Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
torvalds
GitHub Repository: torvalds/linux
Path: blob/master/block/blk-iocost.c
26242 views
1
/* SPDX-License-Identifier: GPL-2.0
2
*
3
* IO cost model based controller.
4
*
5
* Copyright (C) 2019 Tejun Heo <[email protected]>
6
* Copyright (C) 2019 Andy Newell <[email protected]>
7
* Copyright (C) 2019 Facebook
8
*
9
* One challenge of controlling IO resources is the lack of trivially
10
* observable cost metric. This is distinguished from CPU and memory where
11
* wallclock time and the number of bytes can serve as accurate enough
12
* approximations.
13
*
14
* Bandwidth and iops are the most commonly used metrics for IO devices but
15
* depending on the type and specifics of the device, different IO patterns
16
* easily lead to multiple orders of magnitude variations rendering them
17
* useless for the purpose of IO capacity distribution. While on-device
18
* time, with a lot of clutches, could serve as a useful approximation for
19
* non-queued rotational devices, this is no longer viable with modern
20
* devices, even the rotational ones.
21
*
22
* While there is no cost metric we can trivially observe, it isn't a
23
* complete mystery. For example, on a rotational device, seek cost
24
* dominates while a contiguous transfer contributes a smaller amount
25
* proportional to the size. If we can characterize at least the relative
26
* costs of these different types of IOs, it should be possible to
27
* implement a reasonable work-conserving proportional IO resource
28
* distribution.
29
*
30
* 1. IO Cost Model
31
*
32
* IO cost model estimates the cost of an IO given its basic parameters and
33
* history (e.g. the end sector of the last IO). The cost is measured in
34
* device time. If a given IO is estimated to cost 10ms, the device should
35
* be able to process ~100 of those IOs in a second.
36
*
37
* Currently, there's only one builtin cost model - linear. Each IO is
38
* classified as sequential or random and given a base cost accordingly.
39
* On top of that, a size cost proportional to the length of the IO is
40
* added. While simple, this model captures the operational
41
* characteristics of a wide varienty of devices well enough. Default
42
* parameters for several different classes of devices are provided and the
43
* parameters can be configured from userspace via
44
* /sys/fs/cgroup/io.cost.model.
45
*
46
* If needed, tools/cgroup/iocost_coef_gen.py can be used to generate
47
* device-specific coefficients.
48
*
49
* 2. Control Strategy
50
*
51
* The device virtual time (vtime) is used as the primary control metric.
52
* The control strategy is composed of the following three parts.
53
*
54
* 2-1. Vtime Distribution
55
*
56
* When a cgroup becomes active in terms of IOs, its hierarchical share is
57
* calculated. Please consider the following hierarchy where the numbers
58
* inside parentheses denote the configured weights.
59
*
60
* root
61
* / \
62
* A (w:100) B (w:300)
63
* / \
64
* A0 (w:100) A1 (w:100)
65
*
66
* If B is idle and only A0 and A1 are actively issuing IOs, as the two are
67
* of equal weight, each gets 50% share. If then B starts issuing IOs, B
68
* gets 300/(100+300) or 75% share, and A0 and A1 equally splits the rest,
69
* 12.5% each. The distribution mechanism only cares about these flattened
70
* shares. They're called hweights (hierarchical weights) and always add
71
* upto 1 (WEIGHT_ONE).
72
*
73
* A given cgroup's vtime runs slower in inverse proportion to its hweight.
74
* For example, with 12.5% weight, A0's time runs 8 times slower (100/12.5)
75
* against the device vtime - an IO which takes 10ms on the underlying
76
* device is considered to take 80ms on A0.
77
*
78
* This constitutes the basis of IO capacity distribution. Each cgroup's
79
* vtime is running at a rate determined by its hweight. A cgroup tracks
80
* the vtime consumed by past IOs and can issue a new IO if doing so
81
* wouldn't outrun the current device vtime. Otherwise, the IO is
82
* suspended until the vtime has progressed enough to cover it.
83
*
84
* 2-2. Vrate Adjustment
85
*
86
* It's unrealistic to expect the cost model to be perfect. There are too
87
* many devices and even on the same device the overall performance
88
* fluctuates depending on numerous factors such as IO mixture and device
89
* internal garbage collection. The controller needs to adapt dynamically.
90
*
91
* This is achieved by adjusting the overall IO rate according to how busy
92
* the device is. If the device becomes overloaded, we're sending down too
93
* many IOs and should generally slow down. If there are waiting issuers
94
* but the device isn't saturated, we're issuing too few and should
95
* generally speed up.
96
*
97
* To slow down, we lower the vrate - the rate at which the device vtime
98
* passes compared to the wall clock. For example, if the vtime is running
99
* at the vrate of 75%, all cgroups added up would only be able to issue
100
* 750ms worth of IOs per second, and vice-versa for speeding up.
101
*
102
* Device business is determined using two criteria - rq wait and
103
* completion latencies.
104
*
105
* When a device gets saturated, the on-device and then the request queues
106
* fill up and a bio which is ready to be issued has to wait for a request
107
* to become available. When this delay becomes noticeable, it's a clear
108
* indication that the device is saturated and we lower the vrate. This
109
* saturation signal is fairly conservative as it only triggers when both
110
* hardware and software queues are filled up, and is used as the default
111
* busy signal.
112
*
113
* As devices can have deep queues and be unfair in how the queued commands
114
* are executed, solely depending on rq wait may not result in satisfactory
115
* control quality. For a better control quality, completion latency QoS
116
* parameters can be configured so that the device is considered saturated
117
* if N'th percentile completion latency rises above the set point.
118
*
119
* The completion latency requirements are a function of both the
120
* underlying device characteristics and the desired IO latency quality of
121
* service. There is an inherent trade-off - the tighter the latency QoS,
122
* the higher the bandwidth lossage. Latency QoS is disabled by default
123
* and can be set through /sys/fs/cgroup/io.cost.qos.
124
*
125
* 2-3. Work Conservation
126
*
127
* Imagine two cgroups A and B with equal weights. A is issuing a small IO
128
* periodically while B is sending out enough parallel IOs to saturate the
129
* device on its own. Let's say A's usage amounts to 100ms worth of IO
130
* cost per second, i.e., 10% of the device capacity. The naive
131
* distribution of half and half would lead to 60% utilization of the
132
* device, a significant reduction in the total amount of work done
133
* compared to free-for-all competition. This is too high a cost to pay
134
* for IO control.
135
*
136
* To conserve the total amount of work done, we keep track of how much
137
* each active cgroup is actually using and yield part of its weight if
138
* there are other cgroups which can make use of it. In the above case,
139
* A's weight will be lowered so that it hovers above the actual usage and
140
* B would be able to use the rest.
141
*
142
* As we don't want to penalize a cgroup for donating its weight, the
143
* surplus weight adjustment factors in a margin and has an immediate
144
* snapback mechanism in case the cgroup needs more IO vtime for itself.
145
*
146
* Note that adjusting down surplus weights has the same effects as
147
* accelerating vtime for other cgroups and work conservation can also be
148
* implemented by adjusting vrate dynamically. However, squaring who can
149
* donate and should take back how much requires hweight propagations
150
* anyway making it easier to implement and understand as a separate
151
* mechanism.
152
*
153
* 3. Monitoring
154
*
155
* Instead of debugfs or other clumsy monitoring mechanisms, this
156
* controller uses a drgn based monitoring script -
157
* tools/cgroup/iocost_monitor.py. For details on drgn, please see
158
* https://github.com/osandov/drgn. The output looks like the following.
159
*
160
* sdb RUN per=300ms cur_per=234.218:v203.695 busy= +1 vrate= 62.12%
161
* active weight hweight% inflt% dbt delay usages%
162
* test/a * 50/ 50 33.33/ 33.33 27.65 2 0*041 033:033:033
163
* test/b * 100/ 100 66.67/ 66.67 17.56 0 0*000 066:079:077
164
*
165
* - per : Timer period
166
* - cur_per : Internal wall and device vtime clock
167
* - vrate : Device virtual time rate against wall clock
168
* - weight : Surplus-adjusted and configured weights
169
* - hweight : Surplus-adjusted and configured hierarchical weights
170
* - inflt : The percentage of in-flight IO cost at the end of last period
171
* - del_ms : Deferred issuer delay induction level and duration
172
* - usages : Usage history
173
*/
174
175
#include <linux/kernel.h>
176
#include <linux/module.h>
177
#include <linux/timer.h>
178
#include <linux/time64.h>
179
#include <linux/parser.h>
180
#include <linux/sched/signal.h>
181
#include <asm/local.h>
182
#include <asm/local64.h>
183
#include "blk-rq-qos.h"
184
#include "blk-stat.h"
185
#include "blk-wbt.h"
186
#include "blk-cgroup.h"
187
188
#ifdef CONFIG_TRACEPOINTS
189
190
/* copied from TRACE_CGROUP_PATH, see cgroup-internal.h */
191
#define TRACE_IOCG_PATH_LEN 1024
192
static DEFINE_SPINLOCK(trace_iocg_path_lock);
193
static char trace_iocg_path[TRACE_IOCG_PATH_LEN];
194
195
#define TRACE_IOCG_PATH(type, iocg, ...) \
196
do { \
197
unsigned long flags; \
198
if (trace_iocost_##type##_enabled()) { \
199
spin_lock_irqsave(&trace_iocg_path_lock, flags); \
200
cgroup_path(iocg_to_blkg(iocg)->blkcg->css.cgroup, \
201
trace_iocg_path, TRACE_IOCG_PATH_LEN); \
202
trace_iocost_##type(iocg, trace_iocg_path, \
203
##__VA_ARGS__); \
204
spin_unlock_irqrestore(&trace_iocg_path_lock, flags); \
205
} \
206
} while (0)
207
208
#else /* CONFIG_TRACE_POINTS */
209
#define TRACE_IOCG_PATH(type, iocg, ...) do { } while (0)
210
#endif /* CONFIG_TRACE_POINTS */
211
212
enum {
213
MILLION = 1000000,
214
215
/* timer period is calculated from latency requirements, bound it */
216
MIN_PERIOD = USEC_PER_MSEC,
217
MAX_PERIOD = USEC_PER_SEC,
218
219
/*
220
* iocg->vtime is targeted at 50% behind the device vtime, which
221
* serves as its IO credit buffer. Surplus weight adjustment is
222
* immediately canceled if the vtime margin runs below 10%.
223
*/
224
MARGIN_MIN_PCT = 10,
225
MARGIN_LOW_PCT = 20,
226
MARGIN_TARGET_PCT = 50,
227
228
INUSE_ADJ_STEP_PCT = 25,
229
230
/* Have some play in timer operations */
231
TIMER_SLACK_PCT = 1,
232
233
/* 1/64k is granular enough and can easily be handled w/ u32 */
234
WEIGHT_ONE = 1 << 16,
235
};
236
237
enum {
238
/*
239
* As vtime is used to calculate the cost of each IO, it needs to
240
* be fairly high precision. For example, it should be able to
241
* represent the cost of a single page worth of discard with
242
* suffificient accuracy. At the same time, it should be able to
243
* represent reasonably long enough durations to be useful and
244
* convenient during operation.
245
*
246
* 1s worth of vtime is 2^37. This gives us both sub-nanosecond
247
* granularity and days of wrap-around time even at extreme vrates.
248
*/
249
VTIME_PER_SEC_SHIFT = 37,
250
VTIME_PER_SEC = 1LLU << VTIME_PER_SEC_SHIFT,
251
VTIME_PER_USEC = VTIME_PER_SEC / USEC_PER_SEC,
252
VTIME_PER_NSEC = VTIME_PER_SEC / NSEC_PER_SEC,
253
254
/* bound vrate adjustments within two orders of magnitude */
255
VRATE_MIN_PPM = 10000, /* 1% */
256
VRATE_MAX_PPM = 100000000, /* 10000% */
257
258
VRATE_MIN = VTIME_PER_USEC * VRATE_MIN_PPM / MILLION,
259
VRATE_CLAMP_ADJ_PCT = 4,
260
261
/* switch iff the conditions are met for longer than this */
262
AUTOP_CYCLE_NSEC = 10LLU * NSEC_PER_SEC,
263
};
264
265
enum {
266
/* if IOs end up waiting for requests, issue less */
267
RQ_WAIT_BUSY_PCT = 5,
268
269
/* unbusy hysterisis */
270
UNBUSY_THR_PCT = 75,
271
272
/*
273
* The effect of delay is indirect and non-linear and a huge amount of
274
* future debt can accumulate abruptly while unthrottled. Linearly scale
275
* up delay as debt is going up and then let it decay exponentially.
276
* This gives us quick ramp ups while delay is accumulating and long
277
* tails which can help reducing the frequency of debt explosions on
278
* unthrottle. The parameters are experimentally determined.
279
*
280
* The delay mechanism provides adequate protection and behavior in many
281
* cases. However, this is far from ideal and falls shorts on both
282
* fronts. The debtors are often throttled too harshly costing a
283
* significant level of fairness and possibly total work while the
284
* protection against their impacts on the system can be choppy and
285
* unreliable.
286
*
287
* The shortcoming primarily stems from the fact that, unlike for page
288
* cache, the kernel doesn't have well-defined back-pressure propagation
289
* mechanism and policies for anonymous memory. Fully addressing this
290
* issue will likely require substantial improvements in the area.
291
*/
292
MIN_DELAY_THR_PCT = 500,
293
MAX_DELAY_THR_PCT = 25000,
294
MIN_DELAY = 250,
295
MAX_DELAY = 250 * USEC_PER_MSEC,
296
297
/* halve debts if avg usage over 100ms is under 50% */
298
DFGV_USAGE_PCT = 50,
299
DFGV_PERIOD = 100 * USEC_PER_MSEC,
300
301
/* don't let cmds which take a very long time pin lagging for too long */
302
MAX_LAGGING_PERIODS = 10,
303
304
/*
305
* Count IO size in 4k pages. The 12bit shift helps keeping
306
* size-proportional components of cost calculation in closer
307
* numbers of digits to per-IO cost components.
308
*/
309
IOC_PAGE_SHIFT = 12,
310
IOC_PAGE_SIZE = 1 << IOC_PAGE_SHIFT,
311
IOC_SECT_TO_PAGE_SHIFT = IOC_PAGE_SHIFT - SECTOR_SHIFT,
312
313
/* if apart further than 16M, consider randio for linear model */
314
LCOEF_RANDIO_PAGES = 4096,
315
};
316
317
enum ioc_running {
318
IOC_IDLE,
319
IOC_RUNNING,
320
IOC_STOP,
321
};
322
323
/* io.cost.qos controls including per-dev enable of the whole controller */
324
enum {
325
QOS_ENABLE,
326
QOS_CTRL,
327
NR_QOS_CTRL_PARAMS,
328
};
329
330
/* io.cost.qos params */
331
enum {
332
QOS_RPPM,
333
QOS_RLAT,
334
QOS_WPPM,
335
QOS_WLAT,
336
QOS_MIN,
337
QOS_MAX,
338
NR_QOS_PARAMS,
339
};
340
341
/* io.cost.model controls */
342
enum {
343
COST_CTRL,
344
COST_MODEL,
345
NR_COST_CTRL_PARAMS,
346
};
347
348
/* builtin linear cost model coefficients */
349
enum {
350
I_LCOEF_RBPS,
351
I_LCOEF_RSEQIOPS,
352
I_LCOEF_RRANDIOPS,
353
I_LCOEF_WBPS,
354
I_LCOEF_WSEQIOPS,
355
I_LCOEF_WRANDIOPS,
356
NR_I_LCOEFS,
357
};
358
359
enum {
360
LCOEF_RPAGE,
361
LCOEF_RSEQIO,
362
LCOEF_RRANDIO,
363
LCOEF_WPAGE,
364
LCOEF_WSEQIO,
365
LCOEF_WRANDIO,
366
NR_LCOEFS,
367
};
368
369
enum {
370
AUTOP_INVALID,
371
AUTOP_HDD,
372
AUTOP_SSD_QD1,
373
AUTOP_SSD_DFL,
374
AUTOP_SSD_FAST,
375
};
376
377
struct ioc_params {
378
u32 qos[NR_QOS_PARAMS];
379
u64 i_lcoefs[NR_I_LCOEFS];
380
u64 lcoefs[NR_LCOEFS];
381
u32 too_fast_vrate_pct;
382
u32 too_slow_vrate_pct;
383
};
384
385
struct ioc_margins {
386
s64 min;
387
s64 low;
388
s64 target;
389
};
390
391
struct ioc_missed {
392
local_t nr_met;
393
local_t nr_missed;
394
u32 last_met;
395
u32 last_missed;
396
};
397
398
struct ioc_pcpu_stat {
399
struct ioc_missed missed[2];
400
401
local64_t rq_wait_ns;
402
u64 last_rq_wait_ns;
403
};
404
405
/* per device */
406
struct ioc {
407
struct rq_qos rqos;
408
409
bool enabled;
410
411
struct ioc_params params;
412
struct ioc_margins margins;
413
u32 period_us;
414
u32 timer_slack_ns;
415
u64 vrate_min;
416
u64 vrate_max;
417
418
spinlock_t lock;
419
struct timer_list timer;
420
struct list_head active_iocgs; /* active cgroups */
421
struct ioc_pcpu_stat __percpu *pcpu_stat;
422
423
enum ioc_running running;
424
atomic64_t vtime_rate;
425
u64 vtime_base_rate;
426
s64 vtime_err;
427
428
seqcount_spinlock_t period_seqcount;
429
u64 period_at; /* wallclock starttime */
430
u64 period_at_vtime; /* vtime starttime */
431
432
atomic64_t cur_period; /* inc'd each period */
433
int busy_level; /* saturation history */
434
435
bool weights_updated;
436
atomic_t hweight_gen; /* for lazy hweights */
437
438
/* debt forgivness */
439
u64 dfgv_period_at;
440
u64 dfgv_period_rem;
441
u64 dfgv_usage_us_sum;
442
443
u64 autop_too_fast_at;
444
u64 autop_too_slow_at;
445
int autop_idx;
446
bool user_qos_params:1;
447
bool user_cost_model:1;
448
};
449
450
struct iocg_pcpu_stat {
451
local64_t abs_vusage;
452
};
453
454
struct iocg_stat {
455
u64 usage_us;
456
u64 wait_us;
457
u64 indebt_us;
458
u64 indelay_us;
459
};
460
461
/* per device-cgroup pair */
462
struct ioc_gq {
463
struct blkg_policy_data pd;
464
struct ioc *ioc;
465
466
/*
467
* A iocg can get its weight from two sources - an explicit
468
* per-device-cgroup configuration or the default weight of the
469
* cgroup. `cfg_weight` is the explicit per-device-cgroup
470
* configuration. `weight` is the effective considering both
471
* sources.
472
*
473
* When an idle cgroup becomes active its `active` goes from 0 to
474
* `weight`. `inuse` is the surplus adjusted active weight.
475
* `active` and `inuse` are used to calculate `hweight_active` and
476
* `hweight_inuse`.
477
*
478
* `last_inuse` remembers `inuse` while an iocg is idle to persist
479
* surplus adjustments.
480
*
481
* `inuse` may be adjusted dynamically during period. `saved_*` are used
482
* to determine and track adjustments.
483
*/
484
u32 cfg_weight;
485
u32 weight;
486
u32 active;
487
u32 inuse;
488
489
u32 last_inuse;
490
s64 saved_margin;
491
492
sector_t cursor; /* to detect randio */
493
494
/*
495
* `vtime` is this iocg's vtime cursor which progresses as IOs are
496
* issued. If lagging behind device vtime, the delta represents
497
* the currently available IO budget. If running ahead, the
498
* overage.
499
*
500
* `vtime_done` is the same but progressed on completion rather
501
* than issue. The delta behind `vtime` represents the cost of
502
* currently in-flight IOs.
503
*/
504
atomic64_t vtime;
505
atomic64_t done_vtime;
506
u64 abs_vdebt;
507
508
/* current delay in effect and when it started */
509
u64 delay;
510
u64 delay_at;
511
512
/*
513
* The period this iocg was last active in. Used for deactivation
514
* and invalidating `vtime`.
515
*/
516
atomic64_t active_period;
517
struct list_head active_list;
518
519
/* see __propagate_weights() and current_hweight() for details */
520
u64 child_active_sum;
521
u64 child_inuse_sum;
522
u64 child_adjusted_sum;
523
int hweight_gen;
524
u32 hweight_active;
525
u32 hweight_inuse;
526
u32 hweight_donating;
527
u32 hweight_after_donation;
528
529
struct list_head walk_list;
530
struct list_head surplus_list;
531
532
struct wait_queue_head waitq;
533
struct hrtimer waitq_timer;
534
535
/* timestamp at the latest activation */
536
u64 activated_at;
537
538
/* statistics */
539
struct iocg_pcpu_stat __percpu *pcpu_stat;
540
struct iocg_stat stat;
541
struct iocg_stat last_stat;
542
u64 last_stat_abs_vusage;
543
u64 usage_delta_us;
544
u64 wait_since;
545
u64 indebt_since;
546
u64 indelay_since;
547
548
/* this iocg's depth in the hierarchy and ancestors including self */
549
int level;
550
struct ioc_gq *ancestors[];
551
};
552
553
/* per cgroup */
554
struct ioc_cgrp {
555
struct blkcg_policy_data cpd;
556
unsigned int dfl_weight;
557
};
558
559
struct ioc_now {
560
u64 now_ns;
561
u64 now;
562
u64 vnow;
563
};
564
565
struct iocg_wait {
566
struct wait_queue_entry wait;
567
struct bio *bio;
568
u64 abs_cost;
569
bool committed;
570
};
571
572
struct iocg_wake_ctx {
573
struct ioc_gq *iocg;
574
u32 hw_inuse;
575
s64 vbudget;
576
};
577
578
static const struct ioc_params autop[] = {
579
[AUTOP_HDD] = {
580
.qos = {
581
[QOS_RLAT] = 250000, /* 250ms */
582
[QOS_WLAT] = 250000,
583
[QOS_MIN] = VRATE_MIN_PPM,
584
[QOS_MAX] = VRATE_MAX_PPM,
585
},
586
.i_lcoefs = {
587
[I_LCOEF_RBPS] = 174019176,
588
[I_LCOEF_RSEQIOPS] = 41708,
589
[I_LCOEF_RRANDIOPS] = 370,
590
[I_LCOEF_WBPS] = 178075866,
591
[I_LCOEF_WSEQIOPS] = 42705,
592
[I_LCOEF_WRANDIOPS] = 378,
593
},
594
},
595
[AUTOP_SSD_QD1] = {
596
.qos = {
597
[QOS_RLAT] = 25000, /* 25ms */
598
[QOS_WLAT] = 25000,
599
[QOS_MIN] = VRATE_MIN_PPM,
600
[QOS_MAX] = VRATE_MAX_PPM,
601
},
602
.i_lcoefs = {
603
[I_LCOEF_RBPS] = 245855193,
604
[I_LCOEF_RSEQIOPS] = 61575,
605
[I_LCOEF_RRANDIOPS] = 6946,
606
[I_LCOEF_WBPS] = 141365009,
607
[I_LCOEF_WSEQIOPS] = 33716,
608
[I_LCOEF_WRANDIOPS] = 26796,
609
},
610
},
611
[AUTOP_SSD_DFL] = {
612
.qos = {
613
[QOS_RLAT] = 25000, /* 25ms */
614
[QOS_WLAT] = 25000,
615
[QOS_MIN] = VRATE_MIN_PPM,
616
[QOS_MAX] = VRATE_MAX_PPM,
617
},
618
.i_lcoefs = {
619
[I_LCOEF_RBPS] = 488636629,
620
[I_LCOEF_RSEQIOPS] = 8932,
621
[I_LCOEF_RRANDIOPS] = 8518,
622
[I_LCOEF_WBPS] = 427891549,
623
[I_LCOEF_WSEQIOPS] = 28755,
624
[I_LCOEF_WRANDIOPS] = 21940,
625
},
626
.too_fast_vrate_pct = 500,
627
},
628
[AUTOP_SSD_FAST] = {
629
.qos = {
630
[QOS_RLAT] = 5000, /* 5ms */
631
[QOS_WLAT] = 5000,
632
[QOS_MIN] = VRATE_MIN_PPM,
633
[QOS_MAX] = VRATE_MAX_PPM,
634
},
635
.i_lcoefs = {
636
[I_LCOEF_RBPS] = 3102524156LLU,
637
[I_LCOEF_RSEQIOPS] = 724816,
638
[I_LCOEF_RRANDIOPS] = 778122,
639
[I_LCOEF_WBPS] = 1742780862LLU,
640
[I_LCOEF_WSEQIOPS] = 425702,
641
[I_LCOEF_WRANDIOPS] = 443193,
642
},
643
.too_slow_vrate_pct = 10,
644
},
645
};
646
647
/*
648
* vrate adjust percentages indexed by ioc->busy_level. We adjust up on
649
* vtime credit shortage and down on device saturation.
650
*/
651
static const u32 vrate_adj_pct[] =
652
{ 0, 0, 0, 0,
653
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
654
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
655
4, 4, 4, 4, 4, 4, 4, 4, 8, 8, 8, 8, 8, 8, 8, 8, 16 };
656
657
static struct blkcg_policy blkcg_policy_iocost;
658
659
/* accessors and helpers */
660
static struct ioc *rqos_to_ioc(struct rq_qos *rqos)
661
{
662
return container_of(rqos, struct ioc, rqos);
663
}
664
665
static struct ioc *q_to_ioc(struct request_queue *q)
666
{
667
return rqos_to_ioc(rq_qos_id(q, RQ_QOS_COST));
668
}
669
670
static const char __maybe_unused *ioc_name(struct ioc *ioc)
671
{
672
struct gendisk *disk = ioc->rqos.disk;
673
674
if (!disk)
675
return "<unknown>";
676
return disk->disk_name;
677
}
678
679
static struct ioc_gq *pd_to_iocg(struct blkg_policy_data *pd)
680
{
681
return pd ? container_of(pd, struct ioc_gq, pd) : NULL;
682
}
683
684
static struct ioc_gq *blkg_to_iocg(struct blkcg_gq *blkg)
685
{
686
return pd_to_iocg(blkg_to_pd(blkg, &blkcg_policy_iocost));
687
}
688
689
static struct blkcg_gq *iocg_to_blkg(struct ioc_gq *iocg)
690
{
691
return pd_to_blkg(&iocg->pd);
692
}
693
694
static struct ioc_cgrp *blkcg_to_iocc(struct blkcg *blkcg)
695
{
696
return container_of(blkcg_to_cpd(blkcg, &blkcg_policy_iocost),
697
struct ioc_cgrp, cpd);
698
}
699
700
/*
701
* Scale @abs_cost to the inverse of @hw_inuse. The lower the hierarchical
702
* weight, the more expensive each IO. Must round up.
703
*/
704
static u64 abs_cost_to_cost(u64 abs_cost, u32 hw_inuse)
705
{
706
return DIV64_U64_ROUND_UP(abs_cost * WEIGHT_ONE, hw_inuse);
707
}
708
709
/*
710
* The inverse of abs_cost_to_cost(). Must round up.
711
*/
712
static u64 cost_to_abs_cost(u64 cost, u32 hw_inuse)
713
{
714
return DIV64_U64_ROUND_UP(cost * hw_inuse, WEIGHT_ONE);
715
}
716
717
static void iocg_commit_bio(struct ioc_gq *iocg, struct bio *bio,
718
u64 abs_cost, u64 cost)
719
{
720
struct iocg_pcpu_stat *gcs;
721
722
bio->bi_iocost_cost = cost;
723
atomic64_add(cost, &iocg->vtime);
724
725
gcs = get_cpu_ptr(iocg->pcpu_stat);
726
local64_add(abs_cost, &gcs->abs_vusage);
727
put_cpu_ptr(gcs);
728
}
729
730
static void iocg_lock(struct ioc_gq *iocg, bool lock_ioc, unsigned long *flags)
731
{
732
if (lock_ioc) {
733
spin_lock_irqsave(&iocg->ioc->lock, *flags);
734
spin_lock(&iocg->waitq.lock);
735
} else {
736
spin_lock_irqsave(&iocg->waitq.lock, *flags);
737
}
738
}
739
740
static void iocg_unlock(struct ioc_gq *iocg, bool unlock_ioc, unsigned long *flags)
741
{
742
if (unlock_ioc) {
743
spin_unlock(&iocg->waitq.lock);
744
spin_unlock_irqrestore(&iocg->ioc->lock, *flags);
745
} else {
746
spin_unlock_irqrestore(&iocg->waitq.lock, *flags);
747
}
748
}
749
750
#define CREATE_TRACE_POINTS
751
#include <trace/events/iocost.h>
752
753
static void ioc_refresh_margins(struct ioc *ioc)
754
{
755
struct ioc_margins *margins = &ioc->margins;
756
u32 period_us = ioc->period_us;
757
u64 vrate = ioc->vtime_base_rate;
758
759
margins->min = (period_us * MARGIN_MIN_PCT / 100) * vrate;
760
margins->low = (period_us * MARGIN_LOW_PCT / 100) * vrate;
761
margins->target = (period_us * MARGIN_TARGET_PCT / 100) * vrate;
762
}
763
764
/* latency Qos params changed, update period_us and all the dependent params */
765
static void ioc_refresh_period_us(struct ioc *ioc)
766
{
767
u32 ppm, lat, multi, period_us;
768
769
lockdep_assert_held(&ioc->lock);
770
771
/* pick the higher latency target */
772
if (ioc->params.qos[QOS_RLAT] >= ioc->params.qos[QOS_WLAT]) {
773
ppm = ioc->params.qos[QOS_RPPM];
774
lat = ioc->params.qos[QOS_RLAT];
775
} else {
776
ppm = ioc->params.qos[QOS_WPPM];
777
lat = ioc->params.qos[QOS_WLAT];
778
}
779
780
/*
781
* We want the period to be long enough to contain a healthy number
782
* of IOs while short enough for granular control. Define it as a
783
* multiple of the latency target. Ideally, the multiplier should
784
* be scaled according to the percentile so that it would nominally
785
* contain a certain number of requests. Let's be simpler and
786
* scale it linearly so that it's 2x >= pct(90) and 10x at pct(50).
787
*/
788
if (ppm)
789
multi = max_t(u32, (MILLION - ppm) / 50000, 2);
790
else
791
multi = 2;
792
period_us = multi * lat;
793
period_us = clamp_t(u32, period_us, MIN_PERIOD, MAX_PERIOD);
794
795
/* calculate dependent params */
796
ioc->period_us = period_us;
797
ioc->timer_slack_ns = div64_u64(
798
(u64)period_us * NSEC_PER_USEC * TIMER_SLACK_PCT,
799
100);
800
ioc_refresh_margins(ioc);
801
}
802
803
/*
804
* ioc->rqos.disk isn't initialized when this function is called from
805
* the init path.
806
*/
807
static int ioc_autop_idx(struct ioc *ioc, struct gendisk *disk)
808
{
809
int idx = ioc->autop_idx;
810
const struct ioc_params *p = &autop[idx];
811
u32 vrate_pct;
812
u64 now_ns;
813
814
/* rotational? */
815
if (!blk_queue_nonrot(disk->queue))
816
return AUTOP_HDD;
817
818
/* handle SATA SSDs w/ broken NCQ */
819
if (blk_queue_depth(disk->queue) == 1)
820
return AUTOP_SSD_QD1;
821
822
/* use one of the normal ssd sets */
823
if (idx < AUTOP_SSD_DFL)
824
return AUTOP_SSD_DFL;
825
826
/* if user is overriding anything, maintain what was there */
827
if (ioc->user_qos_params || ioc->user_cost_model)
828
return idx;
829
830
/* step up/down based on the vrate */
831
vrate_pct = div64_u64(ioc->vtime_base_rate * 100, VTIME_PER_USEC);
832
now_ns = blk_time_get_ns();
833
834
if (p->too_fast_vrate_pct && p->too_fast_vrate_pct <= vrate_pct) {
835
if (!ioc->autop_too_fast_at)
836
ioc->autop_too_fast_at = now_ns;
837
if (now_ns - ioc->autop_too_fast_at >= AUTOP_CYCLE_NSEC)
838
return idx + 1;
839
} else {
840
ioc->autop_too_fast_at = 0;
841
}
842
843
if (p->too_slow_vrate_pct && p->too_slow_vrate_pct >= vrate_pct) {
844
if (!ioc->autop_too_slow_at)
845
ioc->autop_too_slow_at = now_ns;
846
if (now_ns - ioc->autop_too_slow_at >= AUTOP_CYCLE_NSEC)
847
return idx - 1;
848
} else {
849
ioc->autop_too_slow_at = 0;
850
}
851
852
return idx;
853
}
854
855
/*
856
* Take the followings as input
857
*
858
* @bps maximum sequential throughput
859
* @seqiops maximum sequential 4k iops
860
* @randiops maximum random 4k iops
861
*
862
* and calculate the linear model cost coefficients.
863
*
864
* *@page per-page cost 1s / (@bps / 4096)
865
* *@seqio base cost of a seq IO max((1s / @seqiops) - *@page, 0)
866
* @randiops base cost of a rand IO max((1s / @randiops) - *@page, 0)
867
*/
868
static void calc_lcoefs(u64 bps, u64 seqiops, u64 randiops,
869
u64 *page, u64 *seqio, u64 *randio)
870
{
871
u64 v;
872
873
*page = *seqio = *randio = 0;
874
875
if (bps) {
876
u64 bps_pages = DIV_ROUND_UP_ULL(bps, IOC_PAGE_SIZE);
877
878
if (bps_pages)
879
*page = DIV64_U64_ROUND_UP(VTIME_PER_SEC, bps_pages);
880
else
881
*page = 1;
882
}
883
884
if (seqiops) {
885
v = DIV64_U64_ROUND_UP(VTIME_PER_SEC, seqiops);
886
if (v > *page)
887
*seqio = v - *page;
888
}
889
890
if (randiops) {
891
v = DIV64_U64_ROUND_UP(VTIME_PER_SEC, randiops);
892
if (v > *page)
893
*randio = v - *page;
894
}
895
}
896
897
static void ioc_refresh_lcoefs(struct ioc *ioc)
898
{
899
u64 *u = ioc->params.i_lcoefs;
900
u64 *c = ioc->params.lcoefs;
901
902
calc_lcoefs(u[I_LCOEF_RBPS], u[I_LCOEF_RSEQIOPS], u[I_LCOEF_RRANDIOPS],
903
&c[LCOEF_RPAGE], &c[LCOEF_RSEQIO], &c[LCOEF_RRANDIO]);
904
calc_lcoefs(u[I_LCOEF_WBPS], u[I_LCOEF_WSEQIOPS], u[I_LCOEF_WRANDIOPS],
905
&c[LCOEF_WPAGE], &c[LCOEF_WSEQIO], &c[LCOEF_WRANDIO]);
906
}
907
908
/*
909
* struct gendisk is required as an argument because ioc->rqos.disk
910
* is not properly initialized when called from the init path.
911
*/
912
static bool ioc_refresh_params_disk(struct ioc *ioc, bool force,
913
struct gendisk *disk)
914
{
915
const struct ioc_params *p;
916
int idx;
917
918
lockdep_assert_held(&ioc->lock);
919
920
idx = ioc_autop_idx(ioc, disk);
921
p = &autop[idx];
922
923
if (idx == ioc->autop_idx && !force)
924
return false;
925
926
if (idx != ioc->autop_idx) {
927
atomic64_set(&ioc->vtime_rate, VTIME_PER_USEC);
928
ioc->vtime_base_rate = VTIME_PER_USEC;
929
}
930
931
ioc->autop_idx = idx;
932
ioc->autop_too_fast_at = 0;
933
ioc->autop_too_slow_at = 0;
934
935
if (!ioc->user_qos_params)
936
memcpy(ioc->params.qos, p->qos, sizeof(p->qos));
937
if (!ioc->user_cost_model)
938
memcpy(ioc->params.i_lcoefs, p->i_lcoefs, sizeof(p->i_lcoefs));
939
940
ioc_refresh_period_us(ioc);
941
ioc_refresh_lcoefs(ioc);
942
943
ioc->vrate_min = DIV64_U64_ROUND_UP((u64)ioc->params.qos[QOS_MIN] *
944
VTIME_PER_USEC, MILLION);
945
ioc->vrate_max = DIV64_U64_ROUND_UP((u64)ioc->params.qos[QOS_MAX] *
946
VTIME_PER_USEC, MILLION);
947
948
return true;
949
}
950
951
static bool ioc_refresh_params(struct ioc *ioc, bool force)
952
{
953
return ioc_refresh_params_disk(ioc, force, ioc->rqos.disk);
954
}
955
956
/*
957
* When an iocg accumulates too much vtime or gets deactivated, we throw away
958
* some vtime, which lowers the overall device utilization. As the exact amount
959
* which is being thrown away is known, we can compensate by accelerating the
960
* vrate accordingly so that the extra vtime generated in the current period
961
* matches what got lost.
962
*/
963
static void ioc_refresh_vrate(struct ioc *ioc, struct ioc_now *now)
964
{
965
s64 pleft = ioc->period_at + ioc->period_us - now->now;
966
s64 vperiod = ioc->period_us * ioc->vtime_base_rate;
967
s64 vcomp, vcomp_min, vcomp_max;
968
969
lockdep_assert_held(&ioc->lock);
970
971
/* we need some time left in this period */
972
if (pleft <= 0)
973
goto done;
974
975
/*
976
* Calculate how much vrate should be adjusted to offset the error.
977
* Limit the amount of adjustment and deduct the adjusted amount from
978
* the error.
979
*/
980
vcomp = -div64_s64(ioc->vtime_err, pleft);
981
vcomp_min = -(ioc->vtime_base_rate >> 1);
982
vcomp_max = ioc->vtime_base_rate;
983
vcomp = clamp(vcomp, vcomp_min, vcomp_max);
984
985
ioc->vtime_err += vcomp * pleft;
986
987
atomic64_set(&ioc->vtime_rate, ioc->vtime_base_rate + vcomp);
988
done:
989
/* bound how much error can accumulate */
990
ioc->vtime_err = clamp(ioc->vtime_err, -vperiod, vperiod);
991
}
992
993
static void ioc_adjust_base_vrate(struct ioc *ioc, u32 rq_wait_pct,
994
int nr_lagging, int nr_shortages,
995
int prev_busy_level, u32 *missed_ppm)
996
{
997
u64 vrate = ioc->vtime_base_rate;
998
u64 vrate_min = ioc->vrate_min, vrate_max = ioc->vrate_max;
999
1000
if (!ioc->busy_level || (ioc->busy_level < 0 && nr_lagging)) {
1001
if (ioc->busy_level != prev_busy_level || nr_lagging)
1002
trace_iocost_ioc_vrate_adj(ioc, vrate,
1003
missed_ppm, rq_wait_pct,
1004
nr_lagging, nr_shortages);
1005
1006
return;
1007
}
1008
1009
/*
1010
* If vrate is out of bounds, apply clamp gradually as the
1011
* bounds can change abruptly. Otherwise, apply busy_level
1012
* based adjustment.
1013
*/
1014
if (vrate < vrate_min) {
1015
vrate = div64_u64(vrate * (100 + VRATE_CLAMP_ADJ_PCT), 100);
1016
vrate = min(vrate, vrate_min);
1017
} else if (vrate > vrate_max) {
1018
vrate = div64_u64(vrate * (100 - VRATE_CLAMP_ADJ_PCT), 100);
1019
vrate = max(vrate, vrate_max);
1020
} else {
1021
int idx = min_t(int, abs(ioc->busy_level),
1022
ARRAY_SIZE(vrate_adj_pct) - 1);
1023
u32 adj_pct = vrate_adj_pct[idx];
1024
1025
if (ioc->busy_level > 0)
1026
adj_pct = 100 - adj_pct;
1027
else
1028
adj_pct = 100 + adj_pct;
1029
1030
vrate = clamp(DIV64_U64_ROUND_UP(vrate * adj_pct, 100),
1031
vrate_min, vrate_max);
1032
}
1033
1034
trace_iocost_ioc_vrate_adj(ioc, vrate, missed_ppm, rq_wait_pct,
1035
nr_lagging, nr_shortages);
1036
1037
ioc->vtime_base_rate = vrate;
1038
ioc_refresh_margins(ioc);
1039
}
1040
1041
/* take a snapshot of the current [v]time and vrate */
1042
static void ioc_now(struct ioc *ioc, struct ioc_now *now)
1043
{
1044
unsigned seq;
1045
u64 vrate;
1046
1047
now->now_ns = blk_time_get_ns();
1048
now->now = ktime_to_us(now->now_ns);
1049
vrate = atomic64_read(&ioc->vtime_rate);
1050
1051
/*
1052
* The current vtime is
1053
*
1054
* vtime at period start + (wallclock time since the start) * vrate
1055
*
1056
* As a consistent snapshot of `period_at_vtime` and `period_at` is
1057
* needed, they're seqcount protected.
1058
*/
1059
do {
1060
seq = read_seqcount_begin(&ioc->period_seqcount);
1061
now->vnow = ioc->period_at_vtime +
1062
(now->now - ioc->period_at) * vrate;
1063
} while (read_seqcount_retry(&ioc->period_seqcount, seq));
1064
}
1065
1066
static void ioc_start_period(struct ioc *ioc, struct ioc_now *now)
1067
{
1068
WARN_ON_ONCE(ioc->running != IOC_RUNNING);
1069
1070
write_seqcount_begin(&ioc->period_seqcount);
1071
ioc->period_at = now->now;
1072
ioc->period_at_vtime = now->vnow;
1073
write_seqcount_end(&ioc->period_seqcount);
1074
1075
ioc->timer.expires = jiffies + usecs_to_jiffies(ioc->period_us);
1076
add_timer(&ioc->timer);
1077
}
1078
1079
/*
1080
* Update @iocg's `active` and `inuse` to @active and @inuse, update level
1081
* weight sums and propagate upwards accordingly. If @save, the current margin
1082
* is saved to be used as reference for later inuse in-period adjustments.
1083
*/
1084
static void __propagate_weights(struct ioc_gq *iocg, u32 active, u32 inuse,
1085
bool save, struct ioc_now *now)
1086
{
1087
struct ioc *ioc = iocg->ioc;
1088
int lvl;
1089
1090
lockdep_assert_held(&ioc->lock);
1091
1092
/*
1093
* For an active leaf node, its inuse shouldn't be zero or exceed
1094
* @active. An active internal node's inuse is solely determined by the
1095
* inuse to active ratio of its children regardless of @inuse.
1096
*/
1097
if (list_empty(&iocg->active_list) && iocg->child_active_sum) {
1098
inuse = DIV64_U64_ROUND_UP(active * iocg->child_inuse_sum,
1099
iocg->child_active_sum);
1100
} else {
1101
/*
1102
* It may be tempting to turn this into a clamp expression with
1103
* a lower limit of 1 but active may be 0, which cannot be used
1104
* as an upper limit in that situation. This expression allows
1105
* active to clamp inuse unless it is 0, in which case inuse
1106
* becomes 1.
1107
*/
1108
inuse = min(inuse, active) ?: 1;
1109
}
1110
1111
iocg->last_inuse = iocg->inuse;
1112
if (save)
1113
iocg->saved_margin = now->vnow - atomic64_read(&iocg->vtime);
1114
1115
if (active == iocg->active && inuse == iocg->inuse)
1116
return;
1117
1118
for (lvl = iocg->level - 1; lvl >= 0; lvl--) {
1119
struct ioc_gq *parent = iocg->ancestors[lvl];
1120
struct ioc_gq *child = iocg->ancestors[lvl + 1];
1121
u32 parent_active = 0, parent_inuse = 0;
1122
1123
/* update the level sums */
1124
parent->child_active_sum += (s32)(active - child->active);
1125
parent->child_inuse_sum += (s32)(inuse - child->inuse);
1126
/* apply the updates */
1127
child->active = active;
1128
child->inuse = inuse;
1129
1130
/*
1131
* The delta between inuse and active sums indicates that
1132
* much of weight is being given away. Parent's inuse
1133
* and active should reflect the ratio.
1134
*/
1135
if (parent->child_active_sum) {
1136
parent_active = parent->weight;
1137
parent_inuse = DIV64_U64_ROUND_UP(
1138
parent_active * parent->child_inuse_sum,
1139
parent->child_active_sum);
1140
}
1141
1142
/* do we need to keep walking up? */
1143
if (parent_active == parent->active &&
1144
parent_inuse == parent->inuse)
1145
break;
1146
1147
active = parent_active;
1148
inuse = parent_inuse;
1149
}
1150
1151
ioc->weights_updated = true;
1152
}
1153
1154
static void commit_weights(struct ioc *ioc)
1155
{
1156
lockdep_assert_held(&ioc->lock);
1157
1158
if (ioc->weights_updated) {
1159
/* paired with rmb in current_hweight(), see there */
1160
smp_wmb();
1161
atomic_inc(&ioc->hweight_gen);
1162
ioc->weights_updated = false;
1163
}
1164
}
1165
1166
static void propagate_weights(struct ioc_gq *iocg, u32 active, u32 inuse,
1167
bool save, struct ioc_now *now)
1168
{
1169
__propagate_weights(iocg, active, inuse, save, now);
1170
commit_weights(iocg->ioc);
1171
}
1172
1173
static void current_hweight(struct ioc_gq *iocg, u32 *hw_activep, u32 *hw_inusep)
1174
{
1175
struct ioc *ioc = iocg->ioc;
1176
int lvl;
1177
u32 hwa, hwi;
1178
int ioc_gen;
1179
1180
/* hot path - if uptodate, use cached */
1181
ioc_gen = atomic_read(&ioc->hweight_gen);
1182
if (ioc_gen == iocg->hweight_gen)
1183
goto out;
1184
1185
/*
1186
* Paired with wmb in commit_weights(). If we saw the updated
1187
* hweight_gen, all the weight updates from __propagate_weights() are
1188
* visible too.
1189
*
1190
* We can race with weight updates during calculation and get it
1191
* wrong. However, hweight_gen would have changed and a future
1192
* reader will recalculate and we're guaranteed to discard the
1193
* wrong result soon.
1194
*/
1195
smp_rmb();
1196
1197
hwa = hwi = WEIGHT_ONE;
1198
for (lvl = 0; lvl <= iocg->level - 1; lvl++) {
1199
struct ioc_gq *parent = iocg->ancestors[lvl];
1200
struct ioc_gq *child = iocg->ancestors[lvl + 1];
1201
u64 active_sum = READ_ONCE(parent->child_active_sum);
1202
u64 inuse_sum = READ_ONCE(parent->child_inuse_sum);
1203
u32 active = READ_ONCE(child->active);
1204
u32 inuse = READ_ONCE(child->inuse);
1205
1206
/* we can race with deactivations and either may read as zero */
1207
if (!active_sum || !inuse_sum)
1208
continue;
1209
1210
active_sum = max_t(u64, active, active_sum);
1211
hwa = div64_u64((u64)hwa * active, active_sum);
1212
1213
inuse_sum = max_t(u64, inuse, inuse_sum);
1214
hwi = div64_u64((u64)hwi * inuse, inuse_sum);
1215
}
1216
1217
iocg->hweight_active = max_t(u32, hwa, 1);
1218
iocg->hweight_inuse = max_t(u32, hwi, 1);
1219
iocg->hweight_gen = ioc_gen;
1220
out:
1221
if (hw_activep)
1222
*hw_activep = iocg->hweight_active;
1223
if (hw_inusep)
1224
*hw_inusep = iocg->hweight_inuse;
1225
}
1226
1227
/*
1228
* Calculate the hweight_inuse @iocg would get with max @inuse assuming all the
1229
* other weights stay unchanged.
1230
*/
1231
static u32 current_hweight_max(struct ioc_gq *iocg)
1232
{
1233
u32 hwm = WEIGHT_ONE;
1234
u32 inuse = iocg->active;
1235
u64 child_inuse_sum;
1236
int lvl;
1237
1238
lockdep_assert_held(&iocg->ioc->lock);
1239
1240
for (lvl = iocg->level - 1; lvl >= 0; lvl--) {
1241
struct ioc_gq *parent = iocg->ancestors[lvl];
1242
struct ioc_gq *child = iocg->ancestors[lvl + 1];
1243
1244
child_inuse_sum = parent->child_inuse_sum + inuse - child->inuse;
1245
hwm = div64_u64((u64)hwm * inuse, child_inuse_sum);
1246
inuse = DIV64_U64_ROUND_UP(parent->active * child_inuse_sum,
1247
parent->child_active_sum);
1248
}
1249
1250
return max_t(u32, hwm, 1);
1251
}
1252
1253
static void weight_updated(struct ioc_gq *iocg, struct ioc_now *now)
1254
{
1255
struct ioc *ioc = iocg->ioc;
1256
struct blkcg_gq *blkg = iocg_to_blkg(iocg);
1257
struct ioc_cgrp *iocc = blkcg_to_iocc(blkg->blkcg);
1258
u32 weight;
1259
1260
lockdep_assert_held(&ioc->lock);
1261
1262
weight = iocg->cfg_weight ?: iocc->dfl_weight;
1263
if (weight != iocg->weight && iocg->active)
1264
propagate_weights(iocg, weight, iocg->inuse, true, now);
1265
iocg->weight = weight;
1266
}
1267
1268
static bool iocg_activate(struct ioc_gq *iocg, struct ioc_now *now)
1269
{
1270
struct ioc *ioc = iocg->ioc;
1271
u64 __maybe_unused last_period, cur_period;
1272
u64 vtime, vtarget;
1273
int i;
1274
1275
/*
1276
* If seem to be already active, just update the stamp to tell the
1277
* timer that we're still active. We don't mind occassional races.
1278
*/
1279
if (!list_empty(&iocg->active_list)) {
1280
ioc_now(ioc, now);
1281
cur_period = atomic64_read(&ioc->cur_period);
1282
if (atomic64_read(&iocg->active_period) != cur_period)
1283
atomic64_set(&iocg->active_period, cur_period);
1284
return true;
1285
}
1286
1287
/* racy check on internal node IOs, treat as root level IOs */
1288
if (iocg->child_active_sum)
1289
return false;
1290
1291
spin_lock_irq(&ioc->lock);
1292
1293
ioc_now(ioc, now);
1294
1295
/* update period */
1296
cur_period = atomic64_read(&ioc->cur_period);
1297
last_period = atomic64_read(&iocg->active_period);
1298
atomic64_set(&iocg->active_period, cur_period);
1299
1300
/* already activated or breaking leaf-only constraint? */
1301
if (!list_empty(&iocg->active_list))
1302
goto succeed_unlock;
1303
for (i = iocg->level - 1; i > 0; i--)
1304
if (!list_empty(&iocg->ancestors[i]->active_list))
1305
goto fail_unlock;
1306
1307
if (iocg->child_active_sum)
1308
goto fail_unlock;
1309
1310
/*
1311
* Always start with the target budget. On deactivation, we throw away
1312
* anything above it.
1313
*/
1314
vtarget = now->vnow - ioc->margins.target;
1315
vtime = atomic64_read(&iocg->vtime);
1316
1317
atomic64_add(vtarget - vtime, &iocg->vtime);
1318
atomic64_add(vtarget - vtime, &iocg->done_vtime);
1319
vtime = vtarget;
1320
1321
/*
1322
* Activate, propagate weight and start period timer if not
1323
* running. Reset hweight_gen to avoid accidental match from
1324
* wrapping.
1325
*/
1326
iocg->hweight_gen = atomic_read(&ioc->hweight_gen) - 1;
1327
list_add(&iocg->active_list, &ioc->active_iocgs);
1328
1329
propagate_weights(iocg, iocg->weight,
1330
iocg->last_inuse ?: iocg->weight, true, now);
1331
1332
TRACE_IOCG_PATH(iocg_activate, iocg, now,
1333
last_period, cur_period, vtime);
1334
1335
iocg->activated_at = now->now;
1336
1337
if (ioc->running == IOC_IDLE) {
1338
ioc->running = IOC_RUNNING;
1339
ioc->dfgv_period_at = now->now;
1340
ioc->dfgv_period_rem = 0;
1341
ioc_start_period(ioc, now);
1342
}
1343
1344
succeed_unlock:
1345
spin_unlock_irq(&ioc->lock);
1346
return true;
1347
1348
fail_unlock:
1349
spin_unlock_irq(&ioc->lock);
1350
return false;
1351
}
1352
1353
static bool iocg_kick_delay(struct ioc_gq *iocg, struct ioc_now *now)
1354
{
1355
struct ioc *ioc = iocg->ioc;
1356
struct blkcg_gq *blkg = iocg_to_blkg(iocg);
1357
u64 tdelta, delay, new_delay, shift;
1358
s64 vover, vover_pct;
1359
u32 hwa;
1360
1361
lockdep_assert_held(&iocg->waitq.lock);
1362
1363
/*
1364
* If the delay is set by another CPU, we may be in the past. No need to
1365
* change anything if so. This avoids decay calculation underflow.
1366
*/
1367
if (time_before64(now->now, iocg->delay_at))
1368
return false;
1369
1370
/* calculate the current delay in effect - 1/2 every second */
1371
tdelta = now->now - iocg->delay_at;
1372
shift = div64_u64(tdelta, USEC_PER_SEC);
1373
if (iocg->delay && shift < BITS_PER_LONG)
1374
delay = iocg->delay >> shift;
1375
else
1376
delay = 0;
1377
1378
/* calculate the new delay from the debt amount */
1379
current_hweight(iocg, &hwa, NULL);
1380
vover = atomic64_read(&iocg->vtime) +
1381
abs_cost_to_cost(iocg->abs_vdebt, hwa) - now->vnow;
1382
vover_pct = div64_s64(100 * vover,
1383
ioc->period_us * ioc->vtime_base_rate);
1384
1385
if (vover_pct <= MIN_DELAY_THR_PCT)
1386
new_delay = 0;
1387
else if (vover_pct >= MAX_DELAY_THR_PCT)
1388
new_delay = MAX_DELAY;
1389
else
1390
new_delay = MIN_DELAY +
1391
div_u64((MAX_DELAY - MIN_DELAY) *
1392
(vover_pct - MIN_DELAY_THR_PCT),
1393
MAX_DELAY_THR_PCT - MIN_DELAY_THR_PCT);
1394
1395
/* pick the higher one and apply */
1396
if (new_delay > delay) {
1397
iocg->delay = new_delay;
1398
iocg->delay_at = now->now;
1399
delay = new_delay;
1400
}
1401
1402
if (delay >= MIN_DELAY) {
1403
if (!iocg->indelay_since)
1404
iocg->indelay_since = now->now;
1405
blkcg_set_delay(blkg, delay * NSEC_PER_USEC);
1406
return true;
1407
} else {
1408
if (iocg->indelay_since) {
1409
iocg->stat.indelay_us += now->now - iocg->indelay_since;
1410
iocg->indelay_since = 0;
1411
}
1412
iocg->delay = 0;
1413
blkcg_clear_delay(blkg);
1414
return false;
1415
}
1416
}
1417
1418
static void iocg_incur_debt(struct ioc_gq *iocg, u64 abs_cost,
1419
struct ioc_now *now)
1420
{
1421
struct iocg_pcpu_stat *gcs;
1422
1423
lockdep_assert_held(&iocg->ioc->lock);
1424
lockdep_assert_held(&iocg->waitq.lock);
1425
WARN_ON_ONCE(list_empty(&iocg->active_list));
1426
1427
/*
1428
* Once in debt, debt handling owns inuse. @iocg stays at the minimum
1429
* inuse donating all of it share to others until its debt is paid off.
1430
*/
1431
if (!iocg->abs_vdebt && abs_cost) {
1432
iocg->indebt_since = now->now;
1433
propagate_weights(iocg, iocg->active, 0, false, now);
1434
}
1435
1436
iocg->abs_vdebt += abs_cost;
1437
1438
gcs = get_cpu_ptr(iocg->pcpu_stat);
1439
local64_add(abs_cost, &gcs->abs_vusage);
1440
put_cpu_ptr(gcs);
1441
}
1442
1443
static void iocg_pay_debt(struct ioc_gq *iocg, u64 abs_vpay,
1444
struct ioc_now *now)
1445
{
1446
lockdep_assert_held(&iocg->ioc->lock);
1447
lockdep_assert_held(&iocg->waitq.lock);
1448
1449
/*
1450
* make sure that nobody messed with @iocg. Check iocg->pd.online
1451
* to avoid warn when removing blkcg or disk.
1452
*/
1453
WARN_ON_ONCE(list_empty(&iocg->active_list) && iocg->pd.online);
1454
WARN_ON_ONCE(iocg->inuse > 1);
1455
1456
iocg->abs_vdebt -= min(abs_vpay, iocg->abs_vdebt);
1457
1458
/* if debt is paid in full, restore inuse */
1459
if (!iocg->abs_vdebt) {
1460
iocg->stat.indebt_us += now->now - iocg->indebt_since;
1461
iocg->indebt_since = 0;
1462
1463
propagate_weights(iocg, iocg->active, iocg->last_inuse,
1464
false, now);
1465
}
1466
}
1467
1468
static int iocg_wake_fn(struct wait_queue_entry *wq_entry, unsigned mode,
1469
int flags, void *key)
1470
{
1471
struct iocg_wait *wait = container_of(wq_entry, struct iocg_wait, wait);
1472
struct iocg_wake_ctx *ctx = key;
1473
u64 cost = abs_cost_to_cost(wait->abs_cost, ctx->hw_inuse);
1474
1475
ctx->vbudget -= cost;
1476
1477
if (ctx->vbudget < 0)
1478
return -1;
1479
1480
iocg_commit_bio(ctx->iocg, wait->bio, wait->abs_cost, cost);
1481
wait->committed = true;
1482
1483
/*
1484
* autoremove_wake_function() removes the wait entry only when it
1485
* actually changed the task state. We want the wait always removed.
1486
* Remove explicitly and use default_wake_function(). Note that the
1487
* order of operations is important as finish_wait() tests whether
1488
* @wq_entry is removed without grabbing the lock.
1489
*/
1490
default_wake_function(wq_entry, mode, flags, key);
1491
list_del_init_careful(&wq_entry->entry);
1492
return 0;
1493
}
1494
1495
/*
1496
* Calculate the accumulated budget, pay debt if @pay_debt and wake up waiters
1497
* accordingly. When @pay_debt is %true, the caller must be holding ioc->lock in
1498
* addition to iocg->waitq.lock.
1499
*/
1500
static void iocg_kick_waitq(struct ioc_gq *iocg, bool pay_debt,
1501
struct ioc_now *now)
1502
{
1503
struct ioc *ioc = iocg->ioc;
1504
struct iocg_wake_ctx ctx = { .iocg = iocg };
1505
u64 vshortage, expires, oexpires;
1506
s64 vbudget;
1507
u32 hwa;
1508
1509
lockdep_assert_held(&iocg->waitq.lock);
1510
1511
current_hweight(iocg, &hwa, NULL);
1512
vbudget = now->vnow - atomic64_read(&iocg->vtime);
1513
1514
/* pay off debt */
1515
if (pay_debt && iocg->abs_vdebt && vbudget > 0) {
1516
u64 abs_vbudget = cost_to_abs_cost(vbudget, hwa);
1517
u64 abs_vpay = min_t(u64, abs_vbudget, iocg->abs_vdebt);
1518
u64 vpay = abs_cost_to_cost(abs_vpay, hwa);
1519
1520
lockdep_assert_held(&ioc->lock);
1521
1522
atomic64_add(vpay, &iocg->vtime);
1523
atomic64_add(vpay, &iocg->done_vtime);
1524
iocg_pay_debt(iocg, abs_vpay, now);
1525
vbudget -= vpay;
1526
}
1527
1528
if (iocg->abs_vdebt || iocg->delay)
1529
iocg_kick_delay(iocg, now);
1530
1531
/*
1532
* Debt can still be outstanding if we haven't paid all yet or the
1533
* caller raced and called without @pay_debt. Shouldn't wake up waiters
1534
* under debt. Make sure @vbudget reflects the outstanding amount and is
1535
* not positive.
1536
*/
1537
if (iocg->abs_vdebt) {
1538
s64 vdebt = abs_cost_to_cost(iocg->abs_vdebt, hwa);
1539
vbudget = min_t(s64, 0, vbudget - vdebt);
1540
}
1541
1542
/*
1543
* Wake up the ones which are due and see how much vtime we'll need for
1544
* the next one. As paying off debt restores hw_inuse, it must be read
1545
* after the above debt payment.
1546
*/
1547
ctx.vbudget = vbudget;
1548
current_hweight(iocg, NULL, &ctx.hw_inuse);
1549
1550
__wake_up_locked_key(&iocg->waitq, TASK_NORMAL, &ctx);
1551
1552
if (!waitqueue_active(&iocg->waitq)) {
1553
if (iocg->wait_since) {
1554
iocg->stat.wait_us += now->now - iocg->wait_since;
1555
iocg->wait_since = 0;
1556
}
1557
return;
1558
}
1559
1560
if (!iocg->wait_since)
1561
iocg->wait_since = now->now;
1562
1563
if (WARN_ON_ONCE(ctx.vbudget >= 0))
1564
return;
1565
1566
/* determine next wakeup, add a timer margin to guarantee chunking */
1567
vshortage = -ctx.vbudget;
1568
expires = now->now_ns +
1569
DIV64_U64_ROUND_UP(vshortage, ioc->vtime_base_rate) *
1570
NSEC_PER_USEC;
1571
expires += ioc->timer_slack_ns;
1572
1573
/* if already active and close enough, don't bother */
1574
oexpires = ktime_to_ns(hrtimer_get_softexpires(&iocg->waitq_timer));
1575
if (hrtimer_is_queued(&iocg->waitq_timer) &&
1576
abs(oexpires - expires) <= ioc->timer_slack_ns)
1577
return;
1578
1579
hrtimer_start_range_ns(&iocg->waitq_timer, ns_to_ktime(expires),
1580
ioc->timer_slack_ns, HRTIMER_MODE_ABS);
1581
}
1582
1583
static enum hrtimer_restart iocg_waitq_timer_fn(struct hrtimer *timer)
1584
{
1585
struct ioc_gq *iocg = container_of(timer, struct ioc_gq, waitq_timer);
1586
bool pay_debt = READ_ONCE(iocg->abs_vdebt);
1587
struct ioc_now now;
1588
unsigned long flags;
1589
1590
ioc_now(iocg->ioc, &now);
1591
1592
iocg_lock(iocg, pay_debt, &flags);
1593
iocg_kick_waitq(iocg, pay_debt, &now);
1594
iocg_unlock(iocg, pay_debt, &flags);
1595
1596
return HRTIMER_NORESTART;
1597
}
1598
1599
static void ioc_lat_stat(struct ioc *ioc, u32 *missed_ppm_ar, u32 *rq_wait_pct_p)
1600
{
1601
u32 nr_met[2] = { };
1602
u32 nr_missed[2] = { };
1603
u64 rq_wait_ns = 0;
1604
int cpu, rw;
1605
1606
for_each_online_cpu(cpu) {
1607
struct ioc_pcpu_stat *stat = per_cpu_ptr(ioc->pcpu_stat, cpu);
1608
u64 this_rq_wait_ns;
1609
1610
for (rw = READ; rw <= WRITE; rw++) {
1611
u32 this_met = local_read(&stat->missed[rw].nr_met);
1612
u32 this_missed = local_read(&stat->missed[rw].nr_missed);
1613
1614
nr_met[rw] += this_met - stat->missed[rw].last_met;
1615
nr_missed[rw] += this_missed - stat->missed[rw].last_missed;
1616
stat->missed[rw].last_met = this_met;
1617
stat->missed[rw].last_missed = this_missed;
1618
}
1619
1620
this_rq_wait_ns = local64_read(&stat->rq_wait_ns);
1621
rq_wait_ns += this_rq_wait_ns - stat->last_rq_wait_ns;
1622
stat->last_rq_wait_ns = this_rq_wait_ns;
1623
}
1624
1625
for (rw = READ; rw <= WRITE; rw++) {
1626
if (nr_met[rw] + nr_missed[rw])
1627
missed_ppm_ar[rw] =
1628
DIV64_U64_ROUND_UP((u64)nr_missed[rw] * MILLION,
1629
nr_met[rw] + nr_missed[rw]);
1630
else
1631
missed_ppm_ar[rw] = 0;
1632
}
1633
1634
*rq_wait_pct_p = div64_u64(rq_wait_ns * 100,
1635
ioc->period_us * NSEC_PER_USEC);
1636
}
1637
1638
/* was iocg idle this period? */
1639
static bool iocg_is_idle(struct ioc_gq *iocg)
1640
{
1641
struct ioc *ioc = iocg->ioc;
1642
1643
/* did something get issued this period? */
1644
if (atomic64_read(&iocg->active_period) ==
1645
atomic64_read(&ioc->cur_period))
1646
return false;
1647
1648
/* is something in flight? */
1649
if (atomic64_read(&iocg->done_vtime) != atomic64_read(&iocg->vtime))
1650
return false;
1651
1652
return true;
1653
}
1654
1655
/*
1656
* Call this function on the target leaf @iocg's to build pre-order traversal
1657
* list of all the ancestors in @inner_walk. The inner nodes are linked through
1658
* ->walk_list and the caller is responsible for dissolving the list after use.
1659
*/
1660
static void iocg_build_inner_walk(struct ioc_gq *iocg,
1661
struct list_head *inner_walk)
1662
{
1663
int lvl;
1664
1665
WARN_ON_ONCE(!list_empty(&iocg->walk_list));
1666
1667
/* find the first ancestor which hasn't been visited yet */
1668
for (lvl = iocg->level - 1; lvl >= 0; lvl--) {
1669
if (!list_empty(&iocg->ancestors[lvl]->walk_list))
1670
break;
1671
}
1672
1673
/* walk down and visit the inner nodes to get pre-order traversal */
1674
while (++lvl <= iocg->level - 1) {
1675
struct ioc_gq *inner = iocg->ancestors[lvl];
1676
1677
/* record traversal order */
1678
list_add_tail(&inner->walk_list, inner_walk);
1679
}
1680
}
1681
1682
/* propagate the deltas to the parent */
1683
static void iocg_flush_stat_upward(struct ioc_gq *iocg)
1684
{
1685
if (iocg->level > 0) {
1686
struct iocg_stat *parent_stat =
1687
&iocg->ancestors[iocg->level - 1]->stat;
1688
1689
parent_stat->usage_us +=
1690
iocg->stat.usage_us - iocg->last_stat.usage_us;
1691
parent_stat->wait_us +=
1692
iocg->stat.wait_us - iocg->last_stat.wait_us;
1693
parent_stat->indebt_us +=
1694
iocg->stat.indebt_us - iocg->last_stat.indebt_us;
1695
parent_stat->indelay_us +=
1696
iocg->stat.indelay_us - iocg->last_stat.indelay_us;
1697
}
1698
1699
iocg->last_stat = iocg->stat;
1700
}
1701
1702
/* collect per-cpu counters and propagate the deltas to the parent */
1703
static void iocg_flush_stat_leaf(struct ioc_gq *iocg, struct ioc_now *now)
1704
{
1705
struct ioc *ioc = iocg->ioc;
1706
u64 abs_vusage = 0;
1707
u64 vusage_delta;
1708
int cpu;
1709
1710
lockdep_assert_held(&iocg->ioc->lock);
1711
1712
/* collect per-cpu counters */
1713
for_each_possible_cpu(cpu) {
1714
abs_vusage += local64_read(
1715
per_cpu_ptr(&iocg->pcpu_stat->abs_vusage, cpu));
1716
}
1717
vusage_delta = abs_vusage - iocg->last_stat_abs_vusage;
1718
iocg->last_stat_abs_vusage = abs_vusage;
1719
1720
iocg->usage_delta_us = div64_u64(vusage_delta, ioc->vtime_base_rate);
1721
iocg->stat.usage_us += iocg->usage_delta_us;
1722
1723
iocg_flush_stat_upward(iocg);
1724
}
1725
1726
/* get stat counters ready for reading on all active iocgs */
1727
static void iocg_flush_stat(struct list_head *target_iocgs, struct ioc_now *now)
1728
{
1729
LIST_HEAD(inner_walk);
1730
struct ioc_gq *iocg, *tiocg;
1731
1732
/* flush leaves and build inner node walk list */
1733
list_for_each_entry(iocg, target_iocgs, active_list) {
1734
iocg_flush_stat_leaf(iocg, now);
1735
iocg_build_inner_walk(iocg, &inner_walk);
1736
}
1737
1738
/* keep flushing upwards by walking the inner list backwards */
1739
list_for_each_entry_safe_reverse(iocg, tiocg, &inner_walk, walk_list) {
1740
iocg_flush_stat_upward(iocg);
1741
list_del_init(&iocg->walk_list);
1742
}
1743
}
1744
1745
/*
1746
* Determine what @iocg's hweight_inuse should be after donating unused
1747
* capacity. @hwm is the upper bound and used to signal no donation. This
1748
* function also throws away @iocg's excess budget.
1749
*/
1750
static u32 hweight_after_donation(struct ioc_gq *iocg, u32 old_hwi, u32 hwm,
1751
u32 usage, struct ioc_now *now)
1752
{
1753
struct ioc *ioc = iocg->ioc;
1754
u64 vtime = atomic64_read(&iocg->vtime);
1755
s64 excess, delta, target, new_hwi;
1756
1757
/* debt handling owns inuse for debtors */
1758
if (iocg->abs_vdebt)
1759
return 1;
1760
1761
/* see whether minimum margin requirement is met */
1762
if (waitqueue_active(&iocg->waitq) ||
1763
time_after64(vtime, now->vnow - ioc->margins.min))
1764
return hwm;
1765
1766
/* throw away excess above target */
1767
excess = now->vnow - vtime - ioc->margins.target;
1768
if (excess > 0) {
1769
atomic64_add(excess, &iocg->vtime);
1770
atomic64_add(excess, &iocg->done_vtime);
1771
vtime += excess;
1772
ioc->vtime_err -= div64_u64(excess * old_hwi, WEIGHT_ONE);
1773
}
1774
1775
/*
1776
* Let's say the distance between iocg's and device's vtimes as a
1777
* fraction of period duration is delta. Assuming that the iocg will
1778
* consume the usage determined above, we want to determine new_hwi so
1779
* that delta equals MARGIN_TARGET at the end of the next period.
1780
*
1781
* We need to execute usage worth of IOs while spending the sum of the
1782
* new budget (1 - MARGIN_TARGET) and the leftover from the last period
1783
* (delta):
1784
*
1785
* usage = (1 - MARGIN_TARGET + delta) * new_hwi
1786
*
1787
* Therefore, the new_hwi is:
1788
*
1789
* new_hwi = usage / (1 - MARGIN_TARGET + delta)
1790
*/
1791
delta = div64_s64(WEIGHT_ONE * (now->vnow - vtime),
1792
now->vnow - ioc->period_at_vtime);
1793
target = WEIGHT_ONE * MARGIN_TARGET_PCT / 100;
1794
new_hwi = div64_s64(WEIGHT_ONE * usage, WEIGHT_ONE - target + delta);
1795
1796
return clamp_t(s64, new_hwi, 1, hwm);
1797
}
1798
1799
/*
1800
* For work-conservation, an iocg which isn't using all of its share should
1801
* donate the leftover to other iocgs. There are two ways to achieve this - 1.
1802
* bumping up vrate accordingly 2. lowering the donating iocg's inuse weight.
1803
*
1804
* #1 is mathematically simpler but has the drawback of requiring synchronous
1805
* global hweight_inuse updates when idle iocg's get activated or inuse weights
1806
* change due to donation snapbacks as it has the possibility of grossly
1807
* overshooting what's allowed by the model and vrate.
1808
*
1809
* #2 is inherently safe with local operations. The donating iocg can easily
1810
* snap back to higher weights when needed without worrying about impacts on
1811
* other nodes as the impacts will be inherently correct. This also makes idle
1812
* iocg activations safe. The only effect activations have is decreasing
1813
* hweight_inuse of others, the right solution to which is for those iocgs to
1814
* snap back to higher weights.
1815
*
1816
* So, we go with #2. The challenge is calculating how each donating iocg's
1817
* inuse should be adjusted to achieve the target donation amounts. This is done
1818
* using Andy's method described in the following pdf.
1819
*
1820
* https://drive.google.com/file/d/1PsJwxPFtjUnwOY1QJ5AeICCcsL7BM3bo
1821
*
1822
* Given the weights and target after-donation hweight_inuse values, Andy's
1823
* method determines how the proportional distribution should look like at each
1824
* sibling level to maintain the relative relationship between all non-donating
1825
* pairs. To roughly summarize, it divides the tree into donating and
1826
* non-donating parts, calculates global donation rate which is used to
1827
* determine the target hweight_inuse for each node, and then derives per-level
1828
* proportions.
1829
*
1830
* The following pdf shows that global distribution calculated this way can be
1831
* achieved by scaling inuse weights of donating leaves and propagating the
1832
* adjustments upwards proportionally.
1833
*
1834
* https://drive.google.com/file/d/1vONz1-fzVO7oY5DXXsLjSxEtYYQbOvsE
1835
*
1836
* Combining the above two, we can determine how each leaf iocg's inuse should
1837
* be adjusted to achieve the target donation.
1838
*
1839
* https://drive.google.com/file/d/1WcrltBOSPN0qXVdBgnKm4mdp9FhuEFQN
1840
*
1841
* The inline comments use symbols from the last pdf.
1842
*
1843
* b is the sum of the absolute budgets in the subtree. 1 for the root node.
1844
* f is the sum of the absolute budgets of non-donating nodes in the subtree.
1845
* t is the sum of the absolute budgets of donating nodes in the subtree.
1846
* w is the weight of the node. w = w_f + w_t
1847
* w_f is the non-donating portion of w. w_f = w * f / b
1848
* w_b is the donating portion of w. w_t = w * t / b
1849
* s is the sum of all sibling weights. s = Sum(w) for siblings
1850
* s_f and s_t are the non-donating and donating portions of s.
1851
*
1852
* Subscript p denotes the parent's counterpart and ' the adjusted value - e.g.
1853
* w_pt is the donating portion of the parent's weight and w'_pt the same value
1854
* after adjustments. Subscript r denotes the root node's values.
1855
*/
1856
static void transfer_surpluses(struct list_head *surpluses, struct ioc_now *now)
1857
{
1858
LIST_HEAD(over_hwa);
1859
LIST_HEAD(inner_walk);
1860
struct ioc_gq *iocg, *tiocg, *root_iocg;
1861
u32 after_sum, over_sum, over_target, gamma;
1862
1863
/*
1864
* It's pretty unlikely but possible for the total sum of
1865
* hweight_after_donation's to be higher than WEIGHT_ONE, which will
1866
* confuse the following calculations. If such condition is detected,
1867
* scale down everyone over its full share equally to keep the sum below
1868
* WEIGHT_ONE.
1869
*/
1870
after_sum = 0;
1871
over_sum = 0;
1872
list_for_each_entry(iocg, surpluses, surplus_list) {
1873
u32 hwa;
1874
1875
current_hweight(iocg, &hwa, NULL);
1876
after_sum += iocg->hweight_after_donation;
1877
1878
if (iocg->hweight_after_donation > hwa) {
1879
over_sum += iocg->hweight_after_donation;
1880
list_add(&iocg->walk_list, &over_hwa);
1881
}
1882
}
1883
1884
if (after_sum >= WEIGHT_ONE) {
1885
/*
1886
* The delta should be deducted from the over_sum, calculate
1887
* target over_sum value.
1888
*/
1889
u32 over_delta = after_sum - (WEIGHT_ONE - 1);
1890
WARN_ON_ONCE(over_sum <= over_delta);
1891
over_target = over_sum - over_delta;
1892
} else {
1893
over_target = 0;
1894
}
1895
1896
list_for_each_entry_safe(iocg, tiocg, &over_hwa, walk_list) {
1897
if (over_target)
1898
iocg->hweight_after_donation =
1899
div_u64((u64)iocg->hweight_after_donation *
1900
over_target, over_sum);
1901
list_del_init(&iocg->walk_list);
1902
}
1903
1904
/*
1905
* Build pre-order inner node walk list and prepare for donation
1906
* adjustment calculations.
1907
*/
1908
list_for_each_entry(iocg, surpluses, surplus_list) {
1909
iocg_build_inner_walk(iocg, &inner_walk);
1910
}
1911
1912
root_iocg = list_first_entry(&inner_walk, struct ioc_gq, walk_list);
1913
WARN_ON_ONCE(root_iocg->level > 0);
1914
1915
list_for_each_entry(iocg, &inner_walk, walk_list) {
1916
iocg->child_adjusted_sum = 0;
1917
iocg->hweight_donating = 0;
1918
iocg->hweight_after_donation = 0;
1919
}
1920
1921
/*
1922
* Propagate the donating budget (b_t) and after donation budget (b'_t)
1923
* up the hierarchy.
1924
*/
1925
list_for_each_entry(iocg, surpluses, surplus_list) {
1926
struct ioc_gq *parent = iocg->ancestors[iocg->level - 1];
1927
1928
parent->hweight_donating += iocg->hweight_donating;
1929
parent->hweight_after_donation += iocg->hweight_after_donation;
1930
}
1931
1932
list_for_each_entry_reverse(iocg, &inner_walk, walk_list) {
1933
if (iocg->level > 0) {
1934
struct ioc_gq *parent = iocg->ancestors[iocg->level - 1];
1935
1936
parent->hweight_donating += iocg->hweight_donating;
1937
parent->hweight_after_donation += iocg->hweight_after_donation;
1938
}
1939
}
1940
1941
/*
1942
* Calculate inner hwa's (b) and make sure the donation values are
1943
* within the accepted ranges as we're doing low res calculations with
1944
* roundups.
1945
*/
1946
list_for_each_entry(iocg, &inner_walk, walk_list) {
1947
if (iocg->level) {
1948
struct ioc_gq *parent = iocg->ancestors[iocg->level - 1];
1949
1950
iocg->hweight_active = DIV64_U64_ROUND_UP(
1951
(u64)parent->hweight_active * iocg->active,
1952
parent->child_active_sum);
1953
1954
}
1955
1956
iocg->hweight_donating = min(iocg->hweight_donating,
1957
iocg->hweight_active);
1958
iocg->hweight_after_donation = min(iocg->hweight_after_donation,
1959
iocg->hweight_donating - 1);
1960
if (WARN_ON_ONCE(iocg->hweight_active <= 1 ||
1961
iocg->hweight_donating <= 1 ||
1962
iocg->hweight_after_donation == 0)) {
1963
pr_warn("iocg: invalid donation weights in ");
1964
pr_cont_cgroup_path(iocg_to_blkg(iocg)->blkcg->css.cgroup);
1965
pr_cont(": active=%u donating=%u after=%u\n",
1966
iocg->hweight_active, iocg->hweight_donating,
1967
iocg->hweight_after_donation);
1968
}
1969
}
1970
1971
/*
1972
* Calculate the global donation rate (gamma) - the rate to adjust
1973
* non-donating budgets by.
1974
*
1975
* No need to use 64bit multiplication here as the first operand is
1976
* guaranteed to be smaller than WEIGHT_ONE (1<<16).
1977
*
1978
* We know that there are beneficiary nodes and the sum of the donating
1979
* hweights can't be whole; however, due to the round-ups during hweight
1980
* calculations, root_iocg->hweight_donating might still end up equal to
1981
* or greater than whole. Limit the range when calculating the divider.
1982
*
1983
* gamma = (1 - t_r') / (1 - t_r)
1984
*/
1985
gamma = DIV_ROUND_UP(
1986
(WEIGHT_ONE - root_iocg->hweight_after_donation) * WEIGHT_ONE,
1987
WEIGHT_ONE - min_t(u32, root_iocg->hweight_donating, WEIGHT_ONE - 1));
1988
1989
/*
1990
* Calculate adjusted hwi, child_adjusted_sum and inuse for the inner
1991
* nodes.
1992
*/
1993
list_for_each_entry(iocg, &inner_walk, walk_list) {
1994
struct ioc_gq *parent;
1995
u32 inuse, wpt, wptp;
1996
u64 st, sf;
1997
1998
if (iocg->level == 0) {
1999
/* adjusted weight sum for 1st level: s' = s * b_pf / b'_pf */
2000
iocg->child_adjusted_sum = DIV64_U64_ROUND_UP(
2001
iocg->child_active_sum * (WEIGHT_ONE - iocg->hweight_donating),
2002
WEIGHT_ONE - iocg->hweight_after_donation);
2003
continue;
2004
}
2005
2006
parent = iocg->ancestors[iocg->level - 1];
2007
2008
/* b' = gamma * b_f + b_t' */
2009
iocg->hweight_inuse = DIV64_U64_ROUND_UP(
2010
(u64)gamma * (iocg->hweight_active - iocg->hweight_donating),
2011
WEIGHT_ONE) + iocg->hweight_after_donation;
2012
2013
/* w' = s' * b' / b'_p */
2014
inuse = DIV64_U64_ROUND_UP(
2015
(u64)parent->child_adjusted_sum * iocg->hweight_inuse,
2016
parent->hweight_inuse);
2017
2018
/* adjusted weight sum for children: s' = s_f + s_t * w'_pt / w_pt */
2019
st = DIV64_U64_ROUND_UP(
2020
iocg->child_active_sum * iocg->hweight_donating,
2021
iocg->hweight_active);
2022
sf = iocg->child_active_sum - st;
2023
wpt = DIV64_U64_ROUND_UP(
2024
(u64)iocg->active * iocg->hweight_donating,
2025
iocg->hweight_active);
2026
wptp = DIV64_U64_ROUND_UP(
2027
(u64)inuse * iocg->hweight_after_donation,
2028
iocg->hweight_inuse);
2029
2030
iocg->child_adjusted_sum = sf + DIV64_U64_ROUND_UP(st * wptp, wpt);
2031
}
2032
2033
/*
2034
* All inner nodes now have ->hweight_inuse and ->child_adjusted_sum and
2035
* we can finally determine leaf adjustments.
2036
*/
2037
list_for_each_entry(iocg, surpluses, surplus_list) {
2038
struct ioc_gq *parent = iocg->ancestors[iocg->level - 1];
2039
u32 inuse;
2040
2041
/*
2042
* In-debt iocgs participated in the donation calculation with
2043
* the minimum target hweight_inuse. Configuring inuse
2044
* accordingly would work fine but debt handling expects
2045
* @iocg->inuse stay at the minimum and we don't wanna
2046
* interfere.
2047
*/
2048
if (iocg->abs_vdebt) {
2049
WARN_ON_ONCE(iocg->inuse > 1);
2050
continue;
2051
}
2052
2053
/* w' = s' * b' / b'_p, note that b' == b'_t for donating leaves */
2054
inuse = DIV64_U64_ROUND_UP(
2055
parent->child_adjusted_sum * iocg->hweight_after_donation,
2056
parent->hweight_inuse);
2057
2058
TRACE_IOCG_PATH(inuse_transfer, iocg, now,
2059
iocg->inuse, inuse,
2060
iocg->hweight_inuse,
2061
iocg->hweight_after_donation);
2062
2063
__propagate_weights(iocg, iocg->active, inuse, true, now);
2064
}
2065
2066
/* walk list should be dissolved after use */
2067
list_for_each_entry_safe(iocg, tiocg, &inner_walk, walk_list)
2068
list_del_init(&iocg->walk_list);
2069
}
2070
2071
/*
2072
* A low weight iocg can amass a large amount of debt, for example, when
2073
* anonymous memory gets reclaimed aggressively. If the system has a lot of
2074
* memory paired with a slow IO device, the debt can span multiple seconds or
2075
* more. If there are no other subsequent IO issuers, the in-debt iocg may end
2076
* up blocked paying its debt while the IO device is idle.
2077
*
2078
* The following protects against such cases. If the device has been
2079
* sufficiently idle for a while, the debts are halved and delays are
2080
* recalculated.
2081
*/
2082
static void ioc_forgive_debts(struct ioc *ioc, u64 usage_us_sum, int nr_debtors,
2083
struct ioc_now *now)
2084
{
2085
struct ioc_gq *iocg;
2086
u64 dur, usage_pct, nr_cycles, nr_cycles_shift;
2087
2088
/* if no debtor, reset the cycle */
2089
if (!nr_debtors) {
2090
ioc->dfgv_period_at = now->now;
2091
ioc->dfgv_period_rem = 0;
2092
ioc->dfgv_usage_us_sum = 0;
2093
return;
2094
}
2095
2096
/*
2097
* Debtors can pass through a lot of writes choking the device and we
2098
* don't want to be forgiving debts while the device is struggling from
2099
* write bursts. If we're missing latency targets, consider the device
2100
* fully utilized.
2101
*/
2102
if (ioc->busy_level > 0)
2103
usage_us_sum = max_t(u64, usage_us_sum, ioc->period_us);
2104
2105
ioc->dfgv_usage_us_sum += usage_us_sum;
2106
if (time_before64(now->now, ioc->dfgv_period_at + DFGV_PERIOD))
2107
return;
2108
2109
/*
2110
* At least DFGV_PERIOD has passed since the last period. Calculate the
2111
* average usage and reset the period counters.
2112
*/
2113
dur = now->now - ioc->dfgv_period_at;
2114
usage_pct = div64_u64(100 * ioc->dfgv_usage_us_sum, dur);
2115
2116
ioc->dfgv_period_at = now->now;
2117
ioc->dfgv_usage_us_sum = 0;
2118
2119
/* if was too busy, reset everything */
2120
if (usage_pct > DFGV_USAGE_PCT) {
2121
ioc->dfgv_period_rem = 0;
2122
return;
2123
}
2124
2125
/*
2126
* Usage is lower than threshold. Let's forgive some debts. Debt
2127
* forgiveness runs off of the usual ioc timer but its period usually
2128
* doesn't match ioc's. Compensate the difference by performing the
2129
* reduction as many times as would fit in the duration since the last
2130
* run and carrying over the left-over duration in @ioc->dfgv_period_rem
2131
* - if ioc period is 75% of DFGV_PERIOD, one out of three consecutive
2132
* reductions is doubled.
2133
*/
2134
nr_cycles = dur + ioc->dfgv_period_rem;
2135
ioc->dfgv_period_rem = do_div(nr_cycles, DFGV_PERIOD);
2136
2137
list_for_each_entry(iocg, &ioc->active_iocgs, active_list) {
2138
u64 __maybe_unused old_debt, __maybe_unused old_delay;
2139
2140
if (!iocg->abs_vdebt && !iocg->delay)
2141
continue;
2142
2143
spin_lock(&iocg->waitq.lock);
2144
2145
old_debt = iocg->abs_vdebt;
2146
old_delay = iocg->delay;
2147
2148
nr_cycles_shift = min_t(u64, nr_cycles, BITS_PER_LONG - 1);
2149
if (iocg->abs_vdebt)
2150
iocg->abs_vdebt = iocg->abs_vdebt >> nr_cycles_shift ?: 1;
2151
2152
if (iocg->delay)
2153
iocg->delay = iocg->delay >> nr_cycles_shift ?: 1;
2154
2155
iocg_kick_waitq(iocg, true, now);
2156
2157
TRACE_IOCG_PATH(iocg_forgive_debt, iocg, now, usage_pct,
2158
old_debt, iocg->abs_vdebt,
2159
old_delay, iocg->delay);
2160
2161
spin_unlock(&iocg->waitq.lock);
2162
}
2163
}
2164
2165
/*
2166
* Check the active iocgs' state to avoid oversleeping and deactive
2167
* idle iocgs.
2168
*
2169
* Since waiters determine the sleep durations based on the vrate
2170
* they saw at the time of sleep, if vrate has increased, some
2171
* waiters could be sleeping for too long. Wake up tardy waiters
2172
* which should have woken up in the last period and expire idle
2173
* iocgs.
2174
*/
2175
static int ioc_check_iocgs(struct ioc *ioc, struct ioc_now *now)
2176
{
2177
int nr_debtors = 0;
2178
struct ioc_gq *iocg, *tiocg;
2179
2180
list_for_each_entry_safe(iocg, tiocg, &ioc->active_iocgs, active_list) {
2181
if (!waitqueue_active(&iocg->waitq) && !iocg->abs_vdebt &&
2182
!iocg->delay && !iocg_is_idle(iocg))
2183
continue;
2184
2185
spin_lock(&iocg->waitq.lock);
2186
2187
/* flush wait and indebt stat deltas */
2188
if (iocg->wait_since) {
2189
iocg->stat.wait_us += now->now - iocg->wait_since;
2190
iocg->wait_since = now->now;
2191
}
2192
if (iocg->indebt_since) {
2193
iocg->stat.indebt_us +=
2194
now->now - iocg->indebt_since;
2195
iocg->indebt_since = now->now;
2196
}
2197
if (iocg->indelay_since) {
2198
iocg->stat.indelay_us +=
2199
now->now - iocg->indelay_since;
2200
iocg->indelay_since = now->now;
2201
}
2202
2203
if (waitqueue_active(&iocg->waitq) || iocg->abs_vdebt ||
2204
iocg->delay) {
2205
/* might be oversleeping vtime / hweight changes, kick */
2206
iocg_kick_waitq(iocg, true, now);
2207
if (iocg->abs_vdebt || iocg->delay)
2208
nr_debtors++;
2209
} else if (iocg_is_idle(iocg)) {
2210
/* no waiter and idle, deactivate */
2211
u64 vtime = atomic64_read(&iocg->vtime);
2212
s64 excess;
2213
2214
/*
2215
* @iocg has been inactive for a full duration and will
2216
* have a high budget. Account anything above target as
2217
* error and throw away. On reactivation, it'll start
2218
* with the target budget.
2219
*/
2220
excess = now->vnow - vtime - ioc->margins.target;
2221
if (excess > 0) {
2222
u32 old_hwi;
2223
2224
current_hweight(iocg, NULL, &old_hwi);
2225
ioc->vtime_err -= div64_u64(excess * old_hwi,
2226
WEIGHT_ONE);
2227
}
2228
2229
TRACE_IOCG_PATH(iocg_idle, iocg, now,
2230
atomic64_read(&iocg->active_period),
2231
atomic64_read(&ioc->cur_period), vtime);
2232
__propagate_weights(iocg, 0, 0, false, now);
2233
list_del_init(&iocg->active_list);
2234
}
2235
2236
spin_unlock(&iocg->waitq.lock);
2237
}
2238
2239
commit_weights(ioc);
2240
return nr_debtors;
2241
}
2242
2243
static void ioc_timer_fn(struct timer_list *timer)
2244
{
2245
struct ioc *ioc = container_of(timer, struct ioc, timer);
2246
struct ioc_gq *iocg, *tiocg;
2247
struct ioc_now now;
2248
LIST_HEAD(surpluses);
2249
int nr_debtors, nr_shortages = 0, nr_lagging = 0;
2250
u64 usage_us_sum = 0;
2251
u32 ppm_rthr;
2252
u32 ppm_wthr;
2253
u32 missed_ppm[2], rq_wait_pct;
2254
u64 period_vtime;
2255
int prev_busy_level;
2256
2257
/* how were the latencies during the period? */
2258
ioc_lat_stat(ioc, missed_ppm, &rq_wait_pct);
2259
2260
/* take care of active iocgs */
2261
spin_lock_irq(&ioc->lock);
2262
2263
ppm_rthr = MILLION - ioc->params.qos[QOS_RPPM];
2264
ppm_wthr = MILLION - ioc->params.qos[QOS_WPPM];
2265
ioc_now(ioc, &now);
2266
2267
period_vtime = now.vnow - ioc->period_at_vtime;
2268
if (WARN_ON_ONCE(!period_vtime)) {
2269
spin_unlock_irq(&ioc->lock);
2270
return;
2271
}
2272
2273
nr_debtors = ioc_check_iocgs(ioc, &now);
2274
2275
/*
2276
* Wait and indebt stat are flushed above and the donation calculation
2277
* below needs updated usage stat. Let's bring stat up-to-date.
2278
*/
2279
iocg_flush_stat(&ioc->active_iocgs, &now);
2280
2281
/* calc usage and see whether some weights need to be moved around */
2282
list_for_each_entry(iocg, &ioc->active_iocgs, active_list) {
2283
u64 vdone, vtime, usage_us;
2284
u32 hw_active, hw_inuse;
2285
2286
/*
2287
* Collect unused and wind vtime closer to vnow to prevent
2288
* iocgs from accumulating a large amount of budget.
2289
*/
2290
vdone = atomic64_read(&iocg->done_vtime);
2291
vtime = atomic64_read(&iocg->vtime);
2292
current_hweight(iocg, &hw_active, &hw_inuse);
2293
2294
/*
2295
* Latency QoS detection doesn't account for IOs which are
2296
* in-flight for longer than a period. Detect them by
2297
* comparing vdone against period start. If lagging behind
2298
* IOs from past periods, don't increase vrate.
2299
*/
2300
if ((ppm_rthr != MILLION || ppm_wthr != MILLION) &&
2301
!atomic_read(&iocg_to_blkg(iocg)->use_delay) &&
2302
time_after64(vtime, vdone) &&
2303
time_after64(vtime, now.vnow -
2304
MAX_LAGGING_PERIODS * period_vtime) &&
2305
time_before64(vdone, now.vnow - period_vtime))
2306
nr_lagging++;
2307
2308
/*
2309
* Determine absolute usage factoring in in-flight IOs to avoid
2310
* high-latency completions appearing as idle.
2311
*/
2312
usage_us = iocg->usage_delta_us;
2313
usage_us_sum += usage_us;
2314
2315
/* see whether there's surplus vtime */
2316
WARN_ON_ONCE(!list_empty(&iocg->surplus_list));
2317
if (hw_inuse < hw_active ||
2318
(!waitqueue_active(&iocg->waitq) &&
2319
time_before64(vtime, now.vnow - ioc->margins.low))) {
2320
u32 hwa, old_hwi, hwm, new_hwi, usage;
2321
u64 usage_dur;
2322
2323
if (vdone != vtime) {
2324
u64 inflight_us = DIV64_U64_ROUND_UP(
2325
cost_to_abs_cost(vtime - vdone, hw_inuse),
2326
ioc->vtime_base_rate);
2327
2328
usage_us = max(usage_us, inflight_us);
2329
}
2330
2331
/* convert to hweight based usage ratio */
2332
if (time_after64(iocg->activated_at, ioc->period_at))
2333
usage_dur = max_t(u64, now.now - iocg->activated_at, 1);
2334
else
2335
usage_dur = max_t(u64, now.now - ioc->period_at, 1);
2336
2337
usage = clamp_t(u32,
2338
DIV64_U64_ROUND_UP(usage_us * WEIGHT_ONE,
2339
usage_dur),
2340
1, WEIGHT_ONE);
2341
2342
/*
2343
* Already donating or accumulated enough to start.
2344
* Determine the donation amount.
2345
*/
2346
current_hweight(iocg, &hwa, &old_hwi);
2347
hwm = current_hweight_max(iocg);
2348
new_hwi = hweight_after_donation(iocg, old_hwi, hwm,
2349
usage, &now);
2350
/*
2351
* Donation calculation assumes hweight_after_donation
2352
* to be positive, a condition that a donor w/ hwa < 2
2353
* can't meet. Don't bother with donation if hwa is
2354
* below 2. It's not gonna make a meaningful difference
2355
* anyway.
2356
*/
2357
if (new_hwi < hwm && hwa >= 2) {
2358
iocg->hweight_donating = hwa;
2359
iocg->hweight_after_donation = new_hwi;
2360
list_add(&iocg->surplus_list, &surpluses);
2361
} else if (!iocg->abs_vdebt) {
2362
/*
2363
* @iocg doesn't have enough to donate. Reset
2364
* its inuse to active.
2365
*
2366
* Don't reset debtors as their inuse's are
2367
* owned by debt handling. This shouldn't affect
2368
* donation calculuation in any meaningful way
2369
* as @iocg doesn't have a meaningful amount of
2370
* share anyway.
2371
*/
2372
TRACE_IOCG_PATH(inuse_shortage, iocg, &now,
2373
iocg->inuse, iocg->active,
2374
iocg->hweight_inuse, new_hwi);
2375
2376
__propagate_weights(iocg, iocg->active,
2377
iocg->active, true, &now);
2378
nr_shortages++;
2379
}
2380
} else {
2381
/* genuinely short on vtime */
2382
nr_shortages++;
2383
}
2384
}
2385
2386
if (!list_empty(&surpluses) && nr_shortages)
2387
transfer_surpluses(&surpluses, &now);
2388
2389
commit_weights(ioc);
2390
2391
/* surplus list should be dissolved after use */
2392
list_for_each_entry_safe(iocg, tiocg, &surpluses, surplus_list)
2393
list_del_init(&iocg->surplus_list);
2394
2395
/*
2396
* If q is getting clogged or we're missing too much, we're issuing
2397
* too much IO and should lower vtime rate. If we're not missing
2398
* and experiencing shortages but not surpluses, we're too stingy
2399
* and should increase vtime rate.
2400
*/
2401
prev_busy_level = ioc->busy_level;
2402
if (rq_wait_pct > RQ_WAIT_BUSY_PCT ||
2403
missed_ppm[READ] > ppm_rthr ||
2404
missed_ppm[WRITE] > ppm_wthr) {
2405
/* clearly missing QoS targets, slow down vrate */
2406
ioc->busy_level = max(ioc->busy_level, 0);
2407
ioc->busy_level++;
2408
} else if (rq_wait_pct <= RQ_WAIT_BUSY_PCT * UNBUSY_THR_PCT / 100 &&
2409
missed_ppm[READ] <= ppm_rthr * UNBUSY_THR_PCT / 100 &&
2410
missed_ppm[WRITE] <= ppm_wthr * UNBUSY_THR_PCT / 100) {
2411
/* QoS targets are being met with >25% margin */
2412
if (nr_shortages) {
2413
/*
2414
* We're throttling while the device has spare
2415
* capacity. If vrate was being slowed down, stop.
2416
*/
2417
ioc->busy_level = min(ioc->busy_level, 0);
2418
2419
/*
2420
* If there are IOs spanning multiple periods, wait
2421
* them out before pushing the device harder.
2422
*/
2423
if (!nr_lagging)
2424
ioc->busy_level--;
2425
} else {
2426
/*
2427
* Nobody is being throttled and the users aren't
2428
* issuing enough IOs to saturate the device. We
2429
* simply don't know how close the device is to
2430
* saturation. Coast.
2431
*/
2432
ioc->busy_level = 0;
2433
}
2434
} else {
2435
/* inside the hysterisis margin, we're good */
2436
ioc->busy_level = 0;
2437
}
2438
2439
ioc->busy_level = clamp(ioc->busy_level, -1000, 1000);
2440
2441
ioc_adjust_base_vrate(ioc, rq_wait_pct, nr_lagging, nr_shortages,
2442
prev_busy_level, missed_ppm);
2443
2444
ioc_refresh_params(ioc, false);
2445
2446
ioc_forgive_debts(ioc, usage_us_sum, nr_debtors, &now);
2447
2448
/*
2449
* This period is done. Move onto the next one. If nothing's
2450
* going on with the device, stop the timer.
2451
*/
2452
atomic64_inc(&ioc->cur_period);
2453
2454
if (ioc->running != IOC_STOP) {
2455
if (!list_empty(&ioc->active_iocgs)) {
2456
ioc_start_period(ioc, &now);
2457
} else {
2458
ioc->busy_level = 0;
2459
ioc->vtime_err = 0;
2460
ioc->running = IOC_IDLE;
2461
}
2462
2463
ioc_refresh_vrate(ioc, &now);
2464
}
2465
2466
spin_unlock_irq(&ioc->lock);
2467
}
2468
2469
static u64 adjust_inuse_and_calc_cost(struct ioc_gq *iocg, u64 vtime,
2470
u64 abs_cost, struct ioc_now *now)
2471
{
2472
struct ioc *ioc = iocg->ioc;
2473
struct ioc_margins *margins = &ioc->margins;
2474
u32 __maybe_unused old_inuse = iocg->inuse, __maybe_unused old_hwi;
2475
u32 hwi, adj_step;
2476
s64 margin;
2477
u64 cost, new_inuse;
2478
unsigned long flags;
2479
2480
current_hweight(iocg, NULL, &hwi);
2481
old_hwi = hwi;
2482
cost = abs_cost_to_cost(abs_cost, hwi);
2483
margin = now->vnow - vtime - cost;
2484
2485
/* debt handling owns inuse for debtors */
2486
if (iocg->abs_vdebt)
2487
return cost;
2488
2489
/*
2490
* We only increase inuse during period and do so if the margin has
2491
* deteriorated since the previous adjustment.
2492
*/
2493
if (margin >= iocg->saved_margin || margin >= margins->low ||
2494
iocg->inuse == iocg->active)
2495
return cost;
2496
2497
spin_lock_irqsave(&ioc->lock, flags);
2498
2499
/* we own inuse only when @iocg is in the normal active state */
2500
if (iocg->abs_vdebt || list_empty(&iocg->active_list)) {
2501
spin_unlock_irqrestore(&ioc->lock, flags);
2502
return cost;
2503
}
2504
2505
/*
2506
* Bump up inuse till @abs_cost fits in the existing budget.
2507
* adj_step must be determined after acquiring ioc->lock - we might
2508
* have raced and lost to another thread for activation and could
2509
* be reading 0 iocg->active before ioc->lock which will lead to
2510
* infinite loop.
2511
*/
2512
new_inuse = iocg->inuse;
2513
adj_step = DIV_ROUND_UP(iocg->active * INUSE_ADJ_STEP_PCT, 100);
2514
do {
2515
new_inuse = new_inuse + adj_step;
2516
propagate_weights(iocg, iocg->active, new_inuse, true, now);
2517
current_hweight(iocg, NULL, &hwi);
2518
cost = abs_cost_to_cost(abs_cost, hwi);
2519
} while (time_after64(vtime + cost, now->vnow) &&
2520
iocg->inuse != iocg->active);
2521
2522
spin_unlock_irqrestore(&ioc->lock, flags);
2523
2524
TRACE_IOCG_PATH(inuse_adjust, iocg, now,
2525
old_inuse, iocg->inuse, old_hwi, hwi);
2526
2527
return cost;
2528
}
2529
2530
static void calc_vtime_cost_builtin(struct bio *bio, struct ioc_gq *iocg,
2531
bool is_merge, u64 *costp)
2532
{
2533
struct ioc *ioc = iocg->ioc;
2534
u64 coef_seqio, coef_randio, coef_page;
2535
u64 pages = max_t(u64, bio_sectors(bio) >> IOC_SECT_TO_PAGE_SHIFT, 1);
2536
u64 seek_pages = 0;
2537
u64 cost = 0;
2538
2539
/* Can't calculate cost for empty bio */
2540
if (!bio->bi_iter.bi_size)
2541
goto out;
2542
2543
switch (bio_op(bio)) {
2544
case REQ_OP_READ:
2545
coef_seqio = ioc->params.lcoefs[LCOEF_RSEQIO];
2546
coef_randio = ioc->params.lcoefs[LCOEF_RRANDIO];
2547
coef_page = ioc->params.lcoefs[LCOEF_RPAGE];
2548
break;
2549
case REQ_OP_WRITE:
2550
coef_seqio = ioc->params.lcoefs[LCOEF_WSEQIO];
2551
coef_randio = ioc->params.lcoefs[LCOEF_WRANDIO];
2552
coef_page = ioc->params.lcoefs[LCOEF_WPAGE];
2553
break;
2554
default:
2555
goto out;
2556
}
2557
2558
if (iocg->cursor) {
2559
seek_pages = abs(bio->bi_iter.bi_sector - iocg->cursor);
2560
seek_pages >>= IOC_SECT_TO_PAGE_SHIFT;
2561
}
2562
2563
if (!is_merge) {
2564
if (seek_pages > LCOEF_RANDIO_PAGES) {
2565
cost += coef_randio;
2566
} else {
2567
cost += coef_seqio;
2568
}
2569
}
2570
cost += pages * coef_page;
2571
out:
2572
*costp = cost;
2573
}
2574
2575
static u64 calc_vtime_cost(struct bio *bio, struct ioc_gq *iocg, bool is_merge)
2576
{
2577
u64 cost;
2578
2579
calc_vtime_cost_builtin(bio, iocg, is_merge, &cost);
2580
return cost;
2581
}
2582
2583
static void calc_size_vtime_cost_builtin(struct request *rq, struct ioc *ioc,
2584
u64 *costp)
2585
{
2586
unsigned int pages = blk_rq_stats_sectors(rq) >> IOC_SECT_TO_PAGE_SHIFT;
2587
2588
switch (req_op(rq)) {
2589
case REQ_OP_READ:
2590
*costp = pages * ioc->params.lcoefs[LCOEF_RPAGE];
2591
break;
2592
case REQ_OP_WRITE:
2593
*costp = pages * ioc->params.lcoefs[LCOEF_WPAGE];
2594
break;
2595
default:
2596
*costp = 0;
2597
}
2598
}
2599
2600
static u64 calc_size_vtime_cost(struct request *rq, struct ioc *ioc)
2601
{
2602
u64 cost;
2603
2604
calc_size_vtime_cost_builtin(rq, ioc, &cost);
2605
return cost;
2606
}
2607
2608
static void ioc_rqos_throttle(struct rq_qos *rqos, struct bio *bio)
2609
{
2610
struct blkcg_gq *blkg = bio->bi_blkg;
2611
struct ioc *ioc = rqos_to_ioc(rqos);
2612
struct ioc_gq *iocg = blkg_to_iocg(blkg);
2613
struct ioc_now now;
2614
struct iocg_wait wait;
2615
u64 abs_cost, cost, vtime;
2616
bool use_debt, ioc_locked;
2617
unsigned long flags;
2618
2619
/* bypass IOs if disabled, still initializing, or for root cgroup */
2620
if (!ioc->enabled || !iocg || !iocg->level)
2621
return;
2622
2623
/* calculate the absolute vtime cost */
2624
abs_cost = calc_vtime_cost(bio, iocg, false);
2625
if (!abs_cost)
2626
return;
2627
2628
if (!iocg_activate(iocg, &now))
2629
return;
2630
2631
iocg->cursor = bio_end_sector(bio);
2632
vtime = atomic64_read(&iocg->vtime);
2633
cost = adjust_inuse_and_calc_cost(iocg, vtime, abs_cost, &now);
2634
2635
/*
2636
* If no one's waiting and within budget, issue right away. The
2637
* tests are racy but the races aren't systemic - we only miss once
2638
* in a while which is fine.
2639
*/
2640
if (!waitqueue_active(&iocg->waitq) && !iocg->abs_vdebt &&
2641
time_before_eq64(vtime + cost, now.vnow)) {
2642
iocg_commit_bio(iocg, bio, abs_cost, cost);
2643
return;
2644
}
2645
2646
/*
2647
* We're over budget. This can be handled in two ways. IOs which may
2648
* cause priority inversions are punted to @ioc->aux_iocg and charged as
2649
* debt. Otherwise, the issuer is blocked on @iocg->waitq. Debt handling
2650
* requires @ioc->lock, waitq handling @iocg->waitq.lock. Determine
2651
* whether debt handling is needed and acquire locks accordingly.
2652
*/
2653
use_debt = bio_issue_as_root_blkg(bio) || fatal_signal_pending(current);
2654
ioc_locked = use_debt || READ_ONCE(iocg->abs_vdebt);
2655
retry_lock:
2656
iocg_lock(iocg, ioc_locked, &flags);
2657
2658
/*
2659
* @iocg must stay activated for debt and waitq handling. Deactivation
2660
* is synchronized against both ioc->lock and waitq.lock and we won't
2661
* get deactivated as long as we're waiting or has debt, so we're good
2662
* if we're activated here. In the unlikely cases that we aren't, just
2663
* issue the IO.
2664
*/
2665
if (unlikely(list_empty(&iocg->active_list))) {
2666
iocg_unlock(iocg, ioc_locked, &flags);
2667
iocg_commit_bio(iocg, bio, abs_cost, cost);
2668
return;
2669
}
2670
2671
/*
2672
* We're over budget. If @bio has to be issued regardless, remember
2673
* the abs_cost instead of advancing vtime. iocg_kick_waitq() will pay
2674
* off the debt before waking more IOs.
2675
*
2676
* This way, the debt is continuously paid off each period with the
2677
* actual budget available to the cgroup. If we just wound vtime, we
2678
* would incorrectly use the current hw_inuse for the entire amount
2679
* which, for example, can lead to the cgroup staying blocked for a
2680
* long time even with substantially raised hw_inuse.
2681
*
2682
* An iocg with vdebt should stay online so that the timer can keep
2683
* deducting its vdebt and [de]activate use_delay mechanism
2684
* accordingly. We don't want to race against the timer trying to
2685
* clear them and leave @iocg inactive w/ dangling use_delay heavily
2686
* penalizing the cgroup and its descendants.
2687
*/
2688
if (use_debt) {
2689
iocg_incur_debt(iocg, abs_cost, &now);
2690
if (iocg_kick_delay(iocg, &now))
2691
blkcg_schedule_throttle(rqos->disk,
2692
(bio->bi_opf & REQ_SWAP) == REQ_SWAP);
2693
iocg_unlock(iocg, ioc_locked, &flags);
2694
return;
2695
}
2696
2697
/* guarantee that iocgs w/ waiters have maximum inuse */
2698
if (!iocg->abs_vdebt && iocg->inuse != iocg->active) {
2699
if (!ioc_locked) {
2700
iocg_unlock(iocg, false, &flags);
2701
ioc_locked = true;
2702
goto retry_lock;
2703
}
2704
propagate_weights(iocg, iocg->active, iocg->active, true,
2705
&now);
2706
}
2707
2708
/*
2709
* Append self to the waitq and schedule the wakeup timer if we're
2710
* the first waiter. The timer duration is calculated based on the
2711
* current vrate. vtime and hweight changes can make it too short
2712
* or too long. Each wait entry records the absolute cost it's
2713
* waiting for to allow re-evaluation using a custom wait entry.
2714
*
2715
* If too short, the timer simply reschedules itself. If too long,
2716
* the period timer will notice and trigger wakeups.
2717
*
2718
* All waiters are on iocg->waitq and the wait states are
2719
* synchronized using waitq.lock.
2720
*/
2721
init_wait_func(&wait.wait, iocg_wake_fn);
2722
wait.bio = bio;
2723
wait.abs_cost = abs_cost;
2724
wait.committed = false; /* will be set true by waker */
2725
2726
__add_wait_queue_entry_tail(&iocg->waitq, &wait.wait);
2727
iocg_kick_waitq(iocg, ioc_locked, &now);
2728
2729
iocg_unlock(iocg, ioc_locked, &flags);
2730
2731
while (true) {
2732
set_current_state(TASK_UNINTERRUPTIBLE);
2733
if (wait.committed)
2734
break;
2735
io_schedule();
2736
}
2737
2738
/* waker already committed us, proceed */
2739
finish_wait(&iocg->waitq, &wait.wait);
2740
}
2741
2742
static void ioc_rqos_merge(struct rq_qos *rqos, struct request *rq,
2743
struct bio *bio)
2744
{
2745
struct ioc_gq *iocg = blkg_to_iocg(bio->bi_blkg);
2746
struct ioc *ioc = rqos_to_ioc(rqos);
2747
sector_t bio_end = bio_end_sector(bio);
2748
struct ioc_now now;
2749
u64 vtime, abs_cost, cost;
2750
unsigned long flags;
2751
2752
/* bypass if disabled, still initializing, or for root cgroup */
2753
if (!ioc->enabled || !iocg || !iocg->level)
2754
return;
2755
2756
abs_cost = calc_vtime_cost(bio, iocg, true);
2757
if (!abs_cost)
2758
return;
2759
2760
ioc_now(ioc, &now);
2761
2762
vtime = atomic64_read(&iocg->vtime);
2763
cost = adjust_inuse_and_calc_cost(iocg, vtime, abs_cost, &now);
2764
2765
/* update cursor if backmerging into the request at the cursor */
2766
if (blk_rq_pos(rq) < bio_end &&
2767
blk_rq_pos(rq) + blk_rq_sectors(rq) == iocg->cursor)
2768
iocg->cursor = bio_end;
2769
2770
/*
2771
* Charge if there's enough vtime budget and the existing request has
2772
* cost assigned.
2773
*/
2774
if (rq->bio && rq->bio->bi_iocost_cost &&
2775
time_before_eq64(atomic64_read(&iocg->vtime) + cost, now.vnow)) {
2776
iocg_commit_bio(iocg, bio, abs_cost, cost);
2777
return;
2778
}
2779
2780
/*
2781
* Otherwise, account it as debt if @iocg is online, which it should
2782
* be for the vast majority of cases. See debt handling in
2783
* ioc_rqos_throttle() for details.
2784
*/
2785
spin_lock_irqsave(&ioc->lock, flags);
2786
spin_lock(&iocg->waitq.lock);
2787
2788
if (likely(!list_empty(&iocg->active_list))) {
2789
iocg_incur_debt(iocg, abs_cost, &now);
2790
if (iocg_kick_delay(iocg, &now))
2791
blkcg_schedule_throttle(rqos->disk,
2792
(bio->bi_opf & REQ_SWAP) == REQ_SWAP);
2793
} else {
2794
iocg_commit_bio(iocg, bio, abs_cost, cost);
2795
}
2796
2797
spin_unlock(&iocg->waitq.lock);
2798
spin_unlock_irqrestore(&ioc->lock, flags);
2799
}
2800
2801
static void ioc_rqos_done_bio(struct rq_qos *rqos, struct bio *bio)
2802
{
2803
struct ioc_gq *iocg = blkg_to_iocg(bio->bi_blkg);
2804
2805
if (iocg && bio->bi_iocost_cost)
2806
atomic64_add(bio->bi_iocost_cost, &iocg->done_vtime);
2807
}
2808
2809
static void ioc_rqos_done(struct rq_qos *rqos, struct request *rq)
2810
{
2811
struct ioc *ioc = rqos_to_ioc(rqos);
2812
struct ioc_pcpu_stat *ccs;
2813
u64 on_q_ns, rq_wait_ns, size_nsec;
2814
int pidx, rw;
2815
2816
if (!ioc->enabled || !rq->alloc_time_ns || !rq->start_time_ns)
2817
return;
2818
2819
switch (req_op(rq)) {
2820
case REQ_OP_READ:
2821
pidx = QOS_RLAT;
2822
rw = READ;
2823
break;
2824
case REQ_OP_WRITE:
2825
pidx = QOS_WLAT;
2826
rw = WRITE;
2827
break;
2828
default:
2829
return;
2830
}
2831
2832
on_q_ns = blk_time_get_ns() - rq->alloc_time_ns;
2833
rq_wait_ns = rq->start_time_ns - rq->alloc_time_ns;
2834
size_nsec = div64_u64(calc_size_vtime_cost(rq, ioc), VTIME_PER_NSEC);
2835
2836
ccs = get_cpu_ptr(ioc->pcpu_stat);
2837
2838
if (on_q_ns <= size_nsec ||
2839
on_q_ns - size_nsec <= ioc->params.qos[pidx] * NSEC_PER_USEC)
2840
local_inc(&ccs->missed[rw].nr_met);
2841
else
2842
local_inc(&ccs->missed[rw].nr_missed);
2843
2844
local64_add(rq_wait_ns, &ccs->rq_wait_ns);
2845
2846
put_cpu_ptr(ccs);
2847
}
2848
2849
static void ioc_rqos_queue_depth_changed(struct rq_qos *rqos)
2850
{
2851
struct ioc *ioc = rqos_to_ioc(rqos);
2852
2853
spin_lock_irq(&ioc->lock);
2854
ioc_refresh_params(ioc, false);
2855
spin_unlock_irq(&ioc->lock);
2856
}
2857
2858
static void ioc_rqos_exit(struct rq_qos *rqos)
2859
{
2860
struct ioc *ioc = rqos_to_ioc(rqos);
2861
2862
blkcg_deactivate_policy(rqos->disk, &blkcg_policy_iocost);
2863
2864
spin_lock_irq(&ioc->lock);
2865
ioc->running = IOC_STOP;
2866
spin_unlock_irq(&ioc->lock);
2867
2868
timer_shutdown_sync(&ioc->timer);
2869
free_percpu(ioc->pcpu_stat);
2870
kfree(ioc);
2871
}
2872
2873
static const struct rq_qos_ops ioc_rqos_ops = {
2874
.throttle = ioc_rqos_throttle,
2875
.merge = ioc_rqos_merge,
2876
.done_bio = ioc_rqos_done_bio,
2877
.done = ioc_rqos_done,
2878
.queue_depth_changed = ioc_rqos_queue_depth_changed,
2879
.exit = ioc_rqos_exit,
2880
};
2881
2882
static int blk_iocost_init(struct gendisk *disk)
2883
{
2884
struct ioc *ioc;
2885
int i, cpu, ret;
2886
2887
ioc = kzalloc(sizeof(*ioc), GFP_KERNEL);
2888
if (!ioc)
2889
return -ENOMEM;
2890
2891
ioc->pcpu_stat = alloc_percpu(struct ioc_pcpu_stat);
2892
if (!ioc->pcpu_stat) {
2893
kfree(ioc);
2894
return -ENOMEM;
2895
}
2896
2897
for_each_possible_cpu(cpu) {
2898
struct ioc_pcpu_stat *ccs = per_cpu_ptr(ioc->pcpu_stat, cpu);
2899
2900
for (i = 0; i < ARRAY_SIZE(ccs->missed); i++) {
2901
local_set(&ccs->missed[i].nr_met, 0);
2902
local_set(&ccs->missed[i].nr_missed, 0);
2903
}
2904
local64_set(&ccs->rq_wait_ns, 0);
2905
}
2906
2907
spin_lock_init(&ioc->lock);
2908
timer_setup(&ioc->timer, ioc_timer_fn, 0);
2909
INIT_LIST_HEAD(&ioc->active_iocgs);
2910
2911
ioc->running = IOC_IDLE;
2912
ioc->vtime_base_rate = VTIME_PER_USEC;
2913
atomic64_set(&ioc->vtime_rate, VTIME_PER_USEC);
2914
seqcount_spinlock_init(&ioc->period_seqcount, &ioc->lock);
2915
ioc->period_at = ktime_to_us(blk_time_get());
2916
atomic64_set(&ioc->cur_period, 0);
2917
atomic_set(&ioc->hweight_gen, 0);
2918
2919
spin_lock_irq(&ioc->lock);
2920
ioc->autop_idx = AUTOP_INVALID;
2921
ioc_refresh_params_disk(ioc, true, disk);
2922
spin_unlock_irq(&ioc->lock);
2923
2924
/*
2925
* rqos must be added before activation to allow ioc_pd_init() to
2926
* lookup the ioc from q. This means that the rqos methods may get
2927
* called before policy activation completion, can't assume that the
2928
* target bio has an iocg associated and need to test for NULL iocg.
2929
*/
2930
ret = rq_qos_add(&ioc->rqos, disk, RQ_QOS_COST, &ioc_rqos_ops);
2931
if (ret)
2932
goto err_free_ioc;
2933
2934
ret = blkcg_activate_policy(disk, &blkcg_policy_iocost);
2935
if (ret)
2936
goto err_del_qos;
2937
return 0;
2938
2939
err_del_qos:
2940
rq_qos_del(&ioc->rqos);
2941
err_free_ioc:
2942
free_percpu(ioc->pcpu_stat);
2943
kfree(ioc);
2944
return ret;
2945
}
2946
2947
static struct blkcg_policy_data *ioc_cpd_alloc(gfp_t gfp)
2948
{
2949
struct ioc_cgrp *iocc;
2950
2951
iocc = kzalloc(sizeof(struct ioc_cgrp), gfp);
2952
if (!iocc)
2953
return NULL;
2954
2955
iocc->dfl_weight = CGROUP_WEIGHT_DFL * WEIGHT_ONE;
2956
return &iocc->cpd;
2957
}
2958
2959
static void ioc_cpd_free(struct blkcg_policy_data *cpd)
2960
{
2961
kfree(container_of(cpd, struct ioc_cgrp, cpd));
2962
}
2963
2964
static struct blkg_policy_data *ioc_pd_alloc(struct gendisk *disk,
2965
struct blkcg *blkcg, gfp_t gfp)
2966
{
2967
int levels = blkcg->css.cgroup->level + 1;
2968
struct ioc_gq *iocg;
2969
2970
iocg = kzalloc_node(struct_size(iocg, ancestors, levels), gfp,
2971
disk->node_id);
2972
if (!iocg)
2973
return NULL;
2974
2975
iocg->pcpu_stat = alloc_percpu_gfp(struct iocg_pcpu_stat, gfp);
2976
if (!iocg->pcpu_stat) {
2977
kfree(iocg);
2978
return NULL;
2979
}
2980
2981
return &iocg->pd;
2982
}
2983
2984
static void ioc_pd_init(struct blkg_policy_data *pd)
2985
{
2986
struct ioc_gq *iocg = pd_to_iocg(pd);
2987
struct blkcg_gq *blkg = pd_to_blkg(&iocg->pd);
2988
struct ioc *ioc = q_to_ioc(blkg->q);
2989
struct ioc_now now;
2990
struct blkcg_gq *tblkg;
2991
unsigned long flags;
2992
2993
ioc_now(ioc, &now);
2994
2995
iocg->ioc = ioc;
2996
atomic64_set(&iocg->vtime, now.vnow);
2997
atomic64_set(&iocg->done_vtime, now.vnow);
2998
atomic64_set(&iocg->active_period, atomic64_read(&ioc->cur_period));
2999
INIT_LIST_HEAD(&iocg->active_list);
3000
INIT_LIST_HEAD(&iocg->walk_list);
3001
INIT_LIST_HEAD(&iocg->surplus_list);
3002
iocg->hweight_active = WEIGHT_ONE;
3003
iocg->hweight_inuse = WEIGHT_ONE;
3004
3005
init_waitqueue_head(&iocg->waitq);
3006
hrtimer_setup(&iocg->waitq_timer, iocg_waitq_timer_fn, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
3007
3008
iocg->level = blkg->blkcg->css.cgroup->level;
3009
3010
for (tblkg = blkg; tblkg; tblkg = tblkg->parent) {
3011
struct ioc_gq *tiocg = blkg_to_iocg(tblkg);
3012
iocg->ancestors[tiocg->level] = tiocg;
3013
}
3014
3015
spin_lock_irqsave(&ioc->lock, flags);
3016
weight_updated(iocg, &now);
3017
spin_unlock_irqrestore(&ioc->lock, flags);
3018
}
3019
3020
static void ioc_pd_free(struct blkg_policy_data *pd)
3021
{
3022
struct ioc_gq *iocg = pd_to_iocg(pd);
3023
struct ioc *ioc = iocg->ioc;
3024
unsigned long flags;
3025
3026
if (ioc) {
3027
spin_lock_irqsave(&ioc->lock, flags);
3028
3029
if (!list_empty(&iocg->active_list)) {
3030
struct ioc_now now;
3031
3032
ioc_now(ioc, &now);
3033
propagate_weights(iocg, 0, 0, false, &now);
3034
list_del_init(&iocg->active_list);
3035
}
3036
3037
WARN_ON_ONCE(!list_empty(&iocg->walk_list));
3038
WARN_ON_ONCE(!list_empty(&iocg->surplus_list));
3039
3040
spin_unlock_irqrestore(&ioc->lock, flags);
3041
3042
hrtimer_cancel(&iocg->waitq_timer);
3043
}
3044
free_percpu(iocg->pcpu_stat);
3045
kfree(iocg);
3046
}
3047
3048
static void ioc_pd_stat(struct blkg_policy_data *pd, struct seq_file *s)
3049
{
3050
struct ioc_gq *iocg = pd_to_iocg(pd);
3051
struct ioc *ioc = iocg->ioc;
3052
3053
if (!ioc->enabled)
3054
return;
3055
3056
if (iocg->level == 0) {
3057
unsigned vp10k = DIV64_U64_ROUND_CLOSEST(
3058
ioc->vtime_base_rate * 10000,
3059
VTIME_PER_USEC);
3060
seq_printf(s, " cost.vrate=%u.%02u", vp10k / 100, vp10k % 100);
3061
}
3062
3063
seq_printf(s, " cost.usage=%llu", iocg->last_stat.usage_us);
3064
3065
if (blkcg_debug_stats)
3066
seq_printf(s, " cost.wait=%llu cost.indebt=%llu cost.indelay=%llu",
3067
iocg->last_stat.wait_us,
3068
iocg->last_stat.indebt_us,
3069
iocg->last_stat.indelay_us);
3070
}
3071
3072
static u64 ioc_weight_prfill(struct seq_file *sf, struct blkg_policy_data *pd,
3073
int off)
3074
{
3075
const char *dname = blkg_dev_name(pd->blkg);
3076
struct ioc_gq *iocg = pd_to_iocg(pd);
3077
3078
if (dname && iocg->cfg_weight)
3079
seq_printf(sf, "%s %u\n", dname, iocg->cfg_weight / WEIGHT_ONE);
3080
return 0;
3081
}
3082
3083
3084
static int ioc_weight_show(struct seq_file *sf, void *v)
3085
{
3086
struct blkcg *blkcg = css_to_blkcg(seq_css(sf));
3087
struct ioc_cgrp *iocc = blkcg_to_iocc(blkcg);
3088
3089
seq_printf(sf, "default %u\n", iocc->dfl_weight / WEIGHT_ONE);
3090
blkcg_print_blkgs(sf, blkcg, ioc_weight_prfill,
3091
&blkcg_policy_iocost, seq_cft(sf)->private, false);
3092
return 0;
3093
}
3094
3095
static ssize_t ioc_weight_write(struct kernfs_open_file *of, char *buf,
3096
size_t nbytes, loff_t off)
3097
{
3098
struct blkcg *blkcg = css_to_blkcg(of_css(of));
3099
struct ioc_cgrp *iocc = blkcg_to_iocc(blkcg);
3100
struct blkg_conf_ctx ctx;
3101
struct ioc_now now;
3102
struct ioc_gq *iocg;
3103
u32 v;
3104
int ret;
3105
3106
if (!strchr(buf, ':')) {
3107
struct blkcg_gq *blkg;
3108
3109
if (!sscanf(buf, "default %u", &v) && !sscanf(buf, "%u", &v))
3110
return -EINVAL;
3111
3112
if (v < CGROUP_WEIGHT_MIN || v > CGROUP_WEIGHT_MAX)
3113
return -EINVAL;
3114
3115
spin_lock_irq(&blkcg->lock);
3116
iocc->dfl_weight = v * WEIGHT_ONE;
3117
hlist_for_each_entry(blkg, &blkcg->blkg_list, blkcg_node) {
3118
struct ioc_gq *iocg = blkg_to_iocg(blkg);
3119
3120
if (iocg) {
3121
spin_lock(&iocg->ioc->lock);
3122
ioc_now(iocg->ioc, &now);
3123
weight_updated(iocg, &now);
3124
spin_unlock(&iocg->ioc->lock);
3125
}
3126
}
3127
spin_unlock_irq(&blkcg->lock);
3128
3129
return nbytes;
3130
}
3131
3132
blkg_conf_init(&ctx, buf);
3133
3134
ret = blkg_conf_prep(blkcg, &blkcg_policy_iocost, &ctx);
3135
if (ret)
3136
goto err;
3137
3138
iocg = blkg_to_iocg(ctx.blkg);
3139
3140
if (!strncmp(ctx.body, "default", 7)) {
3141
v = 0;
3142
} else {
3143
if (!sscanf(ctx.body, "%u", &v))
3144
goto einval;
3145
if (v < CGROUP_WEIGHT_MIN || v > CGROUP_WEIGHT_MAX)
3146
goto einval;
3147
}
3148
3149
spin_lock(&iocg->ioc->lock);
3150
iocg->cfg_weight = v * WEIGHT_ONE;
3151
ioc_now(iocg->ioc, &now);
3152
weight_updated(iocg, &now);
3153
spin_unlock(&iocg->ioc->lock);
3154
3155
blkg_conf_exit(&ctx);
3156
return nbytes;
3157
3158
einval:
3159
ret = -EINVAL;
3160
err:
3161
blkg_conf_exit(&ctx);
3162
return ret;
3163
}
3164
3165
static u64 ioc_qos_prfill(struct seq_file *sf, struct blkg_policy_data *pd,
3166
int off)
3167
{
3168
const char *dname = blkg_dev_name(pd->blkg);
3169
struct ioc *ioc = pd_to_iocg(pd)->ioc;
3170
3171
if (!dname)
3172
return 0;
3173
3174
spin_lock(&ioc->lock);
3175
seq_printf(sf, "%s enable=%d ctrl=%s rpct=%u.%02u rlat=%u wpct=%u.%02u wlat=%u min=%u.%02u max=%u.%02u\n",
3176
dname, ioc->enabled, ioc->user_qos_params ? "user" : "auto",
3177
ioc->params.qos[QOS_RPPM] / 10000,
3178
ioc->params.qos[QOS_RPPM] % 10000 / 100,
3179
ioc->params.qos[QOS_RLAT],
3180
ioc->params.qos[QOS_WPPM] / 10000,
3181
ioc->params.qos[QOS_WPPM] % 10000 / 100,
3182
ioc->params.qos[QOS_WLAT],
3183
ioc->params.qos[QOS_MIN] / 10000,
3184
ioc->params.qos[QOS_MIN] % 10000 / 100,
3185
ioc->params.qos[QOS_MAX] / 10000,
3186
ioc->params.qos[QOS_MAX] % 10000 / 100);
3187
spin_unlock(&ioc->lock);
3188
return 0;
3189
}
3190
3191
static int ioc_qos_show(struct seq_file *sf, void *v)
3192
{
3193
struct blkcg *blkcg = css_to_blkcg(seq_css(sf));
3194
3195
blkcg_print_blkgs(sf, blkcg, ioc_qos_prfill,
3196
&blkcg_policy_iocost, seq_cft(sf)->private, false);
3197
return 0;
3198
}
3199
3200
static const match_table_t qos_ctrl_tokens = {
3201
{ QOS_ENABLE, "enable=%u" },
3202
{ QOS_CTRL, "ctrl=%s" },
3203
{ NR_QOS_CTRL_PARAMS, NULL },
3204
};
3205
3206
static const match_table_t qos_tokens = {
3207
{ QOS_RPPM, "rpct=%s" },
3208
{ QOS_RLAT, "rlat=%u" },
3209
{ QOS_WPPM, "wpct=%s" },
3210
{ QOS_WLAT, "wlat=%u" },
3211
{ QOS_MIN, "min=%s" },
3212
{ QOS_MAX, "max=%s" },
3213
{ NR_QOS_PARAMS, NULL },
3214
};
3215
3216
static ssize_t ioc_qos_write(struct kernfs_open_file *of, char *input,
3217
size_t nbytes, loff_t off)
3218
{
3219
struct blkg_conf_ctx ctx;
3220
struct gendisk *disk;
3221
struct ioc *ioc;
3222
u32 qos[NR_QOS_PARAMS];
3223
bool enable, user;
3224
char *body, *p;
3225
unsigned long memflags;
3226
int ret;
3227
3228
blkg_conf_init(&ctx, input);
3229
3230
memflags = blkg_conf_open_bdev_frozen(&ctx);
3231
if (IS_ERR_VALUE(memflags)) {
3232
ret = memflags;
3233
goto err;
3234
}
3235
3236
body = ctx.body;
3237
disk = ctx.bdev->bd_disk;
3238
if (!queue_is_mq(disk->queue)) {
3239
ret = -EOPNOTSUPP;
3240
goto err;
3241
}
3242
3243
ioc = q_to_ioc(disk->queue);
3244
if (!ioc) {
3245
ret = blk_iocost_init(disk);
3246
if (ret)
3247
goto err;
3248
ioc = q_to_ioc(disk->queue);
3249
}
3250
3251
blk_mq_quiesce_queue(disk->queue);
3252
3253
spin_lock_irq(&ioc->lock);
3254
memcpy(qos, ioc->params.qos, sizeof(qos));
3255
enable = ioc->enabled;
3256
user = ioc->user_qos_params;
3257
3258
while ((p = strsep(&body, " \t\n"))) {
3259
substring_t args[MAX_OPT_ARGS];
3260
char buf[32];
3261
int tok;
3262
s64 v;
3263
3264
if (!*p)
3265
continue;
3266
3267
switch (match_token(p, qos_ctrl_tokens, args)) {
3268
case QOS_ENABLE:
3269
if (match_u64(&args[0], &v))
3270
goto einval;
3271
enable = v;
3272
continue;
3273
case QOS_CTRL:
3274
match_strlcpy(buf, &args[0], sizeof(buf));
3275
if (!strcmp(buf, "auto"))
3276
user = false;
3277
else if (!strcmp(buf, "user"))
3278
user = true;
3279
else
3280
goto einval;
3281
continue;
3282
}
3283
3284
tok = match_token(p, qos_tokens, args);
3285
switch (tok) {
3286
case QOS_RPPM:
3287
case QOS_WPPM:
3288
if (match_strlcpy(buf, &args[0], sizeof(buf)) >=
3289
sizeof(buf))
3290
goto einval;
3291
if (cgroup_parse_float(buf, 2, &v))
3292
goto einval;
3293
if (v < 0 || v > 10000)
3294
goto einval;
3295
qos[tok] = v * 100;
3296
break;
3297
case QOS_RLAT:
3298
case QOS_WLAT:
3299
if (match_u64(&args[0], &v))
3300
goto einval;
3301
qos[tok] = v;
3302
break;
3303
case QOS_MIN:
3304
case QOS_MAX:
3305
if (match_strlcpy(buf, &args[0], sizeof(buf)) >=
3306
sizeof(buf))
3307
goto einval;
3308
if (cgroup_parse_float(buf, 2, &v))
3309
goto einval;
3310
if (v < 0)
3311
goto einval;
3312
qos[tok] = clamp_t(s64, v * 100,
3313
VRATE_MIN_PPM, VRATE_MAX_PPM);
3314
break;
3315
default:
3316
goto einval;
3317
}
3318
user = true;
3319
}
3320
3321
if (qos[QOS_MIN] > qos[QOS_MAX])
3322
goto einval;
3323
3324
if (enable && !ioc->enabled) {
3325
blk_stat_enable_accounting(disk->queue);
3326
blk_queue_flag_set(QUEUE_FLAG_RQ_ALLOC_TIME, disk->queue);
3327
ioc->enabled = true;
3328
} else if (!enable && ioc->enabled) {
3329
blk_stat_disable_accounting(disk->queue);
3330
blk_queue_flag_clear(QUEUE_FLAG_RQ_ALLOC_TIME, disk->queue);
3331
ioc->enabled = false;
3332
}
3333
3334
if (user) {
3335
memcpy(ioc->params.qos, qos, sizeof(qos));
3336
ioc->user_qos_params = true;
3337
} else {
3338
ioc->user_qos_params = false;
3339
}
3340
3341
ioc_refresh_params(ioc, true);
3342
spin_unlock_irq(&ioc->lock);
3343
3344
if (enable)
3345
wbt_disable_default(disk);
3346
else
3347
wbt_enable_default(disk);
3348
3349
blk_mq_unquiesce_queue(disk->queue);
3350
3351
blkg_conf_exit_frozen(&ctx, memflags);
3352
return nbytes;
3353
einval:
3354
spin_unlock_irq(&ioc->lock);
3355
blk_mq_unquiesce_queue(disk->queue);
3356
ret = -EINVAL;
3357
err:
3358
blkg_conf_exit_frozen(&ctx, memflags);
3359
return ret;
3360
}
3361
3362
static u64 ioc_cost_model_prfill(struct seq_file *sf,
3363
struct blkg_policy_data *pd, int off)
3364
{
3365
const char *dname = blkg_dev_name(pd->blkg);
3366
struct ioc *ioc = pd_to_iocg(pd)->ioc;
3367
u64 *u = ioc->params.i_lcoefs;
3368
3369
if (!dname)
3370
return 0;
3371
3372
spin_lock(&ioc->lock);
3373
seq_printf(sf, "%s ctrl=%s model=linear "
3374
"rbps=%llu rseqiops=%llu rrandiops=%llu "
3375
"wbps=%llu wseqiops=%llu wrandiops=%llu\n",
3376
dname, ioc->user_cost_model ? "user" : "auto",
3377
u[I_LCOEF_RBPS], u[I_LCOEF_RSEQIOPS], u[I_LCOEF_RRANDIOPS],
3378
u[I_LCOEF_WBPS], u[I_LCOEF_WSEQIOPS], u[I_LCOEF_WRANDIOPS]);
3379
spin_unlock(&ioc->lock);
3380
return 0;
3381
}
3382
3383
static int ioc_cost_model_show(struct seq_file *sf, void *v)
3384
{
3385
struct blkcg *blkcg = css_to_blkcg(seq_css(sf));
3386
3387
blkcg_print_blkgs(sf, blkcg, ioc_cost_model_prfill,
3388
&blkcg_policy_iocost, seq_cft(sf)->private, false);
3389
return 0;
3390
}
3391
3392
static const match_table_t cost_ctrl_tokens = {
3393
{ COST_CTRL, "ctrl=%s" },
3394
{ COST_MODEL, "model=%s" },
3395
{ NR_COST_CTRL_PARAMS, NULL },
3396
};
3397
3398
static const match_table_t i_lcoef_tokens = {
3399
{ I_LCOEF_RBPS, "rbps=%u" },
3400
{ I_LCOEF_RSEQIOPS, "rseqiops=%u" },
3401
{ I_LCOEF_RRANDIOPS, "rrandiops=%u" },
3402
{ I_LCOEF_WBPS, "wbps=%u" },
3403
{ I_LCOEF_WSEQIOPS, "wseqiops=%u" },
3404
{ I_LCOEF_WRANDIOPS, "wrandiops=%u" },
3405
{ NR_I_LCOEFS, NULL },
3406
};
3407
3408
static ssize_t ioc_cost_model_write(struct kernfs_open_file *of, char *input,
3409
size_t nbytes, loff_t off)
3410
{
3411
struct blkg_conf_ctx ctx;
3412
struct request_queue *q;
3413
unsigned int memflags;
3414
struct ioc *ioc;
3415
u64 u[NR_I_LCOEFS];
3416
bool user;
3417
char *body, *p;
3418
int ret;
3419
3420
blkg_conf_init(&ctx, input);
3421
3422
ret = blkg_conf_open_bdev(&ctx);
3423
if (ret)
3424
goto err;
3425
3426
body = ctx.body;
3427
q = bdev_get_queue(ctx.bdev);
3428
if (!queue_is_mq(q)) {
3429
ret = -EOPNOTSUPP;
3430
goto err;
3431
}
3432
3433
ioc = q_to_ioc(q);
3434
if (!ioc) {
3435
ret = blk_iocost_init(ctx.bdev->bd_disk);
3436
if (ret)
3437
goto err;
3438
ioc = q_to_ioc(q);
3439
}
3440
3441
memflags = blk_mq_freeze_queue(q);
3442
blk_mq_quiesce_queue(q);
3443
3444
spin_lock_irq(&ioc->lock);
3445
memcpy(u, ioc->params.i_lcoefs, sizeof(u));
3446
user = ioc->user_cost_model;
3447
3448
while ((p = strsep(&body, " \t\n"))) {
3449
substring_t args[MAX_OPT_ARGS];
3450
char buf[32];
3451
int tok;
3452
u64 v;
3453
3454
if (!*p)
3455
continue;
3456
3457
switch (match_token(p, cost_ctrl_tokens, args)) {
3458
case COST_CTRL:
3459
match_strlcpy(buf, &args[0], sizeof(buf));
3460
if (!strcmp(buf, "auto"))
3461
user = false;
3462
else if (!strcmp(buf, "user"))
3463
user = true;
3464
else
3465
goto einval;
3466
continue;
3467
case COST_MODEL:
3468
match_strlcpy(buf, &args[0], sizeof(buf));
3469
if (strcmp(buf, "linear"))
3470
goto einval;
3471
continue;
3472
}
3473
3474
tok = match_token(p, i_lcoef_tokens, args);
3475
if (tok == NR_I_LCOEFS)
3476
goto einval;
3477
if (match_u64(&args[0], &v))
3478
goto einval;
3479
u[tok] = v;
3480
user = true;
3481
}
3482
3483
if (user) {
3484
memcpy(ioc->params.i_lcoefs, u, sizeof(u));
3485
ioc->user_cost_model = true;
3486
} else {
3487
ioc->user_cost_model = false;
3488
}
3489
ioc_refresh_params(ioc, true);
3490
spin_unlock_irq(&ioc->lock);
3491
3492
blk_mq_unquiesce_queue(q);
3493
blk_mq_unfreeze_queue(q, memflags);
3494
3495
blkg_conf_exit(&ctx);
3496
return nbytes;
3497
3498
einval:
3499
spin_unlock_irq(&ioc->lock);
3500
3501
blk_mq_unquiesce_queue(q);
3502
blk_mq_unfreeze_queue(q, memflags);
3503
3504
ret = -EINVAL;
3505
err:
3506
blkg_conf_exit(&ctx);
3507
return ret;
3508
}
3509
3510
static struct cftype ioc_files[] = {
3511
{
3512
.name = "weight",
3513
.flags = CFTYPE_NOT_ON_ROOT,
3514
.seq_show = ioc_weight_show,
3515
.write = ioc_weight_write,
3516
},
3517
{
3518
.name = "cost.qos",
3519
.flags = CFTYPE_ONLY_ON_ROOT,
3520
.seq_show = ioc_qos_show,
3521
.write = ioc_qos_write,
3522
},
3523
{
3524
.name = "cost.model",
3525
.flags = CFTYPE_ONLY_ON_ROOT,
3526
.seq_show = ioc_cost_model_show,
3527
.write = ioc_cost_model_write,
3528
},
3529
{}
3530
};
3531
3532
static struct blkcg_policy blkcg_policy_iocost = {
3533
.dfl_cftypes = ioc_files,
3534
.cpd_alloc_fn = ioc_cpd_alloc,
3535
.cpd_free_fn = ioc_cpd_free,
3536
.pd_alloc_fn = ioc_pd_alloc,
3537
.pd_init_fn = ioc_pd_init,
3538
.pd_free_fn = ioc_pd_free,
3539
.pd_stat_fn = ioc_pd_stat,
3540
};
3541
3542
static int __init ioc_init(void)
3543
{
3544
return blkcg_policy_register(&blkcg_policy_iocost);
3545
}
3546
3547
static void __exit ioc_exit(void)
3548
{
3549
blkcg_policy_unregister(&blkcg_policy_iocost);
3550
}
3551
3552
module_init(ioc_init);
3553
module_exit(ioc_exit);
3554
3555