Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
torvalds
GitHub Repository: torvalds/linux
Path: blob/master/block/blk-iocost.c
49961 views
1
/* SPDX-License-Identifier: GPL-2.0
2
*
3
* IO cost model based controller.
4
*
5
* Copyright (C) 2019 Tejun Heo <[email protected]>
6
* Copyright (C) 2019 Andy Newell <[email protected]>
7
* Copyright (C) 2019 Facebook
8
*
9
* One challenge of controlling IO resources is the lack of trivially
10
* observable cost metric. This is distinguished from CPU and memory where
11
* wallclock time and the number of bytes can serve as accurate enough
12
* approximations.
13
*
14
* Bandwidth and iops are the most commonly used metrics for IO devices but
15
* depending on the type and specifics of the device, different IO patterns
16
* easily lead to multiple orders of magnitude variations rendering them
17
* useless for the purpose of IO capacity distribution. While on-device
18
* time, with a lot of clutches, could serve as a useful approximation for
19
* non-queued rotational devices, this is no longer viable with modern
20
* devices, even the rotational ones.
21
*
22
* While there is no cost metric we can trivially observe, it isn't a
23
* complete mystery. For example, on a rotational device, seek cost
24
* dominates while a contiguous transfer contributes a smaller amount
25
* proportional to the size. If we can characterize at least the relative
26
* costs of these different types of IOs, it should be possible to
27
* implement a reasonable work-conserving proportional IO resource
28
* distribution.
29
*
30
* 1. IO Cost Model
31
*
32
* IO cost model estimates the cost of an IO given its basic parameters and
33
* history (e.g. the end sector of the last IO). The cost is measured in
34
* device time. If a given IO is estimated to cost 10ms, the device should
35
* be able to process ~100 of those IOs in a second.
36
*
37
* Currently, there's only one builtin cost model - linear. Each IO is
38
* classified as sequential or random and given a base cost accordingly.
39
* On top of that, a size cost proportional to the length of the IO is
40
* added. While simple, this model captures the operational
41
* characteristics of a wide varienty of devices well enough. Default
42
* parameters for several different classes of devices are provided and the
43
* parameters can be configured from userspace via
44
* /sys/fs/cgroup/io.cost.model.
45
*
46
* If needed, tools/cgroup/iocost_coef_gen.py can be used to generate
47
* device-specific coefficients.
48
*
49
* 2. Control Strategy
50
*
51
* The device virtual time (vtime) is used as the primary control metric.
52
* The control strategy is composed of the following three parts.
53
*
54
* 2-1. Vtime Distribution
55
*
56
* When a cgroup becomes active in terms of IOs, its hierarchical share is
57
* calculated. Please consider the following hierarchy where the numbers
58
* inside parentheses denote the configured weights.
59
*
60
* root
61
* / \
62
* A (w:100) B (w:300)
63
* / \
64
* A0 (w:100) A1 (w:100)
65
*
66
* If B is idle and only A0 and A1 are actively issuing IOs, as the two are
67
* of equal weight, each gets 50% share. If then B starts issuing IOs, B
68
* gets 300/(100+300) or 75% share, and A0 and A1 equally splits the rest,
69
* 12.5% each. The distribution mechanism only cares about these flattened
70
* shares. They're called hweights (hierarchical weights) and always add
71
* upto 1 (WEIGHT_ONE).
72
*
73
* A given cgroup's vtime runs slower in inverse proportion to its hweight.
74
* For example, with 12.5% weight, A0's time runs 8 times slower (100/12.5)
75
* against the device vtime - an IO which takes 10ms on the underlying
76
* device is considered to take 80ms on A0.
77
*
78
* This constitutes the basis of IO capacity distribution. Each cgroup's
79
* vtime is running at a rate determined by its hweight. A cgroup tracks
80
* the vtime consumed by past IOs and can issue a new IO if doing so
81
* wouldn't outrun the current device vtime. Otherwise, the IO is
82
* suspended until the vtime has progressed enough to cover it.
83
*
84
* 2-2. Vrate Adjustment
85
*
86
* It's unrealistic to expect the cost model to be perfect. There are too
87
* many devices and even on the same device the overall performance
88
* fluctuates depending on numerous factors such as IO mixture and device
89
* internal garbage collection. The controller needs to adapt dynamically.
90
*
91
* This is achieved by adjusting the overall IO rate according to how busy
92
* the device is. If the device becomes overloaded, we're sending down too
93
* many IOs and should generally slow down. If there are waiting issuers
94
* but the device isn't saturated, we're issuing too few and should
95
* generally speed up.
96
*
97
* To slow down, we lower the vrate - the rate at which the device vtime
98
* passes compared to the wall clock. For example, if the vtime is running
99
* at the vrate of 75%, all cgroups added up would only be able to issue
100
* 750ms worth of IOs per second, and vice-versa for speeding up.
101
*
102
* Device business is determined using two criteria - rq wait and
103
* completion latencies.
104
*
105
* When a device gets saturated, the on-device and then the request queues
106
* fill up and a bio which is ready to be issued has to wait for a request
107
* to become available. When this delay becomes noticeable, it's a clear
108
* indication that the device is saturated and we lower the vrate. This
109
* saturation signal is fairly conservative as it only triggers when both
110
* hardware and software queues are filled up, and is used as the default
111
* busy signal.
112
*
113
* As devices can have deep queues and be unfair in how the queued commands
114
* are executed, solely depending on rq wait may not result in satisfactory
115
* control quality. For a better control quality, completion latency QoS
116
* parameters can be configured so that the device is considered saturated
117
* if N'th percentile completion latency rises above the set point.
118
*
119
* The completion latency requirements are a function of both the
120
* underlying device characteristics and the desired IO latency quality of
121
* service. There is an inherent trade-off - the tighter the latency QoS,
122
* the higher the bandwidth lossage. Latency QoS is disabled by default
123
* and can be set through /sys/fs/cgroup/io.cost.qos.
124
*
125
* 2-3. Work Conservation
126
*
127
* Imagine two cgroups A and B with equal weights. A is issuing a small IO
128
* periodically while B is sending out enough parallel IOs to saturate the
129
* device on its own. Let's say A's usage amounts to 100ms worth of IO
130
* cost per second, i.e., 10% of the device capacity. The naive
131
* distribution of half and half would lead to 60% utilization of the
132
* device, a significant reduction in the total amount of work done
133
* compared to free-for-all competition. This is too high a cost to pay
134
* for IO control.
135
*
136
* To conserve the total amount of work done, we keep track of how much
137
* each active cgroup is actually using and yield part of its weight if
138
* there are other cgroups which can make use of it. In the above case,
139
* A's weight will be lowered so that it hovers above the actual usage and
140
* B would be able to use the rest.
141
*
142
* As we don't want to penalize a cgroup for donating its weight, the
143
* surplus weight adjustment factors in a margin and has an immediate
144
* snapback mechanism in case the cgroup needs more IO vtime for itself.
145
*
146
* Note that adjusting down surplus weights has the same effects as
147
* accelerating vtime for other cgroups and work conservation can also be
148
* implemented by adjusting vrate dynamically. However, squaring who can
149
* donate and should take back how much requires hweight propagations
150
* anyway making it easier to implement and understand as a separate
151
* mechanism.
152
*
153
* 3. Monitoring
154
*
155
* Instead of debugfs or other clumsy monitoring mechanisms, this
156
* controller uses a drgn based monitoring script -
157
* tools/cgroup/iocost_monitor.py. For details on drgn, please see
158
* https://github.com/osandov/drgn. The output looks like the following.
159
*
160
* sdb RUN per=300ms cur_per=234.218:v203.695 busy= +1 vrate= 62.12%
161
* active weight hweight% inflt% dbt delay usages%
162
* test/a * 50/ 50 33.33/ 33.33 27.65 2 0*041 033:033:033
163
* test/b * 100/ 100 66.67/ 66.67 17.56 0 0*000 066:079:077
164
*
165
* - per : Timer period
166
* - cur_per : Internal wall and device vtime clock
167
* - vrate : Device virtual time rate against wall clock
168
* - weight : Surplus-adjusted and configured weights
169
* - hweight : Surplus-adjusted and configured hierarchical weights
170
* - inflt : The percentage of in-flight IO cost at the end of last period
171
* - del_ms : Deferred issuer delay induction level and duration
172
* - usages : Usage history
173
*/
174
175
#include <linux/kernel.h>
176
#include <linux/module.h>
177
#include <linux/timer.h>
178
#include <linux/time64.h>
179
#include <linux/parser.h>
180
#include <linux/sched/signal.h>
181
#include <asm/local.h>
182
#include <asm/local64.h>
183
#include "blk-rq-qos.h"
184
#include "blk-stat.h"
185
#include "blk-wbt.h"
186
#include "blk-cgroup.h"
187
188
#ifdef CONFIG_TRACEPOINTS
189
190
/* copied from TRACE_CGROUP_PATH, see cgroup-internal.h */
191
#define TRACE_IOCG_PATH_LEN 1024
192
static DEFINE_SPINLOCK(trace_iocg_path_lock);
193
static char trace_iocg_path[TRACE_IOCG_PATH_LEN];
194
195
#define TRACE_IOCG_PATH(type, iocg, ...) \
196
do { \
197
unsigned long flags; \
198
if (trace_iocost_##type##_enabled()) { \
199
spin_lock_irqsave(&trace_iocg_path_lock, flags); \
200
cgroup_path(iocg_to_blkg(iocg)->blkcg->css.cgroup, \
201
trace_iocg_path, TRACE_IOCG_PATH_LEN); \
202
trace_iocost_##type(iocg, trace_iocg_path, \
203
##__VA_ARGS__); \
204
spin_unlock_irqrestore(&trace_iocg_path_lock, flags); \
205
} \
206
} while (0)
207
208
#else /* CONFIG_TRACE_POINTS */
209
#define TRACE_IOCG_PATH(type, iocg, ...) do { } while (0)
210
#endif /* CONFIG_TRACE_POINTS */
211
212
enum {
213
MILLION = 1000000,
214
215
/* timer period is calculated from latency requirements, bound it */
216
MIN_PERIOD = USEC_PER_MSEC,
217
MAX_PERIOD = USEC_PER_SEC,
218
219
/*
220
* iocg->vtime is targeted at 50% behind the device vtime, which
221
* serves as its IO credit buffer. Surplus weight adjustment is
222
* immediately canceled if the vtime margin runs below 10%.
223
*/
224
MARGIN_MIN_PCT = 10,
225
MARGIN_LOW_PCT = 20,
226
MARGIN_TARGET_PCT = 50,
227
228
INUSE_ADJ_STEP_PCT = 25,
229
230
/* Have some play in timer operations */
231
TIMER_SLACK_PCT = 1,
232
233
/* 1/64k is granular enough and can easily be handled w/ u32 */
234
WEIGHT_ONE = 1 << 16,
235
};
236
237
enum {
238
/*
239
* As vtime is used to calculate the cost of each IO, it needs to
240
* be fairly high precision. For example, it should be able to
241
* represent the cost of a single page worth of discard with
242
* suffificient accuracy. At the same time, it should be able to
243
* represent reasonably long enough durations to be useful and
244
* convenient during operation.
245
*
246
* 1s worth of vtime is 2^37. This gives us both sub-nanosecond
247
* granularity and days of wrap-around time even at extreme vrates.
248
*/
249
VTIME_PER_SEC_SHIFT = 37,
250
VTIME_PER_SEC = 1LLU << VTIME_PER_SEC_SHIFT,
251
VTIME_PER_USEC = VTIME_PER_SEC / USEC_PER_SEC,
252
VTIME_PER_NSEC = VTIME_PER_SEC / NSEC_PER_SEC,
253
254
/* bound vrate adjustments within two orders of magnitude */
255
VRATE_MIN_PPM = 10000, /* 1% */
256
VRATE_MAX_PPM = 100000000, /* 10000% */
257
258
VRATE_MIN = VTIME_PER_USEC * VRATE_MIN_PPM / MILLION,
259
VRATE_CLAMP_ADJ_PCT = 4,
260
261
/* switch iff the conditions are met for longer than this */
262
AUTOP_CYCLE_NSEC = 10LLU * NSEC_PER_SEC,
263
};
264
265
enum {
266
/* if IOs end up waiting for requests, issue less */
267
RQ_WAIT_BUSY_PCT = 5,
268
269
/* unbusy hysterisis */
270
UNBUSY_THR_PCT = 75,
271
272
/*
273
* The effect of delay is indirect and non-linear and a huge amount of
274
* future debt can accumulate abruptly while unthrottled. Linearly scale
275
* up delay as debt is going up and then let it decay exponentially.
276
* This gives us quick ramp ups while delay is accumulating and long
277
* tails which can help reducing the frequency of debt explosions on
278
* unthrottle. The parameters are experimentally determined.
279
*
280
* The delay mechanism provides adequate protection and behavior in many
281
* cases. However, this is far from ideal and falls shorts on both
282
* fronts. The debtors are often throttled too harshly costing a
283
* significant level of fairness and possibly total work while the
284
* protection against their impacts on the system can be choppy and
285
* unreliable.
286
*
287
* The shortcoming primarily stems from the fact that, unlike for page
288
* cache, the kernel doesn't have well-defined back-pressure propagation
289
* mechanism and policies for anonymous memory. Fully addressing this
290
* issue will likely require substantial improvements in the area.
291
*/
292
MIN_DELAY_THR_PCT = 500,
293
MAX_DELAY_THR_PCT = 25000,
294
MIN_DELAY = 250,
295
MAX_DELAY = 250 * USEC_PER_MSEC,
296
297
/* halve debts if avg usage over 100ms is under 50% */
298
DFGV_USAGE_PCT = 50,
299
DFGV_PERIOD = 100 * USEC_PER_MSEC,
300
301
/* don't let cmds which take a very long time pin lagging for too long */
302
MAX_LAGGING_PERIODS = 10,
303
304
/*
305
* Count IO size in 4k pages. The 12bit shift helps keeping
306
* size-proportional components of cost calculation in closer
307
* numbers of digits to per-IO cost components.
308
*/
309
IOC_PAGE_SHIFT = 12,
310
IOC_PAGE_SIZE = 1 << IOC_PAGE_SHIFT,
311
IOC_SECT_TO_PAGE_SHIFT = IOC_PAGE_SHIFT - SECTOR_SHIFT,
312
313
/* if apart further than 16M, consider randio for linear model */
314
LCOEF_RANDIO_PAGES = 4096,
315
};
316
317
enum ioc_running {
318
IOC_IDLE,
319
IOC_RUNNING,
320
IOC_STOP,
321
};
322
323
/* io.cost.qos controls including per-dev enable of the whole controller */
324
enum {
325
QOS_ENABLE,
326
QOS_CTRL,
327
NR_QOS_CTRL_PARAMS,
328
};
329
330
/* io.cost.qos params */
331
enum {
332
QOS_RPPM,
333
QOS_RLAT,
334
QOS_WPPM,
335
QOS_WLAT,
336
QOS_MIN,
337
QOS_MAX,
338
NR_QOS_PARAMS,
339
};
340
341
/* io.cost.model controls */
342
enum {
343
COST_CTRL,
344
COST_MODEL,
345
NR_COST_CTRL_PARAMS,
346
};
347
348
/* builtin linear cost model coefficients */
349
enum {
350
I_LCOEF_RBPS,
351
I_LCOEF_RSEQIOPS,
352
I_LCOEF_RRANDIOPS,
353
I_LCOEF_WBPS,
354
I_LCOEF_WSEQIOPS,
355
I_LCOEF_WRANDIOPS,
356
NR_I_LCOEFS,
357
};
358
359
enum {
360
LCOEF_RPAGE,
361
LCOEF_RSEQIO,
362
LCOEF_RRANDIO,
363
LCOEF_WPAGE,
364
LCOEF_WSEQIO,
365
LCOEF_WRANDIO,
366
NR_LCOEFS,
367
};
368
369
enum {
370
AUTOP_INVALID,
371
AUTOP_HDD,
372
AUTOP_SSD_QD1,
373
AUTOP_SSD_DFL,
374
AUTOP_SSD_FAST,
375
};
376
377
struct ioc_params {
378
u32 qos[NR_QOS_PARAMS];
379
u64 i_lcoefs[NR_I_LCOEFS];
380
u64 lcoefs[NR_LCOEFS];
381
u32 too_fast_vrate_pct;
382
u32 too_slow_vrate_pct;
383
};
384
385
struct ioc_margins {
386
s64 min;
387
s64 low;
388
s64 target;
389
};
390
391
struct ioc_missed {
392
local_t nr_met;
393
local_t nr_missed;
394
u32 last_met;
395
u32 last_missed;
396
};
397
398
struct ioc_pcpu_stat {
399
struct ioc_missed missed[2];
400
401
local64_t rq_wait_ns;
402
u64 last_rq_wait_ns;
403
};
404
405
/* per device */
406
struct ioc {
407
struct rq_qos rqos;
408
409
bool enabled;
410
411
struct ioc_params params;
412
struct ioc_margins margins;
413
u32 period_us;
414
u32 timer_slack_ns;
415
u64 vrate_min;
416
u64 vrate_max;
417
418
spinlock_t lock;
419
struct timer_list timer;
420
struct list_head active_iocgs; /* active cgroups */
421
struct ioc_pcpu_stat __percpu *pcpu_stat;
422
423
enum ioc_running running;
424
atomic64_t vtime_rate;
425
u64 vtime_base_rate;
426
s64 vtime_err;
427
428
seqcount_spinlock_t period_seqcount;
429
u64 period_at; /* wallclock starttime */
430
u64 period_at_vtime; /* vtime starttime */
431
432
atomic64_t cur_period; /* inc'd each period */
433
int busy_level; /* saturation history */
434
435
bool weights_updated;
436
atomic_t hweight_gen; /* for lazy hweights */
437
438
/* debt forgivness */
439
u64 dfgv_period_at;
440
u64 dfgv_period_rem;
441
u64 dfgv_usage_us_sum;
442
443
u64 autop_too_fast_at;
444
u64 autop_too_slow_at;
445
int autop_idx;
446
bool user_qos_params:1;
447
bool user_cost_model:1;
448
};
449
450
struct iocg_pcpu_stat {
451
local64_t abs_vusage;
452
};
453
454
struct iocg_stat {
455
u64 usage_us;
456
u64 wait_us;
457
u64 indebt_us;
458
u64 indelay_us;
459
};
460
461
/* per device-cgroup pair */
462
struct ioc_gq {
463
struct blkg_policy_data pd;
464
struct ioc *ioc;
465
466
/*
467
* A iocg can get its weight from two sources - an explicit
468
* per-device-cgroup configuration or the default weight of the
469
* cgroup. `cfg_weight` is the explicit per-device-cgroup
470
* configuration. `weight` is the effective considering both
471
* sources.
472
*
473
* When an idle cgroup becomes active its `active` goes from 0 to
474
* `weight`. `inuse` is the surplus adjusted active weight.
475
* `active` and `inuse` are used to calculate `hweight_active` and
476
* `hweight_inuse`.
477
*
478
* `last_inuse` remembers `inuse` while an iocg is idle to persist
479
* surplus adjustments.
480
*
481
* `inuse` may be adjusted dynamically during period. `saved_*` are used
482
* to determine and track adjustments.
483
*/
484
u32 cfg_weight;
485
u32 weight;
486
u32 active;
487
u32 inuse;
488
489
u32 last_inuse;
490
s64 saved_margin;
491
492
sector_t cursor; /* to detect randio */
493
494
/*
495
* `vtime` is this iocg's vtime cursor which progresses as IOs are
496
* issued. If lagging behind device vtime, the delta represents
497
* the currently available IO budget. If running ahead, the
498
* overage.
499
*
500
* `vtime_done` is the same but progressed on completion rather
501
* than issue. The delta behind `vtime` represents the cost of
502
* currently in-flight IOs.
503
*/
504
atomic64_t vtime;
505
atomic64_t done_vtime;
506
u64 abs_vdebt;
507
508
/* current delay in effect and when it started */
509
u64 delay;
510
u64 delay_at;
511
512
/*
513
* The period this iocg was last active in. Used for deactivation
514
* and invalidating `vtime`.
515
*/
516
atomic64_t active_period;
517
struct list_head active_list;
518
519
/* see __propagate_weights() and current_hweight() for details */
520
u64 child_active_sum;
521
u64 child_inuse_sum;
522
u64 child_adjusted_sum;
523
int hweight_gen;
524
u32 hweight_active;
525
u32 hweight_inuse;
526
u32 hweight_donating;
527
u32 hweight_after_donation;
528
529
struct list_head walk_list;
530
struct list_head surplus_list;
531
532
struct wait_queue_head waitq;
533
struct hrtimer waitq_timer;
534
535
/* timestamp at the latest activation */
536
u64 activated_at;
537
538
/* statistics */
539
struct iocg_pcpu_stat __percpu *pcpu_stat;
540
struct iocg_stat stat;
541
struct iocg_stat last_stat;
542
u64 last_stat_abs_vusage;
543
u64 usage_delta_us;
544
u64 wait_since;
545
u64 indebt_since;
546
u64 indelay_since;
547
548
/* this iocg's depth in the hierarchy and ancestors including self */
549
int level;
550
struct ioc_gq *ancestors[];
551
};
552
553
/* per cgroup */
554
struct ioc_cgrp {
555
struct blkcg_policy_data cpd;
556
unsigned int dfl_weight;
557
};
558
559
struct ioc_now {
560
u64 now_ns;
561
u64 now;
562
u64 vnow;
563
};
564
565
struct iocg_wait {
566
struct wait_queue_entry wait;
567
struct bio *bio;
568
u64 abs_cost;
569
bool committed;
570
};
571
572
struct iocg_wake_ctx {
573
struct ioc_gq *iocg;
574
u32 hw_inuse;
575
s64 vbudget;
576
};
577
578
static const struct ioc_params autop[] = {
579
[AUTOP_HDD] = {
580
.qos = {
581
[QOS_RLAT] = 250000, /* 250ms */
582
[QOS_WLAT] = 250000,
583
[QOS_MIN] = VRATE_MIN_PPM,
584
[QOS_MAX] = VRATE_MAX_PPM,
585
},
586
.i_lcoefs = {
587
[I_LCOEF_RBPS] = 174019176,
588
[I_LCOEF_RSEQIOPS] = 41708,
589
[I_LCOEF_RRANDIOPS] = 370,
590
[I_LCOEF_WBPS] = 178075866,
591
[I_LCOEF_WSEQIOPS] = 42705,
592
[I_LCOEF_WRANDIOPS] = 378,
593
},
594
},
595
[AUTOP_SSD_QD1] = {
596
.qos = {
597
[QOS_RLAT] = 25000, /* 25ms */
598
[QOS_WLAT] = 25000,
599
[QOS_MIN] = VRATE_MIN_PPM,
600
[QOS_MAX] = VRATE_MAX_PPM,
601
},
602
.i_lcoefs = {
603
[I_LCOEF_RBPS] = 245855193,
604
[I_LCOEF_RSEQIOPS] = 61575,
605
[I_LCOEF_RRANDIOPS] = 6946,
606
[I_LCOEF_WBPS] = 141365009,
607
[I_LCOEF_WSEQIOPS] = 33716,
608
[I_LCOEF_WRANDIOPS] = 26796,
609
},
610
},
611
[AUTOP_SSD_DFL] = {
612
.qos = {
613
[QOS_RLAT] = 25000, /* 25ms */
614
[QOS_WLAT] = 25000,
615
[QOS_MIN] = VRATE_MIN_PPM,
616
[QOS_MAX] = VRATE_MAX_PPM,
617
},
618
.i_lcoefs = {
619
[I_LCOEF_RBPS] = 488636629,
620
[I_LCOEF_RSEQIOPS] = 8932,
621
[I_LCOEF_RRANDIOPS] = 8518,
622
[I_LCOEF_WBPS] = 427891549,
623
[I_LCOEF_WSEQIOPS] = 28755,
624
[I_LCOEF_WRANDIOPS] = 21940,
625
},
626
.too_fast_vrate_pct = 500,
627
},
628
[AUTOP_SSD_FAST] = {
629
.qos = {
630
[QOS_RLAT] = 5000, /* 5ms */
631
[QOS_WLAT] = 5000,
632
[QOS_MIN] = VRATE_MIN_PPM,
633
[QOS_MAX] = VRATE_MAX_PPM,
634
},
635
.i_lcoefs = {
636
[I_LCOEF_RBPS] = 3102524156LLU,
637
[I_LCOEF_RSEQIOPS] = 724816,
638
[I_LCOEF_RRANDIOPS] = 778122,
639
[I_LCOEF_WBPS] = 1742780862LLU,
640
[I_LCOEF_WSEQIOPS] = 425702,
641
[I_LCOEF_WRANDIOPS] = 443193,
642
},
643
.too_slow_vrate_pct = 10,
644
},
645
};
646
647
/*
648
* vrate adjust percentages indexed by ioc->busy_level. We adjust up on
649
* vtime credit shortage and down on device saturation.
650
*/
651
static const u32 vrate_adj_pct[] =
652
{ 0, 0, 0, 0,
653
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
654
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
655
4, 4, 4, 4, 4, 4, 4, 4, 8, 8, 8, 8, 8, 8, 8, 8, 16 };
656
657
static struct blkcg_policy blkcg_policy_iocost;
658
659
/* accessors and helpers */
660
static struct ioc *rqos_to_ioc(struct rq_qos *rqos)
661
{
662
return container_of(rqos, struct ioc, rqos);
663
}
664
665
static struct ioc *q_to_ioc(struct request_queue *q)
666
{
667
return rqos_to_ioc(rq_qos_id(q, RQ_QOS_COST));
668
}
669
670
static const char __maybe_unused *ioc_name(struct ioc *ioc)
671
{
672
struct gendisk *disk = ioc->rqos.disk;
673
674
if (!disk)
675
return "<unknown>";
676
return disk->disk_name;
677
}
678
679
static struct ioc_gq *pd_to_iocg(struct blkg_policy_data *pd)
680
{
681
return pd ? container_of(pd, struct ioc_gq, pd) : NULL;
682
}
683
684
static struct ioc_gq *blkg_to_iocg(struct blkcg_gq *blkg)
685
{
686
return pd_to_iocg(blkg_to_pd(blkg, &blkcg_policy_iocost));
687
}
688
689
static struct blkcg_gq *iocg_to_blkg(struct ioc_gq *iocg)
690
{
691
return pd_to_blkg(&iocg->pd);
692
}
693
694
static struct ioc_cgrp *blkcg_to_iocc(struct blkcg *blkcg)
695
{
696
return container_of(blkcg_to_cpd(blkcg, &blkcg_policy_iocost),
697
struct ioc_cgrp, cpd);
698
}
699
700
/*
701
* Scale @abs_cost to the inverse of @hw_inuse. The lower the hierarchical
702
* weight, the more expensive each IO. Must round up.
703
*/
704
static u64 abs_cost_to_cost(u64 abs_cost, u32 hw_inuse)
705
{
706
return DIV64_U64_ROUND_UP(abs_cost * WEIGHT_ONE, hw_inuse);
707
}
708
709
/*
710
* The inverse of abs_cost_to_cost(). Must round up.
711
*/
712
static u64 cost_to_abs_cost(u64 cost, u32 hw_inuse)
713
{
714
return DIV64_U64_ROUND_UP(cost * hw_inuse, WEIGHT_ONE);
715
}
716
717
static void iocg_commit_bio(struct ioc_gq *iocg, struct bio *bio,
718
u64 abs_cost, u64 cost)
719
{
720
struct iocg_pcpu_stat *gcs;
721
722
bio->bi_iocost_cost = cost;
723
atomic64_add(cost, &iocg->vtime);
724
725
gcs = get_cpu_ptr(iocg->pcpu_stat);
726
local64_add(abs_cost, &gcs->abs_vusage);
727
put_cpu_ptr(gcs);
728
}
729
730
static void iocg_lock(struct ioc_gq *iocg, bool lock_ioc, unsigned long *flags)
731
{
732
if (lock_ioc) {
733
spin_lock_irqsave(&iocg->ioc->lock, *flags);
734
spin_lock(&iocg->waitq.lock);
735
} else {
736
spin_lock_irqsave(&iocg->waitq.lock, *flags);
737
}
738
}
739
740
static void iocg_unlock(struct ioc_gq *iocg, bool unlock_ioc, unsigned long *flags)
741
{
742
if (unlock_ioc) {
743
spin_unlock(&iocg->waitq.lock);
744
spin_unlock_irqrestore(&iocg->ioc->lock, *flags);
745
} else {
746
spin_unlock_irqrestore(&iocg->waitq.lock, *flags);
747
}
748
}
749
750
#define CREATE_TRACE_POINTS
751
#include <trace/events/iocost.h>
752
753
static void ioc_refresh_margins(struct ioc *ioc)
754
{
755
struct ioc_margins *margins = &ioc->margins;
756
u32 period_us = ioc->period_us;
757
u64 vrate = ioc->vtime_base_rate;
758
759
margins->min = (period_us * MARGIN_MIN_PCT / 100) * vrate;
760
margins->low = (period_us * MARGIN_LOW_PCT / 100) * vrate;
761
margins->target = (period_us * MARGIN_TARGET_PCT / 100) * vrate;
762
}
763
764
/* latency Qos params changed, update period_us and all the dependent params */
765
static void ioc_refresh_period_us(struct ioc *ioc)
766
{
767
u32 ppm, lat, multi, period_us;
768
769
lockdep_assert_held(&ioc->lock);
770
771
/* pick the higher latency target */
772
if (ioc->params.qos[QOS_RLAT] >= ioc->params.qos[QOS_WLAT]) {
773
ppm = ioc->params.qos[QOS_RPPM];
774
lat = ioc->params.qos[QOS_RLAT];
775
} else {
776
ppm = ioc->params.qos[QOS_WPPM];
777
lat = ioc->params.qos[QOS_WLAT];
778
}
779
780
/*
781
* We want the period to be long enough to contain a healthy number
782
* of IOs while short enough for granular control. Define it as a
783
* multiple of the latency target. Ideally, the multiplier should
784
* be scaled according to the percentile so that it would nominally
785
* contain a certain number of requests. Let's be simpler and
786
* scale it linearly so that it's 2x >= pct(90) and 10x at pct(50).
787
*/
788
if (ppm)
789
multi = max_t(u32, (MILLION - ppm) / 50000, 2);
790
else
791
multi = 2;
792
period_us = multi * lat;
793
period_us = clamp_t(u32, period_us, MIN_PERIOD, MAX_PERIOD);
794
795
/* calculate dependent params */
796
ioc->period_us = period_us;
797
ioc->timer_slack_ns = div64_u64(
798
(u64)period_us * NSEC_PER_USEC * TIMER_SLACK_PCT,
799
100);
800
ioc_refresh_margins(ioc);
801
}
802
803
/*
804
* ioc->rqos.disk isn't initialized when this function is called from
805
* the init path.
806
*/
807
static int ioc_autop_idx(struct ioc *ioc, struct gendisk *disk)
808
{
809
int idx = ioc->autop_idx;
810
const struct ioc_params *p = &autop[idx];
811
u32 vrate_pct;
812
u64 now_ns;
813
814
/* rotational? */
815
if (blk_queue_rot(disk->queue))
816
return AUTOP_HDD;
817
818
/* handle SATA SSDs w/ broken NCQ */
819
if (blk_queue_depth(disk->queue) == 1)
820
return AUTOP_SSD_QD1;
821
822
/* use one of the normal ssd sets */
823
if (idx < AUTOP_SSD_DFL)
824
return AUTOP_SSD_DFL;
825
826
/* if user is overriding anything, maintain what was there */
827
if (ioc->user_qos_params || ioc->user_cost_model)
828
return idx;
829
830
/* step up/down based on the vrate */
831
vrate_pct = div64_u64(ioc->vtime_base_rate * 100, VTIME_PER_USEC);
832
now_ns = blk_time_get_ns();
833
834
if (p->too_fast_vrate_pct && p->too_fast_vrate_pct <= vrate_pct) {
835
if (!ioc->autop_too_fast_at)
836
ioc->autop_too_fast_at = now_ns;
837
if (now_ns - ioc->autop_too_fast_at >= AUTOP_CYCLE_NSEC)
838
return idx + 1;
839
} else {
840
ioc->autop_too_fast_at = 0;
841
}
842
843
if (p->too_slow_vrate_pct && p->too_slow_vrate_pct >= vrate_pct) {
844
if (!ioc->autop_too_slow_at)
845
ioc->autop_too_slow_at = now_ns;
846
if (now_ns - ioc->autop_too_slow_at >= AUTOP_CYCLE_NSEC)
847
return idx - 1;
848
} else {
849
ioc->autop_too_slow_at = 0;
850
}
851
852
return idx;
853
}
854
855
/*
856
* Take the followings as input
857
*
858
* @bps maximum sequential throughput
859
* @seqiops maximum sequential 4k iops
860
* @randiops maximum random 4k iops
861
*
862
* and calculate the linear model cost coefficients.
863
*
864
* *@page per-page cost 1s / (@bps / 4096)
865
* *@seqio base cost of a seq IO max((1s / @seqiops) - *@page, 0)
866
* @randiops base cost of a rand IO max((1s / @randiops) - *@page, 0)
867
*/
868
static void calc_lcoefs(u64 bps, u64 seqiops, u64 randiops,
869
u64 *page, u64 *seqio, u64 *randio)
870
{
871
u64 v;
872
873
*page = *seqio = *randio = 0;
874
875
if (bps) {
876
u64 bps_pages = DIV_ROUND_UP_ULL(bps, IOC_PAGE_SIZE);
877
878
if (bps_pages)
879
*page = DIV64_U64_ROUND_UP(VTIME_PER_SEC, bps_pages);
880
else
881
*page = 1;
882
}
883
884
if (seqiops) {
885
v = DIV64_U64_ROUND_UP(VTIME_PER_SEC, seqiops);
886
if (v > *page)
887
*seqio = v - *page;
888
}
889
890
if (randiops) {
891
v = DIV64_U64_ROUND_UP(VTIME_PER_SEC, randiops);
892
if (v > *page)
893
*randio = v - *page;
894
}
895
}
896
897
static void ioc_refresh_lcoefs(struct ioc *ioc)
898
{
899
u64 *u = ioc->params.i_lcoefs;
900
u64 *c = ioc->params.lcoefs;
901
902
calc_lcoefs(u[I_LCOEF_RBPS], u[I_LCOEF_RSEQIOPS], u[I_LCOEF_RRANDIOPS],
903
&c[LCOEF_RPAGE], &c[LCOEF_RSEQIO], &c[LCOEF_RRANDIO]);
904
calc_lcoefs(u[I_LCOEF_WBPS], u[I_LCOEF_WSEQIOPS], u[I_LCOEF_WRANDIOPS],
905
&c[LCOEF_WPAGE], &c[LCOEF_WSEQIO], &c[LCOEF_WRANDIO]);
906
}
907
908
/*
909
* struct gendisk is required as an argument because ioc->rqos.disk
910
* is not properly initialized when called from the init path.
911
*/
912
static bool ioc_refresh_params_disk(struct ioc *ioc, bool force,
913
struct gendisk *disk)
914
{
915
const struct ioc_params *p;
916
int idx;
917
918
lockdep_assert_held(&ioc->lock);
919
920
idx = ioc_autop_idx(ioc, disk);
921
p = &autop[idx];
922
923
if (idx == ioc->autop_idx && !force)
924
return false;
925
926
if (idx != ioc->autop_idx) {
927
atomic64_set(&ioc->vtime_rate, VTIME_PER_USEC);
928
ioc->vtime_base_rate = VTIME_PER_USEC;
929
}
930
931
ioc->autop_idx = idx;
932
ioc->autop_too_fast_at = 0;
933
ioc->autop_too_slow_at = 0;
934
935
if (!ioc->user_qos_params)
936
memcpy(ioc->params.qos, p->qos, sizeof(p->qos));
937
if (!ioc->user_cost_model)
938
memcpy(ioc->params.i_lcoefs, p->i_lcoefs, sizeof(p->i_lcoefs));
939
940
ioc_refresh_period_us(ioc);
941
ioc_refresh_lcoefs(ioc);
942
943
ioc->vrate_min = DIV64_U64_ROUND_UP((u64)ioc->params.qos[QOS_MIN] *
944
VTIME_PER_USEC, MILLION);
945
ioc->vrate_max = DIV64_U64_ROUND_UP((u64)ioc->params.qos[QOS_MAX] *
946
VTIME_PER_USEC, MILLION);
947
948
return true;
949
}
950
951
static bool ioc_refresh_params(struct ioc *ioc, bool force)
952
{
953
return ioc_refresh_params_disk(ioc, force, ioc->rqos.disk);
954
}
955
956
/*
957
* When an iocg accumulates too much vtime or gets deactivated, we throw away
958
* some vtime, which lowers the overall device utilization. As the exact amount
959
* which is being thrown away is known, we can compensate by accelerating the
960
* vrate accordingly so that the extra vtime generated in the current period
961
* matches what got lost.
962
*/
963
static void ioc_refresh_vrate(struct ioc *ioc, struct ioc_now *now)
964
{
965
s64 pleft = ioc->period_at + ioc->period_us - now->now;
966
s64 vperiod = ioc->period_us * ioc->vtime_base_rate;
967
s64 vcomp, vcomp_min, vcomp_max;
968
969
lockdep_assert_held(&ioc->lock);
970
971
/* we need some time left in this period */
972
if (pleft <= 0)
973
goto done;
974
975
/*
976
* Calculate how much vrate should be adjusted to offset the error.
977
* Limit the amount of adjustment and deduct the adjusted amount from
978
* the error.
979
*/
980
vcomp = -div64_s64(ioc->vtime_err, pleft);
981
vcomp_min = -(ioc->vtime_base_rate >> 1);
982
vcomp_max = ioc->vtime_base_rate;
983
vcomp = clamp(vcomp, vcomp_min, vcomp_max);
984
985
ioc->vtime_err += vcomp * pleft;
986
987
atomic64_set(&ioc->vtime_rate, ioc->vtime_base_rate + vcomp);
988
done:
989
/* bound how much error can accumulate */
990
ioc->vtime_err = clamp(ioc->vtime_err, -vperiod, vperiod);
991
}
992
993
static void ioc_adjust_base_vrate(struct ioc *ioc, u32 rq_wait_pct,
994
int nr_lagging, int nr_shortages,
995
int prev_busy_level, u32 *missed_ppm)
996
{
997
u64 vrate = ioc->vtime_base_rate;
998
u64 vrate_min = ioc->vrate_min, vrate_max = ioc->vrate_max;
999
1000
if (!ioc->busy_level || (ioc->busy_level < 0 && nr_lagging)) {
1001
if (ioc->busy_level != prev_busy_level || nr_lagging)
1002
trace_iocost_ioc_vrate_adj(ioc, vrate,
1003
missed_ppm, rq_wait_pct,
1004
nr_lagging, nr_shortages);
1005
1006
return;
1007
}
1008
1009
/*
1010
* If vrate is out of bounds, apply clamp gradually as the
1011
* bounds can change abruptly. Otherwise, apply busy_level
1012
* based adjustment.
1013
*/
1014
if (vrate < vrate_min) {
1015
vrate = div64_u64(vrate * (100 + VRATE_CLAMP_ADJ_PCT), 100);
1016
vrate = min(vrate, vrate_min);
1017
} else if (vrate > vrate_max) {
1018
vrate = div64_u64(vrate * (100 - VRATE_CLAMP_ADJ_PCT), 100);
1019
vrate = max(vrate, vrate_max);
1020
} else {
1021
int idx = min_t(int, abs(ioc->busy_level),
1022
ARRAY_SIZE(vrate_adj_pct) - 1);
1023
u32 adj_pct = vrate_adj_pct[idx];
1024
1025
if (ioc->busy_level > 0)
1026
adj_pct = 100 - adj_pct;
1027
else
1028
adj_pct = 100 + adj_pct;
1029
1030
vrate = clamp(DIV64_U64_ROUND_UP(vrate * adj_pct, 100),
1031
vrate_min, vrate_max);
1032
}
1033
1034
trace_iocost_ioc_vrate_adj(ioc, vrate, missed_ppm, rq_wait_pct,
1035
nr_lagging, nr_shortages);
1036
1037
ioc->vtime_base_rate = vrate;
1038
ioc_refresh_margins(ioc);
1039
}
1040
1041
/* take a snapshot of the current [v]time and vrate */
1042
static void ioc_now(struct ioc *ioc, struct ioc_now *now)
1043
{
1044
unsigned seq;
1045
u64 vrate;
1046
1047
now->now_ns = blk_time_get_ns();
1048
now->now = ktime_to_us(now->now_ns);
1049
vrate = atomic64_read(&ioc->vtime_rate);
1050
1051
/*
1052
* The current vtime is
1053
*
1054
* vtime at period start + (wallclock time since the start) * vrate
1055
*
1056
* As a consistent snapshot of `period_at_vtime` and `period_at` is
1057
* needed, they're seqcount protected.
1058
*/
1059
do {
1060
seq = read_seqcount_begin(&ioc->period_seqcount);
1061
now->vnow = ioc->period_at_vtime +
1062
(now->now - ioc->period_at) * vrate;
1063
} while (read_seqcount_retry(&ioc->period_seqcount, seq));
1064
}
1065
1066
static void ioc_start_period(struct ioc *ioc, struct ioc_now *now)
1067
{
1068
WARN_ON_ONCE(ioc->running != IOC_RUNNING);
1069
1070
write_seqcount_begin(&ioc->period_seqcount);
1071
ioc->period_at = now->now;
1072
ioc->period_at_vtime = now->vnow;
1073
write_seqcount_end(&ioc->period_seqcount);
1074
1075
ioc->timer.expires = jiffies + usecs_to_jiffies(ioc->period_us);
1076
add_timer(&ioc->timer);
1077
}
1078
1079
/*
1080
* Update @iocg's `active` and `inuse` to @active and @inuse, update level
1081
* weight sums and propagate upwards accordingly. If @save, the current margin
1082
* is saved to be used as reference for later inuse in-period adjustments.
1083
*/
1084
static void __propagate_weights(struct ioc_gq *iocg, u32 active, u32 inuse,
1085
bool save, struct ioc_now *now)
1086
{
1087
struct ioc *ioc = iocg->ioc;
1088
int lvl;
1089
1090
lockdep_assert_held(&ioc->lock);
1091
1092
/*
1093
* For an active leaf node, its inuse shouldn't be zero or exceed
1094
* @active. An active internal node's inuse is solely determined by the
1095
* inuse to active ratio of its children regardless of @inuse.
1096
*/
1097
if (list_empty(&iocg->active_list) && iocg->child_active_sum) {
1098
inuse = DIV64_U64_ROUND_UP(active * iocg->child_inuse_sum,
1099
iocg->child_active_sum);
1100
} else {
1101
/*
1102
* It may be tempting to turn this into a clamp expression with
1103
* a lower limit of 1 but active may be 0, which cannot be used
1104
* as an upper limit in that situation. This expression allows
1105
* active to clamp inuse unless it is 0, in which case inuse
1106
* becomes 1.
1107
*/
1108
inuse = min(inuse, active) ?: 1;
1109
}
1110
1111
iocg->last_inuse = iocg->inuse;
1112
if (save)
1113
iocg->saved_margin = now->vnow - atomic64_read(&iocg->vtime);
1114
1115
if (active == iocg->active && inuse == iocg->inuse)
1116
return;
1117
1118
for (lvl = iocg->level - 1; lvl >= 0; lvl--) {
1119
struct ioc_gq *parent = iocg->ancestors[lvl];
1120
struct ioc_gq *child = iocg->ancestors[lvl + 1];
1121
u32 parent_active = 0, parent_inuse = 0;
1122
1123
/* update the level sums */
1124
parent->child_active_sum += (s32)(active - child->active);
1125
parent->child_inuse_sum += (s32)(inuse - child->inuse);
1126
/* apply the updates */
1127
child->active = active;
1128
child->inuse = inuse;
1129
1130
/*
1131
* The delta between inuse and active sums indicates that
1132
* much of weight is being given away. Parent's inuse
1133
* and active should reflect the ratio.
1134
*/
1135
if (parent->child_active_sum) {
1136
parent_active = parent->weight;
1137
parent_inuse = DIV64_U64_ROUND_UP(
1138
parent_active * parent->child_inuse_sum,
1139
parent->child_active_sum);
1140
}
1141
1142
/* do we need to keep walking up? */
1143
if (parent_active == parent->active &&
1144
parent_inuse == parent->inuse)
1145
break;
1146
1147
active = parent_active;
1148
inuse = parent_inuse;
1149
}
1150
1151
ioc->weights_updated = true;
1152
}
1153
1154
static void commit_weights(struct ioc *ioc)
1155
{
1156
lockdep_assert_held(&ioc->lock);
1157
1158
if (ioc->weights_updated) {
1159
/* paired with rmb in current_hweight(), see there */
1160
smp_wmb();
1161
atomic_inc(&ioc->hweight_gen);
1162
ioc->weights_updated = false;
1163
}
1164
}
1165
1166
static void propagate_weights(struct ioc_gq *iocg, u32 active, u32 inuse,
1167
bool save, struct ioc_now *now)
1168
{
1169
__propagate_weights(iocg, active, inuse, save, now);
1170
commit_weights(iocg->ioc);
1171
}
1172
1173
static void current_hweight(struct ioc_gq *iocg, u32 *hw_activep, u32 *hw_inusep)
1174
{
1175
struct ioc *ioc = iocg->ioc;
1176
int lvl;
1177
u32 hwa, hwi;
1178
int ioc_gen;
1179
1180
/* hot path - if uptodate, use cached */
1181
ioc_gen = atomic_read(&ioc->hweight_gen);
1182
if (ioc_gen == iocg->hweight_gen)
1183
goto out;
1184
1185
/*
1186
* Paired with wmb in commit_weights(). If we saw the updated
1187
* hweight_gen, all the weight updates from __propagate_weights() are
1188
* visible too.
1189
*
1190
* We can race with weight updates during calculation and get it
1191
* wrong. However, hweight_gen would have changed and a future
1192
* reader will recalculate and we're guaranteed to discard the
1193
* wrong result soon.
1194
*/
1195
smp_rmb();
1196
1197
hwa = hwi = WEIGHT_ONE;
1198
for (lvl = 0; lvl <= iocg->level - 1; lvl++) {
1199
struct ioc_gq *parent = iocg->ancestors[lvl];
1200
struct ioc_gq *child = iocg->ancestors[lvl + 1];
1201
u64 active_sum = READ_ONCE(parent->child_active_sum);
1202
u64 inuse_sum = READ_ONCE(parent->child_inuse_sum);
1203
u32 active = READ_ONCE(child->active);
1204
u32 inuse = READ_ONCE(child->inuse);
1205
1206
/* we can race with deactivations and either may read as zero */
1207
if (!active_sum || !inuse_sum)
1208
continue;
1209
1210
active_sum = max_t(u64, active, active_sum);
1211
hwa = div64_u64((u64)hwa * active, active_sum);
1212
1213
inuse_sum = max_t(u64, inuse, inuse_sum);
1214
hwi = div64_u64((u64)hwi * inuse, inuse_sum);
1215
}
1216
1217
iocg->hweight_active = max_t(u32, hwa, 1);
1218
iocg->hweight_inuse = max_t(u32, hwi, 1);
1219
iocg->hweight_gen = ioc_gen;
1220
out:
1221
if (hw_activep)
1222
*hw_activep = iocg->hweight_active;
1223
if (hw_inusep)
1224
*hw_inusep = iocg->hweight_inuse;
1225
}
1226
1227
/*
1228
* Calculate the hweight_inuse @iocg would get with max @inuse assuming all the
1229
* other weights stay unchanged.
1230
*/
1231
static u32 current_hweight_max(struct ioc_gq *iocg)
1232
{
1233
u32 hwm = WEIGHT_ONE;
1234
u32 inuse = iocg->active;
1235
u64 child_inuse_sum;
1236
int lvl;
1237
1238
lockdep_assert_held(&iocg->ioc->lock);
1239
1240
for (lvl = iocg->level - 1; lvl >= 0; lvl--) {
1241
struct ioc_gq *parent = iocg->ancestors[lvl];
1242
struct ioc_gq *child = iocg->ancestors[lvl + 1];
1243
1244
child_inuse_sum = parent->child_inuse_sum + inuse - child->inuse;
1245
hwm = div64_u64((u64)hwm * inuse, child_inuse_sum);
1246
inuse = DIV64_U64_ROUND_UP(parent->active * child_inuse_sum,
1247
parent->child_active_sum);
1248
}
1249
1250
return max_t(u32, hwm, 1);
1251
}
1252
1253
static void weight_updated(struct ioc_gq *iocg, struct ioc_now *now)
1254
{
1255
struct ioc *ioc = iocg->ioc;
1256
struct blkcg_gq *blkg = iocg_to_blkg(iocg);
1257
struct ioc_cgrp *iocc = blkcg_to_iocc(blkg->blkcg);
1258
u32 weight;
1259
1260
lockdep_assert_held(&ioc->lock);
1261
1262
weight = iocg->cfg_weight ?: iocc->dfl_weight;
1263
if (weight != iocg->weight && iocg->active)
1264
propagate_weights(iocg, weight, iocg->inuse, true, now);
1265
iocg->weight = weight;
1266
}
1267
1268
static bool iocg_activate(struct ioc_gq *iocg, struct ioc_now *now)
1269
{
1270
struct ioc *ioc = iocg->ioc;
1271
u64 __maybe_unused last_period, cur_period;
1272
u64 vtime, vtarget;
1273
int i;
1274
1275
/*
1276
* If seem to be already active, just update the stamp to tell the
1277
* timer that we're still active. We don't mind occassional races.
1278
*/
1279
if (!list_empty(&iocg->active_list)) {
1280
ioc_now(ioc, now);
1281
cur_period = atomic64_read(&ioc->cur_period);
1282
if (atomic64_read(&iocg->active_period) != cur_period)
1283
atomic64_set(&iocg->active_period, cur_period);
1284
return true;
1285
}
1286
1287
/* racy check on internal node IOs, treat as root level IOs */
1288
if (iocg->child_active_sum)
1289
return false;
1290
1291
spin_lock_irq(&ioc->lock);
1292
1293
ioc_now(ioc, now);
1294
1295
/* update period */
1296
cur_period = atomic64_read(&ioc->cur_period);
1297
last_period = atomic64_read(&iocg->active_period);
1298
atomic64_set(&iocg->active_period, cur_period);
1299
1300
/* already activated or breaking leaf-only constraint? */
1301
if (!list_empty(&iocg->active_list))
1302
goto succeed_unlock;
1303
for (i = iocg->level - 1; i > 0; i--)
1304
if (!list_empty(&iocg->ancestors[i]->active_list))
1305
goto fail_unlock;
1306
1307
if (iocg->child_active_sum)
1308
goto fail_unlock;
1309
1310
/*
1311
* Always start with the target budget. On deactivation, we throw away
1312
* anything above it.
1313
*/
1314
vtarget = now->vnow - ioc->margins.target;
1315
vtime = atomic64_read(&iocg->vtime);
1316
1317
atomic64_add(vtarget - vtime, &iocg->vtime);
1318
atomic64_add(vtarget - vtime, &iocg->done_vtime);
1319
vtime = vtarget;
1320
1321
/*
1322
* Activate, propagate weight and start period timer if not
1323
* running. Reset hweight_gen to avoid accidental match from
1324
* wrapping.
1325
*/
1326
iocg->hweight_gen = atomic_read(&ioc->hweight_gen) - 1;
1327
list_add(&iocg->active_list, &ioc->active_iocgs);
1328
1329
propagate_weights(iocg, iocg->weight,
1330
iocg->last_inuse ?: iocg->weight, true, now);
1331
1332
TRACE_IOCG_PATH(iocg_activate, iocg, now,
1333
last_period, cur_period, vtime);
1334
1335
iocg->activated_at = now->now;
1336
1337
if (ioc->running == IOC_IDLE) {
1338
ioc->running = IOC_RUNNING;
1339
ioc->dfgv_period_at = now->now;
1340
ioc->dfgv_period_rem = 0;
1341
ioc_start_period(ioc, now);
1342
}
1343
1344
succeed_unlock:
1345
spin_unlock_irq(&ioc->lock);
1346
return true;
1347
1348
fail_unlock:
1349
spin_unlock_irq(&ioc->lock);
1350
return false;
1351
}
1352
1353
static bool iocg_kick_delay(struct ioc_gq *iocg, struct ioc_now *now)
1354
{
1355
struct ioc *ioc = iocg->ioc;
1356
struct blkcg_gq *blkg = iocg_to_blkg(iocg);
1357
u64 tdelta, delay, new_delay, shift;
1358
s64 vover, vover_pct;
1359
u32 hwa;
1360
1361
lockdep_assert_held(&iocg->waitq.lock);
1362
1363
/*
1364
* If the delay is set by another CPU, we may be in the past. No need to
1365
* change anything if so. This avoids decay calculation underflow.
1366
*/
1367
if (time_before64(now->now, iocg->delay_at))
1368
return false;
1369
1370
/* calculate the current delay in effect - 1/2 every second */
1371
tdelta = now->now - iocg->delay_at;
1372
shift = div64_u64(tdelta, USEC_PER_SEC);
1373
if (iocg->delay && shift < BITS_PER_LONG)
1374
delay = iocg->delay >> shift;
1375
else
1376
delay = 0;
1377
1378
/* calculate the new delay from the debt amount */
1379
current_hweight(iocg, &hwa, NULL);
1380
vover = atomic64_read(&iocg->vtime) +
1381
abs_cost_to_cost(iocg->abs_vdebt, hwa) - now->vnow;
1382
vover_pct = div64_s64(100 * vover,
1383
ioc->period_us * ioc->vtime_base_rate);
1384
1385
if (vover_pct <= MIN_DELAY_THR_PCT)
1386
new_delay = 0;
1387
else if (vover_pct >= MAX_DELAY_THR_PCT)
1388
new_delay = MAX_DELAY;
1389
else
1390
new_delay = MIN_DELAY +
1391
div_u64((MAX_DELAY - MIN_DELAY) *
1392
(vover_pct - MIN_DELAY_THR_PCT),
1393
MAX_DELAY_THR_PCT - MIN_DELAY_THR_PCT);
1394
1395
/* pick the higher one and apply */
1396
if (new_delay > delay) {
1397
iocg->delay = new_delay;
1398
iocg->delay_at = now->now;
1399
delay = new_delay;
1400
}
1401
1402
if (delay >= MIN_DELAY) {
1403
if (!iocg->indelay_since)
1404
iocg->indelay_since = now->now;
1405
blkcg_set_delay(blkg, delay * NSEC_PER_USEC);
1406
return true;
1407
} else {
1408
if (iocg->indelay_since) {
1409
iocg->stat.indelay_us += now->now - iocg->indelay_since;
1410
iocg->indelay_since = 0;
1411
}
1412
iocg->delay = 0;
1413
blkcg_clear_delay(blkg);
1414
return false;
1415
}
1416
}
1417
1418
static void iocg_incur_debt(struct ioc_gq *iocg, u64 abs_cost,
1419
struct ioc_now *now)
1420
{
1421
struct iocg_pcpu_stat *gcs;
1422
1423
lockdep_assert_held(&iocg->ioc->lock);
1424
lockdep_assert_held(&iocg->waitq.lock);
1425
WARN_ON_ONCE(list_empty(&iocg->active_list));
1426
1427
/*
1428
* Once in debt, debt handling owns inuse. @iocg stays at the minimum
1429
* inuse donating all of it share to others until its debt is paid off.
1430
*/
1431
if (!iocg->abs_vdebt && abs_cost) {
1432
iocg->indebt_since = now->now;
1433
propagate_weights(iocg, iocg->active, 0, false, now);
1434
}
1435
1436
iocg->abs_vdebt += abs_cost;
1437
1438
gcs = get_cpu_ptr(iocg->pcpu_stat);
1439
local64_add(abs_cost, &gcs->abs_vusage);
1440
put_cpu_ptr(gcs);
1441
}
1442
1443
static void iocg_pay_debt(struct ioc_gq *iocg, u64 abs_vpay,
1444
struct ioc_now *now)
1445
{
1446
lockdep_assert_held(&iocg->ioc->lock);
1447
lockdep_assert_held(&iocg->waitq.lock);
1448
1449
/*
1450
* make sure that nobody messed with @iocg. Check iocg->pd.online
1451
* to avoid warn when removing blkcg or disk.
1452
*/
1453
WARN_ON_ONCE(list_empty(&iocg->active_list) && iocg->pd.online);
1454
WARN_ON_ONCE(iocg->inuse > 1);
1455
1456
iocg->abs_vdebt -= min(abs_vpay, iocg->abs_vdebt);
1457
1458
/* if debt is paid in full, restore inuse */
1459
if (!iocg->abs_vdebt) {
1460
iocg->stat.indebt_us += now->now - iocg->indebt_since;
1461
iocg->indebt_since = 0;
1462
1463
propagate_weights(iocg, iocg->active, iocg->last_inuse,
1464
false, now);
1465
}
1466
}
1467
1468
static int iocg_wake_fn(struct wait_queue_entry *wq_entry, unsigned mode,
1469
int flags, void *key)
1470
{
1471
struct iocg_wait *wait = container_of(wq_entry, struct iocg_wait, wait);
1472
struct iocg_wake_ctx *ctx = key;
1473
u64 cost = abs_cost_to_cost(wait->abs_cost, ctx->hw_inuse);
1474
1475
ctx->vbudget -= cost;
1476
1477
if (ctx->vbudget < 0)
1478
return -1;
1479
1480
iocg_commit_bio(ctx->iocg, wait->bio, wait->abs_cost, cost);
1481
wait->committed = true;
1482
1483
/*
1484
* autoremove_wake_function() removes the wait entry only when it
1485
* actually changed the task state. We want the wait always removed.
1486
* Remove explicitly and use default_wake_function(). Note that the
1487
* order of operations is important as finish_wait() tests whether
1488
* @wq_entry is removed without grabbing the lock.
1489
*/
1490
default_wake_function(wq_entry, mode, flags, key);
1491
list_del_init_careful(&wq_entry->entry);
1492
return 0;
1493
}
1494
1495
/*
1496
* Calculate the accumulated budget, pay debt if @pay_debt and wake up waiters
1497
* accordingly. When @pay_debt is %true, the caller must be holding ioc->lock in
1498
* addition to iocg->waitq.lock.
1499
*/
1500
static void iocg_kick_waitq(struct ioc_gq *iocg, bool pay_debt,
1501
struct ioc_now *now)
1502
{
1503
struct ioc *ioc = iocg->ioc;
1504
struct iocg_wake_ctx ctx = { .iocg = iocg };
1505
u64 vshortage, expires, oexpires;
1506
s64 vbudget;
1507
u32 hwa;
1508
1509
lockdep_assert_held(&iocg->waitq.lock);
1510
1511
current_hweight(iocg, &hwa, NULL);
1512
vbudget = now->vnow - atomic64_read(&iocg->vtime);
1513
1514
/* pay off debt */
1515
if (pay_debt && iocg->abs_vdebt && vbudget > 0) {
1516
u64 abs_vbudget = cost_to_abs_cost(vbudget, hwa);
1517
u64 abs_vpay = min_t(u64, abs_vbudget, iocg->abs_vdebt);
1518
u64 vpay = abs_cost_to_cost(abs_vpay, hwa);
1519
1520
lockdep_assert_held(&ioc->lock);
1521
1522
atomic64_add(vpay, &iocg->vtime);
1523
atomic64_add(vpay, &iocg->done_vtime);
1524
iocg_pay_debt(iocg, abs_vpay, now);
1525
vbudget -= vpay;
1526
}
1527
1528
if (iocg->abs_vdebt || iocg->delay)
1529
iocg_kick_delay(iocg, now);
1530
1531
/*
1532
* Debt can still be outstanding if we haven't paid all yet or the
1533
* caller raced and called without @pay_debt. Shouldn't wake up waiters
1534
* under debt. Make sure @vbudget reflects the outstanding amount and is
1535
* not positive.
1536
*/
1537
if (iocg->abs_vdebt) {
1538
s64 vdebt = abs_cost_to_cost(iocg->abs_vdebt, hwa);
1539
vbudget = min_t(s64, 0, vbudget - vdebt);
1540
}
1541
1542
/*
1543
* Wake up the ones which are due and see how much vtime we'll need for
1544
* the next one. As paying off debt restores hw_inuse, it must be read
1545
* after the above debt payment.
1546
*/
1547
ctx.vbudget = vbudget;
1548
current_hweight(iocg, NULL, &ctx.hw_inuse);
1549
1550
__wake_up_locked_key(&iocg->waitq, TASK_NORMAL, &ctx);
1551
1552
if (!waitqueue_active(&iocg->waitq)) {
1553
if (iocg->wait_since) {
1554
iocg->stat.wait_us += now->now - iocg->wait_since;
1555
iocg->wait_since = 0;
1556
}
1557
return;
1558
}
1559
1560
if (!iocg->wait_since)
1561
iocg->wait_since = now->now;
1562
1563
if (WARN_ON_ONCE(ctx.vbudget >= 0))
1564
return;
1565
1566
/* determine next wakeup, add a timer margin to guarantee chunking */
1567
vshortage = -ctx.vbudget;
1568
expires = now->now_ns +
1569
DIV64_U64_ROUND_UP(vshortage, ioc->vtime_base_rate) *
1570
NSEC_PER_USEC;
1571
expires += ioc->timer_slack_ns;
1572
1573
/* if already active and close enough, don't bother */
1574
oexpires = ktime_to_ns(hrtimer_get_softexpires(&iocg->waitq_timer));
1575
if (hrtimer_is_queued(&iocg->waitq_timer) &&
1576
abs(oexpires - expires) <= ioc->timer_slack_ns)
1577
return;
1578
1579
hrtimer_start_range_ns(&iocg->waitq_timer, ns_to_ktime(expires),
1580
ioc->timer_slack_ns, HRTIMER_MODE_ABS);
1581
}
1582
1583
static enum hrtimer_restart iocg_waitq_timer_fn(struct hrtimer *timer)
1584
{
1585
struct ioc_gq *iocg = container_of(timer, struct ioc_gq, waitq_timer);
1586
bool pay_debt = READ_ONCE(iocg->abs_vdebt);
1587
struct ioc_now now;
1588
unsigned long flags;
1589
1590
ioc_now(iocg->ioc, &now);
1591
1592
iocg_lock(iocg, pay_debt, &flags);
1593
iocg_kick_waitq(iocg, pay_debt, &now);
1594
iocg_unlock(iocg, pay_debt, &flags);
1595
1596
return HRTIMER_NORESTART;
1597
}
1598
1599
static void ioc_lat_stat(struct ioc *ioc, u32 *missed_ppm_ar, u32 *rq_wait_pct_p)
1600
{
1601
u32 nr_met[2] = { };
1602
u32 nr_missed[2] = { };
1603
u64 rq_wait_ns = 0;
1604
int cpu, rw;
1605
1606
for_each_online_cpu(cpu) {
1607
struct ioc_pcpu_stat *stat = per_cpu_ptr(ioc->pcpu_stat, cpu);
1608
u64 this_rq_wait_ns;
1609
1610
for (rw = READ; rw <= WRITE; rw++) {
1611
u32 this_met = local_read(&stat->missed[rw].nr_met);
1612
u32 this_missed = local_read(&stat->missed[rw].nr_missed);
1613
1614
nr_met[rw] += this_met - stat->missed[rw].last_met;
1615
nr_missed[rw] += this_missed - stat->missed[rw].last_missed;
1616
stat->missed[rw].last_met = this_met;
1617
stat->missed[rw].last_missed = this_missed;
1618
}
1619
1620
this_rq_wait_ns = local64_read(&stat->rq_wait_ns);
1621
rq_wait_ns += this_rq_wait_ns - stat->last_rq_wait_ns;
1622
stat->last_rq_wait_ns = this_rq_wait_ns;
1623
}
1624
1625
for (rw = READ; rw <= WRITE; rw++) {
1626
if (nr_met[rw] + nr_missed[rw])
1627
missed_ppm_ar[rw] =
1628
DIV64_U64_ROUND_UP((u64)nr_missed[rw] * MILLION,
1629
nr_met[rw] + nr_missed[rw]);
1630
else
1631
missed_ppm_ar[rw] = 0;
1632
}
1633
1634
*rq_wait_pct_p = div64_u64(rq_wait_ns * 100,
1635
ioc->period_us * NSEC_PER_USEC);
1636
}
1637
1638
/* was iocg idle this period? */
1639
static bool iocg_is_idle(struct ioc_gq *iocg)
1640
{
1641
struct ioc *ioc = iocg->ioc;
1642
1643
/* did something get issued this period? */
1644
if (atomic64_read(&iocg->active_period) ==
1645
atomic64_read(&ioc->cur_period))
1646
return false;
1647
1648
/* is something in flight? */
1649
if (atomic64_read(&iocg->done_vtime) != atomic64_read(&iocg->vtime))
1650
return false;
1651
1652
return true;
1653
}
1654
1655
/*
1656
* Call this function on the target leaf @iocg's to build pre-order traversal
1657
* list of all the ancestors in @inner_walk. The inner nodes are linked through
1658
* ->walk_list and the caller is responsible for dissolving the list after use.
1659
*/
1660
static void iocg_build_inner_walk(struct ioc_gq *iocg,
1661
struct list_head *inner_walk)
1662
{
1663
int lvl;
1664
1665
WARN_ON_ONCE(!list_empty(&iocg->walk_list));
1666
1667
/* find the first ancestor which hasn't been visited yet */
1668
for (lvl = iocg->level - 1; lvl >= 0; lvl--) {
1669
if (!list_empty(&iocg->ancestors[lvl]->walk_list))
1670
break;
1671
}
1672
1673
/* walk down and visit the inner nodes to get pre-order traversal */
1674
while (++lvl <= iocg->level - 1) {
1675
struct ioc_gq *inner = iocg->ancestors[lvl];
1676
1677
/* record traversal order */
1678
list_add_tail(&inner->walk_list, inner_walk);
1679
}
1680
}
1681
1682
/* propagate the deltas to the parent */
1683
static void iocg_flush_stat_upward(struct ioc_gq *iocg)
1684
{
1685
if (iocg->level > 0) {
1686
struct iocg_stat *parent_stat =
1687
&iocg->ancestors[iocg->level - 1]->stat;
1688
1689
parent_stat->usage_us +=
1690
iocg->stat.usage_us - iocg->last_stat.usage_us;
1691
parent_stat->wait_us +=
1692
iocg->stat.wait_us - iocg->last_stat.wait_us;
1693
parent_stat->indebt_us +=
1694
iocg->stat.indebt_us - iocg->last_stat.indebt_us;
1695
parent_stat->indelay_us +=
1696
iocg->stat.indelay_us - iocg->last_stat.indelay_us;
1697
}
1698
1699
iocg->last_stat = iocg->stat;
1700
}
1701
1702
/* collect per-cpu counters and propagate the deltas to the parent */
1703
static void iocg_flush_stat_leaf(struct ioc_gq *iocg, struct ioc_now *now)
1704
{
1705
struct ioc *ioc = iocg->ioc;
1706
u64 abs_vusage = 0;
1707
u64 vusage_delta;
1708
int cpu;
1709
1710
lockdep_assert_held(&iocg->ioc->lock);
1711
1712
/* collect per-cpu counters */
1713
for_each_possible_cpu(cpu) {
1714
abs_vusage += local64_read(
1715
per_cpu_ptr(&iocg->pcpu_stat->abs_vusage, cpu));
1716
}
1717
vusage_delta = abs_vusage - iocg->last_stat_abs_vusage;
1718
iocg->last_stat_abs_vusage = abs_vusage;
1719
1720
iocg->usage_delta_us = div64_u64(vusage_delta, ioc->vtime_base_rate);
1721
iocg->stat.usage_us += iocg->usage_delta_us;
1722
1723
iocg_flush_stat_upward(iocg);
1724
}
1725
1726
/* get stat counters ready for reading on all active iocgs */
1727
static void iocg_flush_stat(struct list_head *target_iocgs, struct ioc_now *now)
1728
{
1729
LIST_HEAD(inner_walk);
1730
struct ioc_gq *iocg, *tiocg;
1731
1732
/* flush leaves and build inner node walk list */
1733
list_for_each_entry(iocg, target_iocgs, active_list) {
1734
iocg_flush_stat_leaf(iocg, now);
1735
iocg_build_inner_walk(iocg, &inner_walk);
1736
}
1737
1738
/* keep flushing upwards by walking the inner list backwards */
1739
list_for_each_entry_safe_reverse(iocg, tiocg, &inner_walk, walk_list) {
1740
iocg_flush_stat_upward(iocg);
1741
list_del_init(&iocg->walk_list);
1742
}
1743
}
1744
1745
/*
1746
* Determine what @iocg's hweight_inuse should be after donating unused
1747
* capacity. @hwm is the upper bound and used to signal no donation. This
1748
* function also throws away @iocg's excess budget.
1749
*/
1750
static u32 hweight_after_donation(struct ioc_gq *iocg, u32 old_hwi, u32 hwm,
1751
u32 usage, struct ioc_now *now)
1752
{
1753
struct ioc *ioc = iocg->ioc;
1754
u64 vtime = atomic64_read(&iocg->vtime);
1755
s64 excess, delta, target, new_hwi;
1756
1757
/* debt handling owns inuse for debtors */
1758
if (iocg->abs_vdebt)
1759
return 1;
1760
1761
/* see whether minimum margin requirement is met */
1762
if (waitqueue_active(&iocg->waitq) ||
1763
time_after64(vtime, now->vnow - ioc->margins.min))
1764
return hwm;
1765
1766
/* throw away excess above target */
1767
excess = now->vnow - vtime - ioc->margins.target;
1768
if (excess > 0) {
1769
atomic64_add(excess, &iocg->vtime);
1770
atomic64_add(excess, &iocg->done_vtime);
1771
vtime += excess;
1772
ioc->vtime_err -= div64_u64(excess * old_hwi, WEIGHT_ONE);
1773
}
1774
1775
/*
1776
* Let's say the distance between iocg's and device's vtimes as a
1777
* fraction of period duration is delta. Assuming that the iocg will
1778
* consume the usage determined above, we want to determine new_hwi so
1779
* that delta equals MARGIN_TARGET at the end of the next period.
1780
*
1781
* We need to execute usage worth of IOs while spending the sum of the
1782
* new budget (1 - MARGIN_TARGET) and the leftover from the last period
1783
* (delta):
1784
*
1785
* usage = (1 - MARGIN_TARGET + delta) * new_hwi
1786
*
1787
* Therefore, the new_hwi is:
1788
*
1789
* new_hwi = usage / (1 - MARGIN_TARGET + delta)
1790
*/
1791
delta = div64_s64(WEIGHT_ONE * (now->vnow - vtime),
1792
now->vnow - ioc->period_at_vtime);
1793
target = WEIGHT_ONE * MARGIN_TARGET_PCT / 100;
1794
new_hwi = div64_s64(WEIGHT_ONE * usage, WEIGHT_ONE - target + delta);
1795
1796
return clamp_t(s64, new_hwi, 1, hwm);
1797
}
1798
1799
/*
1800
* For work-conservation, an iocg which isn't using all of its share should
1801
* donate the leftover to other iocgs. There are two ways to achieve this - 1.
1802
* bumping up vrate accordingly 2. lowering the donating iocg's inuse weight.
1803
*
1804
* #1 is mathematically simpler but has the drawback of requiring synchronous
1805
* global hweight_inuse updates when idle iocg's get activated or inuse weights
1806
* change due to donation snapbacks as it has the possibility of grossly
1807
* overshooting what's allowed by the model and vrate.
1808
*
1809
* #2 is inherently safe with local operations. The donating iocg can easily
1810
* snap back to higher weights when needed without worrying about impacts on
1811
* other nodes as the impacts will be inherently correct. This also makes idle
1812
* iocg activations safe. The only effect activations have is decreasing
1813
* hweight_inuse of others, the right solution to which is for those iocgs to
1814
* snap back to higher weights.
1815
*
1816
* So, we go with #2. The challenge is calculating how each donating iocg's
1817
* inuse should be adjusted to achieve the target donation amounts. This is done
1818
* using Andy's method described in the following pdf.
1819
*
1820
* https://drive.google.com/file/d/1PsJwxPFtjUnwOY1QJ5AeICCcsL7BM3bo
1821
*
1822
* Given the weights and target after-donation hweight_inuse values, Andy's
1823
* method determines how the proportional distribution should look like at each
1824
* sibling level to maintain the relative relationship between all non-donating
1825
* pairs. To roughly summarize, it divides the tree into donating and
1826
* non-donating parts, calculates global donation rate which is used to
1827
* determine the target hweight_inuse for each node, and then derives per-level
1828
* proportions.
1829
*
1830
* The following pdf shows that global distribution calculated this way can be
1831
* achieved by scaling inuse weights of donating leaves and propagating the
1832
* adjustments upwards proportionally.
1833
*
1834
* https://drive.google.com/file/d/1vONz1-fzVO7oY5DXXsLjSxEtYYQbOvsE
1835
*
1836
* Combining the above two, we can determine how each leaf iocg's inuse should
1837
* be adjusted to achieve the target donation.
1838
*
1839
* https://drive.google.com/file/d/1WcrltBOSPN0qXVdBgnKm4mdp9FhuEFQN
1840
*
1841
* The inline comments use symbols from the last pdf.
1842
*
1843
* b is the sum of the absolute budgets in the subtree. 1 for the root node.
1844
* f is the sum of the absolute budgets of non-donating nodes in the subtree.
1845
* t is the sum of the absolute budgets of donating nodes in the subtree.
1846
* w is the weight of the node. w = w_f + w_t
1847
* w_f is the non-donating portion of w. w_f = w * f / b
1848
* w_b is the donating portion of w. w_t = w * t / b
1849
* s is the sum of all sibling weights. s = Sum(w) for siblings
1850
* s_f and s_t are the non-donating and donating portions of s.
1851
*
1852
* Subscript p denotes the parent's counterpart and ' the adjusted value - e.g.
1853
* w_pt is the donating portion of the parent's weight and w'_pt the same value
1854
* after adjustments. Subscript r denotes the root node's values.
1855
*/
1856
static void transfer_surpluses(struct list_head *surpluses, struct ioc_now *now)
1857
{
1858
LIST_HEAD(over_hwa);
1859
LIST_HEAD(inner_walk);
1860
struct ioc_gq *iocg, *tiocg, *root_iocg;
1861
u32 after_sum, over_sum, over_target, gamma;
1862
1863
/*
1864
* It's pretty unlikely but possible for the total sum of
1865
* hweight_after_donation's to be higher than WEIGHT_ONE, which will
1866
* confuse the following calculations. If such condition is detected,
1867
* scale down everyone over its full share equally to keep the sum below
1868
* WEIGHT_ONE.
1869
*/
1870
after_sum = 0;
1871
over_sum = 0;
1872
list_for_each_entry(iocg, surpluses, surplus_list) {
1873
u32 hwa;
1874
1875
current_hweight(iocg, &hwa, NULL);
1876
after_sum += iocg->hweight_after_donation;
1877
1878
if (iocg->hweight_after_donation > hwa) {
1879
over_sum += iocg->hweight_after_donation;
1880
list_add(&iocg->walk_list, &over_hwa);
1881
}
1882
}
1883
1884
if (after_sum >= WEIGHT_ONE) {
1885
/*
1886
* The delta should be deducted from the over_sum, calculate
1887
* target over_sum value.
1888
*/
1889
u32 over_delta = after_sum - (WEIGHT_ONE - 1);
1890
WARN_ON_ONCE(over_sum <= over_delta);
1891
over_target = over_sum - over_delta;
1892
} else {
1893
over_target = 0;
1894
}
1895
1896
list_for_each_entry_safe(iocg, tiocg, &over_hwa, walk_list) {
1897
if (over_target)
1898
iocg->hweight_after_donation =
1899
div_u64((u64)iocg->hweight_after_donation *
1900
over_target, over_sum);
1901
list_del_init(&iocg->walk_list);
1902
}
1903
1904
/*
1905
* Build pre-order inner node walk list and prepare for donation
1906
* adjustment calculations.
1907
*/
1908
list_for_each_entry(iocg, surpluses, surplus_list) {
1909
iocg_build_inner_walk(iocg, &inner_walk);
1910
}
1911
1912
root_iocg = list_first_entry(&inner_walk, struct ioc_gq, walk_list);
1913
WARN_ON_ONCE(root_iocg->level > 0);
1914
1915
list_for_each_entry(iocg, &inner_walk, walk_list) {
1916
iocg->child_adjusted_sum = 0;
1917
iocg->hweight_donating = 0;
1918
iocg->hweight_after_donation = 0;
1919
}
1920
1921
/*
1922
* Propagate the donating budget (b_t) and after donation budget (b'_t)
1923
* up the hierarchy.
1924
*/
1925
list_for_each_entry(iocg, surpluses, surplus_list) {
1926
struct ioc_gq *parent = iocg->ancestors[iocg->level - 1];
1927
1928
parent->hweight_donating += iocg->hweight_donating;
1929
parent->hweight_after_donation += iocg->hweight_after_donation;
1930
}
1931
1932
list_for_each_entry_reverse(iocg, &inner_walk, walk_list) {
1933
if (iocg->level > 0) {
1934
struct ioc_gq *parent = iocg->ancestors[iocg->level - 1];
1935
1936
parent->hweight_donating += iocg->hweight_donating;
1937
parent->hweight_after_donation += iocg->hweight_after_donation;
1938
}
1939
}
1940
1941
/*
1942
* Calculate inner hwa's (b) and make sure the donation values are
1943
* within the accepted ranges as we're doing low res calculations with
1944
* roundups.
1945
*/
1946
list_for_each_entry(iocg, &inner_walk, walk_list) {
1947
if (iocg->level) {
1948
struct ioc_gq *parent = iocg->ancestors[iocg->level - 1];
1949
1950
iocg->hweight_active = DIV64_U64_ROUND_UP(
1951
(u64)parent->hweight_active * iocg->active,
1952
parent->child_active_sum);
1953
1954
}
1955
1956
iocg->hweight_donating = min(iocg->hweight_donating,
1957
iocg->hweight_active);
1958
iocg->hweight_after_donation = min(iocg->hweight_after_donation,
1959
iocg->hweight_donating - 1);
1960
if (WARN_ON_ONCE(iocg->hweight_active <= 1 ||
1961
iocg->hweight_donating <= 1 ||
1962
iocg->hweight_after_donation == 0)) {
1963
pr_warn("iocg: invalid donation weights in ");
1964
pr_cont_cgroup_path(iocg_to_blkg(iocg)->blkcg->css.cgroup);
1965
pr_cont(": active=%u donating=%u after=%u\n",
1966
iocg->hweight_active, iocg->hweight_donating,
1967
iocg->hweight_after_donation);
1968
}
1969
}
1970
1971
/*
1972
* Calculate the global donation rate (gamma) - the rate to adjust
1973
* non-donating budgets by.
1974
*
1975
* No need to use 64bit multiplication here as the first operand is
1976
* guaranteed to be smaller than WEIGHT_ONE (1<<16).
1977
*
1978
* We know that there are beneficiary nodes and the sum of the donating
1979
* hweights can't be whole; however, due to the round-ups during hweight
1980
* calculations, root_iocg->hweight_donating might still end up equal to
1981
* or greater than whole. Limit the range when calculating the divider.
1982
*
1983
* gamma = (1 - t_r') / (1 - t_r)
1984
*/
1985
gamma = DIV_ROUND_UP(
1986
(WEIGHT_ONE - root_iocg->hweight_after_donation) * WEIGHT_ONE,
1987
WEIGHT_ONE - min_t(u32, root_iocg->hweight_donating, WEIGHT_ONE - 1));
1988
1989
/*
1990
* Calculate adjusted hwi, child_adjusted_sum and inuse for the inner
1991
* nodes.
1992
*/
1993
list_for_each_entry(iocg, &inner_walk, walk_list) {
1994
struct ioc_gq *parent;
1995
u32 inuse, wpt, wptp;
1996
u64 st, sf;
1997
1998
if (iocg->level == 0) {
1999
/* adjusted weight sum for 1st level: s' = s * b_pf / b'_pf */
2000
iocg->child_adjusted_sum = DIV64_U64_ROUND_UP(
2001
iocg->child_active_sum * (WEIGHT_ONE - iocg->hweight_donating),
2002
WEIGHT_ONE - iocg->hweight_after_donation);
2003
continue;
2004
}
2005
2006
parent = iocg->ancestors[iocg->level - 1];
2007
2008
/* b' = gamma * b_f + b_t' */
2009
iocg->hweight_inuse = DIV64_U64_ROUND_UP(
2010
(u64)gamma * (iocg->hweight_active - iocg->hweight_donating),
2011
WEIGHT_ONE) + iocg->hweight_after_donation;
2012
2013
/* w' = s' * b' / b'_p */
2014
inuse = DIV64_U64_ROUND_UP(
2015
(u64)parent->child_adjusted_sum * iocg->hweight_inuse,
2016
parent->hweight_inuse);
2017
2018
/* adjusted weight sum for children: s' = s_f + s_t * w'_pt / w_pt */
2019
st = DIV64_U64_ROUND_UP(
2020
iocg->child_active_sum * iocg->hweight_donating,
2021
iocg->hweight_active);
2022
sf = iocg->child_active_sum - st;
2023
wpt = DIV64_U64_ROUND_UP(
2024
(u64)iocg->active * iocg->hweight_donating,
2025
iocg->hweight_active);
2026
wptp = DIV64_U64_ROUND_UP(
2027
(u64)inuse * iocg->hweight_after_donation,
2028
iocg->hweight_inuse);
2029
2030
iocg->child_adjusted_sum = sf + DIV64_U64_ROUND_UP(st * wptp, wpt);
2031
}
2032
2033
/*
2034
* All inner nodes now have ->hweight_inuse and ->child_adjusted_sum and
2035
* we can finally determine leaf adjustments.
2036
*/
2037
list_for_each_entry(iocg, surpluses, surplus_list) {
2038
struct ioc_gq *parent = iocg->ancestors[iocg->level - 1];
2039
u32 inuse;
2040
2041
/*
2042
* In-debt iocgs participated in the donation calculation with
2043
* the minimum target hweight_inuse. Configuring inuse
2044
* accordingly would work fine but debt handling expects
2045
* @iocg->inuse stay at the minimum and we don't wanna
2046
* interfere.
2047
*/
2048
if (iocg->abs_vdebt) {
2049
WARN_ON_ONCE(iocg->inuse > 1);
2050
continue;
2051
}
2052
2053
/* w' = s' * b' / b'_p, note that b' == b'_t for donating leaves */
2054
inuse = DIV64_U64_ROUND_UP(
2055
parent->child_adjusted_sum * iocg->hweight_after_donation,
2056
parent->hweight_inuse);
2057
2058
TRACE_IOCG_PATH(inuse_transfer, iocg, now,
2059
iocg->inuse, inuse,
2060
iocg->hweight_inuse,
2061
iocg->hweight_after_donation);
2062
2063
__propagate_weights(iocg, iocg->active, inuse, true, now);
2064
}
2065
2066
/* walk list should be dissolved after use */
2067
list_for_each_entry_safe(iocg, tiocg, &inner_walk, walk_list)
2068
list_del_init(&iocg->walk_list);
2069
}
2070
2071
/*
2072
* A low weight iocg can amass a large amount of debt, for example, when
2073
* anonymous memory gets reclaimed aggressively. If the system has a lot of
2074
* memory paired with a slow IO device, the debt can span multiple seconds or
2075
* more. If there are no other subsequent IO issuers, the in-debt iocg may end
2076
* up blocked paying its debt while the IO device is idle.
2077
*
2078
* The following protects against such cases. If the device has been
2079
* sufficiently idle for a while, the debts are halved and delays are
2080
* recalculated.
2081
*/
2082
static void ioc_forgive_debts(struct ioc *ioc, u64 usage_us_sum, int nr_debtors,
2083
struct ioc_now *now)
2084
{
2085
struct ioc_gq *iocg;
2086
u64 dur, usage_pct, nr_cycles, nr_cycles_shift;
2087
2088
/* if no debtor, reset the cycle */
2089
if (!nr_debtors) {
2090
ioc->dfgv_period_at = now->now;
2091
ioc->dfgv_period_rem = 0;
2092
ioc->dfgv_usage_us_sum = 0;
2093
return;
2094
}
2095
2096
/*
2097
* Debtors can pass through a lot of writes choking the device and we
2098
* don't want to be forgiving debts while the device is struggling from
2099
* write bursts. If we're missing latency targets, consider the device
2100
* fully utilized.
2101
*/
2102
if (ioc->busy_level > 0)
2103
usage_us_sum = max_t(u64, usage_us_sum, ioc->period_us);
2104
2105
ioc->dfgv_usage_us_sum += usage_us_sum;
2106
if (time_before64(now->now, ioc->dfgv_period_at + DFGV_PERIOD))
2107
return;
2108
2109
/*
2110
* At least DFGV_PERIOD has passed since the last period. Calculate the
2111
* average usage and reset the period counters.
2112
*/
2113
dur = now->now - ioc->dfgv_period_at;
2114
usage_pct = div64_u64(100 * ioc->dfgv_usage_us_sum, dur);
2115
2116
ioc->dfgv_period_at = now->now;
2117
ioc->dfgv_usage_us_sum = 0;
2118
2119
/* if was too busy, reset everything */
2120
if (usage_pct > DFGV_USAGE_PCT) {
2121
ioc->dfgv_period_rem = 0;
2122
return;
2123
}
2124
2125
/*
2126
* Usage is lower than threshold. Let's forgive some debts. Debt
2127
* forgiveness runs off of the usual ioc timer but its period usually
2128
* doesn't match ioc's. Compensate the difference by performing the
2129
* reduction as many times as would fit in the duration since the last
2130
* run and carrying over the left-over duration in @ioc->dfgv_period_rem
2131
* - if ioc period is 75% of DFGV_PERIOD, one out of three consecutive
2132
* reductions is doubled.
2133
*/
2134
nr_cycles = dur + ioc->dfgv_period_rem;
2135
ioc->dfgv_period_rem = do_div(nr_cycles, DFGV_PERIOD);
2136
2137
list_for_each_entry(iocg, &ioc->active_iocgs, active_list) {
2138
u64 __maybe_unused old_debt, __maybe_unused old_delay;
2139
2140
if (!iocg->abs_vdebt && !iocg->delay)
2141
continue;
2142
2143
spin_lock(&iocg->waitq.lock);
2144
2145
old_debt = iocg->abs_vdebt;
2146
old_delay = iocg->delay;
2147
2148
nr_cycles_shift = min_t(u64, nr_cycles, BITS_PER_LONG - 1);
2149
if (iocg->abs_vdebt)
2150
iocg->abs_vdebt = iocg->abs_vdebt >> nr_cycles_shift ?: 1;
2151
2152
if (iocg->delay)
2153
iocg->delay = iocg->delay >> nr_cycles_shift ?: 1;
2154
2155
iocg_kick_waitq(iocg, true, now);
2156
2157
TRACE_IOCG_PATH(iocg_forgive_debt, iocg, now, usage_pct,
2158
old_debt, iocg->abs_vdebt,
2159
old_delay, iocg->delay);
2160
2161
spin_unlock(&iocg->waitq.lock);
2162
}
2163
}
2164
2165
/*
2166
* Check the active iocgs' state to avoid oversleeping and deactive
2167
* idle iocgs.
2168
*
2169
* Since waiters determine the sleep durations based on the vrate
2170
* they saw at the time of sleep, if vrate has increased, some
2171
* waiters could be sleeping for too long. Wake up tardy waiters
2172
* which should have woken up in the last period and expire idle
2173
* iocgs.
2174
*/
2175
static int ioc_check_iocgs(struct ioc *ioc, struct ioc_now *now)
2176
{
2177
int nr_debtors = 0;
2178
struct ioc_gq *iocg, *tiocg;
2179
2180
list_for_each_entry_safe(iocg, tiocg, &ioc->active_iocgs, active_list) {
2181
if (!waitqueue_active(&iocg->waitq) && !iocg->abs_vdebt &&
2182
!iocg->delay && !iocg_is_idle(iocg))
2183
continue;
2184
2185
spin_lock(&iocg->waitq.lock);
2186
2187
/* flush wait and indebt stat deltas */
2188
if (iocg->wait_since) {
2189
iocg->stat.wait_us += now->now - iocg->wait_since;
2190
iocg->wait_since = now->now;
2191
}
2192
if (iocg->indebt_since) {
2193
iocg->stat.indebt_us +=
2194
now->now - iocg->indebt_since;
2195
iocg->indebt_since = now->now;
2196
}
2197
if (iocg->indelay_since) {
2198
iocg->stat.indelay_us +=
2199
now->now - iocg->indelay_since;
2200
iocg->indelay_since = now->now;
2201
}
2202
2203
if (waitqueue_active(&iocg->waitq) || iocg->abs_vdebt ||
2204
iocg->delay) {
2205
/* might be oversleeping vtime / hweight changes, kick */
2206
iocg_kick_waitq(iocg, true, now);
2207
if (iocg->abs_vdebt || iocg->delay)
2208
nr_debtors++;
2209
} else if (iocg_is_idle(iocg)) {
2210
/* no waiter and idle, deactivate */
2211
u64 vtime = atomic64_read(&iocg->vtime);
2212
s64 excess;
2213
2214
/*
2215
* @iocg has been inactive for a full duration and will
2216
* have a high budget. Account anything above target as
2217
* error and throw away. On reactivation, it'll start
2218
* with the target budget.
2219
*/
2220
excess = now->vnow - vtime - ioc->margins.target;
2221
if (excess > 0) {
2222
u32 old_hwi;
2223
2224
current_hweight(iocg, NULL, &old_hwi);
2225
ioc->vtime_err -= div64_u64(excess * old_hwi,
2226
WEIGHT_ONE);
2227
}
2228
2229
TRACE_IOCG_PATH(iocg_idle, iocg, now,
2230
atomic64_read(&iocg->active_period),
2231
atomic64_read(&ioc->cur_period), vtime);
2232
__propagate_weights(iocg, 0, 0, false, now);
2233
list_del_init(&iocg->active_list);
2234
}
2235
2236
spin_unlock(&iocg->waitq.lock);
2237
}
2238
2239
commit_weights(ioc);
2240
return nr_debtors;
2241
}
2242
2243
static void ioc_timer_fn(struct timer_list *timer)
2244
{
2245
struct ioc *ioc = container_of(timer, struct ioc, timer);
2246
struct ioc_gq *iocg, *tiocg;
2247
struct ioc_now now;
2248
LIST_HEAD(surpluses);
2249
int nr_debtors, nr_shortages = 0, nr_lagging = 0;
2250
u64 usage_us_sum = 0;
2251
u32 ppm_rthr;
2252
u32 ppm_wthr;
2253
u32 missed_ppm[2], rq_wait_pct;
2254
u64 period_vtime;
2255
int prev_busy_level;
2256
2257
/* how were the latencies during the period? */
2258
ioc_lat_stat(ioc, missed_ppm, &rq_wait_pct);
2259
2260
/* take care of active iocgs */
2261
spin_lock_irq(&ioc->lock);
2262
2263
ppm_rthr = MILLION - ioc->params.qos[QOS_RPPM];
2264
ppm_wthr = MILLION - ioc->params.qos[QOS_WPPM];
2265
ioc_now(ioc, &now);
2266
2267
period_vtime = now.vnow - ioc->period_at_vtime;
2268
if (WARN_ON_ONCE(!period_vtime)) {
2269
spin_unlock_irq(&ioc->lock);
2270
return;
2271
}
2272
2273
nr_debtors = ioc_check_iocgs(ioc, &now);
2274
2275
/*
2276
* Wait and indebt stat are flushed above and the donation calculation
2277
* below needs updated usage stat. Let's bring stat up-to-date.
2278
*/
2279
iocg_flush_stat(&ioc->active_iocgs, &now);
2280
2281
/* calc usage and see whether some weights need to be moved around */
2282
list_for_each_entry(iocg, &ioc->active_iocgs, active_list) {
2283
u64 vdone, vtime, usage_us;
2284
u32 hw_active, hw_inuse;
2285
2286
/*
2287
* Collect unused and wind vtime closer to vnow to prevent
2288
* iocgs from accumulating a large amount of budget.
2289
*/
2290
vdone = atomic64_read(&iocg->done_vtime);
2291
vtime = atomic64_read(&iocg->vtime);
2292
current_hweight(iocg, &hw_active, &hw_inuse);
2293
2294
/*
2295
* Latency QoS detection doesn't account for IOs which are
2296
* in-flight for longer than a period. Detect them by
2297
* comparing vdone against period start. If lagging behind
2298
* IOs from past periods, don't increase vrate.
2299
*/
2300
if ((ppm_rthr != MILLION || ppm_wthr != MILLION) &&
2301
!atomic_read(&iocg_to_blkg(iocg)->use_delay) &&
2302
time_after64(vtime, vdone) &&
2303
time_after64(vtime, now.vnow -
2304
MAX_LAGGING_PERIODS * period_vtime) &&
2305
time_before64(vdone, now.vnow - period_vtime))
2306
nr_lagging++;
2307
2308
/*
2309
* Determine absolute usage factoring in in-flight IOs to avoid
2310
* high-latency completions appearing as idle.
2311
*/
2312
usage_us = iocg->usage_delta_us;
2313
usage_us_sum += usage_us;
2314
2315
/* see whether there's surplus vtime */
2316
WARN_ON_ONCE(!list_empty(&iocg->surplus_list));
2317
if (hw_inuse < hw_active ||
2318
(!waitqueue_active(&iocg->waitq) &&
2319
time_before64(vtime, now.vnow - ioc->margins.low))) {
2320
u32 hwa, old_hwi, hwm, new_hwi, usage;
2321
u64 usage_dur;
2322
2323
if (vdone != vtime) {
2324
u64 inflight_us = DIV64_U64_ROUND_UP(
2325
cost_to_abs_cost(vtime - vdone, hw_inuse),
2326
ioc->vtime_base_rate);
2327
2328
usage_us = max(usage_us, inflight_us);
2329
}
2330
2331
/* convert to hweight based usage ratio */
2332
if (time_after64(iocg->activated_at, ioc->period_at))
2333
usage_dur = max_t(u64, now.now - iocg->activated_at, 1);
2334
else
2335
usage_dur = max_t(u64, now.now - ioc->period_at, 1);
2336
2337
usage = clamp(DIV64_U64_ROUND_UP(usage_us * WEIGHT_ONE, usage_dur),
2338
1, WEIGHT_ONE);
2339
2340
/*
2341
* Already donating or accumulated enough to start.
2342
* Determine the donation amount.
2343
*/
2344
current_hweight(iocg, &hwa, &old_hwi);
2345
hwm = current_hweight_max(iocg);
2346
new_hwi = hweight_after_donation(iocg, old_hwi, hwm,
2347
usage, &now);
2348
/*
2349
* Donation calculation assumes hweight_after_donation
2350
* to be positive, a condition that a donor w/ hwa < 2
2351
* can't meet. Don't bother with donation if hwa is
2352
* below 2. It's not gonna make a meaningful difference
2353
* anyway.
2354
*/
2355
if (new_hwi < hwm && hwa >= 2) {
2356
iocg->hweight_donating = hwa;
2357
iocg->hweight_after_donation = new_hwi;
2358
list_add(&iocg->surplus_list, &surpluses);
2359
} else if (!iocg->abs_vdebt) {
2360
/*
2361
* @iocg doesn't have enough to donate. Reset
2362
* its inuse to active.
2363
*
2364
* Don't reset debtors as their inuse's are
2365
* owned by debt handling. This shouldn't affect
2366
* donation calculuation in any meaningful way
2367
* as @iocg doesn't have a meaningful amount of
2368
* share anyway.
2369
*/
2370
TRACE_IOCG_PATH(inuse_shortage, iocg, &now,
2371
iocg->inuse, iocg->active,
2372
iocg->hweight_inuse, new_hwi);
2373
2374
__propagate_weights(iocg, iocg->active,
2375
iocg->active, true, &now);
2376
nr_shortages++;
2377
}
2378
} else {
2379
/* genuinely short on vtime */
2380
nr_shortages++;
2381
}
2382
}
2383
2384
if (!list_empty(&surpluses) && nr_shortages)
2385
transfer_surpluses(&surpluses, &now);
2386
2387
commit_weights(ioc);
2388
2389
/* surplus list should be dissolved after use */
2390
list_for_each_entry_safe(iocg, tiocg, &surpluses, surplus_list)
2391
list_del_init(&iocg->surplus_list);
2392
2393
/*
2394
* If q is getting clogged or we're missing too much, we're issuing
2395
* too much IO and should lower vtime rate. If we're not missing
2396
* and experiencing shortages but not surpluses, we're too stingy
2397
* and should increase vtime rate.
2398
*/
2399
prev_busy_level = ioc->busy_level;
2400
if (rq_wait_pct > RQ_WAIT_BUSY_PCT ||
2401
missed_ppm[READ] > ppm_rthr ||
2402
missed_ppm[WRITE] > ppm_wthr) {
2403
/* clearly missing QoS targets, slow down vrate */
2404
ioc->busy_level = max(ioc->busy_level, 0);
2405
ioc->busy_level++;
2406
} else if (rq_wait_pct <= RQ_WAIT_BUSY_PCT * UNBUSY_THR_PCT / 100 &&
2407
missed_ppm[READ] <= ppm_rthr * UNBUSY_THR_PCT / 100 &&
2408
missed_ppm[WRITE] <= ppm_wthr * UNBUSY_THR_PCT / 100) {
2409
/* QoS targets are being met with >25% margin */
2410
if (nr_shortages) {
2411
/*
2412
* We're throttling while the device has spare
2413
* capacity. If vrate was being slowed down, stop.
2414
*/
2415
ioc->busy_level = min(ioc->busy_level, 0);
2416
2417
/*
2418
* If there are IOs spanning multiple periods, wait
2419
* them out before pushing the device harder.
2420
*/
2421
if (!nr_lagging)
2422
ioc->busy_level--;
2423
} else {
2424
/*
2425
* Nobody is being throttled and the users aren't
2426
* issuing enough IOs to saturate the device. We
2427
* simply don't know how close the device is to
2428
* saturation. Coast.
2429
*/
2430
ioc->busy_level = 0;
2431
}
2432
} else {
2433
/* inside the hysterisis margin, we're good */
2434
ioc->busy_level = 0;
2435
}
2436
2437
ioc->busy_level = clamp(ioc->busy_level, -1000, 1000);
2438
2439
ioc_adjust_base_vrate(ioc, rq_wait_pct, nr_lagging, nr_shortages,
2440
prev_busy_level, missed_ppm);
2441
2442
ioc_refresh_params(ioc, false);
2443
2444
ioc_forgive_debts(ioc, usage_us_sum, nr_debtors, &now);
2445
2446
/*
2447
* This period is done. Move onto the next one. If nothing's
2448
* going on with the device, stop the timer.
2449
*/
2450
atomic64_inc(&ioc->cur_period);
2451
2452
if (ioc->running != IOC_STOP) {
2453
if (!list_empty(&ioc->active_iocgs)) {
2454
ioc_start_period(ioc, &now);
2455
} else {
2456
ioc->busy_level = 0;
2457
ioc->vtime_err = 0;
2458
ioc->running = IOC_IDLE;
2459
}
2460
2461
ioc_refresh_vrate(ioc, &now);
2462
}
2463
2464
spin_unlock_irq(&ioc->lock);
2465
}
2466
2467
static u64 adjust_inuse_and_calc_cost(struct ioc_gq *iocg, u64 vtime,
2468
u64 abs_cost, struct ioc_now *now)
2469
{
2470
struct ioc *ioc = iocg->ioc;
2471
struct ioc_margins *margins = &ioc->margins;
2472
u32 __maybe_unused old_inuse = iocg->inuse, __maybe_unused old_hwi;
2473
u32 hwi, adj_step;
2474
s64 margin;
2475
u64 cost, new_inuse;
2476
unsigned long flags;
2477
2478
current_hweight(iocg, NULL, &hwi);
2479
old_hwi = hwi;
2480
cost = abs_cost_to_cost(abs_cost, hwi);
2481
margin = now->vnow - vtime - cost;
2482
2483
/* debt handling owns inuse for debtors */
2484
if (iocg->abs_vdebt)
2485
return cost;
2486
2487
/*
2488
* We only increase inuse during period and do so if the margin has
2489
* deteriorated since the previous adjustment.
2490
*/
2491
if (margin >= iocg->saved_margin || margin >= margins->low ||
2492
iocg->inuse == iocg->active)
2493
return cost;
2494
2495
spin_lock_irqsave(&ioc->lock, flags);
2496
2497
/* we own inuse only when @iocg is in the normal active state */
2498
if (iocg->abs_vdebt || list_empty(&iocg->active_list)) {
2499
spin_unlock_irqrestore(&ioc->lock, flags);
2500
return cost;
2501
}
2502
2503
/*
2504
* Bump up inuse till @abs_cost fits in the existing budget.
2505
* adj_step must be determined after acquiring ioc->lock - we might
2506
* have raced and lost to another thread for activation and could
2507
* be reading 0 iocg->active before ioc->lock which will lead to
2508
* infinite loop.
2509
*/
2510
new_inuse = iocg->inuse;
2511
adj_step = DIV_ROUND_UP(iocg->active * INUSE_ADJ_STEP_PCT, 100);
2512
do {
2513
new_inuse = new_inuse + adj_step;
2514
propagate_weights(iocg, iocg->active, new_inuse, true, now);
2515
current_hweight(iocg, NULL, &hwi);
2516
cost = abs_cost_to_cost(abs_cost, hwi);
2517
} while (time_after64(vtime + cost, now->vnow) &&
2518
iocg->inuse != iocg->active);
2519
2520
spin_unlock_irqrestore(&ioc->lock, flags);
2521
2522
TRACE_IOCG_PATH(inuse_adjust, iocg, now,
2523
old_inuse, iocg->inuse, old_hwi, hwi);
2524
2525
return cost;
2526
}
2527
2528
static void calc_vtime_cost_builtin(struct bio *bio, struct ioc_gq *iocg,
2529
bool is_merge, u64 *costp)
2530
{
2531
struct ioc *ioc = iocg->ioc;
2532
u64 coef_seqio, coef_randio, coef_page;
2533
u64 pages = max_t(u64, bio_sectors(bio) >> IOC_SECT_TO_PAGE_SHIFT, 1);
2534
u64 seek_pages = 0;
2535
u64 cost = 0;
2536
2537
/* Can't calculate cost for empty bio */
2538
if (!bio->bi_iter.bi_size)
2539
goto out;
2540
2541
switch (bio_op(bio)) {
2542
case REQ_OP_READ:
2543
coef_seqio = ioc->params.lcoefs[LCOEF_RSEQIO];
2544
coef_randio = ioc->params.lcoefs[LCOEF_RRANDIO];
2545
coef_page = ioc->params.lcoefs[LCOEF_RPAGE];
2546
break;
2547
case REQ_OP_WRITE:
2548
coef_seqio = ioc->params.lcoefs[LCOEF_WSEQIO];
2549
coef_randio = ioc->params.lcoefs[LCOEF_WRANDIO];
2550
coef_page = ioc->params.lcoefs[LCOEF_WPAGE];
2551
break;
2552
default:
2553
goto out;
2554
}
2555
2556
if (iocg->cursor) {
2557
seek_pages = abs(bio->bi_iter.bi_sector - iocg->cursor);
2558
seek_pages >>= IOC_SECT_TO_PAGE_SHIFT;
2559
}
2560
2561
if (!is_merge) {
2562
if (seek_pages > LCOEF_RANDIO_PAGES) {
2563
cost += coef_randio;
2564
} else {
2565
cost += coef_seqio;
2566
}
2567
}
2568
cost += pages * coef_page;
2569
out:
2570
*costp = cost;
2571
}
2572
2573
static u64 calc_vtime_cost(struct bio *bio, struct ioc_gq *iocg, bool is_merge)
2574
{
2575
u64 cost;
2576
2577
calc_vtime_cost_builtin(bio, iocg, is_merge, &cost);
2578
return cost;
2579
}
2580
2581
static void calc_size_vtime_cost_builtin(struct request *rq, struct ioc *ioc,
2582
u64 *costp)
2583
{
2584
unsigned int pages = blk_rq_stats_sectors(rq) >> IOC_SECT_TO_PAGE_SHIFT;
2585
2586
switch (req_op(rq)) {
2587
case REQ_OP_READ:
2588
*costp = pages * ioc->params.lcoefs[LCOEF_RPAGE];
2589
break;
2590
case REQ_OP_WRITE:
2591
*costp = pages * ioc->params.lcoefs[LCOEF_WPAGE];
2592
break;
2593
default:
2594
*costp = 0;
2595
}
2596
}
2597
2598
static u64 calc_size_vtime_cost(struct request *rq, struct ioc *ioc)
2599
{
2600
u64 cost;
2601
2602
calc_size_vtime_cost_builtin(rq, ioc, &cost);
2603
return cost;
2604
}
2605
2606
static void ioc_rqos_throttle(struct rq_qos *rqos, struct bio *bio)
2607
{
2608
struct blkcg_gq *blkg = bio->bi_blkg;
2609
struct ioc *ioc = rqos_to_ioc(rqos);
2610
struct ioc_gq *iocg = blkg_to_iocg(blkg);
2611
struct ioc_now now;
2612
struct iocg_wait wait;
2613
u64 abs_cost, cost, vtime;
2614
bool use_debt, ioc_locked;
2615
unsigned long flags;
2616
2617
/* bypass IOs if disabled, still initializing, or for root cgroup */
2618
if (!ioc->enabled || !iocg || !iocg->level)
2619
return;
2620
2621
/* calculate the absolute vtime cost */
2622
abs_cost = calc_vtime_cost(bio, iocg, false);
2623
if (!abs_cost)
2624
return;
2625
2626
if (!iocg_activate(iocg, &now))
2627
return;
2628
2629
iocg->cursor = bio_end_sector(bio);
2630
vtime = atomic64_read(&iocg->vtime);
2631
cost = adjust_inuse_and_calc_cost(iocg, vtime, abs_cost, &now);
2632
2633
/*
2634
* If no one's waiting and within budget, issue right away. The
2635
* tests are racy but the races aren't systemic - we only miss once
2636
* in a while which is fine.
2637
*/
2638
if (!waitqueue_active(&iocg->waitq) && !iocg->abs_vdebt &&
2639
time_before_eq64(vtime + cost, now.vnow)) {
2640
iocg_commit_bio(iocg, bio, abs_cost, cost);
2641
return;
2642
}
2643
2644
/*
2645
* We're over budget. This can be handled in two ways. IOs which may
2646
* cause priority inversions are punted to @ioc->aux_iocg and charged as
2647
* debt. Otherwise, the issuer is blocked on @iocg->waitq. Debt handling
2648
* requires @ioc->lock, waitq handling @iocg->waitq.lock. Determine
2649
* whether debt handling is needed and acquire locks accordingly.
2650
*/
2651
use_debt = bio_issue_as_root_blkg(bio) || fatal_signal_pending(current);
2652
ioc_locked = use_debt || READ_ONCE(iocg->abs_vdebt);
2653
retry_lock:
2654
iocg_lock(iocg, ioc_locked, &flags);
2655
2656
/*
2657
* @iocg must stay activated for debt and waitq handling. Deactivation
2658
* is synchronized against both ioc->lock and waitq.lock and we won't
2659
* get deactivated as long as we're waiting or has debt, so we're good
2660
* if we're activated here. In the unlikely cases that we aren't, just
2661
* issue the IO.
2662
*/
2663
if (unlikely(list_empty(&iocg->active_list))) {
2664
iocg_unlock(iocg, ioc_locked, &flags);
2665
iocg_commit_bio(iocg, bio, abs_cost, cost);
2666
return;
2667
}
2668
2669
/*
2670
* We're over budget. If @bio has to be issued regardless, remember
2671
* the abs_cost instead of advancing vtime. iocg_kick_waitq() will pay
2672
* off the debt before waking more IOs.
2673
*
2674
* This way, the debt is continuously paid off each period with the
2675
* actual budget available to the cgroup. If we just wound vtime, we
2676
* would incorrectly use the current hw_inuse for the entire amount
2677
* which, for example, can lead to the cgroup staying blocked for a
2678
* long time even with substantially raised hw_inuse.
2679
*
2680
* An iocg with vdebt should stay online so that the timer can keep
2681
* deducting its vdebt and [de]activate use_delay mechanism
2682
* accordingly. We don't want to race against the timer trying to
2683
* clear them and leave @iocg inactive w/ dangling use_delay heavily
2684
* penalizing the cgroup and its descendants.
2685
*/
2686
if (use_debt) {
2687
iocg_incur_debt(iocg, abs_cost, &now);
2688
if (iocg_kick_delay(iocg, &now))
2689
blkcg_schedule_throttle(rqos->disk,
2690
(bio->bi_opf & REQ_SWAP) == REQ_SWAP);
2691
iocg_unlock(iocg, ioc_locked, &flags);
2692
return;
2693
}
2694
2695
/* guarantee that iocgs w/ waiters have maximum inuse */
2696
if (!iocg->abs_vdebt && iocg->inuse != iocg->active) {
2697
if (!ioc_locked) {
2698
iocg_unlock(iocg, false, &flags);
2699
ioc_locked = true;
2700
goto retry_lock;
2701
}
2702
propagate_weights(iocg, iocg->active, iocg->active, true,
2703
&now);
2704
}
2705
2706
/*
2707
* Append self to the waitq and schedule the wakeup timer if we're
2708
* the first waiter. The timer duration is calculated based on the
2709
* current vrate. vtime and hweight changes can make it too short
2710
* or too long. Each wait entry records the absolute cost it's
2711
* waiting for to allow re-evaluation using a custom wait entry.
2712
*
2713
* If too short, the timer simply reschedules itself. If too long,
2714
* the period timer will notice and trigger wakeups.
2715
*
2716
* All waiters are on iocg->waitq and the wait states are
2717
* synchronized using waitq.lock.
2718
*/
2719
init_wait_func(&wait.wait, iocg_wake_fn);
2720
wait.bio = bio;
2721
wait.abs_cost = abs_cost;
2722
wait.committed = false; /* will be set true by waker */
2723
2724
__add_wait_queue_entry_tail(&iocg->waitq, &wait.wait);
2725
iocg_kick_waitq(iocg, ioc_locked, &now);
2726
2727
iocg_unlock(iocg, ioc_locked, &flags);
2728
2729
while (true) {
2730
set_current_state(TASK_UNINTERRUPTIBLE);
2731
if (wait.committed)
2732
break;
2733
io_schedule();
2734
}
2735
2736
/* waker already committed us, proceed */
2737
finish_wait(&iocg->waitq, &wait.wait);
2738
}
2739
2740
static void ioc_rqos_merge(struct rq_qos *rqos, struct request *rq,
2741
struct bio *bio)
2742
{
2743
struct ioc_gq *iocg = blkg_to_iocg(bio->bi_blkg);
2744
struct ioc *ioc = rqos_to_ioc(rqos);
2745
sector_t bio_end = bio_end_sector(bio);
2746
struct ioc_now now;
2747
u64 vtime, abs_cost, cost;
2748
unsigned long flags;
2749
2750
/* bypass if disabled, still initializing, or for root cgroup */
2751
if (!ioc->enabled || !iocg || !iocg->level)
2752
return;
2753
2754
abs_cost = calc_vtime_cost(bio, iocg, true);
2755
if (!abs_cost)
2756
return;
2757
2758
ioc_now(ioc, &now);
2759
2760
vtime = atomic64_read(&iocg->vtime);
2761
cost = adjust_inuse_and_calc_cost(iocg, vtime, abs_cost, &now);
2762
2763
/* update cursor if backmerging into the request at the cursor */
2764
if (blk_rq_pos(rq) < bio_end &&
2765
blk_rq_pos(rq) + blk_rq_sectors(rq) == iocg->cursor)
2766
iocg->cursor = bio_end;
2767
2768
/*
2769
* Charge if there's enough vtime budget and the existing request has
2770
* cost assigned.
2771
*/
2772
if (rq->bio && rq->bio->bi_iocost_cost &&
2773
time_before_eq64(atomic64_read(&iocg->vtime) + cost, now.vnow)) {
2774
iocg_commit_bio(iocg, bio, abs_cost, cost);
2775
return;
2776
}
2777
2778
/*
2779
* Otherwise, account it as debt if @iocg is online, which it should
2780
* be for the vast majority of cases. See debt handling in
2781
* ioc_rqos_throttle() for details.
2782
*/
2783
spin_lock_irqsave(&ioc->lock, flags);
2784
spin_lock(&iocg->waitq.lock);
2785
2786
if (likely(!list_empty(&iocg->active_list))) {
2787
iocg_incur_debt(iocg, abs_cost, &now);
2788
if (iocg_kick_delay(iocg, &now))
2789
blkcg_schedule_throttle(rqos->disk,
2790
(bio->bi_opf & REQ_SWAP) == REQ_SWAP);
2791
} else {
2792
iocg_commit_bio(iocg, bio, abs_cost, cost);
2793
}
2794
2795
spin_unlock(&iocg->waitq.lock);
2796
spin_unlock_irqrestore(&ioc->lock, flags);
2797
}
2798
2799
static void ioc_rqos_done_bio(struct rq_qos *rqos, struct bio *bio)
2800
{
2801
struct ioc_gq *iocg = blkg_to_iocg(bio->bi_blkg);
2802
2803
if (iocg && bio->bi_iocost_cost)
2804
atomic64_add(bio->bi_iocost_cost, &iocg->done_vtime);
2805
}
2806
2807
static void ioc_rqos_done(struct rq_qos *rqos, struct request *rq)
2808
{
2809
struct ioc *ioc = rqos_to_ioc(rqos);
2810
struct ioc_pcpu_stat *ccs;
2811
u64 on_q_ns, rq_wait_ns, size_nsec;
2812
int pidx, rw;
2813
2814
if (!ioc->enabled || !rq->alloc_time_ns || !rq->start_time_ns)
2815
return;
2816
2817
switch (req_op(rq)) {
2818
case REQ_OP_READ:
2819
pidx = QOS_RLAT;
2820
rw = READ;
2821
break;
2822
case REQ_OP_WRITE:
2823
pidx = QOS_WLAT;
2824
rw = WRITE;
2825
break;
2826
default:
2827
return;
2828
}
2829
2830
on_q_ns = blk_time_get_ns() - rq->alloc_time_ns;
2831
rq_wait_ns = rq->start_time_ns - rq->alloc_time_ns;
2832
size_nsec = div64_u64(calc_size_vtime_cost(rq, ioc), VTIME_PER_NSEC);
2833
2834
ccs = get_cpu_ptr(ioc->pcpu_stat);
2835
2836
if (on_q_ns <= size_nsec ||
2837
on_q_ns - size_nsec <= ioc->params.qos[pidx] * NSEC_PER_USEC)
2838
local_inc(&ccs->missed[rw].nr_met);
2839
else
2840
local_inc(&ccs->missed[rw].nr_missed);
2841
2842
local64_add(rq_wait_ns, &ccs->rq_wait_ns);
2843
2844
put_cpu_ptr(ccs);
2845
}
2846
2847
static void ioc_rqos_queue_depth_changed(struct rq_qos *rqos)
2848
{
2849
struct ioc *ioc = rqos_to_ioc(rqos);
2850
2851
spin_lock_irq(&ioc->lock);
2852
ioc_refresh_params(ioc, false);
2853
spin_unlock_irq(&ioc->lock);
2854
}
2855
2856
static void ioc_rqos_exit(struct rq_qos *rqos)
2857
{
2858
struct ioc *ioc = rqos_to_ioc(rqos);
2859
2860
blkcg_deactivate_policy(rqos->disk, &blkcg_policy_iocost);
2861
2862
spin_lock_irq(&ioc->lock);
2863
ioc->running = IOC_STOP;
2864
spin_unlock_irq(&ioc->lock);
2865
2866
timer_shutdown_sync(&ioc->timer);
2867
free_percpu(ioc->pcpu_stat);
2868
kfree(ioc);
2869
}
2870
2871
static const struct rq_qos_ops ioc_rqos_ops = {
2872
.throttle = ioc_rqos_throttle,
2873
.merge = ioc_rqos_merge,
2874
.done_bio = ioc_rqos_done_bio,
2875
.done = ioc_rqos_done,
2876
.queue_depth_changed = ioc_rqos_queue_depth_changed,
2877
.exit = ioc_rqos_exit,
2878
};
2879
2880
static int blk_iocost_init(struct gendisk *disk)
2881
{
2882
struct ioc *ioc;
2883
int i, cpu, ret;
2884
2885
ioc = kzalloc(sizeof(*ioc), GFP_KERNEL);
2886
if (!ioc)
2887
return -ENOMEM;
2888
2889
ioc->pcpu_stat = alloc_percpu(struct ioc_pcpu_stat);
2890
if (!ioc->pcpu_stat) {
2891
kfree(ioc);
2892
return -ENOMEM;
2893
}
2894
2895
for_each_possible_cpu(cpu) {
2896
struct ioc_pcpu_stat *ccs = per_cpu_ptr(ioc->pcpu_stat, cpu);
2897
2898
for (i = 0; i < ARRAY_SIZE(ccs->missed); i++) {
2899
local_set(&ccs->missed[i].nr_met, 0);
2900
local_set(&ccs->missed[i].nr_missed, 0);
2901
}
2902
local64_set(&ccs->rq_wait_ns, 0);
2903
}
2904
2905
spin_lock_init(&ioc->lock);
2906
timer_setup(&ioc->timer, ioc_timer_fn, 0);
2907
INIT_LIST_HEAD(&ioc->active_iocgs);
2908
2909
ioc->running = IOC_IDLE;
2910
ioc->vtime_base_rate = VTIME_PER_USEC;
2911
atomic64_set(&ioc->vtime_rate, VTIME_PER_USEC);
2912
seqcount_spinlock_init(&ioc->period_seqcount, &ioc->lock);
2913
ioc->period_at = ktime_to_us(blk_time_get());
2914
atomic64_set(&ioc->cur_period, 0);
2915
atomic_set(&ioc->hweight_gen, 0);
2916
2917
spin_lock_irq(&ioc->lock);
2918
ioc->autop_idx = AUTOP_INVALID;
2919
ioc_refresh_params_disk(ioc, true, disk);
2920
spin_unlock_irq(&ioc->lock);
2921
2922
/*
2923
* rqos must be added before activation to allow ioc_pd_init() to
2924
* lookup the ioc from q. This means that the rqos methods may get
2925
* called before policy activation completion, can't assume that the
2926
* target bio has an iocg associated and need to test for NULL iocg.
2927
*/
2928
ret = rq_qos_add(&ioc->rqos, disk, RQ_QOS_COST, &ioc_rqos_ops);
2929
if (ret)
2930
goto err_free_ioc;
2931
2932
ret = blkcg_activate_policy(disk, &blkcg_policy_iocost);
2933
if (ret)
2934
goto err_del_qos;
2935
return 0;
2936
2937
err_del_qos:
2938
rq_qos_del(&ioc->rqos);
2939
err_free_ioc:
2940
free_percpu(ioc->pcpu_stat);
2941
kfree(ioc);
2942
return ret;
2943
}
2944
2945
static struct blkcg_policy_data *ioc_cpd_alloc(gfp_t gfp)
2946
{
2947
struct ioc_cgrp *iocc;
2948
2949
iocc = kzalloc(sizeof(struct ioc_cgrp), gfp);
2950
if (!iocc)
2951
return NULL;
2952
2953
iocc->dfl_weight = CGROUP_WEIGHT_DFL * WEIGHT_ONE;
2954
return &iocc->cpd;
2955
}
2956
2957
static void ioc_cpd_free(struct blkcg_policy_data *cpd)
2958
{
2959
kfree(container_of(cpd, struct ioc_cgrp, cpd));
2960
}
2961
2962
static struct blkg_policy_data *ioc_pd_alloc(struct gendisk *disk,
2963
struct blkcg *blkcg, gfp_t gfp)
2964
{
2965
int levels = blkcg->css.cgroup->level + 1;
2966
struct ioc_gq *iocg;
2967
2968
iocg = kzalloc_node(struct_size(iocg, ancestors, levels), gfp,
2969
disk->node_id);
2970
if (!iocg)
2971
return NULL;
2972
2973
iocg->pcpu_stat = alloc_percpu_gfp(struct iocg_pcpu_stat, gfp);
2974
if (!iocg->pcpu_stat) {
2975
kfree(iocg);
2976
return NULL;
2977
}
2978
2979
return &iocg->pd;
2980
}
2981
2982
static void ioc_pd_init(struct blkg_policy_data *pd)
2983
{
2984
struct ioc_gq *iocg = pd_to_iocg(pd);
2985
struct blkcg_gq *blkg = pd_to_blkg(&iocg->pd);
2986
struct ioc *ioc = q_to_ioc(blkg->q);
2987
struct ioc_now now;
2988
struct blkcg_gq *tblkg;
2989
unsigned long flags;
2990
2991
ioc_now(ioc, &now);
2992
2993
iocg->ioc = ioc;
2994
atomic64_set(&iocg->vtime, now.vnow);
2995
atomic64_set(&iocg->done_vtime, now.vnow);
2996
atomic64_set(&iocg->active_period, atomic64_read(&ioc->cur_period));
2997
INIT_LIST_HEAD(&iocg->active_list);
2998
INIT_LIST_HEAD(&iocg->walk_list);
2999
INIT_LIST_HEAD(&iocg->surplus_list);
3000
iocg->hweight_active = WEIGHT_ONE;
3001
iocg->hweight_inuse = WEIGHT_ONE;
3002
3003
init_waitqueue_head(&iocg->waitq);
3004
hrtimer_setup(&iocg->waitq_timer, iocg_waitq_timer_fn, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
3005
3006
iocg->level = blkg->blkcg->css.cgroup->level;
3007
3008
for (tblkg = blkg; tblkg; tblkg = tblkg->parent) {
3009
struct ioc_gq *tiocg = blkg_to_iocg(tblkg);
3010
iocg->ancestors[tiocg->level] = tiocg;
3011
}
3012
3013
spin_lock_irqsave(&ioc->lock, flags);
3014
weight_updated(iocg, &now);
3015
spin_unlock_irqrestore(&ioc->lock, flags);
3016
}
3017
3018
static void ioc_pd_free(struct blkg_policy_data *pd)
3019
{
3020
struct ioc_gq *iocg = pd_to_iocg(pd);
3021
struct ioc *ioc = iocg->ioc;
3022
unsigned long flags;
3023
3024
if (ioc) {
3025
spin_lock_irqsave(&ioc->lock, flags);
3026
3027
if (!list_empty(&iocg->active_list)) {
3028
struct ioc_now now;
3029
3030
ioc_now(ioc, &now);
3031
propagate_weights(iocg, 0, 0, false, &now);
3032
list_del_init(&iocg->active_list);
3033
}
3034
3035
WARN_ON_ONCE(!list_empty(&iocg->walk_list));
3036
WARN_ON_ONCE(!list_empty(&iocg->surplus_list));
3037
3038
spin_unlock_irqrestore(&ioc->lock, flags);
3039
3040
hrtimer_cancel(&iocg->waitq_timer);
3041
}
3042
free_percpu(iocg->pcpu_stat);
3043
kfree(iocg);
3044
}
3045
3046
static void ioc_pd_stat(struct blkg_policy_data *pd, struct seq_file *s)
3047
{
3048
struct ioc_gq *iocg = pd_to_iocg(pd);
3049
struct ioc *ioc = iocg->ioc;
3050
3051
if (!ioc->enabled)
3052
return;
3053
3054
if (iocg->level == 0) {
3055
unsigned vp10k = DIV64_U64_ROUND_CLOSEST(
3056
ioc->vtime_base_rate * 10000,
3057
VTIME_PER_USEC);
3058
seq_printf(s, " cost.vrate=%u.%02u", vp10k / 100, vp10k % 100);
3059
}
3060
3061
seq_printf(s, " cost.usage=%llu", iocg->last_stat.usage_us);
3062
3063
if (blkcg_debug_stats)
3064
seq_printf(s, " cost.wait=%llu cost.indebt=%llu cost.indelay=%llu",
3065
iocg->last_stat.wait_us,
3066
iocg->last_stat.indebt_us,
3067
iocg->last_stat.indelay_us);
3068
}
3069
3070
static u64 ioc_weight_prfill(struct seq_file *sf, struct blkg_policy_data *pd,
3071
int off)
3072
{
3073
const char *dname = blkg_dev_name(pd->blkg);
3074
struct ioc_gq *iocg = pd_to_iocg(pd);
3075
3076
if (dname && iocg->cfg_weight)
3077
seq_printf(sf, "%s %u\n", dname, iocg->cfg_weight / WEIGHT_ONE);
3078
return 0;
3079
}
3080
3081
3082
static int ioc_weight_show(struct seq_file *sf, void *v)
3083
{
3084
struct blkcg *blkcg = css_to_blkcg(seq_css(sf));
3085
struct ioc_cgrp *iocc = blkcg_to_iocc(blkcg);
3086
3087
seq_printf(sf, "default %u\n", iocc->dfl_weight / WEIGHT_ONE);
3088
blkcg_print_blkgs(sf, blkcg, ioc_weight_prfill,
3089
&blkcg_policy_iocost, seq_cft(sf)->private, false);
3090
return 0;
3091
}
3092
3093
static ssize_t ioc_weight_write(struct kernfs_open_file *of, char *buf,
3094
size_t nbytes, loff_t off)
3095
{
3096
struct blkcg *blkcg = css_to_blkcg(of_css(of));
3097
struct ioc_cgrp *iocc = blkcg_to_iocc(blkcg);
3098
struct blkg_conf_ctx ctx;
3099
struct ioc_now now;
3100
struct ioc_gq *iocg;
3101
u32 v;
3102
int ret;
3103
3104
if (!strchr(buf, ':')) {
3105
struct blkcg_gq *blkg;
3106
3107
if (!sscanf(buf, "default %u", &v) && !sscanf(buf, "%u", &v))
3108
return -EINVAL;
3109
3110
if (v < CGROUP_WEIGHT_MIN || v > CGROUP_WEIGHT_MAX)
3111
return -EINVAL;
3112
3113
spin_lock_irq(&blkcg->lock);
3114
iocc->dfl_weight = v * WEIGHT_ONE;
3115
hlist_for_each_entry(blkg, &blkcg->blkg_list, blkcg_node) {
3116
struct ioc_gq *iocg = blkg_to_iocg(blkg);
3117
3118
if (iocg) {
3119
spin_lock(&iocg->ioc->lock);
3120
ioc_now(iocg->ioc, &now);
3121
weight_updated(iocg, &now);
3122
spin_unlock(&iocg->ioc->lock);
3123
}
3124
}
3125
spin_unlock_irq(&blkcg->lock);
3126
3127
return nbytes;
3128
}
3129
3130
blkg_conf_init(&ctx, buf);
3131
3132
ret = blkg_conf_prep(blkcg, &blkcg_policy_iocost, &ctx);
3133
if (ret)
3134
goto err;
3135
3136
iocg = blkg_to_iocg(ctx.blkg);
3137
3138
if (!strncmp(ctx.body, "default", 7)) {
3139
v = 0;
3140
} else {
3141
if (!sscanf(ctx.body, "%u", &v))
3142
goto einval;
3143
if (v < CGROUP_WEIGHT_MIN || v > CGROUP_WEIGHT_MAX)
3144
goto einval;
3145
}
3146
3147
spin_lock(&iocg->ioc->lock);
3148
iocg->cfg_weight = v * WEIGHT_ONE;
3149
ioc_now(iocg->ioc, &now);
3150
weight_updated(iocg, &now);
3151
spin_unlock(&iocg->ioc->lock);
3152
3153
blkg_conf_exit(&ctx);
3154
return nbytes;
3155
3156
einval:
3157
ret = -EINVAL;
3158
err:
3159
blkg_conf_exit(&ctx);
3160
return ret;
3161
}
3162
3163
static u64 ioc_qos_prfill(struct seq_file *sf, struct blkg_policy_data *pd,
3164
int off)
3165
{
3166
const char *dname = blkg_dev_name(pd->blkg);
3167
struct ioc *ioc = pd_to_iocg(pd)->ioc;
3168
3169
if (!dname)
3170
return 0;
3171
3172
spin_lock(&ioc->lock);
3173
seq_printf(sf, "%s enable=%d ctrl=%s rpct=%u.%02u rlat=%u wpct=%u.%02u wlat=%u min=%u.%02u max=%u.%02u\n",
3174
dname, ioc->enabled, ioc->user_qos_params ? "user" : "auto",
3175
ioc->params.qos[QOS_RPPM] / 10000,
3176
ioc->params.qos[QOS_RPPM] % 10000 / 100,
3177
ioc->params.qos[QOS_RLAT],
3178
ioc->params.qos[QOS_WPPM] / 10000,
3179
ioc->params.qos[QOS_WPPM] % 10000 / 100,
3180
ioc->params.qos[QOS_WLAT],
3181
ioc->params.qos[QOS_MIN] / 10000,
3182
ioc->params.qos[QOS_MIN] % 10000 / 100,
3183
ioc->params.qos[QOS_MAX] / 10000,
3184
ioc->params.qos[QOS_MAX] % 10000 / 100);
3185
spin_unlock(&ioc->lock);
3186
return 0;
3187
}
3188
3189
static int ioc_qos_show(struct seq_file *sf, void *v)
3190
{
3191
struct blkcg *blkcg = css_to_blkcg(seq_css(sf));
3192
3193
blkcg_print_blkgs(sf, blkcg, ioc_qos_prfill,
3194
&blkcg_policy_iocost, seq_cft(sf)->private, false);
3195
return 0;
3196
}
3197
3198
static const match_table_t qos_ctrl_tokens = {
3199
{ QOS_ENABLE, "enable=%u" },
3200
{ QOS_CTRL, "ctrl=%s" },
3201
{ NR_QOS_CTRL_PARAMS, NULL },
3202
};
3203
3204
static const match_table_t qos_tokens = {
3205
{ QOS_RPPM, "rpct=%s" },
3206
{ QOS_RLAT, "rlat=%u" },
3207
{ QOS_WPPM, "wpct=%s" },
3208
{ QOS_WLAT, "wlat=%u" },
3209
{ QOS_MIN, "min=%s" },
3210
{ QOS_MAX, "max=%s" },
3211
{ NR_QOS_PARAMS, NULL },
3212
};
3213
3214
static ssize_t ioc_qos_write(struct kernfs_open_file *of, char *input,
3215
size_t nbytes, loff_t off)
3216
{
3217
struct blkg_conf_ctx ctx;
3218
struct gendisk *disk;
3219
struct ioc *ioc;
3220
u32 qos[NR_QOS_PARAMS];
3221
bool enable, user;
3222
char *body, *p;
3223
unsigned long memflags;
3224
int ret;
3225
3226
blkg_conf_init(&ctx, input);
3227
3228
memflags = blkg_conf_open_bdev_frozen(&ctx);
3229
if (IS_ERR_VALUE(memflags)) {
3230
ret = memflags;
3231
goto err;
3232
}
3233
3234
body = ctx.body;
3235
disk = ctx.bdev->bd_disk;
3236
if (!queue_is_mq(disk->queue)) {
3237
ret = -EOPNOTSUPP;
3238
goto err;
3239
}
3240
3241
ioc = q_to_ioc(disk->queue);
3242
if (!ioc) {
3243
ret = blk_iocost_init(disk);
3244
if (ret)
3245
goto err;
3246
ioc = q_to_ioc(disk->queue);
3247
}
3248
3249
blk_mq_quiesce_queue(disk->queue);
3250
3251
spin_lock_irq(&ioc->lock);
3252
memcpy(qos, ioc->params.qos, sizeof(qos));
3253
enable = ioc->enabled;
3254
user = ioc->user_qos_params;
3255
3256
while ((p = strsep(&body, " \t\n"))) {
3257
substring_t args[MAX_OPT_ARGS];
3258
char buf[32];
3259
int tok;
3260
s64 v;
3261
3262
if (!*p)
3263
continue;
3264
3265
switch (match_token(p, qos_ctrl_tokens, args)) {
3266
case QOS_ENABLE:
3267
if (match_u64(&args[0], &v))
3268
goto einval;
3269
enable = v;
3270
continue;
3271
case QOS_CTRL:
3272
match_strlcpy(buf, &args[0], sizeof(buf));
3273
if (!strcmp(buf, "auto"))
3274
user = false;
3275
else if (!strcmp(buf, "user"))
3276
user = true;
3277
else
3278
goto einval;
3279
continue;
3280
}
3281
3282
tok = match_token(p, qos_tokens, args);
3283
switch (tok) {
3284
case QOS_RPPM:
3285
case QOS_WPPM:
3286
if (match_strlcpy(buf, &args[0], sizeof(buf)) >=
3287
sizeof(buf))
3288
goto einval;
3289
if (cgroup_parse_float(buf, 2, &v))
3290
goto einval;
3291
if (v < 0 || v > 10000)
3292
goto einval;
3293
qos[tok] = v * 100;
3294
break;
3295
case QOS_RLAT:
3296
case QOS_WLAT:
3297
if (match_u64(&args[0], &v))
3298
goto einval;
3299
qos[tok] = v;
3300
break;
3301
case QOS_MIN:
3302
case QOS_MAX:
3303
if (match_strlcpy(buf, &args[0], sizeof(buf)) >=
3304
sizeof(buf))
3305
goto einval;
3306
if (cgroup_parse_float(buf, 2, &v))
3307
goto einval;
3308
if (v < 0)
3309
goto einval;
3310
qos[tok] = clamp_t(s64, v * 100,
3311
VRATE_MIN_PPM, VRATE_MAX_PPM);
3312
break;
3313
default:
3314
goto einval;
3315
}
3316
user = true;
3317
}
3318
3319
if (qos[QOS_MIN] > qos[QOS_MAX])
3320
goto einval;
3321
3322
if (enable && !ioc->enabled) {
3323
blk_stat_enable_accounting(disk->queue);
3324
blk_queue_flag_set(QUEUE_FLAG_RQ_ALLOC_TIME, disk->queue);
3325
ioc->enabled = true;
3326
} else if (!enable && ioc->enabled) {
3327
blk_stat_disable_accounting(disk->queue);
3328
blk_queue_flag_clear(QUEUE_FLAG_RQ_ALLOC_TIME, disk->queue);
3329
ioc->enabled = false;
3330
}
3331
3332
if (user) {
3333
memcpy(ioc->params.qos, qos, sizeof(qos));
3334
ioc->user_qos_params = true;
3335
} else {
3336
ioc->user_qos_params = false;
3337
}
3338
3339
ioc_refresh_params(ioc, true);
3340
spin_unlock_irq(&ioc->lock);
3341
3342
if (enable)
3343
wbt_disable_default(disk);
3344
else
3345
wbt_enable_default(disk);
3346
3347
blk_mq_unquiesce_queue(disk->queue);
3348
3349
blkg_conf_exit_frozen(&ctx, memflags);
3350
return nbytes;
3351
einval:
3352
spin_unlock_irq(&ioc->lock);
3353
blk_mq_unquiesce_queue(disk->queue);
3354
ret = -EINVAL;
3355
err:
3356
blkg_conf_exit_frozen(&ctx, memflags);
3357
return ret;
3358
}
3359
3360
static u64 ioc_cost_model_prfill(struct seq_file *sf,
3361
struct blkg_policy_data *pd, int off)
3362
{
3363
const char *dname = blkg_dev_name(pd->blkg);
3364
struct ioc *ioc = pd_to_iocg(pd)->ioc;
3365
u64 *u = ioc->params.i_lcoefs;
3366
3367
if (!dname)
3368
return 0;
3369
3370
spin_lock(&ioc->lock);
3371
seq_printf(sf, "%s ctrl=%s model=linear "
3372
"rbps=%llu rseqiops=%llu rrandiops=%llu "
3373
"wbps=%llu wseqiops=%llu wrandiops=%llu\n",
3374
dname, ioc->user_cost_model ? "user" : "auto",
3375
u[I_LCOEF_RBPS], u[I_LCOEF_RSEQIOPS], u[I_LCOEF_RRANDIOPS],
3376
u[I_LCOEF_WBPS], u[I_LCOEF_WSEQIOPS], u[I_LCOEF_WRANDIOPS]);
3377
spin_unlock(&ioc->lock);
3378
return 0;
3379
}
3380
3381
static int ioc_cost_model_show(struct seq_file *sf, void *v)
3382
{
3383
struct blkcg *blkcg = css_to_blkcg(seq_css(sf));
3384
3385
blkcg_print_blkgs(sf, blkcg, ioc_cost_model_prfill,
3386
&blkcg_policy_iocost, seq_cft(sf)->private, false);
3387
return 0;
3388
}
3389
3390
static const match_table_t cost_ctrl_tokens = {
3391
{ COST_CTRL, "ctrl=%s" },
3392
{ COST_MODEL, "model=%s" },
3393
{ NR_COST_CTRL_PARAMS, NULL },
3394
};
3395
3396
static const match_table_t i_lcoef_tokens = {
3397
{ I_LCOEF_RBPS, "rbps=%u" },
3398
{ I_LCOEF_RSEQIOPS, "rseqiops=%u" },
3399
{ I_LCOEF_RRANDIOPS, "rrandiops=%u" },
3400
{ I_LCOEF_WBPS, "wbps=%u" },
3401
{ I_LCOEF_WSEQIOPS, "wseqiops=%u" },
3402
{ I_LCOEF_WRANDIOPS, "wrandiops=%u" },
3403
{ NR_I_LCOEFS, NULL },
3404
};
3405
3406
static ssize_t ioc_cost_model_write(struct kernfs_open_file *of, char *input,
3407
size_t nbytes, loff_t off)
3408
{
3409
struct blkg_conf_ctx ctx;
3410
struct request_queue *q;
3411
unsigned int memflags;
3412
struct ioc *ioc;
3413
u64 u[NR_I_LCOEFS];
3414
bool user;
3415
char *body, *p;
3416
int ret;
3417
3418
blkg_conf_init(&ctx, input);
3419
3420
ret = blkg_conf_open_bdev(&ctx);
3421
if (ret)
3422
goto err;
3423
3424
body = ctx.body;
3425
q = bdev_get_queue(ctx.bdev);
3426
if (!queue_is_mq(q)) {
3427
ret = -EOPNOTSUPP;
3428
goto err;
3429
}
3430
3431
ioc = q_to_ioc(q);
3432
if (!ioc) {
3433
ret = blk_iocost_init(ctx.bdev->bd_disk);
3434
if (ret)
3435
goto err;
3436
ioc = q_to_ioc(q);
3437
}
3438
3439
memflags = blk_mq_freeze_queue(q);
3440
blk_mq_quiesce_queue(q);
3441
3442
spin_lock_irq(&ioc->lock);
3443
memcpy(u, ioc->params.i_lcoefs, sizeof(u));
3444
user = ioc->user_cost_model;
3445
3446
while ((p = strsep(&body, " \t\n"))) {
3447
substring_t args[MAX_OPT_ARGS];
3448
char buf[32];
3449
int tok;
3450
u64 v;
3451
3452
if (!*p)
3453
continue;
3454
3455
switch (match_token(p, cost_ctrl_tokens, args)) {
3456
case COST_CTRL:
3457
match_strlcpy(buf, &args[0], sizeof(buf));
3458
if (!strcmp(buf, "auto"))
3459
user = false;
3460
else if (!strcmp(buf, "user"))
3461
user = true;
3462
else
3463
goto einval;
3464
continue;
3465
case COST_MODEL:
3466
match_strlcpy(buf, &args[0], sizeof(buf));
3467
if (strcmp(buf, "linear"))
3468
goto einval;
3469
continue;
3470
}
3471
3472
tok = match_token(p, i_lcoef_tokens, args);
3473
if (tok == NR_I_LCOEFS)
3474
goto einval;
3475
if (match_u64(&args[0], &v))
3476
goto einval;
3477
u[tok] = v;
3478
user = true;
3479
}
3480
3481
if (user) {
3482
memcpy(ioc->params.i_lcoefs, u, sizeof(u));
3483
ioc->user_cost_model = true;
3484
} else {
3485
ioc->user_cost_model = false;
3486
}
3487
ioc_refresh_params(ioc, true);
3488
spin_unlock_irq(&ioc->lock);
3489
3490
blk_mq_unquiesce_queue(q);
3491
blk_mq_unfreeze_queue(q, memflags);
3492
3493
blkg_conf_exit(&ctx);
3494
return nbytes;
3495
3496
einval:
3497
spin_unlock_irq(&ioc->lock);
3498
3499
blk_mq_unquiesce_queue(q);
3500
blk_mq_unfreeze_queue(q, memflags);
3501
3502
ret = -EINVAL;
3503
err:
3504
blkg_conf_exit(&ctx);
3505
return ret;
3506
}
3507
3508
static struct cftype ioc_files[] = {
3509
{
3510
.name = "weight",
3511
.flags = CFTYPE_NOT_ON_ROOT,
3512
.seq_show = ioc_weight_show,
3513
.write = ioc_weight_write,
3514
},
3515
{
3516
.name = "cost.qos",
3517
.flags = CFTYPE_ONLY_ON_ROOT,
3518
.seq_show = ioc_qos_show,
3519
.write = ioc_qos_write,
3520
},
3521
{
3522
.name = "cost.model",
3523
.flags = CFTYPE_ONLY_ON_ROOT,
3524
.seq_show = ioc_cost_model_show,
3525
.write = ioc_cost_model_write,
3526
},
3527
{}
3528
};
3529
3530
static struct blkcg_policy blkcg_policy_iocost = {
3531
.dfl_cftypes = ioc_files,
3532
.cpd_alloc_fn = ioc_cpd_alloc,
3533
.cpd_free_fn = ioc_cpd_free,
3534
.pd_alloc_fn = ioc_pd_alloc,
3535
.pd_init_fn = ioc_pd_init,
3536
.pd_free_fn = ioc_pd_free,
3537
.pd_stat_fn = ioc_pd_stat,
3538
};
3539
3540
static int __init ioc_init(void)
3541
{
3542
return blkcg_policy_register(&blkcg_policy_iocost);
3543
}
3544
3545
static void __exit ioc_exit(void)
3546
{
3547
blkcg_policy_unregister(&blkcg_policy_iocost);
3548
}
3549
3550
module_init(ioc_init);
3551
module_exit(ioc_exit);
3552
3553