// SPDX-License-Identifier: GPL-2.0-or-later1/*2* Budget Fair Queueing (BFQ) I/O scheduler.3*4* Based on ideas and code from CFQ:5* Copyright (C) 2003 Jens Axboe <[email protected]>6*7* Copyright (C) 2008 Fabio Checconi <[email protected]>8* Paolo Valente <[email protected]>9*10* Copyright (C) 2010 Paolo Valente <[email protected]>11* Arianna Avanzini <[email protected]>12*13* Copyright (C) 2017 Paolo Valente <[email protected]>14*15* BFQ is a proportional-share I/O scheduler, with some extra16* low-latency capabilities. BFQ also supports full hierarchical17* scheduling through cgroups. Next paragraphs provide an introduction18* on BFQ inner workings. Details on BFQ benefits, usage and19* limitations can be found in Documentation/block/bfq-iosched.rst.20*21* BFQ is a proportional-share storage-I/O scheduling algorithm based22* on the slice-by-slice service scheme of CFQ. But BFQ assigns23* budgets, measured in number of sectors, to processes instead of24* time slices. The device is not granted to the in-service process25* for a given time slice, but until it has exhausted its assigned26* budget. This change from the time to the service domain enables BFQ27* to distribute the device throughput among processes as desired,28* without any distortion due to throughput fluctuations, or to device29* internal queueing. BFQ uses an ad hoc internal scheduler, called30* B-WF2Q+, to schedule processes according to their budgets. More31* precisely, BFQ schedules queues associated with processes. Each32* process/queue is assigned a user-configurable weight, and B-WF2Q+33* guarantees that each queue receives a fraction of the throughput34* proportional to its weight. Thanks to the accurate policy of35* B-WF2Q+, BFQ can afford to assign high budgets to I/O-bound36* processes issuing sequential requests (to boost the throughput),37* and yet guarantee a low latency to interactive and soft real-time38* applications.39*40* In particular, to provide these low-latency guarantees, BFQ41* explicitly privileges the I/O of two classes of time-sensitive42* applications: interactive and soft real-time. In more detail, BFQ43* behaves this way if the low_latency parameter is set (default44* configuration). This feature enables BFQ to provide applications in45* these classes with a very low latency.46*47* To implement this feature, BFQ constantly tries to detect whether48* the I/O requests in a bfq_queue come from an interactive or a soft49* real-time application. For brevity, in these cases, the queue is50* said to be interactive or soft real-time. In both cases, BFQ51* privileges the service of the queue, over that of non-interactive52* and non-soft-real-time queues. This privileging is performed,53* mainly, by raising the weight of the queue. So, for brevity, we54* call just weight-raising periods the time periods during which a55* queue is privileged, because deemed interactive or soft real-time.56*57* The detection of soft real-time queues/applications is described in58* detail in the comments on the function59* bfq_bfqq_softrt_next_start. On the other hand, the detection of an60* interactive queue works as follows: a queue is deemed interactive61* if it is constantly non empty only for a limited time interval,62* after which it does become empty. The queue may be deemed63* interactive again (for a limited time), if it restarts being64* constantly non empty, provided that this happens only after the65* queue has remained empty for a given minimum idle time.66*67* By default, BFQ computes automatically the above maximum time68* interval, i.e., the time interval after which a constantly69* non-empty queue stops being deemed interactive. Since a queue is70* weight-raised while it is deemed interactive, this maximum time71* interval happens to coincide with the (maximum) duration of the72* weight-raising for interactive queues.73*74* Finally, BFQ also features additional heuristics for75* preserving both a low latency and a high throughput on NCQ-capable,76* rotational or flash-based devices, and to get the job done quickly77* for applications consisting in many I/O-bound processes.78*79* NOTE: if the main or only goal, with a given device, is to achieve80* the maximum-possible throughput at all times, then do switch off81* all low-latency heuristics for that device, by setting low_latency82* to 0.83*84* BFQ is described in [1], where also a reference to the initial,85* more theoretical paper on BFQ can be found. The interested reader86* can find in the latter paper full details on the main algorithm, as87* well as formulas of the guarantees and formal proofs of all the88* properties. With respect to the version of BFQ presented in these89* papers, this implementation adds a few more heuristics, such as the90* ones that guarantee a low latency to interactive and soft real-time91* applications, and a hierarchical extension based on H-WF2Q+.92*93* B-WF2Q+ is based on WF2Q+, which is described in [2], together with94* H-WF2Q+, while the augmented tree used here to implement B-WF2Q+95* with O(log N) complexity derives from the one introduced with EEVDF96* in [3].97*98* [1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O99* Scheduler", Proceedings of the First Workshop on Mobile System100* Technologies (MST-2015), May 2015.101* http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf102*103* [2] Jon C.R. Bennett and H. Zhang, "Hierarchical Packet Fair Queueing104* Algorithms", IEEE/ACM Transactions on Networking, 5(5):675-689,105* Oct 1997.106*107* http://www.cs.cmu.edu/~hzhang/papers/TON-97-Oct.ps.gz108*109* [3] I. Stoica and H. Abdel-Wahab, "Earliest Eligible Virtual Deadline110* First: A Flexible and Accurate Mechanism for Proportional Share111* Resource Allocation", technical report.112*113* http://www.cs.berkeley.edu/~istoica/papers/eevdf-tr-95.pdf114*/115#include <linux/module.h>116#include <linux/slab.h>117#include <linux/blkdev.h>118#include <linux/cgroup.h>119#include <linux/ktime.h>120#include <linux/rbtree.h>121#include <linux/ioprio.h>122#include <linux/sbitmap.h>123#include <linux/delay.h>124#include <linux/backing-dev.h>125126#include <trace/events/block.h>127128#include "elevator.h"129#include "blk.h"130#include "blk-mq.h"131#include "blk-mq-sched.h"132#include "bfq-iosched.h"133#include "blk-wbt.h"134135#define BFQ_BFQQ_FNS(name) \136void bfq_mark_bfqq_##name(struct bfq_queue *bfqq) \137{ \138__set_bit(BFQQF_##name, &(bfqq)->flags); \139} \140void bfq_clear_bfqq_##name(struct bfq_queue *bfqq) \141{ \142__clear_bit(BFQQF_##name, &(bfqq)->flags); \143} \144int bfq_bfqq_##name(const struct bfq_queue *bfqq) \145{ \146return test_bit(BFQQF_##name, &(bfqq)->flags); \147}148149BFQ_BFQQ_FNS(just_created);150BFQ_BFQQ_FNS(busy);151BFQ_BFQQ_FNS(wait_request);152BFQ_BFQQ_FNS(non_blocking_wait_rq);153BFQ_BFQQ_FNS(fifo_expire);154BFQ_BFQQ_FNS(has_short_ttime);155BFQ_BFQQ_FNS(sync);156BFQ_BFQQ_FNS(IO_bound);157BFQ_BFQQ_FNS(in_large_burst);158BFQ_BFQQ_FNS(coop);159BFQ_BFQQ_FNS(split_coop);160BFQ_BFQQ_FNS(softrt_update);161#undef BFQ_BFQQ_FNS \162163/* Expiration time of async (0) and sync (1) requests, in ns. */164static const u64 bfq_fifo_expire[2] = { NSEC_PER_SEC / 4, NSEC_PER_SEC / 8 };165166/* Maximum backwards seek (magic number lifted from CFQ), in KiB. */167static const int bfq_back_max = 16 * 1024;168169/* Penalty of a backwards seek, in number of sectors. */170static const int bfq_back_penalty = 2;171172/* Idling period duration, in ns. */173static u64 bfq_slice_idle = NSEC_PER_SEC / 125;174175/* Minimum number of assigned budgets for which stats are safe to compute. */176static const int bfq_stats_min_budgets = 194;177178/* Default maximum budget values, in sectors and number of requests. */179static const int bfq_default_max_budget = 16 * 1024;180181/*182* When a sync request is dispatched, the queue that contains that183* request, and all the ancestor entities of that queue, are charged184* with the number of sectors of the request. In contrast, if the185* request is async, then the queue and its ancestor entities are186* charged with the number of sectors of the request, multiplied by187* the factor below. This throttles the bandwidth for async I/O,188* w.r.t. to sync I/O, and it is done to counter the tendency of async189* writes to steal I/O throughput to reads.190*191* The current value of this parameter is the result of a tuning with192* several hardware and software configurations. We tried to find the193* lowest value for which writes do not cause noticeable problems to194* reads. In fact, the lower this parameter, the stabler I/O control,195* in the following respect. The lower this parameter is, the less196* the bandwidth enjoyed by a group decreases197* - when the group does writes, w.r.t. to when it does reads;198* - when other groups do reads, w.r.t. to when they do writes.199*/200static const int bfq_async_charge_factor = 3;201202/* Default timeout values, in jiffies, approximating CFQ defaults. */203const int bfq_timeout = HZ / 8;204205/*206* Time limit for merging (see comments in bfq_setup_cooperator). Set207* to the slowest value that, in our tests, proved to be effective in208* removing false positives, while not causing true positives to miss209* queue merging.210*211* As can be deduced from the low time limit below, queue merging, if212* successful, happens at the very beginning of the I/O of the involved213* cooperating processes, as a consequence of the arrival of the very214* first requests from each cooperator. After that, there is very215* little chance to find cooperators.216*/217static const unsigned long bfq_merge_time_limit = HZ/10;218219static struct kmem_cache *bfq_pool;220221/* Below this threshold (in ns), we consider thinktime immediate. */222#define BFQ_MIN_TT (2 * NSEC_PER_MSEC)223224/* hw_tag detection: parallel requests threshold and min samples needed. */225#define BFQ_HW_QUEUE_THRESHOLD 3226#define BFQ_HW_QUEUE_SAMPLES 32227228#define BFQQ_SEEK_THR (sector_t)(8 * 100)229#define BFQQ_SECT_THR_NONROT (sector_t)(2 * 32)230#define BFQ_RQ_SEEKY(bfqd, last_pos, rq) \231(get_sdist(last_pos, rq) > \232BFQQ_SEEK_THR && \233(!blk_queue_nonrot(bfqd->queue) || \234blk_rq_sectors(rq) < BFQQ_SECT_THR_NONROT))235#define BFQQ_CLOSE_THR (sector_t)(8 * 1024)236#define BFQQ_SEEKY(bfqq) (hweight32(bfqq->seek_history) > 19)237/*238* Sync random I/O is likely to be confused with soft real-time I/O,239* because it is characterized by limited throughput and apparently240* isochronous arrival pattern. To avoid false positives, queues241* containing only random (seeky) I/O are prevented from being tagged242* as soft real-time.243*/244#define BFQQ_TOTALLY_SEEKY(bfqq) (bfqq->seek_history == -1)245246/* Min number of samples required to perform peak-rate update */247#define BFQ_RATE_MIN_SAMPLES 32248/* Min observation time interval required to perform a peak-rate update (ns) */249#define BFQ_RATE_MIN_INTERVAL (300*NSEC_PER_MSEC)250/* Target observation time interval for a peak-rate update (ns) */251#define BFQ_RATE_REF_INTERVAL NSEC_PER_SEC252253/*254* Shift used for peak-rate fixed precision calculations.255* With256* - the current shift: 16 positions257* - the current type used to store rate: u32258* - the current unit of measure for rate: [sectors/usec], or, more precisely,259* [(sectors/usec) / 2^BFQ_RATE_SHIFT] to take into account the shift,260* the range of rates that can be stored is261* [1 / 2^BFQ_RATE_SHIFT, 2^(32 - BFQ_RATE_SHIFT)] sectors/usec =262* [1 / 2^16, 2^16] sectors/usec = [15e-6, 65536] sectors/usec =263* [15, 65G] sectors/sec264* Which, assuming a sector size of 512B, corresponds to a range of265* [7.5K, 33T] B/sec266*/267#define BFQ_RATE_SHIFT 16268269/*270* When configured for computing the duration of the weight-raising271* for interactive queues automatically (see the comments at the272* beginning of this file), BFQ does it using the following formula:273* duration = (ref_rate / r) * ref_wr_duration,274* where r is the peak rate of the device, and ref_rate and275* ref_wr_duration are two reference parameters. In particular,276* ref_rate is the peak rate of the reference storage device (see277* below), and ref_wr_duration is about the maximum time needed, with278* BFQ and while reading two files in parallel, to load typical large279* applications on the reference device (see the comments on280* max_service_from_wr below, for more details on how ref_wr_duration281* is obtained). In practice, the slower/faster the device at hand282* is, the more/less it takes to load applications with respect to the283* reference device. Accordingly, the longer/shorter BFQ grants284* weight raising to interactive applications.285*286* BFQ uses two different reference pairs (ref_rate, ref_wr_duration),287* depending on whether the device is rotational or non-rotational.288*289* In the following definitions, ref_rate[0] and ref_wr_duration[0]290* are the reference values for a rotational device, whereas291* ref_rate[1] and ref_wr_duration[1] are the reference values for a292* non-rotational device. The reference rates are not the actual peak293* rates of the devices used as a reference, but slightly lower294* values. The reason for using slightly lower values is that the295* peak-rate estimator tends to yield slightly lower values than the296* actual peak rate (it can yield the actual peak rate only if there297* is only one process doing I/O, and the process does sequential298* I/O).299*300* The reference peak rates are measured in sectors/usec, left-shifted301* by BFQ_RATE_SHIFT.302*/303static int ref_rate[2] = {14000, 33000};304/*305* To improve readability, a conversion function is used to initialize306* the following array, which entails that the array can be307* initialized only in a function.308*/309static int ref_wr_duration[2];310311/*312* BFQ uses the above-detailed, time-based weight-raising mechanism to313* privilege interactive tasks. This mechanism is vulnerable to the314* following false positives: I/O-bound applications that will go on315* doing I/O for much longer than the duration of weight316* raising. These applications have basically no benefit from being317* weight-raised at the beginning of their I/O. On the opposite end,318* while being weight-raised, these applications319* a) unjustly steal throughput to applications that may actually need320* low latency;321* b) make BFQ uselessly perform device idling; device idling results322* in loss of device throughput with most flash-based storage, and may323* increase latencies when used purposelessly.324*325* BFQ tries to reduce these problems, by adopting the following326* countermeasure. To introduce this countermeasure, we need first to327* finish explaining how the duration of weight-raising for328* interactive tasks is computed.329*330* For a bfq_queue deemed as interactive, the duration of weight331* raising is dynamically adjusted, as a function of the estimated332* peak rate of the device, so as to be equal to the time needed to333* execute the 'largest' interactive task we benchmarked so far. By334* largest task, we mean the task for which each involved process has335* to do more I/O than for any of the other tasks we benchmarked. This336* reference interactive task is the start-up of LibreOffice Writer,337* and in this task each process/bfq_queue needs to have at most ~110K338* sectors transferred.339*340* This last piece of information enables BFQ to reduce the actual341* duration of weight-raising for at least one class of I/O-bound342* applications: those doing sequential or quasi-sequential I/O. An343* example is file copy. In fact, once started, the main I/O-bound344* processes of these applications usually consume the above 110K345* sectors in much less time than the processes of an application that346* is starting, because these I/O-bound processes will greedily devote347* almost all their CPU cycles only to their target,348* throughput-friendly I/O operations. This is even more true if BFQ349* happens to be underestimating the device peak rate, and thus350* overestimating the duration of weight raising. But, according to351* our measurements, once transferred 110K sectors, these processes352* have no right to be weight-raised any longer.353*354* Basing on the last consideration, BFQ ends weight-raising for a355* bfq_queue if the latter happens to have received an amount of356* service at least equal to the following constant. The constant is357* set to slightly more than 110K, to have a minimum safety margin.358*359* This early ending of weight-raising reduces the amount of time360* during which interactive false positives cause the two problems361* described at the beginning of these comments.362*/363static const unsigned long max_service_from_wr = 120000;364365/*366* Maximum time between the creation of two queues, for stable merge367* to be activated (in ms)368*/369static const unsigned long bfq_activation_stable_merging = 600;370/*371* Minimum time to be waited before evaluating delayed stable merge (in ms)372*/373static const unsigned long bfq_late_stable_merging = 600;374375#define RQ_BIC(rq) ((struct bfq_io_cq *)((rq)->elv.priv[0]))376#define RQ_BFQQ(rq) ((rq)->elv.priv[1])377378struct bfq_queue *bic_to_bfqq(struct bfq_io_cq *bic, bool is_sync,379unsigned int actuator_idx)380{381if (is_sync)382return bic->bfqq[1][actuator_idx];383384return bic->bfqq[0][actuator_idx];385}386387static void bfq_put_stable_ref(struct bfq_queue *bfqq);388389void bic_set_bfqq(struct bfq_io_cq *bic,390struct bfq_queue *bfqq,391bool is_sync,392unsigned int actuator_idx)393{394struct bfq_queue *old_bfqq = bic->bfqq[is_sync][actuator_idx];395396/*397* If bfqq != NULL, then a non-stable queue merge between398* bic->bfqq and bfqq is happening here. This causes troubles399* in the following case: bic->bfqq has also been scheduled400* for a possible stable merge with bic->stable_merge_bfqq,401* and bic->stable_merge_bfqq == bfqq happens to402* hold. Troubles occur because bfqq may then undergo a split,403* thereby becoming eligible for a stable merge. Yet, if404* bic->stable_merge_bfqq points exactly to bfqq, then bfqq405* would be stably merged with itself. To avoid this anomaly,406* we cancel the stable merge if407* bic->stable_merge_bfqq == bfqq.408*/409struct bfq_iocq_bfqq_data *bfqq_data = &bic->bfqq_data[actuator_idx];410411/* Clear bic pointer if bfqq is detached from this bic */412if (old_bfqq && old_bfqq->bic == bic)413old_bfqq->bic = NULL;414415if (is_sync)416bic->bfqq[1][actuator_idx] = bfqq;417else418bic->bfqq[0][actuator_idx] = bfqq;419420if (bfqq && bfqq_data->stable_merge_bfqq == bfqq) {421/*422* Actually, these same instructions are executed also423* in bfq_setup_cooperator, in case of abort or actual424* execution of a stable merge. We could avoid425* repeating these instructions there too, but if we426* did so, we would nest even more complexity in this427* function.428*/429bfq_put_stable_ref(bfqq_data->stable_merge_bfqq);430431bfqq_data->stable_merge_bfqq = NULL;432}433}434435struct bfq_data *bic_to_bfqd(struct bfq_io_cq *bic)436{437return bic->icq.q->elevator->elevator_data;438}439440/**441* icq_to_bic - convert iocontext queue structure to bfq_io_cq.442* @icq: the iocontext queue.443*/444static struct bfq_io_cq *icq_to_bic(struct io_cq *icq)445{446/* bic->icq is the first member, %NULL will convert to %NULL */447return container_of(icq, struct bfq_io_cq, icq);448}449450/**451* bfq_bic_lookup - search into @ioc a bic associated to @bfqd.452* @q: the request queue.453*/454static struct bfq_io_cq *bfq_bic_lookup(struct request_queue *q)455{456if (!current->io_context)457return NULL;458459return icq_to_bic(ioc_lookup_icq(q));460}461462/*463* Scheduler run of queue, if there are requests pending and no one in the464* driver that will restart queueing.465*/466void bfq_schedule_dispatch(struct bfq_data *bfqd)467{468lockdep_assert_held(&bfqd->lock);469470if (bfqd->queued != 0) {471bfq_log(bfqd, "schedule dispatch");472blk_mq_run_hw_queues(bfqd->queue, true);473}474}475476#define bfq_class_idle(bfqq) ((bfqq)->ioprio_class == IOPRIO_CLASS_IDLE)477478#define bfq_sample_valid(samples) ((samples) > 80)479480/*481* Lifted from AS - choose which of rq1 and rq2 that is best served now.482* We choose the request that is closer to the head right now. Distance483* behind the head is penalized and only allowed to a certain extent.484*/485static struct request *bfq_choose_req(struct bfq_data *bfqd,486struct request *rq1,487struct request *rq2,488sector_t last)489{490sector_t s1, s2, d1 = 0, d2 = 0;491unsigned long back_max;492#define BFQ_RQ1_WRAP 0x01 /* request 1 wraps */493#define BFQ_RQ2_WRAP 0x02 /* request 2 wraps */494unsigned int wrap = 0; /* bit mask: requests behind the disk head? */495496if (!rq1 || rq1 == rq2)497return rq2;498if (!rq2)499return rq1;500501if (rq_is_sync(rq1) && !rq_is_sync(rq2))502return rq1;503else if (rq_is_sync(rq2) && !rq_is_sync(rq1))504return rq2;505if ((rq1->cmd_flags & REQ_META) && !(rq2->cmd_flags & REQ_META))506return rq1;507else if ((rq2->cmd_flags & REQ_META) && !(rq1->cmd_flags & REQ_META))508return rq2;509510s1 = blk_rq_pos(rq1);511s2 = blk_rq_pos(rq2);512513/*514* By definition, 1KiB is 2 sectors.515*/516back_max = bfqd->bfq_back_max * 2;517518/*519* Strict one way elevator _except_ in the case where we allow520* short backward seeks which are biased as twice the cost of a521* similar forward seek.522*/523if (s1 >= last)524d1 = s1 - last;525else if (s1 + back_max >= last)526d1 = (last - s1) * bfqd->bfq_back_penalty;527else528wrap |= BFQ_RQ1_WRAP;529530if (s2 >= last)531d2 = s2 - last;532else if (s2 + back_max >= last)533d2 = (last - s2) * bfqd->bfq_back_penalty;534else535wrap |= BFQ_RQ2_WRAP;536537/* Found required data */538539/*540* By doing switch() on the bit mask "wrap" we avoid having to541* check two variables for all permutations: --> faster!542*/543switch (wrap) {544case 0: /* common case for CFQ: rq1 and rq2 not wrapped */545if (d1 < d2)546return rq1;547else if (d2 < d1)548return rq2;549550if (s1 >= s2)551return rq1;552else553return rq2;554555case BFQ_RQ2_WRAP:556return rq1;557case BFQ_RQ1_WRAP:558return rq2;559case BFQ_RQ1_WRAP|BFQ_RQ2_WRAP: /* both rqs wrapped */560default:561/*562* Since both rqs are wrapped,563* start with the one that's further behind head564* (--> only *one* back seek required),565* since back seek takes more time than forward.566*/567if (s1 <= s2)568return rq1;569else570return rq2;571}572}573574#define BFQ_LIMIT_INLINE_DEPTH 16575576#ifdef CONFIG_BFQ_GROUP_IOSCHED577static bool bfqq_request_over_limit(struct bfq_data *bfqd,578struct bfq_io_cq *bic, blk_opf_t opf,579unsigned int act_idx, int limit)580{581struct bfq_entity *inline_entities[BFQ_LIMIT_INLINE_DEPTH];582struct bfq_entity **entities = inline_entities;583int alloc_depth = BFQ_LIMIT_INLINE_DEPTH;584struct bfq_sched_data *sched_data;585struct bfq_entity *entity;586struct bfq_queue *bfqq;587unsigned long wsum;588bool ret = false;589int depth;590int level;591592retry:593spin_lock_irq(&bfqd->lock);594bfqq = bic_to_bfqq(bic, op_is_sync(opf), act_idx);595if (!bfqq)596goto out;597598entity = &bfqq->entity;599if (!entity->on_st_or_in_serv)600goto out;601602/* +1 for bfqq entity, root cgroup not included */603depth = bfqg_to_blkg(bfqq_group(bfqq))->blkcg->css.cgroup->level + 1;604if (depth > alloc_depth) {605spin_unlock_irq(&bfqd->lock);606if (entities != inline_entities)607kfree(entities);608entities = kmalloc_array(depth, sizeof(*entities), GFP_NOIO);609if (!entities)610return false;611alloc_depth = depth;612goto retry;613}614615sched_data = entity->sched_data;616/* Gather our ancestors as we need to traverse them in reverse order */617level = 0;618for_each_entity(entity) {619/*620* If at some level entity is not even active, allow request621* queueing so that BFQ knows there's work to do and activate622* entities.623*/624if (!entity->on_st_or_in_serv)625goto out;626/* Uh, more parents than cgroup subsystem thinks? */627if (WARN_ON_ONCE(level >= depth))628break;629entities[level++] = entity;630}631WARN_ON_ONCE(level != depth);632for (level--; level >= 0; level--) {633entity = entities[level];634if (level > 0) {635wsum = bfq_entity_service_tree(entity)->wsum;636} else {637int i;638/*639* For bfqq itself we take into account service trees640* of all higher priority classes and multiply their641* weights so that low prio queue from higher class642* gets more requests than high prio queue from lower643* class.644*/645wsum = 0;646for (i = 0; i <= bfqq->ioprio_class - 1; i++) {647wsum = wsum * IOPRIO_BE_NR +648sched_data->service_tree[i].wsum;649}650}651if (!wsum)652continue;653limit = DIV_ROUND_CLOSEST(limit * entity->weight, wsum);654if (entity->allocated >= limit) {655bfq_log_bfqq(bfqq->bfqd, bfqq,656"too many requests: allocated %d limit %d level %d",657entity->allocated, limit, level);658ret = true;659break;660}661}662out:663spin_unlock_irq(&bfqd->lock);664if (entities != inline_entities)665kfree(entities);666return ret;667}668#else669static bool bfqq_request_over_limit(struct bfq_data *bfqd,670struct bfq_io_cq *bic, blk_opf_t opf,671unsigned int act_idx, int limit)672{673return false;674}675#endif676677/*678* Async I/O can easily starve sync I/O (both sync reads and sync679* writes), by consuming all tags. Similarly, storms of sync writes,680* such as those that sync(2) may trigger, can starve sync reads.681* Limit depths of async I/O and sync writes so as to counter both682* problems.683*684* Also if a bfq queue or its parent cgroup consume more tags than would be685* appropriate for their weight, we trim the available tag depth to 1. This686* avoids a situation where one cgroup can starve another cgroup from tags and687* thus block service differentiation among cgroups. Note that because the688* queue / cgroup already has many requests allocated and queued, this does not689* significantly affect service guarantees coming from the BFQ scheduling690* algorithm.691*/692static void bfq_limit_depth(blk_opf_t opf, struct blk_mq_alloc_data *data)693{694struct bfq_data *bfqd = data->q->elevator->elevator_data;695struct bfq_io_cq *bic = bfq_bic_lookup(data->q);696unsigned int limit, act_idx;697698/* Sync reads have full depth available */699if (op_is_sync(opf) && !op_is_write(opf))700limit = data->q->nr_requests;701else702limit = bfqd->async_depths[!!bfqd->wr_busy_queues][op_is_sync(opf)];703704for (act_idx = 0; bic && act_idx < bfqd->num_actuators; act_idx++) {705/* Fast path to check if bfqq is already allocated. */706if (!bic_to_bfqq(bic, op_is_sync(opf), act_idx))707continue;708709/*710* Does queue (or any parent entity) exceed number of711* requests that should be available to it? Heavily712* limit depth so that it cannot consume more713* available requests and thus starve other entities.714*/715if (bfqq_request_over_limit(bfqd, bic, opf, act_idx, limit)) {716limit = 1;717break;718}719}720721bfq_log(bfqd, "[%s] wr_busy %d sync %d depth %u",722__func__, bfqd->wr_busy_queues, op_is_sync(opf), limit);723724if (limit < data->q->nr_requests)725data->shallow_depth = limit;726}727728static struct bfq_queue *729bfq_rq_pos_tree_lookup(struct bfq_data *bfqd, struct rb_root *root,730sector_t sector, struct rb_node **ret_parent,731struct rb_node ***rb_link)732{733struct rb_node **p, *parent;734struct bfq_queue *bfqq = NULL;735736parent = NULL;737p = &root->rb_node;738while (*p) {739struct rb_node **n;740741parent = *p;742bfqq = rb_entry(parent, struct bfq_queue, pos_node);743744/*745* Sort strictly based on sector. Smallest to the left,746* largest to the right.747*/748if (sector > blk_rq_pos(bfqq->next_rq))749n = &(*p)->rb_right;750else if (sector < blk_rq_pos(bfqq->next_rq))751n = &(*p)->rb_left;752else753break;754p = n;755bfqq = NULL;756}757758*ret_parent = parent;759if (rb_link)760*rb_link = p;761762bfq_log(bfqd, "rq_pos_tree_lookup %llu: returning %d",763(unsigned long long)sector,764bfqq ? bfqq->pid : 0);765766return bfqq;767}768769static bool bfq_too_late_for_merging(struct bfq_queue *bfqq)770{771return bfqq->service_from_backlogged > 0 &&772time_is_before_jiffies(bfqq->first_IO_time +773bfq_merge_time_limit);774}775776/*777* The following function is not marked as __cold because it is778* actually cold, but for the same performance goal described in the779* comments on the likely() at the beginning of780* bfq_setup_cooperator(). Unexpectedly, to reach an even lower781* execution time for the case where this function is not invoked, we782* had to add an unlikely() in each involved if().783*/784void __cold785bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq)786{787struct rb_node **p, *parent;788struct bfq_queue *__bfqq;789790if (bfqq->pos_root) {791rb_erase(&bfqq->pos_node, bfqq->pos_root);792bfqq->pos_root = NULL;793}794795/* oom_bfqq does not participate in queue merging */796if (bfqq == &bfqd->oom_bfqq)797return;798799/*800* bfqq cannot be merged any longer (see comments in801* bfq_setup_cooperator): no point in adding bfqq into the802* position tree.803*/804if (bfq_too_late_for_merging(bfqq))805return;806807if (bfq_class_idle(bfqq))808return;809if (!bfqq->next_rq)810return;811812bfqq->pos_root = &bfqq_group(bfqq)->rq_pos_tree;813__bfqq = bfq_rq_pos_tree_lookup(bfqd, bfqq->pos_root,814blk_rq_pos(bfqq->next_rq), &parent, &p);815if (!__bfqq) {816rb_link_node(&bfqq->pos_node, parent, p);817rb_insert_color(&bfqq->pos_node, bfqq->pos_root);818} else819bfqq->pos_root = NULL;820}821822/*823* The following function returns false either if every active queue824* must receive the same share of the throughput (symmetric scenario),825* or, as a special case, if bfqq must receive a share of the826* throughput lower than or equal to the share that every other active827* queue must receive. If bfqq does sync I/O, then these are the only828* two cases where bfqq happens to be guaranteed its share of the829* throughput even if I/O dispatching is not plugged when bfqq remains830* temporarily empty (for more details, see the comments in the831* function bfq_better_to_idle()). For this reason, the return value832* of this function is used to check whether I/O-dispatch plugging can833* be avoided.834*835* The above first case (symmetric scenario) occurs when:836* 1) all active queues have the same weight,837* 2) all active queues belong to the same I/O-priority class,838* 3) all active groups at the same level in the groups tree have the same839* weight,840* 4) all active groups at the same level in the groups tree have the same841* number of children.842*843* Unfortunately, keeping the necessary state for evaluating exactly844* the last two symmetry sub-conditions above would be quite complex845* and time consuming. Therefore this function evaluates, instead,846* only the following stronger three sub-conditions, for which it is847* much easier to maintain the needed state:848* 1) all active queues have the same weight,849* 2) all active queues belong to the same I/O-priority class,850* 3) there is at most one active group.851* In particular, the last condition is always true if hierarchical852* support or the cgroups interface are not enabled, thus no state853* needs to be maintained in this case.854*/855static bool bfq_asymmetric_scenario(struct bfq_data *bfqd,856struct bfq_queue *bfqq)857{858bool smallest_weight = bfqq &&859bfqq->weight_counter &&860bfqq->weight_counter ==861container_of(862rb_first_cached(&bfqd->queue_weights_tree),863struct bfq_weight_counter,864weights_node);865866/*867* For queue weights to differ, queue_weights_tree must contain868* at least two nodes.869*/870bool varied_queue_weights = !smallest_weight &&871!RB_EMPTY_ROOT(&bfqd->queue_weights_tree.rb_root) &&872(bfqd->queue_weights_tree.rb_root.rb_node->rb_left ||873bfqd->queue_weights_tree.rb_root.rb_node->rb_right);874875bool multiple_classes_busy =876(bfqd->busy_queues[0] && bfqd->busy_queues[1]) ||877(bfqd->busy_queues[0] && bfqd->busy_queues[2]) ||878(bfqd->busy_queues[1] && bfqd->busy_queues[2]);879880return varied_queue_weights || multiple_classes_busy881#ifdef CONFIG_BFQ_GROUP_IOSCHED882|| bfqd->num_groups_with_pending_reqs > 1883#endif884;885}886887/*888* If the weight-counter tree passed as input contains no counter for889* the weight of the input queue, then add that counter; otherwise just890* increment the existing counter.891*892* Note that weight-counter trees contain few nodes in mostly symmetric893* scenarios. For example, if all queues have the same weight, then the894* weight-counter tree for the queues may contain at most one node.895* This holds even if low_latency is on, because weight-raised queues896* are not inserted in the tree.897* In most scenarios, the rate at which nodes are created/destroyed898* should be low too.899*/900void bfq_weights_tree_add(struct bfq_queue *bfqq)901{902struct rb_root_cached *root = &bfqq->bfqd->queue_weights_tree;903struct bfq_entity *entity = &bfqq->entity;904struct rb_node **new = &(root->rb_root.rb_node), *parent = NULL;905bool leftmost = true;906907/*908* Do not insert if the queue is already associated with a909* counter, which happens if:910* 1) a request arrival has caused the queue to become both911* non-weight-raised, and hence change its weight, and912* backlogged; in this respect, each of the two events913* causes an invocation of this function,914* 2) this is the invocation of this function caused by the915* second event. This second invocation is actually useless,916* and we handle this fact by exiting immediately. More917* efficient or clearer solutions might possibly be adopted.918*/919if (bfqq->weight_counter)920return;921922while (*new) {923struct bfq_weight_counter *__counter = container_of(*new,924struct bfq_weight_counter,925weights_node);926parent = *new;927928if (entity->weight == __counter->weight) {929bfqq->weight_counter = __counter;930goto inc_counter;931}932if (entity->weight < __counter->weight)933new = &((*new)->rb_left);934else {935new = &((*new)->rb_right);936leftmost = false;937}938}939940bfqq->weight_counter = kzalloc(sizeof(struct bfq_weight_counter),941GFP_ATOMIC);942943/*944* In the unlucky event of an allocation failure, we just945* exit. This will cause the weight of queue to not be946* considered in bfq_asymmetric_scenario, which, in its turn,947* causes the scenario to be deemed wrongly symmetric in case948* bfqq's weight would have been the only weight making the949* scenario asymmetric. On the bright side, no unbalance will950* however occur when bfqq becomes inactive again (the951* invocation of this function is triggered by an activation952* of queue). In fact, bfq_weights_tree_remove does nothing953* if !bfqq->weight_counter.954*/955if (unlikely(!bfqq->weight_counter))956return;957958bfqq->weight_counter->weight = entity->weight;959rb_link_node(&bfqq->weight_counter->weights_node, parent, new);960rb_insert_color_cached(&bfqq->weight_counter->weights_node, root,961leftmost);962963inc_counter:964bfqq->weight_counter->num_active++;965bfqq->ref++;966}967968/*969* Decrement the weight counter associated with the queue, and, if the970* counter reaches 0, remove the counter from the tree.971* See the comments to the function bfq_weights_tree_add() for considerations972* about overhead.973*/974void bfq_weights_tree_remove(struct bfq_queue *bfqq)975{976struct rb_root_cached *root;977978if (!bfqq->weight_counter)979return;980981root = &bfqq->bfqd->queue_weights_tree;982bfqq->weight_counter->num_active--;983if (bfqq->weight_counter->num_active > 0)984goto reset_entity_pointer;985986rb_erase_cached(&bfqq->weight_counter->weights_node, root);987kfree(bfqq->weight_counter);988989reset_entity_pointer:990bfqq->weight_counter = NULL;991bfq_put_queue(bfqq);992}993994/*995* Return expired entry, or NULL to just start from scratch in rbtree.996*/997static struct request *bfq_check_fifo(struct bfq_queue *bfqq,998struct request *last)999{1000struct request *rq;10011002if (bfq_bfqq_fifo_expire(bfqq))1003return NULL;10041005bfq_mark_bfqq_fifo_expire(bfqq);10061007rq = rq_entry_fifo(bfqq->fifo.next);10081009if (rq == last || blk_time_get_ns() < rq->fifo_time)1010return NULL;10111012bfq_log_bfqq(bfqq->bfqd, bfqq, "check_fifo: returned %p", rq);1013return rq;1014}10151016static struct request *bfq_find_next_rq(struct bfq_data *bfqd,1017struct bfq_queue *bfqq,1018struct request *last)1019{1020struct rb_node *rbnext = rb_next(&last->rb_node);1021struct rb_node *rbprev = rb_prev(&last->rb_node);1022struct request *next, *prev = NULL;10231024/* Follow expired path, else get first next available. */1025next = bfq_check_fifo(bfqq, last);1026if (next)1027return next;10281029if (rbprev)1030prev = rb_entry_rq(rbprev);10311032if (rbnext)1033next = rb_entry_rq(rbnext);1034else {1035rbnext = rb_first(&bfqq->sort_list);1036if (rbnext && rbnext != &last->rb_node)1037next = rb_entry_rq(rbnext);1038}10391040return bfq_choose_req(bfqd, next, prev, blk_rq_pos(last));1041}10421043/* see the definition of bfq_async_charge_factor for details */1044static unsigned long bfq_serv_to_charge(struct request *rq,1045struct bfq_queue *bfqq)1046{1047if (bfq_bfqq_sync(bfqq) || bfqq->wr_coeff > 1 ||1048bfq_asymmetric_scenario(bfqq->bfqd, bfqq))1049return blk_rq_sectors(rq);10501051return blk_rq_sectors(rq) * bfq_async_charge_factor;1052}10531054/**1055* bfq_updated_next_req - update the queue after a new next_rq selection.1056* @bfqd: the device data the queue belongs to.1057* @bfqq: the queue to update.1058*1059* If the first request of a queue changes we make sure that the queue1060* has enough budget to serve at least its first request (if the1061* request has grown). We do this because if the queue has not enough1062* budget for its first request, it has to go through two dispatch1063* rounds to actually get it dispatched.1064*/1065static void bfq_updated_next_req(struct bfq_data *bfqd,1066struct bfq_queue *bfqq)1067{1068struct bfq_entity *entity = &bfqq->entity;1069struct request *next_rq = bfqq->next_rq;1070unsigned long new_budget;10711072if (!next_rq)1073return;10741075if (bfqq == bfqd->in_service_queue)1076/*1077* In order not to break guarantees, budgets cannot be1078* changed after an entity has been selected.1079*/1080return;10811082new_budget = max_t(unsigned long,1083max_t(unsigned long, bfqq->max_budget,1084bfq_serv_to_charge(next_rq, bfqq)),1085entity->service);1086if (entity->budget != new_budget) {1087entity->budget = new_budget;1088bfq_log_bfqq(bfqd, bfqq, "updated next rq: new budget %lu",1089new_budget);1090bfq_requeue_bfqq(bfqd, bfqq, false);1091}1092}10931094static unsigned int bfq_wr_duration(struct bfq_data *bfqd)1095{1096u64 dur;10971098dur = bfqd->rate_dur_prod;1099do_div(dur, bfqd->peak_rate);11001101/*1102* Limit duration between 3 and 25 seconds. The upper limit1103* has been conservatively set after the following worst case:1104* on a QEMU/KVM virtual machine1105* - running in a slow PC1106* - with a virtual disk stacked on a slow low-end 5400rpm HDD1107* - serving a heavy I/O workload, such as the sequential reading1108* of several files1109* mplayer took 23 seconds to start, if constantly weight-raised.1110*1111* As for higher values than that accommodating the above bad1112* scenario, tests show that higher values would often yield1113* the opposite of the desired result, i.e., would worsen1114* responsiveness by allowing non-interactive applications to1115* preserve weight raising for too long.1116*1117* On the other end, lower values than 3 seconds make it1118* difficult for most interactive tasks to complete their jobs1119* before weight-raising finishes.1120*/1121return clamp_val(dur, msecs_to_jiffies(3000), msecs_to_jiffies(25000));1122}11231124/* switch back from soft real-time to interactive weight raising */1125static void switch_back_to_interactive_wr(struct bfq_queue *bfqq,1126struct bfq_data *bfqd)1127{1128bfqq->wr_coeff = bfqd->bfq_wr_coeff;1129bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);1130bfqq->last_wr_start_finish = bfqq->wr_start_at_switch_to_srt;1131}11321133static void1134bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_data *bfqd,1135struct bfq_io_cq *bic, bool bfq_already_existing)1136{1137unsigned int old_wr_coeff = 1;1138bool busy = bfq_already_existing && bfq_bfqq_busy(bfqq);1139unsigned int a_idx = bfqq->actuator_idx;1140struct bfq_iocq_bfqq_data *bfqq_data = &bic->bfqq_data[a_idx];11411142if (bfqq_data->saved_has_short_ttime)1143bfq_mark_bfqq_has_short_ttime(bfqq);1144else1145bfq_clear_bfqq_has_short_ttime(bfqq);11461147if (bfqq_data->saved_IO_bound)1148bfq_mark_bfqq_IO_bound(bfqq);1149else1150bfq_clear_bfqq_IO_bound(bfqq);11511152bfqq->last_serv_time_ns = bfqq_data->saved_last_serv_time_ns;1153bfqq->inject_limit = bfqq_data->saved_inject_limit;1154bfqq->decrease_time_jif = bfqq_data->saved_decrease_time_jif;11551156bfqq->entity.new_weight = bfqq_data->saved_weight;1157bfqq->ttime = bfqq_data->saved_ttime;1158bfqq->io_start_time = bfqq_data->saved_io_start_time;1159bfqq->tot_idle_time = bfqq_data->saved_tot_idle_time;1160/*1161* Restore weight coefficient only if low_latency is on1162*/1163if (bfqd->low_latency) {1164old_wr_coeff = bfqq->wr_coeff;1165bfqq->wr_coeff = bfqq_data->saved_wr_coeff;1166}1167bfqq->service_from_wr = bfqq_data->saved_service_from_wr;1168bfqq->wr_start_at_switch_to_srt =1169bfqq_data->saved_wr_start_at_switch_to_srt;1170bfqq->last_wr_start_finish = bfqq_data->saved_last_wr_start_finish;1171bfqq->wr_cur_max_time = bfqq_data->saved_wr_cur_max_time;11721173if (bfqq->wr_coeff > 1 && (bfq_bfqq_in_large_burst(bfqq) ||1174time_is_before_jiffies(bfqq->last_wr_start_finish +1175bfqq->wr_cur_max_time))) {1176if (bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time &&1177!bfq_bfqq_in_large_burst(bfqq) &&1178time_is_after_eq_jiffies(bfqq->wr_start_at_switch_to_srt +1179bfq_wr_duration(bfqd))) {1180switch_back_to_interactive_wr(bfqq, bfqd);1181} else {1182bfqq->wr_coeff = 1;1183bfq_log_bfqq(bfqq->bfqd, bfqq,1184"resume state: switching off wr");1185}1186}11871188/* make sure weight will be updated, however we got here */1189bfqq->entity.prio_changed = 1;11901191if (likely(!busy))1192return;11931194if (old_wr_coeff == 1 && bfqq->wr_coeff > 1)1195bfqd->wr_busy_queues++;1196else if (old_wr_coeff > 1 && bfqq->wr_coeff == 1)1197bfqd->wr_busy_queues--;1198}11991200static int bfqq_process_refs(struct bfq_queue *bfqq)1201{1202return bfqq->ref - bfqq->entity.allocated -1203bfqq->entity.on_st_or_in_serv -1204(bfqq->weight_counter != NULL) - bfqq->stable_ref;1205}12061207/* Empty burst list and add just bfqq (see comments on bfq_handle_burst) */1208static void bfq_reset_burst_list(struct bfq_data *bfqd, struct bfq_queue *bfqq)1209{1210struct bfq_queue *item;1211struct hlist_node *n;12121213hlist_for_each_entry_safe(item, n, &bfqd->burst_list, burst_list_node)1214hlist_del_init(&item->burst_list_node);12151216/*1217* Start the creation of a new burst list only if there is no1218* active queue. See comments on the conditional invocation of1219* bfq_handle_burst().1220*/1221if (bfq_tot_busy_queues(bfqd) == 0) {1222hlist_add_head(&bfqq->burst_list_node, &bfqd->burst_list);1223bfqd->burst_size = 1;1224} else1225bfqd->burst_size = 0;12261227bfqd->burst_parent_entity = bfqq->entity.parent;1228}12291230/* Add bfqq to the list of queues in current burst (see bfq_handle_burst) */1231static void bfq_add_to_burst(struct bfq_data *bfqd, struct bfq_queue *bfqq)1232{1233/* Increment burst size to take into account also bfqq */1234bfqd->burst_size++;12351236if (bfqd->burst_size == bfqd->bfq_large_burst_thresh) {1237struct bfq_queue *pos, *bfqq_item;1238struct hlist_node *n;12391240/*1241* Enough queues have been activated shortly after each1242* other to consider this burst as large.1243*/1244bfqd->large_burst = true;12451246/*1247* We can now mark all queues in the burst list as1248* belonging to a large burst.1249*/1250hlist_for_each_entry(bfqq_item, &bfqd->burst_list,1251burst_list_node)1252bfq_mark_bfqq_in_large_burst(bfqq_item);1253bfq_mark_bfqq_in_large_burst(bfqq);12541255/*1256* From now on, and until the current burst finishes, any1257* new queue being activated shortly after the last queue1258* was inserted in the burst can be immediately marked as1259* belonging to a large burst. So the burst list is not1260* needed any more. Remove it.1261*/1262hlist_for_each_entry_safe(pos, n, &bfqd->burst_list,1263burst_list_node)1264hlist_del_init(&pos->burst_list_node);1265} else /*1266* Burst not yet large: add bfqq to the burst list. Do1267* not increment the ref counter for bfqq, because bfqq1268* is removed from the burst list before freeing bfqq1269* in put_queue.1270*/1271hlist_add_head(&bfqq->burst_list_node, &bfqd->burst_list);1272}12731274/*1275* If many queues belonging to the same group happen to be created1276* shortly after each other, then the processes associated with these1277* queues have typically a common goal. In particular, bursts of queue1278* creations are usually caused by services or applications that spawn1279* many parallel threads/processes. Examples are systemd during boot,1280* or git grep. To help these processes get their job done as soon as1281* possible, it is usually better to not grant either weight-raising1282* or device idling to their queues, unless these queues must be1283* protected from the I/O flowing through other active queues.1284*1285* In this comment we describe, firstly, the reasons why this fact1286* holds, and, secondly, the next function, which implements the main1287* steps needed to properly mark these queues so that they can then be1288* treated in a different way.1289*1290* The above services or applications benefit mostly from a high1291* throughput: the quicker the requests of the activated queues are1292* cumulatively served, the sooner the target job of these queues gets1293* completed. As a consequence, weight-raising any of these queues,1294* which also implies idling the device for it, is almost always1295* counterproductive, unless there are other active queues to isolate1296* these new queues from. If there no other active queues, then1297* weight-raising these new queues just lowers throughput in most1298* cases.1299*1300* On the other hand, a burst of queue creations may be caused also by1301* the start of an application that does not consist of a lot of1302* parallel I/O-bound threads. In fact, with a complex application,1303* several short processes may need to be executed to start-up the1304* application. In this respect, to start an application as quickly as1305* possible, the best thing to do is in any case to privilege the I/O1306* related to the application with respect to all other1307* I/O. Therefore, the best strategy to start as quickly as possible1308* an application that causes a burst of queue creations is to1309* weight-raise all the queues created during the burst. This is the1310* exact opposite of the best strategy for the other type of bursts.1311*1312* In the end, to take the best action for each of the two cases, the1313* two types of bursts need to be distinguished. Fortunately, this1314* seems relatively easy, by looking at the sizes of the bursts. In1315* particular, we found a threshold such that only bursts with a1316* larger size than that threshold are apparently caused by1317* services or commands such as systemd or git grep. For brevity,1318* hereafter we call just 'large' these bursts. BFQ *does not*1319* weight-raise queues whose creation occurs in a large burst. In1320* addition, for each of these queues BFQ performs or does not perform1321* idling depending on which choice boosts the throughput more. The1322* exact choice depends on the device and request pattern at1323* hand.1324*1325* Unfortunately, false positives may occur while an interactive task1326* is starting (e.g., an application is being started). The1327* consequence is that the queues associated with the task do not1328* enjoy weight raising as expected. Fortunately these false positives1329* are very rare. They typically occur if some service happens to1330* start doing I/O exactly when the interactive task starts.1331*1332* Turning back to the next function, it is invoked only if there are1333* no active queues (apart from active queues that would belong to the1334* same, possible burst bfqq would belong to), and it implements all1335* the steps needed to detect the occurrence of a large burst and to1336* properly mark all the queues belonging to it (so that they can then1337* be treated in a different way). This goal is achieved by1338* maintaining a "burst list" that holds, temporarily, the queues that1339* belong to the burst in progress. The list is then used to mark1340* these queues as belonging to a large burst if the burst does become1341* large. The main steps are the following.1342*1343* . when the very first queue is created, the queue is inserted into the1344* list (as it could be the first queue in a possible burst)1345*1346* . if the current burst has not yet become large, and a queue Q that does1347* not yet belong to the burst is activated shortly after the last time1348* at which a new queue entered the burst list, then the function appends1349* Q to the burst list1350*1351* . if, as a consequence of the previous step, the burst size reaches1352* the large-burst threshold, then1353*1354* . all the queues in the burst list are marked as belonging to a1355* large burst1356*1357* . the burst list is deleted; in fact, the burst list already served1358* its purpose (keeping temporarily track of the queues in a burst,1359* so as to be able to mark them as belonging to a large burst in the1360* previous sub-step), and now is not needed any more1361*1362* . the device enters a large-burst mode1363*1364* . if a queue Q that does not belong to the burst is created while1365* the device is in large-burst mode and shortly after the last time1366* at which a queue either entered the burst list or was marked as1367* belonging to the current large burst, then Q is immediately marked1368* as belonging to a large burst.1369*1370* . if a queue Q that does not belong to the burst is created a while1371* later, i.e., not shortly after, than the last time at which a queue1372* either entered the burst list or was marked as belonging to the1373* current large burst, then the current burst is deemed as finished and:1374*1375* . the large-burst mode is reset if set1376*1377* . the burst list is emptied1378*1379* . Q is inserted in the burst list, as Q may be the first queue1380* in a possible new burst (then the burst list contains just Q1381* after this step).1382*/1383static void bfq_handle_burst(struct bfq_data *bfqd, struct bfq_queue *bfqq)1384{1385/*1386* If bfqq is already in the burst list or is part of a large1387* burst, or finally has just been split, then there is1388* nothing else to do.1389*/1390if (!hlist_unhashed(&bfqq->burst_list_node) ||1391bfq_bfqq_in_large_burst(bfqq) ||1392time_is_after_eq_jiffies(bfqq->split_time +1393msecs_to_jiffies(10)))1394return;13951396/*1397* If bfqq's creation happens late enough, or bfqq belongs to1398* a different group than the burst group, then the current1399* burst is finished, and related data structures must be1400* reset.1401*1402* In this respect, consider the special case where bfqq is1403* the very first queue created after BFQ is selected for this1404* device. In this case, last_ins_in_burst and1405* burst_parent_entity are not yet significant when we get1406* here. But it is easy to verify that, whether or not the1407* following condition is true, bfqq will end up being1408* inserted into the burst list. In particular the list will1409* happen to contain only bfqq. And this is exactly what has1410* to happen, as bfqq may be the first queue of the first1411* burst.1412*/1413if (time_is_before_jiffies(bfqd->last_ins_in_burst +1414bfqd->bfq_burst_interval) ||1415bfqq->entity.parent != bfqd->burst_parent_entity) {1416bfqd->large_burst = false;1417bfq_reset_burst_list(bfqd, bfqq);1418goto end;1419}14201421/*1422* If we get here, then bfqq is being activated shortly after the1423* last queue. So, if the current burst is also large, we can mark1424* bfqq as belonging to this large burst immediately.1425*/1426if (bfqd->large_burst) {1427bfq_mark_bfqq_in_large_burst(bfqq);1428goto end;1429}14301431/*1432* If we get here, then a large-burst state has not yet been1433* reached, but bfqq is being activated shortly after the last1434* queue. Then we add bfqq to the burst.1435*/1436bfq_add_to_burst(bfqd, bfqq);1437end:1438/*1439* At this point, bfqq either has been added to the current1440* burst or has caused the current burst to terminate and a1441* possible new burst to start. In particular, in the second1442* case, bfqq has become the first queue in the possible new1443* burst. In both cases last_ins_in_burst needs to be moved1444* forward.1445*/1446bfqd->last_ins_in_burst = jiffies;1447}14481449static int bfq_bfqq_budget_left(struct bfq_queue *bfqq)1450{1451struct bfq_entity *entity = &bfqq->entity;14521453return entity->budget - entity->service;1454}14551456/*1457* If enough samples have been computed, return the current max budget1458* stored in bfqd, which is dynamically updated according to the1459* estimated disk peak rate; otherwise return the default max budget1460*/1461static int bfq_max_budget(struct bfq_data *bfqd)1462{1463if (bfqd->budgets_assigned < bfq_stats_min_budgets)1464return bfq_default_max_budget;1465else1466return bfqd->bfq_max_budget;1467}14681469/*1470* Return min budget, which is a fraction of the current or default1471* max budget (trying with 1/32)1472*/1473static int bfq_min_budget(struct bfq_data *bfqd)1474{1475if (bfqd->budgets_assigned < bfq_stats_min_budgets)1476return bfq_default_max_budget / 32;1477else1478return bfqd->bfq_max_budget / 32;1479}14801481/*1482* The next function, invoked after the input queue bfqq switches from1483* idle to busy, updates the budget of bfqq. The function also tells1484* whether the in-service queue should be expired, by returning1485* true. The purpose of expiring the in-service queue is to give bfqq1486* the chance to possibly preempt the in-service queue, and the reason1487* for preempting the in-service queue is to achieve one of the two1488* goals below.1489*1490* 1. Guarantee to bfqq its reserved bandwidth even if bfqq has1491* expired because it has remained idle. In particular, bfqq may have1492* expired for one of the following two reasons:1493*1494* - BFQQE_NO_MORE_REQUESTS bfqq did not enjoy any device idling1495* and did not make it to issue a new request before its last1496* request was served;1497*1498* - BFQQE_TOO_IDLE bfqq did enjoy device idling, but did not issue1499* a new request before the expiration of the idling-time.1500*1501* Even if bfqq has expired for one of the above reasons, the process1502* associated with the queue may be however issuing requests greedily,1503* and thus be sensitive to the bandwidth it receives (bfqq may have1504* remained idle for other reasons: CPU high load, bfqq not enjoying1505* idling, I/O throttling somewhere in the path from the process to1506* the I/O scheduler, ...). But if, after every expiration for one of1507* the above two reasons, bfqq has to wait for the service of at least1508* one full budget of another queue before being served again, then1509* bfqq is likely to get a much lower bandwidth or resource time than1510* its reserved ones. To address this issue, two countermeasures need1511* to be taken.1512*1513* First, the budget and the timestamps of bfqq need to be updated in1514* a special way on bfqq reactivation: they need to be updated as if1515* bfqq did not remain idle and did not expire. In fact, if they are1516* computed as if bfqq expired and remained idle until reactivation,1517* then the process associated with bfqq is treated as if, instead of1518* being greedy, it stopped issuing requests when bfqq remained idle,1519* and restarts issuing requests only on this reactivation. In other1520* words, the scheduler does not help the process recover the "service1521* hole" between bfqq expiration and reactivation. As a consequence,1522* the process receives a lower bandwidth than its reserved one. In1523* contrast, to recover this hole, the budget must be updated as if1524* bfqq was not expired at all before this reactivation, i.e., it must1525* be set to the value of the remaining budget when bfqq was1526* expired. Along the same line, timestamps need to be assigned the1527* value they had the last time bfqq was selected for service, i.e.,1528* before last expiration. Thus timestamps need to be back-shifted1529* with respect to their normal computation (see [1] for more details1530* on this tricky aspect).1531*1532* Secondly, to allow the process to recover the hole, the in-service1533* queue must be expired too, to give bfqq the chance to preempt it1534* immediately. In fact, if bfqq has to wait for a full budget of the1535* in-service queue to be completed, then it may become impossible to1536* let the process recover the hole, even if the back-shifted1537* timestamps of bfqq are lower than those of the in-service queue. If1538* this happens for most or all of the holes, then the process may not1539* receive its reserved bandwidth. In this respect, it is worth noting1540* that, being the service of outstanding requests unpreemptible, a1541* little fraction of the holes may however be unrecoverable, thereby1542* causing a little loss of bandwidth.1543*1544* The last important point is detecting whether bfqq does need this1545* bandwidth recovery. In this respect, the next function deems the1546* process associated with bfqq greedy, and thus allows it to recover1547* the hole, if: 1) the process is waiting for the arrival of a new1548* request (which implies that bfqq expired for one of the above two1549* reasons), and 2) such a request has arrived soon. The first1550* condition is controlled through the flag non_blocking_wait_rq,1551* while the second through the flag arrived_in_time. If both1552* conditions hold, then the function computes the budget in the1553* above-described special way, and signals that the in-service queue1554* should be expired. Timestamp back-shifting is done later in1555* __bfq_activate_entity.1556*1557* 2. Reduce latency. Even if timestamps are not backshifted to let1558* the process associated with bfqq recover a service hole, bfqq may1559* however happen to have, after being (re)activated, a lower finish1560* timestamp than the in-service queue. That is, the next budget of1561* bfqq may have to be completed before the one of the in-service1562* queue. If this is the case, then preempting the in-service queue1563* allows this goal to be achieved, apart from the unpreemptible,1564* outstanding requests mentioned above.1565*1566* Unfortunately, regardless of which of the above two goals one wants1567* to achieve, service trees need first to be updated to know whether1568* the in-service queue must be preempted. To have service trees1569* correctly updated, the in-service queue must be expired and1570* rescheduled, and bfqq must be scheduled too. This is one of the1571* most costly operations (in future versions, the scheduling1572* mechanism may be re-designed in such a way to make it possible to1573* know whether preemption is needed without needing to update service1574* trees). In addition, queue preemptions almost always cause random1575* I/O, which may in turn cause loss of throughput. Finally, there may1576* even be no in-service queue when the next function is invoked (so,1577* no queue to compare timestamps with). Because of these facts, the1578* next function adopts the following simple scheme to avoid costly1579* operations, too frequent preemptions and too many dependencies on1580* the state of the scheduler: it requests the expiration of the1581* in-service queue (unconditionally) only for queues that need to1582* recover a hole. Then it delegates to other parts of the code the1583* responsibility of handling the above case 2.1584*/1585static bool bfq_bfqq_update_budg_for_activation(struct bfq_data *bfqd,1586struct bfq_queue *bfqq,1587bool arrived_in_time)1588{1589struct bfq_entity *entity = &bfqq->entity;15901591/*1592* In the next compound condition, we check also whether there1593* is some budget left, because otherwise there is no point in1594* trying to go on serving bfqq with this same budget: bfqq1595* would be expired immediately after being selected for1596* service. This would only cause useless overhead.1597*/1598if (bfq_bfqq_non_blocking_wait_rq(bfqq) && arrived_in_time &&1599bfq_bfqq_budget_left(bfqq) > 0) {1600/*1601* We do not clear the flag non_blocking_wait_rq here, as1602* the latter is used in bfq_activate_bfqq to signal1603* that timestamps need to be back-shifted (and is1604* cleared right after).1605*/16061607/*1608* In next assignment we rely on that either1609* entity->service or entity->budget are not updated1610* on expiration if bfqq is empty (see1611* __bfq_bfqq_recalc_budget). Thus both quantities1612* remain unchanged after such an expiration, and the1613* following statement therefore assigns to1614* entity->budget the remaining budget on such an1615* expiration.1616*/1617entity->budget = min_t(unsigned long,1618bfq_bfqq_budget_left(bfqq),1619bfqq->max_budget);16201621/*1622* At this point, we have used entity->service to get1623* the budget left (needed for updating1624* entity->budget). Thus we finally can, and have to,1625* reset entity->service. The latter must be reset1626* because bfqq would otherwise be charged again for1627* the service it has received during its previous1628* service slot(s).1629*/1630entity->service = 0;16311632return true;1633}16341635/*1636* We can finally complete expiration, by setting service to 0.1637*/1638entity->service = 0;1639entity->budget = max_t(unsigned long, bfqq->max_budget,1640bfq_serv_to_charge(bfqq->next_rq, bfqq));1641bfq_clear_bfqq_non_blocking_wait_rq(bfqq);1642return false;1643}16441645/*1646* Return the farthest past time instant according to jiffies1647* macros.1648*/1649static unsigned long bfq_smallest_from_now(void)1650{1651return jiffies - MAX_JIFFY_OFFSET;1652}16531654static void bfq_update_bfqq_wr_on_rq_arrival(struct bfq_data *bfqd,1655struct bfq_queue *bfqq,1656unsigned int old_wr_coeff,1657bool wr_or_deserves_wr,1658bool interactive,1659bool in_burst,1660bool soft_rt)1661{1662if (old_wr_coeff == 1 && wr_or_deserves_wr) {1663/* start a weight-raising period */1664if (interactive) {1665bfqq->service_from_wr = 0;1666bfqq->wr_coeff = bfqd->bfq_wr_coeff;1667bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);1668} else {1669/*1670* No interactive weight raising in progress1671* here: assign minus infinity to1672* wr_start_at_switch_to_srt, to make sure1673* that, at the end of the soft-real-time1674* weight raising periods that is starting1675* now, no interactive weight-raising period1676* may be wrongly considered as still in1677* progress (and thus actually started by1678* mistake).1679*/1680bfqq->wr_start_at_switch_to_srt =1681bfq_smallest_from_now();1682bfqq->wr_coeff = bfqd->bfq_wr_coeff *1683BFQ_SOFTRT_WEIGHT_FACTOR;1684bfqq->wr_cur_max_time =1685bfqd->bfq_wr_rt_max_time;1686}16871688/*1689* If needed, further reduce budget to make sure it is1690* close to bfqq's backlog, so as to reduce the1691* scheduling-error component due to a too large1692* budget. Do not care about throughput consequences,1693* but only about latency. Finally, do not assign a1694* too small budget either, to avoid increasing1695* latency by causing too frequent expirations.1696*/1697bfqq->entity.budget = min_t(unsigned long,1698bfqq->entity.budget,16992 * bfq_min_budget(bfqd));1700} else if (old_wr_coeff > 1) {1701if (interactive) { /* update wr coeff and duration */1702bfqq->wr_coeff = bfqd->bfq_wr_coeff;1703bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);1704} else if (in_burst)1705bfqq->wr_coeff = 1;1706else if (soft_rt) {1707/*1708* The application is now or still meeting the1709* requirements for being deemed soft rt. We1710* can then correctly and safely (re)charge1711* the weight-raising duration for the1712* application with the weight-raising1713* duration for soft rt applications.1714*1715* In particular, doing this recharge now, i.e.,1716* before the weight-raising period for the1717* application finishes, reduces the probability1718* of the following negative scenario:1719* 1) the weight of a soft rt application is1720* raised at startup (as for any newly1721* created application),1722* 2) since the application is not interactive,1723* at a certain time weight-raising is1724* stopped for the application,1725* 3) at that time the application happens to1726* still have pending requests, and hence1727* is destined to not have a chance to be1728* deemed soft rt before these requests are1729* completed (see the comments to the1730* function bfq_bfqq_softrt_next_start()1731* for details on soft rt detection),1732* 4) these pending requests experience a high1733* latency because the application is not1734* weight-raised while they are pending.1735*/1736if (bfqq->wr_cur_max_time !=1737bfqd->bfq_wr_rt_max_time) {1738bfqq->wr_start_at_switch_to_srt =1739bfqq->last_wr_start_finish;17401741bfqq->wr_cur_max_time =1742bfqd->bfq_wr_rt_max_time;1743bfqq->wr_coeff = bfqd->bfq_wr_coeff *1744BFQ_SOFTRT_WEIGHT_FACTOR;1745}1746bfqq->last_wr_start_finish = jiffies;1747}1748}1749}17501751static bool bfq_bfqq_idle_for_long_time(struct bfq_data *bfqd,1752struct bfq_queue *bfqq)1753{1754return bfqq->dispatched == 0 &&1755time_is_before_jiffies(1756bfqq->budget_timeout +1757bfqd->bfq_wr_min_idle_time);1758}175917601761/*1762* Return true if bfqq is in a higher priority class, or has a higher1763* weight than the in-service queue.1764*/1765static bool bfq_bfqq_higher_class_or_weight(struct bfq_queue *bfqq,1766struct bfq_queue *in_serv_bfqq)1767{1768int bfqq_weight, in_serv_weight;17691770if (bfqq->ioprio_class < in_serv_bfqq->ioprio_class)1771return true;17721773if (in_serv_bfqq->entity.parent == bfqq->entity.parent) {1774bfqq_weight = bfqq->entity.weight;1775in_serv_weight = in_serv_bfqq->entity.weight;1776} else {1777if (bfqq->entity.parent)1778bfqq_weight = bfqq->entity.parent->weight;1779else1780bfqq_weight = bfqq->entity.weight;1781if (in_serv_bfqq->entity.parent)1782in_serv_weight = in_serv_bfqq->entity.parent->weight;1783else1784in_serv_weight = in_serv_bfqq->entity.weight;1785}17861787return bfqq_weight > in_serv_weight;1788}17891790/*1791* Get the index of the actuator that will serve bio.1792*/1793static unsigned int bfq_actuator_index(struct bfq_data *bfqd, struct bio *bio)1794{1795unsigned int i;1796sector_t end;17971798/* no search needed if one or zero ranges present */1799if (bfqd->num_actuators == 1)1800return 0;18011802/* bio_end_sector(bio) gives the sector after the last one */1803end = bio_end_sector(bio) - 1;18041805for (i = 0; i < bfqd->num_actuators; i++) {1806if (end >= bfqd->sector[i] &&1807end < bfqd->sector[i] + bfqd->nr_sectors[i])1808return i;1809}18101811WARN_ONCE(true,1812"bfq_actuator_index: bio sector out of ranges: end=%llu\n",1813end);1814return 0;1815}18161817static bool bfq_better_to_idle(struct bfq_queue *bfqq);18181819static void bfq_bfqq_handle_idle_busy_switch(struct bfq_data *bfqd,1820struct bfq_queue *bfqq,1821int old_wr_coeff,1822struct request *rq,1823bool *interactive)1824{1825bool soft_rt, in_burst, wr_or_deserves_wr,1826bfqq_wants_to_preempt,1827idle_for_long_time = bfq_bfqq_idle_for_long_time(bfqd, bfqq),1828/*1829* See the comments on1830* bfq_bfqq_update_budg_for_activation for1831* details on the usage of the next variable.1832*/1833arrived_in_time = blk_time_get_ns() <=1834bfqq->ttime.last_end_request +1835bfqd->bfq_slice_idle * 3;1836unsigned int act_idx = bfq_actuator_index(bfqd, rq->bio);1837bool bfqq_non_merged_or_stably_merged =1838bfqq->bic || RQ_BIC(rq)->bfqq_data[act_idx].stably_merged;18391840/*1841* bfqq deserves to be weight-raised if:1842* - it is sync,1843* - it does not belong to a large burst,1844* - it has been idle for enough time or is soft real-time,1845* - is linked to a bfq_io_cq (it is not shared in any sense),1846* - has a default weight (otherwise we assume the user wanted1847* to control its weight explicitly)1848*/1849in_burst = bfq_bfqq_in_large_burst(bfqq);1850soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 &&1851!BFQQ_TOTALLY_SEEKY(bfqq) &&1852!in_burst &&1853time_is_before_jiffies(bfqq->soft_rt_next_start) &&1854bfqq->dispatched == 0 &&1855bfqq->entity.new_weight == 40;1856*interactive = !in_burst && idle_for_long_time &&1857bfqq->entity.new_weight == 40;1858/*1859* Merged bfq_queues are kept out of weight-raising1860* (low-latency) mechanisms. The reason is that these queues1861* are usually created for non-interactive and1862* non-soft-real-time tasks. Yet this is not the case for1863* stably-merged queues. These queues are merged just because1864* they are created shortly after each other. So they may1865* easily serve the I/O of an interactive or soft-real time1866* application, if the application happens to spawn multiple1867* processes. So let also stably-merged queued enjoy weight1868* raising.1869*/1870wr_or_deserves_wr = bfqd->low_latency &&1871(bfqq->wr_coeff > 1 ||1872(bfq_bfqq_sync(bfqq) && bfqq_non_merged_or_stably_merged &&1873(*interactive || soft_rt)));18741875/*1876* Using the last flag, update budget and check whether bfqq1877* may want to preempt the in-service queue.1878*/1879bfqq_wants_to_preempt =1880bfq_bfqq_update_budg_for_activation(bfqd, bfqq,1881arrived_in_time);18821883/*1884* If bfqq happened to be activated in a burst, but has been1885* idle for much more than an interactive queue, then we1886* assume that, in the overall I/O initiated in the burst, the1887* I/O associated with bfqq is finished. So bfqq does not need1888* to be treated as a queue belonging to a burst1889* anymore. Accordingly, we reset bfqq's in_large_burst flag1890* if set, and remove bfqq from the burst list if it's1891* there. We do not decrement burst_size, because the fact1892* that bfqq does not need to belong to the burst list any1893* more does not invalidate the fact that bfqq was created in1894* a burst.1895*/1896if (likely(!bfq_bfqq_just_created(bfqq)) &&1897idle_for_long_time &&1898time_is_before_jiffies(1899bfqq->budget_timeout +1900msecs_to_jiffies(10000))) {1901hlist_del_init(&bfqq->burst_list_node);1902bfq_clear_bfqq_in_large_burst(bfqq);1903}19041905bfq_clear_bfqq_just_created(bfqq);19061907if (bfqd->low_latency) {1908if (unlikely(time_is_after_jiffies(bfqq->split_time)))1909/* wraparound */1910bfqq->split_time =1911jiffies - bfqd->bfq_wr_min_idle_time - 1;19121913if (time_is_before_jiffies(bfqq->split_time +1914bfqd->bfq_wr_min_idle_time)) {1915bfq_update_bfqq_wr_on_rq_arrival(bfqd, bfqq,1916old_wr_coeff,1917wr_or_deserves_wr,1918*interactive,1919in_burst,1920soft_rt);19211922if (old_wr_coeff != bfqq->wr_coeff)1923bfqq->entity.prio_changed = 1;1924}1925}19261927bfqq->last_idle_bklogged = jiffies;1928bfqq->service_from_backlogged = 0;1929bfq_clear_bfqq_softrt_update(bfqq);19301931bfq_add_bfqq_busy(bfqq);19321933/*1934* Expire in-service queue if preemption may be needed for1935* guarantees or throughput. As for guarantees, we care1936* explicitly about two cases. The first is that bfqq has to1937* recover a service hole, as explained in the comments on1938* bfq_bfqq_update_budg_for_activation(), i.e., that1939* bfqq_wants_to_preempt is true. However, if bfqq does not1940* carry time-critical I/O, then bfqq's bandwidth is less1941* important than that of queues that carry time-critical I/O.1942* So, as a further constraint, we consider this case only if1943* bfqq is at least as weight-raised, i.e., at least as time1944* critical, as the in-service queue.1945*1946* The second case is that bfqq is in a higher priority class,1947* or has a higher weight than the in-service queue. If this1948* condition does not hold, we don't care because, even if1949* bfqq does not start to be served immediately, the resulting1950* delay for bfqq's I/O is however lower or much lower than1951* the ideal completion time to be guaranteed to bfqq's I/O.1952*1953* In both cases, preemption is needed only if, according to1954* the timestamps of both bfqq and of the in-service queue,1955* bfqq actually is the next queue to serve. So, to reduce1956* useless preemptions, the return value of1957* next_queue_may_preempt() is considered in the next compound1958* condition too. Yet next_queue_may_preempt() just checks a1959* simple, necessary condition for bfqq to be the next queue1960* to serve. In fact, to evaluate a sufficient condition, the1961* timestamps of the in-service queue would need to be1962* updated, and this operation is quite costly (see the1963* comments on bfq_bfqq_update_budg_for_activation()).1964*1965* As for throughput, we ask bfq_better_to_idle() whether we1966* still need to plug I/O dispatching. If bfq_better_to_idle()1967* says no, then plugging is not needed any longer, either to1968* boost throughput or to perserve service guarantees. Then1969* the best option is to stop plugging I/O, as not doing so1970* would certainly lower throughput. We may end up in this1971* case if: (1) upon a dispatch attempt, we detected that it1972* was better to plug I/O dispatch, and to wait for a new1973* request to arrive for the currently in-service queue, but1974* (2) this switch of bfqq to busy changes the scenario.1975*/1976if (bfqd->in_service_queue &&1977((bfqq_wants_to_preempt &&1978bfqq->wr_coeff >= bfqd->in_service_queue->wr_coeff) ||1979bfq_bfqq_higher_class_or_weight(bfqq, bfqd->in_service_queue) ||1980!bfq_better_to_idle(bfqd->in_service_queue)) &&1981next_queue_may_preempt(bfqd))1982bfq_bfqq_expire(bfqd, bfqd->in_service_queue,1983false, BFQQE_PREEMPTED);1984}19851986static void bfq_reset_inject_limit(struct bfq_data *bfqd,1987struct bfq_queue *bfqq)1988{1989/* invalidate baseline total service time */1990bfqq->last_serv_time_ns = 0;19911992/*1993* Reset pointer in case we are waiting for1994* some request completion.1995*/1996bfqd->waited_rq = NULL;19971998/*1999* If bfqq has a short think time, then start by setting the2000* inject limit to 0 prudentially, because the service time of2001* an injected I/O request may be higher than the think time2002* of bfqq, and therefore, if one request was injected when2003* bfqq remains empty, this injected request might delay the2004* service of the next I/O request for bfqq significantly. In2005* case bfqq can actually tolerate some injection, then the2006* adaptive update will however raise the limit soon. This2007* lucky circumstance holds exactly because bfqq has a short2008* think time, and thus, after remaining empty, is likely to2009* get new I/O enqueued---and then completed---before being2010* expired. This is the very pattern that gives the2011* limit-update algorithm the chance to measure the effect of2012* injection on request service times, and then to update the2013* limit accordingly.2014*2015* However, in the following special case, the inject limit is2016* left to 1 even if the think time is short: bfqq's I/O is2017* synchronized with that of some other queue, i.e., bfqq may2018* receive new I/O only after the I/O of the other queue is2019* completed. Keeping the inject limit to 1 allows the2020* blocking I/O to be served while bfqq is in service. And2021* this is very convenient both for bfqq and for overall2022* throughput, as explained in detail in the comments in2023* bfq_update_has_short_ttime().2024*2025* On the opposite end, if bfqq has a long think time, then2026* start directly by 1, because:2027* a) on the bright side, keeping at most one request in2028* service in the drive is unlikely to cause any harm to the2029* latency of bfqq's requests, as the service time of a single2030* request is likely to be lower than the think time of bfqq;2031* b) on the downside, after becoming empty, bfqq is likely to2032* expire before getting its next request. With this request2033* arrival pattern, it is very hard to sample total service2034* times and update the inject limit accordingly (see comments2035* on bfq_update_inject_limit()). So the limit is likely to be2036* never, or at least seldom, updated. As a consequence, by2037* setting the limit to 1, we avoid that no injection ever2038* occurs with bfqq. On the downside, this proactive step2039* further reduces chances to actually compute the baseline2040* total service time. Thus it reduces chances to execute the2041* limit-update algorithm and possibly raise the limit to more2042* than 1.2043*/2044if (bfq_bfqq_has_short_ttime(bfqq))2045bfqq->inject_limit = 0;2046else2047bfqq->inject_limit = 1;20482049bfqq->decrease_time_jif = jiffies;2050}20512052static void bfq_update_io_intensity(struct bfq_queue *bfqq, u64 now_ns)2053{2054u64 tot_io_time = now_ns - bfqq->io_start_time;20552056if (RB_EMPTY_ROOT(&bfqq->sort_list) && bfqq->dispatched == 0)2057bfqq->tot_idle_time +=2058now_ns - bfqq->ttime.last_end_request;20592060if (unlikely(bfq_bfqq_just_created(bfqq)))2061return;20622063/*2064* Must be busy for at least about 80% of the time to be2065* considered I/O bound.2066*/2067if (bfqq->tot_idle_time * 5 > tot_io_time)2068bfq_clear_bfqq_IO_bound(bfqq);2069else2070bfq_mark_bfqq_IO_bound(bfqq);20712072/*2073* Keep an observation window of at most 200 ms in the past2074* from now.2075*/2076if (tot_io_time > 200 * NSEC_PER_MSEC) {2077bfqq->io_start_time = now_ns - (tot_io_time>>1);2078bfqq->tot_idle_time >>= 1;2079}2080}20812082/*2083* Detect whether bfqq's I/O seems synchronized with that of some2084* other queue, i.e., whether bfqq, after remaining empty, happens to2085* receive new I/O only right after some I/O request of the other2086* queue has been completed. We call waker queue the other queue, and2087* we assume, for simplicity, that bfqq may have at most one waker2088* queue.2089*2090* A remarkable throughput boost can be reached by unconditionally2091* injecting the I/O of the waker queue, every time a new2092* bfq_dispatch_request happens to be invoked while I/O is being2093* plugged for bfqq. In addition to boosting throughput, this2094* unblocks bfqq's I/O, thereby improving bandwidth and latency for2095* bfqq. Note that these same results may be achieved with the general2096* injection mechanism, but less effectively. For details on this2097* aspect, see the comments on the choice of the queue for injection2098* in bfq_select_queue().2099*2100* Turning back to the detection of a waker queue, a queue Q is deemed as a2101* waker queue for bfqq if, for three consecutive times, bfqq happens to become2102* non empty right after a request of Q has been completed within given2103* timeout. In this respect, even if bfqq is empty, we do not check for a waker2104* if it still has some in-flight I/O. In fact, in this case bfqq is actually2105* still being served by the drive, and may receive new I/O on the completion2106* of some of the in-flight requests. In particular, on the first time, Q is2107* tentatively set as a candidate waker queue, while on the third consecutive2108* time that Q is detected, the field waker_bfqq is set to Q, to confirm that Q2109* is a waker queue for bfqq. These detection steps are performed only if bfqq2110* has a long think time, so as to make it more likely that bfqq's I/O is2111* actually being blocked by a synchronization. This last filter, plus the2112* above three-times requirement and time limit for detection, make false2113* positives less likely.2114*2115* NOTE2116*2117* The sooner a waker queue is detected, the sooner throughput can be2118* boosted by injecting I/O from the waker queue. Fortunately,2119* detection is likely to be actually fast, for the following2120* reasons. While blocked by synchronization, bfqq has a long think2121* time. This implies that bfqq's inject limit is at least equal to 12122* (see the comments in bfq_update_inject_limit()). So, thanks to2123* injection, the waker queue is likely to be served during the very2124* first I/O-plugging time interval for bfqq. This triggers the first2125* step of the detection mechanism. Thanks again to injection, the2126* candidate waker queue is then likely to be confirmed no later than2127* during the next I/O-plugging interval for bfqq.2128*2129* ISSUE2130*2131* On queue merging all waker information is lost.2132*/2133static void bfq_check_waker(struct bfq_data *bfqd, struct bfq_queue *bfqq,2134u64 now_ns)2135{2136char waker_name[MAX_BFQQ_NAME_LENGTH];21372138if (!bfqd->last_completed_rq_bfqq ||2139bfqd->last_completed_rq_bfqq == bfqq ||2140bfq_bfqq_has_short_ttime(bfqq) ||2141now_ns - bfqd->last_completion >= 4 * NSEC_PER_MSEC ||2142bfqd->last_completed_rq_bfqq == &bfqd->oom_bfqq ||2143bfqq == &bfqd->oom_bfqq)2144return;21452146/*2147* We reset waker detection logic also if too much time has passed2148* since the first detection. If wakeups are rare, pointless idling2149* doesn't hurt throughput that much. The condition below makes sure2150* we do not uselessly idle blocking waker in more than 1/64 cases.2151*/2152if (bfqd->last_completed_rq_bfqq !=2153bfqq->tentative_waker_bfqq ||2154now_ns > bfqq->waker_detection_started +2155128 * (u64)bfqd->bfq_slice_idle) {2156/*2157* First synchronization detected with a2158* candidate waker queue, or with a different2159* candidate waker queue from the current one.2160*/2161bfqq->tentative_waker_bfqq =2162bfqd->last_completed_rq_bfqq;2163bfqq->num_waker_detections = 1;2164bfqq->waker_detection_started = now_ns;2165bfq_bfqq_name(bfqq->tentative_waker_bfqq, waker_name,2166MAX_BFQQ_NAME_LENGTH);2167bfq_log_bfqq(bfqd, bfqq, "set tentative waker %s", waker_name);2168} else /* Same tentative waker queue detected again */2169bfqq->num_waker_detections++;21702171if (bfqq->num_waker_detections == 3) {2172bfqq->waker_bfqq = bfqd->last_completed_rq_bfqq;2173bfqq->tentative_waker_bfqq = NULL;2174bfq_bfqq_name(bfqq->waker_bfqq, waker_name,2175MAX_BFQQ_NAME_LENGTH);2176bfq_log_bfqq(bfqd, bfqq, "set waker %s", waker_name);21772178/*2179* If the waker queue disappears, then2180* bfqq->waker_bfqq must be reset. To2181* this goal, we maintain in each2182* waker queue a list, woken_list, of2183* all the queues that reference the2184* waker queue through their2185* waker_bfqq pointer. When the waker2186* queue exits, the waker_bfqq pointer2187* of all the queues in the woken_list2188* is reset.2189*2190* In addition, if bfqq is already in2191* the woken_list of a waker queue,2192* then, before being inserted into2193* the woken_list of a new waker2194* queue, bfqq must be removed from2195* the woken_list of the old waker2196* queue.2197*/2198if (!hlist_unhashed(&bfqq->woken_list_node))2199hlist_del_init(&bfqq->woken_list_node);2200hlist_add_head(&bfqq->woken_list_node,2201&bfqd->last_completed_rq_bfqq->woken_list);2202}2203}22042205static void bfq_add_request(struct request *rq)2206{2207struct bfq_queue *bfqq = RQ_BFQQ(rq);2208struct bfq_data *bfqd = bfqq->bfqd;2209struct request *next_rq, *prev;2210unsigned int old_wr_coeff = bfqq->wr_coeff;2211bool interactive = false;2212u64 now_ns = blk_time_get_ns();22132214bfq_log_bfqq(bfqd, bfqq, "add_request %d", rq_is_sync(rq));2215bfqq->queued[rq_is_sync(rq)]++;2216/*2217* Updating of 'bfqd->queued' is protected by 'bfqd->lock', however, it2218* may be read without holding the lock in bfq_has_work().2219*/2220WRITE_ONCE(bfqd->queued, bfqd->queued + 1);22212222if (bfq_bfqq_sync(bfqq) && RQ_BIC(rq)->requests <= 1) {2223bfq_check_waker(bfqd, bfqq, now_ns);22242225/*2226* Periodically reset inject limit, to make sure that2227* the latter eventually drops in case workload2228* changes, see step (3) in the comments on2229* bfq_update_inject_limit().2230*/2231if (time_is_before_eq_jiffies(bfqq->decrease_time_jif +2232msecs_to_jiffies(1000)))2233bfq_reset_inject_limit(bfqd, bfqq);22342235/*2236* The following conditions must hold to setup a new2237* sampling of total service time, and then a new2238* update of the inject limit:2239* - bfqq is in service, because the total service2240* time is evaluated only for the I/O requests of2241* the queues in service;2242* - this is the right occasion to compute or to2243* lower the baseline total service time, because2244* there are actually no requests in the drive,2245* or2246* the baseline total service time is available, and2247* this is the right occasion to compute the other2248* quantity needed to update the inject limit, i.e.,2249* the total service time caused by the amount of2250* injection allowed by the current value of the2251* limit. It is the right occasion because injection2252* has actually been performed during the service2253* hole, and there are still in-flight requests,2254* which are very likely to be exactly the injected2255* requests, or part of them;2256* - the minimum interval for sampling the total2257* service time and updating the inject limit has2258* elapsed.2259*/2260if (bfqq == bfqd->in_service_queue &&2261(bfqd->tot_rq_in_driver == 0 ||2262(bfqq->last_serv_time_ns > 0 &&2263bfqd->rqs_injected && bfqd->tot_rq_in_driver > 0)) &&2264time_is_before_eq_jiffies(bfqq->decrease_time_jif +2265msecs_to_jiffies(10))) {2266bfqd->last_empty_occupied_ns = blk_time_get_ns();2267/*2268* Start the state machine for measuring the2269* total service time of rq: setting2270* wait_dispatch will cause bfqd->waited_rq to2271* be set when rq will be dispatched.2272*/2273bfqd->wait_dispatch = true;2274/*2275* If there is no I/O in service in the drive,2276* then possible injection occurred before the2277* arrival of rq will not affect the total2278* service time of rq. So the injection limit2279* must not be updated as a function of such2280* total service time, unless new injection2281* occurs before rq is completed. To have the2282* injection limit updated only in the latter2283* case, reset rqs_injected here (rqs_injected2284* will be set in case injection is performed2285* on bfqq before rq is completed).2286*/2287if (bfqd->tot_rq_in_driver == 0)2288bfqd->rqs_injected = false;2289}2290}22912292if (bfq_bfqq_sync(bfqq))2293bfq_update_io_intensity(bfqq, now_ns);22942295elv_rb_add(&bfqq->sort_list, rq);22962297/*2298* Check if this request is a better next-serve candidate.2299*/2300prev = bfqq->next_rq;2301next_rq = bfq_choose_req(bfqd, bfqq->next_rq, rq, bfqd->last_position);2302bfqq->next_rq = next_rq;23032304/*2305* Adjust priority tree position, if next_rq changes.2306* See comments on bfq_pos_tree_add_move() for the unlikely().2307*/2308if (unlikely(!bfqd->nonrot_with_queueing && prev != bfqq->next_rq))2309bfq_pos_tree_add_move(bfqd, bfqq);23102311if (!bfq_bfqq_busy(bfqq)) /* switching to busy ... */2312bfq_bfqq_handle_idle_busy_switch(bfqd, bfqq, old_wr_coeff,2313rq, &interactive);2314else {2315if (bfqd->low_latency && old_wr_coeff == 1 && !rq_is_sync(rq) &&2316time_is_before_jiffies(2317bfqq->last_wr_start_finish +2318bfqd->bfq_wr_min_inter_arr_async)) {2319bfqq->wr_coeff = bfqd->bfq_wr_coeff;2320bfqq->wr_cur_max_time = bfq_wr_duration(bfqd);23212322bfqd->wr_busy_queues++;2323bfqq->entity.prio_changed = 1;2324}2325if (prev != bfqq->next_rq)2326bfq_updated_next_req(bfqd, bfqq);2327}23282329/*2330* Assign jiffies to last_wr_start_finish in the following2331* cases:2332*2333* . if bfqq is not going to be weight-raised, because, for2334* non weight-raised queues, last_wr_start_finish stores the2335* arrival time of the last request; as of now, this piece2336* of information is used only for deciding whether to2337* weight-raise async queues2338*2339* . if bfqq is not weight-raised, because, if bfqq is now2340* switching to weight-raised, then last_wr_start_finish2341* stores the time when weight-raising starts2342*2343* . if bfqq is interactive, because, regardless of whether2344* bfqq is currently weight-raised, the weight-raising2345* period must start or restart (this case is considered2346* separately because it is not detected by the above2347* conditions, if bfqq is already weight-raised)2348*2349* last_wr_start_finish has to be updated also if bfqq is soft2350* real-time, because the weight-raising period is constantly2351* restarted on idle-to-busy transitions for these queues, but2352* this is already done in bfq_bfqq_handle_idle_busy_switch if2353* needed.2354*/2355if (bfqd->low_latency &&2356(old_wr_coeff == 1 || bfqq->wr_coeff == 1 || interactive))2357bfqq->last_wr_start_finish = jiffies;2358}23592360static struct request *bfq_find_rq_fmerge(struct bfq_data *bfqd,2361struct bio *bio,2362struct request_queue *q)2363{2364struct bfq_queue *bfqq = bfqd->bio_bfqq;236523662367if (bfqq)2368return elv_rb_find(&bfqq->sort_list, bio_end_sector(bio));23692370return NULL;2371}23722373static sector_t get_sdist(sector_t last_pos, struct request *rq)2374{2375if (last_pos)2376return abs(blk_rq_pos(rq) - last_pos);23772378return 0;2379}23802381static void bfq_remove_request(struct request_queue *q,2382struct request *rq)2383{2384struct bfq_queue *bfqq = RQ_BFQQ(rq);2385struct bfq_data *bfqd = bfqq->bfqd;2386const int sync = rq_is_sync(rq);23872388if (bfqq->next_rq == rq) {2389bfqq->next_rq = bfq_find_next_rq(bfqd, bfqq, rq);2390bfq_updated_next_req(bfqd, bfqq);2391}23922393if (rq->queuelist.prev != &rq->queuelist)2394list_del_init(&rq->queuelist);2395bfqq->queued[sync]--;2396/*2397* Updating of 'bfqd->queued' is protected by 'bfqd->lock', however, it2398* may be read without holding the lock in bfq_has_work().2399*/2400WRITE_ONCE(bfqd->queued, bfqd->queued - 1);2401elv_rb_del(&bfqq->sort_list, rq);24022403elv_rqhash_del(q, rq);2404if (q->last_merge == rq)2405q->last_merge = NULL;24062407if (RB_EMPTY_ROOT(&bfqq->sort_list)) {2408bfqq->next_rq = NULL;24092410if (bfq_bfqq_busy(bfqq) && bfqq != bfqd->in_service_queue) {2411bfq_del_bfqq_busy(bfqq, false);2412/*2413* bfqq emptied. In normal operation, when2414* bfqq is empty, bfqq->entity.service and2415* bfqq->entity.budget must contain,2416* respectively, the service received and the2417* budget used last time bfqq emptied. These2418* facts do not hold in this case, as at least2419* this last removal occurred while bfqq is2420* not in service. To avoid inconsistencies,2421* reset both bfqq->entity.service and2422* bfqq->entity.budget, if bfqq has still a2423* process that may issue I/O requests to it.2424*/2425bfqq->entity.budget = bfqq->entity.service = 0;2426}24272428/*2429* Remove queue from request-position tree as it is empty.2430*/2431if (bfqq->pos_root) {2432rb_erase(&bfqq->pos_node, bfqq->pos_root);2433bfqq->pos_root = NULL;2434}2435} else {2436/* see comments on bfq_pos_tree_add_move() for the unlikely() */2437if (unlikely(!bfqd->nonrot_with_queueing))2438bfq_pos_tree_add_move(bfqd, bfqq);2439}24402441if (rq->cmd_flags & REQ_META)2442bfqq->meta_pending--;24432444}24452446static bool bfq_bio_merge(struct request_queue *q, struct bio *bio,2447unsigned int nr_segs)2448{2449struct bfq_data *bfqd = q->elevator->elevator_data;2450struct bfq_io_cq *bic = bfq_bic_lookup(q);2451struct request *free = NULL;2452bool ret;24532454spin_lock_irq(&bfqd->lock);24552456if (bic) {2457/*2458* Make sure cgroup info is uptodate for current process before2459* considering the merge.2460*/2461bfq_bic_update_cgroup(bic, bio);24622463bfqd->bio_bfqq = bic_to_bfqq(bic, op_is_sync(bio->bi_opf),2464bfq_actuator_index(bfqd, bio));2465} else {2466bfqd->bio_bfqq = NULL;2467}2468bfqd->bio_bic = bic;24692470ret = blk_mq_sched_try_merge(q, bio, nr_segs, &free);24712472spin_unlock_irq(&bfqd->lock);2473if (free)2474blk_mq_free_request(free);24752476return ret;2477}24782479static int bfq_request_merge(struct request_queue *q, struct request **req,2480struct bio *bio)2481{2482struct bfq_data *bfqd = q->elevator->elevator_data;2483struct request *__rq;24842485__rq = bfq_find_rq_fmerge(bfqd, bio, q);2486if (__rq && elv_bio_merge_ok(__rq, bio)) {2487*req = __rq;24882489if (blk_discard_mergable(__rq))2490return ELEVATOR_DISCARD_MERGE;2491return ELEVATOR_FRONT_MERGE;2492}24932494return ELEVATOR_NO_MERGE;2495}24962497static void bfq_request_merged(struct request_queue *q, struct request *req,2498enum elv_merge type)2499{2500if (type == ELEVATOR_FRONT_MERGE &&2501rb_prev(&req->rb_node) &&2502blk_rq_pos(req) <2503blk_rq_pos(container_of(rb_prev(&req->rb_node),2504struct request, rb_node))) {2505struct bfq_queue *bfqq = RQ_BFQQ(req);2506struct bfq_data *bfqd;2507struct request *prev, *next_rq;25082509if (!bfqq)2510return;25112512bfqd = bfqq->bfqd;25132514/* Reposition request in its sort_list */2515elv_rb_del(&bfqq->sort_list, req);2516elv_rb_add(&bfqq->sort_list, req);25172518/* Choose next request to be served for bfqq */2519prev = bfqq->next_rq;2520next_rq = bfq_choose_req(bfqd, bfqq->next_rq, req,2521bfqd->last_position);2522bfqq->next_rq = next_rq;2523/*2524* If next_rq changes, update both the queue's budget to2525* fit the new request and the queue's position in its2526* rq_pos_tree.2527*/2528if (prev != bfqq->next_rq) {2529bfq_updated_next_req(bfqd, bfqq);2530/*2531* See comments on bfq_pos_tree_add_move() for2532* the unlikely().2533*/2534if (unlikely(!bfqd->nonrot_with_queueing))2535bfq_pos_tree_add_move(bfqd, bfqq);2536}2537}2538}25392540/*2541* This function is called to notify the scheduler that the requests2542* rq and 'next' have been merged, with 'next' going away. BFQ2543* exploits this hook to address the following issue: if 'next' has a2544* fifo_time lower that rq, then the fifo_time of rq must be set to2545* the value of 'next', to not forget the greater age of 'next'.2546*2547* NOTE: in this function we assume that rq is in a bfq_queue, basing2548* on that rq is picked from the hash table q->elevator->hash, which,2549* in its turn, is filled only with I/O requests present in2550* bfq_queues, while BFQ is in use for the request queue q. In fact,2551* the function that fills this hash table (elv_rqhash_add) is called2552* only by bfq_insert_request.2553*/2554static void bfq_requests_merged(struct request_queue *q, struct request *rq,2555struct request *next)2556{2557struct bfq_queue *bfqq = RQ_BFQQ(rq),2558*next_bfqq = RQ_BFQQ(next);25592560if (!bfqq)2561goto remove;25622563/*2564* If next and rq belong to the same bfq_queue and next is older2565* than rq, then reposition rq in the fifo (by substituting next2566* with rq). Otherwise, if next and rq belong to different2567* bfq_queues, never reposition rq: in fact, we would have to2568* reposition it with respect to next's position in its own fifo,2569* which would most certainly be too expensive with respect to2570* the benefits.2571*/2572if (bfqq == next_bfqq &&2573!list_empty(&rq->queuelist) && !list_empty(&next->queuelist) &&2574next->fifo_time < rq->fifo_time) {2575list_del_init(&rq->queuelist);2576list_replace_init(&next->queuelist, &rq->queuelist);2577rq->fifo_time = next->fifo_time;2578}25792580if (bfqq->next_rq == next)2581bfqq->next_rq = rq;25822583bfqg_stats_update_io_merged(bfqq_group(bfqq), next->cmd_flags);2584remove:2585/* Merged request may be in the IO scheduler. Remove it. */2586if (!RB_EMPTY_NODE(&next->rb_node)) {2587bfq_remove_request(next->q, next);2588if (next_bfqq)2589bfqg_stats_update_io_remove(bfqq_group(next_bfqq),2590next->cmd_flags);2591}2592}25932594/* Must be called with bfqq != NULL */2595static void bfq_bfqq_end_wr(struct bfq_queue *bfqq)2596{2597/*2598* If bfqq has been enjoying interactive weight-raising, then2599* reset soft_rt_next_start. We do it for the following2600* reason. bfqq may have been conveying the I/O needed to load2601* a soft real-time application. Such an application actually2602* exhibits a soft real-time I/O pattern after it finishes2603* loading, and finally starts doing its job. But, if bfqq has2604* been receiving a lot of bandwidth so far (likely to happen2605* on a fast device), then soft_rt_next_start now contains a2606* high value that. So, without this reset, bfqq would be2607* prevented from being possibly considered as soft_rt for a2608* very long time.2609*/26102611if (bfqq->wr_cur_max_time !=2612bfqq->bfqd->bfq_wr_rt_max_time)2613bfqq->soft_rt_next_start = jiffies;26142615if (bfq_bfqq_busy(bfqq))2616bfqq->bfqd->wr_busy_queues--;2617bfqq->wr_coeff = 1;2618bfqq->wr_cur_max_time = 0;2619bfqq->last_wr_start_finish = jiffies;2620/*2621* Trigger a weight change on the next invocation of2622* __bfq_entity_update_weight_prio.2623*/2624bfqq->entity.prio_changed = 1;2625}26262627void bfq_end_wr_async_queues(struct bfq_data *bfqd,2628struct bfq_group *bfqg)2629{2630int i, j, k;26312632for (k = 0; k < bfqd->num_actuators; k++) {2633for (i = 0; i < 2; i++)2634for (j = 0; j < IOPRIO_NR_LEVELS; j++)2635if (bfqg->async_bfqq[i][j][k])2636bfq_bfqq_end_wr(bfqg->async_bfqq[i][j][k]);2637if (bfqg->async_idle_bfqq[k])2638bfq_bfqq_end_wr(bfqg->async_idle_bfqq[k]);2639}2640}26412642static void bfq_end_wr(struct bfq_data *bfqd)2643{2644struct bfq_queue *bfqq;2645int i;26462647spin_lock_irq(&bfqd->lock);26482649for (i = 0; i < bfqd->num_actuators; i++) {2650list_for_each_entry(bfqq, &bfqd->active_list[i], bfqq_list)2651bfq_bfqq_end_wr(bfqq);2652}2653list_for_each_entry(bfqq, &bfqd->idle_list, bfqq_list)2654bfq_bfqq_end_wr(bfqq);2655bfq_end_wr_async(bfqd);26562657spin_unlock_irq(&bfqd->lock);2658}26592660static sector_t bfq_io_struct_pos(void *io_struct, bool request)2661{2662if (request)2663return blk_rq_pos(io_struct);2664else2665return ((struct bio *)io_struct)->bi_iter.bi_sector;2666}26672668static int bfq_rq_close_to_sector(void *io_struct, bool request,2669sector_t sector)2670{2671return abs(bfq_io_struct_pos(io_struct, request) - sector) <=2672BFQQ_CLOSE_THR;2673}26742675static struct bfq_queue *bfqq_find_close(struct bfq_data *bfqd,2676struct bfq_queue *bfqq,2677sector_t sector)2678{2679struct rb_root *root = &bfqq_group(bfqq)->rq_pos_tree;2680struct rb_node *parent, *node;2681struct bfq_queue *__bfqq;26822683if (RB_EMPTY_ROOT(root))2684return NULL;26852686/*2687* First, if we find a request starting at the end of the last2688* request, choose it.2689*/2690__bfqq = bfq_rq_pos_tree_lookup(bfqd, root, sector, &parent, NULL);2691if (__bfqq)2692return __bfqq;26932694/*2695* If the exact sector wasn't found, the parent of the NULL leaf2696* will contain the closest sector (rq_pos_tree sorted by2697* next_request position).2698*/2699__bfqq = rb_entry(parent, struct bfq_queue, pos_node);2700if (bfq_rq_close_to_sector(__bfqq->next_rq, true, sector))2701return __bfqq;27022703if (blk_rq_pos(__bfqq->next_rq) < sector)2704node = rb_next(&__bfqq->pos_node);2705else2706node = rb_prev(&__bfqq->pos_node);2707if (!node)2708return NULL;27092710__bfqq = rb_entry(node, struct bfq_queue, pos_node);2711if (bfq_rq_close_to_sector(__bfqq->next_rq, true, sector))2712return __bfqq;27132714return NULL;2715}27162717static struct bfq_queue *bfq_find_close_cooperator(struct bfq_data *bfqd,2718struct bfq_queue *cur_bfqq,2719sector_t sector)2720{2721struct bfq_queue *bfqq;27222723/*2724* We shall notice if some of the queues are cooperating,2725* e.g., working closely on the same area of the device. In2726* that case, we can group them together and: 1) don't waste2727* time idling, and 2) serve the union of their requests in2728* the best possible order for throughput.2729*/2730bfqq = bfqq_find_close(bfqd, cur_bfqq, sector);2731if (!bfqq || bfqq == cur_bfqq)2732return NULL;27332734return bfqq;2735}27362737static struct bfq_queue *2738bfq_setup_merge(struct bfq_queue *bfqq, struct bfq_queue *new_bfqq)2739{2740int process_refs, new_process_refs;2741struct bfq_queue *__bfqq;27422743/*2744* If there are no process references on the new_bfqq, then it is2745* unsafe to follow the ->new_bfqq chain as other bfqq's in the chain2746* may have dropped their last reference (not just their last process2747* reference).2748*/2749if (!bfqq_process_refs(new_bfqq))2750return NULL;27512752/* Avoid a circular list and skip interim queue merges. */2753while ((__bfqq = new_bfqq->new_bfqq)) {2754if (__bfqq == bfqq)2755return NULL;2756new_bfqq = __bfqq;2757}27582759process_refs = bfqq_process_refs(bfqq);2760new_process_refs = bfqq_process_refs(new_bfqq);2761/*2762* If the process for the bfqq has gone away, there is no2763* sense in merging the queues.2764*/2765if (process_refs == 0 || new_process_refs == 0)2766return NULL;27672768/*2769* Make sure merged queues belong to the same parent. Parents could2770* have changed since the time we decided the two queues are suitable2771* for merging.2772*/2773if (new_bfqq->entity.parent != bfqq->entity.parent)2774return NULL;27752776bfq_log_bfqq(bfqq->bfqd, bfqq, "scheduling merge with queue %d",2777new_bfqq->pid);27782779/*2780* Merging is just a redirection: the requests of the process2781* owning one of the two queues are redirected to the other queue.2782* The latter queue, in its turn, is set as shared if this is the2783* first time that the requests of some process are redirected to2784* it.2785*2786* We redirect bfqq to new_bfqq and not the opposite, because2787* we are in the context of the process owning bfqq, thus we2788* have the io_cq of this process. So we can immediately2789* configure this io_cq to redirect the requests of the2790* process to new_bfqq. In contrast, the io_cq of new_bfqq is2791* not available any more (new_bfqq->bic == NULL).2792*2793* Anyway, even in case new_bfqq coincides with the in-service2794* queue, redirecting requests the in-service queue is the2795* best option, as we feed the in-service queue with new2796* requests close to the last request served and, by doing so,2797* are likely to increase the throughput.2798*/2799bfqq->new_bfqq = new_bfqq;2800/*2801* The above assignment schedules the following redirections:2802* each time some I/O for bfqq arrives, the process that2803* generated that I/O is disassociated from bfqq and2804* associated with new_bfqq. Here we increases new_bfqq->ref2805* in advance, adding the number of processes that are2806* expected to be associated with new_bfqq as they happen to2807* issue I/O.2808*/2809new_bfqq->ref += process_refs;2810return new_bfqq;2811}28122813static bool bfq_may_be_close_cooperator(struct bfq_queue *bfqq,2814struct bfq_queue *new_bfqq)2815{2816if (bfq_too_late_for_merging(new_bfqq))2817return false;28182819if (bfq_class_idle(bfqq) || bfq_class_idle(new_bfqq) ||2820(bfqq->ioprio_class != new_bfqq->ioprio_class))2821return false;28222823/*2824* If either of the queues has already been detected as seeky,2825* then merging it with the other queue is unlikely to lead to2826* sequential I/O.2827*/2828if (BFQQ_SEEKY(bfqq) || BFQQ_SEEKY(new_bfqq))2829return false;28302831/*2832* Interleaved I/O is known to be done by (some) applications2833* only for reads, so it does not make sense to merge async2834* queues.2835*/2836if (!bfq_bfqq_sync(bfqq) || !bfq_bfqq_sync(new_bfqq))2837return false;28382839return true;2840}28412842static bool idling_boosts_thr_without_issues(struct bfq_data *bfqd,2843struct bfq_queue *bfqq);28442845static struct bfq_queue *2846bfq_setup_stable_merge(struct bfq_data *bfqd, struct bfq_queue *bfqq,2847struct bfq_queue *stable_merge_bfqq,2848struct bfq_iocq_bfqq_data *bfqq_data)2849{2850int proc_ref = min(bfqq_process_refs(bfqq),2851bfqq_process_refs(stable_merge_bfqq));2852struct bfq_queue *new_bfqq = NULL;28532854bfqq_data->stable_merge_bfqq = NULL;2855if (idling_boosts_thr_without_issues(bfqd, bfqq) || proc_ref == 0)2856goto out;28572858/* next function will take at least one ref */2859new_bfqq = bfq_setup_merge(bfqq, stable_merge_bfqq);28602861if (new_bfqq) {2862bfqq_data->stably_merged = true;2863if (new_bfqq->bic) {2864unsigned int new_a_idx = new_bfqq->actuator_idx;2865struct bfq_iocq_bfqq_data *new_bfqq_data =2866&new_bfqq->bic->bfqq_data[new_a_idx];28672868new_bfqq_data->stably_merged = true;2869}2870}28712872out:2873/* deschedule stable merge, because done or aborted here */2874bfq_put_stable_ref(stable_merge_bfqq);28752876return new_bfqq;2877}28782879/*2880* Attempt to schedule a merge of bfqq with the currently in-service2881* queue or with a close queue among the scheduled queues. Return2882* NULL if no merge was scheduled, a pointer to the shared bfq_queue2883* structure otherwise.2884*2885* The OOM queue is not allowed to participate to cooperation: in fact, since2886* the requests temporarily redirected to the OOM queue could be redirected2887* again to dedicated queues at any time, the state needed to correctly2888* handle merging with the OOM queue would be quite complex and expensive2889* to maintain. Besides, in such a critical condition as an out of memory,2890* the benefits of queue merging may be little relevant, or even negligible.2891*2892* WARNING: queue merging may impair fairness among non-weight raised2893* queues, for at least two reasons: 1) the original weight of a2894* merged queue may change during the merged state, 2) even being the2895* weight the same, a merged queue may be bloated with many more2896* requests than the ones produced by its originally-associated2897* process.2898*/2899static struct bfq_queue *2900bfq_setup_cooperator(struct bfq_data *bfqd, struct bfq_queue *bfqq,2901void *io_struct, bool request, struct bfq_io_cq *bic)2902{2903struct bfq_queue *in_service_bfqq, *new_bfqq;2904unsigned int a_idx = bfqq->actuator_idx;2905struct bfq_iocq_bfqq_data *bfqq_data = &bic->bfqq_data[a_idx];29062907/* if a merge has already been setup, then proceed with that first */2908new_bfqq = bfqq->new_bfqq;2909if (new_bfqq) {2910while (new_bfqq->new_bfqq)2911new_bfqq = new_bfqq->new_bfqq;2912return new_bfqq;2913}29142915/*2916* Check delayed stable merge for rotational or non-queueing2917* devs. For this branch to be executed, bfqq must not be2918* currently merged with some other queue (i.e., bfqq->bic2919* must be non null). If we considered also merged queues,2920* then we should also check whether bfqq has already been2921* merged with bic->stable_merge_bfqq. But this would be2922* costly and complicated.2923*/2924if (unlikely(!bfqd->nonrot_with_queueing)) {2925/*2926* Make sure also that bfqq is sync, because2927* bic->stable_merge_bfqq may point to some queue (for2928* stable merging) also if bic is associated with a2929* sync queue, but this bfqq is async2930*/2931if (bfq_bfqq_sync(bfqq) && bfqq_data->stable_merge_bfqq &&2932!bfq_bfqq_just_created(bfqq) &&2933time_is_before_jiffies(bfqq->split_time +2934msecs_to_jiffies(bfq_late_stable_merging)) &&2935time_is_before_jiffies(bfqq->creation_time +2936msecs_to_jiffies(bfq_late_stable_merging))) {2937struct bfq_queue *stable_merge_bfqq =2938bfqq_data->stable_merge_bfqq;29392940return bfq_setup_stable_merge(bfqd, bfqq,2941stable_merge_bfqq,2942bfqq_data);2943}2944}29452946/*2947* Do not perform queue merging if the device is non2948* rotational and performs internal queueing. In fact, such a2949* device reaches a high speed through internal parallelism2950* and pipelining. This means that, to reach a high2951* throughput, it must have many requests enqueued at the same2952* time. But, in this configuration, the internal scheduling2953* algorithm of the device does exactly the job of queue2954* merging: it reorders requests so as to obtain as much as2955* possible a sequential I/O pattern. As a consequence, with2956* the workload generated by processes doing interleaved I/O,2957* the throughput reached by the device is likely to be the2958* same, with and without queue merging.2959*2960* Disabling merging also provides a remarkable benefit in2961* terms of throughput. Merging tends to make many workloads2962* artificially more uneven, because of shared queues2963* remaining non empty for incomparably more time than2964* non-merged queues. This may accentuate workload2965* asymmetries. For example, if one of the queues in a set of2966* merged queues has a higher weight than a normal queue, then2967* the shared queue may inherit such a high weight and, by2968* staying almost always active, may force BFQ to perform I/O2969* plugging most of the time. This evidently makes it harder2970* for BFQ to let the device reach a high throughput.2971*2972* Finally, the likely() macro below is not used because one2973* of the two branches is more likely than the other, but to2974* have the code path after the following if() executed as2975* fast as possible for the case of a non rotational device2976* with queueing. We want it because this is the fastest kind2977* of device. On the opposite end, the likely() may lengthen2978* the execution time of BFQ for the case of slower devices2979* (rotational or at least without queueing). But in this case2980* the execution time of BFQ matters very little, if not at2981* all.2982*/2983if (likely(bfqd->nonrot_with_queueing))2984return NULL;29852986/*2987* Prevent bfqq from being merged if it has been created too2988* long ago. The idea is that true cooperating processes, and2989* thus their associated bfq_queues, are supposed to be2990* created shortly after each other. This is the case, e.g.,2991* for KVM/QEMU and dump I/O threads. Basing on this2992* assumption, the following filtering greatly reduces the2993* probability that two non-cooperating processes, which just2994* happen to do close I/O for some short time interval, have2995* their queues merged by mistake.2996*/2997if (bfq_too_late_for_merging(bfqq))2998return NULL;29993000if (!io_struct || unlikely(bfqq == &bfqd->oom_bfqq))3001return NULL;30023003/* If there is only one backlogged queue, don't search. */3004if (bfq_tot_busy_queues(bfqd) == 1)3005return NULL;30063007in_service_bfqq = bfqd->in_service_queue;30083009if (in_service_bfqq && in_service_bfqq != bfqq &&3010likely(in_service_bfqq != &bfqd->oom_bfqq) &&3011bfq_rq_close_to_sector(io_struct, request,3012bfqd->in_serv_last_pos) &&3013bfqq->entity.parent == in_service_bfqq->entity.parent &&3014bfq_may_be_close_cooperator(bfqq, in_service_bfqq)) {3015new_bfqq = bfq_setup_merge(bfqq, in_service_bfqq);3016if (new_bfqq)3017return new_bfqq;3018}3019/*3020* Check whether there is a cooperator among currently scheduled3021* queues. The only thing we need is that the bio/request is not3022* NULL, as we need it to establish whether a cooperator exists.3023*/3024new_bfqq = bfq_find_close_cooperator(bfqd, bfqq,3025bfq_io_struct_pos(io_struct, request));30263027if (new_bfqq && likely(new_bfqq != &bfqd->oom_bfqq) &&3028bfq_may_be_close_cooperator(bfqq, new_bfqq))3029return bfq_setup_merge(bfqq, new_bfqq);30303031return NULL;3032}30333034static void bfq_bfqq_save_state(struct bfq_queue *bfqq)3035{3036struct bfq_io_cq *bic = bfqq->bic;3037unsigned int a_idx = bfqq->actuator_idx;3038struct bfq_iocq_bfqq_data *bfqq_data = &bic->bfqq_data[a_idx];30393040/*3041* If !bfqq->bic, the queue is already shared or its requests3042* have already been redirected to a shared queue; both idle window3043* and weight raising state have already been saved. Do nothing.3044*/3045if (!bic)3046return;30473048bfqq_data->saved_last_serv_time_ns = bfqq->last_serv_time_ns;3049bfqq_data->saved_inject_limit = bfqq->inject_limit;3050bfqq_data->saved_decrease_time_jif = bfqq->decrease_time_jif;30513052bfqq_data->saved_weight = bfqq->entity.orig_weight;3053bfqq_data->saved_ttime = bfqq->ttime;3054bfqq_data->saved_has_short_ttime =3055bfq_bfqq_has_short_ttime(bfqq);3056bfqq_data->saved_IO_bound = bfq_bfqq_IO_bound(bfqq);3057bfqq_data->saved_io_start_time = bfqq->io_start_time;3058bfqq_data->saved_tot_idle_time = bfqq->tot_idle_time;3059bfqq_data->saved_in_large_burst = bfq_bfqq_in_large_burst(bfqq);3060bfqq_data->was_in_burst_list =3061!hlist_unhashed(&bfqq->burst_list_node);30623063if (unlikely(bfq_bfqq_just_created(bfqq) &&3064!bfq_bfqq_in_large_burst(bfqq) &&3065bfqq->bfqd->low_latency)) {3066/*3067* bfqq being merged right after being created: bfqq3068* would have deserved interactive weight raising, but3069* did not make it to be set in a weight-raised state,3070* because of this early merge. Store directly the3071* weight-raising state that would have been assigned3072* to bfqq, so that to avoid that bfqq unjustly fails3073* to enjoy weight raising if split soon.3074*/3075bfqq_data->saved_wr_coeff = bfqq->bfqd->bfq_wr_coeff;3076bfqq_data->saved_wr_start_at_switch_to_srt =3077bfq_smallest_from_now();3078bfqq_data->saved_wr_cur_max_time =3079bfq_wr_duration(bfqq->bfqd);3080bfqq_data->saved_last_wr_start_finish = jiffies;3081} else {3082bfqq_data->saved_wr_coeff = bfqq->wr_coeff;3083bfqq_data->saved_wr_start_at_switch_to_srt =3084bfqq->wr_start_at_switch_to_srt;3085bfqq_data->saved_service_from_wr =3086bfqq->service_from_wr;3087bfqq_data->saved_last_wr_start_finish =3088bfqq->last_wr_start_finish;3089bfqq_data->saved_wr_cur_max_time = bfqq->wr_cur_max_time;3090}3091}309230933094void bfq_reassign_last_bfqq(struct bfq_queue *cur_bfqq,3095struct bfq_queue *new_bfqq)3096{3097if (cur_bfqq->entity.parent &&3098cur_bfqq->entity.parent->last_bfqq_created == cur_bfqq)3099cur_bfqq->entity.parent->last_bfqq_created = new_bfqq;3100else if (cur_bfqq->bfqd && cur_bfqq->bfqd->last_bfqq_created == cur_bfqq)3101cur_bfqq->bfqd->last_bfqq_created = new_bfqq;3102}31033104void bfq_release_process_ref(struct bfq_data *bfqd, struct bfq_queue *bfqq)3105{3106/*3107* To prevent bfqq's service guarantees from being violated,3108* bfqq may be left busy, i.e., queued for service, even if3109* empty (see comments in __bfq_bfqq_expire() for3110* details). But, if no process will send requests to bfqq any3111* longer, then there is no point in keeping bfqq queued for3112* service. In addition, keeping bfqq queued for service, but3113* with no process ref any longer, may have caused bfqq to be3114* freed when dequeued from service. But this is assumed to3115* never happen.3116*/3117if (bfq_bfqq_busy(bfqq) && RB_EMPTY_ROOT(&bfqq->sort_list) &&3118bfqq != bfqd->in_service_queue)3119bfq_del_bfqq_busy(bfqq, false);31203121bfq_reassign_last_bfqq(bfqq, NULL);31223123bfq_put_queue(bfqq);3124}31253126static struct bfq_queue *bfq_merge_bfqqs(struct bfq_data *bfqd,3127struct bfq_io_cq *bic,3128struct bfq_queue *bfqq)3129{3130struct bfq_queue *new_bfqq = bfqq->new_bfqq;31313132bfq_log_bfqq(bfqd, bfqq, "merging with queue %lu",3133(unsigned long)new_bfqq->pid);3134/* Save weight raising and idle window of the merged queues */3135bfq_bfqq_save_state(bfqq);3136bfq_bfqq_save_state(new_bfqq);3137if (bfq_bfqq_IO_bound(bfqq))3138bfq_mark_bfqq_IO_bound(new_bfqq);3139bfq_clear_bfqq_IO_bound(bfqq);31403141/*3142* The processes associated with bfqq are cooperators of the3143* processes associated with new_bfqq. So, if bfqq has a3144* waker, then assume that all these processes will be happy3145* to let bfqq's waker freely inject I/O when they have no3146* I/O.3147*/3148if (bfqq->waker_bfqq && !new_bfqq->waker_bfqq &&3149bfqq->waker_bfqq != new_bfqq) {3150new_bfqq->waker_bfqq = bfqq->waker_bfqq;3151new_bfqq->tentative_waker_bfqq = NULL;31523153/*3154* If the waker queue disappears, then3155* new_bfqq->waker_bfqq must be reset. So insert3156* new_bfqq into the woken_list of the waker. See3157* bfq_check_waker for details.3158*/3159hlist_add_head(&new_bfqq->woken_list_node,3160&new_bfqq->waker_bfqq->woken_list);31613162}31633164/*3165* If bfqq is weight-raised, then let new_bfqq inherit3166* weight-raising. To reduce false positives, neglect the case3167* where bfqq has just been created, but has not yet made it3168* to be weight-raised (which may happen because EQM may merge3169* bfqq even before bfq_add_request is executed for the first3170* time for bfqq). Handling this case would however be very3171* easy, thanks to the flag just_created.3172*/3173if (new_bfqq->wr_coeff == 1 && bfqq->wr_coeff > 1) {3174new_bfqq->wr_coeff = bfqq->wr_coeff;3175new_bfqq->wr_cur_max_time = bfqq->wr_cur_max_time;3176new_bfqq->last_wr_start_finish = bfqq->last_wr_start_finish;3177new_bfqq->wr_start_at_switch_to_srt =3178bfqq->wr_start_at_switch_to_srt;3179if (bfq_bfqq_busy(new_bfqq))3180bfqd->wr_busy_queues++;3181new_bfqq->entity.prio_changed = 1;3182}31833184if (bfqq->wr_coeff > 1) { /* bfqq has given its wr to new_bfqq */3185bfqq->wr_coeff = 1;3186bfqq->entity.prio_changed = 1;3187if (bfq_bfqq_busy(bfqq))3188bfqd->wr_busy_queues--;3189}31903191bfq_log_bfqq(bfqd, new_bfqq, "merge_bfqqs: wr_busy %d",3192bfqd->wr_busy_queues);31933194/*3195* Merge queues (that is, let bic redirect its requests to new_bfqq)3196*/3197bic_set_bfqq(bic, new_bfqq, true, bfqq->actuator_idx);3198bfq_mark_bfqq_coop(new_bfqq);3199/*3200* new_bfqq now belongs to at least two bics (it is a shared queue):3201* set new_bfqq->bic to NULL. bfqq either:3202* - does not belong to any bic any more, and hence bfqq->bic must3203* be set to NULL, or3204* - is a queue whose owning bics have already been redirected to a3205* different queue, hence the queue is destined to not belong to3206* any bic soon and bfqq->bic is already NULL (therefore the next3207* assignment causes no harm).3208*/3209new_bfqq->bic = NULL;3210/*3211* If the queue is shared, the pid is the pid of one of the associated3212* processes. Which pid depends on the exact sequence of merge events3213* the queue underwent. So printing such a pid is useless and confusing3214* because it reports a random pid between those of the associated3215* processes.3216* We mark such a queue with a pid -1, and then print SHARED instead of3217* a pid in logging messages.3218*/3219new_bfqq->pid = -1;3220bfqq->bic = NULL;32213222bfq_reassign_last_bfqq(bfqq, new_bfqq);32233224bfq_release_process_ref(bfqd, bfqq);32253226return new_bfqq;3227}32283229static bool bfq_allow_bio_merge(struct request_queue *q, struct request *rq,3230struct bio *bio)3231{3232struct bfq_data *bfqd = q->elevator->elevator_data;3233bool is_sync = op_is_sync(bio->bi_opf);3234struct bfq_queue *bfqq = bfqd->bio_bfqq, *new_bfqq;32353236/*3237* Disallow merge of a sync bio into an async request.3238*/3239if (is_sync && !rq_is_sync(rq))3240return false;32413242/*3243* Lookup the bfqq that this bio will be queued with. Allow3244* merge only if rq is queued there.3245*/3246if (!bfqq)3247return false;32483249/*3250* We take advantage of this function to perform an early merge3251* of the queues of possible cooperating processes.3252*/3253new_bfqq = bfq_setup_cooperator(bfqd, bfqq, bio, false, bfqd->bio_bic);3254if (new_bfqq) {3255/*3256* bic still points to bfqq, then it has not yet been3257* redirected to some other bfq_queue, and a queue3258* merge between bfqq and new_bfqq can be safely3259* fulfilled, i.e., bic can be redirected to new_bfqq3260* and bfqq can be put.3261*/3262while (bfqq != new_bfqq)3263bfqq = bfq_merge_bfqqs(bfqd, bfqd->bio_bic, bfqq);32643265/*3266* Change also bqfd->bio_bfqq, as3267* bfqd->bio_bic now points to new_bfqq, and3268* this function may be invoked again (and then may3269* use again bqfd->bio_bfqq).3270*/3271bfqd->bio_bfqq = bfqq;3272}32733274return bfqq == RQ_BFQQ(rq);3275}32763277/*3278* Set the maximum time for the in-service queue to consume its3279* budget. This prevents seeky processes from lowering the throughput.3280* In practice, a time-slice service scheme is used with seeky3281* processes.3282*/3283static void bfq_set_budget_timeout(struct bfq_data *bfqd,3284struct bfq_queue *bfqq)3285{3286unsigned int timeout_coeff;32873288if (bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time)3289timeout_coeff = 1;3290else3291timeout_coeff = bfqq->entity.weight / bfqq->entity.orig_weight;32923293bfqd->last_budget_start = blk_time_get();32943295bfqq->budget_timeout = jiffies +3296bfqd->bfq_timeout * timeout_coeff;3297}32983299static void __bfq_set_in_service_queue(struct bfq_data *bfqd,3300struct bfq_queue *bfqq)3301{3302if (bfqq) {3303bfq_clear_bfqq_fifo_expire(bfqq);33043305bfqd->budgets_assigned = (bfqd->budgets_assigned * 7 + 256) / 8;33063307if (time_is_before_jiffies(bfqq->last_wr_start_finish) &&3308bfqq->wr_coeff > 1 &&3309bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time &&3310time_is_before_jiffies(bfqq->budget_timeout)) {3311/*3312* For soft real-time queues, move the start3313* of the weight-raising period forward by the3314* time the queue has not received any3315* service. Otherwise, a relatively long3316* service delay is likely to cause the3317* weight-raising period of the queue to end,3318* because of the short duration of the3319* weight-raising period of a soft real-time3320* queue. It is worth noting that this move3321* is not so dangerous for the other queues,3322* because soft real-time queues are not3323* greedy.3324*3325* To not add a further variable, we use the3326* overloaded field budget_timeout to3327* determine for how long the queue has not3328* received service, i.e., how much time has3329* elapsed since the queue expired. However,3330* this is a little imprecise, because3331* budget_timeout is set to jiffies if bfqq3332* not only expires, but also remains with no3333* request.3334*/3335if (time_after(bfqq->budget_timeout,3336bfqq->last_wr_start_finish))3337bfqq->last_wr_start_finish +=3338jiffies - bfqq->budget_timeout;3339else3340bfqq->last_wr_start_finish = jiffies;3341}33423343bfq_set_budget_timeout(bfqd, bfqq);3344bfq_log_bfqq(bfqd, bfqq,3345"set_in_service_queue, cur-budget = %d",3346bfqq->entity.budget);3347}33483349bfqd->in_service_queue = bfqq;3350bfqd->in_serv_last_pos = 0;3351}33523353/*3354* Get and set a new queue for service.3355*/3356static struct bfq_queue *bfq_set_in_service_queue(struct bfq_data *bfqd)3357{3358struct bfq_queue *bfqq = bfq_get_next_queue(bfqd);33593360__bfq_set_in_service_queue(bfqd, bfqq);3361return bfqq;3362}33633364static void bfq_arm_slice_timer(struct bfq_data *bfqd)3365{3366struct bfq_queue *bfqq = bfqd->in_service_queue;3367u32 sl;33683369bfq_mark_bfqq_wait_request(bfqq);33703371/*3372* We don't want to idle for seeks, but we do want to allow3373* fair distribution of slice time for a process doing back-to-back3374* seeks. So allow a little bit of time for him to submit a new rq.3375*/3376sl = bfqd->bfq_slice_idle;3377/*3378* Unless the queue is being weight-raised or the scenario is3379* asymmetric, grant only minimum idle time if the queue3380* is seeky. A long idling is preserved for a weight-raised3381* queue, or, more in general, in an asymmetric scenario,3382* because a long idling is needed for guaranteeing to a queue3383* its reserved share of the throughput (in particular, it is3384* needed if the queue has a higher weight than some other3385* queue).3386*/3387if (BFQQ_SEEKY(bfqq) && bfqq->wr_coeff == 1 &&3388!bfq_asymmetric_scenario(bfqd, bfqq))3389sl = min_t(u64, sl, BFQ_MIN_TT);3390else if (bfqq->wr_coeff > 1)3391sl = max_t(u32, sl, 20ULL * NSEC_PER_MSEC);33923393bfqd->last_idling_start = blk_time_get();3394bfqd->last_idling_start_jiffies = jiffies;33953396hrtimer_start(&bfqd->idle_slice_timer, ns_to_ktime(sl),3397HRTIMER_MODE_REL);3398bfqg_stats_set_start_idle_time(bfqq_group(bfqq));3399}34003401/*3402* In autotuning mode, max_budget is dynamically recomputed as the3403* amount of sectors transferred in timeout at the estimated peak3404* rate. This enables BFQ to utilize a full timeslice with a full3405* budget, even if the in-service queue is served at peak rate. And3406* this maximises throughput with sequential workloads.3407*/3408static unsigned long bfq_calc_max_budget(struct bfq_data *bfqd)3409{3410return (u64)bfqd->peak_rate * USEC_PER_MSEC *3411jiffies_to_msecs(bfqd->bfq_timeout)>>BFQ_RATE_SHIFT;3412}34133414/*3415* Update parameters related to throughput and responsiveness, as a3416* function of the estimated peak rate. See comments on3417* bfq_calc_max_budget(), and on the ref_wr_duration array.3418*/3419static void update_thr_responsiveness_params(struct bfq_data *bfqd)3420{3421if (bfqd->bfq_user_max_budget == 0) {3422bfqd->bfq_max_budget =3423bfq_calc_max_budget(bfqd);3424bfq_log(bfqd, "new max_budget = %d", bfqd->bfq_max_budget);3425}3426}34273428static void bfq_reset_rate_computation(struct bfq_data *bfqd,3429struct request *rq)3430{3431if (rq != NULL) { /* new rq dispatch now, reset accordingly */3432bfqd->last_dispatch = bfqd->first_dispatch = blk_time_get_ns();3433bfqd->peak_rate_samples = 1;3434bfqd->sequential_samples = 0;3435bfqd->tot_sectors_dispatched = bfqd->last_rq_max_size =3436blk_rq_sectors(rq);3437} else /* no new rq dispatched, just reset the number of samples */3438bfqd->peak_rate_samples = 0; /* full re-init on next disp. */34393440bfq_log(bfqd,3441"reset_rate_computation at end, sample %u/%u tot_sects %llu",3442bfqd->peak_rate_samples, bfqd->sequential_samples,3443bfqd->tot_sectors_dispatched);3444}34453446static void bfq_update_rate_reset(struct bfq_data *bfqd, struct request *rq)3447{3448u32 rate, weight, divisor;34493450/*3451* For the convergence property to hold (see comments on3452* bfq_update_peak_rate()) and for the assessment to be3453* reliable, a minimum number of samples must be present, and3454* a minimum amount of time must have elapsed. If not so, do3455* not compute new rate. Just reset parameters, to get ready3456* for a new evaluation attempt.3457*/3458if (bfqd->peak_rate_samples < BFQ_RATE_MIN_SAMPLES ||3459bfqd->delta_from_first < BFQ_RATE_MIN_INTERVAL)3460goto reset_computation;34613462/*3463* If a new request completion has occurred after last3464* dispatch, then, to approximate the rate at which requests3465* have been served by the device, it is more precise to3466* extend the observation interval to the last completion.3467*/3468bfqd->delta_from_first =3469max_t(u64, bfqd->delta_from_first,3470bfqd->last_completion - bfqd->first_dispatch);34713472/*3473* Rate computed in sects/usec, and not sects/nsec, for3474* precision issues.3475*/3476rate = div64_ul(bfqd->tot_sectors_dispatched<<BFQ_RATE_SHIFT,3477div_u64(bfqd->delta_from_first, NSEC_PER_USEC));34783479/*3480* Peak rate not updated if:3481* - the percentage of sequential dispatches is below 3/4 of the3482* total, and rate is below the current estimated peak rate3483* - rate is unreasonably high (> 20M sectors/sec)3484*/3485if ((bfqd->sequential_samples < (3 * bfqd->peak_rate_samples)>>2 &&3486rate <= bfqd->peak_rate) ||3487rate > 20<<BFQ_RATE_SHIFT)3488goto reset_computation;34893490/*3491* We have to update the peak rate, at last! To this purpose,3492* we use a low-pass filter. We compute the smoothing constant3493* of the filter as a function of the 'weight' of the new3494* measured rate.3495*3496* As can be seen in next formulas, we define this weight as a3497* quantity proportional to how sequential the workload is,3498* and to how long the observation time interval is.3499*3500* The weight runs from 0 to 8. The maximum value of the3501* weight, 8, yields the minimum value for the smoothing3502* constant. At this minimum value for the smoothing constant,3503* the measured rate contributes for half of the next value of3504* the estimated peak rate.3505*3506* So, the first step is to compute the weight as a function3507* of how sequential the workload is. Note that the weight3508* cannot reach 9, because bfqd->sequential_samples cannot3509* become equal to bfqd->peak_rate_samples, which, in its3510* turn, holds true because bfqd->sequential_samples is not3511* incremented for the first sample.3512*/3513weight = (9 * bfqd->sequential_samples) / bfqd->peak_rate_samples;35143515/*3516* Second step: further refine the weight as a function of the3517* duration of the observation interval.3518*/3519weight = min_t(u32, 8,3520div_u64(weight * bfqd->delta_from_first,3521BFQ_RATE_REF_INTERVAL));35223523/*3524* Divisor ranging from 10, for minimum weight, to 2, for3525* maximum weight.3526*/3527divisor = 10 - weight;35283529/*3530* Finally, update peak rate:3531*3532* peak_rate = peak_rate * (divisor-1) / divisor + rate / divisor3533*/3534bfqd->peak_rate *= divisor-1;3535bfqd->peak_rate /= divisor;3536rate /= divisor; /* smoothing constant alpha = 1/divisor */35373538bfqd->peak_rate += rate;35393540/*3541* For a very slow device, bfqd->peak_rate can reach 0 (see3542* the minimum representable values reported in the comments3543* on BFQ_RATE_SHIFT). Push to 1 if this happens, to avoid3544* divisions by zero where bfqd->peak_rate is used as a3545* divisor.3546*/3547bfqd->peak_rate = max_t(u32, 1, bfqd->peak_rate);35483549update_thr_responsiveness_params(bfqd);35503551reset_computation:3552bfq_reset_rate_computation(bfqd, rq);3553}35543555/*3556* Update the read/write peak rate (the main quantity used for3557* auto-tuning, see update_thr_responsiveness_params()).3558*3559* It is not trivial to estimate the peak rate (correctly): because of3560* the presence of sw and hw queues between the scheduler and the3561* device components that finally serve I/O requests, it is hard to3562* say exactly when a given dispatched request is served inside the3563* device, and for how long. As a consequence, it is hard to know3564* precisely at what rate a given set of requests is actually served3565* by the device.3566*3567* On the opposite end, the dispatch time of any request is trivially3568* available, and, from this piece of information, the "dispatch rate"3569* of requests can be immediately computed. So, the idea in the next3570* function is to use what is known, namely request dispatch times3571* (plus, when useful, request completion times), to estimate what is3572* unknown, namely in-device request service rate.3573*3574* The main issue is that, because of the above facts, the rate at3575* which a certain set of requests is dispatched over a certain time3576* interval can vary greatly with respect to the rate at which the3577* same requests are then served. But, since the size of any3578* intermediate queue is limited, and the service scheme is lossless3579* (no request is silently dropped), the following obvious convergence3580* property holds: the number of requests dispatched MUST become3581* closer and closer to the number of requests completed as the3582* observation interval grows. This is the key property used in3583* the next function to estimate the peak service rate as a function3584* of the observed dispatch rate. The function assumes to be invoked3585* on every request dispatch.3586*/3587static void bfq_update_peak_rate(struct bfq_data *bfqd, struct request *rq)3588{3589u64 now_ns = blk_time_get_ns();35903591if (bfqd->peak_rate_samples == 0) { /* first dispatch */3592bfq_log(bfqd, "update_peak_rate: goto reset, samples %d",3593bfqd->peak_rate_samples);3594bfq_reset_rate_computation(bfqd, rq);3595goto update_last_values; /* will add one sample */3596}35973598/*3599* Device idle for very long: the observation interval lasting3600* up to this dispatch cannot be a valid observation interval3601* for computing a new peak rate (similarly to the late-3602* completion event in bfq_completed_request()). Go to3603* update_rate_and_reset to have the following three steps3604* taken:3605* - close the observation interval at the last (previous)3606* request dispatch or completion3607* - compute rate, if possible, for that observation interval3608* - start a new observation interval with this dispatch3609*/3610if (now_ns - bfqd->last_dispatch > 100*NSEC_PER_MSEC &&3611bfqd->tot_rq_in_driver == 0)3612goto update_rate_and_reset;36133614/* Update sampling information */3615bfqd->peak_rate_samples++;36163617if ((bfqd->tot_rq_in_driver > 0 ||3618now_ns - bfqd->last_completion < BFQ_MIN_TT)3619&& !BFQ_RQ_SEEKY(bfqd, bfqd->last_position, rq))3620bfqd->sequential_samples++;36213622bfqd->tot_sectors_dispatched += blk_rq_sectors(rq);36233624/* Reset max observed rq size every 32 dispatches */3625if (likely(bfqd->peak_rate_samples % 32))3626bfqd->last_rq_max_size =3627max_t(u32, blk_rq_sectors(rq), bfqd->last_rq_max_size);3628else3629bfqd->last_rq_max_size = blk_rq_sectors(rq);36303631bfqd->delta_from_first = now_ns - bfqd->first_dispatch;36323633/* Target observation interval not yet reached, go on sampling */3634if (bfqd->delta_from_first < BFQ_RATE_REF_INTERVAL)3635goto update_last_values;36363637update_rate_and_reset:3638bfq_update_rate_reset(bfqd, rq);3639update_last_values:3640bfqd->last_position = blk_rq_pos(rq) + blk_rq_sectors(rq);3641if (RQ_BFQQ(rq) == bfqd->in_service_queue)3642bfqd->in_serv_last_pos = bfqd->last_position;3643bfqd->last_dispatch = now_ns;3644}36453646/*3647* Remove request from internal lists.3648*/3649static void bfq_dispatch_remove(struct request_queue *q, struct request *rq)3650{3651struct bfq_queue *bfqq = RQ_BFQQ(rq);36523653/*3654* For consistency, the next instruction should have been3655* executed after removing the request from the queue and3656* dispatching it. We execute instead this instruction before3657* bfq_remove_request() (and hence introduce a temporary3658* inconsistency), for efficiency. In fact, should this3659* dispatch occur for a non in-service bfqq, this anticipated3660* increment prevents two counters related to bfqq->dispatched3661* from risking to be, first, uselessly decremented, and then3662* incremented again when the (new) value of bfqq->dispatched3663* happens to be taken into account.3664*/3665bfqq->dispatched++;3666bfq_update_peak_rate(q->elevator->elevator_data, rq);36673668bfq_remove_request(q, rq);3669}36703671/*3672* There is a case where idling does not have to be performed for3673* throughput concerns, but to preserve the throughput share of3674* the process associated with bfqq.3675*3676* To introduce this case, we can note that allowing the drive3677* to enqueue more than one request at a time, and hence3678* delegating de facto final scheduling decisions to the3679* drive's internal scheduler, entails loss of control on the3680* actual request service order. In particular, the critical3681* situation is when requests from different processes happen3682* to be present, at the same time, in the internal queue(s)3683* of the drive. In such a situation, the drive, by deciding3684* the service order of the internally-queued requests, does3685* determine also the actual throughput distribution among3686* these processes. But the drive typically has no notion or3687* concern about per-process throughput distribution, and3688* makes its decisions only on a per-request basis. Therefore,3689* the service distribution enforced by the drive's internal3690* scheduler is likely to coincide with the desired throughput3691* distribution only in a completely symmetric, or favorably3692* skewed scenario where:3693* (i-a) each of these processes must get the same throughput as3694* the others,3695* (i-b) in case (i-a) does not hold, it holds that the process3696* associated with bfqq must receive a lower or equal3697* throughput than any of the other processes;3698* (ii) the I/O of each process has the same properties, in3699* terms of locality (sequential or random), direction3700* (reads or writes), request sizes, greediness3701* (from I/O-bound to sporadic), and so on;37023703* In fact, in such a scenario, the drive tends to treat the requests3704* of each process in about the same way as the requests of the3705* others, and thus to provide each of these processes with about the3706* same throughput. This is exactly the desired throughput3707* distribution if (i-a) holds, or, if (i-b) holds instead, this is an3708* even more convenient distribution for (the process associated with)3709* bfqq.3710*3711* In contrast, in any asymmetric or unfavorable scenario, device3712* idling (I/O-dispatch plugging) is certainly needed to guarantee3713* that bfqq receives its assigned fraction of the device throughput3714* (see [1] for details).3715*3716* The problem is that idling may significantly reduce throughput with3717* certain combinations of types of I/O and devices. An important3718* example is sync random I/O on flash storage with command3719* queueing. So, unless bfqq falls in cases where idling also boosts3720* throughput, it is important to check conditions (i-a), i(-b) and3721* (ii) accurately, so as to avoid idling when not strictly needed for3722* service guarantees.3723*3724* Unfortunately, it is extremely difficult to thoroughly check3725* condition (ii). And, in case there are active groups, it becomes3726* very difficult to check conditions (i-a) and (i-b) too. In fact,3727* if there are active groups, then, for conditions (i-a) or (i-b) to3728* become false 'indirectly', it is enough that an active group3729* contains more active processes or sub-groups than some other active3730* group. More precisely, for conditions (i-a) or (i-b) to become3731* false because of such a group, it is not even necessary that the3732* group is (still) active: it is sufficient that, even if the group3733* has become inactive, some of its descendant processes still have3734* some request already dispatched but still waiting for3735* completion. In fact, requests have still to be guaranteed their3736* share of the throughput even after being dispatched. In this3737* respect, it is easy to show that, if a group frequently becomes3738* inactive while still having in-flight requests, and if, when this3739* happens, the group is not considered in the calculation of whether3740* the scenario is asymmetric, then the group may fail to be3741* guaranteed its fair share of the throughput (basically because3742* idling may not be performed for the descendant processes of the3743* group, but it had to be). We address this issue with the following3744* bi-modal behavior, implemented in the function3745* bfq_asymmetric_scenario().3746*3747* If there are groups with requests waiting for completion3748* (as commented above, some of these groups may even be3749* already inactive), then the scenario is tagged as3750* asymmetric, conservatively, without checking any of the3751* conditions (i-a), (i-b) or (ii). So the device is idled for bfqq.3752* This behavior matches also the fact that groups are created3753* exactly if controlling I/O is a primary concern (to3754* preserve bandwidth and latency guarantees).3755*3756* On the opposite end, if there are no groups with requests waiting3757* for completion, then only conditions (i-a) and (i-b) are actually3758* controlled, i.e., provided that conditions (i-a) or (i-b) holds,3759* idling is not performed, regardless of whether condition (ii)3760* holds. In other words, only if conditions (i-a) and (i-b) do not3761* hold, then idling is allowed, and the device tends to be prevented3762* from queueing many requests, possibly of several processes. Since3763* there are no groups with requests waiting for completion, then, to3764* control conditions (i-a) and (i-b) it is enough to check just3765* whether all the queues with requests waiting for completion also3766* have the same weight.3767*3768* Not checking condition (ii) evidently exposes bfqq to the3769* risk of getting less throughput than its fair share.3770* However, for queues with the same weight, a further3771* mechanism, preemption, mitigates or even eliminates this3772* problem. And it does so without consequences on overall3773* throughput. This mechanism and its benefits are explained3774* in the next three paragraphs.3775*3776* Even if a queue, say Q, is expired when it remains idle, Q3777* can still preempt the new in-service queue if the next3778* request of Q arrives soon (see the comments on3779* bfq_bfqq_update_budg_for_activation). If all queues and3780* groups have the same weight, this form of preemption,3781* combined with the hole-recovery heuristic described in the3782* comments on function bfq_bfqq_update_budg_for_activation,3783* are enough to preserve a correct bandwidth distribution in3784* the mid term, even without idling. In fact, even if not3785* idling allows the internal queues of the device to contain3786* many requests, and thus to reorder requests, we can rather3787* safely assume that the internal scheduler still preserves a3788* minimum of mid-term fairness.3789*3790* More precisely, this preemption-based, idleless approach3791* provides fairness in terms of IOPS, and not sectors per3792* second. This can be seen with a simple example. Suppose3793* that there are two queues with the same weight, but that3794* the first queue receives requests of 8 sectors, while the3795* second queue receives requests of 1024 sectors. In3796* addition, suppose that each of the two queues contains at3797* most one request at a time, which implies that each queue3798* always remains idle after it is served. Finally, after3799* remaining idle, each queue receives very quickly a new3800* request. It follows that the two queues are served3801* alternatively, preempting each other if needed. This3802* implies that, although both queues have the same weight,3803* the queue with large requests receives a service that is3804* 1024/8 times as high as the service received by the other3805* queue.3806*3807* The motivation for using preemption instead of idling (for3808* queues with the same weight) is that, by not idling,3809* service guarantees are preserved (completely or at least in3810* part) without minimally sacrificing throughput. And, if3811* there is no active group, then the primary expectation for3812* this device is probably a high throughput.3813*3814* We are now left only with explaining the two sub-conditions in the3815* additional compound condition that is checked below for deciding3816* whether the scenario is asymmetric. To explain the first3817* sub-condition, we need to add that the function3818* bfq_asymmetric_scenario checks the weights of only3819* non-weight-raised queues, for efficiency reasons (see comments on3820* bfq_weights_tree_add()). Then the fact that bfqq is weight-raised3821* is checked explicitly here. More precisely, the compound condition3822* below takes into account also the fact that, even if bfqq is being3823* weight-raised, the scenario is still symmetric if all queues with3824* requests waiting for completion happen to be3825* weight-raised. Actually, we should be even more precise here, and3826* differentiate between interactive weight raising and soft real-time3827* weight raising.3828*3829* The second sub-condition checked in the compound condition is3830* whether there is a fair amount of already in-flight I/O not3831* belonging to bfqq. If so, I/O dispatching is to be plugged, for the3832* following reason. The drive may decide to serve in-flight3833* non-bfqq's I/O requests before bfqq's ones, thereby delaying the3834* arrival of new I/O requests for bfqq (recall that bfqq is sync). If3835* I/O-dispatching is not plugged, then, while bfqq remains empty, a3836* basically uncontrolled amount of I/O from other queues may be3837* dispatched too, possibly causing the service of bfqq's I/O to be3838* delayed even longer in the drive. This problem gets more and more3839* serious as the speed and the queue depth of the drive grow,3840* because, as these two quantities grow, the probability to find no3841* queue busy but many requests in flight grows too. By contrast,3842* plugging I/O dispatching minimizes the delay induced by already3843* in-flight I/O, and enables bfqq to recover the bandwidth it may3844* lose because of this delay.3845*3846* As a side note, it is worth considering that the above3847* device-idling countermeasures may however fail in the following3848* unlucky scenario: if I/O-dispatch plugging is (correctly) disabled3849* in a time period during which all symmetry sub-conditions hold, and3850* therefore the device is allowed to enqueue many requests, but at3851* some later point in time some sub-condition stops to hold, then it3852* may become impossible to make requests be served in the desired3853* order until all the requests already queued in the device have been3854* served. The last sub-condition commented above somewhat mitigates3855* this problem for weight-raised queues.3856*3857* However, as an additional mitigation for this problem, we preserve3858* plugging for a special symmetric case that may suddenly turn into3859* asymmetric: the case where only bfqq is busy. In this case, not3860* expiring bfqq does not cause any harm to any other queues in terms3861* of service guarantees. In contrast, it avoids the following unlucky3862* sequence of events: (1) bfqq is expired, (2) a new queue with a3863* lower weight than bfqq becomes busy (or more queues), (3) the new3864* queue is served until a new request arrives for bfqq, (4) when bfqq3865* is finally served, there are so many requests of the new queue in3866* the drive that the pending requests for bfqq take a lot of time to3867* be served. In particular, event (2) may case even already3868* dispatched requests of bfqq to be delayed, inside the drive. So, to3869* avoid this series of events, the scenario is preventively declared3870* as asymmetric also if bfqq is the only busy queues3871*/3872static bool idling_needed_for_service_guarantees(struct bfq_data *bfqd,3873struct bfq_queue *bfqq)3874{3875int tot_busy_queues = bfq_tot_busy_queues(bfqd);38763877/* No point in idling for bfqq if it won't get requests any longer */3878if (unlikely(!bfqq_process_refs(bfqq)))3879return false;38803881return (bfqq->wr_coeff > 1 &&3882(bfqd->wr_busy_queues < tot_busy_queues ||3883bfqd->tot_rq_in_driver >= bfqq->dispatched + 4)) ||3884bfq_asymmetric_scenario(bfqd, bfqq) ||3885tot_busy_queues == 1;3886}38873888static bool __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq,3889enum bfqq_expiration reason)3890{3891/*3892* If this bfqq is shared between multiple processes, check3893* to make sure that those processes are still issuing I/Os3894* within the mean seek distance. If not, it may be time to3895* break the queues apart again.3896*/3897if (bfq_bfqq_coop(bfqq) && BFQQ_SEEKY(bfqq))3898bfq_mark_bfqq_split_coop(bfqq);38993900/*3901* Consider queues with a higher finish virtual time than3902* bfqq. If idling_needed_for_service_guarantees(bfqq) returns3903* true, then bfqq's bandwidth would be violated if an3904* uncontrolled amount of I/O from these queues were3905* dispatched while bfqq is waiting for its new I/O to3906* arrive. This is exactly what may happen if this is a forced3907* expiration caused by a preemption attempt, and if bfqq is3908* not re-scheduled. To prevent this from happening, re-queue3909* bfqq if it needs I/O-dispatch plugging, even if it is3910* empty. By doing so, bfqq is granted to be served before the3911* above queues (provided that bfqq is of course eligible).3912*/3913if (RB_EMPTY_ROOT(&bfqq->sort_list) &&3914!(reason == BFQQE_PREEMPTED &&3915idling_needed_for_service_guarantees(bfqd, bfqq))) {3916if (bfqq->dispatched == 0)3917/*3918* Overloading budget_timeout field to store3919* the time at which the queue remains with no3920* backlog and no outstanding request; used by3921* the weight-raising mechanism.3922*/3923bfqq->budget_timeout = jiffies;39243925bfq_del_bfqq_busy(bfqq, true);3926} else {3927bfq_requeue_bfqq(bfqd, bfqq, true);3928/*3929* Resort priority tree of potential close cooperators.3930* See comments on bfq_pos_tree_add_move() for the unlikely().3931*/3932if (unlikely(!bfqd->nonrot_with_queueing &&3933!RB_EMPTY_ROOT(&bfqq->sort_list)))3934bfq_pos_tree_add_move(bfqd, bfqq);3935}39363937/*3938* All in-service entities must have been properly deactivated3939* or requeued before executing the next function, which3940* resets all in-service entities as no more in service. This3941* may cause bfqq to be freed. If this happens, the next3942* function returns true.3943*/3944return __bfq_bfqd_reset_in_service(bfqd);3945}39463947/**3948* __bfq_bfqq_recalc_budget - try to adapt the budget to the @bfqq behavior.3949* @bfqd: device data.3950* @bfqq: queue to update.3951* @reason: reason for expiration.3952*3953* Handle the feedback on @bfqq budget at queue expiration.3954* See the body for detailed comments.3955*/3956static void __bfq_bfqq_recalc_budget(struct bfq_data *bfqd,3957struct bfq_queue *bfqq,3958enum bfqq_expiration reason)3959{3960struct request *next_rq;3961int budget, min_budget;39623963min_budget = bfq_min_budget(bfqd);39643965if (bfqq->wr_coeff == 1)3966budget = bfqq->max_budget;3967else /*3968* Use a constant, low budget for weight-raised queues,3969* to help achieve a low latency. Keep it slightly higher3970* than the minimum possible budget, to cause a little3971* bit fewer expirations.3972*/3973budget = 2 * min_budget;39743975bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last budg %d, budg left %d",3976bfqq->entity.budget, bfq_bfqq_budget_left(bfqq));3977bfq_log_bfqq(bfqd, bfqq, "recalc_budg: last max_budg %d, min budg %d",3978budget, bfq_min_budget(bfqd));3979bfq_log_bfqq(bfqd, bfqq, "recalc_budg: sync %d, seeky %d",3980bfq_bfqq_sync(bfqq), BFQQ_SEEKY(bfqd->in_service_queue));39813982if (bfq_bfqq_sync(bfqq) && bfqq->wr_coeff == 1) {3983switch (reason) {3984/*3985* Caveat: in all the following cases we trade latency3986* for throughput.3987*/3988case BFQQE_TOO_IDLE:3989/*3990* This is the only case where we may reduce3991* the budget: if there is no request of the3992* process still waiting for completion, then3993* we assume (tentatively) that the timer has3994* expired because the batch of requests of3995* the process could have been served with a3996* smaller budget. Hence, betting that3997* process will behave in the same way when it3998* becomes backlogged again, we reduce its3999* next budget. As long as we guess right,4000* this budget cut reduces the latency4001* experienced by the process.4002*4003* However, if there are still outstanding4004* requests, then the process may have not yet4005* issued its next request just because it is4006* still waiting for the completion of some of4007* the still outstanding ones. So in this4008* subcase we do not reduce its budget, on the4009* contrary we increase it to possibly boost4010* the throughput, as discussed in the4011* comments to the BUDGET_TIMEOUT case.4012*/4013if (bfqq->dispatched > 0) /* still outstanding reqs */4014budget = min(budget * 2, bfqd->bfq_max_budget);4015else {4016if (budget > 5 * min_budget)4017budget -= 4 * min_budget;4018else4019budget = min_budget;4020}4021break;4022case BFQQE_BUDGET_TIMEOUT:4023/*4024* We double the budget here because it gives4025* the chance to boost the throughput if this4026* is not a seeky process (and has bumped into4027* this timeout because of, e.g., ZBR).4028*/4029budget = min(budget * 2, bfqd->bfq_max_budget);4030break;4031case BFQQE_BUDGET_EXHAUSTED:4032/*4033* The process still has backlog, and did not4034* let either the budget timeout or the disk4035* idling timeout expire. Hence it is not4036* seeky, has a short thinktime and may be4037* happy with a higher budget too. So4038* definitely increase the budget of this good4039* candidate to boost the disk throughput.4040*/4041budget = min(budget * 4, bfqd->bfq_max_budget);4042break;4043case BFQQE_NO_MORE_REQUESTS:4044/*4045* For queues that expire for this reason, it4046* is particularly important to keep the4047* budget close to the actual service they4048* need. Doing so reduces the timestamp4049* misalignment problem described in the4050* comments in the body of4051* __bfq_activate_entity. In fact, suppose4052* that a queue systematically expires for4053* BFQQE_NO_MORE_REQUESTS and presents a4054* new request in time to enjoy timestamp4055* back-shifting. The larger the budget of the4056* queue is with respect to the service the4057* queue actually requests in each service4058* slot, the more times the queue can be4059* reactivated with the same virtual finish4060* time. It follows that, even if this finish4061* time is pushed to the system virtual time4062* to reduce the consequent timestamp4063* misalignment, the queue unjustly enjoys for4064* many re-activations a lower finish time4065* than all newly activated queues.4066*4067* The service needed by bfqq is measured4068* quite precisely by bfqq->entity.service.4069* Since bfqq does not enjoy device idling,4070* bfqq->entity.service is equal to the number4071* of sectors that the process associated with4072* bfqq requested to read/write before waiting4073* for request completions, or blocking for4074* other reasons.4075*/4076budget = max_t(int, bfqq->entity.service, min_budget);4077break;4078default:4079return;4080}4081} else if (!bfq_bfqq_sync(bfqq)) {4082/*4083* Async queues get always the maximum possible4084* budget, as for them we do not care about latency4085* (in addition, their ability to dispatch is limited4086* by the charging factor).4087*/4088budget = bfqd->bfq_max_budget;4089}40904091bfqq->max_budget = budget;40924093if (bfqd->budgets_assigned >= bfq_stats_min_budgets &&4094!bfqd->bfq_user_max_budget)4095bfqq->max_budget = min(bfqq->max_budget, bfqd->bfq_max_budget);40964097/*4098* If there is still backlog, then assign a new budget, making4099* sure that it is large enough for the next request. Since4100* the finish time of bfqq must be kept in sync with the4101* budget, be sure to call __bfq_bfqq_expire() *after* this4102* update.4103*4104* If there is no backlog, then no need to update the budget;4105* it will be updated on the arrival of a new request.4106*/4107next_rq = bfqq->next_rq;4108if (next_rq)4109bfqq->entity.budget = max_t(unsigned long, bfqq->max_budget,4110bfq_serv_to_charge(next_rq, bfqq));41114112bfq_log_bfqq(bfqd, bfqq, "head sect: %u, new budget %d",4113next_rq ? blk_rq_sectors(next_rq) : 0,4114bfqq->entity.budget);4115}41164117/*4118* Return true if the process associated with bfqq is "slow". The slow4119* flag is used, in addition to the budget timeout, to reduce the4120* amount of service provided to seeky processes, and thus reduce4121* their chances to lower the throughput. More details in the comments4122* on the function bfq_bfqq_expire().4123*4124* An important observation is in order: as discussed in the comments4125* on the function bfq_update_peak_rate(), with devices with internal4126* queues, it is hard if ever possible to know when and for how long4127* an I/O request is processed by the device (apart from the trivial4128* I/O pattern where a new request is dispatched only after the4129* previous one has been completed). This makes it hard to evaluate4130* the real rate at which the I/O requests of each bfq_queue are4131* served. In fact, for an I/O scheduler like BFQ, serving a4132* bfq_queue means just dispatching its requests during its service4133* slot (i.e., until the budget of the queue is exhausted, or the4134* queue remains idle, or, finally, a timeout fires). But, during the4135* service slot of a bfq_queue, around 100 ms at most, the device may4136* be even still processing requests of bfq_queues served in previous4137* service slots. On the opposite end, the requests of the in-service4138* bfq_queue may be completed after the service slot of the queue4139* finishes.4140*4141* Anyway, unless more sophisticated solutions are used4142* (where possible), the sum of the sizes of the requests dispatched4143* during the service slot of a bfq_queue is probably the only4144* approximation available for the service received by the bfq_queue4145* during its service slot. And this sum is the quantity used in this4146* function to evaluate the I/O speed of a process.4147*/4148static bool bfq_bfqq_is_slow(struct bfq_data *bfqd, struct bfq_queue *bfqq,4149bool compensate, unsigned long *delta_ms)4150{4151ktime_t delta_ktime;4152u32 delta_usecs;4153bool slow = BFQQ_SEEKY(bfqq); /* if delta too short, use seekyness */41544155if (!bfq_bfqq_sync(bfqq))4156return false;41574158if (compensate)4159delta_ktime = bfqd->last_idling_start;4160else4161delta_ktime = blk_time_get();4162delta_ktime = ktime_sub(delta_ktime, bfqd->last_budget_start);4163delta_usecs = ktime_to_us(delta_ktime);41644165/* don't use too short time intervals */4166if (delta_usecs < 1000) {4167if (blk_queue_nonrot(bfqd->queue))4168/*4169* give same worst-case guarantees as idling4170* for seeky4171*/4172*delta_ms = BFQ_MIN_TT / NSEC_PER_MSEC;4173else /* charge at least one seek */4174*delta_ms = bfq_slice_idle / NSEC_PER_MSEC;41754176return slow;4177}41784179*delta_ms = delta_usecs / USEC_PER_MSEC;41804181/*4182* Use only long (> 20ms) intervals to filter out excessive4183* spikes in service rate estimation.4184*/4185if (delta_usecs > 20000) {4186/*4187* Caveat for rotational devices: processes doing I/O4188* in the slower disk zones tend to be slow(er) even4189* if not seeky. In this respect, the estimated peak4190* rate is likely to be an average over the disk4191* surface. Accordingly, to not be too harsh with4192* unlucky processes, a process is deemed slow only if4193* its rate has been lower than half of the estimated4194* peak rate.4195*/4196slow = bfqq->entity.service < bfqd->bfq_max_budget / 2;4197}41984199bfq_log_bfqq(bfqd, bfqq, "bfq_bfqq_is_slow: slow %d", slow);42004201return slow;4202}42034204/*4205* To be deemed as soft real-time, an application must meet two4206* requirements. First, the application must not require an average4207* bandwidth higher than the approximate bandwidth required to playback or4208* record a compressed high-definition video.4209* The next function is invoked on the completion of the last request of a4210* batch, to compute the next-start time instant, soft_rt_next_start, such4211* that, if the next request of the application does not arrive before4212* soft_rt_next_start, then the above requirement on the bandwidth is met.4213*4214* The second requirement is that the request pattern of the application is4215* isochronous, i.e., that, after issuing a request or a batch of requests,4216* the application stops issuing new requests until all its pending requests4217* have been completed. After that, the application may issue a new batch,4218* and so on.4219* For this reason the next function is invoked to compute4220* soft_rt_next_start only for applications that meet this requirement,4221* whereas soft_rt_next_start is set to infinity for applications that do4222* not.4223*4224* Unfortunately, even a greedy (i.e., I/O-bound) application may4225* happen to meet, occasionally or systematically, both the above4226* bandwidth and isochrony requirements. This may happen at least in4227* the following circumstances. First, if the CPU load is high. The4228* application may stop issuing requests while the CPUs are busy4229* serving other processes, then restart, then stop again for a while,4230* and so on. The other circumstances are related to the storage4231* device: the storage device is highly loaded or reaches a low-enough4232* throughput with the I/O of the application (e.g., because the I/O4233* is random and/or the device is slow). In all these cases, the4234* I/O of the application may be simply slowed down enough to meet4235* the bandwidth and isochrony requirements. To reduce the probability4236* that greedy applications are deemed as soft real-time in these4237* corner cases, a further rule is used in the computation of4238* soft_rt_next_start: the return value of this function is forced to4239* be higher than the maximum between the following two quantities.4240*4241* (a) Current time plus: (1) the maximum time for which the arrival4242* of a request is waited for when a sync queue becomes idle,4243* namely bfqd->bfq_slice_idle, and (2) a few extra jiffies. We4244* postpone for a moment the reason for adding a few extra4245* jiffies; we get back to it after next item (b). Lower-bounding4246* the return value of this function with the current time plus4247* bfqd->bfq_slice_idle tends to filter out greedy applications,4248* because the latter issue their next request as soon as possible4249* after the last one has been completed. In contrast, a soft4250* real-time application spends some time processing data, after a4251* batch of its requests has been completed.4252*4253* (b) Current value of bfqq->soft_rt_next_start. As pointed out4254* above, greedy applications may happen to meet both the4255* bandwidth and isochrony requirements under heavy CPU or4256* storage-device load. In more detail, in these scenarios, these4257* applications happen, only for limited time periods, to do I/O4258* slowly enough to meet all the requirements described so far,4259* including the filtering in above item (a). These slow-speed4260* time intervals are usually interspersed between other time4261* intervals during which these applications do I/O at a very high4262* speed. Fortunately, exactly because of the high speed of the4263* I/O in the high-speed intervals, the values returned by this4264* function happen to be so high, near the end of any such4265* high-speed interval, to be likely to fall *after* the end of4266* the low-speed time interval that follows. These high values are4267* stored in bfqq->soft_rt_next_start after each invocation of4268* this function. As a consequence, if the last value of4269* bfqq->soft_rt_next_start is constantly used to lower-bound the4270* next value that this function may return, then, from the very4271* beginning of a low-speed interval, bfqq->soft_rt_next_start is4272* likely to be constantly kept so high that any I/O request4273* issued during the low-speed interval is considered as arriving4274* to soon for the application to be deemed as soft4275* real-time. Then, in the high-speed interval that follows, the4276* application will not be deemed as soft real-time, just because4277* it will do I/O at a high speed. And so on.4278*4279* Getting back to the filtering in item (a), in the following two4280* cases this filtering might be easily passed by a greedy4281* application, if the reference quantity was just4282* bfqd->bfq_slice_idle:4283* 1) HZ is so low that the duration of a jiffy is comparable to or4284* higher than bfqd->bfq_slice_idle. This happens, e.g., on slow4285* devices with HZ=100. The time granularity may be so coarse4286* that the approximation, in jiffies, of bfqd->bfq_slice_idle4287* is rather lower than the exact value.4288* 2) jiffies, instead of increasing at a constant rate, may stop increasing4289* for a while, then suddenly 'jump' by several units to recover the lost4290* increments. This seems to happen, e.g., inside virtual machines.4291* To address this issue, in the filtering in (a) we do not use as a4292* reference time interval just bfqd->bfq_slice_idle, but4293* bfqd->bfq_slice_idle plus a few jiffies. In particular, we add the4294* minimum number of jiffies for which the filter seems to be quite4295* precise also in embedded systems and KVM/QEMU virtual machines.4296*/4297static unsigned long bfq_bfqq_softrt_next_start(struct bfq_data *bfqd,4298struct bfq_queue *bfqq)4299{4300return max3(bfqq->soft_rt_next_start,4301bfqq->last_idle_bklogged +4302HZ * bfqq->service_from_backlogged /4303bfqd->bfq_wr_max_softrt_rate,4304jiffies + nsecs_to_jiffies(bfqq->bfqd->bfq_slice_idle) + 4);4305}43064307/**4308* bfq_bfqq_expire - expire a queue.4309* @bfqd: device owning the queue.4310* @bfqq: the queue to expire.4311* @compensate: if true, compensate for the time spent idling.4312* @reason: the reason causing the expiration.4313*4314* If the process associated with bfqq does slow I/O (e.g., because it4315* issues random requests), we charge bfqq with the time it has been4316* in service instead of the service it has received (see4317* bfq_bfqq_charge_time for details on how this goal is achieved). As4318* a consequence, bfqq will typically get higher timestamps upon4319* reactivation, and hence it will be rescheduled as if it had4320* received more service than what it has actually received. In the4321* end, bfqq receives less service in proportion to how slowly its4322* associated process consumes its budgets (and hence how seriously it4323* tends to lower the throughput). In addition, this time-charging4324* strategy guarantees time fairness among slow processes. In4325* contrast, if the process associated with bfqq is not slow, we4326* charge bfqq exactly with the service it has received.4327*4328* Charging time to the first type of queues and the exact service to4329* the other has the effect of using the WF2Q+ policy to schedule the4330* former on a timeslice basis, without violating service domain4331* guarantees among the latter.4332*/4333void bfq_bfqq_expire(struct bfq_data *bfqd,4334struct bfq_queue *bfqq,4335bool compensate,4336enum bfqq_expiration reason)4337{4338bool slow;4339unsigned long delta = 0;4340struct bfq_entity *entity = &bfqq->entity;43414342/*4343* Check whether the process is slow (see bfq_bfqq_is_slow).4344*/4345slow = bfq_bfqq_is_slow(bfqd, bfqq, compensate, &delta);43464347/*4348* As above explained, charge slow (typically seeky) and4349* timed-out queues with the time and not the service4350* received, to favor sequential workloads.4351*4352* Processes doing I/O in the slower disk zones will tend to4353* be slow(er) even if not seeky. Therefore, since the4354* estimated peak rate is actually an average over the disk4355* surface, these processes may timeout just for bad luck. To4356* avoid punishing them, do not charge time to processes that4357* succeeded in consuming at least 2/3 of their budget. This4358* allows BFQ to preserve enough elasticity to still perform4359* bandwidth, and not time, distribution with little unlucky4360* or quasi-sequential processes.4361*/4362if (bfqq->wr_coeff == 1 &&4363(slow ||4364(reason == BFQQE_BUDGET_TIMEOUT &&4365bfq_bfqq_budget_left(bfqq) >= entity->budget / 3)))4366bfq_bfqq_charge_time(bfqd, bfqq, delta);43674368if (bfqd->low_latency && bfqq->wr_coeff == 1)4369bfqq->last_wr_start_finish = jiffies;43704371if (bfqd->low_latency && bfqd->bfq_wr_max_softrt_rate > 0 &&4372RB_EMPTY_ROOT(&bfqq->sort_list)) {4373/*4374* If we get here, and there are no outstanding4375* requests, then the request pattern is isochronous4376* (see the comments on the function4377* bfq_bfqq_softrt_next_start()). Therefore we can4378* compute soft_rt_next_start.4379*4380* If, instead, the queue still has outstanding4381* requests, then we have to wait for the completion4382* of all the outstanding requests to discover whether4383* the request pattern is actually isochronous.4384*/4385if (bfqq->dispatched == 0)4386bfqq->soft_rt_next_start =4387bfq_bfqq_softrt_next_start(bfqd, bfqq);4388else if (bfqq->dispatched > 0) {4389/*4390* Schedule an update of soft_rt_next_start to when4391* the task may be discovered to be isochronous.4392*/4393bfq_mark_bfqq_softrt_update(bfqq);4394}4395}43964397bfq_log_bfqq(bfqd, bfqq,4398"expire (%d, slow %d, num_disp %d, short_ttime %d)", reason,4399slow, bfqq->dispatched, bfq_bfqq_has_short_ttime(bfqq));44004401/*4402* bfqq expired, so no total service time needs to be computed4403* any longer: reset state machine for measuring total service4404* times.4405*/4406bfqd->rqs_injected = bfqd->wait_dispatch = false;4407bfqd->waited_rq = NULL;44084409/*4410* Increase, decrease or leave budget unchanged according to4411* reason.4412*/4413__bfq_bfqq_recalc_budget(bfqd, bfqq, reason);4414if (__bfq_bfqq_expire(bfqd, bfqq, reason))4415/* bfqq is gone, no more actions on it */4416return;44174418/* mark bfqq as waiting a request only if a bic still points to it */4419if (!bfq_bfqq_busy(bfqq) &&4420reason != BFQQE_BUDGET_TIMEOUT &&4421reason != BFQQE_BUDGET_EXHAUSTED) {4422bfq_mark_bfqq_non_blocking_wait_rq(bfqq);4423/*4424* Not setting service to 0, because, if the next rq4425* arrives in time, the queue will go on receiving4426* service with this same budget (as if it never expired)4427*/4428} else4429entity->service = 0;44304431/*4432* Reset the received-service counter for every parent entity.4433* Differently from what happens with bfqq->entity.service,4434* the resetting of this counter never needs to be postponed4435* for parent entities. In fact, in case bfqq may have a4436* chance to go on being served using the last, partially4437* consumed budget, bfqq->entity.service needs to be kept,4438* because if bfqq then actually goes on being served using4439* the same budget, the last value of bfqq->entity.service is4440* needed to properly decrement bfqq->entity.budget by the4441* portion already consumed. In contrast, it is not necessary4442* to keep entity->service for parent entities too, because4443* the bubble up of the new value of bfqq->entity.budget will4444* make sure that the budgets of parent entities are correct,4445* even in case bfqq and thus parent entities go on receiving4446* service with the same budget.4447*/4448entity = entity->parent;4449for_each_entity(entity)4450entity->service = 0;4451}44524453/*4454* Budget timeout is not implemented through a dedicated timer, but4455* just checked on request arrivals and completions, as well as on4456* idle timer expirations.4457*/4458static bool bfq_bfqq_budget_timeout(struct bfq_queue *bfqq)4459{4460return time_is_before_eq_jiffies(bfqq->budget_timeout);4461}44624463/*4464* If we expire a queue that is actively waiting (i.e., with the4465* device idled) for the arrival of a new request, then we may incur4466* the timestamp misalignment problem described in the body of the4467* function __bfq_activate_entity. Hence we return true only if this4468* condition does not hold, or if the queue is slow enough to deserve4469* only to be kicked off for preserving a high throughput.4470*/4471static bool bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq)4472{4473bfq_log_bfqq(bfqq->bfqd, bfqq,4474"may_budget_timeout: wait_request %d left %d timeout %d",4475bfq_bfqq_wait_request(bfqq),4476bfq_bfqq_budget_left(bfqq) >= bfqq->entity.budget / 3,4477bfq_bfqq_budget_timeout(bfqq));44784479return (!bfq_bfqq_wait_request(bfqq) ||4480bfq_bfqq_budget_left(bfqq) >= bfqq->entity.budget / 3)4481&&4482bfq_bfqq_budget_timeout(bfqq);4483}44844485static bool idling_boosts_thr_without_issues(struct bfq_data *bfqd,4486struct bfq_queue *bfqq)4487{4488bool rot_without_queueing =4489!blk_queue_nonrot(bfqd->queue) && !bfqd->hw_tag,4490bfqq_sequential_and_IO_bound,4491idling_boosts_thr;44924493/* No point in idling for bfqq if it won't get requests any longer */4494if (unlikely(!bfqq_process_refs(bfqq)))4495return false;44964497bfqq_sequential_and_IO_bound = !BFQQ_SEEKY(bfqq) &&4498bfq_bfqq_IO_bound(bfqq) && bfq_bfqq_has_short_ttime(bfqq);44994500/*4501* The next variable takes into account the cases where idling4502* boosts the throughput.4503*4504* The value of the variable is computed considering, first, that4505* idling is virtually always beneficial for the throughput if:4506* (a) the device is not NCQ-capable and rotational, or4507* (b) regardless of the presence of NCQ, the device is rotational and4508* the request pattern for bfqq is I/O-bound and sequential, or4509* (c) regardless of whether it is rotational, the device is4510* not NCQ-capable and the request pattern for bfqq is4511* I/O-bound and sequential.4512*4513* Secondly, and in contrast to the above item (b), idling an4514* NCQ-capable flash-based device would not boost the4515* throughput even with sequential I/O; rather it would lower4516* the throughput in proportion to how fast the device4517* is. Accordingly, the next variable is true if any of the4518* above conditions (a), (b) or (c) is true, and, in4519* particular, happens to be false if bfqd is an NCQ-capable4520* flash-based device.4521*/4522idling_boosts_thr = rot_without_queueing ||4523((!blk_queue_nonrot(bfqd->queue) || !bfqd->hw_tag) &&4524bfqq_sequential_and_IO_bound);45254526/*4527* The return value of this function is equal to that of4528* idling_boosts_thr, unless a special case holds. In this4529* special case, described below, idling may cause problems to4530* weight-raised queues.4531*4532* When the request pool is saturated (e.g., in the presence4533* of write hogs), if the processes associated with4534* non-weight-raised queues ask for requests at a lower rate,4535* then processes associated with weight-raised queues have a4536* higher probability to get a request from the pool4537* immediately (or at least soon) when they need one. Thus4538* they have a higher probability to actually get a fraction4539* of the device throughput proportional to their high4540* weight. This is especially true with NCQ-capable drives,4541* which enqueue several requests in advance, and further4542* reorder internally-queued requests.4543*4544* For this reason, we force to false the return value if4545* there are weight-raised busy queues. In this case, and if4546* bfqq is not weight-raised, this guarantees that the device4547* is not idled for bfqq (if, instead, bfqq is weight-raised,4548* then idling will be guaranteed by another variable, see4549* below). Combined with the timestamping rules of BFQ (see4550* [1] for details), this behavior causes bfqq, and hence any4551* sync non-weight-raised queue, to get a lower number of4552* requests served, and thus to ask for a lower number of4553* requests from the request pool, before the busy4554* weight-raised queues get served again. This often mitigates4555* starvation problems in the presence of heavy write4556* workloads and NCQ, thereby guaranteeing a higher4557* application and system responsiveness in these hostile4558* scenarios.4559*/4560return idling_boosts_thr &&4561bfqd->wr_busy_queues == 0;4562}45634564/*4565* For a queue that becomes empty, device idling is allowed only if4566* this function returns true for that queue. As a consequence, since4567* device idling plays a critical role for both throughput boosting4568* and service guarantees, the return value of this function plays a4569* critical role as well.4570*4571* In a nutshell, this function returns true only if idling is4572* beneficial for throughput or, even if detrimental for throughput,4573* idling is however necessary to preserve service guarantees (low4574* latency, desired throughput distribution, ...). In particular, on4575* NCQ-capable devices, this function tries to return false, so as to4576* help keep the drives' internal queues full, whenever this helps the4577* device boost the throughput without causing any service-guarantee4578* issue.4579*4580* Most of the issues taken into account to get the return value of4581* this function are not trivial. We discuss these issues in the two4582* functions providing the main pieces of information needed by this4583* function.4584*/4585static bool bfq_better_to_idle(struct bfq_queue *bfqq)4586{4587struct bfq_data *bfqd = bfqq->bfqd;4588bool idling_boosts_thr_with_no_issue, idling_needed_for_service_guar;45894590/* No point in idling for bfqq if it won't get requests any longer */4591if (unlikely(!bfqq_process_refs(bfqq)))4592return false;45934594if (unlikely(bfqd->strict_guarantees))4595return true;45964597/*4598* Idling is performed only if slice_idle > 0. In addition, we4599* do not idle if4600* (a) bfqq is async4601* (b) bfqq is in the idle io prio class: in this case we do4602* not idle because we want to minimize the bandwidth that4603* queues in this class can steal to higher-priority queues4604*/4605if (bfqd->bfq_slice_idle == 0 || !bfq_bfqq_sync(bfqq) ||4606bfq_class_idle(bfqq))4607return false;46084609idling_boosts_thr_with_no_issue =4610idling_boosts_thr_without_issues(bfqd, bfqq);46114612idling_needed_for_service_guar =4613idling_needed_for_service_guarantees(bfqd, bfqq);46144615/*4616* We have now the two components we need to compute the4617* return value of the function, which is true only if idling4618* either boosts the throughput (without issues), or is4619* necessary to preserve service guarantees.4620*/4621return idling_boosts_thr_with_no_issue ||4622idling_needed_for_service_guar;4623}46244625/*4626* If the in-service queue is empty but the function bfq_better_to_idle4627* returns true, then:4628* 1) the queue must remain in service and cannot be expired, and4629* 2) the device must be idled to wait for the possible arrival of a new4630* request for the queue.4631* See the comments on the function bfq_better_to_idle for the reasons4632* why performing device idling is the best choice to boost the throughput4633* and preserve service guarantees when bfq_better_to_idle itself4634* returns true.4635*/4636static bool bfq_bfqq_must_idle(struct bfq_queue *bfqq)4637{4638return RB_EMPTY_ROOT(&bfqq->sort_list) && bfq_better_to_idle(bfqq);4639}46404641/*4642* This function chooses the queue from which to pick the next extra4643* I/O request to inject, if it finds a compatible queue. See the4644* comments on bfq_update_inject_limit() for details on the injection4645* mechanism, and for the definitions of the quantities mentioned4646* below.4647*/4648static struct bfq_queue *4649bfq_choose_bfqq_for_injection(struct bfq_data *bfqd)4650{4651struct bfq_queue *bfqq, *in_serv_bfqq = bfqd->in_service_queue;4652unsigned int limit = in_serv_bfqq->inject_limit;4653int i;46544655/*4656* If4657* - bfqq is not weight-raised and therefore does not carry4658* time-critical I/O,4659* or4660* - regardless of whether bfqq is weight-raised, bfqq has4661* however a long think time, during which it can absorb the4662* effect of an appropriate number of extra I/O requests4663* from other queues (see bfq_update_inject_limit for4664* details on the computation of this number);4665* then injection can be performed without restrictions.4666*/4667bool in_serv_always_inject = in_serv_bfqq->wr_coeff == 1 ||4668!bfq_bfqq_has_short_ttime(in_serv_bfqq);46694670/*4671* If4672* - the baseline total service time could not be sampled yet,4673* so the inject limit happens to be still 0, and4674* - a lot of time has elapsed since the plugging of I/O4675* dispatching started, so drive speed is being wasted4676* significantly;4677* then temporarily raise inject limit to one request.4678*/4679if (limit == 0 && in_serv_bfqq->last_serv_time_ns == 0 &&4680bfq_bfqq_wait_request(in_serv_bfqq) &&4681time_is_before_eq_jiffies(bfqd->last_idling_start_jiffies +4682bfqd->bfq_slice_idle)4683)4684limit = 1;46854686if (bfqd->tot_rq_in_driver >= limit)4687return NULL;46884689/*4690* Linear search of the source queue for injection; but, with4691* a high probability, very few steps are needed to find a4692* candidate queue, i.e., a queue with enough budget left for4693* its next request. In fact:4694* - BFQ dynamically updates the budget of every queue so as4695* to accommodate the expected backlog of the queue;4696* - if a queue gets all its requests dispatched as injected4697* service, then the queue is removed from the active list4698* (and re-added only if it gets new requests, but then it4699* is assigned again enough budget for its new backlog).4700*/4701for (i = 0; i < bfqd->num_actuators; i++) {4702list_for_each_entry(bfqq, &bfqd->active_list[i], bfqq_list)4703if (!RB_EMPTY_ROOT(&bfqq->sort_list) &&4704(in_serv_always_inject || bfqq->wr_coeff > 1) &&4705bfq_serv_to_charge(bfqq->next_rq, bfqq) <=4706bfq_bfqq_budget_left(bfqq)) {4707/*4708* Allow for only one large in-flight request4709* on non-rotational devices, for the4710* following reason. On non-rotationl drives,4711* large requests take much longer than4712* smaller requests to be served. In addition,4713* the drive prefers to serve large requests4714* w.r.t. to small ones, if it can choose. So,4715* having more than one large requests queued4716* in the drive may easily make the next first4717* request of the in-service queue wait for so4718* long to break bfqq's service guarantees. On4719* the bright side, large requests let the4720* drive reach a very high throughput, even if4721* there is only one in-flight large request4722* at a time.4723*/4724if (blk_queue_nonrot(bfqd->queue) &&4725blk_rq_sectors(bfqq->next_rq) >=4726BFQQ_SECT_THR_NONROT &&4727bfqd->tot_rq_in_driver >= 1)4728continue;4729else {4730bfqd->rqs_injected = true;4731return bfqq;4732}4733}4734}47354736return NULL;4737}47384739static struct bfq_queue *4740bfq_find_active_bfqq_for_actuator(struct bfq_data *bfqd, int idx)4741{4742struct bfq_queue *bfqq;47434744if (bfqd->in_service_queue &&4745bfqd->in_service_queue->actuator_idx == idx)4746return bfqd->in_service_queue;47474748list_for_each_entry(bfqq, &bfqd->active_list[idx], bfqq_list) {4749if (!RB_EMPTY_ROOT(&bfqq->sort_list) &&4750bfq_serv_to_charge(bfqq->next_rq, bfqq) <=4751bfq_bfqq_budget_left(bfqq)) {4752return bfqq;4753}4754}47554756return NULL;4757}47584759/*4760* Perform a linear scan of each actuator, until an actuator is found4761* for which the following three conditions hold: the load of the4762* actuator is below the threshold (see comments on4763* actuator_load_threshold for details) and lower than that of the4764* next actuator (comments on this extra condition below), and there4765* is a queue that contains I/O for that actuator. On success, return4766* that queue.4767*4768* Performing a plain linear scan entails a prioritization among4769* actuators. The extra condition above breaks this prioritization and4770* tends to distribute injection uniformly across actuators.4771*/4772static struct bfq_queue *4773bfq_find_bfqq_for_underused_actuator(struct bfq_data *bfqd)4774{4775int i;47764777for (i = 0 ; i < bfqd->num_actuators; i++) {4778if (bfqd->rq_in_driver[i] < bfqd->actuator_load_threshold &&4779(i == bfqd->num_actuators - 1 ||4780bfqd->rq_in_driver[i] < bfqd->rq_in_driver[i+1])) {4781struct bfq_queue *bfqq =4782bfq_find_active_bfqq_for_actuator(bfqd, i);47834784if (bfqq)4785return bfqq;4786}4787}47884789return NULL;4790}479147924793/*4794* Select a queue for service. If we have a current queue in service,4795* check whether to continue servicing it, or retrieve and set a new one.4796*/4797static struct bfq_queue *bfq_select_queue(struct bfq_data *bfqd)4798{4799struct bfq_queue *bfqq, *inject_bfqq;4800struct request *next_rq;4801enum bfqq_expiration reason = BFQQE_BUDGET_TIMEOUT;48024803bfqq = bfqd->in_service_queue;4804if (!bfqq)4805goto new_queue;48064807bfq_log_bfqq(bfqd, bfqq, "select_queue: already in-service queue");48084809/*4810* Do not expire bfqq for budget timeout if bfqq may be about4811* to enjoy device idling. The reason why, in this case, we4812* prevent bfqq from expiring is the same as in the comments4813* on the case where bfq_bfqq_must_idle() returns true, in4814* bfq_completed_request().4815*/4816if (bfq_may_expire_for_budg_timeout(bfqq) &&4817!bfq_bfqq_must_idle(bfqq))4818goto expire;48194820check_queue:4821/*4822* If some actuator is underutilized, but the in-service4823* queue does not contain I/O for that actuator, then try to4824* inject I/O for that actuator.4825*/4826inject_bfqq = bfq_find_bfqq_for_underused_actuator(bfqd);4827if (inject_bfqq && inject_bfqq != bfqq)4828return inject_bfqq;48294830/*4831* This loop is rarely executed more than once. Even when it4832* happens, it is much more convenient to re-execute this loop4833* than to return NULL and trigger a new dispatch to get a4834* request served.4835*/4836next_rq = bfqq->next_rq;4837/*4838* If bfqq has requests queued and it has enough budget left to4839* serve them, keep the queue, otherwise expire it.4840*/4841if (next_rq) {4842if (bfq_serv_to_charge(next_rq, bfqq) >4843bfq_bfqq_budget_left(bfqq)) {4844/*4845* Expire the queue for budget exhaustion,4846* which makes sure that the next budget is4847* enough to serve the next request, even if4848* it comes from the fifo expired path.4849*/4850reason = BFQQE_BUDGET_EXHAUSTED;4851goto expire;4852} else {4853/*4854* The idle timer may be pending because we may4855* not disable disk idling even when a new request4856* arrives.4857*/4858if (bfq_bfqq_wait_request(bfqq)) {4859/*4860* If we get here: 1) at least a new request4861* has arrived but we have not disabled the4862* timer because the request was too small,4863* 2) then the block layer has unplugged4864* the device, causing the dispatch to be4865* invoked.4866*4867* Since the device is unplugged, now the4868* requests are probably large enough to4869* provide a reasonable throughput.4870* So we disable idling.4871*/4872bfq_clear_bfqq_wait_request(bfqq);4873hrtimer_try_to_cancel(&bfqd->idle_slice_timer);4874}4875goto keep_queue;4876}4877}48784879/*4880* No requests pending. However, if the in-service queue is idling4881* for a new request, or has requests waiting for a completion and4882* may idle after their completion, then keep it anyway.4883*4884* Yet, inject service from other queues if it boosts4885* throughput and is possible.4886*/4887if (bfq_bfqq_wait_request(bfqq) ||4888(bfqq->dispatched != 0 && bfq_better_to_idle(bfqq))) {4889unsigned int act_idx = bfqq->actuator_idx;4890struct bfq_queue *async_bfqq = NULL;4891struct bfq_queue *blocked_bfqq =4892!hlist_empty(&bfqq->woken_list) ?4893container_of(bfqq->woken_list.first,4894struct bfq_queue,4895woken_list_node)4896: NULL;48974898if (bfqq->bic && bfqq->bic->bfqq[0][act_idx] &&4899bfq_bfqq_busy(bfqq->bic->bfqq[0][act_idx]) &&4900bfqq->bic->bfqq[0][act_idx]->next_rq)4901async_bfqq = bfqq->bic->bfqq[0][act_idx];4902/*4903* The next four mutually-exclusive ifs decide4904* whether to try injection, and choose the queue to4905* pick an I/O request from.4906*4907* The first if checks whether the process associated4908* with bfqq has also async I/O pending. If so, it4909* injects such I/O unconditionally. Injecting async4910* I/O from the same process can cause no harm to the4911* process. On the contrary, it can only increase4912* bandwidth and reduce latency for the process.4913*4914* The second if checks whether there happens to be a4915* non-empty waker queue for bfqq, i.e., a queue whose4916* I/O needs to be completed for bfqq to receive new4917* I/O. This happens, e.g., if bfqq is associated with4918* a process that does some sync. A sync generates4919* extra blocking I/O, which must be completed before4920* the process associated with bfqq can go on with its4921* I/O. If the I/O of the waker queue is not served,4922* then bfqq remains empty, and no I/O is dispatched,4923* until the idle timeout fires for bfqq. This is4924* likely to result in lower bandwidth and higher4925* latencies for bfqq, and in a severe loss of total4926* throughput. The best action to take is therefore to4927* serve the waker queue as soon as possible. So do it4928* (without relying on the third alternative below for4929* eventually serving waker_bfqq's I/O; see the last4930* paragraph for further details). This systematic4931* injection of I/O from the waker queue does not4932* cause any delay to bfqq's I/O. On the contrary,4933* next bfqq's I/O is brought forward dramatically,4934* for it is not blocked for milliseconds.4935*4936* The third if checks whether there is a queue woken4937* by bfqq, and currently with pending I/O. Such a4938* woken queue does not steal bandwidth from bfqq,4939* because it remains soon without I/O if bfqq is not4940* served. So there is virtually no risk of loss of4941* bandwidth for bfqq if this woken queue has I/O4942* dispatched while bfqq is waiting for new I/O.4943*4944* The fourth if checks whether bfqq is a queue for4945* which it is better to avoid injection. It is so if4946* bfqq delivers more throughput when served without4947* any further I/O from other queues in the middle, or4948* if the service times of bfqq's I/O requests both4949* count more than overall throughput, and may be4950* easily increased by injection (this happens if bfqq4951* has a short think time). If none of these4952* conditions holds, then a candidate queue for4953* injection is looked for through4954* bfq_choose_bfqq_for_injection(). Note that the4955* latter may return NULL (for example if the inject4956* limit for bfqq is currently 0).4957*4958* NOTE: motivation for the second alternative4959*4960* Thanks to the way the inject limit is updated in4961* bfq_update_has_short_ttime(), it is rather likely4962* that, if I/O is being plugged for bfqq and the4963* waker queue has pending I/O requests that are4964* blocking bfqq's I/O, then the fourth alternative4965* above lets the waker queue get served before the4966* I/O-plugging timeout fires. So one may deem the4967* second alternative superfluous. It is not, because4968* the fourth alternative may be way less effective in4969* case of a synchronization. For two main4970* reasons. First, throughput may be low because the4971* inject limit may be too low to guarantee the same4972* amount of injected I/O, from the waker queue or4973* other queues, that the second alternative4974* guarantees (the second alternative unconditionally4975* injects a pending I/O request of the waker queue4976* for each bfq_dispatch_request()). Second, with the4977* fourth alternative, the duration of the plugging,4978* i.e., the time before bfqq finally receives new I/O,4979* may not be minimized, because the waker queue may4980* happen to be served only after other queues.4981*/4982if (async_bfqq &&4983icq_to_bic(async_bfqq->next_rq->elv.icq) == bfqq->bic &&4984bfq_serv_to_charge(async_bfqq->next_rq, async_bfqq) <=4985bfq_bfqq_budget_left(async_bfqq))4986bfqq = async_bfqq;4987else if (bfqq->waker_bfqq &&4988bfq_bfqq_busy(bfqq->waker_bfqq) &&4989bfqq->waker_bfqq->next_rq &&4990bfq_serv_to_charge(bfqq->waker_bfqq->next_rq,4991bfqq->waker_bfqq) <=4992bfq_bfqq_budget_left(bfqq->waker_bfqq)4993)4994bfqq = bfqq->waker_bfqq;4995else if (blocked_bfqq &&4996bfq_bfqq_busy(blocked_bfqq) &&4997blocked_bfqq->next_rq &&4998bfq_serv_to_charge(blocked_bfqq->next_rq,4999blocked_bfqq) <=5000bfq_bfqq_budget_left(blocked_bfqq)5001)5002bfqq = blocked_bfqq;5003else if (!idling_boosts_thr_without_issues(bfqd, bfqq) &&5004(bfqq->wr_coeff == 1 || bfqd->wr_busy_queues > 1 ||5005!bfq_bfqq_has_short_ttime(bfqq)))5006bfqq = bfq_choose_bfqq_for_injection(bfqd);5007else5008bfqq = NULL;50095010goto keep_queue;5011}50125013reason = BFQQE_NO_MORE_REQUESTS;5014expire:5015bfq_bfqq_expire(bfqd, bfqq, false, reason);5016new_queue:5017bfqq = bfq_set_in_service_queue(bfqd);5018if (bfqq) {5019bfq_log_bfqq(bfqd, bfqq, "select_queue: checking new queue");5020goto check_queue;5021}5022keep_queue:5023if (bfqq)5024bfq_log_bfqq(bfqd, bfqq, "select_queue: returned this queue");5025else5026bfq_log(bfqd, "select_queue: no queue returned");50275028return bfqq;5029}50305031static void bfq_update_wr_data(struct bfq_data *bfqd, struct bfq_queue *bfqq)5032{5033struct bfq_entity *entity = &bfqq->entity;50345035if (bfqq->wr_coeff > 1) { /* queue is being weight-raised */5036bfq_log_bfqq(bfqd, bfqq,5037"raising period dur %u/%u msec, old coeff %u, w %d(%d)",5038jiffies_to_msecs(jiffies - bfqq->last_wr_start_finish),5039jiffies_to_msecs(bfqq->wr_cur_max_time),5040bfqq->wr_coeff,5041bfqq->entity.weight, bfqq->entity.orig_weight);50425043if (entity->prio_changed)5044bfq_log_bfqq(bfqd, bfqq, "WARN: pending prio change");50455046/*5047* If the queue was activated in a burst, or too much5048* time has elapsed from the beginning of this5049* weight-raising period, then end weight raising.5050*/5051if (bfq_bfqq_in_large_burst(bfqq))5052bfq_bfqq_end_wr(bfqq);5053else if (time_is_before_jiffies(bfqq->last_wr_start_finish +5054bfqq->wr_cur_max_time)) {5055if (bfqq->wr_cur_max_time != bfqd->bfq_wr_rt_max_time ||5056time_is_before_jiffies(bfqq->wr_start_at_switch_to_srt +5057bfq_wr_duration(bfqd))) {5058/*5059* Either in interactive weight5060* raising, or in soft_rt weight5061* raising with the5062* interactive-weight-raising period5063* elapsed (so no switch back to5064* interactive weight raising).5065*/5066bfq_bfqq_end_wr(bfqq);5067} else { /*5068* soft_rt finishing while still in5069* interactive period, switch back to5070* interactive weight raising5071*/5072switch_back_to_interactive_wr(bfqq, bfqd);5073bfqq->entity.prio_changed = 1;5074}5075}5076if (bfqq->wr_coeff > 1 &&5077bfqq->wr_cur_max_time != bfqd->bfq_wr_rt_max_time &&5078bfqq->service_from_wr > max_service_from_wr) {5079/* see comments on max_service_from_wr */5080bfq_bfqq_end_wr(bfqq);5081}5082}5083/*5084* To improve latency (for this or other queues), immediately5085* update weight both if it must be raised and if it must be5086* lowered. Since, entity may be on some active tree here, and5087* might have a pending change of its ioprio class, invoke5088* next function with the last parameter unset (see the5089* comments on the function).5090*/5091if ((entity->weight > entity->orig_weight) != (bfqq->wr_coeff > 1))5092__bfq_entity_update_weight_prio(bfq_entity_service_tree(entity),5093entity, false);5094}50955096/*5097* Dispatch next request from bfqq.5098*/5099static struct request *bfq_dispatch_rq_from_bfqq(struct bfq_data *bfqd,5100struct bfq_queue *bfqq)5101{5102struct request *rq = bfqq->next_rq;5103unsigned long service_to_charge;51045105service_to_charge = bfq_serv_to_charge(rq, bfqq);51065107bfq_bfqq_served(bfqq, service_to_charge);51085109if (bfqq == bfqd->in_service_queue && bfqd->wait_dispatch) {5110bfqd->wait_dispatch = false;5111bfqd->waited_rq = rq;5112}51135114bfq_dispatch_remove(bfqd->queue, rq);51155116if (bfqq != bfqd->in_service_queue)5117return rq;51185119/*5120* If weight raising has to terminate for bfqq, then next5121* function causes an immediate update of bfqq's weight,5122* without waiting for next activation. As a consequence, on5123* expiration, bfqq will be timestamped as if has never been5124* weight-raised during this service slot, even if it has5125* received part or even most of the service as a5126* weight-raised queue. This inflates bfqq's timestamps, which5127* is beneficial, as bfqq is then more willing to leave the5128* device immediately to possible other weight-raised queues.5129*/5130bfq_update_wr_data(bfqd, bfqq);51315132/*5133* Expire bfqq, pretending that its budget expired, if bfqq5134* belongs to CLASS_IDLE and other queues are waiting for5135* service.5136*/5137if (bfq_tot_busy_queues(bfqd) > 1 && bfq_class_idle(bfqq))5138bfq_bfqq_expire(bfqd, bfqq, false, BFQQE_BUDGET_EXHAUSTED);51395140return rq;5141}51425143static bool bfq_has_work(struct blk_mq_hw_ctx *hctx)5144{5145struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;51465147/*5148* Avoiding lock: a race on bfqd->queued should cause at5149* most a call to dispatch for nothing5150*/5151return !list_empty_careful(&bfqd->dispatch) ||5152READ_ONCE(bfqd->queued);5153}51545155static struct request *__bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)5156{5157struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;5158struct request *rq = NULL;5159struct bfq_queue *bfqq = NULL;51605161if (!list_empty(&bfqd->dispatch)) {5162rq = list_first_entry(&bfqd->dispatch, struct request,5163queuelist);5164list_del_init(&rq->queuelist);51655166bfqq = RQ_BFQQ(rq);51675168if (bfqq) {5169/*5170* Increment counters here, because this5171* dispatch does not follow the standard5172* dispatch flow (where counters are5173* incremented)5174*/5175bfqq->dispatched++;51765177goto inc_in_driver_start_rq;5178}51795180/*5181* We exploit the bfq_finish_requeue_request hook to5182* decrement tot_rq_in_driver, but5183* bfq_finish_requeue_request will not be invoked on5184* this request. So, to avoid unbalance, just start5185* this request, without incrementing tot_rq_in_driver. As5186* a negative consequence, tot_rq_in_driver is deceptively5187* lower than it should be while this request is in5188* service. This may cause bfq_schedule_dispatch to be5189* invoked uselessly.5190*5191* As for implementing an exact solution, the5192* bfq_finish_requeue_request hook, if defined, is5193* probably invoked also on this request. So, by5194* exploiting this hook, we could 1) increment5195* tot_rq_in_driver here, and 2) decrement it in5196* bfq_finish_requeue_request. Such a solution would5197* let the value of the counter be always accurate,5198* but it would entail using an extra interface5199* function. This cost seems higher than the benefit,5200* being the frequency of non-elevator-private5201* requests very low.5202*/5203goto start_rq;5204}52055206bfq_log(bfqd, "dispatch requests: %d busy queues",5207bfq_tot_busy_queues(bfqd));52085209if (bfq_tot_busy_queues(bfqd) == 0)5210goto exit;52115212/*5213* Force device to serve one request at a time if5214* strict_guarantees is true. Forcing this service scheme is5215* currently the ONLY way to guarantee that the request5216* service order enforced by the scheduler is respected by a5217* queueing device. Otherwise the device is free even to make5218* some unlucky request wait for as long as the device5219* wishes.5220*5221* Of course, serving one request at a time may cause loss of5222* throughput.5223*/5224if (bfqd->strict_guarantees && bfqd->tot_rq_in_driver > 0)5225goto exit;52265227bfqq = bfq_select_queue(bfqd);5228if (!bfqq)5229goto exit;52305231rq = bfq_dispatch_rq_from_bfqq(bfqd, bfqq);52325233if (rq) {5234inc_in_driver_start_rq:5235bfqd->rq_in_driver[bfqq->actuator_idx]++;5236bfqd->tot_rq_in_driver++;5237start_rq:5238rq->rq_flags |= RQF_STARTED;5239}5240exit:5241return rq;5242}52435244#ifdef CONFIG_BFQ_CGROUP_DEBUG5245static void bfq_update_dispatch_stats(struct request_queue *q,5246struct request *rq,5247struct bfq_queue *in_serv_queue,5248bool idle_timer_disabled)5249{5250struct bfq_queue *bfqq = rq ? RQ_BFQQ(rq) : NULL;52515252if (!idle_timer_disabled && !bfqq)5253return;52545255/*5256* rq and bfqq are guaranteed to exist until this function5257* ends, for the following reasons. First, rq can be5258* dispatched to the device, and then can be completed and5259* freed, only after this function ends. Second, rq cannot be5260* merged (and thus freed because of a merge) any longer,5261* because it has already started. Thus rq cannot be freed5262* before this function ends, and, since rq has a reference to5263* bfqq, the same guarantee holds for bfqq too.5264*5265* In addition, the following queue lock guarantees that5266* bfqq_group(bfqq) exists as well.5267*/5268spin_lock_irq(&q->queue_lock);5269if (idle_timer_disabled)5270/*5271* Since the idle timer has been disabled,5272* in_serv_queue contained some request when5273* __bfq_dispatch_request was invoked above, which5274* implies that rq was picked exactly from5275* in_serv_queue. Thus in_serv_queue == bfqq, and is5276* therefore guaranteed to exist because of the above5277* arguments.5278*/5279bfqg_stats_update_idle_time(bfqq_group(in_serv_queue));5280if (bfqq) {5281struct bfq_group *bfqg = bfqq_group(bfqq);52825283bfqg_stats_update_avg_queue_size(bfqg);5284bfqg_stats_set_start_empty_time(bfqg);5285bfqg_stats_update_io_remove(bfqg, rq->cmd_flags);5286}5287spin_unlock_irq(&q->queue_lock);5288}5289#else5290static inline void bfq_update_dispatch_stats(struct request_queue *q,5291struct request *rq,5292struct bfq_queue *in_serv_queue,5293bool idle_timer_disabled) {}5294#endif /* CONFIG_BFQ_CGROUP_DEBUG */52955296static struct request *bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)5297{5298struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;5299struct request *rq;5300struct bfq_queue *in_serv_queue;5301bool waiting_rq, idle_timer_disabled = false;53025303spin_lock_irq(&bfqd->lock);53045305in_serv_queue = bfqd->in_service_queue;5306waiting_rq = in_serv_queue && bfq_bfqq_wait_request(in_serv_queue);53075308rq = __bfq_dispatch_request(hctx);5309if (in_serv_queue == bfqd->in_service_queue) {5310idle_timer_disabled =5311waiting_rq && !bfq_bfqq_wait_request(in_serv_queue);5312}53135314spin_unlock_irq(&bfqd->lock);5315bfq_update_dispatch_stats(hctx->queue, rq,5316idle_timer_disabled ? in_serv_queue : NULL,5317idle_timer_disabled);53185319return rq;5320}53215322/*5323* Task holds one reference to the queue, dropped when task exits. Each rq5324* in-flight on this queue also holds a reference, dropped when rq is freed.5325*5326* Scheduler lock must be held here. Recall not to use bfqq after calling5327* this function on it.5328*/5329void bfq_put_queue(struct bfq_queue *bfqq)5330{5331struct bfq_queue *item;5332struct hlist_node *n;5333struct bfq_group *bfqg = bfqq_group(bfqq);53345335bfq_log_bfqq(bfqq->bfqd, bfqq, "put_queue: %p %d", bfqq, bfqq->ref);53365337bfqq->ref--;5338if (bfqq->ref)5339return;53405341if (!hlist_unhashed(&bfqq->burst_list_node)) {5342hlist_del_init(&bfqq->burst_list_node);5343/*5344* Decrement also burst size after the removal, if the5345* process associated with bfqq is exiting, and thus5346* does not contribute to the burst any longer. This5347* decrement helps filter out false positives of large5348* bursts, when some short-lived process (often due to5349* the execution of commands by some service) happens5350* to start and exit while a complex application is5351* starting, and thus spawning several processes that5352* do I/O (and that *must not* be treated as a large5353* burst, see comments on bfq_handle_burst).5354*5355* In particular, the decrement is performed only if:5356* 1) bfqq is not a merged queue, because, if it is,5357* then this free of bfqq is not triggered by the exit5358* of the process bfqq is associated with, but exactly5359* by the fact that bfqq has just been merged.5360* 2) burst_size is greater than 0, to handle5361* unbalanced decrements. Unbalanced decrements may5362* happen in te following case: bfqq is inserted into5363* the current burst list--without incrementing5364* bust_size--because of a split, but the current5365* burst list is not the burst list bfqq belonged to5366* (see comments on the case of a split in5367* bfq_set_request).5368*/5369if (bfqq->bic && bfqq->bfqd->burst_size > 0)5370bfqq->bfqd->burst_size--;5371}53725373/*5374* bfqq does not exist any longer, so it cannot be woken by5375* any other queue, and cannot wake any other queue. Then bfqq5376* must be removed from the woken list of its possible waker5377* queue, and all queues in the woken list of bfqq must stop5378* having a waker queue. Strictly speaking, these updates5379* should be performed when bfqq remains with no I/O source5380* attached to it, which happens before bfqq gets freed. In5381* particular, this happens when the last process associated5382* with bfqq exits or gets associated with a different5383* queue. However, both events lead to bfqq being freed soon,5384* and dangling references would come out only after bfqq gets5385* freed. So these updates are done here, as a simple and safe5386* way to handle all cases.5387*/5388/* remove bfqq from woken list */5389if (!hlist_unhashed(&bfqq->woken_list_node))5390hlist_del_init(&bfqq->woken_list_node);53915392/* reset waker for all queues in woken list */5393hlist_for_each_entry_safe(item, n, &bfqq->woken_list,5394woken_list_node) {5395item->waker_bfqq = NULL;5396hlist_del_init(&item->woken_list_node);5397}53985399if (bfqq->bfqd->last_completed_rq_bfqq == bfqq)5400bfqq->bfqd->last_completed_rq_bfqq = NULL;54015402WARN_ON_ONCE(!list_empty(&bfqq->fifo));5403WARN_ON_ONCE(!RB_EMPTY_ROOT(&bfqq->sort_list));5404WARN_ON_ONCE(bfqq->dispatched);54055406kmem_cache_free(bfq_pool, bfqq);5407bfqg_and_blkg_put(bfqg);5408}54095410static void bfq_put_stable_ref(struct bfq_queue *bfqq)5411{5412bfqq->stable_ref--;5413bfq_put_queue(bfqq);5414}54155416void bfq_put_cooperator(struct bfq_queue *bfqq)5417{5418struct bfq_queue *__bfqq, *next;54195420/*5421* If this queue was scheduled to merge with another queue, be5422* sure to drop the reference taken on that queue (and others in5423* the merge chain). See bfq_setup_merge and bfq_merge_bfqqs.5424*/5425__bfqq = bfqq->new_bfqq;5426while (__bfqq) {5427next = __bfqq->new_bfqq;5428bfq_put_queue(__bfqq);5429__bfqq = next;5430}5431}54325433static void bfq_exit_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq)5434{5435if (bfqq == bfqd->in_service_queue) {5436__bfq_bfqq_expire(bfqd, bfqq, BFQQE_BUDGET_TIMEOUT);5437bfq_schedule_dispatch(bfqd);5438}54395440bfq_log_bfqq(bfqd, bfqq, "exit_bfqq: %p, %d", bfqq, bfqq->ref);54415442bfq_put_cooperator(bfqq);54435444bfq_release_process_ref(bfqd, bfqq);5445}54465447static void bfq_exit_icq_bfqq(struct bfq_io_cq *bic, bool is_sync,5448unsigned int actuator_idx)5449{5450struct bfq_queue *bfqq = bic_to_bfqq(bic, is_sync, actuator_idx);5451struct bfq_data *bfqd;54525453if (bfqq)5454bfqd = bfqq->bfqd; /* NULL if scheduler already exited */54555456if (bfqq && bfqd) {5457bic_set_bfqq(bic, NULL, is_sync, actuator_idx);5458bfq_exit_bfqq(bfqd, bfqq);5459}5460}54615462static void _bfq_exit_icq(struct bfq_io_cq *bic, unsigned int num_actuators)5463{5464struct bfq_iocq_bfqq_data *bfqq_data = bic->bfqq_data;5465unsigned int act_idx;54665467for (act_idx = 0; act_idx < num_actuators; act_idx++) {5468if (bfqq_data[act_idx].stable_merge_bfqq)5469bfq_put_stable_ref(bfqq_data[act_idx].stable_merge_bfqq);54705471bfq_exit_icq_bfqq(bic, true, act_idx);5472bfq_exit_icq_bfqq(bic, false, act_idx);5473}5474}54755476static void bfq_exit_icq(struct io_cq *icq)5477{5478struct bfq_io_cq *bic = icq_to_bic(icq);5479struct bfq_data *bfqd = bic_to_bfqd(bic);5480unsigned long flags;54815482/*5483* If bfqd and thus bfqd->num_actuators is not available any5484* longer, then cycle over all possible per-actuator bfqqs in5485* next loop. We rely on bic being zeroed on creation, and5486* therefore on its unused per-actuator fields being NULL.5487*5488* bfqd is NULL if scheduler already exited, and in that case5489* this is the last time these queues are accessed.5490*/5491if (bfqd) {5492spin_lock_irqsave(&bfqd->lock, flags);5493_bfq_exit_icq(bic, bfqd->num_actuators);5494spin_unlock_irqrestore(&bfqd->lock, flags);5495} else {5496_bfq_exit_icq(bic, BFQ_MAX_ACTUATORS);5497}5498}54995500/*5501* Update the entity prio values; note that the new values will not5502* be used until the next (re)activation.5503*/5504static void5505bfq_set_next_ioprio_data(struct bfq_queue *bfqq, struct bfq_io_cq *bic)5506{5507struct task_struct *tsk = current;5508int ioprio_class;5509struct bfq_data *bfqd = bfqq->bfqd;55105511if (!bfqd)5512return;55135514ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio);5515switch (ioprio_class) {5516default:5517pr_err("bdi %s: bfq: bad prio class %d\n",5518bdi_dev_name(bfqq->bfqd->queue->disk->bdi),5519ioprio_class);5520fallthrough;5521case IOPRIO_CLASS_NONE:5522/*5523* No prio set, inherit CPU scheduling settings.5524*/5525bfqq->new_ioprio = task_nice_ioprio(tsk);5526bfqq->new_ioprio_class = task_nice_ioclass(tsk);5527break;5528case IOPRIO_CLASS_RT:5529bfqq->new_ioprio = IOPRIO_PRIO_LEVEL(bic->ioprio);5530bfqq->new_ioprio_class = IOPRIO_CLASS_RT;5531break;5532case IOPRIO_CLASS_BE:5533bfqq->new_ioprio = IOPRIO_PRIO_LEVEL(bic->ioprio);5534bfqq->new_ioprio_class = IOPRIO_CLASS_BE;5535break;5536case IOPRIO_CLASS_IDLE:5537bfqq->new_ioprio_class = IOPRIO_CLASS_IDLE;5538bfqq->new_ioprio = IOPRIO_NR_LEVELS - 1;5539break;5540}55415542if (bfqq->new_ioprio >= IOPRIO_NR_LEVELS) {5543pr_crit("bfq_set_next_ioprio_data: new_ioprio %d\n",5544bfqq->new_ioprio);5545bfqq->new_ioprio = IOPRIO_NR_LEVELS - 1;5546}55475548bfqq->entity.new_weight = bfq_ioprio_to_weight(bfqq->new_ioprio);5549bfq_log_bfqq(bfqd, bfqq, "new_ioprio %d new_weight %d",5550bfqq->new_ioprio, bfqq->entity.new_weight);5551bfqq->entity.prio_changed = 1;5552}55535554static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,5555struct bio *bio, bool is_sync,5556struct bfq_io_cq *bic,5557bool respawn);55585559static void bfq_check_ioprio_change(struct bfq_io_cq *bic, struct bio *bio)5560{5561struct bfq_data *bfqd = bic_to_bfqd(bic);5562struct bfq_queue *bfqq;5563int ioprio = bic->icq.ioc->ioprio;55645565/*5566* This condition may trigger on a newly created bic, be sure to5567* drop the lock before returning.5568*/5569if (unlikely(!bfqd) || likely(bic->ioprio == ioprio))5570return;55715572bic->ioprio = ioprio;55735574bfqq = bic_to_bfqq(bic, false, bfq_actuator_index(bfqd, bio));5575if (bfqq) {5576struct bfq_queue *old_bfqq = bfqq;55775578bfqq = bfq_get_queue(bfqd, bio, false, bic, true);5579bic_set_bfqq(bic, bfqq, false, bfq_actuator_index(bfqd, bio));5580bfq_release_process_ref(bfqd, old_bfqq);5581}55825583bfqq = bic_to_bfqq(bic, true, bfq_actuator_index(bfqd, bio));5584if (bfqq)5585bfq_set_next_ioprio_data(bfqq, bic);5586}55875588static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,5589struct bfq_io_cq *bic, pid_t pid, int is_sync,5590unsigned int act_idx)5591{5592u64 now_ns = blk_time_get_ns();55935594bfqq->actuator_idx = act_idx;5595RB_CLEAR_NODE(&bfqq->entity.rb_node);5596INIT_LIST_HEAD(&bfqq->fifo);5597INIT_HLIST_NODE(&bfqq->burst_list_node);5598INIT_HLIST_NODE(&bfqq->woken_list_node);5599INIT_HLIST_HEAD(&bfqq->woken_list);56005601bfqq->ref = 0;5602bfqq->bfqd = bfqd;56035604if (bic)5605bfq_set_next_ioprio_data(bfqq, bic);56065607if (is_sync) {5608/*5609* No need to mark as has_short_ttime if in5610* idle_class, because no device idling is performed5611* for queues in idle class5612*/5613if (!bfq_class_idle(bfqq))5614/* tentatively mark as has_short_ttime */5615bfq_mark_bfqq_has_short_ttime(bfqq);5616bfq_mark_bfqq_sync(bfqq);5617bfq_mark_bfqq_just_created(bfqq);5618} else5619bfq_clear_bfqq_sync(bfqq);56205621/* set end request to minus infinity from now */5622bfqq->ttime.last_end_request = now_ns + 1;56235624bfqq->creation_time = jiffies;56255626bfqq->io_start_time = now_ns;56275628bfq_mark_bfqq_IO_bound(bfqq);56295630bfqq->pid = pid;56315632/* Tentative initial value to trade off between thr and lat */5633bfqq->max_budget = (2 * bfq_max_budget(bfqd)) / 3;5634bfqq->budget_timeout = bfq_smallest_from_now();56355636bfqq->wr_coeff = 1;5637bfqq->last_wr_start_finish = jiffies;5638bfqq->wr_start_at_switch_to_srt = bfq_smallest_from_now();5639bfqq->split_time = bfq_smallest_from_now();56405641/*5642* To not forget the possibly high bandwidth consumed by a5643* process/queue in the recent past,5644* bfq_bfqq_softrt_next_start() returns a value at least equal5645* to the current value of bfqq->soft_rt_next_start (see5646* comments on bfq_bfqq_softrt_next_start). Set5647* soft_rt_next_start to now, to mean that bfqq has consumed5648* no bandwidth so far.5649*/5650bfqq->soft_rt_next_start = jiffies;56515652/* first request is almost certainly seeky */5653bfqq->seek_history = 1;56545655bfqq->decrease_time_jif = jiffies;5656}56575658static struct bfq_queue **bfq_async_queue_prio(struct bfq_data *bfqd,5659struct bfq_group *bfqg,5660int ioprio_class, int ioprio, int act_idx)5661{5662switch (ioprio_class) {5663case IOPRIO_CLASS_RT:5664return &bfqg->async_bfqq[0][ioprio][act_idx];5665case IOPRIO_CLASS_NONE:5666ioprio = IOPRIO_BE_NORM;5667fallthrough;5668case IOPRIO_CLASS_BE:5669return &bfqg->async_bfqq[1][ioprio][act_idx];5670case IOPRIO_CLASS_IDLE:5671return &bfqg->async_idle_bfqq[act_idx];5672default:5673return NULL;5674}5675}56765677static struct bfq_queue *5678bfq_do_early_stable_merge(struct bfq_data *bfqd, struct bfq_queue *bfqq,5679struct bfq_io_cq *bic,5680struct bfq_queue *last_bfqq_created)5681{5682unsigned int a_idx = last_bfqq_created->actuator_idx;5683struct bfq_queue *new_bfqq =5684bfq_setup_merge(bfqq, last_bfqq_created);56855686if (!new_bfqq)5687return bfqq;56885689if (new_bfqq->bic)5690new_bfqq->bic->bfqq_data[a_idx].stably_merged = true;5691bic->bfqq_data[a_idx].stably_merged = true;56925693/*5694* Reusing merge functions. This implies that5695* bfqq->bic must be set too, for5696* bfq_merge_bfqqs to correctly save bfqq's5697* state before killing it.5698*/5699bfqq->bic = bic;5700return bfq_merge_bfqqs(bfqd, bic, bfqq);5701}57025703/*5704* Many throughput-sensitive workloads are made of several parallel5705* I/O flows, with all flows generated by the same application, or5706* more generically by the same task (e.g., system boot). The most5707* counterproductive action with these workloads is plugging I/O5708* dispatch when one of the bfq_queues associated with these flows5709* remains temporarily empty.5710*5711* To avoid this plugging, BFQ has been using a burst-handling5712* mechanism for years now. This mechanism has proven effective for5713* throughput, and not detrimental for service guarantees. The5714* following function pushes this mechanism a little bit further,5715* basing on the following two facts.5716*5717* First, all the I/O flows of a the same application or task5718* contribute to the execution/completion of that common application5719* or task. So the performance figures that matter are total5720* throughput of the flows and task-wide I/O latency. In particular,5721* these flows do not need to be protected from each other, in terms5722* of individual bandwidth or latency.5723*5724* Second, the above fact holds regardless of the number of flows.5725*5726* Putting these two facts together, this commits merges stably the5727* bfq_queues associated with these I/O flows, i.e., with the5728* processes that generate these IO/ flows, regardless of how many the5729* involved processes are.5730*5731* To decide whether a set of bfq_queues is actually associated with5732* the I/O flows of a common application or task, and to merge these5733* queues stably, this function operates as follows: given a bfq_queue,5734* say Q2, currently being created, and the last bfq_queue, say Q1,5735* created before Q2, Q2 is merged stably with Q1 if5736* - very little time has elapsed since when Q1 was created5737* - Q2 has the same ioprio as Q15738* - Q2 belongs to the same group as Q15739*5740* Merging bfq_queues also reduces scheduling overhead. A fio test5741* with ten random readers on /dev/nullb shows a throughput boost of5742* 40%, with a quadcore. Since BFQ's execution time amounts to ~50% of5743* the total per-request processing time, the above throughput boost5744* implies that BFQ's overhead is reduced by more than 50%.5745*5746* This new mechanism most certainly obsoletes the current5747* burst-handling heuristics. We keep those heuristics for the moment.5748*/5749static struct bfq_queue *bfq_do_or_sched_stable_merge(struct bfq_data *bfqd,5750struct bfq_queue *bfqq,5751struct bfq_io_cq *bic)5752{5753struct bfq_queue **source_bfqq = bfqq->entity.parent ?5754&bfqq->entity.parent->last_bfqq_created :5755&bfqd->last_bfqq_created;57565757struct bfq_queue *last_bfqq_created = *source_bfqq;57585759/*5760* If last_bfqq_created has not been set yet, then init it. If5761* it has been set already, but too long ago, then move it5762* forward to bfqq. Finally, move also if bfqq belongs to a5763* different group than last_bfqq_created, or if bfqq has a5764* different ioprio, ioprio_class or actuator_idx. If none of5765* these conditions holds true, then try an early stable merge5766* or schedule a delayed stable merge. As for the condition on5767* actuator_idx, the reason is that, if queues associated with5768* different actuators are merged, then control is lost on5769* each actuator. Therefore some actuator may be5770* underutilized, and throughput may decrease.5771*5772* A delayed merge is scheduled (instead of performing an5773* early merge), in case bfqq might soon prove to be more5774* throughput-beneficial if not merged. Currently this is5775* possible only if bfqd is rotational with no queueing. For5776* such a drive, not merging bfqq is better for throughput if5777* bfqq happens to contain sequential I/O. So, we wait a5778* little bit for enough I/O to flow through bfqq. After that,5779* if such an I/O is sequential, then the merge is5780* canceled. Otherwise the merge is finally performed.5781*/5782if (!last_bfqq_created ||5783time_before(last_bfqq_created->creation_time +5784msecs_to_jiffies(bfq_activation_stable_merging),5785bfqq->creation_time) ||5786bfqq->entity.parent != last_bfqq_created->entity.parent ||5787bfqq->ioprio != last_bfqq_created->ioprio ||5788bfqq->ioprio_class != last_bfqq_created->ioprio_class ||5789bfqq->actuator_idx != last_bfqq_created->actuator_idx)5790*source_bfqq = bfqq;5791else if (time_after_eq(last_bfqq_created->creation_time +5792bfqd->bfq_burst_interval,5793bfqq->creation_time)) {5794if (likely(bfqd->nonrot_with_queueing))5795/*5796* With this type of drive, leaving5797* bfqq alone may provide no5798* throughput benefits compared with5799* merging bfqq. So merge bfqq now.5800*/5801bfqq = bfq_do_early_stable_merge(bfqd, bfqq,5802bic,5803last_bfqq_created);5804else { /* schedule tentative stable merge */5805/*5806* get reference on last_bfqq_created,5807* to prevent it from being freed,5808* until we decide whether to merge5809*/5810last_bfqq_created->ref++;5811/*5812* need to keep track of stable refs, to5813* compute process refs correctly5814*/5815last_bfqq_created->stable_ref++;5816/*5817* Record the bfqq to merge to.5818*/5819bic->bfqq_data[last_bfqq_created->actuator_idx].stable_merge_bfqq =5820last_bfqq_created;5821}5822}58235824return bfqq;5825}582658275828static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,5829struct bio *bio, bool is_sync,5830struct bfq_io_cq *bic,5831bool respawn)5832{5833const int ioprio = IOPRIO_PRIO_LEVEL(bic->ioprio);5834const int ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio);5835struct bfq_queue **async_bfqq = NULL;5836struct bfq_queue *bfqq;5837struct bfq_group *bfqg;58385839bfqg = bfq_bio_bfqg(bfqd, bio);5840if (!is_sync) {5841async_bfqq = bfq_async_queue_prio(bfqd, bfqg, ioprio_class,5842ioprio,5843bfq_actuator_index(bfqd, bio));5844bfqq = *async_bfqq;5845if (bfqq)5846goto out;5847}58485849bfqq = kmem_cache_alloc_node(bfq_pool, GFP_NOWAIT | __GFP_ZERO,5850bfqd->queue->node);58515852if (bfqq) {5853bfq_init_bfqq(bfqd, bfqq, bic, current->pid,5854is_sync, bfq_actuator_index(bfqd, bio));5855bfq_init_entity(&bfqq->entity, bfqg);5856bfq_log_bfqq(bfqd, bfqq, "allocated");5857} else {5858bfqq = &bfqd->oom_bfqq;5859bfq_log_bfqq(bfqd, bfqq, "using oom bfqq");5860goto out;5861}58625863/*5864* Pin the queue now that it's allocated, scheduler exit will5865* prune it.5866*/5867if (async_bfqq) {5868bfqq->ref++; /*5869* Extra group reference, w.r.t. sync5870* queue. This extra reference is removed5871* only if bfqq->bfqg disappears, to5872* guarantee that this queue is not freed5873* until its group goes away.5874*/5875bfq_log_bfqq(bfqd, bfqq, "get_queue, bfqq not in async: %p, %d",5876bfqq, bfqq->ref);5877*async_bfqq = bfqq;5878}58795880out:5881bfqq->ref++; /* get a process reference to this queue */58825883if (bfqq != &bfqd->oom_bfqq && is_sync && !respawn)5884bfqq = bfq_do_or_sched_stable_merge(bfqd, bfqq, bic);5885return bfqq;5886}58875888static void bfq_update_io_thinktime(struct bfq_data *bfqd,5889struct bfq_queue *bfqq)5890{5891struct bfq_ttime *ttime = &bfqq->ttime;5892u64 elapsed;58935894/*5895* We are really interested in how long it takes for the queue to5896* become busy when there is no outstanding IO for this queue. So5897* ignore cases when the bfq queue has already IO queued.5898*/5899if (bfqq->dispatched || bfq_bfqq_busy(bfqq))5900return;5901elapsed = blk_time_get_ns() - bfqq->ttime.last_end_request;5902elapsed = min_t(u64, elapsed, 2ULL * bfqd->bfq_slice_idle);59035904ttime->ttime_samples = (7*ttime->ttime_samples + 256) / 8;5905ttime->ttime_total = div_u64(7*ttime->ttime_total + 256*elapsed, 8);5906ttime->ttime_mean = div64_ul(ttime->ttime_total + 128,5907ttime->ttime_samples);5908}59095910static void5911bfq_update_io_seektime(struct bfq_data *bfqd, struct bfq_queue *bfqq,5912struct request *rq)5913{5914bfqq->seek_history <<= 1;5915bfqq->seek_history |= BFQ_RQ_SEEKY(bfqd, bfqq->last_request_pos, rq);59165917if (bfqq->wr_coeff > 1 &&5918bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time &&5919BFQQ_TOTALLY_SEEKY(bfqq)) {5920if (time_is_before_jiffies(bfqq->wr_start_at_switch_to_srt +5921bfq_wr_duration(bfqd))) {5922/*5923* In soft_rt weight raising with the5924* interactive-weight-raising period5925* elapsed (so no switch back to5926* interactive weight raising).5927*/5928bfq_bfqq_end_wr(bfqq);5929} else { /*5930* stopping soft_rt weight raising5931* while still in interactive period,5932* switch back to interactive weight5933* raising5934*/5935switch_back_to_interactive_wr(bfqq, bfqd);5936bfqq->entity.prio_changed = 1;5937}5938}5939}59405941static void bfq_update_has_short_ttime(struct bfq_data *bfqd,5942struct bfq_queue *bfqq,5943struct bfq_io_cq *bic)5944{5945bool has_short_ttime = true, state_changed;59465947/*5948* No need to update has_short_ttime if bfqq is async or in5949* idle io prio class, or if bfq_slice_idle is zero, because5950* no device idling is performed for bfqq in this case.5951*/5952if (!bfq_bfqq_sync(bfqq) || bfq_class_idle(bfqq) ||5953bfqd->bfq_slice_idle == 0)5954return;59555956/* Idle window just restored, statistics are meaningless. */5957if (time_is_after_eq_jiffies(bfqq->split_time +5958bfqd->bfq_wr_min_idle_time))5959return;59605961/* Think time is infinite if no process is linked to5962* bfqq. Otherwise check average think time to decide whether5963* to mark as has_short_ttime. To this goal, compare average5964* think time with half the I/O-plugging timeout.5965*/5966if (atomic_read(&bic->icq.ioc->active_ref) == 0 ||5967(bfq_sample_valid(bfqq->ttime.ttime_samples) &&5968bfqq->ttime.ttime_mean > bfqd->bfq_slice_idle>>1))5969has_short_ttime = false;59705971state_changed = has_short_ttime != bfq_bfqq_has_short_ttime(bfqq);59725973if (has_short_ttime)5974bfq_mark_bfqq_has_short_ttime(bfqq);5975else5976bfq_clear_bfqq_has_short_ttime(bfqq);59775978/*5979* Until the base value for the total service time gets5980* finally computed for bfqq, the inject limit does depend on5981* the think-time state (short|long). In particular, the limit5982* is 0 or 1 if the think time is deemed, respectively, as5983* short or long (details in the comments in5984* bfq_update_inject_limit()). Accordingly, the next5985* instructions reset the inject limit if the think-time state5986* has changed and the above base value is still to be5987* computed.5988*5989* However, the reset is performed only if more than 100 ms5990* have elapsed since the last update of the inject limit, or5991* (inclusive) if the change is from short to long think5992* time. The reason for this waiting is as follows.5993*5994* bfqq may have a long think time because of a5995* synchronization with some other queue, i.e., because the5996* I/O of some other queue may need to be completed for bfqq5997* to receive new I/O. Details in the comments on the choice5998* of the queue for injection in bfq_select_queue().5999*6000* As stressed in those comments, if such a synchronization is6001* actually in place, then, without injection on bfqq, the6002* blocking I/O cannot happen to served while bfqq is in6003* service. As a consequence, if bfqq is granted6004* I/O-dispatch-plugging, then bfqq remains empty, and no I/O6005* is dispatched, until the idle timeout fires. This is likely6006* to result in lower bandwidth and higher latencies for bfqq,6007* and in a severe loss of total throughput.6008*6009* On the opposite end, a non-zero inject limit may allow the6010* I/O that blocks bfqq to be executed soon, and therefore6011* bfqq to receive new I/O soon.6012*6013* But, if the blocking gets actually eliminated, then the6014* next think-time sample for bfqq may be very low. This in6015* turn may cause bfqq's think time to be deemed6016* short. Without the 100 ms barrier, this new state change6017* would cause the body of the next if to be executed6018* immediately. But this would set to 0 the inject6019* limit. Without injection, the blocking I/O would cause the6020* think time of bfqq to become long again, and therefore the6021* inject limit to be raised again, and so on. The only effect6022* of such a steady oscillation between the two think-time6023* states would be to prevent effective injection on bfqq.6024*6025* In contrast, if the inject limit is not reset during such a6026* long time interval as 100 ms, then the number of short6027* think time samples can grow significantly before the reset6028* is performed. As a consequence, the think time state can6029* become stable before the reset. Therefore there will be no6030* state change when the 100 ms elapse, and no reset of the6031* inject limit. The inject limit remains steadily equal to 16032* both during and after the 100 ms. So injection can be6033* performed at all times, and throughput gets boosted.6034*6035* An inject limit equal to 1 is however in conflict, in6036* general, with the fact that the think time of bfqq is6037* short, because injection may be likely to delay bfqq's I/O6038* (as explained in the comments in6039* bfq_update_inject_limit()). But this does not happen in6040* this special case, because bfqq's low think time is due to6041* an effective handling of a synchronization, through6042* injection. In this special case, bfqq's I/O does not get6043* delayed by injection; on the contrary, bfqq's I/O is6044* brought forward, because it is not blocked for6045* milliseconds.6046*6047* In addition, serving the blocking I/O much sooner, and much6048* more frequently than once per I/O-plugging timeout, makes6049* it much quicker to detect a waker queue (the concept of6050* waker queue is defined in the comments in6051* bfq_add_request()). This makes it possible to start sooner6052* to boost throughput more effectively, by injecting the I/O6053* of the waker queue unconditionally on every6054* bfq_dispatch_request().6055*6056* One last, important benefit of not resetting the inject6057* limit before 100 ms is that, during this time interval, the6058* base value for the total service time is likely to get6059* finally computed for bfqq, freeing the inject limit from6060* its relation with the think time.6061*/6062if (state_changed && bfqq->last_serv_time_ns == 0 &&6063(time_is_before_eq_jiffies(bfqq->decrease_time_jif +6064msecs_to_jiffies(100)) ||6065!has_short_ttime))6066bfq_reset_inject_limit(bfqd, bfqq);6067}60686069/*6070* Called when a new fs request (rq) is added to bfqq. Check if there's6071* something we should do about it.6072*/6073static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,6074struct request *rq)6075{6076if (rq->cmd_flags & REQ_META)6077bfqq->meta_pending++;60786079bfqq->last_request_pos = blk_rq_pos(rq) + blk_rq_sectors(rq);60806081if (bfqq == bfqd->in_service_queue && bfq_bfqq_wait_request(bfqq)) {6082bool small_req = bfqq->queued[rq_is_sync(rq)] == 1 &&6083blk_rq_sectors(rq) < 32;6084bool budget_timeout = bfq_bfqq_budget_timeout(bfqq);60856086/*6087* There is just this request queued: if6088* - the request is small, and6089* - we are idling to boost throughput, and6090* - the queue is not to be expired,6091* then just exit.6092*6093* In this way, if the device is being idled to wait6094* for a new request from the in-service queue, we6095* avoid unplugging the device and committing the6096* device to serve just a small request. In contrast6097* we wait for the block layer to decide when to6098* unplug the device: hopefully, new requests will be6099* merged to this one quickly, then the device will be6100* unplugged and larger requests will be dispatched.6101*/6102if (small_req && idling_boosts_thr_without_issues(bfqd, bfqq) &&6103!budget_timeout)6104return;61056106/*6107* A large enough request arrived, or idling is being6108* performed to preserve service guarantees, or6109* finally the queue is to be expired: in all these6110* cases disk idling is to be stopped, so clear6111* wait_request flag and reset timer.6112*/6113bfq_clear_bfqq_wait_request(bfqq);6114hrtimer_try_to_cancel(&bfqd->idle_slice_timer);61156116/*6117* The queue is not empty, because a new request just6118* arrived. Hence we can safely expire the queue, in6119* case of budget timeout, without risking that the6120* timestamps of the queue are not updated correctly.6121* See [1] for more details.6122*/6123if (budget_timeout)6124bfq_bfqq_expire(bfqd, bfqq, false,6125BFQQE_BUDGET_TIMEOUT);6126}6127}61286129static void bfqq_request_allocated(struct bfq_queue *bfqq)6130{6131struct bfq_entity *entity = &bfqq->entity;61326133for_each_entity(entity)6134entity->allocated++;6135}61366137static void bfqq_request_freed(struct bfq_queue *bfqq)6138{6139struct bfq_entity *entity = &bfqq->entity;61406141for_each_entity(entity)6142entity->allocated--;6143}61446145/* returns true if it causes the idle timer to be disabled */6146static bool __bfq_insert_request(struct bfq_data *bfqd, struct request *rq)6147{6148struct bfq_queue *bfqq = RQ_BFQQ(rq),6149*new_bfqq = bfq_setup_cooperator(bfqd, bfqq, rq, true,6150RQ_BIC(rq));6151bool waiting, idle_timer_disabled = false;61526153if (new_bfqq) {6154struct bfq_queue *old_bfqq = bfqq;6155/*6156* Release the request's reference to the old bfqq6157* and make sure one is taken to the shared queue.6158*/6159bfqq_request_allocated(new_bfqq);6160bfqq_request_freed(bfqq);6161new_bfqq->ref++;6162/*6163* If the bic associated with the process6164* issuing this request still points to bfqq6165* (and thus has not been already redirected6166* to new_bfqq or even some other bfq_queue),6167* then complete the merge and redirect it to6168* new_bfqq.6169*/6170if (bic_to_bfqq(RQ_BIC(rq), true,6171bfq_actuator_index(bfqd, rq->bio)) == bfqq) {6172while (bfqq != new_bfqq)6173bfqq = bfq_merge_bfqqs(bfqd, RQ_BIC(rq), bfqq);6174}61756176bfq_clear_bfqq_just_created(old_bfqq);6177/*6178* rq is about to be enqueued into new_bfqq,6179* release rq reference on bfqq6180*/6181bfq_put_queue(old_bfqq);6182rq->elv.priv[1] = new_bfqq;6183}61846185bfq_update_io_thinktime(bfqd, bfqq);6186bfq_update_has_short_ttime(bfqd, bfqq, RQ_BIC(rq));6187bfq_update_io_seektime(bfqd, bfqq, rq);61886189waiting = bfqq && bfq_bfqq_wait_request(bfqq);6190bfq_add_request(rq);6191idle_timer_disabled = waiting && !bfq_bfqq_wait_request(bfqq);61926193rq->fifo_time = blk_time_get_ns() + bfqd->bfq_fifo_expire[rq_is_sync(rq)];6194list_add_tail(&rq->queuelist, &bfqq->fifo);61956196bfq_rq_enqueued(bfqd, bfqq, rq);61976198return idle_timer_disabled;6199}62006201#ifdef CONFIG_BFQ_CGROUP_DEBUG6202static void bfq_update_insert_stats(struct request_queue *q,6203struct bfq_queue *bfqq,6204bool idle_timer_disabled,6205blk_opf_t cmd_flags)6206{6207if (!bfqq)6208return;62096210/*6211* bfqq still exists, because it can disappear only after6212* either it is merged with another queue, or the process it6213* is associated with exits. But both actions must be taken by6214* the same process currently executing this flow of6215* instructions.6216*6217* In addition, the following queue lock guarantees that6218* bfqq_group(bfqq) exists as well.6219*/6220spin_lock_irq(&q->queue_lock);6221bfqg_stats_update_io_add(bfqq_group(bfqq), bfqq, cmd_flags);6222if (idle_timer_disabled)6223bfqg_stats_update_idle_time(bfqq_group(bfqq));6224spin_unlock_irq(&q->queue_lock);6225}6226#else6227static inline void bfq_update_insert_stats(struct request_queue *q,6228struct bfq_queue *bfqq,6229bool idle_timer_disabled,6230blk_opf_t cmd_flags) {}6231#endif /* CONFIG_BFQ_CGROUP_DEBUG */62326233static struct bfq_queue *bfq_init_rq(struct request *rq);62346235static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,6236blk_insert_t flags)6237{6238struct request_queue *q = hctx->queue;6239struct bfq_data *bfqd = q->elevator->elevator_data;6240struct bfq_queue *bfqq;6241bool idle_timer_disabled = false;6242blk_opf_t cmd_flags;6243LIST_HEAD(free);62446245#ifdef CONFIG_BFQ_GROUP_IOSCHED6246if (!cgroup_subsys_on_dfl(io_cgrp_subsys) && rq->bio)6247bfqg_stats_update_legacy_io(q, rq);6248#endif6249spin_lock_irq(&bfqd->lock);6250bfqq = bfq_init_rq(rq);6251if (blk_mq_sched_try_insert_merge(q, rq, &free)) {6252spin_unlock_irq(&bfqd->lock);6253blk_mq_free_requests(&free);6254return;6255}62566257trace_block_rq_insert(rq);62586259if (flags & BLK_MQ_INSERT_AT_HEAD) {6260list_add(&rq->queuelist, &bfqd->dispatch);6261} else if (!bfqq) {6262list_add_tail(&rq->queuelist, &bfqd->dispatch);6263} else {6264idle_timer_disabled = __bfq_insert_request(bfqd, rq);6265/*6266* Update bfqq, because, if a queue merge has occurred6267* in __bfq_insert_request, then rq has been6268* redirected into a new queue.6269*/6270bfqq = RQ_BFQQ(rq);62716272if (rq_mergeable(rq)) {6273elv_rqhash_add(q, rq);6274if (!q->last_merge)6275q->last_merge = rq;6276}6277}62786279/*6280* Cache cmd_flags before releasing scheduler lock, because rq6281* may disappear afterwards (for example, because of a request6282* merge).6283*/6284cmd_flags = rq->cmd_flags;6285spin_unlock_irq(&bfqd->lock);62866287bfq_update_insert_stats(q, bfqq, idle_timer_disabled,6288cmd_flags);6289}62906291static void bfq_insert_requests(struct blk_mq_hw_ctx *hctx,6292struct list_head *list,6293blk_insert_t flags)6294{6295while (!list_empty(list)) {6296struct request *rq;62976298rq = list_first_entry(list, struct request, queuelist);6299list_del_init(&rq->queuelist);6300bfq_insert_request(hctx, rq, flags);6301}6302}63036304static void bfq_update_hw_tag(struct bfq_data *bfqd)6305{6306struct bfq_queue *bfqq = bfqd->in_service_queue;63076308bfqd->max_rq_in_driver = max_t(int, bfqd->max_rq_in_driver,6309bfqd->tot_rq_in_driver);63106311if (bfqd->hw_tag == 1)6312return;63136314/*6315* This sample is valid if the number of outstanding requests6316* is large enough to allow a queueing behavior. Note that the6317* sum is not exact, as it's not taking into account deactivated6318* requests.6319*/6320if (bfqd->tot_rq_in_driver + bfqd->queued <= BFQ_HW_QUEUE_THRESHOLD)6321return;63226323/*6324* If active queue hasn't enough requests and can idle, bfq might not6325* dispatch sufficient requests to hardware. Don't zero hw_tag in this6326* case6327*/6328if (bfqq && bfq_bfqq_has_short_ttime(bfqq) &&6329bfqq->dispatched + bfqq->queued[0] + bfqq->queued[1] <6330BFQ_HW_QUEUE_THRESHOLD &&6331bfqd->tot_rq_in_driver < BFQ_HW_QUEUE_THRESHOLD)6332return;63336334if (bfqd->hw_tag_samples++ < BFQ_HW_QUEUE_SAMPLES)6335return;63366337bfqd->hw_tag = bfqd->max_rq_in_driver > BFQ_HW_QUEUE_THRESHOLD;6338bfqd->max_rq_in_driver = 0;6339bfqd->hw_tag_samples = 0;63406341bfqd->nonrot_with_queueing =6342blk_queue_nonrot(bfqd->queue) && bfqd->hw_tag;6343}63446345static void bfq_completed_request(struct bfq_queue *bfqq, struct bfq_data *bfqd)6346{6347u64 now_ns;6348u32 delta_us;63496350bfq_update_hw_tag(bfqd);63516352bfqd->rq_in_driver[bfqq->actuator_idx]--;6353bfqd->tot_rq_in_driver--;6354bfqq->dispatched--;63556356if (!bfqq->dispatched && !bfq_bfqq_busy(bfqq)) {6357/*6358* Set budget_timeout (which we overload to store the6359* time at which the queue remains with no backlog and6360* no outstanding request; used by the weight-raising6361* mechanism).6362*/6363bfqq->budget_timeout = jiffies;63646365bfq_del_bfqq_in_groups_with_pending_reqs(bfqq);6366bfq_weights_tree_remove(bfqq);6367}63686369now_ns = blk_time_get_ns();63706371bfqq->ttime.last_end_request = now_ns;63726373/*6374* Using us instead of ns, to get a reasonable precision in6375* computing rate in next check.6376*/6377delta_us = div_u64(now_ns - bfqd->last_completion, NSEC_PER_USEC);63786379/*6380* If the request took rather long to complete, and, according6381* to the maximum request size recorded, this completion latency6382* implies that the request was certainly served at a very low6383* rate (less than 1M sectors/sec), then the whole observation6384* interval that lasts up to this time instant cannot be a6385* valid time interval for computing a new peak rate. Invoke6386* bfq_update_rate_reset to have the following three steps6387* taken:6388* - close the observation interval at the last (previous)6389* request dispatch or completion6390* - compute rate, if possible, for that observation interval6391* - reset to zero samples, which will trigger a proper6392* re-initialization of the observation interval on next6393* dispatch6394*/6395if (delta_us > BFQ_MIN_TT/NSEC_PER_USEC &&6396(bfqd->last_rq_max_size<<BFQ_RATE_SHIFT)/delta_us <63971UL<<(BFQ_RATE_SHIFT - 10))6398bfq_update_rate_reset(bfqd, NULL);6399bfqd->last_completion = now_ns;6400/*6401* Shared queues are likely to receive I/O at a high6402* rate. This may deceptively let them be considered as wakers6403* of other queues. But a false waker will unjustly steal6404* bandwidth to its supposedly woken queue. So considering6405* also shared queues in the waking mechanism may cause more6406* control troubles than throughput benefits. Then reset6407* last_completed_rq_bfqq if bfqq is a shared queue.6408*/6409if (!bfq_bfqq_coop(bfqq))6410bfqd->last_completed_rq_bfqq = bfqq;6411else6412bfqd->last_completed_rq_bfqq = NULL;64136414/*6415* If we are waiting to discover whether the request pattern6416* of the task associated with the queue is actually6417* isochronous, and both requisites for this condition to hold6418* are now satisfied, then compute soft_rt_next_start (see the6419* comments on the function bfq_bfqq_softrt_next_start()). We6420* do not compute soft_rt_next_start if bfqq is in interactive6421* weight raising (see the comments in bfq_bfqq_expire() for6422* an explanation). We schedule this delayed update when bfqq6423* expires, if it still has in-flight requests.6424*/6425if (bfq_bfqq_softrt_update(bfqq) && bfqq->dispatched == 0 &&6426RB_EMPTY_ROOT(&bfqq->sort_list) &&6427bfqq->wr_coeff != bfqd->bfq_wr_coeff)6428bfqq->soft_rt_next_start =6429bfq_bfqq_softrt_next_start(bfqd, bfqq);64306431/*6432* If this is the in-service queue, check if it needs to be expired,6433* or if we want to idle in case it has no pending requests.6434*/6435if (bfqd->in_service_queue == bfqq) {6436if (bfq_bfqq_must_idle(bfqq)) {6437if (bfqq->dispatched == 0)6438bfq_arm_slice_timer(bfqd);6439/*6440* If we get here, we do not expire bfqq, even6441* if bfqq was in budget timeout or had no6442* more requests (as controlled in the next6443* conditional instructions). The reason for6444* not expiring bfqq is as follows.6445*6446* Here bfqq->dispatched > 0 holds, but6447* bfq_bfqq_must_idle() returned true. This6448* implies that, even if no request arrives6449* for bfqq before bfqq->dispatched reaches 0,6450* bfqq will, however, not be expired on the6451* completion event that causes bfqq->dispatch6452* to reach zero. In contrast, on this event,6453* bfqq will start enjoying device idling6454* (I/O-dispatch plugging).6455*6456* But, if we expired bfqq here, bfqq would6457* not have the chance to enjoy device idling6458* when bfqq->dispatched finally reaches6459* zero. This would expose bfqq to violation6460* of its reserved service guarantees.6461*/6462return;6463} else if (bfq_may_expire_for_budg_timeout(bfqq))6464bfq_bfqq_expire(bfqd, bfqq, false,6465BFQQE_BUDGET_TIMEOUT);6466else if (RB_EMPTY_ROOT(&bfqq->sort_list) &&6467(bfqq->dispatched == 0 ||6468!bfq_better_to_idle(bfqq)))6469bfq_bfqq_expire(bfqd, bfqq, false,6470BFQQE_NO_MORE_REQUESTS);6471}64726473if (!bfqd->tot_rq_in_driver)6474bfq_schedule_dispatch(bfqd);6475}64766477/*6478* The processes associated with bfqq may happen to generate their6479* cumulative I/O at a lower rate than the rate at which the device6480* could serve the same I/O. This is rather probable, e.g., if only6481* one process is associated with bfqq and the device is an SSD. It6482* results in bfqq becoming often empty while in service. In this6483* respect, if BFQ is allowed to switch to another queue when bfqq6484* remains empty, then the device goes on being fed with I/O requests,6485* and the throughput is not affected. In contrast, if BFQ is not6486* allowed to switch to another queue---because bfqq is sync and6487* I/O-dispatch needs to be plugged while bfqq is temporarily6488* empty---then, during the service of bfqq, there will be frequent6489* "service holes", i.e., time intervals during which bfqq gets empty6490* and the device can only consume the I/O already queued in its6491* hardware queues. During service holes, the device may even get to6492* remaining idle. In the end, during the service of bfqq, the device6493* is driven at a lower speed than the one it can reach with the kind6494* of I/O flowing through bfqq.6495*6496* To counter this loss of throughput, BFQ implements a "request6497* injection mechanism", which tries to fill the above service holes6498* with I/O requests taken from other queues. The hard part in this6499* mechanism is finding the right amount of I/O to inject, so as to6500* both boost throughput and not break bfqq's bandwidth and latency6501* guarantees. In this respect, the mechanism maintains a per-queue6502* inject limit, computed as below. While bfqq is empty, the injection6503* mechanism dispatches extra I/O requests only until the total number6504* of I/O requests in flight---i.e., already dispatched but not yet6505* completed---remains lower than this limit.6506*6507* A first definition comes in handy to introduce the algorithm by6508* which the inject limit is computed. We define as first request for6509* bfqq, an I/O request for bfqq that arrives while bfqq is in6510* service, and causes bfqq to switch from empty to non-empty. The6511* algorithm updates the limit as a function of the effect of6512* injection on the service times of only the first requests of6513* bfqq. The reason for this restriction is that these are the6514* requests whose service time is affected most, because they are the6515* first to arrive after injection possibly occurred.6516*6517* To evaluate the effect of injection, the algorithm measures the6518* "total service time" of first requests. We define as total service6519* time of an I/O request, the time that elapses since when the6520* request is enqueued into bfqq, to when it is completed. This6521* quantity allows the whole effect of injection to be measured. It is6522* easy to see why. Suppose that some requests of other queues are6523* actually injected while bfqq is empty, and that a new request R6524* then arrives for bfqq. If the device does start to serve all or6525* part of the injected requests during the service hole, then,6526* because of this extra service, it may delay the next invocation of6527* the dispatch hook of BFQ. Then, even after R gets eventually6528* dispatched, the device may delay the actual service of R if it is6529* still busy serving the extra requests, or if it decides to serve,6530* before R, some extra request still present in its queues. As a6531* conclusion, the cumulative extra delay caused by injection can be6532* easily evaluated by just comparing the total service time of first6533* requests with and without injection.6534*6535* The limit-update algorithm works as follows. On the arrival of a6536* first request of bfqq, the algorithm measures the total time of the6537* request only if one of the three cases below holds, and, for each6538* case, it updates the limit as described below:6539*6540* (1) If there is no in-flight request. This gives a baseline for the6541* total service time of the requests of bfqq. If the baseline has6542* not been computed yet, then, after computing it, the limit is6543* set to 1, to start boosting throughput, and to prepare the6544* ground for the next case. If the baseline has already been6545* computed, then it is updated, in case it results to be lower6546* than the previous value.6547*6548* (2) If the limit is higher than 0 and there are in-flight6549* requests. By comparing the total service time in this case with6550* the above baseline, it is possible to know at which extent the6551* current value of the limit is inflating the total service6552* time. If the inflation is below a certain threshold, then bfqq6553* is assumed to be suffering from no perceivable loss of its6554* service guarantees, and the limit is even tentatively6555* increased. If the inflation is above the threshold, then the6556* limit is decreased. Due to the lack of any hysteresis, this6557* logic makes the limit oscillate even in steady workload6558* conditions. Yet we opted for it, because it is fast in reaching6559* the best value for the limit, as a function of the current I/O6560* workload. To reduce oscillations, this step is disabled for a6561* short time interval after the limit happens to be decreased.6562*6563* (3) Periodically, after resetting the limit, to make sure that the6564* limit eventually drops in case the workload changes. This is6565* needed because, after the limit has gone safely up for a6566* certain workload, it is impossible to guess whether the6567* baseline total service time may have changed, without measuring6568* it again without injection. A more effective version of this6569* step might be to just sample the baseline, by interrupting6570* injection only once, and then to reset/lower the limit only if6571* the total service time with the current limit does happen to be6572* too large.6573*6574* More details on each step are provided in the comments on the6575* pieces of code that implement these steps: the branch handling the6576* transition from empty to non empty in bfq_add_request(), the branch6577* handling injection in bfq_select_queue(), and the function6578* bfq_choose_bfqq_for_injection(). These comments also explain some6579* exceptions, made by the injection mechanism in some special cases.6580*/6581static void bfq_update_inject_limit(struct bfq_data *bfqd,6582struct bfq_queue *bfqq)6583{6584u64 tot_time_ns = blk_time_get_ns() - bfqd->last_empty_occupied_ns;6585unsigned int old_limit = bfqq->inject_limit;65866587if (bfqq->last_serv_time_ns > 0 && bfqd->rqs_injected) {6588u64 threshold = (bfqq->last_serv_time_ns * 3)>>1;65896590if (tot_time_ns >= threshold && old_limit > 0) {6591bfqq->inject_limit--;6592bfqq->decrease_time_jif = jiffies;6593} else if (tot_time_ns < threshold &&6594old_limit <= bfqd->max_rq_in_driver)6595bfqq->inject_limit++;6596}65976598/*6599* Either we still have to compute the base value for the6600* total service time, and there seem to be the right6601* conditions to do it, or we can lower the last base value6602* computed.6603*6604* NOTE: (bfqd->tot_rq_in_driver == 1) means that there is no I/O6605* request in flight, because this function is in the code6606* path that handles the completion of a request of bfqq, and,6607* in particular, this function is executed before6608* bfqd->tot_rq_in_driver is decremented in such a code path.6609*/6610if ((bfqq->last_serv_time_ns == 0 && bfqd->tot_rq_in_driver == 1) ||6611tot_time_ns < bfqq->last_serv_time_ns) {6612if (bfqq->last_serv_time_ns == 0) {6613/*6614* Now we certainly have a base value: make sure we6615* start trying injection.6616*/6617bfqq->inject_limit = max_t(unsigned int, 1, old_limit);6618}6619bfqq->last_serv_time_ns = tot_time_ns;6620} else if (!bfqd->rqs_injected && bfqd->tot_rq_in_driver == 1)6621/*6622* No I/O injected and no request still in service in6623* the drive: these are the exact conditions for6624* computing the base value of the total service time6625* for bfqq. So let's update this value, because it is6626* rather variable. For example, it varies if the size6627* or the spatial locality of the I/O requests in bfqq6628* change.6629*/6630bfqq->last_serv_time_ns = tot_time_ns;663166326633/* update complete, not waiting for any request completion any longer */6634bfqd->waited_rq = NULL;6635bfqd->rqs_injected = false;6636}66376638/*6639* Handle either a requeue or a finish for rq. The things to do are6640* the same in both cases: all references to rq are to be dropped. In6641* particular, rq is considered completed from the point of view of6642* the scheduler.6643*/6644static void bfq_finish_requeue_request(struct request *rq)6645{6646struct bfq_queue *bfqq = RQ_BFQQ(rq);6647struct bfq_data *bfqd;6648unsigned long flags;66496650/*6651* rq either is not associated with any icq, or is an already6652* requeued request that has not (yet) been re-inserted into6653* a bfq_queue.6654*/6655if (!rq->elv.icq || !bfqq)6656return;66576658bfqd = bfqq->bfqd;66596660if (rq->rq_flags & RQF_STARTED)6661bfqg_stats_update_completion(bfqq_group(bfqq),6662rq->start_time_ns,6663rq->io_start_time_ns,6664rq->cmd_flags);66656666spin_lock_irqsave(&bfqd->lock, flags);6667if (likely(rq->rq_flags & RQF_STARTED)) {6668if (rq == bfqd->waited_rq)6669bfq_update_inject_limit(bfqd, bfqq);66706671bfq_completed_request(bfqq, bfqd);6672}6673bfqq_request_freed(bfqq);6674bfq_put_queue(bfqq);6675RQ_BIC(rq)->requests--;6676spin_unlock_irqrestore(&bfqd->lock, flags);66776678/*6679* Reset private fields. In case of a requeue, this allows6680* this function to correctly do nothing if it is spuriously6681* invoked again on this same request (see the check at the6682* beginning of the function). Probably, a better general6683* design would be to prevent blk-mq from invoking the requeue6684* or finish hooks of an elevator, for a request that is not6685* referred by that elevator.6686*6687* Resetting the following fields would break the6688* request-insertion logic if rq is re-inserted into a bfq6689* internal queue, without a re-preparation. Here we assume6690* that re-insertions of requeued requests, without6691* re-preparation, can happen only for pass_through or at_head6692* requests (which are not re-inserted into bfq internal6693* queues).6694*/6695rq->elv.priv[0] = NULL;6696rq->elv.priv[1] = NULL;6697}66986699static void bfq_finish_request(struct request *rq)6700{6701bfq_finish_requeue_request(rq);67026703if (rq->elv.icq) {6704put_io_context(rq->elv.icq->ioc);6705rq->elv.icq = NULL;6706}6707}67086709/*6710* Removes the association between the current task and bfqq, assuming6711* that bic points to the bfq iocontext of the task.6712* Returns NULL if a new bfqq should be allocated, or the old bfqq if this6713* was the last process referring to that bfqq.6714*/6715static struct bfq_queue *6716bfq_split_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq)6717{6718bfq_log_bfqq(bfqq->bfqd, bfqq, "splitting queue");67196720if (bfqq_process_refs(bfqq) == 1 && !bfqq->new_bfqq) {6721bfqq->pid = current->pid;6722bfq_clear_bfqq_coop(bfqq);6723bfq_clear_bfqq_split_coop(bfqq);6724return bfqq;6725}67266727bic_set_bfqq(bic, NULL, true, bfqq->actuator_idx);67286729bfq_put_cooperator(bfqq);67306731bfq_release_process_ref(bfqq->bfqd, bfqq);6732return NULL;6733}67346735static struct bfq_queue *6736__bfq_get_bfqq_handle_split(struct bfq_data *bfqd, struct bfq_io_cq *bic,6737struct bio *bio, bool split, bool is_sync,6738bool *new_queue)6739{6740unsigned int act_idx = bfq_actuator_index(bfqd, bio);6741struct bfq_queue *bfqq = bic_to_bfqq(bic, is_sync, act_idx);6742struct bfq_iocq_bfqq_data *bfqq_data = &bic->bfqq_data[act_idx];67436744if (likely(bfqq && bfqq != &bfqd->oom_bfqq))6745return bfqq;67466747if (new_queue)6748*new_queue = true;67496750if (bfqq)6751bfq_put_queue(bfqq);6752bfqq = bfq_get_queue(bfqd, bio, is_sync, bic, split);67536754bic_set_bfqq(bic, bfqq, is_sync, act_idx);6755if (split && is_sync) {6756if ((bfqq_data->was_in_burst_list && bfqd->large_burst) ||6757bfqq_data->saved_in_large_burst)6758bfq_mark_bfqq_in_large_burst(bfqq);6759else {6760bfq_clear_bfqq_in_large_burst(bfqq);6761if (bfqq_data->was_in_burst_list)6762/*6763* If bfqq was in the current6764* burst list before being6765* merged, then we have to add6766* it back. And we do not need6767* to increase burst_size, as6768* we did not decrement6769* burst_size when we removed6770* bfqq from the burst list as6771* a consequence of a merge6772* (see comments in6773* bfq_put_queue). In this6774* respect, it would be rather6775* costly to know whether the6776* current burst list is still6777* the same burst list from6778* which bfqq was removed on6779* the merge. To avoid this6780* cost, if bfqq was in a6781* burst list, then we add6782* bfqq to the current burst6783* list without any further6784* check. This can cause6785* inappropriate insertions,6786* but rarely enough to not6787* harm the detection of large6788* bursts significantly.6789*/6790hlist_add_head(&bfqq->burst_list_node,6791&bfqd->burst_list);6792}6793bfqq->split_time = jiffies;6794}67956796return bfqq;6797}67986799/*6800* Only reset private fields. The actual request preparation will be6801* performed by bfq_init_rq, when rq is either inserted or merged. See6802* comments on bfq_init_rq for the reason behind this delayed6803* preparation.6804*/6805static void bfq_prepare_request(struct request *rq)6806{6807rq->elv.icq = ioc_find_get_icq(rq->q);68086809/*6810* Regardless of whether we have an icq attached, we have to6811* clear the scheduler pointers, as they might point to6812* previously allocated bic/bfqq structs.6813*/6814rq->elv.priv[0] = rq->elv.priv[1] = NULL;6815}68166817static struct bfq_queue *bfq_waker_bfqq(struct bfq_queue *bfqq)6818{6819struct bfq_queue *new_bfqq = bfqq->new_bfqq;6820struct bfq_queue *waker_bfqq = bfqq->waker_bfqq;68216822if (!waker_bfqq)6823return NULL;68246825while (new_bfqq) {6826if (new_bfqq == waker_bfqq) {6827/*6828* If waker_bfqq is in the merge chain, and current6829* is the only process, waker_bfqq can be freed.6830*/6831if (bfqq_process_refs(waker_bfqq) == 1)6832return NULL;68336834return waker_bfqq;6835}68366837new_bfqq = new_bfqq->new_bfqq;6838}68396840/*6841* If waker_bfqq is not in the merge chain, and it's procress reference6842* is 0, waker_bfqq can be freed.6843*/6844if (bfqq_process_refs(waker_bfqq) == 0)6845return NULL;68466847return waker_bfqq;6848}68496850static struct bfq_queue *bfq_get_bfqq_handle_split(struct bfq_data *bfqd,6851struct bfq_io_cq *bic,6852struct bio *bio,6853unsigned int idx,6854bool is_sync)6855{6856struct bfq_queue *waker_bfqq;6857struct bfq_queue *bfqq;6858bool new_queue = false;68596860bfqq = __bfq_get_bfqq_handle_split(bfqd, bic, bio, false, is_sync,6861&new_queue);6862if (unlikely(new_queue))6863return bfqq;68646865/* If the queue was seeky for too long, break it apart. */6866if (!bfq_bfqq_coop(bfqq) || !bfq_bfqq_split_coop(bfqq) ||6867bic->bfqq_data[idx].stably_merged)6868return bfqq;68696870waker_bfqq = bfq_waker_bfqq(bfqq);68716872/* Update bic before losing reference to bfqq */6873if (bfq_bfqq_in_large_burst(bfqq))6874bic->bfqq_data[idx].saved_in_large_burst = true;68756876bfqq = bfq_split_bfqq(bic, bfqq);6877if (bfqq) {6878bfq_bfqq_resume_state(bfqq, bfqd, bic, true);6879return bfqq;6880}68816882bfqq = __bfq_get_bfqq_handle_split(bfqd, bic, bio, true, is_sync, NULL);6883if (unlikely(bfqq == &bfqd->oom_bfqq))6884return bfqq;68856886bfq_bfqq_resume_state(bfqq, bfqd, bic, false);6887bfqq->waker_bfqq = waker_bfqq;6888bfqq->tentative_waker_bfqq = NULL;68896890/*6891* If the waker queue disappears, then new_bfqq->waker_bfqq must be6892* reset. So insert new_bfqq into the6893* woken_list of the waker. See6894* bfq_check_waker for details.6895*/6896if (waker_bfqq)6897hlist_add_head(&bfqq->woken_list_node,6898&bfqq->waker_bfqq->woken_list);68996900return bfqq;6901}69026903/*6904* If needed, init rq, allocate bfq data structures associated with6905* rq, and increment reference counters in the destination bfq_queue6906* for rq. Return the destination bfq_queue for rq, or NULL is rq is6907* not associated with any bfq_queue.6908*6909* This function is invoked by the functions that perform rq insertion6910* or merging. One may have expected the above preparation operations6911* to be performed in bfq_prepare_request, and not delayed to when rq6912* is inserted or merged. The rationale behind this delayed6913* preparation is that, after the prepare_request hook is invoked for6914* rq, rq may still be transformed into a request with no icq, i.e., a6915* request not associated with any queue. No bfq hook is invoked to6916* signal this transformation. As a consequence, should these6917* preparation operations be performed when the prepare_request hook6918* is invoked, and should rq be transformed one moment later, bfq6919* would end up in an inconsistent state, because it would have6920* incremented some queue counters for an rq destined to6921* transformation, without any chance to correctly lower these6922* counters back. In contrast, no transformation can still happen for6923* rq after rq has been inserted or merged. So, it is safe to execute6924* these preparation operations when rq is finally inserted or merged.6925*/6926static struct bfq_queue *bfq_init_rq(struct request *rq)6927{6928struct request_queue *q = rq->q;6929struct bio *bio = rq->bio;6930struct bfq_data *bfqd = q->elevator->elevator_data;6931struct bfq_io_cq *bic;6932const int is_sync = rq_is_sync(rq);6933struct bfq_queue *bfqq;6934unsigned int a_idx = bfq_actuator_index(bfqd, bio);69356936if (unlikely(!rq->elv.icq))6937return NULL;69386939/*6940* Assuming that RQ_BFQQ(rq) is set only if everything is set6941* for this rq. This holds true, because this function is6942* invoked only for insertion or merging, and, after such6943* events, a request cannot be manipulated any longer before6944* being removed from bfq.6945*/6946if (RQ_BFQQ(rq))6947return RQ_BFQQ(rq);69486949bic = icq_to_bic(rq->elv.icq);6950bfq_check_ioprio_change(bic, bio);6951bfq_bic_update_cgroup(bic, bio);6952bfqq = bfq_get_bfqq_handle_split(bfqd, bic, bio, a_idx, is_sync);69536954bfqq_request_allocated(bfqq);6955bfqq->ref++;6956bic->requests++;6957bfq_log_bfqq(bfqd, bfqq, "get_request %p: bfqq %p, %d",6958rq, bfqq, bfqq->ref);69596960rq->elv.priv[0] = bic;6961rq->elv.priv[1] = bfqq;69626963/*6964* If a bfq_queue has only one process reference, it is owned6965* by only this bic: we can then set bfqq->bic = bic. in6966* addition, if the queue has also just been split, we have to6967* resume its state.6968*/6969if (likely(bfqq != &bfqd->oom_bfqq) && !bfqq->new_bfqq &&6970bfqq_process_refs(bfqq) == 1)6971bfqq->bic = bic;69726973/*6974* Consider bfqq as possibly belonging to a burst of newly6975* created queues only if:6976* 1) A burst is actually happening (bfqd->burst_size > 0)6977* or6978* 2) There is no other active queue. In fact, if, in6979* contrast, there are active queues not belonging to the6980* possible burst bfqq may belong to, then there is no gain6981* in considering bfqq as belonging to a burst, and6982* therefore in not weight-raising bfqq. See comments on6983* bfq_handle_burst().6984*6985* This filtering also helps eliminating false positives,6986* occurring when bfqq does not belong to an actual large6987* burst, but some background task (e.g., a service) happens6988* to trigger the creation of new queues very close to when6989* bfqq and its possible companion queues are created. See6990* comments on bfq_handle_burst() for further details also on6991* this issue.6992*/6993if (unlikely(bfq_bfqq_just_created(bfqq) &&6994(bfqd->burst_size > 0 ||6995bfq_tot_busy_queues(bfqd) == 0)))6996bfq_handle_burst(bfqd, bfqq);69976998return bfqq;6999}70007001static void7002bfq_idle_slice_timer_body(struct bfq_data *bfqd, struct bfq_queue *bfqq)7003{7004enum bfqq_expiration reason;7005unsigned long flags;70067007spin_lock_irqsave(&bfqd->lock, flags);70087009/*7010* Considering that bfqq may be in race, we should firstly check7011* whether bfqq is in service before doing something on it. If7012* the bfqq in race is not in service, it has already been expired7013* through __bfq_bfqq_expire func and its wait_request flags has7014* been cleared in __bfq_bfqd_reset_in_service func.7015*/7016if (bfqq != bfqd->in_service_queue) {7017spin_unlock_irqrestore(&bfqd->lock, flags);7018return;7019}70207021bfq_clear_bfqq_wait_request(bfqq);70227023if (bfq_bfqq_budget_timeout(bfqq))7024/*7025* Also here the queue can be safely expired7026* for budget timeout without wasting7027* guarantees7028*/7029reason = BFQQE_BUDGET_TIMEOUT;7030else if (bfqq->queued[0] == 0 && bfqq->queued[1] == 0)7031/*7032* The queue may not be empty upon timer expiration,7033* because we may not disable the timer when the7034* first request of the in-service queue arrives7035* during disk idling.7036*/7037reason = BFQQE_TOO_IDLE;7038else7039goto schedule_dispatch;70407041bfq_bfqq_expire(bfqd, bfqq, true, reason);70427043schedule_dispatch:7044bfq_schedule_dispatch(bfqd);7045spin_unlock_irqrestore(&bfqd->lock, flags);7046}70477048/*7049* Handler of the expiration of the timer running if the in-service queue7050* is idling inside its time slice.7051*/7052static enum hrtimer_restart bfq_idle_slice_timer(struct hrtimer *timer)7053{7054struct bfq_data *bfqd = container_of(timer, struct bfq_data,7055idle_slice_timer);7056struct bfq_queue *bfqq = bfqd->in_service_queue;70577058/*7059* Theoretical race here: the in-service queue can be NULL or7060* different from the queue that was idling if a new request7061* arrives for the current queue and there is a full dispatch7062* cycle that changes the in-service queue. This can hardly7063* happen, but in the worst case we just expire a queue too7064* early.7065*/7066if (bfqq)7067bfq_idle_slice_timer_body(bfqd, bfqq);70687069return HRTIMER_NORESTART;7070}70717072static void __bfq_put_async_bfqq(struct bfq_data *bfqd,7073struct bfq_queue **bfqq_ptr)7074{7075struct bfq_queue *bfqq = *bfqq_ptr;70767077bfq_log(bfqd, "put_async_bfqq: %p", bfqq);7078if (bfqq) {7079bfq_bfqq_move(bfqd, bfqq, bfqd->root_group);70807081bfq_log_bfqq(bfqd, bfqq, "put_async_bfqq: putting %p, %d",7082bfqq, bfqq->ref);7083bfq_put_queue(bfqq);7084*bfqq_ptr = NULL;7085}7086}70877088/*7089* Release all the bfqg references to its async queues. If we are7090* deallocating the group these queues may still contain requests, so7091* we reparent them to the root cgroup (i.e., the only one that will7092* exist for sure until all the requests on a device are gone).7093*/7094void bfq_put_async_queues(struct bfq_data *bfqd, struct bfq_group *bfqg)7095{7096int i, j, k;70977098for (k = 0; k < bfqd->num_actuators; k++) {7099for (i = 0; i < 2; i++)7100for (j = 0; j < IOPRIO_NR_LEVELS; j++)7101__bfq_put_async_bfqq(bfqd, &bfqg->async_bfqq[i][j][k]);71027103__bfq_put_async_bfqq(bfqd, &bfqg->async_idle_bfqq[k]);7104}7105}71067107/*7108* See the comments on bfq_limit_depth for the purpose of7109* the depths set in the function. Return minimum shallow depth we'll use.7110*/7111static void bfq_update_depths(struct bfq_data *bfqd, struct sbitmap_queue *bt)7112{7113unsigned int nr_requests = bfqd->queue->nr_requests;71147115/*7116* In-word depths if no bfq_queue is being weight-raised:7117* leaving 25% of tags only for sync reads.7118*7119* In next formulas, right-shift the value7120* (1U<<bt->sb.shift), instead of computing directly7121* (1U<<(bt->sb.shift - something)), to be robust against7122* any possible value of bt->sb.shift, without having to7123* limit 'something'.7124*/7125/* no more than 50% of tags for async I/O */7126bfqd->async_depths[0][0] = max(nr_requests >> 1, 1U);7127/*7128* no more than 75% of tags for sync writes (25% extra tags7129* w.r.t. async I/O, to prevent async I/O from starving sync7130* writes)7131*/7132bfqd->async_depths[0][1] = max((nr_requests * 3) >> 2, 1U);71337134/*7135* In-word depths in case some bfq_queue is being weight-7136* raised: leaving ~63% of tags for sync reads. This is the7137* highest percentage for which, in our tests, application7138* start-up times didn't suffer from any regression due to tag7139* shortage.7140*/7141/* no more than ~18% of tags for async I/O */7142bfqd->async_depths[1][0] = max((nr_requests * 3) >> 4, 1U);7143/* no more than ~37% of tags for sync writes (~20% extra tags) */7144bfqd->async_depths[1][1] = max((nr_requests * 6) >> 4, 1U);7145}71467147static void bfq_depth_updated(struct blk_mq_hw_ctx *hctx)7148{7149struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;7150struct blk_mq_tags *tags = hctx->sched_tags;71517152bfq_update_depths(bfqd, &tags->bitmap_tags);7153sbitmap_queue_min_shallow_depth(&tags->bitmap_tags, 1);7154}71557156static int bfq_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int index)7157{7158bfq_depth_updated(hctx);7159return 0;7160}71617162static void bfq_exit_queue(struct elevator_queue *e)7163{7164struct bfq_data *bfqd = e->elevator_data;7165struct bfq_queue *bfqq, *n;7166unsigned int actuator;71677168hrtimer_cancel(&bfqd->idle_slice_timer);71697170spin_lock_irq(&bfqd->lock);7171list_for_each_entry_safe(bfqq, n, &bfqd->idle_list, bfqq_list)7172bfq_deactivate_bfqq(bfqd, bfqq, false, false);7173spin_unlock_irq(&bfqd->lock);71747175for (actuator = 0; actuator < bfqd->num_actuators; actuator++)7176WARN_ON_ONCE(bfqd->rq_in_driver[actuator]);7177WARN_ON_ONCE(bfqd->tot_rq_in_driver);71787179hrtimer_cancel(&bfqd->idle_slice_timer);71807181/* release oom-queue reference to root group */7182bfqg_and_blkg_put(bfqd->root_group);71837184#ifdef CONFIG_BFQ_GROUP_IOSCHED7185blkcg_deactivate_policy(bfqd->queue->disk, &blkcg_policy_bfq);7186#else7187spin_lock_irq(&bfqd->lock);7188bfq_put_async_queues(bfqd, bfqd->root_group);7189kfree(bfqd->root_group);7190spin_unlock_irq(&bfqd->lock);7191#endif71927193blk_stat_disable_accounting(bfqd->queue);7194blk_queue_flag_clear(QUEUE_FLAG_DISABLE_WBT_DEF, bfqd->queue);7195set_bit(ELEVATOR_FLAG_ENABLE_WBT_ON_EXIT, &e->flags);71967197kfree(bfqd);7198}71997200static void bfq_init_root_group(struct bfq_group *root_group,7201struct bfq_data *bfqd)7202{7203int i;72047205#ifdef CONFIG_BFQ_GROUP_IOSCHED7206root_group->entity.parent = NULL;7207root_group->my_entity = NULL;7208root_group->bfqd = bfqd;7209#endif7210root_group->rq_pos_tree = RB_ROOT;7211for (i = 0; i < BFQ_IOPRIO_CLASSES; i++)7212root_group->sched_data.service_tree[i] = BFQ_SERVICE_TREE_INIT;7213root_group->sched_data.bfq_class_idle_last_service = jiffies;7214}72157216static int bfq_init_queue(struct request_queue *q, struct elevator_queue *eq)7217{7218struct bfq_data *bfqd;7219unsigned int i;7220struct blk_independent_access_ranges *ia_ranges = q->disk->ia_ranges;72217222bfqd = kzalloc_node(sizeof(*bfqd), GFP_KERNEL, q->node);7223if (!bfqd)7224return -ENOMEM;72257226eq->elevator_data = bfqd;72277228spin_lock_irq(&q->queue_lock);7229q->elevator = eq;7230spin_unlock_irq(&q->queue_lock);72317232/*7233* Our fallback bfqq if bfq_find_alloc_queue() runs into OOM issues.7234* Grab a permanent reference to it, so that the normal code flow7235* will not attempt to free it.7236* Set zero as actuator index: we will pretend that7237* all I/O requests are for the same actuator.7238*/7239bfq_init_bfqq(bfqd, &bfqd->oom_bfqq, NULL, 1, 0, 0);7240bfqd->oom_bfqq.ref++;7241bfqd->oom_bfqq.new_ioprio = BFQ_DEFAULT_QUEUE_IOPRIO;7242bfqd->oom_bfqq.new_ioprio_class = IOPRIO_CLASS_BE;7243bfqd->oom_bfqq.entity.new_weight =7244bfq_ioprio_to_weight(bfqd->oom_bfqq.new_ioprio);72457246/* oom_bfqq does not participate to bursts */7247bfq_clear_bfqq_just_created(&bfqd->oom_bfqq);72487249/*7250* Trigger weight initialization, according to ioprio, at the7251* oom_bfqq's first activation. The oom_bfqq's ioprio and ioprio7252* class won't be changed any more.7253*/7254bfqd->oom_bfqq.entity.prio_changed = 1;72557256bfqd->queue = q;72577258bfqd->num_actuators = 1;7259/*7260* If the disk supports multiple actuators, copy independent7261* access ranges from the request queue structure.7262*/7263spin_lock_irq(&q->queue_lock);7264if (ia_ranges) {7265/*7266* Check if the disk ia_ranges size exceeds the current bfq7267* actuator limit.7268*/7269if (ia_ranges->nr_ia_ranges > BFQ_MAX_ACTUATORS) {7270pr_crit("nr_ia_ranges higher than act limit: iars=%d, max=%d.\n",7271ia_ranges->nr_ia_ranges, BFQ_MAX_ACTUATORS);7272pr_crit("Falling back to single actuator mode.\n");7273} else {7274bfqd->num_actuators = ia_ranges->nr_ia_ranges;72757276for (i = 0; i < bfqd->num_actuators; i++) {7277bfqd->sector[i] = ia_ranges->ia_range[i].sector;7278bfqd->nr_sectors[i] =7279ia_ranges->ia_range[i].nr_sectors;7280}7281}7282}72837284/* Otherwise use single-actuator dev info */7285if (bfqd->num_actuators == 1) {7286bfqd->sector[0] = 0;7287bfqd->nr_sectors[0] = get_capacity(q->disk);7288}7289spin_unlock_irq(&q->queue_lock);72907291INIT_LIST_HEAD(&bfqd->dispatch);72927293hrtimer_setup(&bfqd->idle_slice_timer, bfq_idle_slice_timer, CLOCK_MONOTONIC,7294HRTIMER_MODE_REL);72957296bfqd->queue_weights_tree = RB_ROOT_CACHED;7297#ifdef CONFIG_BFQ_GROUP_IOSCHED7298bfqd->num_groups_with_pending_reqs = 0;7299#endif73007301INIT_LIST_HEAD(&bfqd->active_list[0]);7302INIT_LIST_HEAD(&bfqd->active_list[1]);7303INIT_LIST_HEAD(&bfqd->idle_list);7304INIT_HLIST_HEAD(&bfqd->burst_list);73057306bfqd->hw_tag = -1;7307bfqd->nonrot_with_queueing = blk_queue_nonrot(bfqd->queue);73087309bfqd->bfq_max_budget = bfq_default_max_budget;73107311bfqd->bfq_fifo_expire[0] = bfq_fifo_expire[0];7312bfqd->bfq_fifo_expire[1] = bfq_fifo_expire[1];7313bfqd->bfq_back_max = bfq_back_max;7314bfqd->bfq_back_penalty = bfq_back_penalty;7315bfqd->bfq_slice_idle = bfq_slice_idle;7316bfqd->bfq_timeout = bfq_timeout;73177318bfqd->bfq_large_burst_thresh = 8;7319bfqd->bfq_burst_interval = msecs_to_jiffies(180);73207321bfqd->low_latency = true;73227323/*7324* Trade-off between responsiveness and fairness.7325*/7326bfqd->bfq_wr_coeff = 30;7327bfqd->bfq_wr_rt_max_time = msecs_to_jiffies(300);7328bfqd->bfq_wr_min_idle_time = msecs_to_jiffies(2000);7329bfqd->bfq_wr_min_inter_arr_async = msecs_to_jiffies(500);7330bfqd->bfq_wr_max_softrt_rate = 7000; /*7331* Approximate rate required7332* to playback or record a7333* high-definition compressed7334* video.7335*/7336bfqd->wr_busy_queues = 0;73377338/*7339* Begin by assuming, optimistically, that the device peak7340* rate is equal to 2/3 of the highest reference rate.7341*/7342bfqd->rate_dur_prod = ref_rate[blk_queue_nonrot(bfqd->queue)] *7343ref_wr_duration[blk_queue_nonrot(bfqd->queue)];7344bfqd->peak_rate = ref_rate[blk_queue_nonrot(bfqd->queue)] * 2 / 3;73457346/* see comments on the definition of next field inside bfq_data */7347bfqd->actuator_load_threshold = 4;73487349spin_lock_init(&bfqd->lock);73507351/*7352* The invocation of the next bfq_create_group_hierarchy7353* function is the head of a chain of function calls7354* (bfq_create_group_hierarchy->blkcg_activate_policy->7355* blk_mq_freeze_queue) that may lead to the invocation of the7356* has_work hook function. For this reason,7357* bfq_create_group_hierarchy is invoked only after all7358* scheduler data has been initialized, apart from the fields7359* that can be initialized only after invoking7360* bfq_create_group_hierarchy. This, in particular, enables7361* has_work to correctly return false. Of course, to avoid7362* other inconsistencies, the blk-mq stack must then refrain7363* from invoking further scheduler hooks before this init7364* function is finished.7365*/7366bfqd->root_group = bfq_create_group_hierarchy(bfqd, q->node);7367if (!bfqd->root_group)7368goto out_free;7369bfq_init_root_group(bfqd->root_group, bfqd);7370bfq_init_entity(&bfqd->oom_bfqq.entity, bfqd->root_group);73717372/* We dispatch from request queue wide instead of hw queue */7373blk_queue_flag_set(QUEUE_FLAG_SQ_SCHED, q);73747375blk_queue_flag_set(QUEUE_FLAG_DISABLE_WBT_DEF, q);7376wbt_disable_default(q->disk);7377blk_stat_enable_accounting(q);73787379return 0;73807381out_free:7382kfree(bfqd);7383return -ENOMEM;7384}73857386static void bfq_slab_kill(void)7387{7388kmem_cache_destroy(bfq_pool);7389}73907391static int __init bfq_slab_setup(void)7392{7393bfq_pool = KMEM_CACHE(bfq_queue, 0);7394if (!bfq_pool)7395return -ENOMEM;7396return 0;7397}73987399static ssize_t bfq_var_show(unsigned int var, char *page)7400{7401return sprintf(page, "%u\n", var);7402}74037404static int bfq_var_store(unsigned long *var, const char *page)7405{7406unsigned long new_val;7407int ret = kstrtoul(page, 10, &new_val);74087409if (ret)7410return ret;7411*var = new_val;7412return 0;7413}74147415#define SHOW_FUNCTION(__FUNC, __VAR, __CONV) \7416static ssize_t __FUNC(struct elevator_queue *e, char *page) \7417{ \7418struct bfq_data *bfqd = e->elevator_data; \7419u64 __data = __VAR; \7420if (__CONV == 1) \7421__data = jiffies_to_msecs(__data); \7422else if (__CONV == 2) \7423__data = div_u64(__data, NSEC_PER_MSEC); \7424return bfq_var_show(__data, (page)); \7425}7426SHOW_FUNCTION(bfq_fifo_expire_sync_show, bfqd->bfq_fifo_expire[1], 2);7427SHOW_FUNCTION(bfq_fifo_expire_async_show, bfqd->bfq_fifo_expire[0], 2);7428SHOW_FUNCTION(bfq_back_seek_max_show, bfqd->bfq_back_max, 0);7429SHOW_FUNCTION(bfq_back_seek_penalty_show, bfqd->bfq_back_penalty, 0);7430SHOW_FUNCTION(bfq_slice_idle_show, bfqd->bfq_slice_idle, 2);7431SHOW_FUNCTION(bfq_max_budget_show, bfqd->bfq_user_max_budget, 0);7432SHOW_FUNCTION(bfq_timeout_sync_show, bfqd->bfq_timeout, 1);7433SHOW_FUNCTION(bfq_strict_guarantees_show, bfqd->strict_guarantees, 0);7434SHOW_FUNCTION(bfq_low_latency_show, bfqd->low_latency, 0);7435#undef SHOW_FUNCTION74367437#define USEC_SHOW_FUNCTION(__FUNC, __VAR) \7438static ssize_t __FUNC(struct elevator_queue *e, char *page) \7439{ \7440struct bfq_data *bfqd = e->elevator_data; \7441u64 __data = __VAR; \7442__data = div_u64(__data, NSEC_PER_USEC); \7443return bfq_var_show(__data, (page)); \7444}7445USEC_SHOW_FUNCTION(bfq_slice_idle_us_show, bfqd->bfq_slice_idle);7446#undef USEC_SHOW_FUNCTION74477448#define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \7449static ssize_t \7450__FUNC(struct elevator_queue *e, const char *page, size_t count) \7451{ \7452struct bfq_data *bfqd = e->elevator_data; \7453unsigned long __data, __min = (MIN), __max = (MAX); \7454int ret; \7455\7456ret = bfq_var_store(&__data, (page)); \7457if (ret) \7458return ret; \7459if (__data < __min) \7460__data = __min; \7461else if (__data > __max) \7462__data = __max; \7463if (__CONV == 1) \7464*(__PTR) = msecs_to_jiffies(__data); \7465else if (__CONV == 2) \7466*(__PTR) = (u64)__data * NSEC_PER_MSEC; \7467else \7468*(__PTR) = __data; \7469return count; \7470}7471STORE_FUNCTION(bfq_fifo_expire_sync_store, &bfqd->bfq_fifo_expire[1], 1,7472INT_MAX, 2);7473STORE_FUNCTION(bfq_fifo_expire_async_store, &bfqd->bfq_fifo_expire[0], 1,7474INT_MAX, 2);7475STORE_FUNCTION(bfq_back_seek_max_store, &bfqd->bfq_back_max, 0, INT_MAX, 0);7476STORE_FUNCTION(bfq_back_seek_penalty_store, &bfqd->bfq_back_penalty, 1,7477INT_MAX, 0);7478STORE_FUNCTION(bfq_slice_idle_store, &bfqd->bfq_slice_idle, 0, INT_MAX, 2);7479#undef STORE_FUNCTION74807481#define USEC_STORE_FUNCTION(__FUNC, __PTR, MIN, MAX) \7482static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)\7483{ \7484struct bfq_data *bfqd = e->elevator_data; \7485unsigned long __data, __min = (MIN), __max = (MAX); \7486int ret; \7487\7488ret = bfq_var_store(&__data, (page)); \7489if (ret) \7490return ret; \7491if (__data < __min) \7492__data = __min; \7493else if (__data > __max) \7494__data = __max; \7495*(__PTR) = (u64)__data * NSEC_PER_USEC; \7496return count; \7497}7498USEC_STORE_FUNCTION(bfq_slice_idle_us_store, &bfqd->bfq_slice_idle, 0,7499UINT_MAX);7500#undef USEC_STORE_FUNCTION75017502static ssize_t bfq_max_budget_store(struct elevator_queue *e,7503const char *page, size_t count)7504{7505struct bfq_data *bfqd = e->elevator_data;7506unsigned long __data;7507int ret;75087509ret = bfq_var_store(&__data, (page));7510if (ret)7511return ret;75127513if (__data == 0)7514bfqd->bfq_max_budget = bfq_calc_max_budget(bfqd);7515else {7516if (__data > INT_MAX)7517__data = INT_MAX;7518bfqd->bfq_max_budget = __data;7519}75207521bfqd->bfq_user_max_budget = __data;75227523return count;7524}75257526/*7527* Leaving this name to preserve name compatibility with cfq7528* parameters, but this timeout is used for both sync and async.7529*/7530static ssize_t bfq_timeout_sync_store(struct elevator_queue *e,7531const char *page, size_t count)7532{7533struct bfq_data *bfqd = e->elevator_data;7534unsigned long __data;7535int ret;75367537ret = bfq_var_store(&__data, (page));7538if (ret)7539return ret;75407541if (__data < 1)7542__data = 1;7543else if (__data > INT_MAX)7544__data = INT_MAX;75457546bfqd->bfq_timeout = msecs_to_jiffies(__data);7547if (bfqd->bfq_user_max_budget == 0)7548bfqd->bfq_max_budget = bfq_calc_max_budget(bfqd);75497550return count;7551}75527553static ssize_t bfq_strict_guarantees_store(struct elevator_queue *e,7554const char *page, size_t count)7555{7556struct bfq_data *bfqd = e->elevator_data;7557unsigned long __data;7558int ret;75597560ret = bfq_var_store(&__data, (page));7561if (ret)7562return ret;75637564if (__data > 1)7565__data = 1;7566if (!bfqd->strict_guarantees && __data == 17567&& bfqd->bfq_slice_idle < 8 * NSEC_PER_MSEC)7568bfqd->bfq_slice_idle = 8 * NSEC_PER_MSEC;75697570bfqd->strict_guarantees = __data;75717572return count;7573}75747575static ssize_t bfq_low_latency_store(struct elevator_queue *e,7576const char *page, size_t count)7577{7578struct bfq_data *bfqd = e->elevator_data;7579unsigned long __data;7580int ret;75817582ret = bfq_var_store(&__data, (page));7583if (ret)7584return ret;75857586if (__data > 1)7587__data = 1;7588if (__data == 0 && bfqd->low_latency != 0)7589bfq_end_wr(bfqd);7590bfqd->low_latency = __data;75917592return count;7593}75947595#define BFQ_ATTR(name) \7596__ATTR(name, 0644, bfq_##name##_show, bfq_##name##_store)75977598static const struct elv_fs_entry bfq_attrs[] = {7599BFQ_ATTR(fifo_expire_sync),7600BFQ_ATTR(fifo_expire_async),7601BFQ_ATTR(back_seek_max),7602BFQ_ATTR(back_seek_penalty),7603BFQ_ATTR(slice_idle),7604BFQ_ATTR(slice_idle_us),7605BFQ_ATTR(max_budget),7606BFQ_ATTR(timeout_sync),7607BFQ_ATTR(strict_guarantees),7608BFQ_ATTR(low_latency),7609__ATTR_NULL7610};76117612static struct elevator_type iosched_bfq_mq = {7613.ops = {7614.limit_depth = bfq_limit_depth,7615.prepare_request = bfq_prepare_request,7616.requeue_request = bfq_finish_requeue_request,7617.finish_request = bfq_finish_request,7618.exit_icq = bfq_exit_icq,7619.insert_requests = bfq_insert_requests,7620.dispatch_request = bfq_dispatch_request,7621.next_request = elv_rb_latter_request,7622.former_request = elv_rb_former_request,7623.allow_merge = bfq_allow_bio_merge,7624.bio_merge = bfq_bio_merge,7625.request_merge = bfq_request_merge,7626.requests_merged = bfq_requests_merged,7627.request_merged = bfq_request_merged,7628.has_work = bfq_has_work,7629.depth_updated = bfq_depth_updated,7630.init_hctx = bfq_init_hctx,7631.init_sched = bfq_init_queue,7632.exit_sched = bfq_exit_queue,7633},76347635.icq_size = sizeof(struct bfq_io_cq),7636.icq_align = __alignof__(struct bfq_io_cq),7637.elevator_attrs = bfq_attrs,7638.elevator_name = "bfq",7639.elevator_owner = THIS_MODULE,7640};7641MODULE_ALIAS("bfq-iosched");76427643static int __init bfq_init(void)7644{7645int ret;76467647#ifdef CONFIG_BFQ_GROUP_IOSCHED7648ret = blkcg_policy_register(&blkcg_policy_bfq);7649if (ret)7650return ret;7651#endif76527653ret = -ENOMEM;7654if (bfq_slab_setup())7655goto err_pol_unreg;76567657/*7658* Times to load large popular applications for the typical7659* systems installed on the reference devices (see the7660* comments before the definition of the next7661* array). Actually, we use slightly lower values, as the7662* estimated peak rate tends to be smaller than the actual7663* peak rate. The reason for this last fact is that estimates7664* are computed over much shorter time intervals than the long7665* intervals typically used for benchmarking. Why? First, to7666* adapt more quickly to variations. Second, because an I/O7667* scheduler cannot rely on a peak-rate-evaluation workload to7668* be run for a long time.7669*/7670ref_wr_duration[0] = msecs_to_jiffies(7000); /* actually 8 sec */7671ref_wr_duration[1] = msecs_to_jiffies(2500); /* actually 3 sec */76727673ret = elv_register(&iosched_bfq_mq);7674if (ret)7675goto slab_kill;76767677return 0;76787679slab_kill:7680bfq_slab_kill();7681err_pol_unreg:7682#ifdef CONFIG_BFQ_GROUP_IOSCHED7683blkcg_policy_unregister(&blkcg_policy_bfq);7684#endif7685return ret;7686}76877688static void __exit bfq_exit(void)7689{7690elv_unregister(&iosched_bfq_mq);7691#ifdef CONFIG_BFQ_GROUP_IOSCHED7692blkcg_policy_unregister(&blkcg_policy_bfq);7693#endif7694bfq_slab_kill();7695}76967697module_init(bfq_init);7698module_exit(bfq_exit);76997700MODULE_AUTHOR("Paolo Valente");7701MODULE_LICENSE("GPL");7702MODULE_DESCRIPTION("MQ Budget Fair Queueing I/O Scheduler");770377047705