Path: blob/main/sys/contrib/openzfs/module/zfs/arc.c
106597 views
// SPDX-License-Identifier: CDDL-1.01/*2* CDDL HEADER START3*4* The contents of this file are subject to the terms of the5* Common Development and Distribution License (the "License").6* You may not use this file except in compliance with the License.7*8* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE9* or https://opensource.org/licenses/CDDL-1.0.10* See the License for the specific language governing permissions11* and limitations under the License.12*13* When distributing Covered Code, include this CDDL HEADER in each14* file and include the License file at usr/src/OPENSOLARIS.LICENSE.15* If applicable, add the following below this CDDL HEADER, with the16* fields enclosed by brackets "[]" replaced with your own identifying17* information: Portions Copyright [yyyy] [name of copyright owner]18*19* CDDL HEADER END20*/21/*22* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.23* Copyright (c) 2018, Joyent, Inc.24* Copyright (c) 2011, 2020, Delphix. All rights reserved.25* Copyright (c) 2014, Saso Kiselkov. All rights reserved.26* Copyright (c) 2017, Nexenta Systems, Inc. All rights reserved.27* Copyright (c) 2019, loli10K <[email protected]>. All rights reserved.28* Copyright (c) 2020, George Amanakis. All rights reserved.29* Copyright (c) 2019, 2024, 2025, Klara, Inc.30* Copyright (c) 2019, Allan Jude31* Copyright (c) 2020, The FreeBSD Foundation [1]32* Copyright (c) 2021, 2024 by George Melikov. All rights reserved.33*34* [1] Portions of this software were developed by Allan Jude35* under sponsorship from the FreeBSD Foundation.36*/3738/*39* DVA-based Adjustable Replacement Cache40*41* While much of the theory of operation used here is42* based on the self-tuning, low overhead replacement cache43* presented by Megiddo and Modha at FAST 2003, there are some44* significant differences:45*46* 1. The Megiddo and Modha model assumes any page is evictable.47* Pages in its cache cannot be "locked" into memory. This makes48* the eviction algorithm simple: evict the last page in the list.49* This also make the performance characteristics easy to reason50* about. Our cache is not so simple. At any given moment, some51* subset of the blocks in the cache are un-evictable because we52* have handed out a reference to them. Blocks are only evictable53* when there are no external references active. This makes54* eviction far more problematic: we choose to evict the evictable55* blocks that are the "lowest" in the list.56*57* There are times when it is not possible to evict the requested58* space. In these circumstances we are unable to adjust the cache59* size. To prevent the cache growing unbounded at these times we60* implement a "cache throttle" that slows the flow of new data61* into the cache until we can make space available.62*63* 2. The Megiddo and Modha model assumes a fixed cache size.64* Pages are evicted when the cache is full and there is a cache65* miss. Our model has a variable sized cache. It grows with66* high use, but also tries to react to memory pressure from the67* operating system: decreasing its size when system memory is68* tight.69*70* 3. The Megiddo and Modha model assumes a fixed page size. All71* elements of the cache are therefore exactly the same size. So72* when adjusting the cache size following a cache miss, its simply73* a matter of choosing a single page to evict. In our model, we74* have variable sized cache blocks (ranging from 512 bytes to75* 128K bytes). We therefore choose a set of blocks to evict to make76* space for a cache miss that approximates as closely as possible77* the space used by the new block.78*79* See also: "ARC: A Self-Tuning, Low Overhead Replacement Cache"80* by N. Megiddo & D. Modha, FAST 200381*/8283/*84* The locking model:85*86* A new reference to a cache buffer can be obtained in two87* ways: 1) via a hash table lookup using the DVA as a key,88* or 2) via one of the ARC lists. The arc_read() interface89* uses method 1, while the internal ARC algorithms for90* adjusting the cache use method 2. We therefore provide two91* types of locks: 1) the hash table lock array, and 2) the92* ARC list locks.93*94* Buffers do not have their own mutexes, rather they rely on the95* hash table mutexes for the bulk of their protection (i.e. most96* fields in the arc_buf_hdr_t are protected by these mutexes).97*98* buf_hash_find() returns the appropriate mutex (held) when it99* locates the requested buffer in the hash table. It returns100* NULL for the mutex if the buffer was not in the table.101*102* buf_hash_remove() expects the appropriate hash mutex to be103* already held before it is invoked.104*105* Each ARC state also has a mutex which is used to protect the106* buffer list associated with the state. When attempting to107* obtain a hash table lock while holding an ARC list lock you108* must use: mutex_tryenter() to avoid deadlock. Also note that109* the active state mutex must be held before the ghost state mutex.110*111* It as also possible to register a callback which is run when the112* metadata limit is reached and no buffers can be safely evicted. In113* this case the arc user should drop a reference on some arc buffers so114* they can be reclaimed. For example, when using the ZPL each dentry115* holds a references on a znode. These dentries must be pruned before116* the arc buffer holding the znode can be safely evicted.117*118* Note that the majority of the performance stats are manipulated119* with atomic operations.120*121* The L2ARC uses the l2ad_mtx on each vdev for the following:122*123* - L2ARC buflist creation124* - L2ARC buflist eviction125* - L2ARC write completion, which walks L2ARC buflists126* - ARC header destruction, as it removes from L2ARC buflists127* - ARC header release, as it removes from L2ARC buflists128*/129130/*131* ARC operation:132*133* Every block that is in the ARC is tracked by an arc_buf_hdr_t structure.134* This structure can point either to a block that is still in the cache or to135* one that is only accessible in an L2 ARC device, or it can provide136* information about a block that was recently evicted. If a block is137* only accessible in the L2ARC, then the arc_buf_hdr_t only has enough138* information to retrieve it from the L2ARC device. This information is139* stored in the l2arc_buf_hdr_t sub-structure of the arc_buf_hdr_t. A block140* that is in this state cannot access the data directly.141*142* Blocks that are actively being referenced or have not been evicted143* are cached in the L1ARC. The L1ARC (l1arc_buf_hdr_t) is a structure within144* the arc_buf_hdr_t that will point to the data block in memory. A block can145* only be read by a consumer if it has an l1arc_buf_hdr_t. The L1ARC146* caches data in two ways -- in a list of ARC buffers (arc_buf_t) and147* also in the arc_buf_hdr_t's private physical data block pointer (b_pabd).148*149* The L1ARC's data pointer may or may not be uncompressed. The ARC has the150* ability to store the physical data (b_pabd) associated with the DVA of the151* arc_buf_hdr_t. Since the b_pabd is a copy of the on-disk physical block,152* it will match its on-disk compression characteristics. This behavior can be153* disabled by setting 'zfs_compressed_arc_enabled' to B_FALSE. When the154* compressed ARC functionality is disabled, the b_pabd will point to an155* uncompressed version of the on-disk data.156*157* Data in the L1ARC is not accessed by consumers of the ARC directly. Each158* arc_buf_hdr_t can have multiple ARC buffers (arc_buf_t) which reference it.159* Each ARC buffer (arc_buf_t) is being actively accessed by a specific ARC160* consumer. The ARC will provide references to this data and will keep it161* cached until it is no longer in use. The ARC caches only the L1ARC's physical162* data block and will evict any arc_buf_t that is no longer referenced. The163* amount of memory consumed by the arc_buf_ts' data buffers can be seen via the164* "overhead_size" kstat.165*166* Depending on the consumer, an arc_buf_t can be requested in uncompressed or167* compressed form. The typical case is that consumers will want uncompressed168* data, and when that happens a new data buffer is allocated where the data is169* decompressed for them to use. Currently the only consumer who wants170* compressed arc_buf_t's is "zfs send", when it streams data exactly as it171* exists on disk. When this happens, the arc_buf_t's data buffer is shared172* with the arc_buf_hdr_t.173*174* Here is a diagram showing an arc_buf_hdr_t referenced by two arc_buf_t's. The175* first one is owned by a compressed send consumer (and therefore references176* the same compressed data buffer as the arc_buf_hdr_t) and the second could be177* used by any other consumer (and has its own uncompressed copy of the data178* buffer).179*180* arc_buf_hdr_t181* +-----------+182* | fields |183* | common to |184* | L1- and |185* | L2ARC |186* +-----------+187* | l2arc_buf_hdr_t188* | |189* +-----------+190* | l1arc_buf_hdr_t191* | | arc_buf_t192* | b_buf +------------>+-----------+ arc_buf_t193* | b_pabd +-+ |b_next +---->+-----------+194* +-----------+ | |-----------| |b_next +-->NULL195* | |b_comp = T | +-----------+196* | |b_data +-+ |b_comp = F |197* | +-----------+ | |b_data +-+198* +->+------+ | +-----------+ |199* compressed | | | |200* data | |<--------------+ | uncompressed201* +------+ compressed, | data202* shared +-->+------+203* data | |204* | |205* +------+206*207* When a consumer reads a block, the ARC must first look to see if the208* arc_buf_hdr_t is cached. If the hdr is cached then the ARC allocates a new209* arc_buf_t and either copies uncompressed data into a new data buffer from an210* existing uncompressed arc_buf_t, decompresses the hdr's b_pabd buffer into a211* new data buffer, or shares the hdr's b_pabd buffer, depending on whether the212* hdr is compressed and the desired compression characteristics of the213* arc_buf_t consumer. If the arc_buf_t ends up sharing data with the214* arc_buf_hdr_t and both of them are uncompressed then the arc_buf_t must be215* the last buffer in the hdr's b_buf list, however a shared compressed buf can216* be anywhere in the hdr's list.217*218* The diagram below shows an example of an uncompressed ARC hdr that is219* sharing its data with an arc_buf_t (note that the shared uncompressed buf is220* the last element in the buf list):221*222* arc_buf_hdr_t223* +-----------+224* | |225* | |226* | |227* +-----------+228* l2arc_buf_hdr_t| |229* | |230* +-----------+231* l1arc_buf_hdr_t| |232* | | arc_buf_t (shared)233* | b_buf +------------>+---------+ arc_buf_t234* | | |b_next +---->+---------+235* | b_pabd +-+ |---------| |b_next +-->NULL236* +-----------+ | | | +---------+237* | |b_data +-+ | |238* | +---------+ | |b_data +-+239* +->+------+ | +---------+ |240* | | | |241* uncompressed | | | |242* data +------+ | |243* ^ +->+------+ |244* | uncompressed | | |245* | data | | |246* | +------+ |247* +---------------------------------+248*249* Writing to the ARC requires that the ARC first discard the hdr's b_pabd250* since the physical block is about to be rewritten. The new data contents251* will be contained in the arc_buf_t. As the I/O pipeline performs the write,252* it may compress the data before writing it to disk. The ARC will be called253* with the transformed data and will memcpy the transformed on-disk block into254* a newly allocated b_pabd. Writes are always done into buffers which have255* either been loaned (and hence are new and don't have other readers) or256* buffers which have been released (and hence have their own hdr, if there257* were originally other readers of the buf's original hdr). This ensures that258* the ARC only needs to update a single buf and its hdr after a write occurs.259*260* When the L2ARC is in use, it will also take advantage of the b_pabd. The261* L2ARC will always write the contents of b_pabd to the L2ARC. This means262* that when compressed ARC is enabled that the L2ARC blocks are identical263* to the on-disk block in the main data pool. This provides a significant264* advantage since the ARC can leverage the bp's checksum when reading from the265* L2ARC to determine if the contents are valid. However, if the compressed266* ARC is disabled, then the L2ARC's block must be transformed to look267* like the physical block in the main data pool before comparing the268* checksum and determining its validity.269*270* The L1ARC has a slightly different system for storing encrypted data.271* Raw (encrypted + possibly compressed) data has a few subtle differences from272* data that is just compressed. The biggest difference is that it is not273* possible to decrypt encrypted data (or vice-versa) if the keys aren't loaded.274* The other difference is that encryption cannot be treated as a suggestion.275* If a caller would prefer compressed data, but they actually wind up with276* uncompressed data the worst thing that could happen is there might be a277* performance hit. If the caller requests encrypted data, however, we must be278* sure they actually get it or else secret information could be leaked. Raw279* data is stored in hdr->b_crypt_hdr.b_rabd. An encrypted header, therefore,280* may have both an encrypted version and a decrypted version of its data at281* once. When a caller needs a raw arc_buf_t, it is allocated and the data is282* copied out of this header. To avoid complications with b_pabd, raw buffers283* cannot be shared.284*/285286#include <sys/spa.h>287#include <sys/zio.h>288#include <sys/spa_impl.h>289#include <sys/zio_compress.h>290#include <sys/zio_checksum.h>291#include <sys/zfs_context.h>292#include <sys/arc.h>293#include <sys/zfs_refcount.h>294#include <sys/vdev.h>295#include <sys/vdev_impl.h>296#include <sys/dsl_pool.h>297#include <sys/multilist.h>298#include <sys/abd.h>299#include <sys/dbuf.h>300#include <sys/zil.h>301#include <sys/fm/fs/zfs.h>302#include <sys/callb.h>303#include <sys/kstat.h>304#include <sys/zthr.h>305#include <zfs_fletcher.h>306#include <sys/arc_impl.h>307#include <sys/trace_zfs.h>308#include <sys/aggsum.h>309#include <sys/wmsum.h>310#include <cityhash.h>311#include <sys/vdev_trim.h>312#include <sys/zfs_racct.h>313#include <sys/zstd/zstd.h>314315#ifndef _KERNEL316/* set with ZFS_DEBUG=watch, to enable watchpoints on frozen buffers */317boolean_t arc_watch = B_FALSE;318#endif319320/*321* This thread's job is to keep enough free memory in the system, by322* calling arc_kmem_reap_soon() plus arc_reduce_target_size(), which improves323* arc_available_memory().324*/325static zthr_t *arc_reap_zthr;326327/*328* This thread's job is to keep arc_size under arc_c, by calling329* arc_evict(), which improves arc_is_overflowing().330*/331static zthr_t *arc_evict_zthr;332static arc_buf_hdr_t **arc_state_evict_markers;333static int arc_state_evict_marker_count;334335static kmutex_t arc_evict_lock;336static boolean_t arc_evict_needed = B_FALSE;337static clock_t arc_last_uncached_flush;338339static taskq_t *arc_evict_taskq;340static struct evict_arg *arc_evict_arg;341342/*343* Count of bytes evicted since boot.344*/345static uint64_t arc_evict_count;346347/*348* List of arc_evict_waiter_t's, representing threads waiting for the349* arc_evict_count to reach specific values.350*/351static list_t arc_evict_waiters;352353/*354* When arc_is_overflowing(), arc_get_data_impl() waits for this percent of355* the requested amount of data to be evicted. For example, by default for356* every 2KB that's evicted, 1KB of it may be "reused" by a new allocation.357* Since this is above 100%, it ensures that progress is made towards getting358* arc_size under arc_c. Since this is finite, it ensures that allocations359* can still happen, even during the potentially long time that arc_size is360* more than arc_c.361*/362static uint_t zfs_arc_eviction_pct = 200;363364/*365* The number of headers to evict in arc_evict_state_impl() before366* dropping the sublist lock and evicting from another sublist. A lower367* value means we're more likely to evict the "correct" header (i.e. the368* oldest header in the arc state), but comes with higher overhead369* (i.e. more invocations of arc_evict_state_impl()).370*/371static uint_t zfs_arc_evict_batch_limit = 10;372373/*374* Number batches to process per parallel eviction task under heavy load to375* reduce number of context switches.376*/377static uint_t zfs_arc_evict_batches_limit = 5;378379/* number of seconds before growing cache again */380uint_t arc_grow_retry = 5;381382/*383* Minimum time between calls to arc_kmem_reap_soon().384*/385static const int arc_kmem_cache_reap_retry_ms = 1000;386387/* shift of arc_c for calculating overflow limit in arc_get_data_impl */388static int zfs_arc_overflow_shift = 8;389390/* log2(fraction of arc to reclaim) */391uint_t arc_shrink_shift = 7;392393#ifdef _KERNEL394/* percent of pagecache to reclaim arc to */395uint_t zfs_arc_pc_percent = 0;396#endif397398/*399* log2(fraction of ARC which must be free to allow growing).400* I.e. If there is less than arc_c >> arc_no_grow_shift free memory,401* when reading a new block into the ARC, we will evict an equal-sized block402* from the ARC.403*404* This must be less than arc_shrink_shift, so that when we shrink the ARC,405* we will still not allow it to grow.406*/407uint_t arc_no_grow_shift = 5;408409410/*411* minimum lifespan of a prefetch block in clock ticks412* (initialized in arc_init())413*/414static uint_t arc_min_prefetch;415static uint_t arc_min_prescient_prefetch;416417/*418* If this percent of memory is free, don't throttle.419*/420uint_t arc_lotsfree_percent = 10;421422/*423* The arc has filled available memory and has now warmed up.424*/425boolean_t arc_warm;426427/*428* These tunables are for performance analysis.429*/430uint64_t zfs_arc_max = 0;431uint64_t zfs_arc_min = 0;432static uint64_t zfs_arc_dnode_limit = 0;433static uint_t zfs_arc_dnode_reduce_percent = 10;434static uint_t zfs_arc_grow_retry = 0;435static uint_t zfs_arc_shrink_shift = 0;436uint_t zfs_arc_average_blocksize = 8 * 1024; /* 8KB */437438/*439* ARC dirty data constraints for arc_tempreserve_space() throttle:440* * total dirty data limit441* * anon block dirty limit442* * each pool's anon allowance443*/444static const unsigned long zfs_arc_dirty_limit_percent = 50;445static const unsigned long zfs_arc_anon_limit_percent = 25;446static const unsigned long zfs_arc_pool_dirty_percent = 20;447448/*449* Enable or disable compressed arc buffers.450*/451int zfs_compressed_arc_enabled = B_TRUE;452453/*454* Balance between metadata and data on ghost hits. Values above 100455* increase metadata caching by proportionally reducing effect of ghost456* data hits on target data/metadata rate.457*/458static uint_t zfs_arc_meta_balance = 500;459460/*461* Percentage that can be consumed by dnodes of ARC meta buffers.462*/463static uint_t zfs_arc_dnode_limit_percent = 10;464465/*466* These tunables are Linux-specific467*/468static uint64_t zfs_arc_sys_free = 0;469static uint_t zfs_arc_min_prefetch_ms = 0;470static uint_t zfs_arc_min_prescient_prefetch_ms = 0;471static uint_t zfs_arc_lotsfree_percent = 10;472473/*474* Number of arc_prune threads475*/476static int zfs_arc_prune_task_threads = 1;477478/* Used by spa_export/spa_destroy to flush the arc asynchronously */479static taskq_t *arc_flush_taskq;480481/*482* Controls the number of ARC eviction threads to dispatch sublists to.483*484* Possible values:485* 0 (auto) compute the number of threads using a logarithmic formula.486* 1 (disabled) one thread - parallel eviction is disabled.487* 2+ (manual) set the number manually.488*489* See arc_evict_thread_init() for how "auto" is computed.490*/491static uint_t zfs_arc_evict_threads = 0;492493/* The 7 states: */494arc_state_t ARC_anon;495arc_state_t ARC_mru;496arc_state_t ARC_mru_ghost;497arc_state_t ARC_mfu;498arc_state_t ARC_mfu_ghost;499arc_state_t ARC_l2c_only;500arc_state_t ARC_uncached;501502arc_stats_t arc_stats = {503{ "hits", KSTAT_DATA_UINT64 },504{ "iohits", KSTAT_DATA_UINT64 },505{ "misses", KSTAT_DATA_UINT64 },506{ "demand_data_hits", KSTAT_DATA_UINT64 },507{ "demand_data_iohits", KSTAT_DATA_UINT64 },508{ "demand_data_misses", KSTAT_DATA_UINT64 },509{ "demand_metadata_hits", KSTAT_DATA_UINT64 },510{ "demand_metadata_iohits", KSTAT_DATA_UINT64 },511{ "demand_metadata_misses", KSTAT_DATA_UINT64 },512{ "prefetch_data_hits", KSTAT_DATA_UINT64 },513{ "prefetch_data_iohits", KSTAT_DATA_UINT64 },514{ "prefetch_data_misses", KSTAT_DATA_UINT64 },515{ "prefetch_metadata_hits", KSTAT_DATA_UINT64 },516{ "prefetch_metadata_iohits", KSTAT_DATA_UINT64 },517{ "prefetch_metadata_misses", KSTAT_DATA_UINT64 },518{ "mru_hits", KSTAT_DATA_UINT64 },519{ "mru_ghost_hits", KSTAT_DATA_UINT64 },520{ "mfu_hits", KSTAT_DATA_UINT64 },521{ "mfu_ghost_hits", KSTAT_DATA_UINT64 },522{ "uncached_hits", KSTAT_DATA_UINT64 },523{ "deleted", KSTAT_DATA_UINT64 },524{ "mutex_miss", KSTAT_DATA_UINT64 },525{ "access_skip", KSTAT_DATA_UINT64 },526{ "evict_skip", KSTAT_DATA_UINT64 },527{ "evict_not_enough", KSTAT_DATA_UINT64 },528{ "evict_l2_cached", KSTAT_DATA_UINT64 },529{ "evict_l2_eligible", KSTAT_DATA_UINT64 },530{ "evict_l2_eligible_mfu", KSTAT_DATA_UINT64 },531{ "evict_l2_eligible_mru", KSTAT_DATA_UINT64 },532{ "evict_l2_ineligible", KSTAT_DATA_UINT64 },533{ "evict_l2_skip", KSTAT_DATA_UINT64 },534{ "hash_elements", KSTAT_DATA_UINT64 },535{ "hash_elements_max", KSTAT_DATA_UINT64 },536{ "hash_collisions", KSTAT_DATA_UINT64 },537{ "hash_chains", KSTAT_DATA_UINT64 },538{ "hash_chain_max", KSTAT_DATA_UINT64 },539{ "meta", KSTAT_DATA_UINT64 },540{ "pd", KSTAT_DATA_UINT64 },541{ "pm", KSTAT_DATA_UINT64 },542{ "c", KSTAT_DATA_UINT64 },543{ "c_min", KSTAT_DATA_UINT64 },544{ "c_max", KSTAT_DATA_UINT64 },545{ "size", KSTAT_DATA_UINT64 },546{ "compressed_size", KSTAT_DATA_UINT64 },547{ "uncompressed_size", KSTAT_DATA_UINT64 },548{ "overhead_size", KSTAT_DATA_UINT64 },549{ "hdr_size", KSTAT_DATA_UINT64 },550{ "data_size", KSTAT_DATA_UINT64 },551{ "metadata_size", KSTAT_DATA_UINT64 },552{ "dbuf_size", KSTAT_DATA_UINT64 },553{ "dnode_size", KSTAT_DATA_UINT64 },554{ "bonus_size", KSTAT_DATA_UINT64 },555#if defined(COMPAT_FREEBSD11)556{ "other_size", KSTAT_DATA_UINT64 },557#endif558{ "anon_size", KSTAT_DATA_UINT64 },559{ "anon_data", KSTAT_DATA_UINT64 },560{ "anon_metadata", KSTAT_DATA_UINT64 },561{ "anon_evictable_data", KSTAT_DATA_UINT64 },562{ "anon_evictable_metadata", KSTAT_DATA_UINT64 },563{ "mru_size", KSTAT_DATA_UINT64 },564{ "mru_data", KSTAT_DATA_UINT64 },565{ "mru_metadata", KSTAT_DATA_UINT64 },566{ "mru_evictable_data", KSTAT_DATA_UINT64 },567{ "mru_evictable_metadata", KSTAT_DATA_UINT64 },568{ "mru_ghost_size", KSTAT_DATA_UINT64 },569{ "mru_ghost_data", KSTAT_DATA_UINT64 },570{ "mru_ghost_metadata", KSTAT_DATA_UINT64 },571{ "mru_ghost_evictable_data", KSTAT_DATA_UINT64 },572{ "mru_ghost_evictable_metadata", KSTAT_DATA_UINT64 },573{ "mfu_size", KSTAT_DATA_UINT64 },574{ "mfu_data", KSTAT_DATA_UINT64 },575{ "mfu_metadata", KSTAT_DATA_UINT64 },576{ "mfu_evictable_data", KSTAT_DATA_UINT64 },577{ "mfu_evictable_metadata", KSTAT_DATA_UINT64 },578{ "mfu_ghost_size", KSTAT_DATA_UINT64 },579{ "mfu_ghost_data", KSTAT_DATA_UINT64 },580{ "mfu_ghost_metadata", KSTAT_DATA_UINT64 },581{ "mfu_ghost_evictable_data", KSTAT_DATA_UINT64 },582{ "mfu_ghost_evictable_metadata", KSTAT_DATA_UINT64 },583{ "uncached_size", KSTAT_DATA_UINT64 },584{ "uncached_data", KSTAT_DATA_UINT64 },585{ "uncached_metadata", KSTAT_DATA_UINT64 },586{ "uncached_evictable_data", KSTAT_DATA_UINT64 },587{ "uncached_evictable_metadata", KSTAT_DATA_UINT64 },588{ "l2_hits", KSTAT_DATA_UINT64 },589{ "l2_misses", KSTAT_DATA_UINT64 },590{ "l2_prefetch_asize", KSTAT_DATA_UINT64 },591{ "l2_mru_asize", KSTAT_DATA_UINT64 },592{ "l2_mfu_asize", KSTAT_DATA_UINT64 },593{ "l2_bufc_data_asize", KSTAT_DATA_UINT64 },594{ "l2_bufc_metadata_asize", KSTAT_DATA_UINT64 },595{ "l2_feeds", KSTAT_DATA_UINT64 },596{ "l2_rw_clash", KSTAT_DATA_UINT64 },597{ "l2_read_bytes", KSTAT_DATA_UINT64 },598{ "l2_write_bytes", KSTAT_DATA_UINT64 },599{ "l2_writes_sent", KSTAT_DATA_UINT64 },600{ "l2_writes_done", KSTAT_DATA_UINT64 },601{ "l2_writes_error", KSTAT_DATA_UINT64 },602{ "l2_writes_lock_retry", KSTAT_DATA_UINT64 },603{ "l2_evict_lock_retry", KSTAT_DATA_UINT64 },604{ "l2_evict_reading", KSTAT_DATA_UINT64 },605{ "l2_evict_l1cached", KSTAT_DATA_UINT64 },606{ "l2_free_on_write", KSTAT_DATA_UINT64 },607{ "l2_abort_lowmem", KSTAT_DATA_UINT64 },608{ "l2_cksum_bad", KSTAT_DATA_UINT64 },609{ "l2_io_error", KSTAT_DATA_UINT64 },610{ "l2_size", KSTAT_DATA_UINT64 },611{ "l2_asize", KSTAT_DATA_UINT64 },612{ "l2_hdr_size", KSTAT_DATA_UINT64 },613{ "l2_log_blk_writes", KSTAT_DATA_UINT64 },614{ "l2_log_blk_avg_asize", KSTAT_DATA_UINT64 },615{ "l2_log_blk_asize", KSTAT_DATA_UINT64 },616{ "l2_log_blk_count", KSTAT_DATA_UINT64 },617{ "l2_data_to_meta_ratio", KSTAT_DATA_UINT64 },618{ "l2_rebuild_success", KSTAT_DATA_UINT64 },619{ "l2_rebuild_unsupported", KSTAT_DATA_UINT64 },620{ "l2_rebuild_io_errors", KSTAT_DATA_UINT64 },621{ "l2_rebuild_dh_errors", KSTAT_DATA_UINT64 },622{ "l2_rebuild_cksum_lb_errors", KSTAT_DATA_UINT64 },623{ "l2_rebuild_lowmem", KSTAT_DATA_UINT64 },624{ "l2_rebuild_size", KSTAT_DATA_UINT64 },625{ "l2_rebuild_asize", KSTAT_DATA_UINT64 },626{ "l2_rebuild_bufs", KSTAT_DATA_UINT64 },627{ "l2_rebuild_bufs_precached", KSTAT_DATA_UINT64 },628{ "l2_rebuild_log_blks", KSTAT_DATA_UINT64 },629{ "memory_throttle_count", KSTAT_DATA_UINT64 },630{ "memory_direct_count", KSTAT_DATA_UINT64 },631{ "memory_indirect_count", KSTAT_DATA_UINT64 },632{ "memory_all_bytes", KSTAT_DATA_UINT64 },633{ "memory_free_bytes", KSTAT_DATA_UINT64 },634{ "memory_available_bytes", KSTAT_DATA_INT64 },635{ "arc_no_grow", KSTAT_DATA_UINT64 },636{ "arc_tempreserve", KSTAT_DATA_UINT64 },637{ "arc_loaned_bytes", KSTAT_DATA_UINT64 },638{ "arc_prune", KSTAT_DATA_UINT64 },639{ "arc_meta_used", KSTAT_DATA_UINT64 },640{ "arc_dnode_limit", KSTAT_DATA_UINT64 },641{ "async_upgrade_sync", KSTAT_DATA_UINT64 },642{ "predictive_prefetch", KSTAT_DATA_UINT64 },643{ "demand_hit_predictive_prefetch", KSTAT_DATA_UINT64 },644{ "demand_iohit_predictive_prefetch", KSTAT_DATA_UINT64 },645{ "prescient_prefetch", KSTAT_DATA_UINT64 },646{ "demand_hit_prescient_prefetch", KSTAT_DATA_UINT64 },647{ "demand_iohit_prescient_prefetch", KSTAT_DATA_UINT64 },648{ "arc_need_free", KSTAT_DATA_UINT64 },649{ "arc_sys_free", KSTAT_DATA_UINT64 },650{ "arc_raw_size", KSTAT_DATA_UINT64 },651{ "cached_only_in_progress", KSTAT_DATA_UINT64 },652{ "abd_chunk_waste_size", KSTAT_DATA_UINT64 },653};654655arc_sums_t arc_sums;656657#define ARCSTAT_MAX(stat, val) { \658uint64_t m; \659while ((val) > (m = arc_stats.stat.value.ui64) && \660(m != atomic_cas_64(&arc_stats.stat.value.ui64, m, (val)))) \661continue; \662}663664/*665* We define a macro to allow ARC hits/misses to be easily broken down by666* two separate conditions, giving a total of four different subtypes for667* each of hits and misses (so eight statistics total).668*/669#define ARCSTAT_CONDSTAT(cond1, stat1, notstat1, cond2, stat2, notstat2, stat) \670if (cond1) { \671if (cond2) { \672ARCSTAT_BUMP(arcstat_##stat1##_##stat2##_##stat); \673} else { \674ARCSTAT_BUMP(arcstat_##stat1##_##notstat2##_##stat); \675} \676} else { \677if (cond2) { \678ARCSTAT_BUMP(arcstat_##notstat1##_##stat2##_##stat); \679} else { \680ARCSTAT_BUMP(arcstat_##notstat1##_##notstat2##_##stat);\681} \682}683684/*685* This macro allows us to use kstats as floating averages. Each time we686* update this kstat, we first factor it and the update value by687* ARCSTAT_AVG_FACTOR to shrink the new value's contribution to the overall688* average. This macro assumes that integer loads and stores are atomic, but689* is not safe for multiple writers updating the kstat in parallel (only the690* last writer's update will remain).691*/692#define ARCSTAT_F_AVG_FACTOR 3693#define ARCSTAT_F_AVG(stat, value) \694do { \695uint64_t x = ARCSTAT(stat); \696x = x - x / ARCSTAT_F_AVG_FACTOR + \697(value) / ARCSTAT_F_AVG_FACTOR; \698ARCSTAT(stat) = x; \699} while (0)700701static kstat_t *arc_ksp;702703/*704* There are several ARC variables that are critical to export as kstats --705* but we don't want to have to grovel around in the kstat whenever we wish to706* manipulate them. For these variables, we therefore define them to be in707* terms of the statistic variable. This assures that we are not introducing708* the possibility of inconsistency by having shadow copies of the variables,709* while still allowing the code to be readable.710*/711#define arc_tempreserve ARCSTAT(arcstat_tempreserve)712#define arc_loaned_bytes ARCSTAT(arcstat_loaned_bytes)713#define arc_dnode_limit ARCSTAT(arcstat_dnode_limit) /* max size for dnodes */714#define arc_need_free ARCSTAT(arcstat_need_free) /* waiting to be evicted */715716hrtime_t arc_growtime;717list_t arc_prune_list;718kmutex_t arc_prune_mtx;719taskq_t *arc_prune_taskq;720721#define GHOST_STATE(state) \722((state) == arc_mru_ghost || (state) == arc_mfu_ghost || \723(state) == arc_l2c_only)724725#define HDR_IN_HASH_TABLE(hdr) ((hdr)->b_flags & ARC_FLAG_IN_HASH_TABLE)726#define HDR_IO_IN_PROGRESS(hdr) ((hdr)->b_flags & ARC_FLAG_IO_IN_PROGRESS)727#define HDR_IO_ERROR(hdr) ((hdr)->b_flags & ARC_FLAG_IO_ERROR)728#define HDR_PREFETCH(hdr) ((hdr)->b_flags & ARC_FLAG_PREFETCH)729#define HDR_PRESCIENT_PREFETCH(hdr) \730((hdr)->b_flags & ARC_FLAG_PRESCIENT_PREFETCH)731#define HDR_COMPRESSION_ENABLED(hdr) \732((hdr)->b_flags & ARC_FLAG_COMPRESSED_ARC)733734#define HDR_L2CACHE(hdr) ((hdr)->b_flags & ARC_FLAG_L2CACHE)735#define HDR_UNCACHED(hdr) ((hdr)->b_flags & ARC_FLAG_UNCACHED)736#define HDR_L2_READING(hdr) \737(((hdr)->b_flags & ARC_FLAG_IO_IN_PROGRESS) && \738((hdr)->b_flags & ARC_FLAG_HAS_L2HDR))739#define HDR_L2_WRITING(hdr) ((hdr)->b_flags & ARC_FLAG_L2_WRITING)740#define HDR_L2_EVICTED(hdr) ((hdr)->b_flags & ARC_FLAG_L2_EVICTED)741#define HDR_L2_WRITE_HEAD(hdr) ((hdr)->b_flags & ARC_FLAG_L2_WRITE_HEAD)742#define HDR_PROTECTED(hdr) ((hdr)->b_flags & ARC_FLAG_PROTECTED)743#define HDR_NOAUTH(hdr) ((hdr)->b_flags & ARC_FLAG_NOAUTH)744#define HDR_SHARED_DATA(hdr) ((hdr)->b_flags & ARC_FLAG_SHARED_DATA)745746#define HDR_ISTYPE_METADATA(hdr) \747((hdr)->b_flags & ARC_FLAG_BUFC_METADATA)748#define HDR_ISTYPE_DATA(hdr) (!HDR_ISTYPE_METADATA(hdr))749750#define HDR_HAS_L1HDR(hdr) ((hdr)->b_flags & ARC_FLAG_HAS_L1HDR)751#define HDR_HAS_L2HDR(hdr) ((hdr)->b_flags & ARC_FLAG_HAS_L2HDR)752#define HDR_HAS_RABD(hdr) \753(HDR_HAS_L1HDR(hdr) && HDR_PROTECTED(hdr) && \754(hdr)->b_crypt_hdr.b_rabd != NULL)755#define HDR_ENCRYPTED(hdr) \756(HDR_PROTECTED(hdr) && DMU_OT_IS_ENCRYPTED((hdr)->b_crypt_hdr.b_ot))757#define HDR_AUTHENTICATED(hdr) \758(HDR_PROTECTED(hdr) && !DMU_OT_IS_ENCRYPTED((hdr)->b_crypt_hdr.b_ot))759760/* For storing compression mode in b_flags */761#define HDR_COMPRESS_OFFSET (highbit64(ARC_FLAG_COMPRESS_0) - 1)762763#define HDR_GET_COMPRESS(hdr) ((enum zio_compress)BF32_GET((hdr)->b_flags, \764HDR_COMPRESS_OFFSET, SPA_COMPRESSBITS))765#define HDR_SET_COMPRESS(hdr, cmp) BF32_SET((hdr)->b_flags, \766HDR_COMPRESS_OFFSET, SPA_COMPRESSBITS, (cmp));767768#define ARC_BUF_LAST(buf) ((buf)->b_next == NULL)769#define ARC_BUF_SHARED(buf) ((buf)->b_flags & ARC_BUF_FLAG_SHARED)770#define ARC_BUF_COMPRESSED(buf) ((buf)->b_flags & ARC_BUF_FLAG_COMPRESSED)771#define ARC_BUF_ENCRYPTED(buf) ((buf)->b_flags & ARC_BUF_FLAG_ENCRYPTED)772773/*774* Other sizes775*/776777#define HDR_FULL_SIZE ((int64_t)sizeof (arc_buf_hdr_t))778#define HDR_L2ONLY_SIZE ((int64_t)offsetof(arc_buf_hdr_t, b_l1hdr))779780/*781* Hash table routines782*/783784#define BUF_LOCKS 2048785typedef struct buf_hash_table {786uint64_t ht_mask;787arc_buf_hdr_t **ht_table;788kmutex_t ht_locks[BUF_LOCKS] ____cacheline_aligned;789} buf_hash_table_t;790791static buf_hash_table_t buf_hash_table;792793#define BUF_HASH_INDEX(spa, dva, birth) \794(buf_hash(spa, dva, birth) & buf_hash_table.ht_mask)795#define BUF_HASH_LOCK(idx) (&buf_hash_table.ht_locks[idx & (BUF_LOCKS-1)])796#define HDR_LOCK(hdr) \797(BUF_HASH_LOCK(BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth)))798799uint64_t zfs_crc64_table[256];800801/*802* Asynchronous ARC flush803*804* We track these in a list for arc_async_flush_guid_inuse().805* Used for both L1 and L2 async teardown.806*/807static list_t arc_async_flush_list;808static kmutex_t arc_async_flush_lock;809810typedef struct arc_async_flush {811uint64_t af_spa_guid;812taskq_ent_t af_tqent;813uint_t af_cache_level; /* 1 or 2 to differentiate node */814list_node_t af_node;815} arc_async_flush_t;816817818/*819* Level 2 ARC820*/821822#define L2ARC_WRITE_SIZE (32 * 1024 * 1024) /* initial write max */823#define L2ARC_HEADROOM 8 /* num of writes */824825/*826* If we discover during ARC scan any buffers to be compressed, we boost827* our headroom for the next scanning cycle by this percentage multiple.828*/829#define L2ARC_HEADROOM_BOOST 200830#define L2ARC_FEED_SECS 1 /* caching interval secs */831#define L2ARC_FEED_MIN_MS 200 /* min caching interval ms */832833/*834* We can feed L2ARC from two states of ARC buffers, mru and mfu,835* and each of the state has two types: data and metadata.836*/837#define L2ARC_FEED_TYPES 4838839/* L2ARC Performance Tunables */840uint64_t l2arc_write_max = L2ARC_WRITE_SIZE; /* def max write size */841uint64_t l2arc_write_boost = L2ARC_WRITE_SIZE; /* extra warmup write */842uint64_t l2arc_headroom = L2ARC_HEADROOM; /* # of dev writes */843uint64_t l2arc_headroom_boost = L2ARC_HEADROOM_BOOST;844uint64_t l2arc_feed_secs = L2ARC_FEED_SECS; /* interval seconds */845uint64_t l2arc_feed_min_ms = L2ARC_FEED_MIN_MS; /* min interval msecs */846int l2arc_noprefetch = B_TRUE; /* don't cache prefetch bufs */847int l2arc_feed_again = B_TRUE; /* turbo warmup */848int l2arc_norw = B_FALSE; /* no reads during writes */849static uint_t l2arc_meta_percent = 33; /* limit on headers size */850851/*852* L2ARC Internals853*/854static list_t L2ARC_dev_list; /* device list */855static list_t *l2arc_dev_list; /* device list pointer */856static kmutex_t l2arc_dev_mtx; /* device list mutex */857static l2arc_dev_t *l2arc_dev_last; /* last device used */858static list_t L2ARC_free_on_write; /* free after write buf list */859static list_t *l2arc_free_on_write; /* free after write list ptr */860static kmutex_t l2arc_free_on_write_mtx; /* mutex for list */861static uint64_t l2arc_ndev; /* number of devices */862863typedef struct l2arc_read_callback {864arc_buf_hdr_t *l2rcb_hdr; /* read header */865blkptr_t l2rcb_bp; /* original blkptr */866zbookmark_phys_t l2rcb_zb; /* original bookmark */867int l2rcb_flags; /* original flags */868abd_t *l2rcb_abd; /* temporary buffer */869} l2arc_read_callback_t;870871typedef struct l2arc_data_free {872/* protected by l2arc_free_on_write_mtx */873abd_t *l2df_abd;874size_t l2df_size;875arc_buf_contents_t l2df_type;876list_node_t l2df_list_node;877} l2arc_data_free_t;878879typedef enum arc_fill_flags {880ARC_FILL_LOCKED = 1 << 0, /* hdr lock is held */881ARC_FILL_COMPRESSED = 1 << 1, /* fill with compressed data */882ARC_FILL_ENCRYPTED = 1 << 2, /* fill with encrypted data */883ARC_FILL_NOAUTH = 1 << 3, /* don't attempt to authenticate */884ARC_FILL_IN_PLACE = 1 << 4 /* fill in place (special case) */885} arc_fill_flags_t;886887typedef enum arc_ovf_level {888ARC_OVF_NONE, /* ARC within target size. */889ARC_OVF_SOME, /* ARC is slightly overflowed. */890ARC_OVF_SEVERE /* ARC is severely overflowed. */891} arc_ovf_level_t;892893static kmutex_t l2arc_feed_thr_lock;894static kcondvar_t l2arc_feed_thr_cv;895static uint8_t l2arc_thread_exit;896897static kmutex_t l2arc_rebuild_thr_lock;898static kcondvar_t l2arc_rebuild_thr_cv;899900enum arc_hdr_alloc_flags {901ARC_HDR_ALLOC_RDATA = 0x1,902ARC_HDR_USE_RESERVE = 0x4,903ARC_HDR_ALLOC_LINEAR = 0x8,904};905906907static abd_t *arc_get_data_abd(arc_buf_hdr_t *, uint64_t, const void *, int);908static void *arc_get_data_buf(arc_buf_hdr_t *, uint64_t, const void *);909static void arc_get_data_impl(arc_buf_hdr_t *, uint64_t, const void *, int);910static void arc_free_data_abd(arc_buf_hdr_t *, abd_t *, uint64_t, const void *);911static void arc_free_data_buf(arc_buf_hdr_t *, void *, uint64_t, const void *);912static void arc_free_data_impl(arc_buf_hdr_t *hdr, uint64_t size,913const void *tag);914static void arc_hdr_free_abd(arc_buf_hdr_t *, boolean_t);915static void arc_hdr_alloc_abd(arc_buf_hdr_t *, int);916static void arc_hdr_destroy(arc_buf_hdr_t *);917static void arc_access(arc_buf_hdr_t *, arc_flags_t, boolean_t);918static void arc_buf_watch(arc_buf_t *);919static void arc_change_state(arc_state_t *, arc_buf_hdr_t *);920921static arc_buf_contents_t arc_buf_type(arc_buf_hdr_t *);922static uint32_t arc_bufc_to_flags(arc_buf_contents_t);923static inline void arc_hdr_set_flags(arc_buf_hdr_t *hdr, arc_flags_t flags);924static inline void arc_hdr_clear_flags(arc_buf_hdr_t *hdr, arc_flags_t flags);925926static boolean_t l2arc_write_eligible(uint64_t, arc_buf_hdr_t *);927static void l2arc_read_done(zio_t *);928static void l2arc_do_free_on_write(void);929static void l2arc_hdr_arcstats_update(arc_buf_hdr_t *hdr, boolean_t incr,930boolean_t state_only);931932static void arc_prune_async(uint64_t adjust);933934#define l2arc_hdr_arcstats_increment(hdr) \935l2arc_hdr_arcstats_update((hdr), B_TRUE, B_FALSE)936#define l2arc_hdr_arcstats_decrement(hdr) \937l2arc_hdr_arcstats_update((hdr), B_FALSE, B_FALSE)938#define l2arc_hdr_arcstats_increment_state(hdr) \939l2arc_hdr_arcstats_update((hdr), B_TRUE, B_TRUE)940#define l2arc_hdr_arcstats_decrement_state(hdr) \941l2arc_hdr_arcstats_update((hdr), B_FALSE, B_TRUE)942943/*944* l2arc_exclude_special : A zfs module parameter that controls whether buffers945* present on special vdevs are eligibile for caching in L2ARC. If946* set to 1, exclude dbufs on special vdevs from being cached to947* L2ARC.948*/949int l2arc_exclude_special = 0;950951/*952* l2arc_mfuonly : A ZFS module parameter that controls whether only MFU953* metadata and data are cached from ARC into L2ARC.954*/955static int l2arc_mfuonly = 0;956957/*958* L2ARC TRIM959* l2arc_trim_ahead : A ZFS module parameter that controls how much ahead of960* the current write size (l2arc_write_max) we should TRIM if we961* have filled the device. It is defined as a percentage of the962* write size. If set to 100 we trim twice the space required to963* accommodate upcoming writes. A minimum of 64MB will be trimmed.964* It also enables TRIM of the whole L2ARC device upon creation or965* addition to an existing pool or if the header of the device is966* invalid upon importing a pool or onlining a cache device. The967* default is 0, which disables TRIM on L2ARC altogether as it can968* put significant stress on the underlying storage devices. This969* will vary depending of how well the specific device handles970* these commands.971*/972static uint64_t l2arc_trim_ahead = 0;973974/*975* Performance tuning of L2ARC persistence:976*977* l2arc_rebuild_enabled : A ZFS module parameter that controls whether adding978* an L2ARC device (either at pool import or later) will attempt979* to rebuild L2ARC buffer contents.980* l2arc_rebuild_blocks_min_l2size : A ZFS module parameter that controls981* whether log blocks are written to the L2ARC device. If the L2ARC982* device is less than 1GB, the amount of data l2arc_evict()983* evicts is significant compared to the amount of restored L2ARC984* data. In this case do not write log blocks in L2ARC in order985* not to waste space.986*/987static int l2arc_rebuild_enabled = B_TRUE;988static uint64_t l2arc_rebuild_blocks_min_l2size = 1024 * 1024 * 1024;989990/* L2ARC persistence rebuild control routines. */991void l2arc_rebuild_vdev(vdev_t *vd, boolean_t reopen);992static __attribute__((noreturn)) void l2arc_dev_rebuild_thread(void *arg);993static int l2arc_rebuild(l2arc_dev_t *dev);994995/* L2ARC persistence read I/O routines. */996static int l2arc_dev_hdr_read(l2arc_dev_t *dev);997static int l2arc_log_blk_read(l2arc_dev_t *dev,998const l2arc_log_blkptr_t *this_lp, const l2arc_log_blkptr_t *next_lp,999l2arc_log_blk_phys_t *this_lb, l2arc_log_blk_phys_t *next_lb,1000zio_t *this_io, zio_t **next_io);1001static zio_t *l2arc_log_blk_fetch(vdev_t *vd,1002const l2arc_log_blkptr_t *lp, l2arc_log_blk_phys_t *lb);1003static void l2arc_log_blk_fetch_abort(zio_t *zio);10041005/* L2ARC persistence block restoration routines. */1006static void l2arc_log_blk_restore(l2arc_dev_t *dev,1007const l2arc_log_blk_phys_t *lb, uint64_t lb_asize);1008static void l2arc_hdr_restore(const l2arc_log_ent_phys_t *le,1009l2arc_dev_t *dev);10101011/* L2ARC persistence write I/O routines. */1012static uint64_t l2arc_log_blk_commit(l2arc_dev_t *dev, zio_t *pio,1013l2arc_write_callback_t *cb);10141015/* L2ARC persistence auxiliary routines. */1016boolean_t l2arc_log_blkptr_valid(l2arc_dev_t *dev,1017const l2arc_log_blkptr_t *lbp);1018static boolean_t l2arc_log_blk_insert(l2arc_dev_t *dev,1019const arc_buf_hdr_t *ab);1020boolean_t l2arc_range_check_overlap(uint64_t bottom,1021uint64_t top, uint64_t check);1022static void l2arc_blk_fetch_done(zio_t *zio);1023static inline uint64_t1024l2arc_log_blk_overhead(uint64_t write_sz, l2arc_dev_t *dev);10251026/*1027* We use Cityhash for this. It's fast, and has good hash properties without1028* requiring any large static buffers.1029*/1030static uint64_t1031buf_hash(uint64_t spa, const dva_t *dva, uint64_t birth)1032{1033return (cityhash4(spa, dva->dva_word[0], dva->dva_word[1], birth));1034}10351036#define HDR_EMPTY(hdr) \1037((hdr)->b_dva.dva_word[0] == 0 && \1038(hdr)->b_dva.dva_word[1] == 0)10391040#define HDR_EMPTY_OR_LOCKED(hdr) \1041(HDR_EMPTY(hdr) || MUTEX_HELD(HDR_LOCK(hdr)))10421043#define HDR_EQUAL(spa, dva, birth, hdr) \1044((hdr)->b_dva.dva_word[0] == (dva)->dva_word[0]) && \1045((hdr)->b_dva.dva_word[1] == (dva)->dva_word[1]) && \1046((hdr)->b_birth == birth) && ((hdr)->b_spa == spa)10471048static void1049buf_discard_identity(arc_buf_hdr_t *hdr)1050{1051hdr->b_dva.dva_word[0] = 0;1052hdr->b_dva.dva_word[1] = 0;1053hdr->b_birth = 0;1054}10551056static arc_buf_hdr_t *1057buf_hash_find(uint64_t spa, const blkptr_t *bp, kmutex_t **lockp)1058{1059const dva_t *dva = BP_IDENTITY(bp);1060uint64_t birth = BP_GET_PHYSICAL_BIRTH(bp);1061uint64_t idx = BUF_HASH_INDEX(spa, dva, birth);1062kmutex_t *hash_lock = BUF_HASH_LOCK(idx);1063arc_buf_hdr_t *hdr;10641065mutex_enter(hash_lock);1066for (hdr = buf_hash_table.ht_table[idx]; hdr != NULL;1067hdr = hdr->b_hash_next) {1068if (HDR_EQUAL(spa, dva, birth, hdr)) {1069*lockp = hash_lock;1070return (hdr);1071}1072}1073mutex_exit(hash_lock);1074*lockp = NULL;1075return (NULL);1076}10771078/*1079* Insert an entry into the hash table. If there is already an element1080* equal to elem in the hash table, then the already existing element1081* will be returned and the new element will not be inserted.1082* Otherwise returns NULL.1083* If lockp == NULL, the caller is assumed to already hold the hash lock.1084*/1085static arc_buf_hdr_t *1086buf_hash_insert(arc_buf_hdr_t *hdr, kmutex_t **lockp)1087{1088uint64_t idx = BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth);1089kmutex_t *hash_lock = BUF_HASH_LOCK(idx);1090arc_buf_hdr_t *fhdr;1091uint32_t i;10921093ASSERT(!DVA_IS_EMPTY(&hdr->b_dva));1094ASSERT(hdr->b_birth != 0);1095ASSERT(!HDR_IN_HASH_TABLE(hdr));10961097if (lockp != NULL) {1098*lockp = hash_lock;1099mutex_enter(hash_lock);1100} else {1101ASSERT(MUTEX_HELD(hash_lock));1102}11031104for (fhdr = buf_hash_table.ht_table[idx], i = 0; fhdr != NULL;1105fhdr = fhdr->b_hash_next, i++) {1106if (HDR_EQUAL(hdr->b_spa, &hdr->b_dva, hdr->b_birth, fhdr))1107return (fhdr);1108}11091110hdr->b_hash_next = buf_hash_table.ht_table[idx];1111buf_hash_table.ht_table[idx] = hdr;1112arc_hdr_set_flags(hdr, ARC_FLAG_IN_HASH_TABLE);11131114/* collect some hash table performance data */1115if (i > 0) {1116ARCSTAT_BUMP(arcstat_hash_collisions);1117if (i == 1)1118ARCSTAT_BUMP(arcstat_hash_chains);1119ARCSTAT_MAX(arcstat_hash_chain_max, i);1120}1121ARCSTAT_BUMP(arcstat_hash_elements);11221123return (NULL);1124}11251126static void1127buf_hash_remove(arc_buf_hdr_t *hdr)1128{1129arc_buf_hdr_t *fhdr, **hdrp;1130uint64_t idx = BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth);11311132ASSERT(MUTEX_HELD(BUF_HASH_LOCK(idx)));1133ASSERT(HDR_IN_HASH_TABLE(hdr));11341135hdrp = &buf_hash_table.ht_table[idx];1136while ((fhdr = *hdrp) != hdr) {1137ASSERT3P(fhdr, !=, NULL);1138hdrp = &fhdr->b_hash_next;1139}1140*hdrp = hdr->b_hash_next;1141hdr->b_hash_next = NULL;1142arc_hdr_clear_flags(hdr, ARC_FLAG_IN_HASH_TABLE);11431144/* collect some hash table performance data */1145ARCSTAT_BUMPDOWN(arcstat_hash_elements);1146if (buf_hash_table.ht_table[idx] &&1147buf_hash_table.ht_table[idx]->b_hash_next == NULL)1148ARCSTAT_BUMPDOWN(arcstat_hash_chains);1149}11501151/*1152* Global data structures and functions for the buf kmem cache.1153*/11541155static kmem_cache_t *hdr_full_cache;1156static kmem_cache_t *hdr_l2only_cache;1157static kmem_cache_t *buf_cache;11581159static void1160buf_fini(void)1161{1162#if defined(_KERNEL)1163/*1164* Large allocations which do not require contiguous pages1165* should be using vmem_free() in the linux kernel.1166*/1167vmem_free(buf_hash_table.ht_table,1168(buf_hash_table.ht_mask + 1) * sizeof (void *));1169#else1170kmem_free(buf_hash_table.ht_table,1171(buf_hash_table.ht_mask + 1) * sizeof (void *));1172#endif1173for (int i = 0; i < BUF_LOCKS; i++)1174mutex_destroy(BUF_HASH_LOCK(i));1175kmem_cache_destroy(hdr_full_cache);1176kmem_cache_destroy(hdr_l2only_cache);1177kmem_cache_destroy(buf_cache);1178}11791180/*1181* Constructor callback - called when the cache is empty1182* and a new buf is requested.1183*/1184static int1185hdr_full_cons(void *vbuf, void *unused, int kmflag)1186{1187(void) unused, (void) kmflag;1188arc_buf_hdr_t *hdr = vbuf;11891190memset(hdr, 0, HDR_FULL_SIZE);1191hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS;1192zfs_refcount_create(&hdr->b_l1hdr.b_refcnt);1193#ifdef ZFS_DEBUG1194mutex_init(&hdr->b_l1hdr.b_freeze_lock, NULL, MUTEX_DEFAULT, NULL);1195#endif1196multilist_link_init(&hdr->b_l1hdr.b_arc_node);1197list_link_init(&hdr->b_l2hdr.b_l2node);1198arc_space_consume(HDR_FULL_SIZE, ARC_SPACE_HDRS);11991200return (0);1201}12021203static int1204hdr_l2only_cons(void *vbuf, void *unused, int kmflag)1205{1206(void) unused, (void) kmflag;1207arc_buf_hdr_t *hdr = vbuf;12081209memset(hdr, 0, HDR_L2ONLY_SIZE);1210arc_space_consume(HDR_L2ONLY_SIZE, ARC_SPACE_L2HDRS);12111212return (0);1213}12141215static int1216buf_cons(void *vbuf, void *unused, int kmflag)1217{1218(void) unused, (void) kmflag;1219arc_buf_t *buf = vbuf;12201221memset(buf, 0, sizeof (arc_buf_t));1222arc_space_consume(sizeof (arc_buf_t), ARC_SPACE_HDRS);12231224return (0);1225}12261227/*1228* Destructor callback - called when a cached buf is1229* no longer required.1230*/1231static void1232hdr_full_dest(void *vbuf, void *unused)1233{1234(void) unused;1235arc_buf_hdr_t *hdr = vbuf;12361237ASSERT(HDR_EMPTY(hdr));1238zfs_refcount_destroy(&hdr->b_l1hdr.b_refcnt);1239#ifdef ZFS_DEBUG1240mutex_destroy(&hdr->b_l1hdr.b_freeze_lock);1241#endif1242ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));1243arc_space_return(HDR_FULL_SIZE, ARC_SPACE_HDRS);1244}12451246static void1247hdr_l2only_dest(void *vbuf, void *unused)1248{1249(void) unused;1250arc_buf_hdr_t *hdr = vbuf;12511252ASSERT(HDR_EMPTY(hdr));1253arc_space_return(HDR_L2ONLY_SIZE, ARC_SPACE_L2HDRS);1254}12551256static void1257buf_dest(void *vbuf, void *unused)1258{1259(void) unused;1260(void) vbuf;12611262arc_space_return(sizeof (arc_buf_t), ARC_SPACE_HDRS);1263}12641265static void1266buf_init(void)1267{1268uint64_t *ct = NULL;1269uint64_t hsize = 1ULL << 12;1270int i, j;12711272/*1273* The hash table is big enough to fill all of physical memory1274* with an average block size of zfs_arc_average_blocksize (default 8K).1275* By default, the table will take up1276* totalmem * sizeof(void*) / 8K (1MB per GB with 8-byte pointers).1277*/1278while (hsize * zfs_arc_average_blocksize < arc_all_memory())1279hsize <<= 1;1280retry:1281buf_hash_table.ht_mask = hsize - 1;1282#if defined(_KERNEL)1283/*1284* Large allocations which do not require contiguous pages1285* should be using vmem_alloc() in the linux kernel1286*/1287buf_hash_table.ht_table =1288vmem_zalloc(hsize * sizeof (void*), KM_SLEEP);1289#else1290buf_hash_table.ht_table =1291kmem_zalloc(hsize * sizeof (void*), KM_NOSLEEP);1292#endif1293if (buf_hash_table.ht_table == NULL) {1294ASSERT(hsize > (1ULL << 8));1295hsize >>= 1;1296goto retry;1297}12981299hdr_full_cache = kmem_cache_create("arc_buf_hdr_t_full", HDR_FULL_SIZE,13000, hdr_full_cons, hdr_full_dest, NULL, NULL, NULL, KMC_RECLAIMABLE);1301hdr_l2only_cache = kmem_cache_create("arc_buf_hdr_t_l2only",1302HDR_L2ONLY_SIZE, 0, hdr_l2only_cons, hdr_l2only_dest, NULL,1303NULL, NULL, 0);1304buf_cache = kmem_cache_create("arc_buf_t", sizeof (arc_buf_t),13050, buf_cons, buf_dest, NULL, NULL, NULL, 0);13061307for (i = 0; i < 256; i++)1308for (ct = zfs_crc64_table + i, *ct = i, j = 8; j > 0; j--)1309*ct = (*ct >> 1) ^ (-(*ct & 1) & ZFS_CRC64_POLY);13101311for (i = 0; i < BUF_LOCKS; i++)1312mutex_init(BUF_HASH_LOCK(i), NULL, MUTEX_DEFAULT, NULL);1313}13141315#define ARC_MINTIME (hz>>4) /* 62 ms */13161317/*1318* This is the size that the buf occupies in memory. If the buf is compressed,1319* it will correspond to the compressed size. You should use this method of1320* getting the buf size unless you explicitly need the logical size.1321*/1322uint64_t1323arc_buf_size(arc_buf_t *buf)1324{1325return (ARC_BUF_COMPRESSED(buf) ?1326HDR_GET_PSIZE(buf->b_hdr) : HDR_GET_LSIZE(buf->b_hdr));1327}13281329uint64_t1330arc_buf_lsize(arc_buf_t *buf)1331{1332return (HDR_GET_LSIZE(buf->b_hdr));1333}13341335/*1336* This function will return B_TRUE if the buffer is encrypted in memory.1337* This buffer can be decrypted by calling arc_untransform().1338*/1339boolean_t1340arc_is_encrypted(arc_buf_t *buf)1341{1342return (ARC_BUF_ENCRYPTED(buf) != 0);1343}13441345/*1346* Returns B_TRUE if the buffer represents data that has not had its MAC1347* verified yet.1348*/1349boolean_t1350arc_is_unauthenticated(arc_buf_t *buf)1351{1352return (HDR_NOAUTH(buf->b_hdr) != 0);1353}13541355void1356arc_get_raw_params(arc_buf_t *buf, boolean_t *byteorder, uint8_t *salt,1357uint8_t *iv, uint8_t *mac)1358{1359arc_buf_hdr_t *hdr = buf->b_hdr;13601361ASSERT(HDR_PROTECTED(hdr));13621363memcpy(salt, hdr->b_crypt_hdr.b_salt, ZIO_DATA_SALT_LEN);1364memcpy(iv, hdr->b_crypt_hdr.b_iv, ZIO_DATA_IV_LEN);1365memcpy(mac, hdr->b_crypt_hdr.b_mac, ZIO_DATA_MAC_LEN);1366*byteorder = (hdr->b_l1hdr.b_byteswap == DMU_BSWAP_NUMFUNCS) ?1367ZFS_HOST_BYTEORDER : !ZFS_HOST_BYTEORDER;1368}13691370/*1371* Indicates how this buffer is compressed in memory. If it is not compressed1372* the value will be ZIO_COMPRESS_OFF. It can be made normally readable with1373* arc_untransform() as long as it is also unencrypted.1374*/1375enum zio_compress1376arc_get_compression(arc_buf_t *buf)1377{1378return (ARC_BUF_COMPRESSED(buf) ?1379HDR_GET_COMPRESS(buf->b_hdr) : ZIO_COMPRESS_OFF);1380}13811382/*1383* Return the compression algorithm used to store this data in the ARC. If ARC1384* compression is enabled or this is an encrypted block, this will be the same1385* as what's used to store it on-disk. Otherwise, this will be ZIO_COMPRESS_OFF.1386*/1387static inline enum zio_compress1388arc_hdr_get_compress(arc_buf_hdr_t *hdr)1389{1390return (HDR_COMPRESSION_ENABLED(hdr) ?1391HDR_GET_COMPRESS(hdr) : ZIO_COMPRESS_OFF);1392}13931394uint8_t1395arc_get_complevel(arc_buf_t *buf)1396{1397return (buf->b_hdr->b_complevel);1398}13991400__maybe_unused1401static inline boolean_t1402arc_buf_is_shared(arc_buf_t *buf)1403{1404boolean_t shared = (buf->b_data != NULL &&1405buf->b_hdr->b_l1hdr.b_pabd != NULL &&1406abd_is_linear(buf->b_hdr->b_l1hdr.b_pabd) &&1407buf->b_data == abd_to_buf(buf->b_hdr->b_l1hdr.b_pabd));1408IMPLY(shared, HDR_SHARED_DATA(buf->b_hdr));1409EQUIV(shared, ARC_BUF_SHARED(buf));1410IMPLY(shared, ARC_BUF_COMPRESSED(buf) || ARC_BUF_LAST(buf));14111412/*1413* It would be nice to assert arc_can_share() too, but the "hdr isn't1414* already being shared" requirement prevents us from doing that.1415*/14161417return (shared);1418}14191420/*1421* Free the checksum associated with this header. If there is no checksum, this1422* is a no-op.1423*/1424static inline void1425arc_cksum_free(arc_buf_hdr_t *hdr)1426{1427#ifdef ZFS_DEBUG1428ASSERT(HDR_HAS_L1HDR(hdr));14291430mutex_enter(&hdr->b_l1hdr.b_freeze_lock);1431if (hdr->b_l1hdr.b_freeze_cksum != NULL) {1432kmem_free(hdr->b_l1hdr.b_freeze_cksum, sizeof (zio_cksum_t));1433hdr->b_l1hdr.b_freeze_cksum = NULL;1434}1435mutex_exit(&hdr->b_l1hdr.b_freeze_lock);1436#endif1437}14381439/*1440* Return true iff at least one of the bufs on hdr is not compressed.1441* Encrypted buffers count as compressed.1442*/1443static boolean_t1444arc_hdr_has_uncompressed_buf(arc_buf_hdr_t *hdr)1445{1446ASSERT(hdr->b_l1hdr.b_state == arc_anon || HDR_EMPTY_OR_LOCKED(hdr));14471448for (arc_buf_t *b = hdr->b_l1hdr.b_buf; b != NULL; b = b->b_next) {1449if (!ARC_BUF_COMPRESSED(b)) {1450return (B_TRUE);1451}1452}1453return (B_FALSE);1454}145514561457/*1458* If we've turned on the ZFS_DEBUG_MODIFY flag, verify that the buf's data1459* matches the checksum that is stored in the hdr. If there is no checksum,1460* or if the buf is compressed, this is a no-op.1461*/1462static void1463arc_cksum_verify(arc_buf_t *buf)1464{1465#ifdef ZFS_DEBUG1466arc_buf_hdr_t *hdr = buf->b_hdr;1467zio_cksum_t zc;14681469if (!(zfs_flags & ZFS_DEBUG_MODIFY))1470return;14711472if (ARC_BUF_COMPRESSED(buf))1473return;14741475ASSERT(HDR_HAS_L1HDR(hdr));14761477mutex_enter(&hdr->b_l1hdr.b_freeze_lock);14781479if (hdr->b_l1hdr.b_freeze_cksum == NULL || HDR_IO_ERROR(hdr)) {1480mutex_exit(&hdr->b_l1hdr.b_freeze_lock);1481return;1482}14831484fletcher_2_native(buf->b_data, arc_buf_size(buf), NULL, &zc);1485if (!ZIO_CHECKSUM_EQUAL(*hdr->b_l1hdr.b_freeze_cksum, zc))1486panic("buffer modified while frozen!");1487mutex_exit(&hdr->b_l1hdr.b_freeze_lock);1488#endif1489}14901491/*1492* This function makes the assumption that data stored in the L2ARC1493* will be transformed exactly as it is in the main pool. Because of1494* this we can verify the checksum against the reading process's bp.1495*/1496static boolean_t1497arc_cksum_is_equal(arc_buf_hdr_t *hdr, zio_t *zio)1498{1499ASSERT(!BP_IS_EMBEDDED(zio->io_bp));1500VERIFY3U(BP_GET_PSIZE(zio->io_bp), ==, HDR_GET_PSIZE(hdr));15011502/*1503* Block pointers always store the checksum for the logical data.1504* If the block pointer has the gang bit set, then the checksum1505* it represents is for the reconstituted data and not for an1506* individual gang member. The zio pipeline, however, must be able to1507* determine the checksum of each of the gang constituents so it1508* treats the checksum comparison differently than what we need1509* for l2arc blocks. This prevents us from using the1510* zio_checksum_error() interface directly. Instead we must call the1511* zio_checksum_error_impl() so that we can ensure the checksum is1512* generated using the correct checksum algorithm and accounts for the1513* logical I/O size and not just a gang fragment.1514*/1515return (zio_checksum_error_impl(zio->io_spa, zio->io_bp,1516BP_GET_CHECKSUM(zio->io_bp), zio->io_abd, zio->io_size,1517zio->io_offset, NULL) == 0);1518}15191520/*1521* Given a buf full of data, if ZFS_DEBUG_MODIFY is enabled this computes a1522* checksum and attaches it to the buf's hdr so that we can ensure that the buf1523* isn't modified later on. If buf is compressed or there is already a checksum1524* on the hdr, this is a no-op (we only checksum uncompressed bufs).1525*/1526static void1527arc_cksum_compute(arc_buf_t *buf)1528{1529if (!(zfs_flags & ZFS_DEBUG_MODIFY))1530return;15311532#ifdef ZFS_DEBUG1533arc_buf_hdr_t *hdr = buf->b_hdr;1534ASSERT(HDR_HAS_L1HDR(hdr));1535mutex_enter(&hdr->b_l1hdr.b_freeze_lock);1536if (hdr->b_l1hdr.b_freeze_cksum != NULL || ARC_BUF_COMPRESSED(buf)) {1537mutex_exit(&hdr->b_l1hdr.b_freeze_lock);1538return;1539}15401541ASSERT(!ARC_BUF_ENCRYPTED(buf));1542ASSERT(!ARC_BUF_COMPRESSED(buf));1543hdr->b_l1hdr.b_freeze_cksum = kmem_alloc(sizeof (zio_cksum_t),1544KM_SLEEP);1545fletcher_2_native(buf->b_data, arc_buf_size(buf), NULL,1546hdr->b_l1hdr.b_freeze_cksum);1547mutex_exit(&hdr->b_l1hdr.b_freeze_lock);1548#endif1549arc_buf_watch(buf);1550}15511552#ifndef _KERNEL1553void1554arc_buf_sigsegv(int sig, siginfo_t *si, void *unused)1555{1556(void) sig, (void) unused;1557panic("Got SIGSEGV at address: 0x%lx\n", (long)si->si_addr);1558}1559#endif15601561static void1562arc_buf_unwatch(arc_buf_t *buf)1563{1564#ifndef _KERNEL1565if (arc_watch) {1566ASSERT0(mprotect(buf->b_data, arc_buf_size(buf),1567PROT_READ | PROT_WRITE));1568}1569#else1570(void) buf;1571#endif1572}15731574static void1575arc_buf_watch(arc_buf_t *buf)1576{1577#ifndef _KERNEL1578if (arc_watch)1579ASSERT0(mprotect(buf->b_data, arc_buf_size(buf),1580PROT_READ));1581#else1582(void) buf;1583#endif1584}15851586static arc_buf_contents_t1587arc_buf_type(arc_buf_hdr_t *hdr)1588{1589arc_buf_contents_t type;1590if (HDR_ISTYPE_METADATA(hdr)) {1591type = ARC_BUFC_METADATA;1592} else {1593type = ARC_BUFC_DATA;1594}1595VERIFY3U(hdr->b_type, ==, type);1596return (type);1597}15981599boolean_t1600arc_is_metadata(arc_buf_t *buf)1601{1602return (HDR_ISTYPE_METADATA(buf->b_hdr) != 0);1603}16041605static uint32_t1606arc_bufc_to_flags(arc_buf_contents_t type)1607{1608switch (type) {1609case ARC_BUFC_DATA:1610/* metadata field is 0 if buffer contains normal data */1611return (0);1612case ARC_BUFC_METADATA:1613return (ARC_FLAG_BUFC_METADATA);1614default:1615break;1616}1617panic("undefined ARC buffer type!");1618return ((uint32_t)-1);1619}16201621void1622arc_buf_thaw(arc_buf_t *buf)1623{1624arc_buf_hdr_t *hdr = buf->b_hdr;16251626ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon);1627ASSERT(!HDR_IO_IN_PROGRESS(hdr));16281629arc_cksum_verify(buf);16301631/*1632* Compressed buffers do not manipulate the b_freeze_cksum.1633*/1634if (ARC_BUF_COMPRESSED(buf))1635return;16361637ASSERT(HDR_HAS_L1HDR(hdr));1638arc_cksum_free(hdr);1639arc_buf_unwatch(buf);1640}16411642void1643arc_buf_freeze(arc_buf_t *buf)1644{1645if (!(zfs_flags & ZFS_DEBUG_MODIFY))1646return;16471648if (ARC_BUF_COMPRESSED(buf))1649return;16501651ASSERT(HDR_HAS_L1HDR(buf->b_hdr));1652arc_cksum_compute(buf);1653}16541655/*1656* The arc_buf_hdr_t's b_flags should never be modified directly. Instead,1657* the following functions should be used to ensure that the flags are1658* updated in a thread-safe way. When manipulating the flags either1659* the hash_lock must be held or the hdr must be undiscoverable. This1660* ensures that we're not racing with any other threads when updating1661* the flags.1662*/1663static inline void1664arc_hdr_set_flags(arc_buf_hdr_t *hdr, arc_flags_t flags)1665{1666ASSERT(HDR_EMPTY_OR_LOCKED(hdr));1667hdr->b_flags |= flags;1668}16691670static inline void1671arc_hdr_clear_flags(arc_buf_hdr_t *hdr, arc_flags_t flags)1672{1673ASSERT(HDR_EMPTY_OR_LOCKED(hdr));1674hdr->b_flags &= ~flags;1675}16761677/*1678* Setting the compression bits in the arc_buf_hdr_t's b_flags is1679* done in a special way since we have to clear and set bits1680* at the same time. Consumers that wish to set the compression bits1681* must use this function to ensure that the flags are updated in1682* thread-safe manner.1683*/1684static void1685arc_hdr_set_compress(arc_buf_hdr_t *hdr, enum zio_compress cmp)1686{1687ASSERT(HDR_EMPTY_OR_LOCKED(hdr));16881689/*1690* Holes and embedded blocks will always have a psize = 0 so1691* we ignore the compression of the blkptr and set the1692* want to uncompress them. Mark them as uncompressed.1693*/1694if (!zfs_compressed_arc_enabled || HDR_GET_PSIZE(hdr) == 0) {1695arc_hdr_clear_flags(hdr, ARC_FLAG_COMPRESSED_ARC);1696ASSERT(!HDR_COMPRESSION_ENABLED(hdr));1697} else {1698arc_hdr_set_flags(hdr, ARC_FLAG_COMPRESSED_ARC);1699ASSERT(HDR_COMPRESSION_ENABLED(hdr));1700}17011702HDR_SET_COMPRESS(hdr, cmp);1703ASSERT3U(HDR_GET_COMPRESS(hdr), ==, cmp);1704}17051706/*1707* Looks for another buf on the same hdr which has the data decompressed, copies1708* from it, and returns true. If no such buf exists, returns false.1709*/1710static boolean_t1711arc_buf_try_copy_decompressed_data(arc_buf_t *buf)1712{1713arc_buf_hdr_t *hdr = buf->b_hdr;1714boolean_t copied = B_FALSE;17151716ASSERT(HDR_HAS_L1HDR(hdr));1717ASSERT3P(buf->b_data, !=, NULL);1718ASSERT(!ARC_BUF_COMPRESSED(buf));17191720for (arc_buf_t *from = hdr->b_l1hdr.b_buf; from != NULL;1721from = from->b_next) {1722/* can't use our own data buffer */1723if (from == buf) {1724continue;1725}17261727if (!ARC_BUF_COMPRESSED(from)) {1728memcpy(buf->b_data, from->b_data, arc_buf_size(buf));1729copied = B_TRUE;1730break;1731}1732}17331734#ifdef ZFS_DEBUG1735/*1736* There were no decompressed bufs, so there should not be a1737* checksum on the hdr either.1738*/1739if (zfs_flags & ZFS_DEBUG_MODIFY)1740EQUIV(!copied, hdr->b_l1hdr.b_freeze_cksum == NULL);1741#endif17421743return (copied);1744}17451746/*1747* Allocates an ARC buf header that's in an evicted & L2-cached state.1748* This is used during l2arc reconstruction to make empty ARC buffers1749* which circumvent the regular disk->arc->l2arc path and instead come1750* into being in the reverse order, i.e. l2arc->arc.1751*/1752static arc_buf_hdr_t *1753arc_buf_alloc_l2only(size_t size, arc_buf_contents_t type, l2arc_dev_t *dev,1754dva_t dva, uint64_t daddr, int32_t psize, uint64_t asize, uint64_t birth,1755enum zio_compress compress, uint8_t complevel, boolean_t protected,1756boolean_t prefetch, arc_state_type_t arcs_state)1757{1758arc_buf_hdr_t *hdr;17591760ASSERT(size != 0);1761ASSERT(dev->l2ad_vdev != NULL);17621763hdr = kmem_cache_alloc(hdr_l2only_cache, KM_SLEEP);1764hdr->b_birth = birth;1765hdr->b_type = type;1766hdr->b_flags = 0;1767arc_hdr_set_flags(hdr, arc_bufc_to_flags(type) | ARC_FLAG_HAS_L2HDR);1768HDR_SET_LSIZE(hdr, size);1769HDR_SET_PSIZE(hdr, psize);1770HDR_SET_L2SIZE(hdr, asize);1771arc_hdr_set_compress(hdr, compress);1772hdr->b_complevel = complevel;1773if (protected)1774arc_hdr_set_flags(hdr, ARC_FLAG_PROTECTED);1775if (prefetch)1776arc_hdr_set_flags(hdr, ARC_FLAG_PREFETCH);1777hdr->b_spa = spa_load_guid(dev->l2ad_vdev->vdev_spa);17781779hdr->b_dva = dva;17801781hdr->b_l2hdr.b_dev = dev;1782hdr->b_l2hdr.b_daddr = daddr;1783hdr->b_l2hdr.b_arcs_state = arcs_state;17841785return (hdr);1786}17871788/*1789* Return the size of the block, b_pabd, that is stored in the arc_buf_hdr_t.1790*/1791static uint64_t1792arc_hdr_size(arc_buf_hdr_t *hdr)1793{1794uint64_t size;17951796if (arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF &&1797HDR_GET_PSIZE(hdr) > 0) {1798size = HDR_GET_PSIZE(hdr);1799} else {1800ASSERT3U(HDR_GET_LSIZE(hdr), !=, 0);1801size = HDR_GET_LSIZE(hdr);1802}1803return (size);1804}18051806static int1807arc_hdr_authenticate(arc_buf_hdr_t *hdr, spa_t *spa, uint64_t dsobj)1808{1809int ret;1810uint64_t csize;1811uint64_t lsize = HDR_GET_LSIZE(hdr);1812uint64_t psize = HDR_GET_PSIZE(hdr);1813abd_t *abd = hdr->b_l1hdr.b_pabd;1814boolean_t free_abd = B_FALSE;18151816ASSERT(HDR_EMPTY_OR_LOCKED(hdr));1817ASSERT(HDR_AUTHENTICATED(hdr));1818ASSERT3P(abd, !=, NULL);18191820/*1821* The MAC is calculated on the compressed data that is stored on disk.1822* However, if compressed arc is disabled we will only have the1823* decompressed data available to us now. Compress it into a temporary1824* abd so we can verify the MAC. The performance overhead of this will1825* be relatively low, since most objects in an encrypted objset will1826* be encrypted (instead of authenticated) anyway.1827*/1828if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF &&1829!HDR_COMPRESSION_ENABLED(hdr)) {1830abd = NULL;1831csize = zio_compress_data(HDR_GET_COMPRESS(hdr),1832hdr->b_l1hdr.b_pabd, &abd, lsize, MIN(lsize, psize),1833hdr->b_complevel);1834if (csize >= lsize || csize > psize) {1835ret = SET_ERROR(EIO);1836return (ret);1837}1838ASSERT3P(abd, !=, NULL);1839abd_zero_off(abd, csize, psize - csize);1840free_abd = B_TRUE;1841}18421843/*1844* Authentication is best effort. We authenticate whenever the key is1845* available. If we succeed we clear ARC_FLAG_NOAUTH.1846*/1847if (hdr->b_crypt_hdr.b_ot == DMU_OT_OBJSET) {1848ASSERT3U(HDR_GET_COMPRESS(hdr), ==, ZIO_COMPRESS_OFF);1849ASSERT3U(lsize, ==, psize);1850ret = spa_do_crypt_objset_mac_abd(B_FALSE, spa, dsobj, abd,1851psize, hdr->b_l1hdr.b_byteswap != DMU_BSWAP_NUMFUNCS);1852} else {1853ret = spa_do_crypt_mac_abd(B_FALSE, spa, dsobj, abd, psize,1854hdr->b_crypt_hdr.b_mac);1855}18561857if (ret == 0)1858arc_hdr_clear_flags(hdr, ARC_FLAG_NOAUTH);1859else if (ret == ENOENT)1860ret = 0;18611862if (free_abd)1863abd_free(abd);18641865return (ret);1866}18671868/*1869* This function will take a header that only has raw encrypted data in1870* b_crypt_hdr.b_rabd and decrypt it into a new buffer which is stored in1871* b_l1hdr.b_pabd. If designated in the header flags, this function will1872* also decompress the data.1873*/1874static int1875arc_hdr_decrypt(arc_buf_hdr_t *hdr, spa_t *spa, const zbookmark_phys_t *zb)1876{1877int ret;1878abd_t *cabd = NULL;1879boolean_t no_crypt = B_FALSE;1880boolean_t bswap = (hdr->b_l1hdr.b_byteswap != DMU_BSWAP_NUMFUNCS);18811882ASSERT(HDR_EMPTY_OR_LOCKED(hdr));1883ASSERT(HDR_ENCRYPTED(hdr));18841885arc_hdr_alloc_abd(hdr, 0);18861887ret = spa_do_crypt_abd(B_FALSE, spa, zb, hdr->b_crypt_hdr.b_ot,1888B_FALSE, bswap, hdr->b_crypt_hdr.b_salt, hdr->b_crypt_hdr.b_iv,1889hdr->b_crypt_hdr.b_mac, HDR_GET_PSIZE(hdr), hdr->b_l1hdr.b_pabd,1890hdr->b_crypt_hdr.b_rabd, &no_crypt);1891if (ret != 0)1892goto error;18931894if (no_crypt) {1895abd_copy(hdr->b_l1hdr.b_pabd, hdr->b_crypt_hdr.b_rabd,1896HDR_GET_PSIZE(hdr));1897}18981899/*1900* If this header has disabled arc compression but the b_pabd is1901* compressed after decrypting it, we need to decompress the newly1902* decrypted data.1903*/1904if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF &&1905!HDR_COMPRESSION_ENABLED(hdr)) {1906/*1907* We want to make sure that we are correctly honoring the1908* zfs_abd_scatter_enabled setting, so we allocate an abd here1909* and then loan a buffer from it, rather than allocating a1910* linear buffer and wrapping it in an abd later.1911*/1912cabd = arc_get_data_abd(hdr, arc_hdr_size(hdr), hdr, 0);19131914ret = zio_decompress_data(HDR_GET_COMPRESS(hdr),1915hdr->b_l1hdr.b_pabd, cabd, HDR_GET_PSIZE(hdr),1916HDR_GET_LSIZE(hdr), &hdr->b_complevel);1917if (ret != 0) {1918goto error;1919}19201921arc_free_data_abd(hdr, hdr->b_l1hdr.b_pabd,1922arc_hdr_size(hdr), hdr);1923hdr->b_l1hdr.b_pabd = cabd;1924}19251926return (0);19271928error:1929arc_hdr_free_abd(hdr, B_FALSE);1930if (cabd != NULL)1931arc_free_data_abd(hdr, cabd, arc_hdr_size(hdr), hdr);19321933return (ret);1934}19351936/*1937* This function is called during arc_buf_fill() to prepare the header's1938* abd plaintext pointer for use. This involves authenticated protected1939* data and decrypting encrypted data into the plaintext abd.1940*/1941static int1942arc_fill_hdr_crypt(arc_buf_hdr_t *hdr, kmutex_t *hash_lock, spa_t *spa,1943const zbookmark_phys_t *zb, boolean_t noauth)1944{1945int ret;19461947ASSERT(HDR_PROTECTED(hdr));19481949if (hash_lock != NULL)1950mutex_enter(hash_lock);19511952if (HDR_NOAUTH(hdr) && !noauth) {1953/*1954* The caller requested authenticated data but our data has1955* not been authenticated yet. Verify the MAC now if we can.1956*/1957ret = arc_hdr_authenticate(hdr, spa, zb->zb_objset);1958if (ret != 0)1959goto error;1960} else if (HDR_HAS_RABD(hdr) && hdr->b_l1hdr.b_pabd == NULL) {1961/*1962* If we only have the encrypted version of the data, but the1963* unencrypted version was requested we take this opportunity1964* to store the decrypted version in the header for future use.1965*/1966ret = arc_hdr_decrypt(hdr, spa, zb);1967if (ret != 0)1968goto error;1969}19701971ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);19721973if (hash_lock != NULL)1974mutex_exit(hash_lock);19751976return (0);19771978error:1979if (hash_lock != NULL)1980mutex_exit(hash_lock);19811982return (ret);1983}19841985/*1986* This function is used by the dbuf code to decrypt bonus buffers in place.1987* The dbuf code itself doesn't have any locking for decrypting a shared dnode1988* block, so we use the hash lock here to protect against concurrent calls to1989* arc_buf_fill().1990*/1991static void1992arc_buf_untransform_in_place(arc_buf_t *buf)1993{1994arc_buf_hdr_t *hdr = buf->b_hdr;19951996ASSERT(HDR_ENCRYPTED(hdr));1997ASSERT3U(hdr->b_crypt_hdr.b_ot, ==, DMU_OT_DNODE);1998ASSERT(HDR_EMPTY_OR_LOCKED(hdr));1999ASSERT3PF(hdr->b_l1hdr.b_pabd, !=, NULL, "hdr %px buf %px", hdr, buf);20002001zio_crypt_copy_dnode_bonus(hdr->b_l1hdr.b_pabd, buf->b_data,2002arc_buf_size(buf));2003buf->b_flags &= ~ARC_BUF_FLAG_ENCRYPTED;2004buf->b_flags &= ~ARC_BUF_FLAG_COMPRESSED;2005}20062007/*2008* Given a buf that has a data buffer attached to it, this function will2009* efficiently fill the buf with data of the specified compression setting from2010* the hdr and update the hdr's b_freeze_cksum if necessary. If the buf and hdr2011* are already sharing a data buf, no copy is performed.2012*2013* If the buf is marked as compressed but uncompressed data was requested, this2014* will allocate a new data buffer for the buf, remove that flag, and fill the2015* buf with uncompressed data. You can't request a compressed buf on a hdr with2016* uncompressed data, and (since we haven't added support for it yet) if you2017* want compressed data your buf must already be marked as compressed and have2018* the correct-sized data buffer.2019*/2020static int2021arc_buf_fill(arc_buf_t *buf, spa_t *spa, const zbookmark_phys_t *zb,2022arc_fill_flags_t flags)2023{2024int error = 0;2025arc_buf_hdr_t *hdr = buf->b_hdr;2026boolean_t hdr_compressed =2027(arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF);2028boolean_t compressed = (flags & ARC_FILL_COMPRESSED) != 0;2029boolean_t encrypted = (flags & ARC_FILL_ENCRYPTED) != 0;2030dmu_object_byteswap_t bswap = hdr->b_l1hdr.b_byteswap;2031kmutex_t *hash_lock = (flags & ARC_FILL_LOCKED) ? NULL : HDR_LOCK(hdr);20322033ASSERT3P(buf->b_data, !=, NULL);2034IMPLY(compressed, hdr_compressed || ARC_BUF_ENCRYPTED(buf));2035IMPLY(compressed, ARC_BUF_COMPRESSED(buf));2036IMPLY(encrypted, HDR_ENCRYPTED(hdr));2037IMPLY(encrypted, ARC_BUF_ENCRYPTED(buf));2038IMPLY(encrypted, ARC_BUF_COMPRESSED(buf));2039IMPLY(encrypted, !arc_buf_is_shared(buf));20402041/*2042* If the caller wanted encrypted data we just need to copy it from2043* b_rabd and potentially byteswap it. We won't be able to do any2044* further transforms on it.2045*/2046if (encrypted) {2047ASSERT(HDR_HAS_RABD(hdr));2048abd_copy_to_buf(buf->b_data, hdr->b_crypt_hdr.b_rabd,2049HDR_GET_PSIZE(hdr));2050goto byteswap;2051}20522053/*2054* Adjust encrypted and authenticated headers to accommodate2055* the request if needed. Dnode blocks (ARC_FILL_IN_PLACE) are2056* allowed to fail decryption due to keys not being loaded2057* without being marked as an IO error.2058*/2059if (HDR_PROTECTED(hdr)) {2060error = arc_fill_hdr_crypt(hdr, hash_lock, spa,2061zb, !!(flags & ARC_FILL_NOAUTH));2062if (error == EACCES && (flags & ARC_FILL_IN_PLACE) != 0) {2063return (error);2064} else if (error != 0) {2065if (hash_lock != NULL)2066mutex_enter(hash_lock);2067arc_hdr_set_flags(hdr, ARC_FLAG_IO_ERROR);2068if (hash_lock != NULL)2069mutex_exit(hash_lock);2070return (error);2071}2072}20732074/*2075* There is a special case here for dnode blocks which are2076* decrypting their bonus buffers. These blocks may request to2077* be decrypted in-place. This is necessary because there may2078* be many dnodes pointing into this buffer and there is2079* currently no method to synchronize replacing the backing2080* b_data buffer and updating all of the pointers. Here we use2081* the hash lock to ensure there are no races. If the need2082* arises for other types to be decrypted in-place, they must2083* add handling here as well.2084*/2085if ((flags & ARC_FILL_IN_PLACE) != 0) {2086ASSERT(!hdr_compressed);2087ASSERT(!compressed);2088ASSERT(!encrypted);20892090if (HDR_ENCRYPTED(hdr) && ARC_BUF_ENCRYPTED(buf)) {2091ASSERT3U(hdr->b_crypt_hdr.b_ot, ==, DMU_OT_DNODE);20922093if (hash_lock != NULL)2094mutex_enter(hash_lock);2095arc_buf_untransform_in_place(buf);2096if (hash_lock != NULL)2097mutex_exit(hash_lock);20982099/* Compute the hdr's checksum if necessary */2100arc_cksum_compute(buf);2101}21022103return (0);2104}21052106if (hdr_compressed == compressed) {2107if (ARC_BUF_SHARED(buf)) {2108ASSERT(arc_buf_is_shared(buf));2109} else {2110abd_copy_to_buf(buf->b_data, hdr->b_l1hdr.b_pabd,2111arc_buf_size(buf));2112}2113} else {2114ASSERT(hdr_compressed);2115ASSERT(!compressed);21162117/*2118* If the buf is sharing its data with the hdr, unlink it and2119* allocate a new data buffer for the buf.2120*/2121if (ARC_BUF_SHARED(buf)) {2122ASSERTF(ARC_BUF_COMPRESSED(buf),2123"buf %p was uncompressed", buf);21242125/* We need to give the buf its own b_data */2126buf->b_flags &= ~ARC_BUF_FLAG_SHARED;2127buf->b_data =2128arc_get_data_buf(hdr, HDR_GET_LSIZE(hdr), buf);2129arc_hdr_clear_flags(hdr, ARC_FLAG_SHARED_DATA);21302131/* Previously overhead was 0; just add new overhead */2132ARCSTAT_INCR(arcstat_overhead_size, HDR_GET_LSIZE(hdr));2133} else if (ARC_BUF_COMPRESSED(buf)) {2134ASSERT(!arc_buf_is_shared(buf));21352136/* We need to reallocate the buf's b_data */2137arc_free_data_buf(hdr, buf->b_data, HDR_GET_PSIZE(hdr),2138buf);2139buf->b_data =2140arc_get_data_buf(hdr, HDR_GET_LSIZE(hdr), buf);21412142/* We increased the size of b_data; update overhead */2143ARCSTAT_INCR(arcstat_overhead_size,2144HDR_GET_LSIZE(hdr) - HDR_GET_PSIZE(hdr));2145}21462147/*2148* Regardless of the buf's previous compression settings, it2149* should not be compressed at the end of this function.2150*/2151buf->b_flags &= ~ARC_BUF_FLAG_COMPRESSED;21522153/*2154* Try copying the data from another buf which already has a2155* decompressed version. If that's not possible, it's time to2156* bite the bullet and decompress the data from the hdr.2157*/2158if (arc_buf_try_copy_decompressed_data(buf)) {2159/* Skip byteswapping and checksumming (already done) */2160return (0);2161} else {2162abd_t dabd;2163abd_get_from_buf_struct(&dabd, buf->b_data,2164HDR_GET_LSIZE(hdr));2165error = zio_decompress_data(HDR_GET_COMPRESS(hdr),2166hdr->b_l1hdr.b_pabd, &dabd,2167HDR_GET_PSIZE(hdr), HDR_GET_LSIZE(hdr),2168&hdr->b_complevel);2169abd_free(&dabd);21702171/*2172* Absent hardware errors or software bugs, this should2173* be impossible, but log it anyway so we can debug it.2174*/2175if (error != 0) {2176zfs_dbgmsg(2177"hdr %px, compress %d, psize %d, lsize %d",2178hdr, arc_hdr_get_compress(hdr),2179HDR_GET_PSIZE(hdr), HDR_GET_LSIZE(hdr));2180if (hash_lock != NULL)2181mutex_enter(hash_lock);2182arc_hdr_set_flags(hdr, ARC_FLAG_IO_ERROR);2183if (hash_lock != NULL)2184mutex_exit(hash_lock);2185return (SET_ERROR(EIO));2186}2187}2188}21892190byteswap:2191/* Byteswap the buf's data if necessary */2192if (bswap != DMU_BSWAP_NUMFUNCS) {2193ASSERT(!HDR_SHARED_DATA(hdr));2194ASSERT3U(bswap, <, DMU_BSWAP_NUMFUNCS);2195dmu_ot_byteswap[bswap].ob_func(buf->b_data, HDR_GET_LSIZE(hdr));2196}21972198/* Compute the hdr's checksum if necessary */2199arc_cksum_compute(buf);22002201return (0);2202}22032204/*2205* If this function is being called to decrypt an encrypted buffer or verify an2206* authenticated one, the key must be loaded and a mapping must be made2207* available in the keystore via spa_keystore_create_mapping() or one of its2208* callers.2209*/2210int2211arc_untransform(arc_buf_t *buf, spa_t *spa, const zbookmark_phys_t *zb,2212boolean_t in_place)2213{2214int ret;2215arc_fill_flags_t flags = 0;22162217if (in_place)2218flags |= ARC_FILL_IN_PLACE;22192220ret = arc_buf_fill(buf, spa, zb, flags);2221if (ret == ECKSUM) {2222/*2223* Convert authentication and decryption errors to EIO2224* (and generate an ereport) before leaving the ARC.2225*/2226ret = SET_ERROR(EIO);2227spa_log_error(spa, zb, buf->b_hdr->b_birth);2228(void) zfs_ereport_post(FM_EREPORT_ZFS_AUTHENTICATION,2229spa, NULL, zb, NULL, 0);2230}22312232return (ret);2233}22342235/*2236* Increment the amount of evictable space in the arc_state_t's refcount.2237* We account for the space used by the hdr and the arc buf individually2238* so that we can add and remove them from the refcount individually.2239*/2240static void2241arc_evictable_space_increment(arc_buf_hdr_t *hdr, arc_state_t *state)2242{2243arc_buf_contents_t type = arc_buf_type(hdr);22442245ASSERT(HDR_HAS_L1HDR(hdr));22462247if (GHOST_STATE(state)) {2248ASSERT0P(hdr->b_l1hdr.b_buf);2249ASSERT0P(hdr->b_l1hdr.b_pabd);2250ASSERT(!HDR_HAS_RABD(hdr));2251(void) zfs_refcount_add_many(&state->arcs_esize[type],2252HDR_GET_LSIZE(hdr), hdr);2253return;2254}22552256if (hdr->b_l1hdr.b_pabd != NULL) {2257(void) zfs_refcount_add_many(&state->arcs_esize[type],2258arc_hdr_size(hdr), hdr);2259}2260if (HDR_HAS_RABD(hdr)) {2261(void) zfs_refcount_add_many(&state->arcs_esize[type],2262HDR_GET_PSIZE(hdr), hdr);2263}22642265for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL;2266buf = buf->b_next) {2267if (ARC_BUF_SHARED(buf))2268continue;2269(void) zfs_refcount_add_many(&state->arcs_esize[type],2270arc_buf_size(buf), buf);2271}2272}22732274/*2275* Decrement the amount of evictable space in the arc_state_t's refcount.2276* We account for the space used by the hdr and the arc buf individually2277* so that we can add and remove them from the refcount individually.2278*/2279static void2280arc_evictable_space_decrement(arc_buf_hdr_t *hdr, arc_state_t *state)2281{2282arc_buf_contents_t type = arc_buf_type(hdr);22832284ASSERT(HDR_HAS_L1HDR(hdr));22852286if (GHOST_STATE(state)) {2287ASSERT0P(hdr->b_l1hdr.b_buf);2288ASSERT0P(hdr->b_l1hdr.b_pabd);2289ASSERT(!HDR_HAS_RABD(hdr));2290(void) zfs_refcount_remove_many(&state->arcs_esize[type],2291HDR_GET_LSIZE(hdr), hdr);2292return;2293}22942295if (hdr->b_l1hdr.b_pabd != NULL) {2296(void) zfs_refcount_remove_many(&state->arcs_esize[type],2297arc_hdr_size(hdr), hdr);2298}2299if (HDR_HAS_RABD(hdr)) {2300(void) zfs_refcount_remove_many(&state->arcs_esize[type],2301HDR_GET_PSIZE(hdr), hdr);2302}23032304for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL;2305buf = buf->b_next) {2306if (ARC_BUF_SHARED(buf))2307continue;2308(void) zfs_refcount_remove_many(&state->arcs_esize[type],2309arc_buf_size(buf), buf);2310}2311}23122313/*2314* Add a reference to this hdr indicating that someone is actively2315* referencing that memory. When the refcount transitions from 0 to 1,2316* we remove it from the respective arc_state_t list to indicate that2317* it is not evictable.2318*/2319static void2320add_reference(arc_buf_hdr_t *hdr, const void *tag)2321{2322arc_state_t *state = hdr->b_l1hdr.b_state;23232324ASSERT(HDR_HAS_L1HDR(hdr));2325if (!HDR_EMPTY(hdr) && !MUTEX_HELD(HDR_LOCK(hdr))) {2326ASSERT(state == arc_anon);2327ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt));2328ASSERT0P(hdr->b_l1hdr.b_buf);2329}23302331if ((zfs_refcount_add(&hdr->b_l1hdr.b_refcnt, tag) == 1) &&2332state != arc_anon && state != arc_l2c_only) {2333/* We don't use the L2-only state list. */2334multilist_remove(&state->arcs_list[arc_buf_type(hdr)], hdr);2335arc_evictable_space_decrement(hdr, state);2336}2337}23382339/*2340* Remove a reference from this hdr. When the reference transitions from2341* 1 to 0 and we're not anonymous, then we add this hdr to the arc_state_t's2342* list making it eligible for eviction.2343*/2344static int2345remove_reference(arc_buf_hdr_t *hdr, const void *tag)2346{2347int cnt;2348arc_state_t *state = hdr->b_l1hdr.b_state;23492350ASSERT(HDR_HAS_L1HDR(hdr));2351ASSERT(state == arc_anon || MUTEX_HELD(HDR_LOCK(hdr)));2352ASSERT(!GHOST_STATE(state)); /* arc_l2c_only counts as a ghost. */23532354if ((cnt = zfs_refcount_remove(&hdr->b_l1hdr.b_refcnt, tag)) != 0)2355return (cnt);23562357if (state == arc_anon) {2358arc_hdr_destroy(hdr);2359return (0);2360}2361if (state == arc_uncached && !HDR_PREFETCH(hdr)) {2362arc_change_state(arc_anon, hdr);2363arc_hdr_destroy(hdr);2364return (0);2365}2366multilist_insert(&state->arcs_list[arc_buf_type(hdr)], hdr);2367arc_evictable_space_increment(hdr, state);2368return (0);2369}23702371/*2372* Returns detailed information about a specific arc buffer. When the2373* state_index argument is set the function will calculate the arc header2374* list position for its arc state. Since this requires a linear traversal2375* callers are strongly encourage not to do this. However, it can be helpful2376* for targeted analysis so the functionality is provided.2377*/2378void2379arc_buf_info(arc_buf_t *ab, arc_buf_info_t *abi, int state_index)2380{2381(void) state_index;2382arc_buf_hdr_t *hdr = ab->b_hdr;2383l1arc_buf_hdr_t *l1hdr = NULL;2384l2arc_buf_hdr_t *l2hdr = NULL;2385arc_state_t *state = NULL;23862387memset(abi, 0, sizeof (arc_buf_info_t));23882389if (hdr == NULL)2390return;23912392abi->abi_flags = hdr->b_flags;23932394if (HDR_HAS_L1HDR(hdr)) {2395l1hdr = &hdr->b_l1hdr;2396state = l1hdr->b_state;2397}2398if (HDR_HAS_L2HDR(hdr))2399l2hdr = &hdr->b_l2hdr;24002401if (l1hdr) {2402abi->abi_bufcnt = 0;2403for (arc_buf_t *buf = l1hdr->b_buf; buf; buf = buf->b_next)2404abi->abi_bufcnt++;2405abi->abi_access = l1hdr->b_arc_access;2406abi->abi_mru_hits = l1hdr->b_mru_hits;2407abi->abi_mru_ghost_hits = l1hdr->b_mru_ghost_hits;2408abi->abi_mfu_hits = l1hdr->b_mfu_hits;2409abi->abi_mfu_ghost_hits = l1hdr->b_mfu_ghost_hits;2410abi->abi_holds = zfs_refcount_count(&l1hdr->b_refcnt);2411}24122413if (l2hdr) {2414abi->abi_l2arc_dattr = l2hdr->b_daddr;2415abi->abi_l2arc_hits = l2hdr->b_hits;2416}24172418abi->abi_state_type = state ? state->arcs_state : ARC_STATE_ANON;2419abi->abi_state_contents = arc_buf_type(hdr);2420abi->abi_size = arc_hdr_size(hdr);2421}24222423/*2424* Move the supplied buffer to the indicated state. The hash lock2425* for the buffer must be held by the caller.2426*/2427static void2428arc_change_state(arc_state_t *new_state, arc_buf_hdr_t *hdr)2429{2430arc_state_t *old_state;2431int64_t refcnt;2432boolean_t update_old, update_new;2433arc_buf_contents_t type = arc_buf_type(hdr);24342435/*2436* We almost always have an L1 hdr here, since we call arc_hdr_realloc()2437* in arc_read() when bringing a buffer out of the L2ARC. However, the2438* L1 hdr doesn't always exist when we change state to arc_anon before2439* destroying a header, in which case reallocating to add the L1 hdr is2440* pointless.2441*/2442if (HDR_HAS_L1HDR(hdr)) {2443old_state = hdr->b_l1hdr.b_state;2444refcnt = zfs_refcount_count(&hdr->b_l1hdr.b_refcnt);2445update_old = (hdr->b_l1hdr.b_buf != NULL ||2446hdr->b_l1hdr.b_pabd != NULL || HDR_HAS_RABD(hdr));24472448IMPLY(GHOST_STATE(old_state), hdr->b_l1hdr.b_buf == NULL);2449IMPLY(GHOST_STATE(new_state), hdr->b_l1hdr.b_buf == NULL);2450IMPLY(old_state == arc_anon, hdr->b_l1hdr.b_buf == NULL ||2451ARC_BUF_LAST(hdr->b_l1hdr.b_buf));2452} else {2453old_state = arc_l2c_only;2454refcnt = 0;2455update_old = B_FALSE;2456}2457update_new = update_old;2458if (GHOST_STATE(old_state))2459update_old = B_TRUE;2460if (GHOST_STATE(new_state))2461update_new = B_TRUE;24622463ASSERT(MUTEX_HELD(HDR_LOCK(hdr)));2464ASSERT3P(new_state, !=, old_state);24652466/*2467* If this buffer is evictable, transfer it from the2468* old state list to the new state list.2469*/2470if (refcnt == 0) {2471if (old_state != arc_anon && old_state != arc_l2c_only) {2472ASSERT(HDR_HAS_L1HDR(hdr));2473/* remove_reference() saves on insert. */2474if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {2475multilist_remove(&old_state->arcs_list[type],2476hdr);2477arc_evictable_space_decrement(hdr, old_state);2478}2479}2480if (new_state != arc_anon && new_state != arc_l2c_only) {2481/*2482* An L1 header always exists here, since if we're2483* moving to some L1-cached state (i.e. not l2c_only or2484* anonymous), we realloc the header to add an L1hdr2485* beforehand.2486*/2487ASSERT(HDR_HAS_L1HDR(hdr));2488multilist_insert(&new_state->arcs_list[type], hdr);2489arc_evictable_space_increment(hdr, new_state);2490}2491}24922493ASSERT(!HDR_EMPTY(hdr));2494if (new_state == arc_anon && HDR_IN_HASH_TABLE(hdr))2495buf_hash_remove(hdr);24962497/* adjust state sizes (ignore arc_l2c_only) */24982499if (update_new && new_state != arc_l2c_only) {2500ASSERT(HDR_HAS_L1HDR(hdr));2501if (GHOST_STATE(new_state)) {25022503/*2504* When moving a header to a ghost state, we first2505* remove all arc buffers. Thus, we'll have no arc2506* buffer to use for the reference. As a result, we2507* use the arc header pointer for the reference.2508*/2509(void) zfs_refcount_add_many(2510&new_state->arcs_size[type],2511HDR_GET_LSIZE(hdr), hdr);2512ASSERT0P(hdr->b_l1hdr.b_pabd);2513ASSERT(!HDR_HAS_RABD(hdr));2514} else {25152516/*2517* Each individual buffer holds a unique reference,2518* thus we must remove each of these references one2519* at a time.2520*/2521for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL;2522buf = buf->b_next) {25232524/*2525* When the arc_buf_t is sharing the data2526* block with the hdr, the owner of the2527* reference belongs to the hdr. Only2528* add to the refcount if the arc_buf_t is2529* not shared.2530*/2531if (ARC_BUF_SHARED(buf))2532continue;25332534(void) zfs_refcount_add_many(2535&new_state->arcs_size[type],2536arc_buf_size(buf), buf);2537}25382539if (hdr->b_l1hdr.b_pabd != NULL) {2540(void) zfs_refcount_add_many(2541&new_state->arcs_size[type],2542arc_hdr_size(hdr), hdr);2543}25442545if (HDR_HAS_RABD(hdr)) {2546(void) zfs_refcount_add_many(2547&new_state->arcs_size[type],2548HDR_GET_PSIZE(hdr), hdr);2549}2550}2551}25522553if (update_old && old_state != arc_l2c_only) {2554ASSERT(HDR_HAS_L1HDR(hdr));2555if (GHOST_STATE(old_state)) {2556ASSERT0P(hdr->b_l1hdr.b_pabd);2557ASSERT(!HDR_HAS_RABD(hdr));25582559/*2560* When moving a header off of a ghost state,2561* the header will not contain any arc buffers.2562* We use the arc header pointer for the reference2563* which is exactly what we did when we put the2564* header on the ghost state.2565*/25662567(void) zfs_refcount_remove_many(2568&old_state->arcs_size[type],2569HDR_GET_LSIZE(hdr), hdr);2570} else {25712572/*2573* Each individual buffer holds a unique reference,2574* thus we must remove each of these references one2575* at a time.2576*/2577for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL;2578buf = buf->b_next) {25792580/*2581* When the arc_buf_t is sharing the data2582* block with the hdr, the owner of the2583* reference belongs to the hdr. Only2584* add to the refcount if the arc_buf_t is2585* not shared.2586*/2587if (ARC_BUF_SHARED(buf))2588continue;25892590(void) zfs_refcount_remove_many(2591&old_state->arcs_size[type],2592arc_buf_size(buf), buf);2593}2594ASSERT(hdr->b_l1hdr.b_pabd != NULL ||2595HDR_HAS_RABD(hdr));25962597if (hdr->b_l1hdr.b_pabd != NULL) {2598(void) zfs_refcount_remove_many(2599&old_state->arcs_size[type],2600arc_hdr_size(hdr), hdr);2601}26022603if (HDR_HAS_RABD(hdr)) {2604(void) zfs_refcount_remove_many(2605&old_state->arcs_size[type],2606HDR_GET_PSIZE(hdr), hdr);2607}2608}2609}26102611if (HDR_HAS_L1HDR(hdr)) {2612hdr->b_l1hdr.b_state = new_state;26132614if (HDR_HAS_L2HDR(hdr) && new_state != arc_l2c_only) {2615l2arc_hdr_arcstats_decrement_state(hdr);2616hdr->b_l2hdr.b_arcs_state = new_state->arcs_state;2617l2arc_hdr_arcstats_increment_state(hdr);2618}2619}2620}26212622void2623arc_space_consume(uint64_t space, arc_space_type_t type)2624{2625ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES);26262627switch (type) {2628default:2629break;2630case ARC_SPACE_DATA:2631ARCSTAT_INCR(arcstat_data_size, space);2632break;2633case ARC_SPACE_META:2634ARCSTAT_INCR(arcstat_metadata_size, space);2635break;2636case ARC_SPACE_BONUS:2637ARCSTAT_INCR(arcstat_bonus_size, space);2638break;2639case ARC_SPACE_DNODE:2640aggsum_add(&arc_sums.arcstat_dnode_size, space);2641break;2642case ARC_SPACE_DBUF:2643ARCSTAT_INCR(arcstat_dbuf_size, space);2644break;2645case ARC_SPACE_HDRS:2646ARCSTAT_INCR(arcstat_hdr_size, space);2647break;2648case ARC_SPACE_L2HDRS:2649aggsum_add(&arc_sums.arcstat_l2_hdr_size, space);2650break;2651case ARC_SPACE_ABD_CHUNK_WASTE:2652/*2653* Note: this includes space wasted by all scatter ABD's, not2654* just those allocated by the ARC. But the vast majority of2655* scatter ABD's come from the ARC, because other users are2656* very short-lived.2657*/2658ARCSTAT_INCR(arcstat_abd_chunk_waste_size, space);2659break;2660}26612662if (type != ARC_SPACE_DATA && type != ARC_SPACE_ABD_CHUNK_WASTE)2663ARCSTAT_INCR(arcstat_meta_used, space);26642665aggsum_add(&arc_sums.arcstat_size, space);2666}26672668void2669arc_space_return(uint64_t space, arc_space_type_t type)2670{2671ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES);26722673switch (type) {2674default:2675break;2676case ARC_SPACE_DATA:2677ARCSTAT_INCR(arcstat_data_size, -space);2678break;2679case ARC_SPACE_META:2680ARCSTAT_INCR(arcstat_metadata_size, -space);2681break;2682case ARC_SPACE_BONUS:2683ARCSTAT_INCR(arcstat_bonus_size, -space);2684break;2685case ARC_SPACE_DNODE:2686aggsum_add(&arc_sums.arcstat_dnode_size, -space);2687break;2688case ARC_SPACE_DBUF:2689ARCSTAT_INCR(arcstat_dbuf_size, -space);2690break;2691case ARC_SPACE_HDRS:2692ARCSTAT_INCR(arcstat_hdr_size, -space);2693break;2694case ARC_SPACE_L2HDRS:2695aggsum_add(&arc_sums.arcstat_l2_hdr_size, -space);2696break;2697case ARC_SPACE_ABD_CHUNK_WASTE:2698ARCSTAT_INCR(arcstat_abd_chunk_waste_size, -space);2699break;2700}27012702if (type != ARC_SPACE_DATA && type != ARC_SPACE_ABD_CHUNK_WASTE)2703ARCSTAT_INCR(arcstat_meta_used, -space);27042705ASSERT(aggsum_compare(&arc_sums.arcstat_size, space) >= 0);2706aggsum_add(&arc_sums.arcstat_size, -space);2707}27082709/*2710* Given a hdr and a buf, returns whether that buf can share its b_data buffer2711* with the hdr's b_pabd.2712*/2713static boolean_t2714arc_can_share(arc_buf_hdr_t *hdr, arc_buf_t *buf)2715{2716/*2717* The criteria for sharing a hdr's data are:2718* 1. the buffer is not encrypted2719* 2. the hdr's compression matches the buf's compression2720* 3. the hdr doesn't need to be byteswapped2721* 4. the hdr isn't already being shared2722* 5. the buf is either compressed or it is the last buf in the hdr list2723*2724* Criterion #5 maintains the invariant that shared uncompressed2725* bufs must be the final buf in the hdr's b_buf list. Reading this, you2726* might ask, "if a compressed buf is allocated first, won't that be the2727* last thing in the list?", but in that case it's impossible to create2728* a shared uncompressed buf anyway (because the hdr must be compressed2729* to have the compressed buf). You might also think that #3 is2730* sufficient to make this guarantee, however it's possible2731* (specifically in the rare L2ARC write race mentioned in2732* arc_buf_alloc_impl()) there will be an existing uncompressed buf that2733* is shareable, but wasn't at the time of its allocation. Rather than2734* allow a new shared uncompressed buf to be created and then shuffle2735* the list around to make it the last element, this simply disallows2736* sharing if the new buf isn't the first to be added.2737*/2738ASSERT3P(buf->b_hdr, ==, hdr);2739boolean_t hdr_compressed =2740arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF;2741boolean_t buf_compressed = ARC_BUF_COMPRESSED(buf) != 0;2742return (!ARC_BUF_ENCRYPTED(buf) &&2743buf_compressed == hdr_compressed &&2744hdr->b_l1hdr.b_byteswap == DMU_BSWAP_NUMFUNCS &&2745!HDR_SHARED_DATA(hdr) &&2746(ARC_BUF_LAST(buf) || ARC_BUF_COMPRESSED(buf)));2747}27482749/*2750* Allocate a buf for this hdr. If you care about the data that's in the hdr,2751* or if you want a compressed buffer, pass those flags in. Returns 0 if the2752* copy was made successfully, or an error code otherwise.2753*/2754static int2755arc_buf_alloc_impl(arc_buf_hdr_t *hdr, spa_t *spa, const zbookmark_phys_t *zb,2756const void *tag, boolean_t encrypted, boolean_t compressed,2757boolean_t noauth, boolean_t fill, arc_buf_t **ret)2758{2759arc_buf_t *buf;2760arc_fill_flags_t flags = ARC_FILL_LOCKED;27612762ASSERT(HDR_HAS_L1HDR(hdr));2763ASSERT3U(HDR_GET_LSIZE(hdr), >, 0);2764VERIFY(hdr->b_type == ARC_BUFC_DATA ||2765hdr->b_type == ARC_BUFC_METADATA);2766ASSERT3P(ret, !=, NULL);2767ASSERT0P(*ret);2768IMPLY(encrypted, compressed);27692770buf = *ret = kmem_cache_alloc(buf_cache, KM_PUSHPAGE);2771buf->b_hdr = hdr;2772buf->b_data = NULL;2773buf->b_next = hdr->b_l1hdr.b_buf;2774buf->b_flags = 0;27752776add_reference(hdr, tag);27772778/*2779* We're about to change the hdr's b_flags. We must either2780* hold the hash_lock or be undiscoverable.2781*/2782ASSERT(HDR_EMPTY_OR_LOCKED(hdr));27832784/*2785* Only honor requests for compressed bufs if the hdr is actually2786* compressed. This must be overridden if the buffer is encrypted since2787* encrypted buffers cannot be decompressed.2788*/2789if (encrypted) {2790buf->b_flags |= ARC_BUF_FLAG_COMPRESSED;2791buf->b_flags |= ARC_BUF_FLAG_ENCRYPTED;2792flags |= ARC_FILL_COMPRESSED | ARC_FILL_ENCRYPTED;2793} else if (compressed &&2794arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF) {2795buf->b_flags |= ARC_BUF_FLAG_COMPRESSED;2796flags |= ARC_FILL_COMPRESSED;2797}27982799if (noauth) {2800ASSERT0(encrypted);2801flags |= ARC_FILL_NOAUTH;2802}28032804/*2805* If the hdr's data can be shared then we share the data buffer and2806* set the appropriate bit in the hdr's b_flags to indicate the hdr is2807* sharing it's b_pabd with the arc_buf_t. Otherwise, we allocate a new2808* buffer to store the buf's data.2809*2810* There are two additional restrictions here because we're sharing2811* hdr -> buf instead of the usual buf -> hdr. First, the hdr can't be2812* actively involved in an L2ARC write, because if this buf is used by2813* an arc_write() then the hdr's data buffer will be released when the2814* write completes, even though the L2ARC write might still be using it.2815* Second, the hdr's ABD must be linear so that the buf's user doesn't2816* need to be ABD-aware. It must be allocated via2817* zio_[data_]buf_alloc(), not as a page, because we need to be able2818* to abd_release_ownership_of_buf(), which isn't allowed on "linear2819* page" buffers because the ABD code needs to handle freeing them2820* specially.2821*/2822boolean_t can_share = arc_can_share(hdr, buf) &&2823!HDR_L2_WRITING(hdr) &&2824hdr->b_l1hdr.b_pabd != NULL &&2825abd_is_linear(hdr->b_l1hdr.b_pabd) &&2826!abd_is_linear_page(hdr->b_l1hdr.b_pabd);28272828/* Set up b_data and sharing */2829if (can_share) {2830buf->b_data = abd_to_buf(hdr->b_l1hdr.b_pabd);2831buf->b_flags |= ARC_BUF_FLAG_SHARED;2832arc_hdr_set_flags(hdr, ARC_FLAG_SHARED_DATA);2833} else {2834buf->b_data =2835arc_get_data_buf(hdr, arc_buf_size(buf), buf);2836ARCSTAT_INCR(arcstat_overhead_size, arc_buf_size(buf));2837}2838VERIFY3P(buf->b_data, !=, NULL);28392840hdr->b_l1hdr.b_buf = buf;28412842/*2843* If the user wants the data from the hdr, we need to either copy or2844* decompress the data.2845*/2846if (fill) {2847ASSERT3P(zb, !=, NULL);2848return (arc_buf_fill(buf, spa, zb, flags));2849}28502851return (0);2852}28532854static const char *arc_onloan_tag = "onloan";28552856static inline void2857arc_loaned_bytes_update(int64_t delta)2858{2859atomic_add_64(&arc_loaned_bytes, delta);28602861/* assert that it did not wrap around */2862ASSERT3S(atomic_add_64_nv(&arc_loaned_bytes, 0), >=, 0);2863}28642865/*2866* Loan out an anonymous arc buffer. Loaned buffers are not counted as in2867* flight data by arc_tempreserve_space() until they are "returned". Loaned2868* buffers must be returned to the arc before they can be used by the DMU or2869* freed.2870*/2871arc_buf_t *2872arc_loan_buf(spa_t *spa, boolean_t is_metadata, int size)2873{2874arc_buf_t *buf = arc_alloc_buf(spa, arc_onloan_tag,2875is_metadata ? ARC_BUFC_METADATA : ARC_BUFC_DATA, size);28762877arc_loaned_bytes_update(arc_buf_size(buf));28782879return (buf);2880}28812882arc_buf_t *2883arc_loan_compressed_buf(spa_t *spa, uint64_t psize, uint64_t lsize,2884enum zio_compress compression_type, uint8_t complevel)2885{2886arc_buf_t *buf = arc_alloc_compressed_buf(spa, arc_onloan_tag,2887psize, lsize, compression_type, complevel);28882889arc_loaned_bytes_update(arc_buf_size(buf));28902891return (buf);2892}28932894arc_buf_t *2895arc_loan_raw_buf(spa_t *spa, uint64_t dsobj, boolean_t byteorder,2896const uint8_t *salt, const uint8_t *iv, const uint8_t *mac,2897dmu_object_type_t ot, uint64_t psize, uint64_t lsize,2898enum zio_compress compression_type, uint8_t complevel)2899{2900arc_buf_t *buf = arc_alloc_raw_buf(spa, arc_onloan_tag, dsobj,2901byteorder, salt, iv, mac, ot, psize, lsize, compression_type,2902complevel);29032904atomic_add_64(&arc_loaned_bytes, psize);2905return (buf);2906}290729082909/*2910* Return a loaned arc buffer to the arc.2911*/2912void2913arc_return_buf(arc_buf_t *buf, const void *tag)2914{2915arc_buf_hdr_t *hdr = buf->b_hdr;29162917ASSERT3P(buf->b_data, !=, NULL);2918ASSERT(HDR_HAS_L1HDR(hdr));2919(void) zfs_refcount_add(&hdr->b_l1hdr.b_refcnt, tag);2920(void) zfs_refcount_remove(&hdr->b_l1hdr.b_refcnt, arc_onloan_tag);29212922arc_loaned_bytes_update(-arc_buf_size(buf));2923}29242925/* Detach an arc_buf from a dbuf (tag) */2926void2927arc_loan_inuse_buf(arc_buf_t *buf, const void *tag)2928{2929arc_buf_hdr_t *hdr = buf->b_hdr;29302931ASSERT3P(buf->b_data, !=, NULL);2932ASSERT(HDR_HAS_L1HDR(hdr));2933(void) zfs_refcount_add(&hdr->b_l1hdr.b_refcnt, arc_onloan_tag);2934(void) zfs_refcount_remove(&hdr->b_l1hdr.b_refcnt, tag);29352936arc_loaned_bytes_update(arc_buf_size(buf));2937}29382939static void2940l2arc_free_abd_on_write(abd_t *abd, size_t size, arc_buf_contents_t type)2941{2942l2arc_data_free_t *df = kmem_alloc(sizeof (*df), KM_SLEEP);29432944df->l2df_abd = abd;2945df->l2df_size = size;2946df->l2df_type = type;2947mutex_enter(&l2arc_free_on_write_mtx);2948list_insert_head(l2arc_free_on_write, df);2949mutex_exit(&l2arc_free_on_write_mtx);2950}29512952static void2953arc_hdr_free_on_write(arc_buf_hdr_t *hdr, boolean_t free_rdata)2954{2955arc_state_t *state = hdr->b_l1hdr.b_state;2956arc_buf_contents_t type = arc_buf_type(hdr);2957uint64_t size = (free_rdata) ? HDR_GET_PSIZE(hdr) : arc_hdr_size(hdr);29582959/* protected by hash lock, if in the hash table */2960if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {2961ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt));2962ASSERT(state != arc_anon && state != arc_l2c_only);29632964(void) zfs_refcount_remove_many(&state->arcs_esize[type],2965size, hdr);2966}2967(void) zfs_refcount_remove_many(&state->arcs_size[type], size, hdr);2968if (type == ARC_BUFC_METADATA) {2969arc_space_return(size, ARC_SPACE_META);2970} else {2971ASSERT(type == ARC_BUFC_DATA);2972arc_space_return(size, ARC_SPACE_DATA);2973}29742975if (free_rdata) {2976l2arc_free_abd_on_write(hdr->b_crypt_hdr.b_rabd, size, type);2977} else {2978l2arc_free_abd_on_write(hdr->b_l1hdr.b_pabd, size, type);2979}2980}29812982/*2983* Share the arc_buf_t's data with the hdr. Whenever we are sharing the2984* data buffer, we transfer the refcount ownership to the hdr and update2985* the appropriate kstats.2986*/2987static void2988arc_share_buf(arc_buf_hdr_t *hdr, arc_buf_t *buf)2989{2990ASSERT(arc_can_share(hdr, buf));2991ASSERT0P(hdr->b_l1hdr.b_pabd);2992ASSERT(!ARC_BUF_ENCRYPTED(buf));2993ASSERT(HDR_EMPTY_OR_LOCKED(hdr));29942995/*2996* Start sharing the data buffer. We transfer the2997* refcount ownership to the hdr since it always owns2998* the refcount whenever an arc_buf_t is shared.2999*/3000zfs_refcount_transfer_ownership_many(3001&hdr->b_l1hdr.b_state->arcs_size[arc_buf_type(hdr)],3002arc_hdr_size(hdr), buf, hdr);3003hdr->b_l1hdr.b_pabd = abd_get_from_buf(buf->b_data, arc_buf_size(buf));3004abd_take_ownership_of_buf(hdr->b_l1hdr.b_pabd,3005HDR_ISTYPE_METADATA(hdr));3006arc_hdr_set_flags(hdr, ARC_FLAG_SHARED_DATA);3007buf->b_flags |= ARC_BUF_FLAG_SHARED;30083009/*3010* Since we've transferred ownership to the hdr we need3011* to increment its compressed and uncompressed kstats and3012* decrement the overhead size.3013*/3014ARCSTAT_INCR(arcstat_compressed_size, arc_hdr_size(hdr));3015ARCSTAT_INCR(arcstat_uncompressed_size, HDR_GET_LSIZE(hdr));3016ARCSTAT_INCR(arcstat_overhead_size, -arc_buf_size(buf));3017}30183019static void3020arc_unshare_buf(arc_buf_hdr_t *hdr, arc_buf_t *buf)3021{3022ASSERT(arc_buf_is_shared(buf));3023ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);3024ASSERT(HDR_EMPTY_OR_LOCKED(hdr));30253026/*3027* We are no longer sharing this buffer so we need3028* to transfer its ownership to the rightful owner.3029*/3030zfs_refcount_transfer_ownership_many(3031&hdr->b_l1hdr.b_state->arcs_size[arc_buf_type(hdr)],3032arc_hdr_size(hdr), hdr, buf);3033arc_hdr_clear_flags(hdr, ARC_FLAG_SHARED_DATA);3034abd_release_ownership_of_buf(hdr->b_l1hdr.b_pabd);3035abd_free(hdr->b_l1hdr.b_pabd);3036hdr->b_l1hdr.b_pabd = NULL;3037buf->b_flags &= ~ARC_BUF_FLAG_SHARED;30383039/*3040* Since the buffer is no longer shared between3041* the arc buf and the hdr, count it as overhead.3042*/3043ARCSTAT_INCR(arcstat_compressed_size, -arc_hdr_size(hdr));3044ARCSTAT_INCR(arcstat_uncompressed_size, -HDR_GET_LSIZE(hdr));3045ARCSTAT_INCR(arcstat_overhead_size, arc_buf_size(buf));3046}30473048/*3049* Remove an arc_buf_t from the hdr's buf list and return the last3050* arc_buf_t on the list. If no buffers remain on the list then return3051* NULL.3052*/3053static arc_buf_t *3054arc_buf_remove(arc_buf_hdr_t *hdr, arc_buf_t *buf)3055{3056ASSERT(HDR_HAS_L1HDR(hdr));3057ASSERT(HDR_EMPTY_OR_LOCKED(hdr));30583059arc_buf_t **bufp = &hdr->b_l1hdr.b_buf;3060arc_buf_t *lastbuf = NULL;30613062/*3063* Remove the buf from the hdr list and locate the last3064* remaining buffer on the list.3065*/3066while (*bufp != NULL) {3067if (*bufp == buf)3068*bufp = buf->b_next;30693070/*3071* If we've removed a buffer in the middle of3072* the list then update the lastbuf and update3073* bufp.3074*/3075if (*bufp != NULL) {3076lastbuf = *bufp;3077bufp = &(*bufp)->b_next;3078}3079}3080buf->b_next = NULL;3081ASSERT3P(lastbuf, !=, buf);3082IMPLY(lastbuf != NULL, ARC_BUF_LAST(lastbuf));30833084return (lastbuf);3085}30863087/*3088* Free up buf->b_data and pull the arc_buf_t off of the arc_buf_hdr_t's3089* list and free it.3090*/3091static void3092arc_buf_destroy_impl(arc_buf_t *buf)3093{3094arc_buf_hdr_t *hdr = buf->b_hdr;30953096/*3097* Free up the data associated with the buf but only if we're not3098* sharing this with the hdr. If we are sharing it with the hdr, the3099* hdr is responsible for doing the free.3100*/3101if (buf->b_data != NULL) {3102/*3103* We're about to change the hdr's b_flags. We must either3104* hold the hash_lock or be undiscoverable.3105*/3106ASSERT(HDR_EMPTY_OR_LOCKED(hdr));31073108arc_cksum_verify(buf);3109arc_buf_unwatch(buf);31103111if (ARC_BUF_SHARED(buf)) {3112arc_hdr_clear_flags(hdr, ARC_FLAG_SHARED_DATA);3113} else {3114ASSERT(!arc_buf_is_shared(buf));3115uint64_t size = arc_buf_size(buf);3116arc_free_data_buf(hdr, buf->b_data, size, buf);3117ARCSTAT_INCR(arcstat_overhead_size, -size);3118}3119buf->b_data = NULL;31203121/*3122* If we have no more encrypted buffers and we've already3123* gotten a copy of the decrypted data we can free b_rabd3124* to save some space.3125*/3126if (ARC_BUF_ENCRYPTED(buf) && HDR_HAS_RABD(hdr) &&3127hdr->b_l1hdr.b_pabd != NULL && !HDR_IO_IN_PROGRESS(hdr)) {3128arc_buf_t *b;3129for (b = hdr->b_l1hdr.b_buf; b; b = b->b_next) {3130if (b != buf && ARC_BUF_ENCRYPTED(b))3131break;3132}3133if (b == NULL)3134arc_hdr_free_abd(hdr, B_TRUE);3135}3136}31373138arc_buf_t *lastbuf = arc_buf_remove(hdr, buf);31393140if (ARC_BUF_SHARED(buf) && !ARC_BUF_COMPRESSED(buf)) {3141/*3142* If the current arc_buf_t is sharing its data buffer with the3143* hdr, then reassign the hdr's b_pabd to share it with the new3144* buffer at the end of the list. The shared buffer is always3145* the last one on the hdr's buffer list.3146*3147* There is an equivalent case for compressed bufs, but since3148* they aren't guaranteed to be the last buf in the list and3149* that is an exceedingly rare case, we just allow that space be3150* wasted temporarily. We must also be careful not to share3151* encrypted buffers, since they cannot be shared.3152*/3153if (lastbuf != NULL && !ARC_BUF_ENCRYPTED(lastbuf)) {3154/* Only one buf can be shared at once */3155ASSERT(!arc_buf_is_shared(lastbuf));3156/* hdr is uncompressed so can't have compressed buf */3157ASSERT(!ARC_BUF_COMPRESSED(lastbuf));31583159ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);3160arc_hdr_free_abd(hdr, B_FALSE);31613162/*3163* We must setup a new shared block between the3164* last buffer and the hdr. The data would have3165* been allocated by the arc buf so we need to transfer3166* ownership to the hdr since it's now being shared.3167*/3168arc_share_buf(hdr, lastbuf);3169}3170} else if (HDR_SHARED_DATA(hdr)) {3171/*3172* Uncompressed shared buffers are always at the end3173* of the list. Compressed buffers don't have the3174* same requirements. This makes it hard to3175* simply assert that the lastbuf is shared so3176* we rely on the hdr's compression flags to determine3177* if we have a compressed, shared buffer.3178*/3179ASSERT3P(lastbuf, !=, NULL);3180ASSERT(arc_buf_is_shared(lastbuf) ||3181arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF);3182}31833184/*3185* Free the checksum if we're removing the last uncompressed buf from3186* this hdr.3187*/3188if (!arc_hdr_has_uncompressed_buf(hdr)) {3189arc_cksum_free(hdr);3190}31913192/* clean up the buf */3193buf->b_hdr = NULL;3194kmem_cache_free(buf_cache, buf);3195}31963197static void3198arc_hdr_alloc_abd(arc_buf_hdr_t *hdr, int alloc_flags)3199{3200uint64_t size;3201boolean_t alloc_rdata = ((alloc_flags & ARC_HDR_ALLOC_RDATA) != 0);32023203ASSERT3U(HDR_GET_LSIZE(hdr), >, 0);3204ASSERT(HDR_HAS_L1HDR(hdr));3205ASSERT(!HDR_SHARED_DATA(hdr) || alloc_rdata);3206IMPLY(alloc_rdata, HDR_PROTECTED(hdr));32073208if (alloc_rdata) {3209size = HDR_GET_PSIZE(hdr);3210ASSERT0P(hdr->b_crypt_hdr.b_rabd);3211hdr->b_crypt_hdr.b_rabd = arc_get_data_abd(hdr, size, hdr,3212alloc_flags);3213ASSERT3P(hdr->b_crypt_hdr.b_rabd, !=, NULL);3214ARCSTAT_INCR(arcstat_raw_size, size);3215} else {3216size = arc_hdr_size(hdr);3217ASSERT0P(hdr->b_l1hdr.b_pabd);3218hdr->b_l1hdr.b_pabd = arc_get_data_abd(hdr, size, hdr,3219alloc_flags);3220ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);3221}32223223ARCSTAT_INCR(arcstat_compressed_size, size);3224ARCSTAT_INCR(arcstat_uncompressed_size, HDR_GET_LSIZE(hdr));3225}32263227static void3228arc_hdr_free_abd(arc_buf_hdr_t *hdr, boolean_t free_rdata)3229{3230uint64_t size = (free_rdata) ? HDR_GET_PSIZE(hdr) : arc_hdr_size(hdr);32313232ASSERT(HDR_HAS_L1HDR(hdr));3233ASSERT(hdr->b_l1hdr.b_pabd != NULL || HDR_HAS_RABD(hdr));3234IMPLY(free_rdata, HDR_HAS_RABD(hdr));32353236/*3237* If the hdr is currently being written to the l2arc then3238* we defer freeing the data by adding it to the l2arc_free_on_write3239* list. The l2arc will free the data once it's finished3240* writing it to the l2arc device.3241*/3242if (HDR_L2_WRITING(hdr)) {3243arc_hdr_free_on_write(hdr, free_rdata);3244ARCSTAT_BUMP(arcstat_l2_free_on_write);3245} else if (free_rdata) {3246arc_free_data_abd(hdr, hdr->b_crypt_hdr.b_rabd, size, hdr);3247} else {3248arc_free_data_abd(hdr, hdr->b_l1hdr.b_pabd, size, hdr);3249}32503251if (free_rdata) {3252hdr->b_crypt_hdr.b_rabd = NULL;3253ARCSTAT_INCR(arcstat_raw_size, -size);3254} else {3255hdr->b_l1hdr.b_pabd = NULL;3256}32573258if (hdr->b_l1hdr.b_pabd == NULL && !HDR_HAS_RABD(hdr))3259hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS;32603261ARCSTAT_INCR(arcstat_compressed_size, -size);3262ARCSTAT_INCR(arcstat_uncompressed_size, -HDR_GET_LSIZE(hdr));3263}32643265/*3266* Allocate empty anonymous ARC header. The header will get its identity3267* assigned and buffers attached later as part of read or write operations.3268*3269* In case of read arc_read() assigns header its identify (b_dva + b_birth),3270* inserts it into ARC hash to become globally visible and allocates physical3271* (b_pabd) or raw (b_rabd) ABD buffer to read into from disk. On disk read3272* completion arc_read_done() allocates ARC buffer(s) as needed, potentially3273* sharing one of them with the physical ABD buffer.3274*3275* In case of write arc_alloc_buf() allocates ARC buffer to be filled with3276* data. Then after compression and/or encryption arc_write_ready() allocates3277* and fills (or potentially shares) physical (b_pabd) or raw (b_rabd) ABD3278* buffer. On disk write completion arc_write_done() assigns the header its3279* new identity (b_dva + b_birth) and inserts into ARC hash.3280*3281* In case of partial overwrite the old data is read first as described. Then3282* arc_release() either allocates new anonymous ARC header and moves the ARC3283* buffer to it, or reuses the old ARC header by discarding its identity and3284* removing it from ARC hash. After buffer modification normal write process3285* follows as described.3286*/3287static arc_buf_hdr_t *3288arc_hdr_alloc(uint64_t spa, int32_t psize, int32_t lsize,3289boolean_t protected, enum zio_compress compression_type, uint8_t complevel,3290arc_buf_contents_t type)3291{3292arc_buf_hdr_t *hdr;32933294VERIFY(type == ARC_BUFC_DATA || type == ARC_BUFC_METADATA);3295hdr = kmem_cache_alloc(hdr_full_cache, KM_PUSHPAGE);32963297ASSERT(HDR_EMPTY(hdr));3298#ifdef ZFS_DEBUG3299ASSERT0P(hdr->b_l1hdr.b_freeze_cksum);3300#endif3301HDR_SET_PSIZE(hdr, psize);3302HDR_SET_LSIZE(hdr, lsize);3303hdr->b_spa = spa;3304hdr->b_type = type;3305hdr->b_flags = 0;3306arc_hdr_set_flags(hdr, arc_bufc_to_flags(type) | ARC_FLAG_HAS_L1HDR);3307arc_hdr_set_compress(hdr, compression_type);3308hdr->b_complevel = complevel;3309if (protected)3310arc_hdr_set_flags(hdr, ARC_FLAG_PROTECTED);33113312hdr->b_l1hdr.b_state = arc_anon;3313hdr->b_l1hdr.b_arc_access = 0;3314hdr->b_l1hdr.b_mru_hits = 0;3315hdr->b_l1hdr.b_mru_ghost_hits = 0;3316hdr->b_l1hdr.b_mfu_hits = 0;3317hdr->b_l1hdr.b_mfu_ghost_hits = 0;3318hdr->b_l1hdr.b_buf = NULL;33193320ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt));33213322return (hdr);3323}33243325/*3326* Transition between the two allocation states for the arc_buf_hdr struct.3327* The arc_buf_hdr struct can be allocated with (hdr_full_cache) or without3328* (hdr_l2only_cache) the fields necessary for the L1 cache - the smaller3329* version is used when a cache buffer is only in the L2ARC in order to reduce3330* memory usage.3331*/3332static arc_buf_hdr_t *3333arc_hdr_realloc(arc_buf_hdr_t *hdr, kmem_cache_t *old, kmem_cache_t *new)3334{3335ASSERT(HDR_HAS_L2HDR(hdr));33363337arc_buf_hdr_t *nhdr;3338l2arc_dev_t *dev = hdr->b_l2hdr.b_dev;33393340ASSERT((old == hdr_full_cache && new == hdr_l2only_cache) ||3341(old == hdr_l2only_cache && new == hdr_full_cache));33423343nhdr = kmem_cache_alloc(new, KM_PUSHPAGE);33443345ASSERT(MUTEX_HELD(HDR_LOCK(hdr)));3346buf_hash_remove(hdr);33473348memcpy(nhdr, hdr, HDR_L2ONLY_SIZE);33493350if (new == hdr_full_cache) {3351arc_hdr_set_flags(nhdr, ARC_FLAG_HAS_L1HDR);3352/*3353* arc_access and arc_change_state need to be aware that a3354* header has just come out of L2ARC, so we set its state to3355* l2c_only even though it's about to change.3356*/3357nhdr->b_l1hdr.b_state = arc_l2c_only;33583359/* Verify previous threads set to NULL before freeing */3360ASSERT0P(nhdr->b_l1hdr.b_pabd);3361ASSERT(!HDR_HAS_RABD(hdr));3362} else {3363ASSERT0P(hdr->b_l1hdr.b_buf);3364#ifdef ZFS_DEBUG3365ASSERT0P(hdr->b_l1hdr.b_freeze_cksum);3366#endif33673368/*3369* If we've reached here, We must have been called from3370* arc_evict_hdr(), as such we should have already been3371* removed from any ghost list we were previously on3372* (which protects us from racing with arc_evict_state),3373* thus no locking is needed during this check.3374*/3375ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));33763377/*3378* A buffer must not be moved into the arc_l2c_only3379* state if it's not finished being written out to the3380* l2arc device. Otherwise, the b_l1hdr.b_pabd field3381* might try to be accessed, even though it was removed.3382*/3383VERIFY(!HDR_L2_WRITING(hdr));3384VERIFY0P(hdr->b_l1hdr.b_pabd);3385ASSERT(!HDR_HAS_RABD(hdr));33863387arc_hdr_clear_flags(nhdr, ARC_FLAG_HAS_L1HDR);3388}3389/*3390* The header has been reallocated so we need to re-insert it into any3391* lists it was on.3392*/3393(void) buf_hash_insert(nhdr, NULL);33943395ASSERT(list_link_active(&hdr->b_l2hdr.b_l2node));33963397mutex_enter(&dev->l2ad_mtx);33983399/*3400* We must place the realloc'ed header back into the list at3401* the same spot. Otherwise, if it's placed earlier in the list,3402* l2arc_write_buffers() could find it during the function's3403* write phase, and try to write it out to the l2arc.3404*/3405list_insert_after(&dev->l2ad_buflist, hdr, nhdr);3406list_remove(&dev->l2ad_buflist, hdr);34073408mutex_exit(&dev->l2ad_mtx);34093410/*3411* Since we're using the pointer address as the tag when3412* incrementing and decrementing the l2ad_alloc refcount, we3413* must remove the old pointer (that we're about to destroy) and3414* add the new pointer to the refcount. Otherwise we'd remove3415* the wrong pointer address when calling arc_hdr_destroy() later.3416*/34173418(void) zfs_refcount_remove_many(&dev->l2ad_alloc,3419arc_hdr_size(hdr), hdr);3420(void) zfs_refcount_add_many(&dev->l2ad_alloc,3421arc_hdr_size(nhdr), nhdr);34223423buf_discard_identity(hdr);3424kmem_cache_free(old, hdr);34253426return (nhdr);3427}34283429/*3430* This function is used by the send / receive code to convert a newly3431* allocated arc_buf_t to one that is suitable for a raw encrypted write. It3432* is also used to allow the root objset block to be updated without altering3433* its embedded MACs. Both block types will always be uncompressed so we do not3434* have to worry about compression type or psize.3435*/3436void3437arc_convert_to_raw(arc_buf_t *buf, uint64_t dsobj, boolean_t byteorder,3438dmu_object_type_t ot, const uint8_t *salt, const uint8_t *iv,3439const uint8_t *mac)3440{3441arc_buf_hdr_t *hdr = buf->b_hdr;34423443ASSERT(ot == DMU_OT_DNODE || ot == DMU_OT_OBJSET);3444ASSERT(HDR_HAS_L1HDR(hdr));3445ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon);34463447buf->b_flags |= (ARC_BUF_FLAG_COMPRESSED | ARC_BUF_FLAG_ENCRYPTED);3448arc_hdr_set_flags(hdr, ARC_FLAG_PROTECTED);3449hdr->b_crypt_hdr.b_dsobj = dsobj;3450hdr->b_crypt_hdr.b_ot = ot;3451hdr->b_l1hdr.b_byteswap = (byteorder == ZFS_HOST_BYTEORDER) ?3452DMU_BSWAP_NUMFUNCS : DMU_OT_BYTESWAP(ot);3453if (!arc_hdr_has_uncompressed_buf(hdr))3454arc_cksum_free(hdr);34553456if (salt != NULL)3457memcpy(hdr->b_crypt_hdr.b_salt, salt, ZIO_DATA_SALT_LEN);3458if (iv != NULL)3459memcpy(hdr->b_crypt_hdr.b_iv, iv, ZIO_DATA_IV_LEN);3460if (mac != NULL)3461memcpy(hdr->b_crypt_hdr.b_mac, mac, ZIO_DATA_MAC_LEN);3462}34633464/*3465* Allocate a new arc_buf_hdr_t and arc_buf_t and return the buf to the caller.3466* The buf is returned thawed since we expect the consumer to modify it.3467*/3468arc_buf_t *3469arc_alloc_buf(spa_t *spa, const void *tag, arc_buf_contents_t type,3470int32_t size)3471{3472arc_buf_hdr_t *hdr = arc_hdr_alloc(spa_load_guid(spa), size, size,3473B_FALSE, ZIO_COMPRESS_OFF, 0, type);34743475arc_buf_t *buf = NULL;3476VERIFY0(arc_buf_alloc_impl(hdr, spa, NULL, tag, B_FALSE, B_FALSE,3477B_FALSE, B_FALSE, &buf));3478arc_buf_thaw(buf);34793480return (buf);3481}34823483/*3484* Allocate a compressed buf in the same manner as arc_alloc_buf. Don't use this3485* for bufs containing metadata.3486*/3487arc_buf_t *3488arc_alloc_compressed_buf(spa_t *spa, const void *tag, uint64_t psize,3489uint64_t lsize, enum zio_compress compression_type, uint8_t complevel)3490{3491ASSERT3U(lsize, >, 0);3492ASSERT3U(lsize, >=, psize);3493ASSERT3U(compression_type, >, ZIO_COMPRESS_OFF);3494ASSERT3U(compression_type, <, ZIO_COMPRESS_FUNCTIONS);34953496arc_buf_hdr_t *hdr = arc_hdr_alloc(spa_load_guid(spa), psize, lsize,3497B_FALSE, compression_type, complevel, ARC_BUFC_DATA);34983499arc_buf_t *buf = NULL;3500VERIFY0(arc_buf_alloc_impl(hdr, spa, NULL, tag, B_FALSE,3501B_TRUE, B_FALSE, B_FALSE, &buf));3502arc_buf_thaw(buf);35033504/*3505* To ensure that the hdr has the correct data in it if we call3506* arc_untransform() on this buf before it's been written to disk,3507* it's easiest if we just set up sharing between the buf and the hdr.3508*/3509arc_share_buf(hdr, buf);35103511return (buf);3512}35133514arc_buf_t *3515arc_alloc_raw_buf(spa_t *spa, const void *tag, uint64_t dsobj,3516boolean_t byteorder, const uint8_t *salt, const uint8_t *iv,3517const uint8_t *mac, dmu_object_type_t ot, uint64_t psize, uint64_t lsize,3518enum zio_compress compression_type, uint8_t complevel)3519{3520arc_buf_hdr_t *hdr;3521arc_buf_t *buf;3522arc_buf_contents_t type = DMU_OT_IS_METADATA(ot) ?3523ARC_BUFC_METADATA : ARC_BUFC_DATA;35243525ASSERT3U(lsize, >, 0);3526ASSERT3U(lsize, >=, psize);3527ASSERT3U(compression_type, >=, ZIO_COMPRESS_OFF);3528ASSERT3U(compression_type, <, ZIO_COMPRESS_FUNCTIONS);35293530hdr = arc_hdr_alloc(spa_load_guid(spa), psize, lsize, B_TRUE,3531compression_type, complevel, type);35323533hdr->b_crypt_hdr.b_dsobj = dsobj;3534hdr->b_crypt_hdr.b_ot = ot;3535hdr->b_l1hdr.b_byteswap = (byteorder == ZFS_HOST_BYTEORDER) ?3536DMU_BSWAP_NUMFUNCS : DMU_OT_BYTESWAP(ot);3537memcpy(hdr->b_crypt_hdr.b_salt, salt, ZIO_DATA_SALT_LEN);3538memcpy(hdr->b_crypt_hdr.b_iv, iv, ZIO_DATA_IV_LEN);3539memcpy(hdr->b_crypt_hdr.b_mac, mac, ZIO_DATA_MAC_LEN);35403541/*3542* This buffer will be considered encrypted even if the ot is not an3543* encrypted type. It will become authenticated instead in3544* arc_write_ready().3545*/3546buf = NULL;3547VERIFY0(arc_buf_alloc_impl(hdr, spa, NULL, tag, B_TRUE, B_TRUE,3548B_FALSE, B_FALSE, &buf));3549arc_buf_thaw(buf);35503551return (buf);3552}35533554static void3555l2arc_hdr_arcstats_update(arc_buf_hdr_t *hdr, boolean_t incr,3556boolean_t state_only)3557{3558uint64_t lsize = HDR_GET_LSIZE(hdr);3559uint64_t psize = HDR_GET_PSIZE(hdr);3560uint64_t asize = HDR_GET_L2SIZE(hdr);3561arc_buf_contents_t type = hdr->b_type;3562int64_t lsize_s;3563int64_t psize_s;3564int64_t asize_s;35653566/* For L2 we expect the header's b_l2size to be valid */3567ASSERT3U(asize, >=, psize);35683569if (incr) {3570lsize_s = lsize;3571psize_s = psize;3572asize_s = asize;3573} else {3574lsize_s = -lsize;3575psize_s = -psize;3576asize_s = -asize;3577}35783579/* If the buffer is a prefetch, count it as such. */3580if (HDR_PREFETCH(hdr)) {3581ARCSTAT_INCR(arcstat_l2_prefetch_asize, asize_s);3582} else {3583/*3584* We use the value stored in the L2 header upon initial3585* caching in L2ARC. This value will be updated in case3586* an MRU/MRU_ghost buffer transitions to MFU but the L2ARC3587* metadata (log entry) cannot currently be updated. Having3588* the ARC state in the L2 header solves the problem of a3589* possibly absent L1 header (apparent in buffers restored3590* from persistent L2ARC).3591*/3592switch (hdr->b_l2hdr.b_arcs_state) {3593case ARC_STATE_MRU_GHOST:3594case ARC_STATE_MRU:3595ARCSTAT_INCR(arcstat_l2_mru_asize, asize_s);3596break;3597case ARC_STATE_MFU_GHOST:3598case ARC_STATE_MFU:3599ARCSTAT_INCR(arcstat_l2_mfu_asize, asize_s);3600break;3601default:3602break;3603}3604}36053606if (state_only)3607return;36083609ARCSTAT_INCR(arcstat_l2_psize, psize_s);3610ARCSTAT_INCR(arcstat_l2_lsize, lsize_s);36113612switch (type) {3613case ARC_BUFC_DATA:3614ARCSTAT_INCR(arcstat_l2_bufc_data_asize, asize_s);3615break;3616case ARC_BUFC_METADATA:3617ARCSTAT_INCR(arcstat_l2_bufc_metadata_asize, asize_s);3618break;3619default:3620break;3621}3622}362336243625static void3626arc_hdr_l2hdr_destroy(arc_buf_hdr_t *hdr)3627{3628l2arc_buf_hdr_t *l2hdr = &hdr->b_l2hdr;3629l2arc_dev_t *dev = l2hdr->b_dev;36303631ASSERT(MUTEX_HELD(&dev->l2ad_mtx));3632ASSERT(HDR_HAS_L2HDR(hdr));36333634list_remove(&dev->l2ad_buflist, hdr);36353636l2arc_hdr_arcstats_decrement(hdr);3637if (dev->l2ad_vdev != NULL) {3638uint64_t asize = HDR_GET_L2SIZE(hdr);3639vdev_space_update(dev->l2ad_vdev, -asize, 0, 0);3640}36413642(void) zfs_refcount_remove_many(&dev->l2ad_alloc, arc_hdr_size(hdr),3643hdr);3644arc_hdr_clear_flags(hdr, ARC_FLAG_HAS_L2HDR);3645}36463647static void3648arc_hdr_destroy(arc_buf_hdr_t *hdr)3649{3650if (HDR_HAS_L1HDR(hdr)) {3651ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt));3652ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon);3653}3654ASSERT(!HDR_IO_IN_PROGRESS(hdr));3655ASSERT(!HDR_IN_HASH_TABLE(hdr));36563657if (HDR_HAS_L2HDR(hdr)) {3658l2arc_dev_t *dev = hdr->b_l2hdr.b_dev;3659boolean_t buflist_held = MUTEX_HELD(&dev->l2ad_mtx);36603661if (!buflist_held)3662mutex_enter(&dev->l2ad_mtx);36633664/*3665* Even though we checked this conditional above, we3666* need to check this again now that we have the3667* l2ad_mtx. This is because we could be racing with3668* another thread calling l2arc_evict() which might have3669* destroyed this header's L2 portion as we were waiting3670* to acquire the l2ad_mtx. If that happens, we don't3671* want to re-destroy the header's L2 portion.3672*/3673if (HDR_HAS_L2HDR(hdr)) {36743675if (!HDR_EMPTY(hdr))3676buf_discard_identity(hdr);36773678arc_hdr_l2hdr_destroy(hdr);3679}36803681if (!buflist_held)3682mutex_exit(&dev->l2ad_mtx);3683}36843685/*3686* The header's identify can only be safely discarded once it is no3687* longer discoverable. This requires removing it from the hash table3688* and the l2arc header list. After this point the hash lock can not3689* be used to protect the header.3690*/3691if (!HDR_EMPTY(hdr))3692buf_discard_identity(hdr);36933694if (HDR_HAS_L1HDR(hdr)) {3695arc_cksum_free(hdr);36963697while (hdr->b_l1hdr.b_buf != NULL)3698arc_buf_destroy_impl(hdr->b_l1hdr.b_buf);36993700if (hdr->b_l1hdr.b_pabd != NULL)3701arc_hdr_free_abd(hdr, B_FALSE);37023703if (HDR_HAS_RABD(hdr))3704arc_hdr_free_abd(hdr, B_TRUE);3705}37063707ASSERT0P(hdr->b_hash_next);3708if (HDR_HAS_L1HDR(hdr)) {3709ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));3710ASSERT0P(hdr->b_l1hdr.b_acb);3711#ifdef ZFS_DEBUG3712ASSERT0P(hdr->b_l1hdr.b_freeze_cksum);3713#endif3714kmem_cache_free(hdr_full_cache, hdr);3715} else {3716kmem_cache_free(hdr_l2only_cache, hdr);3717}3718}37193720void3721arc_buf_destroy(arc_buf_t *buf, const void *tag)3722{3723arc_buf_hdr_t *hdr = buf->b_hdr;37243725if (hdr->b_l1hdr.b_state == arc_anon) {3726ASSERT3P(hdr->b_l1hdr.b_buf, ==, buf);3727ASSERT(ARC_BUF_LAST(buf));3728ASSERT(!HDR_IO_IN_PROGRESS(hdr));3729VERIFY0(remove_reference(hdr, tag));3730return;3731}37323733kmutex_t *hash_lock = HDR_LOCK(hdr);3734mutex_enter(hash_lock);37353736ASSERT3P(hdr, ==, buf->b_hdr);3737ASSERT3P(hdr->b_l1hdr.b_buf, !=, NULL);3738ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));3739ASSERT3P(hdr->b_l1hdr.b_state, !=, arc_anon);3740ASSERT3P(buf->b_data, !=, NULL);37413742arc_buf_destroy_impl(buf);3743(void) remove_reference(hdr, tag);3744mutex_exit(hash_lock);3745}37463747/*3748* Evict the arc_buf_hdr that is provided as a parameter. The resultant3749* state of the header is dependent on its state prior to entering this3750* function. The following transitions are possible:3751*3752* - arc_mru -> arc_mru_ghost3753* - arc_mfu -> arc_mfu_ghost3754* - arc_mru_ghost -> arc_l2c_only3755* - arc_mru_ghost -> deleted3756* - arc_mfu_ghost -> arc_l2c_only3757* - arc_mfu_ghost -> deleted3758* - arc_uncached -> deleted3759*3760* Return total size of evicted data buffers for eviction progress tracking.3761* When evicting from ghost states return logical buffer size to make eviction3762* progress at the same (or at least comparable) rate as from non-ghost states.3763*3764* Return *real_evicted for actual ARC size reduction to wake up threads3765* waiting for it. For non-ghost states it includes size of evicted data3766* buffers (the headers are not freed there). For ghost states it includes3767* only the evicted headers size.3768*/3769static int64_t3770arc_evict_hdr(arc_buf_hdr_t *hdr, uint64_t *real_evicted)3771{3772arc_state_t *evicted_state, *state;3773int64_t bytes_evicted = 0;37743775ASSERT(MUTEX_HELD(HDR_LOCK(hdr)));3776ASSERT(HDR_HAS_L1HDR(hdr));3777ASSERT(!HDR_IO_IN_PROGRESS(hdr));3778ASSERT0P(hdr->b_l1hdr.b_buf);3779ASSERT0(zfs_refcount_count(&hdr->b_l1hdr.b_refcnt));37803781*real_evicted = 0;3782state = hdr->b_l1hdr.b_state;3783if (GHOST_STATE(state)) {37843785/*3786* l2arc_write_buffers() relies on a header's L1 portion3787* (i.e. its b_pabd field) during it's write phase.3788* Thus, we cannot push a header onto the arc_l2c_only3789* state (removing its L1 piece) until the header is3790* done being written to the l2arc.3791*/3792if (HDR_HAS_L2HDR(hdr) && HDR_L2_WRITING(hdr)) {3793ARCSTAT_BUMP(arcstat_evict_l2_skip);3794return (bytes_evicted);3795}37963797ARCSTAT_BUMP(arcstat_deleted);3798bytes_evicted += HDR_GET_LSIZE(hdr);37993800DTRACE_PROBE1(arc__delete, arc_buf_hdr_t *, hdr);38013802if (HDR_HAS_L2HDR(hdr)) {3803ASSERT0P(hdr->b_l1hdr.b_pabd);3804ASSERT(!HDR_HAS_RABD(hdr));3805/*3806* This buffer is cached on the 2nd Level ARC;3807* don't destroy the header.3808*/3809arc_change_state(arc_l2c_only, hdr);3810/*3811* dropping from L1+L2 cached to L2-only,3812* realloc to remove the L1 header.3813*/3814(void) arc_hdr_realloc(hdr, hdr_full_cache,3815hdr_l2only_cache);3816*real_evicted += HDR_FULL_SIZE - HDR_L2ONLY_SIZE;3817} else {3818arc_change_state(arc_anon, hdr);3819arc_hdr_destroy(hdr);3820*real_evicted += HDR_FULL_SIZE;3821}3822return (bytes_evicted);3823}38243825ASSERT(state == arc_mru || state == arc_mfu || state == arc_uncached);3826evicted_state = (state == arc_uncached) ? arc_anon :3827((state == arc_mru) ? arc_mru_ghost : arc_mfu_ghost);38283829/* prefetch buffers have a minimum lifespan */3830uint_t min_lifetime = HDR_PRESCIENT_PREFETCH(hdr) ?3831arc_min_prescient_prefetch : arc_min_prefetch;3832if ((hdr->b_flags & (ARC_FLAG_PREFETCH | ARC_FLAG_INDIRECT)) &&3833ddi_get_lbolt() - hdr->b_l1hdr.b_arc_access < min_lifetime) {3834ARCSTAT_BUMP(arcstat_evict_skip);3835return (bytes_evicted);3836}38373838if (HDR_HAS_L2HDR(hdr)) {3839ARCSTAT_INCR(arcstat_evict_l2_cached, HDR_GET_LSIZE(hdr));3840} else {3841if (l2arc_write_eligible(hdr->b_spa, hdr)) {3842ARCSTAT_INCR(arcstat_evict_l2_eligible,3843HDR_GET_LSIZE(hdr));38443845switch (state->arcs_state) {3846case ARC_STATE_MRU:3847ARCSTAT_INCR(3848arcstat_evict_l2_eligible_mru,3849HDR_GET_LSIZE(hdr));3850break;3851case ARC_STATE_MFU:3852ARCSTAT_INCR(3853arcstat_evict_l2_eligible_mfu,3854HDR_GET_LSIZE(hdr));3855break;3856default:3857break;3858}3859} else {3860ARCSTAT_INCR(arcstat_evict_l2_ineligible,3861HDR_GET_LSIZE(hdr));3862}3863}38643865bytes_evicted += arc_hdr_size(hdr);3866*real_evicted += arc_hdr_size(hdr);38673868/*3869* If this hdr is being evicted and has a compressed buffer then we3870* discard it here before we change states. This ensures that the3871* accounting is updated correctly in arc_free_data_impl().3872*/3873if (hdr->b_l1hdr.b_pabd != NULL)3874arc_hdr_free_abd(hdr, B_FALSE);38753876if (HDR_HAS_RABD(hdr))3877arc_hdr_free_abd(hdr, B_TRUE);38783879arc_change_state(evicted_state, hdr);3880DTRACE_PROBE1(arc__evict, arc_buf_hdr_t *, hdr);3881if (evicted_state == arc_anon) {3882arc_hdr_destroy(hdr);3883*real_evicted += HDR_FULL_SIZE;3884} else {3885ASSERT(HDR_IN_HASH_TABLE(hdr));3886}38873888return (bytes_evicted);3889}38903891static void3892arc_set_need_free(void)3893{3894ASSERT(MUTEX_HELD(&arc_evict_lock));3895int64_t remaining = arc_free_memory() - arc_sys_free / 2;3896arc_evict_waiter_t *aw = list_tail(&arc_evict_waiters);3897if (aw == NULL) {3898arc_need_free = MAX(-remaining, 0);3899} else {3900arc_need_free =3901MAX(-remaining, (int64_t)(aw->aew_count - arc_evict_count));3902}3903}39043905static uint64_t3906arc_evict_state_impl(multilist_t *ml, int idx, arc_buf_hdr_t *marker,3907uint64_t spa, uint64_t bytes, boolean_t *more)3908{3909multilist_sublist_t *mls;3910uint64_t bytes_evicted = 0, real_evicted = 0;3911arc_buf_hdr_t *hdr;3912kmutex_t *hash_lock;3913uint_t evict_count = zfs_arc_evict_batch_limit;39143915ASSERT3P(marker, !=, NULL);39163917mls = multilist_sublist_lock_idx(ml, idx);39183919for (hdr = multilist_sublist_prev(mls, marker); likely(hdr != NULL);3920hdr = multilist_sublist_prev(mls, marker)) {3921if ((evict_count == 0) || (bytes_evicted >= bytes))3922break;39233924/*3925* To keep our iteration location, move the marker3926* forward. Since we're not holding hdr's hash lock, we3927* must be very careful and not remove 'hdr' from the3928* sublist. Otherwise, other consumers might mistake the3929* 'hdr' as not being on a sublist when they call the3930* multilist_link_active() function (they all rely on3931* the hash lock protecting concurrent insertions and3932* removals). multilist_sublist_move_forward() was3933* specifically implemented to ensure this is the case3934* (only 'marker' will be removed and re-inserted).3935*/3936multilist_sublist_move_forward(mls, marker);39373938/*3939* The only case where the b_spa field should ever be3940* zero, is the marker headers inserted by3941* arc_evict_state(). It's possible for multiple threads3942* to be calling arc_evict_state() concurrently (e.g.3943* dsl_pool_close() and zio_inject_fault()), so we must3944* skip any markers we see from these other threads.3945*/3946if (hdr->b_spa == 0)3947continue;39483949/* we're only interested in evicting buffers of a certain spa */3950if (spa != 0 && hdr->b_spa != spa) {3951ARCSTAT_BUMP(arcstat_evict_skip);3952continue;3953}39543955hash_lock = HDR_LOCK(hdr);39563957/*3958* We aren't calling this function from any code path3959* that would already be holding a hash lock, so we're3960* asserting on this assumption to be defensive in case3961* this ever changes. Without this check, it would be3962* possible to incorrectly increment arcstat_mutex_miss3963* below (e.g. if the code changed such that we called3964* this function with a hash lock held).3965*/3966ASSERT(!MUTEX_HELD(hash_lock));39673968if (mutex_tryenter(hash_lock)) {3969uint64_t revicted;3970uint64_t evicted = arc_evict_hdr(hdr, &revicted);3971mutex_exit(hash_lock);39723973bytes_evicted += evicted;3974real_evicted += revicted;39753976/*3977* If evicted is zero, arc_evict_hdr() must have3978* decided to skip this header, don't increment3979* evict_count in this case.3980*/3981if (evicted != 0)3982evict_count--;39833984} else {3985ARCSTAT_BUMP(arcstat_mutex_miss);3986}3987}39883989multilist_sublist_unlock(mls);39903991/* Indicate if another iteration may be productive. */3992if (more)3993*more = (hdr != NULL);39943995/*3996* Increment the count of evicted bytes, and wake up any threads that3997* are waiting for the count to reach this value. Since the list is3998* ordered by ascending aew_count, we pop off the beginning of the3999* list until we reach the end, or a waiter that's past the current4000* "count". Doing this outside the loop reduces the number of times4001* we need to acquire the global arc_evict_lock.4002*4003* Only wake when there's sufficient free memory in the system4004* (specifically, arc_sys_free/2, which by default is a bit more than4005* 1/64th of RAM). See the comments in arc_wait_for_eviction().4006*/4007mutex_enter(&arc_evict_lock);4008arc_evict_count += real_evicted;40094010if (arc_free_memory() > arc_sys_free / 2) {4011arc_evict_waiter_t *aw;4012while ((aw = list_head(&arc_evict_waiters)) != NULL &&4013aw->aew_count <= arc_evict_count) {4014list_remove(&arc_evict_waiters, aw);4015cv_signal(&aw->aew_cv);4016}4017}4018arc_set_need_free();4019mutex_exit(&arc_evict_lock);40204021return (bytes_evicted);4022}40234024static arc_buf_hdr_t *4025arc_state_alloc_marker(void)4026{4027arc_buf_hdr_t *marker = kmem_cache_alloc(hdr_full_cache, KM_SLEEP);40284029/*4030* A b_spa of 0 is used to indicate that this header is4031* a marker. This fact is used in arc_evict_state_impl().4032*/4033marker->b_spa = 0;40344035return (marker);4036}40374038static void4039arc_state_free_marker(arc_buf_hdr_t *marker)4040{4041kmem_cache_free(hdr_full_cache, marker);4042}40434044/*4045* Allocate an array of buffer headers used as placeholders during arc state4046* eviction.4047*/4048static arc_buf_hdr_t **4049arc_state_alloc_markers(int count)4050{4051arc_buf_hdr_t **markers;40524053markers = kmem_zalloc(sizeof (*markers) * count, KM_SLEEP);4054for (int i = 0; i < count; i++)4055markers[i] = arc_state_alloc_marker();4056return (markers);4057}40584059static void4060arc_state_free_markers(arc_buf_hdr_t **markers, int count)4061{4062for (int i = 0; i < count; i++)4063arc_state_free_marker(markers[i]);4064kmem_free(markers, sizeof (*markers) * count);4065}40664067typedef struct evict_arg {4068taskq_ent_t eva_tqent;4069multilist_t *eva_ml;4070arc_buf_hdr_t *eva_marker;4071int eva_idx;4072uint64_t eva_spa;4073uint64_t eva_bytes;4074uint64_t eva_evicted;4075} evict_arg_t;40764077static void4078arc_evict_task(void *arg)4079{4080evict_arg_t *eva = arg;4081uint64_t total_evicted = 0;4082boolean_t more;4083uint_t batches = zfs_arc_evict_batches_limit;40844085/* Process multiple batches to amortize taskq dispatch overhead. */4086do {4087total_evicted += arc_evict_state_impl(eva->eva_ml,4088eva->eva_idx, eva->eva_marker, eva->eva_spa,4089eva->eva_bytes - total_evicted, &more);4090} while (total_evicted < eva->eva_bytes && --batches > 0 && more);40914092eva->eva_evicted = total_evicted;4093}40944095static void4096arc_evict_thread_init(void)4097{4098if (zfs_arc_evict_threads == 0) {4099/*4100* Compute number of threads we want to use for eviction.4101*4102* Normally, it's log2(ncpus) + ncpus/32, which gets us to the4103* default max of 16 threads at ~256 CPUs.4104*4105* However, that formula goes to two threads at 4 CPUs, which4106* is still rather to low to be really useful, so we just go4107* with 1 thread at fewer than 6 cores.4108*/4109if (max_ncpus < 6)4110zfs_arc_evict_threads = 1;4111else4112zfs_arc_evict_threads =4113(highbit64(max_ncpus) - 1) + max_ncpus / 32;4114} else if (zfs_arc_evict_threads > max_ncpus)4115zfs_arc_evict_threads = max_ncpus;41164117if (zfs_arc_evict_threads > 1) {4118arc_evict_taskq = taskq_create("arc_evict",4119zfs_arc_evict_threads, defclsyspri, 0, INT_MAX,4120TASKQ_PREPOPULATE);4121arc_evict_arg = kmem_zalloc(4122sizeof (evict_arg_t) * zfs_arc_evict_threads, KM_SLEEP);4123}4124}41254126/*4127* The minimum number of bytes we can evict at once is a block size.4128* So, SPA_MAXBLOCKSIZE is a reasonable minimal value per an eviction task.4129* We use this value to compute a scaling factor for the eviction tasks.4130*/4131#define MIN_EVICT_SIZE (SPA_MAXBLOCKSIZE)41324133/*4134* Evict buffers from the given arc state, until we've removed the4135* specified number of bytes. Move the removed buffers to the4136* appropriate evict state.4137*4138* This function makes a "best effort". It skips over any buffers4139* it can't get a hash_lock on, and so, may not catch all candidates.4140* It may also return without evicting as much space as requested.4141*4142* If bytes is specified using the special value ARC_EVICT_ALL, this4143* will evict all available (i.e. unlocked and evictable) buffers from4144* the given arc state; which is used by arc_flush().4145*/4146static uint64_t4147arc_evict_state(arc_state_t *state, arc_buf_contents_t type, uint64_t spa,4148uint64_t bytes)4149{4150uint64_t total_evicted = 0;4151multilist_t *ml = &state->arcs_list[type];4152int num_sublists;4153arc_buf_hdr_t **markers;4154evict_arg_t *eva = NULL;41554156num_sublists = multilist_get_num_sublists(ml);41574158boolean_t use_evcttq = zfs_arc_evict_threads > 1;41594160/*4161* If we've tried to evict from each sublist, made some4162* progress, but still have not hit the target number of bytes4163* to evict, we want to keep trying. The markers allow us to4164* pick up where we left off for each individual sublist, rather4165* than starting from the tail each time.4166*/4167if (zthr_iscurthread(arc_evict_zthr)) {4168markers = arc_state_evict_markers;4169ASSERT3S(num_sublists, <=, arc_state_evict_marker_count);4170} else {4171markers = arc_state_alloc_markers(num_sublists);4172}4173for (int i = 0; i < num_sublists; i++) {4174multilist_sublist_t *mls;41754176mls = multilist_sublist_lock_idx(ml, i);4177multilist_sublist_insert_tail(mls, markers[i]);4178multilist_sublist_unlock(mls);4179}41804181if (use_evcttq) {4182if (zthr_iscurthread(arc_evict_zthr))4183eva = arc_evict_arg;4184else4185eva = kmem_alloc(sizeof (evict_arg_t) *4186zfs_arc_evict_threads, KM_NOSLEEP);4187if (eva) {4188for (int i = 0; i < zfs_arc_evict_threads; i++) {4189taskq_init_ent(&eva[i].eva_tqent);4190eva[i].eva_ml = ml;4191eva[i].eva_spa = spa;4192}4193} else {4194/*4195* Fall back to the regular single evict if it is not4196* possible to allocate memory for the taskq entries.4197*/4198use_evcttq = B_FALSE;4199}4200}42014202/*4203* Start eviction using a randomly selected sublist, this is to try and4204* evenly balance eviction across all sublists. Always starting at the4205* same sublist (e.g. index 0) would cause evictions to favor certain4206* sublists over others.4207*/4208uint64_t scan_evicted = 0;4209int sublists_left = num_sublists;4210int sublist_idx = multilist_get_random_index(ml);42114212/*4213* While we haven't hit our target number of bytes to evict, or4214* we're evicting all available buffers.4215*/4216while (total_evicted < bytes) {4217uint64_t evict = MIN_EVICT_SIZE;4218uint_t ntasks = zfs_arc_evict_threads;42194220if (use_evcttq) {4221if (sublists_left < ntasks)4222ntasks = sublists_left;42234224if (ntasks < 2)4225use_evcttq = B_FALSE;4226}42274228if (use_evcttq) {4229uint64_t left = bytes - total_evicted;42304231if (bytes == ARC_EVICT_ALL) {4232evict = bytes;4233} else if (left >= ntasks * MIN_EVICT_SIZE) {4234evict = DIV_ROUND_UP(left, ntasks);4235} else {4236ntasks = left / MIN_EVICT_SIZE;4237if (ntasks < 2)4238use_evcttq = B_FALSE;4239else4240evict = DIV_ROUND_UP(left, ntasks);4241}4242}42434244for (int i = 0; sublists_left > 0; i++, sublist_idx++,4245sublists_left--) {4246uint64_t bytes_evicted;42474248/* we've reached the end, wrap to the beginning */4249if (sublist_idx >= num_sublists)4250sublist_idx = 0;42514252if (use_evcttq) {4253if (i == ntasks)4254break;42554256eva[i].eva_marker = markers[sublist_idx];4257eva[i].eva_idx = sublist_idx;4258eva[i].eva_bytes = evict;42594260taskq_dispatch_ent(arc_evict_taskq,4261arc_evict_task, &eva[i], 0,4262&eva[i].eva_tqent);42634264continue;4265}42664267bytes_evicted = arc_evict_state_impl(ml, sublist_idx,4268markers[sublist_idx], spa, bytes - total_evicted,4269NULL);42704271scan_evicted += bytes_evicted;4272total_evicted += bytes_evicted;42734274if (total_evicted < bytes)4275kpreempt(KPREEMPT_SYNC);4276else4277break;4278}42794280if (use_evcttq) {4281taskq_wait(arc_evict_taskq);42824283for (int i = 0; i < ntasks; i++) {4284scan_evicted += eva[i].eva_evicted;4285total_evicted += eva[i].eva_evicted;4286}4287}42884289/*4290* If we scanned all sublists and didn't evict anything, we4291* have no reason to believe we'll evict more during another4292* scan, so break the loop.4293*/4294if (scan_evicted == 0 && sublists_left == 0) {4295/* This isn't possible, let's make that obvious */4296ASSERT3S(bytes, !=, 0);42974298/*4299* When bytes is ARC_EVICT_ALL, the only way to4300* break the loop is when scan_evicted is zero.4301* In that case, we actually have evicted enough,4302* so we don't want to increment the kstat.4303*/4304if (bytes != ARC_EVICT_ALL) {4305ASSERT3S(total_evicted, <, bytes);4306ARCSTAT_BUMP(arcstat_evict_not_enough);4307}43084309break;4310}43114312/*4313* If we scanned all sublists but still have more to do,4314* reset the counts so we can go around again.4315*/4316if (sublists_left == 0) {4317sublists_left = num_sublists;4318sublist_idx = multilist_get_random_index(ml);4319scan_evicted = 0;43204321/*4322* Since we're about to reconsider all sublists,4323* re-enable use of the evict threads if available.4324*/4325use_evcttq = (zfs_arc_evict_threads > 1 && eva != NULL);4326}4327}43284329if (eva != NULL && eva != arc_evict_arg)4330kmem_free(eva, sizeof (evict_arg_t) * zfs_arc_evict_threads);43314332for (int i = 0; i < num_sublists; i++) {4333multilist_sublist_t *mls = multilist_sublist_lock_idx(ml, i);4334multilist_sublist_remove(mls, markers[i]);4335multilist_sublist_unlock(mls);4336}43374338if (markers != arc_state_evict_markers)4339arc_state_free_markers(markers, num_sublists);43404341return (total_evicted);4342}43434344/*4345* Flush all "evictable" data of the given type from the arc state4346* specified. This will not evict any "active" buffers (i.e. referenced).4347*4348* When 'retry' is set to B_FALSE, the function will make a single pass4349* over the state and evict any buffers that it can. Since it doesn't4350* continually retry the eviction, it might end up leaving some buffers4351* in the ARC due to lock misses.4352*4353* When 'retry' is set to B_TRUE, the function will continually retry the4354* eviction until *all* evictable buffers have been removed from the4355* state. As a result, if concurrent insertions into the state are4356* allowed (e.g. if the ARC isn't shutting down), this function might4357* wind up in an infinite loop, continually trying to evict buffers.4358*/4359static uint64_t4360arc_flush_state(arc_state_t *state, uint64_t spa, arc_buf_contents_t type,4361boolean_t retry)4362{4363uint64_t evicted = 0;43644365while (zfs_refcount_count(&state->arcs_esize[type]) != 0) {4366evicted += arc_evict_state(state, type, spa, ARC_EVICT_ALL);43674368if (!retry)4369break;4370}43714372return (evicted);4373}43744375/*4376* Evict the specified number of bytes from the state specified. This4377* function prevents us from trying to evict more from a state's list4378* than is "evictable", and to skip evicting altogether when passed a4379* negative value for "bytes". In contrast, arc_evict_state() will4380* evict everything it can, when passed a negative value for "bytes".4381*/4382static uint64_t4383arc_evict_impl(arc_state_t *state, arc_buf_contents_t type, int64_t bytes)4384{4385uint64_t delta;43864387if (bytes > 0 && zfs_refcount_count(&state->arcs_esize[type]) > 0) {4388delta = MIN(zfs_refcount_count(&state->arcs_esize[type]),4389bytes);4390return (arc_evict_state(state, type, 0, delta));4391}43924393return (0);4394}43954396/*4397* Adjust specified fraction, taking into account initial ghost state(s) size,4398* ghost hit bytes towards increasing the fraction, ghost hit bytes towards4399* decreasing it, plus a balance factor, controlling the decrease rate, used4400* to balance metadata vs data.4401*/4402static uint64_t4403arc_evict_adj(uint64_t frac, uint64_t total, uint64_t up, uint64_t down,4404uint_t balance)4405{4406if (total < 32 || up + down == 0)4407return (frac);44084409/*4410* We should not have more ghost hits than ghost size, but they may4411* get close. To avoid overflows below up/down should not be bigger4412* than 1/5 of total. But to limit maximum adjustment speed restrict4413* it some more.4414*/4415if (up + down >= total / 16) {4416uint64_t scale = (up + down) / (total / 32);4417up /= scale;4418down /= scale;4419}44204421/* Get maximal dynamic range by choosing optimal shifts. */4422int s = highbit64(total);4423s = MIN(64 - s, 32);44244425ASSERT3U(frac, <=, 1ULL << 32);4426uint64_t ofrac = (1ULL << 32) - frac;44274428if (frac >= 4 * ofrac)4429up /= frac / (2 * ofrac + 1);4430up = (up << s) / (total >> (32 - s));4431if (ofrac >= 4 * frac)4432down /= ofrac / (2 * frac + 1);4433down = (down << s) / (total >> (32 - s));4434down = down * 100 / balance;44354436ASSERT3U(up, <=, (1ULL << 32) - frac);4437ASSERT3U(down, <=, frac);4438return (frac + up - down);4439}44404441/*4442* Calculate (x * multiplier / divisor) without unnecesary overflows.4443*/4444static uint64_t4445arc_mf(uint64_t x, uint64_t multiplier, uint64_t divisor)4446{4447uint64_t q = (x / divisor);4448uint64_t r = (x % divisor);44494450return ((q * multiplier) + ((r * multiplier) / divisor));4451}44524453/*4454* Evict buffers from the cache, such that arcstat_size is capped by arc_c.4455*/4456static uint64_t4457arc_evict(void)4458{4459uint64_t bytes, total_evicted = 0;4460int64_t e, mrud, mrum, mfud, mfum, w;4461static uint64_t ogrd, ogrm, ogfd, ogfm;4462static uint64_t gsrd, gsrm, gsfd, gsfm;4463uint64_t ngrd, ngrm, ngfd, ngfm;44644465/* Get current size of ARC states we can evict from. */4466mrud = zfs_refcount_count(&arc_mru->arcs_size[ARC_BUFC_DATA]) +4467zfs_refcount_count(&arc_anon->arcs_size[ARC_BUFC_DATA]);4468mrum = zfs_refcount_count(&arc_mru->arcs_size[ARC_BUFC_METADATA]) +4469zfs_refcount_count(&arc_anon->arcs_size[ARC_BUFC_METADATA]);4470mfud = zfs_refcount_count(&arc_mfu->arcs_size[ARC_BUFC_DATA]);4471mfum = zfs_refcount_count(&arc_mfu->arcs_size[ARC_BUFC_METADATA]);4472uint64_t d = mrud + mfud;4473uint64_t m = mrum + mfum;4474uint64_t t = d + m;44754476/* Get ARC ghost hits since last eviction. */4477ngrd = wmsum_value(&arc_mru_ghost->arcs_hits[ARC_BUFC_DATA]);4478uint64_t grd = ngrd - ogrd;4479ogrd = ngrd;4480ngrm = wmsum_value(&arc_mru_ghost->arcs_hits[ARC_BUFC_METADATA]);4481uint64_t grm = ngrm - ogrm;4482ogrm = ngrm;4483ngfd = wmsum_value(&arc_mfu_ghost->arcs_hits[ARC_BUFC_DATA]);4484uint64_t gfd = ngfd - ogfd;4485ogfd = ngfd;4486ngfm = wmsum_value(&arc_mfu_ghost->arcs_hits[ARC_BUFC_METADATA]);4487uint64_t gfm = ngfm - ogfm;4488ogfm = ngfm;44894490/* Adjust ARC states balance based on ghost hits. */4491arc_meta = arc_evict_adj(arc_meta, gsrd + gsrm + gsfd + gsfm,4492grm + gfm, grd + gfd, zfs_arc_meta_balance);4493arc_pd = arc_evict_adj(arc_pd, gsrd + gsfd, grd, gfd, 100);4494arc_pm = arc_evict_adj(arc_pm, gsrm + gsfm, grm, gfm, 100);44954496uint64_t asize = aggsum_value(&arc_sums.arcstat_size);4497uint64_t ac = arc_c;4498int64_t wt = t - (asize - ac);44994500/*4501* Try to reduce pinned dnodes if more than 3/4 of wanted metadata4502* target is not evictable or if they go over arc_dnode_limit.4503*/4504int64_t prune = 0;4505int64_t dn = aggsum_value(&arc_sums.arcstat_dnode_size);4506int64_t nem = zfs_refcount_count(&arc_mru->arcs_size[ARC_BUFC_METADATA])4507+ zfs_refcount_count(&arc_mfu->arcs_size[ARC_BUFC_METADATA])4508- zfs_refcount_count(&arc_mru->arcs_esize[ARC_BUFC_METADATA])4509- zfs_refcount_count(&arc_mfu->arcs_esize[ARC_BUFC_METADATA]);4510w = wt * (int64_t)(arc_meta >> 16) >> 16;4511if (nem > w * 3 / 4) {4512prune = dn / sizeof (dnode_t) *4513zfs_arc_dnode_reduce_percent / 100;4514if (nem < w && w > 4)4515prune = arc_mf(prune, nem - w * 3 / 4, w / 4);4516}4517if (dn > arc_dnode_limit) {4518prune = MAX(prune, (dn - arc_dnode_limit) / sizeof (dnode_t) *4519zfs_arc_dnode_reduce_percent / 100);4520}4521if (prune > 0)4522arc_prune_async(prune);45234524/* Evict MRU metadata. */4525w = wt * (int64_t)(arc_meta * arc_pm >> 48) >> 16;4526e = MIN((int64_t)(asize - ac), (int64_t)(mrum - w));4527bytes = arc_evict_impl(arc_mru, ARC_BUFC_METADATA, e);4528total_evicted += bytes;4529mrum -= bytes;4530asize -= bytes;45314532/* Evict MFU metadata. */4533w = wt * (int64_t)(arc_meta >> 16) >> 16;4534e = MIN((int64_t)(asize - ac), (int64_t)(m - bytes - w));4535bytes = arc_evict_impl(arc_mfu, ARC_BUFC_METADATA, e);4536total_evicted += bytes;4537mfum -= bytes;4538asize -= bytes;45394540/* Evict MRU data. */4541wt -= m - total_evicted;4542w = wt * (int64_t)(arc_pd >> 16) >> 16;4543e = MIN((int64_t)(asize - ac), (int64_t)(mrud - w));4544bytes = arc_evict_impl(arc_mru, ARC_BUFC_DATA, e);4545total_evicted += bytes;4546mrud -= bytes;4547asize -= bytes;45484549/* Evict MFU data. */4550e = asize - ac;4551bytes = arc_evict_impl(arc_mfu, ARC_BUFC_DATA, e);4552mfud -= bytes;4553total_evicted += bytes;45544555/*4556* Evict ghost lists4557*4558* Size of each state's ghost list represents how much that state4559* may grow by shrinking the other states. Would it need to shrink4560* other states to zero (that is unlikely), its ghost size would be4561* equal to sum of other three state sizes. But excessive ghost4562* size may result in false ghost hits (too far back), that may4563* never result in real cache hits if several states are competing.4564* So choose some arbitraty point of 1/2 of other state sizes.4565*/4566gsrd = (mrum + mfud + mfum) / 2;4567e = zfs_refcount_count(&arc_mru_ghost->arcs_size[ARC_BUFC_DATA]) -4568gsrd;4569(void) arc_evict_impl(arc_mru_ghost, ARC_BUFC_DATA, e);45704571gsrm = (mrud + mfud + mfum) / 2;4572e = zfs_refcount_count(&arc_mru_ghost->arcs_size[ARC_BUFC_METADATA]) -4573gsrm;4574(void) arc_evict_impl(arc_mru_ghost, ARC_BUFC_METADATA, e);45754576gsfd = (mrud + mrum + mfum) / 2;4577e = zfs_refcount_count(&arc_mfu_ghost->arcs_size[ARC_BUFC_DATA]) -4578gsfd;4579(void) arc_evict_impl(arc_mfu_ghost, ARC_BUFC_DATA, e);45804581gsfm = (mrud + mrum + mfud) / 2;4582e = zfs_refcount_count(&arc_mfu_ghost->arcs_size[ARC_BUFC_METADATA]) -4583gsfm;4584(void) arc_evict_impl(arc_mfu_ghost, ARC_BUFC_METADATA, e);45854586return (total_evicted);4587}45884589static void4590arc_flush_impl(uint64_t guid, boolean_t retry)4591{4592ASSERT(!retry || guid == 0);45934594(void) arc_flush_state(arc_mru, guid, ARC_BUFC_DATA, retry);4595(void) arc_flush_state(arc_mru, guid, ARC_BUFC_METADATA, retry);45964597(void) arc_flush_state(arc_mfu, guid, ARC_BUFC_DATA, retry);4598(void) arc_flush_state(arc_mfu, guid, ARC_BUFC_METADATA, retry);45994600(void) arc_flush_state(arc_mru_ghost, guid, ARC_BUFC_DATA, retry);4601(void) arc_flush_state(arc_mru_ghost, guid, ARC_BUFC_METADATA, retry);46024603(void) arc_flush_state(arc_mfu_ghost, guid, ARC_BUFC_DATA, retry);4604(void) arc_flush_state(arc_mfu_ghost, guid, ARC_BUFC_METADATA, retry);46054606(void) arc_flush_state(arc_uncached, guid, ARC_BUFC_DATA, retry);4607(void) arc_flush_state(arc_uncached, guid, ARC_BUFC_METADATA, retry);4608}46094610void4611arc_flush(spa_t *spa, boolean_t retry)4612{4613/*4614* If retry is B_TRUE, a spa must not be specified since we have4615* no good way to determine if all of a spa's buffers have been4616* evicted from an arc state.4617*/4618ASSERT(!retry || spa == NULL);46194620arc_flush_impl(spa != NULL ? spa_load_guid(spa) : 0, retry);4621}46224623static arc_async_flush_t *4624arc_async_flush_add(uint64_t spa_guid, uint_t level)4625{4626arc_async_flush_t *af = kmem_alloc(sizeof (*af), KM_SLEEP);4627af->af_spa_guid = spa_guid;4628af->af_cache_level = level;4629taskq_init_ent(&af->af_tqent);4630list_link_init(&af->af_node);46314632mutex_enter(&arc_async_flush_lock);4633list_insert_tail(&arc_async_flush_list, af);4634mutex_exit(&arc_async_flush_lock);46354636return (af);4637}46384639static void4640arc_async_flush_remove(uint64_t spa_guid, uint_t level)4641{4642mutex_enter(&arc_async_flush_lock);4643for (arc_async_flush_t *af = list_head(&arc_async_flush_list);4644af != NULL; af = list_next(&arc_async_flush_list, af)) {4645if (af->af_spa_guid == spa_guid &&4646af->af_cache_level == level) {4647list_remove(&arc_async_flush_list, af);4648kmem_free(af, sizeof (*af));4649break;4650}4651}4652mutex_exit(&arc_async_flush_lock);4653}46544655static void4656arc_flush_task(void *arg)4657{4658arc_async_flush_t *af = arg;4659hrtime_t start_time = gethrtime();4660uint64_t spa_guid = af->af_spa_guid;46614662arc_flush_impl(spa_guid, B_FALSE);4663arc_async_flush_remove(spa_guid, af->af_cache_level);46644665uint64_t elapsed = NSEC2MSEC(gethrtime() - start_time);4666if (elapsed > 0) {4667zfs_dbgmsg("spa %llu arc flushed in %llu ms",4668(u_longlong_t)spa_guid, (u_longlong_t)elapsed);4669}4670}46714672/*4673* ARC buffers use the spa's load guid and can continue to exist after4674* the spa_t is gone (exported). The blocks are orphaned since each4675* spa import has a different load guid.4676*4677* It's OK if the spa is re-imported while this asynchronous flush is4678* still in progress. The new spa_load_guid will be different.4679*4680* Also, arc_fini will wait for any arc_flush_task to finish.4681*/4682void4683arc_flush_async(spa_t *spa)4684{4685uint64_t spa_guid = spa_load_guid(spa);4686arc_async_flush_t *af = arc_async_flush_add(spa_guid, 1);46874688taskq_dispatch_ent(arc_flush_taskq, arc_flush_task,4689af, TQ_SLEEP, &af->af_tqent);4690}46914692/*4693* Check if a guid is still in-use as part of an async teardown task4694*/4695boolean_t4696arc_async_flush_guid_inuse(uint64_t spa_guid)4697{4698mutex_enter(&arc_async_flush_lock);4699for (arc_async_flush_t *af = list_head(&arc_async_flush_list);4700af != NULL; af = list_next(&arc_async_flush_list, af)) {4701if (af->af_spa_guid == spa_guid) {4702mutex_exit(&arc_async_flush_lock);4703return (B_TRUE);4704}4705}4706mutex_exit(&arc_async_flush_lock);4707return (B_FALSE);4708}47094710uint64_t4711arc_reduce_target_size(uint64_t to_free)4712{4713/*4714* Get the actual arc size. Even if we don't need it, this updates4715* the aggsum lower bound estimate for arc_is_overflowing().4716*/4717uint64_t asize = aggsum_value(&arc_sums.arcstat_size);47184719/*4720* All callers want the ARC to actually evict (at least) this much4721* memory. Therefore we reduce from the lower of the current size and4722* the target size. This way, even if arc_c is much higher than4723* arc_size (as can be the case after many calls to arc_freed(), we will4724* immediately have arc_c < arc_size and therefore the arc_evict_zthr4725* will evict.4726*/4727uint64_t c = arc_c;4728if (c > arc_c_min) {4729c = MIN(c, MAX(asize, arc_c_min));4730to_free = MIN(to_free, c - arc_c_min);4731arc_c = c - to_free;4732} else {4733to_free = 0;4734}47354736/*4737* Since dbuf cache size is a fraction of target ARC size, we should4738* notify dbuf about the reduction, which might be significant,4739* especially if current ARC size was much smaller than the target.4740*/4741dbuf_cache_reduce_target_size();47424743/*4744* Whether or not we reduced the target size, request eviction if the4745* current size is over it now, since caller obviously wants some RAM.4746*/4747if (asize > arc_c) {4748/* See comment in arc_evict_cb_check() on why lock+flag */4749mutex_enter(&arc_evict_lock);4750arc_evict_needed = B_TRUE;4751mutex_exit(&arc_evict_lock);4752zthr_wakeup(arc_evict_zthr);4753}47544755return (to_free);4756}47574758/*4759* Determine if the system is under memory pressure and is asking4760* to reclaim memory. A return value of B_TRUE indicates that the system4761* is under memory pressure and that the arc should adjust accordingly.4762*/4763boolean_t4764arc_reclaim_needed(void)4765{4766return (arc_available_memory() < 0);4767}47684769void4770arc_kmem_reap_soon(void)4771{4772size_t i;4773kmem_cache_t *prev_cache = NULL;4774kmem_cache_t *prev_data_cache = NULL;47754776#ifdef _KERNEL4777#if defined(_ILP32)4778/*4779* Reclaim unused memory from all kmem caches.4780*/4781kmem_reap();4782#endif4783#endif47844785for (i = 0; i < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT; i++) {4786#if defined(_ILP32)4787/* reach upper limit of cache size on 32-bit */4788if (zio_buf_cache[i] == NULL)4789break;4790#endif4791if (zio_buf_cache[i] != prev_cache) {4792prev_cache = zio_buf_cache[i];4793kmem_cache_reap_now(zio_buf_cache[i]);4794}4795if (zio_data_buf_cache[i] != prev_data_cache) {4796prev_data_cache = zio_data_buf_cache[i];4797kmem_cache_reap_now(zio_data_buf_cache[i]);4798}4799}4800kmem_cache_reap_now(buf_cache);4801kmem_cache_reap_now(hdr_full_cache);4802kmem_cache_reap_now(hdr_l2only_cache);4803kmem_cache_reap_now(zfs_btree_leaf_cache);4804abd_cache_reap_now();4805}48064807static boolean_t4808arc_evict_cb_check(void *arg, zthr_t *zthr)4809{4810(void) arg, (void) zthr;48114812#ifdef ZFS_DEBUG4813/*4814* This is necessary in order to keep the kstat information4815* up to date for tools that display kstat data such as the4816* mdb ::arc dcmd and the Linux crash utility. These tools4817* typically do not call kstat's update function, but simply4818* dump out stats from the most recent update. Without4819* this call, these commands may show stale stats for the4820* anon, mru, mru_ghost, mfu, and mfu_ghost lists. Even4821* with this call, the data might be out of date if the4822* evict thread hasn't been woken recently; but that should4823* suffice. The arc_state_t structures can be queried4824* directly if more accurate information is needed.4825*/4826if (arc_ksp != NULL)4827arc_ksp->ks_update(arc_ksp, KSTAT_READ);4828#endif48294830/*4831* We have to rely on arc_wait_for_eviction() to tell us when to4832* evict, rather than checking if we are overflowing here, so that we4833* are sure to not leave arc_wait_for_eviction() waiting on aew_cv.4834* If we have become "not overflowing" since arc_wait_for_eviction()4835* checked, we need to wake it up. We could broadcast the CV here,4836* but arc_wait_for_eviction() may have not yet gone to sleep. We4837* would need to use a mutex to ensure that this function doesn't4838* broadcast until arc_wait_for_eviction() has gone to sleep (e.g.4839* the arc_evict_lock). However, the lock ordering of such a lock4840* would necessarily be incorrect with respect to the zthr_lock,4841* which is held before this function is called, and is held by4842* arc_wait_for_eviction() when it calls zthr_wakeup().4843*/4844if (arc_evict_needed)4845return (B_TRUE);48464847/*4848* If we have buffers in uncached state, evict them periodically.4849*/4850return ((zfs_refcount_count(&arc_uncached->arcs_esize[ARC_BUFC_DATA]) +4851zfs_refcount_count(&arc_uncached->arcs_esize[ARC_BUFC_METADATA]) &&4852ddi_get_lbolt() - arc_last_uncached_flush > arc_min_prefetch / 2));4853}48544855/*4856* Keep arc_size under arc_c by running arc_evict which evicts data4857* from the ARC.4858*/4859static void4860arc_evict_cb(void *arg, zthr_t *zthr)4861{4862(void) arg;48634864uint64_t evicted = 0;4865fstrans_cookie_t cookie = spl_fstrans_mark();48664867/* Always try to evict from uncached state. */4868arc_last_uncached_flush = ddi_get_lbolt();4869evicted += arc_flush_state(arc_uncached, 0, ARC_BUFC_DATA, B_FALSE);4870evicted += arc_flush_state(arc_uncached, 0, ARC_BUFC_METADATA, B_FALSE);48714872/* Evict from other states only if told to. */4873if (arc_evict_needed)4874evicted += arc_evict();48754876/*4877* If evicted is zero, we couldn't evict anything4878* via arc_evict(). This could be due to hash lock4879* collisions, but more likely due to the majority of4880* arc buffers being unevictable. Therefore, even if4881* arc_size is above arc_c, another pass is unlikely to4882* be helpful and could potentially cause us to enter an4883* infinite loop. Additionally, zthr_iscancelled() is4884* checked here so that if the arc is shutting down, the4885* broadcast will wake any remaining arc evict waiters.4886*4887* Note we cancel using zthr instead of arc_evict_zthr4888* because the latter may not yet be initializd when the4889* callback is first invoked.4890*/4891mutex_enter(&arc_evict_lock);4892arc_evict_needed = !zthr_iscancelled(zthr) &&4893evicted > 0 && aggsum_compare(&arc_sums.arcstat_size, arc_c) > 0;4894if (!arc_evict_needed) {4895/*4896* We're either no longer overflowing, or we4897* can't evict anything more, so we should wake4898* arc_get_data_impl() sooner.4899*/4900arc_evict_waiter_t *aw;4901while ((aw = list_remove_head(&arc_evict_waiters)) != NULL) {4902cv_signal(&aw->aew_cv);4903}4904arc_set_need_free();4905}4906mutex_exit(&arc_evict_lock);4907spl_fstrans_unmark(cookie);4908}49094910static boolean_t4911arc_reap_cb_check(void *arg, zthr_t *zthr)4912{4913(void) arg, (void) zthr;49144915int64_t free_memory = arc_available_memory();4916static int reap_cb_check_counter = 0;49174918/*4919* If a kmem reap is already active, don't schedule more. We must4920* check for this because kmem_cache_reap_soon() won't actually4921* block on the cache being reaped (this is to prevent callers from4922* becoming implicitly blocked by a system-wide kmem reap -- which,4923* on a system with many, many full magazines, can take minutes).4924*/4925if (!kmem_cache_reap_active() && free_memory < 0) {49264927arc_no_grow = B_TRUE;4928arc_warm = B_TRUE;4929/*4930* Wait at least zfs_grow_retry (default 5) seconds4931* before considering growing.4932*/4933arc_growtime = gethrtime() + SEC2NSEC(arc_grow_retry);4934return (B_TRUE);4935} else if (free_memory < arc_c >> arc_no_grow_shift) {4936arc_no_grow = B_TRUE;4937} else if (gethrtime() >= arc_growtime) {4938arc_no_grow = B_FALSE;4939}49404941/*4942* Called unconditionally every 60 seconds to reclaim unused4943* zstd compression and decompression context. This is done4944* here to avoid the need for an independent thread.4945*/4946if (!((reap_cb_check_counter++) % 60))4947zfs_zstd_cache_reap_now();49484949return (B_FALSE);4950}49514952/*4953* Keep enough free memory in the system by reaping the ARC's kmem4954* caches. To cause more slabs to be reapable, we may reduce the4955* target size of the cache (arc_c), causing the arc_evict_cb()4956* to free more buffers.4957*/4958static void4959arc_reap_cb(void *arg, zthr_t *zthr)4960{4961int64_t can_free, free_memory, to_free;49624963(void) arg, (void) zthr;4964fstrans_cookie_t cookie = spl_fstrans_mark();49654966/*4967* Kick off asynchronous kmem_reap()'s of all our caches.4968*/4969arc_kmem_reap_soon();49704971/*4972* Wait at least arc_kmem_cache_reap_retry_ms between4973* arc_kmem_reap_soon() calls. Without this check it is possible to4974* end up in a situation where we spend lots of time reaping4975* caches, while we're near arc_c_min. Waiting here also gives the4976* subsequent free memory check a chance of finding that the4977* asynchronous reap has already freed enough memory, and we don't4978* need to call arc_reduce_target_size().4979*/4980delay((hz * arc_kmem_cache_reap_retry_ms + 999) / 1000);49814982/*4983* Reduce the target size as needed to maintain the amount of free4984* memory in the system at a fraction of the arc_size (1/128th by4985* default). If oversubscribed (free_memory < 0) then reduce the4986* target arc_size by the deficit amount plus the fractional4987* amount. If free memory is positive but less than the fractional4988* amount, reduce by what is needed to hit the fractional amount.4989*/4990free_memory = arc_available_memory();4991can_free = arc_c - arc_c_min;4992to_free = (MAX(can_free, 0) >> arc_shrink_shift) - free_memory;4993if (to_free > 0)4994arc_reduce_target_size(to_free);4995spl_fstrans_unmark(cookie);4996}49974998#ifdef _KERNEL4999/*5000* Determine the amount of memory eligible for eviction contained in the5001* ARC. All clean data reported by the ghost lists can always be safely5002* evicted. Due to arc_c_min, the same does not hold for all clean data5003* contained by the regular mru and mfu lists.5004*5005* In the case of the regular mru and mfu lists, we need to report as5006* much clean data as possible, such that evicting that same reported5007* data will not bring arc_size below arc_c_min. Thus, in certain5008* circumstances, the total amount of clean data in the mru and mfu5009* lists might not actually be evictable.5010*5011* The following two distinct cases are accounted for:5012*5013* 1. The sum of the amount of dirty data contained by both the mru and5014* mfu lists, plus the ARC's other accounting (e.g. the anon list),5015* is greater than or equal to arc_c_min.5016* (i.e. amount of dirty data >= arc_c_min)5017*5018* This is the easy case; all clean data contained by the mru and mfu5019* lists is evictable. Evicting all clean data can only drop arc_size5020* to the amount of dirty data, which is greater than arc_c_min.5021*5022* 2. The sum of the amount of dirty data contained by both the mru and5023* mfu lists, plus the ARC's other accounting (e.g. the anon list),5024* is less than arc_c_min.5025* (i.e. arc_c_min > amount of dirty data)5026*5027* 2.1. arc_size is greater than or equal arc_c_min.5028* (i.e. arc_size >= arc_c_min > amount of dirty data)5029*5030* In this case, not all clean data from the regular mru and mfu5031* lists is actually evictable; we must leave enough clean data5032* to keep arc_size above arc_c_min. Thus, the maximum amount of5033* evictable data from the two lists combined, is exactly the5034* difference between arc_size and arc_c_min.5035*5036* 2.2. arc_size is less than arc_c_min5037* (i.e. arc_c_min > arc_size > amount of dirty data)5038*5039* In this case, none of the data contained in the mru and mfu5040* lists is evictable, even if it's clean. Since arc_size is5041* already below arc_c_min, evicting any more would only5042* increase this negative difference.5043*/50445045#endif /* _KERNEL */50465047/*5048* Adapt arc info given the number of bytes we are trying to add and5049* the state that we are coming from. This function is only called5050* when we are adding new content to the cache.5051*/5052static void5053arc_adapt(uint64_t bytes)5054{5055/*5056* Wake reap thread if we do not have any available memory5057*/5058if (arc_reclaim_needed()) {5059zthr_wakeup(arc_reap_zthr);5060return;5061}50625063if (arc_no_grow)5064return;50655066if (arc_c >= arc_c_max)5067return;50685069/*5070* If we're within (2 * maxblocksize) bytes of the target5071* cache size, increment the target cache size5072*/5073if (aggsum_upper_bound(&arc_sums.arcstat_size) +50742 * SPA_MAXBLOCKSIZE >= arc_c) {5075uint64_t dc = MAX(bytes, SPA_OLD_MAXBLOCKSIZE);5076if (atomic_add_64_nv(&arc_c, dc) > arc_c_max)5077arc_c = arc_c_max;5078}5079}50805081/*5082* Check if ARC current size has grown past our upper thresholds.5083*/5084static arc_ovf_level_t5085arc_is_overflowing(boolean_t lax, boolean_t use_reserve)5086{5087/*5088* We just compare the lower bound here for performance reasons. Our5089* primary goals are to make sure that the arc never grows without5090* bound, and that it can reach its maximum size. This check5091* accomplishes both goals. The maximum amount we could run over by is5092* 2 * aggsum_borrow_multiplier * NUM_CPUS * the average size of a block5093* in the ARC. In practice, that's in the tens of MB, which is low5094* enough to be safe.5095*/5096int64_t arc_over = aggsum_lower_bound(&arc_sums.arcstat_size) - arc_c -5097zfs_max_recordsize;5098int64_t dn_over = aggsum_lower_bound(&arc_sums.arcstat_dnode_size) -5099arc_dnode_limit;51005101/* Always allow at least one block of overflow. */5102if (arc_over < 0 && dn_over <= 0)5103return (ARC_OVF_NONE);51045105/* If we are under memory pressure, report severe overflow. */5106if (!lax)5107return (ARC_OVF_SEVERE);51085109/* We are not under pressure, so be more or less relaxed. */5110int64_t overflow = (arc_c >> zfs_arc_overflow_shift) / 2;5111if (use_reserve)5112overflow *= 3;5113return (arc_over < overflow ? ARC_OVF_SOME : ARC_OVF_SEVERE);5114}51155116static abd_t *5117arc_get_data_abd(arc_buf_hdr_t *hdr, uint64_t size, const void *tag,5118int alloc_flags)5119{5120arc_buf_contents_t type = arc_buf_type(hdr);51215122arc_get_data_impl(hdr, size, tag, alloc_flags);5123if (alloc_flags & ARC_HDR_ALLOC_LINEAR)5124return (abd_alloc_linear(size, type == ARC_BUFC_METADATA));5125else5126return (abd_alloc(size, type == ARC_BUFC_METADATA));5127}51285129static void *5130arc_get_data_buf(arc_buf_hdr_t *hdr, uint64_t size, const void *tag)5131{5132arc_buf_contents_t type = arc_buf_type(hdr);51335134arc_get_data_impl(hdr, size, tag, 0);5135if (type == ARC_BUFC_METADATA) {5136return (zio_buf_alloc(size));5137} else {5138ASSERT(type == ARC_BUFC_DATA);5139return (zio_data_buf_alloc(size));5140}5141}51425143/*5144* Wait for the specified amount of data (in bytes) to be evicted from the5145* ARC, and for there to be sufficient free memory in the system.5146* The lax argument specifies that caller does not have a specific reason5147* to wait, not aware of any memory pressure. Low memory handlers though5148* should set it to B_FALSE to wait for all required evictions to complete.5149* The use_reserve argument allows some callers to wait less than others5150* to not block critical code paths, possibly blocking other resources.5151*/5152void5153arc_wait_for_eviction(uint64_t amount, boolean_t lax, boolean_t use_reserve)5154{5155switch (arc_is_overflowing(lax, use_reserve)) {5156case ARC_OVF_NONE:5157return;5158case ARC_OVF_SOME:5159/*5160* This is a bit racy without taking arc_evict_lock, but the5161* worst that can happen is we either call zthr_wakeup() extra5162* time due to race with other thread here, or the set flag5163* get cleared by arc_evict_cb(), which is unlikely due to5164* big hysteresis, but also not important since at this level5165* of overflow the eviction is purely advisory. Same time5166* taking the global lock here every time without waiting for5167* the actual eviction creates a significant lock contention.5168*/5169if (!arc_evict_needed) {5170arc_evict_needed = B_TRUE;5171zthr_wakeup(arc_evict_zthr);5172}5173return;5174case ARC_OVF_SEVERE:5175default:5176{5177arc_evict_waiter_t aw;5178list_link_init(&aw.aew_node);5179cv_init(&aw.aew_cv, NULL, CV_DEFAULT, NULL);51805181uint64_t last_count = 0;5182mutex_enter(&arc_evict_lock);5183arc_evict_waiter_t *last;5184if ((last = list_tail(&arc_evict_waiters)) != NULL) {5185last_count = last->aew_count;5186} else if (!arc_evict_needed) {5187arc_evict_needed = B_TRUE;5188zthr_wakeup(arc_evict_zthr);5189}5190/*5191* Note, the last waiter's count may be less than5192* arc_evict_count if we are low on memory in which5193* case arc_evict_state_impl() may have deferred5194* wakeups (but still incremented arc_evict_count).5195*/5196aw.aew_count = MAX(last_count, arc_evict_count) + amount;51975198list_insert_tail(&arc_evict_waiters, &aw);51995200arc_set_need_free();52015202DTRACE_PROBE3(arc__wait__for__eviction,5203uint64_t, amount,5204uint64_t, arc_evict_count,5205uint64_t, aw.aew_count);52065207/*5208* We will be woken up either when arc_evict_count reaches5209* aew_count, or when the ARC is no longer overflowing and5210* eviction completes.5211* In case of "false" wakeup, we will still be on the list.5212*/5213do {5214cv_wait(&aw.aew_cv, &arc_evict_lock);5215} while (list_link_active(&aw.aew_node));5216mutex_exit(&arc_evict_lock);52175218cv_destroy(&aw.aew_cv);5219}5220}5221}52225223/*5224* Allocate a block and return it to the caller. If we are hitting the5225* hard limit for the cache size, we must sleep, waiting for the eviction5226* thread to catch up. If we're past the target size but below the hard5227* limit, we'll only signal the reclaim thread and continue on.5228*/5229static void5230arc_get_data_impl(arc_buf_hdr_t *hdr, uint64_t size, const void *tag,5231int alloc_flags)5232{5233arc_adapt(size);52345235/*5236* If arc_size is currently overflowing, we must be adding data5237* faster than we are evicting. To ensure we don't compound the5238* problem by adding more data and forcing arc_size to grow even5239* further past it's target size, we wait for the eviction thread to5240* make some progress. We also wait for there to be sufficient free5241* memory in the system, as measured by arc_free_memory().5242*5243* Specifically, we wait for zfs_arc_eviction_pct percent of the5244* requested size to be evicted. This should be more than 100%, to5245* ensure that that progress is also made towards getting arc_size5246* under arc_c. See the comment above zfs_arc_eviction_pct.5247*/5248arc_wait_for_eviction(size * zfs_arc_eviction_pct / 100,5249B_TRUE, alloc_flags & ARC_HDR_USE_RESERVE);52505251arc_buf_contents_t type = arc_buf_type(hdr);5252if (type == ARC_BUFC_METADATA) {5253arc_space_consume(size, ARC_SPACE_META);5254} else {5255arc_space_consume(size, ARC_SPACE_DATA);5256}52575258/*5259* Update the state size. Note that ghost states have a5260* "ghost size" and so don't need to be updated.5261*/5262arc_state_t *state = hdr->b_l1hdr.b_state;5263if (!GHOST_STATE(state)) {52645265(void) zfs_refcount_add_many(&state->arcs_size[type], size,5266tag);52675268/*5269* If this is reached via arc_read, the link is5270* protected by the hash lock. If reached via5271* arc_buf_alloc, the header should not be accessed by5272* any other thread. And, if reached via arc_read_done,5273* the hash lock will protect it if it's found in the5274* hash table; otherwise no other thread should be5275* trying to [add|remove]_reference it.5276*/5277if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {5278ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt));5279(void) zfs_refcount_add_many(&state->arcs_esize[type],5280size, tag);5281}5282}5283}52845285static void5286arc_free_data_abd(arc_buf_hdr_t *hdr, abd_t *abd, uint64_t size,5287const void *tag)5288{5289arc_free_data_impl(hdr, size, tag);5290abd_free(abd);5291}52925293static void5294arc_free_data_buf(arc_buf_hdr_t *hdr, void *buf, uint64_t size, const void *tag)5295{5296arc_buf_contents_t type = arc_buf_type(hdr);52975298arc_free_data_impl(hdr, size, tag);5299if (type == ARC_BUFC_METADATA) {5300zio_buf_free(buf, size);5301} else {5302ASSERT(type == ARC_BUFC_DATA);5303zio_data_buf_free(buf, size);5304}5305}53065307/*5308* Free the arc data buffer.5309*/5310static void5311arc_free_data_impl(arc_buf_hdr_t *hdr, uint64_t size, const void *tag)5312{5313arc_state_t *state = hdr->b_l1hdr.b_state;5314arc_buf_contents_t type = arc_buf_type(hdr);53155316/* protected by hash lock, if in the hash table */5317if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {5318ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt));5319ASSERT(state != arc_anon && state != arc_l2c_only);53205321(void) zfs_refcount_remove_many(&state->arcs_esize[type],5322size, tag);5323}5324(void) zfs_refcount_remove_many(&state->arcs_size[type], size, tag);53255326VERIFY3U(hdr->b_type, ==, type);5327if (type == ARC_BUFC_METADATA) {5328arc_space_return(size, ARC_SPACE_META);5329} else {5330ASSERT(type == ARC_BUFC_DATA);5331arc_space_return(size, ARC_SPACE_DATA);5332}5333}53345335/*5336* This routine is called whenever a buffer is accessed.5337*/5338static void5339arc_access(arc_buf_hdr_t *hdr, arc_flags_t arc_flags, boolean_t hit)5340{5341ASSERT(MUTEX_HELD(HDR_LOCK(hdr)));5342ASSERT(HDR_HAS_L1HDR(hdr));53435344/*5345* Update buffer prefetch status.5346*/5347boolean_t was_prefetch = HDR_PREFETCH(hdr);5348boolean_t now_prefetch = arc_flags & ARC_FLAG_PREFETCH;5349if (was_prefetch != now_prefetch) {5350if (was_prefetch) {5351ARCSTAT_CONDSTAT(hit, demand_hit, demand_iohit,5352HDR_PRESCIENT_PREFETCH(hdr), prescient, predictive,5353prefetch);5354}5355if (HDR_HAS_L2HDR(hdr))5356l2arc_hdr_arcstats_decrement_state(hdr);5357if (was_prefetch) {5358arc_hdr_clear_flags(hdr,5359ARC_FLAG_PREFETCH | ARC_FLAG_PRESCIENT_PREFETCH);5360} else {5361arc_hdr_set_flags(hdr, ARC_FLAG_PREFETCH);5362}5363if (HDR_HAS_L2HDR(hdr))5364l2arc_hdr_arcstats_increment_state(hdr);5365}5366if (now_prefetch) {5367if (arc_flags & ARC_FLAG_PRESCIENT_PREFETCH) {5368arc_hdr_set_flags(hdr, ARC_FLAG_PRESCIENT_PREFETCH);5369ARCSTAT_BUMP(arcstat_prescient_prefetch);5370} else {5371ARCSTAT_BUMP(arcstat_predictive_prefetch);5372}5373}5374if (arc_flags & ARC_FLAG_L2CACHE)5375arc_hdr_set_flags(hdr, ARC_FLAG_L2CACHE);53765377clock_t now = ddi_get_lbolt();5378if (hdr->b_l1hdr.b_state == arc_anon) {5379arc_state_t *new_state;5380/*5381* This buffer is not in the cache, and does not appear in5382* our "ghost" lists. Add it to the MRU or uncached state.5383*/5384ASSERT0(hdr->b_l1hdr.b_arc_access);5385hdr->b_l1hdr.b_arc_access = now;5386if (HDR_UNCACHED(hdr)) {5387new_state = arc_uncached;5388DTRACE_PROBE1(new_state__uncached, arc_buf_hdr_t *,5389hdr);5390} else {5391new_state = arc_mru;5392DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, hdr);5393}5394arc_change_state(new_state, hdr);5395} else if (hdr->b_l1hdr.b_state == arc_mru) {5396/*5397* This buffer has been accessed once recently and either5398* its read is still in progress or it is in the cache.5399*/5400if (HDR_IO_IN_PROGRESS(hdr)) {5401hdr->b_l1hdr.b_arc_access = now;5402return;5403}5404hdr->b_l1hdr.b_mru_hits++;5405ARCSTAT_BUMP(arcstat_mru_hits);54065407/*5408* If the previous access was a prefetch, then it already5409* handled possible promotion, so nothing more to do for now.5410*/5411if (was_prefetch) {5412hdr->b_l1hdr.b_arc_access = now;5413return;5414}54155416/*5417* If more than ARC_MINTIME have passed from the previous5418* hit, promote the buffer to the MFU state.5419*/5420if (ddi_time_after(now, hdr->b_l1hdr.b_arc_access +5421ARC_MINTIME)) {5422hdr->b_l1hdr.b_arc_access = now;5423DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr);5424arc_change_state(arc_mfu, hdr);5425}5426} else if (hdr->b_l1hdr.b_state == arc_mru_ghost) {5427arc_state_t *new_state;5428/*5429* This buffer has been accessed once recently, but was5430* evicted from the cache. Would we have bigger MRU, it5431* would be an MRU hit, so handle it the same way, except5432* we don't need to check the previous access time.5433*/5434hdr->b_l1hdr.b_mru_ghost_hits++;5435ARCSTAT_BUMP(arcstat_mru_ghost_hits);5436hdr->b_l1hdr.b_arc_access = now;5437wmsum_add(&arc_mru_ghost->arcs_hits[arc_buf_type(hdr)],5438arc_hdr_size(hdr));5439if (was_prefetch) {5440new_state = arc_mru;5441DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, hdr);5442} else {5443new_state = arc_mfu;5444DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr);5445}5446arc_change_state(new_state, hdr);5447} else if (hdr->b_l1hdr.b_state == arc_mfu) {5448/*5449* This buffer has been accessed more than once and either5450* still in the cache or being restored from one of ghosts.5451*/5452if (!HDR_IO_IN_PROGRESS(hdr)) {5453hdr->b_l1hdr.b_mfu_hits++;5454ARCSTAT_BUMP(arcstat_mfu_hits);5455}5456hdr->b_l1hdr.b_arc_access = now;5457} else if (hdr->b_l1hdr.b_state == arc_mfu_ghost) {5458/*5459* This buffer has been accessed more than once recently, but5460* has been evicted from the cache. Would we have bigger MFU5461* it would stay in cache, so move it back to MFU state.5462*/5463hdr->b_l1hdr.b_mfu_ghost_hits++;5464ARCSTAT_BUMP(arcstat_mfu_ghost_hits);5465hdr->b_l1hdr.b_arc_access = now;5466wmsum_add(&arc_mfu_ghost->arcs_hits[arc_buf_type(hdr)],5467arc_hdr_size(hdr));5468DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr);5469arc_change_state(arc_mfu, hdr);5470} else if (hdr->b_l1hdr.b_state == arc_uncached) {5471/*5472* This buffer is uncacheable, but we got a hit. Probably5473* a demand read after prefetch. Nothing more to do here.5474*/5475if (!HDR_IO_IN_PROGRESS(hdr))5476ARCSTAT_BUMP(arcstat_uncached_hits);5477hdr->b_l1hdr.b_arc_access = now;5478} else if (hdr->b_l1hdr.b_state == arc_l2c_only) {5479/*5480* This buffer is on the 2nd Level ARC and was not accessed5481* for a long time, so treat it as new and put into MRU.5482*/5483hdr->b_l1hdr.b_arc_access = now;5484DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, hdr);5485arc_change_state(arc_mru, hdr);5486} else {5487cmn_err(CE_PANIC, "invalid arc state 0x%p",5488hdr->b_l1hdr.b_state);5489}5490}54915492/*5493* This routine is called by dbuf_hold() to update the arc_access() state5494* which otherwise would be skipped for entries in the dbuf cache.5495*/5496void5497arc_buf_access(arc_buf_t *buf)5498{5499arc_buf_hdr_t *hdr = buf->b_hdr;55005501/*5502* Avoid taking the hash_lock when possible as an optimization.5503* The header must be checked again under the hash_lock in order5504* to handle the case where it is concurrently being released.5505*/5506if (hdr->b_l1hdr.b_state == arc_anon || HDR_EMPTY(hdr))5507return;55085509kmutex_t *hash_lock = HDR_LOCK(hdr);5510mutex_enter(hash_lock);55115512if (hdr->b_l1hdr.b_state == arc_anon || HDR_EMPTY(hdr)) {5513mutex_exit(hash_lock);5514ARCSTAT_BUMP(arcstat_access_skip);5515return;5516}55175518ASSERT(hdr->b_l1hdr.b_state == arc_mru ||5519hdr->b_l1hdr.b_state == arc_mfu ||5520hdr->b_l1hdr.b_state == arc_uncached);55215522DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr);5523arc_access(hdr, 0, B_TRUE);5524mutex_exit(hash_lock);55255526ARCSTAT_BUMP(arcstat_hits);5527ARCSTAT_CONDSTAT(B_TRUE /* demand */, demand, prefetch,5528!HDR_ISTYPE_METADATA(hdr), data, metadata, hits);5529}55305531/* a generic arc_read_done_func_t which you can use */5532void5533arc_bcopy_func(zio_t *zio, const zbookmark_phys_t *zb, const blkptr_t *bp,5534arc_buf_t *buf, void *arg)5535{5536(void) zio, (void) zb, (void) bp;55375538if (buf == NULL)5539return;55405541memcpy(arg, buf->b_data, arc_buf_size(buf));5542arc_buf_destroy(buf, arg);5543}55445545/* a generic arc_read_done_func_t */5546void5547arc_getbuf_func(zio_t *zio, const zbookmark_phys_t *zb, const blkptr_t *bp,5548arc_buf_t *buf, void *arg)5549{5550(void) zb, (void) bp;5551arc_buf_t **bufp = arg;55525553if (buf == NULL) {5554ASSERT(zio == NULL || zio->io_error != 0);5555*bufp = NULL;5556} else {5557ASSERT(zio == NULL || zio->io_error == 0);5558*bufp = buf;5559ASSERT(buf->b_data != NULL);5560}5561}55625563static void5564arc_hdr_verify(arc_buf_hdr_t *hdr, blkptr_t *bp)5565{5566if (BP_IS_HOLE(bp) || BP_IS_EMBEDDED(bp)) {5567ASSERT0(HDR_GET_PSIZE(hdr));5568ASSERT3U(arc_hdr_get_compress(hdr), ==, ZIO_COMPRESS_OFF);5569} else {5570if (HDR_COMPRESSION_ENABLED(hdr)) {5571ASSERT3U(arc_hdr_get_compress(hdr), ==,5572BP_GET_COMPRESS(bp));5573}5574ASSERT3U(HDR_GET_LSIZE(hdr), ==, BP_GET_LSIZE(bp));5575ASSERT3U(HDR_GET_PSIZE(hdr), ==, BP_GET_PSIZE(bp));5576ASSERT3U(!!HDR_PROTECTED(hdr), ==, BP_IS_PROTECTED(bp));5577}5578}55795580static void5581arc_read_done(zio_t *zio)5582{5583blkptr_t *bp = zio->io_bp;5584arc_buf_hdr_t *hdr = zio->io_private;5585kmutex_t *hash_lock = NULL;5586arc_callback_t *callback_list;5587arc_callback_t *acb;55885589/*5590* The hdr was inserted into hash-table and removed from lists5591* prior to starting I/O. We should find this header, since5592* it's in the hash table, and it should be legit since it's5593* not possible to evict it during the I/O. The only possible5594* reason for it not to be found is if we were freed during the5595* read.5596*/5597if (HDR_IN_HASH_TABLE(hdr)) {5598arc_buf_hdr_t *found;55995600ASSERT3U(hdr->b_birth, ==, BP_GET_PHYSICAL_BIRTH(zio->io_bp));5601ASSERT3U(hdr->b_dva.dva_word[0], ==,5602BP_IDENTITY(zio->io_bp)->dva_word[0]);5603ASSERT3U(hdr->b_dva.dva_word[1], ==,5604BP_IDENTITY(zio->io_bp)->dva_word[1]);56055606found = buf_hash_find(hdr->b_spa, zio->io_bp, &hash_lock);56075608ASSERT((found == hdr &&5609DVA_EQUAL(&hdr->b_dva, BP_IDENTITY(zio->io_bp))) ||5610(found == hdr && HDR_L2_READING(hdr)));5611ASSERT3P(hash_lock, !=, NULL);5612}56135614if (BP_IS_PROTECTED(bp)) {5615hdr->b_crypt_hdr.b_ot = BP_GET_TYPE(bp);5616hdr->b_crypt_hdr.b_dsobj = zio->io_bookmark.zb_objset;5617zio_crypt_decode_params_bp(bp, hdr->b_crypt_hdr.b_salt,5618hdr->b_crypt_hdr.b_iv);56195620if (zio->io_error == 0) {5621if (BP_GET_TYPE(bp) == DMU_OT_INTENT_LOG) {5622void *tmpbuf;56235624tmpbuf = abd_borrow_buf_copy(zio->io_abd,5625sizeof (zil_chain_t));5626zio_crypt_decode_mac_zil(tmpbuf,5627hdr->b_crypt_hdr.b_mac);5628abd_return_buf(zio->io_abd, tmpbuf,5629sizeof (zil_chain_t));5630} else {5631zio_crypt_decode_mac_bp(bp,5632hdr->b_crypt_hdr.b_mac);5633}5634}5635}56365637if (zio->io_error == 0) {5638/* byteswap if necessary */5639if (BP_SHOULD_BYTESWAP(zio->io_bp)) {5640if (BP_GET_LEVEL(zio->io_bp) > 0) {5641hdr->b_l1hdr.b_byteswap = DMU_BSWAP_UINT64;5642} else {5643hdr->b_l1hdr.b_byteswap =5644DMU_OT_BYTESWAP(BP_GET_TYPE(zio->io_bp));5645}5646} else {5647hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS;5648}5649if (!HDR_L2_READING(hdr)) {5650hdr->b_complevel = zio->io_prop.zp_complevel;5651}5652}56535654arc_hdr_clear_flags(hdr, ARC_FLAG_L2_EVICTED);5655if (l2arc_noprefetch && HDR_PREFETCH(hdr))5656arc_hdr_clear_flags(hdr, ARC_FLAG_L2CACHE);56575658callback_list = hdr->b_l1hdr.b_acb;5659ASSERT3P(callback_list, !=, NULL);5660hdr->b_l1hdr.b_acb = NULL;56615662/*5663* If a read request has a callback (i.e. acb_done is not NULL), then we5664* make a buf containing the data according to the parameters which were5665* passed in. The implementation of arc_buf_alloc_impl() ensures that we5666* aren't needlessly decompressing the data multiple times.5667*/5668int callback_cnt = 0;5669for (acb = callback_list; acb != NULL; acb = acb->acb_next) {56705671/* We need the last one to call below in original order. */5672callback_list = acb;56735674if (!acb->acb_done || acb->acb_nobuf)5675continue;56765677callback_cnt++;56785679if (zio->io_error != 0)5680continue;56815682int error = arc_buf_alloc_impl(hdr, zio->io_spa,5683&acb->acb_zb, acb->acb_private, acb->acb_encrypted,5684acb->acb_compressed, acb->acb_noauth, B_TRUE,5685&acb->acb_buf);56865687/*5688* Assert non-speculative zios didn't fail because an5689* encryption key wasn't loaded5690*/5691ASSERT((zio->io_flags & ZIO_FLAG_SPECULATIVE) ||5692error != EACCES);56935694/*5695* If we failed to decrypt, report an error now (as the zio5696* layer would have done if it had done the transforms).5697*/5698if (error == ECKSUM) {5699ASSERT(BP_IS_PROTECTED(bp));5700error = SET_ERROR(EIO);5701if ((zio->io_flags & ZIO_FLAG_SPECULATIVE) == 0) {5702spa_log_error(zio->io_spa, &acb->acb_zb,5703BP_GET_PHYSICAL_BIRTH(zio->io_bp));5704(void) zfs_ereport_post(5705FM_EREPORT_ZFS_AUTHENTICATION,5706zio->io_spa, NULL, &acb->acb_zb, zio, 0);5707}5708}57095710if (error != 0) {5711/*5712* Decompression or decryption failed. Set5713* io_error so that when we call acb_done5714* (below), we will indicate that the read5715* failed. Note that in the unusual case5716* where one callback is compressed and another5717* uncompressed, we will mark all of them5718* as failed, even though the uncompressed5719* one can't actually fail. In this case,5720* the hdr will not be anonymous, because5721* if there are multiple callbacks, it's5722* because multiple threads found the same5723* arc buf in the hash table.5724*/5725zio->io_error = error;5726}5727}57285729/*5730* If there are multiple callbacks, we must have the hash lock,5731* because the only way for multiple threads to find this hdr is5732* in the hash table. This ensures that if there are multiple5733* callbacks, the hdr is not anonymous. If it were anonymous,5734* we couldn't use arc_buf_destroy() in the error case below.5735*/5736ASSERT(callback_cnt < 2 || hash_lock != NULL);57375738if (zio->io_error == 0) {5739arc_hdr_verify(hdr, zio->io_bp);5740} else {5741arc_hdr_set_flags(hdr, ARC_FLAG_IO_ERROR);5742if (hdr->b_l1hdr.b_state != arc_anon)5743arc_change_state(arc_anon, hdr);5744if (HDR_IN_HASH_TABLE(hdr))5745buf_hash_remove(hdr);5746}57475748arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);5749(void) remove_reference(hdr, hdr);57505751if (hash_lock != NULL)5752mutex_exit(hash_lock);57535754/* execute each callback and free its structure */5755while ((acb = callback_list) != NULL) {5756if (acb->acb_done != NULL) {5757if (zio->io_error != 0 && acb->acb_buf != NULL) {5758/*5759* If arc_buf_alloc_impl() fails during5760* decompression, the buf will still be5761* allocated, and needs to be freed here.5762*/5763arc_buf_destroy(acb->acb_buf,5764acb->acb_private);5765acb->acb_buf = NULL;5766}5767acb->acb_done(zio, &zio->io_bookmark, zio->io_bp,5768acb->acb_buf, acb->acb_private);5769}57705771if (acb->acb_zio_dummy != NULL) {5772acb->acb_zio_dummy->io_error = zio->io_error;5773zio_nowait(acb->acb_zio_dummy);5774}57755776callback_list = acb->acb_prev;5777if (acb->acb_wait) {5778mutex_enter(&acb->acb_wait_lock);5779acb->acb_wait_error = zio->io_error;5780acb->acb_wait = B_FALSE;5781cv_signal(&acb->acb_wait_cv);5782mutex_exit(&acb->acb_wait_lock);5783/* acb will be freed by the waiting thread. */5784} else {5785kmem_free(acb, sizeof (arc_callback_t));5786}5787}5788}57895790/*5791* Lookup the block at the specified DVA (in bp), and return the manner in5792* which the block is cached. A zero return indicates not cached.5793*/5794int5795arc_cached(spa_t *spa, const blkptr_t *bp)5796{5797arc_buf_hdr_t *hdr = NULL;5798kmutex_t *hash_lock = NULL;5799uint64_t guid = spa_load_guid(spa);5800int flags = 0;58015802if (BP_IS_EMBEDDED(bp))5803return (ARC_CACHED_EMBEDDED);58045805hdr = buf_hash_find(guid, bp, &hash_lock);5806if (hdr == NULL)5807return (0);58085809if (HDR_HAS_L1HDR(hdr)) {5810arc_state_t *state = hdr->b_l1hdr.b_state;5811/*5812* We switch to ensure that any future arc_state_type_t5813* changes are handled. This is just a shift to promote5814* more compile-time checking.5815*/5816switch (state->arcs_state) {5817case ARC_STATE_ANON:5818break;5819case ARC_STATE_MRU:5820flags |= ARC_CACHED_IN_MRU | ARC_CACHED_IN_L1;5821break;5822case ARC_STATE_MFU:5823flags |= ARC_CACHED_IN_MFU | ARC_CACHED_IN_L1;5824break;5825case ARC_STATE_UNCACHED:5826/* The header is still in L1, probably not for long */5827flags |= ARC_CACHED_IN_L1;5828break;5829default:5830break;5831}5832}5833if (HDR_HAS_L2HDR(hdr))5834flags |= ARC_CACHED_IN_L2;58355836mutex_exit(hash_lock);58375838return (flags);5839}58405841/*5842* "Read" the block at the specified DVA (in bp) via the5843* cache. If the block is found in the cache, invoke the provided5844* callback immediately and return. Note that the `zio' parameter5845* in the callback will be NULL in this case, since no IO was5846* required. If the block is not in the cache pass the read request5847* on to the spa with a substitute callback function, so that the5848* requested block will be added to the cache.5849*5850* If a read request arrives for a block that has a read in-progress,5851* either wait for the in-progress read to complete (and return the5852* results); or, if this is a read with a "done" func, add a record5853* to the read to invoke the "done" func when the read completes,5854* and return; or just return.5855*5856* arc_read_done() will invoke all the requested "done" functions5857* for readers of this block.5858*/5859int5860arc_read(zio_t *pio, spa_t *spa, const blkptr_t *bp,5861arc_read_done_func_t *done, void *private, zio_priority_t priority,5862int zio_flags, arc_flags_t *arc_flags, const zbookmark_phys_t *zb)5863{5864arc_buf_hdr_t *hdr = NULL;5865kmutex_t *hash_lock = NULL;5866zio_t *rzio;5867uint64_t guid = spa_load_guid(spa);5868boolean_t compressed_read = (zio_flags & ZIO_FLAG_RAW_COMPRESS) != 0;5869boolean_t encrypted_read = BP_IS_ENCRYPTED(bp) &&5870(zio_flags & ZIO_FLAG_RAW_ENCRYPT) != 0;5871boolean_t noauth_read = BP_IS_AUTHENTICATED(bp) &&5872(zio_flags & ZIO_FLAG_RAW_ENCRYPT) != 0;5873boolean_t embedded_bp = !!BP_IS_EMBEDDED(bp);5874boolean_t no_buf = *arc_flags & ARC_FLAG_NO_BUF;5875arc_buf_t *buf = NULL;5876int rc = 0;5877boolean_t bp_validation = B_FALSE;58785879ASSERT(!embedded_bp ||5880BPE_GET_ETYPE(bp) == BP_EMBEDDED_TYPE_DATA);5881ASSERT(!BP_IS_HOLE(bp));5882ASSERT(!BP_IS_REDACTED(bp));58835884/*5885* Normally SPL_FSTRANS will already be set since kernel threads which5886* expect to call the DMU interfaces will set it when created. System5887* calls are similarly handled by setting/cleaning the bit in the5888* registered callback (module/os/.../zfs/zpl_*).5889*5890* External consumers such as Lustre which call the exported DMU5891* interfaces may not have set SPL_FSTRANS. To avoid a deadlock5892* on the hash_lock always set and clear the bit.5893*/5894fstrans_cookie_t cookie = spl_fstrans_mark();5895top:5896if (!embedded_bp) {5897/*5898* Embedded BP's have no DVA and require no I/O to "read".5899* Create an anonymous arc buf to back it.5900*/5901hdr = buf_hash_find(guid, bp, &hash_lock);5902}59035904/*5905* Determine if we have an L1 cache hit or a cache miss. For simplicity5906* we maintain encrypted data separately from compressed / uncompressed5907* data. If the user is requesting raw encrypted data and we don't have5908* that in the header we will read from disk to guarantee that we can5909* get it even if the encryption keys aren't loaded.5910*/5911if (hdr != NULL && HDR_HAS_L1HDR(hdr) && (HDR_HAS_RABD(hdr) ||5912(hdr->b_l1hdr.b_pabd != NULL && !encrypted_read))) {5913boolean_t is_data = !HDR_ISTYPE_METADATA(hdr);59145915/*5916* Verify the block pointer contents are reasonable. This5917* should always be the case since the blkptr is protected by5918* a checksum.5919*/5920if (zfs_blkptr_verify(spa, bp, BLK_CONFIG_SKIP,5921BLK_VERIFY_LOG)) {5922mutex_exit(hash_lock);5923rc = SET_ERROR(ECKSUM);5924goto done;5925}59265927if (HDR_IO_IN_PROGRESS(hdr)) {5928if (*arc_flags & ARC_FLAG_CACHED_ONLY) {5929mutex_exit(hash_lock);5930ARCSTAT_BUMP(arcstat_cached_only_in_progress);5931rc = SET_ERROR(ENOENT);5932goto done;5933}59345935zio_t *head_zio = hdr->b_l1hdr.b_acb->acb_zio_head;5936ASSERT3P(head_zio, !=, NULL);5937if ((hdr->b_flags & ARC_FLAG_PRIO_ASYNC_READ) &&5938priority == ZIO_PRIORITY_SYNC_READ) {5939/*5940* This is a sync read that needs to wait for5941* an in-flight async read. Request that the5942* zio have its priority upgraded.5943*/5944zio_change_priority(head_zio, priority);5945DTRACE_PROBE1(arc__async__upgrade__sync,5946arc_buf_hdr_t *, hdr);5947ARCSTAT_BUMP(arcstat_async_upgrade_sync);5948}59495950DTRACE_PROBE1(arc__iohit, arc_buf_hdr_t *, hdr);5951arc_access(hdr, *arc_flags, B_FALSE);59525953/*5954* If there are multiple threads reading the same block5955* and that block is not yet in the ARC, then only one5956* thread will do the physical I/O and all other5957* threads will wait until that I/O completes.5958* Synchronous reads use the acb_wait_cv whereas nowait5959* reads register a callback. Both are signalled/called5960* in arc_read_done.5961*5962* Errors of the physical I/O may need to be propagated.5963* Synchronous read errors are returned here from5964* arc_read_done via acb_wait_error. Nowait reads5965* attach the acb_zio_dummy zio to pio and5966* arc_read_done propagates the physical I/O's io_error5967* to acb_zio_dummy, and thereby to pio.5968*/5969arc_callback_t *acb = NULL;5970if (done || pio || *arc_flags & ARC_FLAG_WAIT) {5971acb = kmem_zalloc(sizeof (arc_callback_t),5972KM_SLEEP);5973acb->acb_done = done;5974acb->acb_private = private;5975acb->acb_compressed = compressed_read;5976acb->acb_encrypted = encrypted_read;5977acb->acb_noauth = noauth_read;5978acb->acb_nobuf = no_buf;5979if (*arc_flags & ARC_FLAG_WAIT) {5980acb->acb_wait = B_TRUE;5981mutex_init(&acb->acb_wait_lock, NULL,5982MUTEX_DEFAULT, NULL);5983cv_init(&acb->acb_wait_cv, NULL,5984CV_DEFAULT, NULL);5985}5986acb->acb_zb = *zb;5987if (pio != NULL) {5988acb->acb_zio_dummy = zio_null(pio,5989spa, NULL, NULL, NULL, zio_flags);5990}5991acb->acb_zio_head = head_zio;5992acb->acb_next = hdr->b_l1hdr.b_acb;5993hdr->b_l1hdr.b_acb->acb_prev = acb;5994hdr->b_l1hdr.b_acb = acb;5995}5996mutex_exit(hash_lock);59975998ARCSTAT_BUMP(arcstat_iohits);5999ARCSTAT_CONDSTAT(!(*arc_flags & ARC_FLAG_PREFETCH),6000demand, prefetch, is_data, data, metadata, iohits);60016002if (*arc_flags & ARC_FLAG_WAIT) {6003mutex_enter(&acb->acb_wait_lock);6004while (acb->acb_wait) {6005cv_wait(&acb->acb_wait_cv,6006&acb->acb_wait_lock);6007}6008rc = acb->acb_wait_error;6009mutex_exit(&acb->acb_wait_lock);6010mutex_destroy(&acb->acb_wait_lock);6011cv_destroy(&acb->acb_wait_cv);6012kmem_free(acb, sizeof (arc_callback_t));6013}6014goto out;6015}60166017ASSERT(hdr->b_l1hdr.b_state == arc_mru ||6018hdr->b_l1hdr.b_state == arc_mfu ||6019hdr->b_l1hdr.b_state == arc_uncached);60206021DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr);6022arc_access(hdr, *arc_flags, B_TRUE);60236024if (done && !no_buf) {6025ASSERT(!embedded_bp || !BP_IS_HOLE(bp));60266027/* Get a buf with the desired data in it. */6028rc = arc_buf_alloc_impl(hdr, spa, zb, private,6029encrypted_read, compressed_read, noauth_read,6030B_TRUE, &buf);6031if (rc == ECKSUM) {6032/*6033* Convert authentication and decryption errors6034* to EIO (and generate an ereport if needed)6035* before leaving the ARC.6036*/6037rc = SET_ERROR(EIO);6038if ((zio_flags & ZIO_FLAG_SPECULATIVE) == 0) {6039spa_log_error(spa, zb, hdr->b_birth);6040(void) zfs_ereport_post(6041FM_EREPORT_ZFS_AUTHENTICATION,6042spa, NULL, zb, NULL, 0);6043}6044}6045if (rc != 0) {6046arc_buf_destroy_impl(buf);6047buf = NULL;6048(void) remove_reference(hdr, private);6049}60506051/* assert any errors weren't due to unloaded keys */6052ASSERT((zio_flags & ZIO_FLAG_SPECULATIVE) ||6053rc != EACCES);6054}6055mutex_exit(hash_lock);6056ARCSTAT_BUMP(arcstat_hits);6057ARCSTAT_CONDSTAT(!(*arc_flags & ARC_FLAG_PREFETCH),6058demand, prefetch, is_data, data, metadata, hits);6059*arc_flags |= ARC_FLAG_CACHED;6060goto done;6061} else {6062uint64_t lsize = BP_GET_LSIZE(bp);6063uint64_t psize = BP_GET_PSIZE(bp);6064arc_callback_t *acb;6065vdev_t *vd = NULL;6066uint64_t addr = 0;6067boolean_t devw = B_FALSE;6068uint64_t size;6069abd_t *hdr_abd;6070int alloc_flags = encrypted_read ? ARC_HDR_ALLOC_RDATA : 0;6071arc_buf_contents_t type = BP_GET_BUFC_TYPE(bp);6072int config_lock;6073int error;60746075if (*arc_flags & ARC_FLAG_CACHED_ONLY) {6076if (hash_lock != NULL)6077mutex_exit(hash_lock);6078rc = SET_ERROR(ENOENT);6079goto done;6080}60816082if (zio_flags & ZIO_FLAG_CONFIG_WRITER) {6083config_lock = BLK_CONFIG_HELD;6084} else if (hash_lock != NULL) {6085/*6086* Prevent lock order reversal6087*/6088config_lock = BLK_CONFIG_NEEDED_TRY;6089} else {6090config_lock = BLK_CONFIG_NEEDED;6091}60926093/*6094* Verify the block pointer contents are reasonable. This6095* should always be the case since the blkptr is protected by6096* a checksum.6097*/6098if (!bp_validation && (error = zfs_blkptr_verify(spa, bp,6099config_lock, BLK_VERIFY_LOG))) {6100if (hash_lock != NULL)6101mutex_exit(hash_lock);6102if (error == EBUSY && !zfs_blkptr_verify(spa, bp,6103BLK_CONFIG_NEEDED, BLK_VERIFY_LOG)) {6104bp_validation = B_TRUE;6105goto top;6106}6107rc = SET_ERROR(ECKSUM);6108goto done;6109}61106111if (hdr == NULL) {6112/*6113* This block is not in the cache or it has6114* embedded data.6115*/6116arc_buf_hdr_t *exists = NULL;6117hdr = arc_hdr_alloc(guid, psize, lsize,6118BP_IS_PROTECTED(bp), BP_GET_COMPRESS(bp), 0, type);61196120if (!embedded_bp) {6121hdr->b_dva = *BP_IDENTITY(bp);6122hdr->b_birth = BP_GET_PHYSICAL_BIRTH(bp);6123exists = buf_hash_insert(hdr, &hash_lock);6124}6125if (exists != NULL) {6126/* somebody beat us to the hash insert */6127mutex_exit(hash_lock);6128buf_discard_identity(hdr);6129arc_hdr_destroy(hdr);6130goto top; /* restart the IO request */6131}6132} else {6133/*6134* This block is in the ghost cache or encrypted data6135* was requested and we didn't have it. If it was6136* L2-only (and thus didn't have an L1 hdr),6137* we realloc the header to add an L1 hdr.6138*/6139if (!HDR_HAS_L1HDR(hdr)) {6140hdr = arc_hdr_realloc(hdr, hdr_l2only_cache,6141hdr_full_cache);6142}61436144if (GHOST_STATE(hdr->b_l1hdr.b_state)) {6145ASSERT0P(hdr->b_l1hdr.b_pabd);6146ASSERT(!HDR_HAS_RABD(hdr));6147ASSERT(!HDR_IO_IN_PROGRESS(hdr));6148ASSERT0(zfs_refcount_count(6149&hdr->b_l1hdr.b_refcnt));6150ASSERT0P(hdr->b_l1hdr.b_buf);6151#ifdef ZFS_DEBUG6152ASSERT0P(hdr->b_l1hdr.b_freeze_cksum);6153#endif6154} else if (HDR_IO_IN_PROGRESS(hdr)) {6155/*6156* If this header already had an IO in progress6157* and we are performing another IO to fetch6158* encrypted data we must wait until the first6159* IO completes so as not to confuse6160* arc_read_done(). This should be very rare6161* and so the performance impact shouldn't6162* matter.6163*/6164arc_callback_t *acb = kmem_zalloc(6165sizeof (arc_callback_t), KM_SLEEP);6166acb->acb_wait = B_TRUE;6167mutex_init(&acb->acb_wait_lock, NULL,6168MUTEX_DEFAULT, NULL);6169cv_init(&acb->acb_wait_cv, NULL, CV_DEFAULT,6170NULL);6171acb->acb_zio_head =6172hdr->b_l1hdr.b_acb->acb_zio_head;6173acb->acb_next = hdr->b_l1hdr.b_acb;6174hdr->b_l1hdr.b_acb->acb_prev = acb;6175hdr->b_l1hdr.b_acb = acb;6176mutex_exit(hash_lock);6177mutex_enter(&acb->acb_wait_lock);6178while (acb->acb_wait) {6179cv_wait(&acb->acb_wait_cv,6180&acb->acb_wait_lock);6181}6182mutex_exit(&acb->acb_wait_lock);6183mutex_destroy(&acb->acb_wait_lock);6184cv_destroy(&acb->acb_wait_cv);6185kmem_free(acb, sizeof (arc_callback_t));6186goto top;6187}6188}6189if (*arc_flags & ARC_FLAG_UNCACHED) {6190arc_hdr_set_flags(hdr, ARC_FLAG_UNCACHED);6191if (!encrypted_read)6192alloc_flags |= ARC_HDR_ALLOC_LINEAR;6193}61946195/*6196* Take additional reference for IO_IN_PROGRESS. It stops6197* arc_access() from putting this header without any buffers6198* and so other references but obviously nonevictable onto6199* the evictable list of MRU or MFU state.6200*/6201add_reference(hdr, hdr);6202if (!embedded_bp)6203arc_access(hdr, *arc_flags, B_FALSE);6204arc_hdr_set_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);6205arc_hdr_alloc_abd(hdr, alloc_flags);6206if (encrypted_read) {6207ASSERT(HDR_HAS_RABD(hdr));6208size = HDR_GET_PSIZE(hdr);6209hdr_abd = hdr->b_crypt_hdr.b_rabd;6210zio_flags |= ZIO_FLAG_RAW;6211} else {6212ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);6213size = arc_hdr_size(hdr);6214hdr_abd = hdr->b_l1hdr.b_pabd;62156216if (arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF) {6217zio_flags |= ZIO_FLAG_RAW_COMPRESS;6218}62196220/*6221* For authenticated bp's, we do not ask the ZIO layer6222* to authenticate them since this will cause the entire6223* IO to fail if the key isn't loaded. Instead, we6224* defer authentication until arc_buf_fill(), which will6225* verify the data when the key is available.6226*/6227if (BP_IS_AUTHENTICATED(bp))6228zio_flags |= ZIO_FLAG_RAW_ENCRYPT;6229}62306231if (BP_IS_AUTHENTICATED(bp))6232arc_hdr_set_flags(hdr, ARC_FLAG_NOAUTH);6233if (BP_GET_LEVEL(bp) > 0)6234arc_hdr_set_flags(hdr, ARC_FLAG_INDIRECT);6235ASSERT(!GHOST_STATE(hdr->b_l1hdr.b_state));62366237acb = kmem_zalloc(sizeof (arc_callback_t), KM_SLEEP);6238acb->acb_done = done;6239acb->acb_private = private;6240acb->acb_compressed = compressed_read;6241acb->acb_encrypted = encrypted_read;6242acb->acb_noauth = noauth_read;6243acb->acb_nobuf = no_buf;6244acb->acb_zb = *zb;62456246ASSERT0P(hdr->b_l1hdr.b_acb);6247hdr->b_l1hdr.b_acb = acb;62486249if (HDR_HAS_L2HDR(hdr) &&6250(vd = hdr->b_l2hdr.b_dev->l2ad_vdev) != NULL) {6251devw = hdr->b_l2hdr.b_dev->l2ad_writing;6252addr = hdr->b_l2hdr.b_daddr;6253/*6254* Lock out L2ARC device removal.6255*/6256if (vdev_is_dead(vd) ||6257!spa_config_tryenter(spa, SCL_L2ARC, vd, RW_READER))6258vd = NULL;6259}62606261/*6262* We count both async reads and scrub IOs as asynchronous so6263* that both can be upgraded in the event of a cache hit while6264* the read IO is still in-flight.6265*/6266if (priority == ZIO_PRIORITY_ASYNC_READ ||6267priority == ZIO_PRIORITY_SCRUB)6268arc_hdr_set_flags(hdr, ARC_FLAG_PRIO_ASYNC_READ);6269else6270arc_hdr_clear_flags(hdr, ARC_FLAG_PRIO_ASYNC_READ);62716272/*6273* At this point, we have a level 1 cache miss or a blkptr6274* with embedded data. Try again in L2ARC if possible.6275*/6276ASSERT3U(HDR_GET_LSIZE(hdr), ==, lsize);62776278/*6279* Skip ARC stat bump for block pointers with embedded6280* data. The data are read from the blkptr itself via6281* decode_embedded_bp_compressed().6282*/6283if (!embedded_bp) {6284DTRACE_PROBE4(arc__miss, arc_buf_hdr_t *, hdr,6285blkptr_t *, bp, uint64_t, lsize,6286zbookmark_phys_t *, zb);6287ARCSTAT_BUMP(arcstat_misses);6288ARCSTAT_CONDSTAT(!(*arc_flags & ARC_FLAG_PREFETCH),6289demand, prefetch, !HDR_ISTYPE_METADATA(hdr), data,6290metadata, misses);6291zfs_racct_read(spa, size, 1,6292(*arc_flags & ARC_FLAG_UNCACHED) ?6293DMU_UNCACHEDIO : 0);6294}62956296/* Check if the spa even has l2 configured */6297const boolean_t spa_has_l2 = l2arc_ndev != 0 &&6298spa->spa_l2cache.sav_count > 0;62996300if (vd != NULL && spa_has_l2 && !(l2arc_norw && devw)) {6301/*6302* Read from the L2ARC if the following are true:6303* 1. The L2ARC vdev was previously cached.6304* 2. This buffer still has L2ARC metadata.6305* 3. This buffer isn't currently writing to the L2ARC.6306* 4. The L2ARC entry wasn't evicted, which may6307* also have invalidated the vdev.6308*/6309if (HDR_HAS_L2HDR(hdr) &&6310!HDR_L2_WRITING(hdr) && !HDR_L2_EVICTED(hdr)) {6311l2arc_read_callback_t *cb;6312abd_t *abd;6313uint64_t asize;63146315DTRACE_PROBE1(l2arc__hit, arc_buf_hdr_t *, hdr);6316ARCSTAT_BUMP(arcstat_l2_hits);6317hdr->b_l2hdr.b_hits++;63186319cb = kmem_zalloc(sizeof (l2arc_read_callback_t),6320KM_SLEEP);6321cb->l2rcb_hdr = hdr;6322cb->l2rcb_bp = *bp;6323cb->l2rcb_zb = *zb;6324cb->l2rcb_flags = zio_flags;63256326/*6327* When Compressed ARC is disabled, but the6328* L2ARC block is compressed, arc_hdr_size()6329* will have returned LSIZE rather than PSIZE.6330*/6331if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF &&6332!HDR_COMPRESSION_ENABLED(hdr) &&6333HDR_GET_PSIZE(hdr) != 0) {6334size = HDR_GET_PSIZE(hdr);6335}63366337asize = vdev_psize_to_asize(vd, size);6338if (asize != size) {6339abd = abd_alloc_for_io(asize,6340HDR_ISTYPE_METADATA(hdr));6341cb->l2rcb_abd = abd;6342} else {6343abd = hdr_abd;6344}63456346ASSERT(addr >= VDEV_LABEL_START_SIZE &&6347addr + asize <= vd->vdev_psize -6348VDEV_LABEL_END_SIZE);63496350/*6351* l2arc read. The SCL_L2ARC lock will be6352* released by l2arc_read_done().6353* Issue a null zio if the underlying buffer6354* was squashed to zero size by compression.6355*/6356ASSERT3U(arc_hdr_get_compress(hdr), !=,6357ZIO_COMPRESS_EMPTY);6358rzio = zio_read_phys(pio, vd, addr,6359asize, abd,6360ZIO_CHECKSUM_OFF,6361l2arc_read_done, cb, priority,6362zio_flags | ZIO_FLAG_CANFAIL |6363ZIO_FLAG_DONT_PROPAGATE |6364ZIO_FLAG_DONT_RETRY, B_FALSE);6365acb->acb_zio_head = rzio;63666367if (hash_lock != NULL)6368mutex_exit(hash_lock);63696370DTRACE_PROBE2(l2arc__read, vdev_t *, vd,6371zio_t *, rzio);6372ARCSTAT_INCR(arcstat_l2_read_bytes,6373HDR_GET_PSIZE(hdr));63746375if (*arc_flags & ARC_FLAG_NOWAIT) {6376zio_nowait(rzio);6377goto out;6378}63796380ASSERT(*arc_flags & ARC_FLAG_WAIT);6381if (zio_wait(rzio) == 0)6382goto out;63836384/* l2arc read error; goto zio_read() */6385if (hash_lock != NULL)6386mutex_enter(hash_lock);6387} else {6388DTRACE_PROBE1(l2arc__miss,6389arc_buf_hdr_t *, hdr);6390ARCSTAT_BUMP(arcstat_l2_misses);6391if (HDR_L2_WRITING(hdr))6392ARCSTAT_BUMP(arcstat_l2_rw_clash);6393spa_config_exit(spa, SCL_L2ARC, vd);6394}6395} else {6396if (vd != NULL)6397spa_config_exit(spa, SCL_L2ARC, vd);63986399/*6400* Only a spa with l2 should contribute to l26401* miss stats. (Including the case of having a6402* faulted cache device - that's also a miss.)6403*/6404if (spa_has_l2) {6405/*6406* Skip ARC stat bump for block pointers with6407* embedded data. The data are read from the6408* blkptr itself via6409* decode_embedded_bp_compressed().6410*/6411if (!embedded_bp) {6412DTRACE_PROBE1(l2arc__miss,6413arc_buf_hdr_t *, hdr);6414ARCSTAT_BUMP(arcstat_l2_misses);6415}6416}6417}64186419rzio = zio_read(pio, spa, bp, hdr_abd, size,6420arc_read_done, hdr, priority, zio_flags, zb);6421acb->acb_zio_head = rzio;64226423if (hash_lock != NULL)6424mutex_exit(hash_lock);64256426if (*arc_flags & ARC_FLAG_WAIT) {6427rc = zio_wait(rzio);6428goto out;6429}64306431ASSERT(*arc_flags & ARC_FLAG_NOWAIT);6432zio_nowait(rzio);6433}64346435out:6436/* embedded bps don't actually go to disk */6437if (!embedded_bp)6438spa_read_history_add(spa, zb, *arc_flags);6439spl_fstrans_unmark(cookie);6440return (rc);64416442done:6443if (done)6444done(NULL, zb, bp, buf, private);6445if (pio && rc != 0) {6446zio_t *zio = zio_null(pio, spa, NULL, NULL, NULL, zio_flags);6447zio->io_error = rc;6448zio_nowait(zio);6449}6450goto out;6451}64526453arc_prune_t *6454arc_add_prune_callback(arc_prune_func_t *func, void *private)6455{6456arc_prune_t *p;64576458p = kmem_alloc(sizeof (*p), KM_SLEEP);6459p->p_pfunc = func;6460p->p_private = private;6461list_link_init(&p->p_node);6462zfs_refcount_create(&p->p_refcnt);64636464mutex_enter(&arc_prune_mtx);6465zfs_refcount_add(&p->p_refcnt, &arc_prune_list);6466list_insert_head(&arc_prune_list, p);6467mutex_exit(&arc_prune_mtx);64686469return (p);6470}64716472void6473arc_remove_prune_callback(arc_prune_t *p)6474{6475boolean_t wait = B_FALSE;6476mutex_enter(&arc_prune_mtx);6477list_remove(&arc_prune_list, p);6478if (zfs_refcount_remove(&p->p_refcnt, &arc_prune_list) > 0)6479wait = B_TRUE;6480mutex_exit(&arc_prune_mtx);64816482/* wait for arc_prune_task to finish */6483if (wait)6484taskq_wait_outstanding(arc_prune_taskq, 0);6485ASSERT0(zfs_refcount_count(&p->p_refcnt));6486zfs_refcount_destroy(&p->p_refcnt);6487kmem_free(p, sizeof (*p));6488}64896490/*6491* Helper function for arc_prune_async() it is responsible for safely6492* handling the execution of a registered arc_prune_func_t.6493*/6494static void6495arc_prune_task(void *ptr)6496{6497arc_prune_t *ap = (arc_prune_t *)ptr;6498arc_prune_func_t *func = ap->p_pfunc;64996500if (func != NULL)6501func(ap->p_adjust, ap->p_private);65026503(void) zfs_refcount_remove(&ap->p_refcnt, func);6504}65056506/*6507* Notify registered consumers they must drop holds on a portion of the ARC6508* buffers they reference. This provides a mechanism to ensure the ARC can6509* honor the metadata limit and reclaim otherwise pinned ARC buffers.6510*6511* This operation is performed asynchronously so it may be safely called6512* in the context of the arc_reclaim_thread(). A reference is taken here6513* for each registered arc_prune_t and the arc_prune_task() is responsible6514* for releasing it once the registered arc_prune_func_t has completed.6515*/6516static void6517arc_prune_async(uint64_t adjust)6518{6519arc_prune_t *ap;65206521mutex_enter(&arc_prune_mtx);6522for (ap = list_head(&arc_prune_list); ap != NULL;6523ap = list_next(&arc_prune_list, ap)) {65246525if (zfs_refcount_count(&ap->p_refcnt) >= 2)6526continue;65276528zfs_refcount_add(&ap->p_refcnt, ap->p_pfunc);6529ap->p_adjust = adjust;6530if (taskq_dispatch(arc_prune_taskq, arc_prune_task,6531ap, TQ_SLEEP) == TASKQID_INVALID) {6532(void) zfs_refcount_remove(&ap->p_refcnt, ap->p_pfunc);6533continue;6534}6535ARCSTAT_BUMP(arcstat_prune);6536}6537mutex_exit(&arc_prune_mtx);6538}65396540/*6541* Notify the arc that a block was freed, and thus will never be used again.6542*/6543void6544arc_freed(spa_t *spa, const blkptr_t *bp)6545{6546arc_buf_hdr_t *hdr;6547kmutex_t *hash_lock;6548uint64_t guid = spa_load_guid(spa);65496550ASSERT(!BP_IS_EMBEDDED(bp));65516552hdr = buf_hash_find(guid, bp, &hash_lock);6553if (hdr == NULL)6554return;65556556/*6557* We might be trying to free a block that is still doing I/O6558* (i.e. prefetch) or has some other reference (i.e. a dedup-ed,6559* dmu_sync-ed block). A block may also have a reference if it is6560* part of a dedup-ed, dmu_synced write. The dmu_sync() function would6561* have written the new block to its final resting place on disk but6562* without the dedup flag set. This would have left the hdr in the MRU6563* state and discoverable. When the txg finally syncs it detects that6564* the block was overridden in open context and issues an override I/O.6565* Since this is a dedup block, the override I/O will determine if the6566* block is already in the DDT. If so, then it will replace the io_bp6567* with the bp from the DDT and allow the I/O to finish. When the I/O6568* reaches the done callback, dbuf_write_override_done, it will6569* check to see if the io_bp and io_bp_override are identical.6570* If they are not, then it indicates that the bp was replaced with6571* the bp in the DDT and the override bp is freed. This allows6572* us to arrive here with a reference on a block that is being6573* freed. So if we have an I/O in progress, or a reference to6574* this hdr, then we don't destroy the hdr.6575*/6576if (!HDR_HAS_L1HDR(hdr) ||6577zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt)) {6578arc_change_state(arc_anon, hdr);6579arc_hdr_destroy(hdr);6580mutex_exit(hash_lock);6581} else {6582mutex_exit(hash_lock);6583}65846585}65866587/*6588* Release this buffer from the cache, making it an anonymous buffer. This6589* must be done after a read and prior to modifying the buffer contents.6590* If the buffer has more than one reference, we must make6591* a new hdr for the buffer.6592*/6593void6594arc_release(arc_buf_t *buf, const void *tag)6595{6596arc_buf_hdr_t *hdr = buf->b_hdr;65976598/*6599* It would be nice to assert that if its DMU metadata (level >6600* 0 || it's the dnode file), then it must be syncing context.6601* But we don't know that information at this level.6602*/66036604ASSERT(HDR_HAS_L1HDR(hdr));66056606/*6607* We don't grab the hash lock prior to this check, because if6608* the buffer's header is in the arc_anon state, it won't be6609* linked into the hash table.6610*/6611if (hdr->b_l1hdr.b_state == arc_anon) {6612ASSERT(!HDR_IO_IN_PROGRESS(hdr));6613ASSERT(!HDR_IN_HASH_TABLE(hdr));6614ASSERT(!HDR_HAS_L2HDR(hdr));66156616ASSERT3P(hdr->b_l1hdr.b_buf, ==, buf);6617ASSERT(ARC_BUF_LAST(buf));6618ASSERT3S(zfs_refcount_count(&hdr->b_l1hdr.b_refcnt), ==, 1);6619ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));66206621hdr->b_l1hdr.b_arc_access = 0;66226623/*6624* If the buf is being overridden then it may already6625* have a hdr that is not empty.6626*/6627buf_discard_identity(hdr);6628arc_buf_thaw(buf);66296630return;6631}66326633kmutex_t *hash_lock = HDR_LOCK(hdr);6634mutex_enter(hash_lock);66356636/*6637* This assignment is only valid as long as the hash_lock is6638* held, we must be careful not to reference state or the6639* b_state field after dropping the lock.6640*/6641arc_state_t *state = hdr->b_l1hdr.b_state;6642ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));6643ASSERT3P(state, !=, arc_anon);6644ASSERT3P(state, !=, arc_l2c_only);66456646/* this buffer is not on any list */6647ASSERT3S(zfs_refcount_count(&hdr->b_l1hdr.b_refcnt), >, 0);66486649/*6650* Do we have more than one buf?6651*/6652if (hdr->b_l1hdr.b_buf != buf || !ARC_BUF_LAST(buf)) {6653arc_buf_hdr_t *nhdr;6654uint64_t spa = hdr->b_spa;6655uint64_t psize = HDR_GET_PSIZE(hdr);6656uint64_t lsize = HDR_GET_LSIZE(hdr);6657boolean_t protected = HDR_PROTECTED(hdr);6658enum zio_compress compress = arc_hdr_get_compress(hdr);6659arc_buf_contents_t type = arc_buf_type(hdr);66606661if (ARC_BUF_SHARED(buf) && !ARC_BUF_COMPRESSED(buf)) {6662ASSERT3P(hdr->b_l1hdr.b_buf, !=, buf);6663ASSERT(ARC_BUF_LAST(buf));6664}66656666/*6667* Pull the buffer off of this hdr and find the last buffer6668* in the hdr's buffer list.6669*/6670VERIFY3S(remove_reference(hdr, tag), >, 0);6671arc_buf_t *lastbuf = arc_buf_remove(hdr, buf);6672ASSERT3P(lastbuf, !=, NULL);66736674/*6675* If the current arc_buf_t and the hdr are sharing their data6676* buffer, then we must stop sharing that block.6677*/6678if (ARC_BUF_SHARED(buf)) {6679ASSERT(!arc_buf_is_shared(lastbuf));66806681/*6682* First, sever the block sharing relationship between6683* buf and the arc_buf_hdr_t.6684*/6685arc_unshare_buf(hdr, buf);66866687/*6688* Now we need to recreate the hdr's b_pabd. Since we6689* have lastbuf handy, we try to share with it, but if6690* we can't then we allocate a new b_pabd and copy the6691* data from buf into it.6692*/6693if (arc_can_share(hdr, lastbuf)) {6694arc_share_buf(hdr, lastbuf);6695} else {6696arc_hdr_alloc_abd(hdr, 0);6697abd_copy_from_buf(hdr->b_l1hdr.b_pabd,6698buf->b_data, psize);6699}6700} else if (HDR_SHARED_DATA(hdr)) {6701/*6702* Uncompressed shared buffers are always at the end6703* of the list. Compressed buffers don't have the6704* same requirements. This makes it hard to6705* simply assert that the lastbuf is shared so6706* we rely on the hdr's compression flags to determine6707* if we have a compressed, shared buffer.6708*/6709ASSERT(arc_buf_is_shared(lastbuf) ||6710arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF);6711ASSERT(!arc_buf_is_shared(buf));6712}67136714ASSERT(hdr->b_l1hdr.b_pabd != NULL || HDR_HAS_RABD(hdr));67156716(void) zfs_refcount_remove_many(&state->arcs_size[type],6717arc_buf_size(buf), buf);67186719arc_cksum_verify(buf);6720arc_buf_unwatch(buf);67216722/* if this is the last uncompressed buf free the checksum */6723if (!arc_hdr_has_uncompressed_buf(hdr))6724arc_cksum_free(hdr);67256726mutex_exit(hash_lock);67276728nhdr = arc_hdr_alloc(spa, psize, lsize, protected,6729compress, hdr->b_complevel, type);6730ASSERT0P(nhdr->b_l1hdr.b_buf);6731ASSERT0(zfs_refcount_count(&nhdr->b_l1hdr.b_refcnt));6732VERIFY3U(nhdr->b_type, ==, type);6733ASSERT(!HDR_SHARED_DATA(nhdr));67346735nhdr->b_l1hdr.b_buf = buf;6736(void) zfs_refcount_add(&nhdr->b_l1hdr.b_refcnt, tag);6737buf->b_hdr = nhdr;67386739(void) zfs_refcount_add_many(&arc_anon->arcs_size[type],6740arc_buf_size(buf), buf);6741} else {6742ASSERT(zfs_refcount_count(&hdr->b_l1hdr.b_refcnt) == 1);6743/* protected by hash lock, or hdr is on arc_anon */6744ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));6745ASSERT(!HDR_IO_IN_PROGRESS(hdr));67466747if (HDR_HAS_L2HDR(hdr)) {6748mutex_enter(&hdr->b_l2hdr.b_dev->l2ad_mtx);6749/* Recheck to prevent race with l2arc_evict(). */6750if (HDR_HAS_L2HDR(hdr))6751arc_hdr_l2hdr_destroy(hdr);6752mutex_exit(&hdr->b_l2hdr.b_dev->l2ad_mtx);6753}67546755hdr->b_l1hdr.b_mru_hits = 0;6756hdr->b_l1hdr.b_mru_ghost_hits = 0;6757hdr->b_l1hdr.b_mfu_hits = 0;6758hdr->b_l1hdr.b_mfu_ghost_hits = 0;6759arc_change_state(arc_anon, hdr);6760hdr->b_l1hdr.b_arc_access = 0;67616762mutex_exit(hash_lock);6763buf_discard_identity(hdr);6764arc_buf_thaw(buf);6765}6766}67676768int6769arc_released(arc_buf_t *buf)6770{6771return (buf->b_data != NULL &&6772buf->b_hdr->b_l1hdr.b_state == arc_anon);6773}67746775#ifdef ZFS_DEBUG6776int6777arc_referenced(arc_buf_t *buf)6778{6779return (zfs_refcount_count(&buf->b_hdr->b_l1hdr.b_refcnt));6780}6781#endif67826783static void6784arc_write_ready(zio_t *zio)6785{6786arc_write_callback_t *callback = zio->io_private;6787arc_buf_t *buf = callback->awcb_buf;6788arc_buf_hdr_t *hdr = buf->b_hdr;6789blkptr_t *bp = zio->io_bp;6790uint64_t psize = BP_IS_HOLE(bp) ? 0 : BP_GET_PSIZE(bp);6791fstrans_cookie_t cookie = spl_fstrans_mark();67926793ASSERT(HDR_HAS_L1HDR(hdr));6794ASSERT(!zfs_refcount_is_zero(&buf->b_hdr->b_l1hdr.b_refcnt));6795ASSERT3P(hdr->b_l1hdr.b_buf, !=, NULL);67966797/*6798* If we're reexecuting this zio because the pool suspended, then6799* cleanup any state that was previously set the first time the6800* callback was invoked.6801*/6802if (zio->io_flags & ZIO_FLAG_REEXECUTED) {6803arc_cksum_free(hdr);6804arc_buf_unwatch(buf);6805if (hdr->b_l1hdr.b_pabd != NULL) {6806if (ARC_BUF_SHARED(buf)) {6807arc_unshare_buf(hdr, buf);6808} else {6809ASSERT(!arc_buf_is_shared(buf));6810arc_hdr_free_abd(hdr, B_FALSE);6811}6812}68136814if (HDR_HAS_RABD(hdr))6815arc_hdr_free_abd(hdr, B_TRUE);6816}6817ASSERT0P(hdr->b_l1hdr.b_pabd);6818ASSERT(!HDR_HAS_RABD(hdr));6819ASSERT(!HDR_SHARED_DATA(hdr));6820ASSERT(!arc_buf_is_shared(buf));68216822callback->awcb_ready(zio, buf, callback->awcb_private);68236824if (HDR_IO_IN_PROGRESS(hdr)) {6825ASSERT(zio->io_flags & ZIO_FLAG_REEXECUTED);6826} else {6827arc_hdr_set_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);6828add_reference(hdr, hdr); /* For IO_IN_PROGRESS. */6829}68306831if (BP_IS_PROTECTED(bp)) {6832/* ZIL blocks are written through zio_rewrite */6833ASSERT3U(BP_GET_TYPE(bp), !=, DMU_OT_INTENT_LOG);68346835if (BP_SHOULD_BYTESWAP(bp)) {6836if (BP_GET_LEVEL(bp) > 0) {6837hdr->b_l1hdr.b_byteswap = DMU_BSWAP_UINT64;6838} else {6839hdr->b_l1hdr.b_byteswap =6840DMU_OT_BYTESWAP(BP_GET_TYPE(bp));6841}6842} else {6843hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS;6844}68456846arc_hdr_set_flags(hdr, ARC_FLAG_PROTECTED);6847hdr->b_crypt_hdr.b_ot = BP_GET_TYPE(bp);6848hdr->b_crypt_hdr.b_dsobj = zio->io_bookmark.zb_objset;6849zio_crypt_decode_params_bp(bp, hdr->b_crypt_hdr.b_salt,6850hdr->b_crypt_hdr.b_iv);6851zio_crypt_decode_mac_bp(bp, hdr->b_crypt_hdr.b_mac);6852} else {6853arc_hdr_clear_flags(hdr, ARC_FLAG_PROTECTED);6854}68556856/*6857* If this block was written for raw encryption but the zio layer6858* ended up only authenticating it, adjust the buffer flags now.6859*/6860if (BP_IS_AUTHENTICATED(bp) && ARC_BUF_ENCRYPTED(buf)) {6861arc_hdr_set_flags(hdr, ARC_FLAG_NOAUTH);6862buf->b_flags &= ~ARC_BUF_FLAG_ENCRYPTED;6863if (BP_GET_COMPRESS(bp) == ZIO_COMPRESS_OFF)6864buf->b_flags &= ~ARC_BUF_FLAG_COMPRESSED;6865} else if (BP_IS_HOLE(bp) && ARC_BUF_ENCRYPTED(buf)) {6866buf->b_flags &= ~ARC_BUF_FLAG_ENCRYPTED;6867buf->b_flags &= ~ARC_BUF_FLAG_COMPRESSED;6868}68696870/* this must be done after the buffer flags are adjusted */6871arc_cksum_compute(buf);68726873enum zio_compress compress;6874if (BP_IS_HOLE(bp) || BP_IS_EMBEDDED(bp)) {6875compress = ZIO_COMPRESS_OFF;6876} else {6877ASSERT3U(HDR_GET_LSIZE(hdr), ==, BP_GET_LSIZE(bp));6878compress = BP_GET_COMPRESS(bp);6879}6880HDR_SET_PSIZE(hdr, psize);6881arc_hdr_set_compress(hdr, compress);6882hdr->b_complevel = zio->io_prop.zp_complevel;68836884if (zio->io_error != 0 || psize == 0)6885goto out;68866887/*6888* Fill the hdr with data. If the buffer is encrypted we have no choice6889* but to copy the data into b_radb. If the hdr is compressed, the data6890* we want is available from the zio, otherwise we can take it from6891* the buf.6892*6893* We might be able to share the buf's data with the hdr here. However,6894* doing so would cause the ARC to be full of linear ABDs if we write a6895* lot of shareable data. As a compromise, we check whether scattered6896* ABDs are allowed, and assume that if they are then the user wants6897* the ARC to be primarily filled with them regardless of the data being6898* written. Therefore, if they're allowed then we allocate one and copy6899* the data into it; otherwise, we share the data directly if we can.6900*/6901if (ARC_BUF_ENCRYPTED(buf)) {6902ASSERT3U(psize, >, 0);6903ASSERT(ARC_BUF_COMPRESSED(buf));6904arc_hdr_alloc_abd(hdr, ARC_HDR_ALLOC_RDATA |6905ARC_HDR_USE_RESERVE);6906abd_copy(hdr->b_crypt_hdr.b_rabd, zio->io_abd, psize);6907} else if (!(HDR_UNCACHED(hdr) ||6908abd_size_alloc_linear(arc_buf_size(buf))) ||6909!arc_can_share(hdr, buf)) {6910/*6911* Ideally, we would always copy the io_abd into b_pabd, but the6912* user may have disabled compressed ARC, thus we must check the6913* hdr's compression setting rather than the io_bp's.6914*/6915if (BP_IS_ENCRYPTED(bp)) {6916ASSERT3U(psize, >, 0);6917arc_hdr_alloc_abd(hdr, ARC_HDR_ALLOC_RDATA |6918ARC_HDR_USE_RESERVE);6919abd_copy(hdr->b_crypt_hdr.b_rabd, zio->io_abd, psize);6920} else if (arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF &&6921!ARC_BUF_COMPRESSED(buf)) {6922ASSERT3U(psize, >, 0);6923arc_hdr_alloc_abd(hdr, ARC_HDR_USE_RESERVE);6924abd_copy(hdr->b_l1hdr.b_pabd, zio->io_abd, psize);6925} else {6926ASSERT3U(zio->io_orig_size, ==, arc_hdr_size(hdr));6927arc_hdr_alloc_abd(hdr, ARC_HDR_USE_RESERVE);6928abd_copy_from_buf(hdr->b_l1hdr.b_pabd, buf->b_data,6929arc_buf_size(buf));6930}6931} else {6932ASSERT3P(buf->b_data, ==, abd_to_buf(zio->io_orig_abd));6933ASSERT3U(zio->io_orig_size, ==, arc_buf_size(buf));6934ASSERT3P(hdr->b_l1hdr.b_buf, ==, buf);6935ASSERT(ARC_BUF_LAST(buf));69366937arc_share_buf(hdr, buf);6938}69396940out:6941arc_hdr_verify(hdr, bp);6942spl_fstrans_unmark(cookie);6943}69446945static void6946arc_write_children_ready(zio_t *zio)6947{6948arc_write_callback_t *callback = zio->io_private;6949arc_buf_t *buf = callback->awcb_buf;69506951callback->awcb_children_ready(zio, buf, callback->awcb_private);6952}69536954static void6955arc_write_done(zio_t *zio)6956{6957arc_write_callback_t *callback = zio->io_private;6958arc_buf_t *buf = callback->awcb_buf;6959arc_buf_hdr_t *hdr = buf->b_hdr;69606961ASSERT0P(hdr->b_l1hdr.b_acb);69626963if (zio->io_error == 0) {6964arc_hdr_verify(hdr, zio->io_bp);69656966if (BP_IS_HOLE(zio->io_bp) || BP_IS_EMBEDDED(zio->io_bp)) {6967buf_discard_identity(hdr);6968} else {6969hdr->b_dva = *BP_IDENTITY(zio->io_bp);6970hdr->b_birth = BP_GET_PHYSICAL_BIRTH(zio->io_bp);6971}6972} else {6973ASSERT(HDR_EMPTY(hdr));6974}69756976/*6977* If the block to be written was all-zero or compressed enough to be6978* embedded in the BP, no write was performed so there will be no6979* dva/birth/checksum. The buffer must therefore remain anonymous6980* (and uncached).6981*/6982if (!HDR_EMPTY(hdr)) {6983arc_buf_hdr_t *exists;6984kmutex_t *hash_lock;69856986ASSERT0(zio->io_error);69876988arc_cksum_verify(buf);69896990exists = buf_hash_insert(hdr, &hash_lock);6991if (exists != NULL) {6992/*6993* This can only happen if we overwrite for6994* sync-to-convergence, because we remove6995* buffers from the hash table when we arc_free().6996*/6997if (zio->io_flags & ZIO_FLAG_IO_REWRITE) {6998if (!BP_EQUAL(&zio->io_bp_orig, zio->io_bp))6999panic("bad overwrite, hdr=%p exists=%p",7000(void *)hdr, (void *)exists);7001ASSERT(zfs_refcount_is_zero(7002&exists->b_l1hdr.b_refcnt));7003arc_change_state(arc_anon, exists);7004arc_hdr_destroy(exists);7005mutex_exit(hash_lock);7006exists = buf_hash_insert(hdr, &hash_lock);7007ASSERT0P(exists);7008} else if (zio->io_flags & ZIO_FLAG_NOPWRITE) {7009/* nopwrite */7010ASSERT(zio->io_prop.zp_nopwrite);7011if (!BP_EQUAL(&zio->io_bp_orig, zio->io_bp))7012panic("bad nopwrite, hdr=%p exists=%p",7013(void *)hdr, (void *)exists);7014} else {7015/* Dedup */7016ASSERT3P(hdr->b_l1hdr.b_buf, !=, NULL);7017ASSERT(ARC_BUF_LAST(hdr->b_l1hdr.b_buf));7018ASSERT(hdr->b_l1hdr.b_state == arc_anon);7019ASSERT(BP_GET_DEDUP(zio->io_bp));7020ASSERT0(BP_GET_LEVEL(zio->io_bp));7021}7022}7023arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);7024VERIFY3S(remove_reference(hdr, hdr), >, 0);7025/* if it's not anon, we are doing a scrub */7026if (exists == NULL && hdr->b_l1hdr.b_state == arc_anon)7027arc_access(hdr, 0, B_FALSE);7028mutex_exit(hash_lock);7029} else {7030arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);7031VERIFY3S(remove_reference(hdr, hdr), >, 0);7032}70337034callback->awcb_done(zio, buf, callback->awcb_private);70357036abd_free(zio->io_abd);7037kmem_free(callback, sizeof (arc_write_callback_t));7038}70397040zio_t *7041arc_write(zio_t *pio, spa_t *spa, uint64_t txg,7042blkptr_t *bp, arc_buf_t *buf, boolean_t uncached, boolean_t l2arc,7043const zio_prop_t *zp, arc_write_done_func_t *ready,7044arc_write_done_func_t *children_ready, arc_write_done_func_t *done,7045void *private, zio_priority_t priority, int zio_flags,7046const zbookmark_phys_t *zb)7047{7048arc_buf_hdr_t *hdr = buf->b_hdr;7049arc_write_callback_t *callback;7050zio_t *zio;7051zio_prop_t localprop = *zp;70527053ASSERT3P(ready, !=, NULL);7054ASSERT3P(done, !=, NULL);7055ASSERT(!HDR_IO_ERROR(hdr));7056ASSERT(!HDR_IO_IN_PROGRESS(hdr));7057ASSERT0P(hdr->b_l1hdr.b_acb);7058ASSERT3P(hdr->b_l1hdr.b_buf, !=, NULL);7059if (uncached)7060arc_hdr_set_flags(hdr, ARC_FLAG_UNCACHED);7061else if (l2arc)7062arc_hdr_set_flags(hdr, ARC_FLAG_L2CACHE);70637064if (ARC_BUF_ENCRYPTED(buf)) {7065ASSERT(ARC_BUF_COMPRESSED(buf));7066localprop.zp_encrypt = B_TRUE;7067localprop.zp_compress = HDR_GET_COMPRESS(hdr);7068localprop.zp_complevel = hdr->b_complevel;7069localprop.zp_byteorder =7070(hdr->b_l1hdr.b_byteswap == DMU_BSWAP_NUMFUNCS) ?7071ZFS_HOST_BYTEORDER : !ZFS_HOST_BYTEORDER;7072memcpy(localprop.zp_salt, hdr->b_crypt_hdr.b_salt,7073ZIO_DATA_SALT_LEN);7074memcpy(localprop.zp_iv, hdr->b_crypt_hdr.b_iv,7075ZIO_DATA_IV_LEN);7076memcpy(localprop.zp_mac, hdr->b_crypt_hdr.b_mac,7077ZIO_DATA_MAC_LEN);7078if (DMU_OT_IS_ENCRYPTED(localprop.zp_type)) {7079localprop.zp_nopwrite = B_FALSE;7080localprop.zp_copies =7081MIN(localprop.zp_copies, SPA_DVAS_PER_BP - 1);7082localprop.zp_gang_copies =7083MIN(localprop.zp_gang_copies, SPA_DVAS_PER_BP - 1);7084}7085zio_flags |= ZIO_FLAG_RAW;7086} else if (ARC_BUF_COMPRESSED(buf)) {7087ASSERT3U(HDR_GET_LSIZE(hdr), !=, arc_buf_size(buf));7088localprop.zp_compress = HDR_GET_COMPRESS(hdr);7089localprop.zp_complevel = hdr->b_complevel;7090zio_flags |= ZIO_FLAG_RAW_COMPRESS;7091}7092callback = kmem_zalloc(sizeof (arc_write_callback_t), KM_SLEEP);7093callback->awcb_ready = ready;7094callback->awcb_children_ready = children_ready;7095callback->awcb_done = done;7096callback->awcb_private = private;7097callback->awcb_buf = buf;70987099/*7100* The hdr's b_pabd is now stale, free it now. A new data block7101* will be allocated when the zio pipeline calls arc_write_ready().7102*/7103if (hdr->b_l1hdr.b_pabd != NULL) {7104/*7105* If the buf is currently sharing the data block with7106* the hdr then we need to break that relationship here.7107* The hdr will remain with a NULL data pointer and the7108* buf will take sole ownership of the block.7109*/7110if (ARC_BUF_SHARED(buf)) {7111arc_unshare_buf(hdr, buf);7112} else {7113ASSERT(!arc_buf_is_shared(buf));7114arc_hdr_free_abd(hdr, B_FALSE);7115}7116VERIFY3P(buf->b_data, !=, NULL);7117}71187119if (HDR_HAS_RABD(hdr))7120arc_hdr_free_abd(hdr, B_TRUE);71217122if (!(zio_flags & ZIO_FLAG_RAW))7123arc_hdr_set_compress(hdr, ZIO_COMPRESS_OFF);71247125ASSERT(!arc_buf_is_shared(buf));7126ASSERT0P(hdr->b_l1hdr.b_pabd);71277128zio = zio_write(pio, spa, txg, bp,7129abd_get_from_buf(buf->b_data, HDR_GET_LSIZE(hdr)),7130HDR_GET_LSIZE(hdr), arc_buf_size(buf), &localprop, arc_write_ready,7131(children_ready != NULL) ? arc_write_children_ready : NULL,7132arc_write_done, callback, priority, zio_flags, zb);71337134return (zio);7135}71367137void7138arc_tempreserve_clear(uint64_t reserve)7139{7140atomic_add_64(&arc_tempreserve, -reserve);7141ASSERT((int64_t)arc_tempreserve >= 0);7142}71437144int7145arc_tempreserve_space(spa_t *spa, uint64_t reserve, uint64_t txg)7146{7147int error;7148uint64_t anon_size;71497150if (!arc_no_grow &&7151reserve > arc_c/4 &&7152reserve * 4 > (2ULL << SPA_MAXBLOCKSHIFT))7153arc_c = MIN(arc_c_max, reserve * 4);71547155/*7156* Throttle when the calculated memory footprint for the TXG7157* exceeds the target ARC size.7158*/7159if (reserve > arc_c) {7160DMU_TX_STAT_BUMP(dmu_tx_memory_reserve);7161return (SET_ERROR(ERESTART));7162}71637164/*7165* Don't count loaned bufs as in flight dirty data to prevent long7166* network delays from blocking transactions that are ready to be7167* assigned to a txg.7168*/71697170/* assert that it has not wrapped around */7171ASSERT3S(atomic_add_64_nv(&arc_loaned_bytes, 0), >=, 0);71727173anon_size = MAX((int64_t)7174(zfs_refcount_count(&arc_anon->arcs_size[ARC_BUFC_DATA]) +7175zfs_refcount_count(&arc_anon->arcs_size[ARC_BUFC_METADATA]) -7176arc_loaned_bytes), 0);71777178/*7179* Writes will, almost always, require additional memory allocations7180* in order to compress/encrypt/etc the data. We therefore need to7181* make sure that there is sufficient available memory for this.7182*/7183error = arc_memory_throttle(spa, reserve, txg);7184if (error != 0)7185return (error);71867187/*7188* Throttle writes when the amount of dirty data in the cache7189* gets too large. We try to keep the cache less than half full7190* of dirty blocks so that our sync times don't grow too large.7191*7192* In the case of one pool being built on another pool, we want7193* to make sure we don't end up throttling the lower (backing)7194* pool when the upper pool is the majority contributor to dirty7195* data. To insure we make forward progress during throttling, we7196* also check the current pool's net dirty data and only throttle7197* if it exceeds zfs_arc_pool_dirty_percent of the anonymous dirty7198* data in the cache.7199*7200* Note: if two requests come in concurrently, we might let them7201* both succeed, when one of them should fail. Not a huge deal.7202*/7203uint64_t total_dirty = reserve + arc_tempreserve + anon_size;7204uint64_t spa_dirty_anon = spa_dirty_data(spa);7205uint64_t rarc_c = arc_warm ? arc_c : arc_c_max;7206if (total_dirty > rarc_c * zfs_arc_dirty_limit_percent / 100 &&7207anon_size > rarc_c * zfs_arc_anon_limit_percent / 100 &&7208spa_dirty_anon > anon_size * zfs_arc_pool_dirty_percent / 100) {7209#ifdef ZFS_DEBUG7210uint64_t meta_esize = zfs_refcount_count(7211&arc_anon->arcs_esize[ARC_BUFC_METADATA]);7212uint64_t data_esize =7213zfs_refcount_count(&arc_anon->arcs_esize[ARC_BUFC_DATA]);7214dprintf("failing, arc_tempreserve=%lluK anon_meta=%lluK "7215"anon_data=%lluK tempreserve=%lluK rarc_c=%lluK\n",7216(u_longlong_t)arc_tempreserve >> 10,7217(u_longlong_t)meta_esize >> 10,7218(u_longlong_t)data_esize >> 10,7219(u_longlong_t)reserve >> 10,7220(u_longlong_t)rarc_c >> 10);7221#endif7222DMU_TX_STAT_BUMP(dmu_tx_dirty_throttle);7223return (SET_ERROR(ERESTART));7224}7225atomic_add_64(&arc_tempreserve, reserve);7226return (0);7227}72287229static void7230arc_kstat_update_state(arc_state_t *state, kstat_named_t *size,7231kstat_named_t *data, kstat_named_t *metadata,7232kstat_named_t *evict_data, kstat_named_t *evict_metadata)7233{7234data->value.ui64 =7235zfs_refcount_count(&state->arcs_size[ARC_BUFC_DATA]);7236metadata->value.ui64 =7237zfs_refcount_count(&state->arcs_size[ARC_BUFC_METADATA]);7238size->value.ui64 = data->value.ui64 + metadata->value.ui64;7239evict_data->value.ui64 =7240zfs_refcount_count(&state->arcs_esize[ARC_BUFC_DATA]);7241evict_metadata->value.ui64 =7242zfs_refcount_count(&state->arcs_esize[ARC_BUFC_METADATA]);7243}72447245static int7246arc_kstat_update(kstat_t *ksp, int rw)7247{7248arc_stats_t *as = ksp->ks_data;72497250if (rw == KSTAT_WRITE)7251return (SET_ERROR(EACCES));72527253as->arcstat_hits.value.ui64 =7254wmsum_value(&arc_sums.arcstat_hits);7255as->arcstat_iohits.value.ui64 =7256wmsum_value(&arc_sums.arcstat_iohits);7257as->arcstat_misses.value.ui64 =7258wmsum_value(&arc_sums.arcstat_misses);7259as->arcstat_demand_data_hits.value.ui64 =7260wmsum_value(&arc_sums.arcstat_demand_data_hits);7261as->arcstat_demand_data_iohits.value.ui64 =7262wmsum_value(&arc_sums.arcstat_demand_data_iohits);7263as->arcstat_demand_data_misses.value.ui64 =7264wmsum_value(&arc_sums.arcstat_demand_data_misses);7265as->arcstat_demand_metadata_hits.value.ui64 =7266wmsum_value(&arc_sums.arcstat_demand_metadata_hits);7267as->arcstat_demand_metadata_iohits.value.ui64 =7268wmsum_value(&arc_sums.arcstat_demand_metadata_iohits);7269as->arcstat_demand_metadata_misses.value.ui64 =7270wmsum_value(&arc_sums.arcstat_demand_metadata_misses);7271as->arcstat_prefetch_data_hits.value.ui64 =7272wmsum_value(&arc_sums.arcstat_prefetch_data_hits);7273as->arcstat_prefetch_data_iohits.value.ui64 =7274wmsum_value(&arc_sums.arcstat_prefetch_data_iohits);7275as->arcstat_prefetch_data_misses.value.ui64 =7276wmsum_value(&arc_sums.arcstat_prefetch_data_misses);7277as->arcstat_prefetch_metadata_hits.value.ui64 =7278wmsum_value(&arc_sums.arcstat_prefetch_metadata_hits);7279as->arcstat_prefetch_metadata_iohits.value.ui64 =7280wmsum_value(&arc_sums.arcstat_prefetch_metadata_iohits);7281as->arcstat_prefetch_metadata_misses.value.ui64 =7282wmsum_value(&arc_sums.arcstat_prefetch_metadata_misses);7283as->arcstat_mru_hits.value.ui64 =7284wmsum_value(&arc_sums.arcstat_mru_hits);7285as->arcstat_mru_ghost_hits.value.ui64 =7286wmsum_value(&arc_sums.arcstat_mru_ghost_hits);7287as->arcstat_mfu_hits.value.ui64 =7288wmsum_value(&arc_sums.arcstat_mfu_hits);7289as->arcstat_mfu_ghost_hits.value.ui64 =7290wmsum_value(&arc_sums.arcstat_mfu_ghost_hits);7291as->arcstat_uncached_hits.value.ui64 =7292wmsum_value(&arc_sums.arcstat_uncached_hits);7293as->arcstat_deleted.value.ui64 =7294wmsum_value(&arc_sums.arcstat_deleted);7295as->arcstat_mutex_miss.value.ui64 =7296wmsum_value(&arc_sums.arcstat_mutex_miss);7297as->arcstat_access_skip.value.ui64 =7298wmsum_value(&arc_sums.arcstat_access_skip);7299as->arcstat_evict_skip.value.ui64 =7300wmsum_value(&arc_sums.arcstat_evict_skip);7301as->arcstat_evict_not_enough.value.ui64 =7302wmsum_value(&arc_sums.arcstat_evict_not_enough);7303as->arcstat_evict_l2_cached.value.ui64 =7304wmsum_value(&arc_sums.arcstat_evict_l2_cached);7305as->arcstat_evict_l2_eligible.value.ui64 =7306wmsum_value(&arc_sums.arcstat_evict_l2_eligible);7307as->arcstat_evict_l2_eligible_mfu.value.ui64 =7308wmsum_value(&arc_sums.arcstat_evict_l2_eligible_mfu);7309as->arcstat_evict_l2_eligible_mru.value.ui64 =7310wmsum_value(&arc_sums.arcstat_evict_l2_eligible_mru);7311as->arcstat_evict_l2_ineligible.value.ui64 =7312wmsum_value(&arc_sums.arcstat_evict_l2_ineligible);7313as->arcstat_evict_l2_skip.value.ui64 =7314wmsum_value(&arc_sums.arcstat_evict_l2_skip);7315as->arcstat_hash_elements.value.ui64 =7316as->arcstat_hash_elements_max.value.ui64 =7317wmsum_value(&arc_sums.arcstat_hash_elements);7318as->arcstat_hash_collisions.value.ui64 =7319wmsum_value(&arc_sums.arcstat_hash_collisions);7320as->arcstat_hash_chains.value.ui64 =7321wmsum_value(&arc_sums.arcstat_hash_chains);7322as->arcstat_size.value.ui64 =7323aggsum_value(&arc_sums.arcstat_size);7324as->arcstat_compressed_size.value.ui64 =7325wmsum_value(&arc_sums.arcstat_compressed_size);7326as->arcstat_uncompressed_size.value.ui64 =7327wmsum_value(&arc_sums.arcstat_uncompressed_size);7328as->arcstat_overhead_size.value.ui64 =7329wmsum_value(&arc_sums.arcstat_overhead_size);7330as->arcstat_hdr_size.value.ui64 =7331wmsum_value(&arc_sums.arcstat_hdr_size);7332as->arcstat_data_size.value.ui64 =7333wmsum_value(&arc_sums.arcstat_data_size);7334as->arcstat_metadata_size.value.ui64 =7335wmsum_value(&arc_sums.arcstat_metadata_size);7336as->arcstat_dbuf_size.value.ui64 =7337wmsum_value(&arc_sums.arcstat_dbuf_size);7338#if defined(COMPAT_FREEBSD11)7339as->arcstat_other_size.value.ui64 =7340wmsum_value(&arc_sums.arcstat_bonus_size) +7341aggsum_value(&arc_sums.arcstat_dnode_size) +7342wmsum_value(&arc_sums.arcstat_dbuf_size);7343#endif73447345arc_kstat_update_state(arc_anon,7346&as->arcstat_anon_size,7347&as->arcstat_anon_data,7348&as->arcstat_anon_metadata,7349&as->arcstat_anon_evictable_data,7350&as->arcstat_anon_evictable_metadata);7351arc_kstat_update_state(arc_mru,7352&as->arcstat_mru_size,7353&as->arcstat_mru_data,7354&as->arcstat_mru_metadata,7355&as->arcstat_mru_evictable_data,7356&as->arcstat_mru_evictable_metadata);7357arc_kstat_update_state(arc_mru_ghost,7358&as->arcstat_mru_ghost_size,7359&as->arcstat_mru_ghost_data,7360&as->arcstat_mru_ghost_metadata,7361&as->arcstat_mru_ghost_evictable_data,7362&as->arcstat_mru_ghost_evictable_metadata);7363arc_kstat_update_state(arc_mfu,7364&as->arcstat_mfu_size,7365&as->arcstat_mfu_data,7366&as->arcstat_mfu_metadata,7367&as->arcstat_mfu_evictable_data,7368&as->arcstat_mfu_evictable_metadata);7369arc_kstat_update_state(arc_mfu_ghost,7370&as->arcstat_mfu_ghost_size,7371&as->arcstat_mfu_ghost_data,7372&as->arcstat_mfu_ghost_metadata,7373&as->arcstat_mfu_ghost_evictable_data,7374&as->arcstat_mfu_ghost_evictable_metadata);7375arc_kstat_update_state(arc_uncached,7376&as->arcstat_uncached_size,7377&as->arcstat_uncached_data,7378&as->arcstat_uncached_metadata,7379&as->arcstat_uncached_evictable_data,7380&as->arcstat_uncached_evictable_metadata);73817382as->arcstat_dnode_size.value.ui64 =7383aggsum_value(&arc_sums.arcstat_dnode_size);7384as->arcstat_bonus_size.value.ui64 =7385wmsum_value(&arc_sums.arcstat_bonus_size);7386as->arcstat_l2_hits.value.ui64 =7387wmsum_value(&arc_sums.arcstat_l2_hits);7388as->arcstat_l2_misses.value.ui64 =7389wmsum_value(&arc_sums.arcstat_l2_misses);7390as->arcstat_l2_prefetch_asize.value.ui64 =7391wmsum_value(&arc_sums.arcstat_l2_prefetch_asize);7392as->arcstat_l2_mru_asize.value.ui64 =7393wmsum_value(&arc_sums.arcstat_l2_mru_asize);7394as->arcstat_l2_mfu_asize.value.ui64 =7395wmsum_value(&arc_sums.arcstat_l2_mfu_asize);7396as->arcstat_l2_bufc_data_asize.value.ui64 =7397wmsum_value(&arc_sums.arcstat_l2_bufc_data_asize);7398as->arcstat_l2_bufc_metadata_asize.value.ui64 =7399wmsum_value(&arc_sums.arcstat_l2_bufc_metadata_asize);7400as->arcstat_l2_feeds.value.ui64 =7401wmsum_value(&arc_sums.arcstat_l2_feeds);7402as->arcstat_l2_rw_clash.value.ui64 =7403wmsum_value(&arc_sums.arcstat_l2_rw_clash);7404as->arcstat_l2_read_bytes.value.ui64 =7405wmsum_value(&arc_sums.arcstat_l2_read_bytes);7406as->arcstat_l2_write_bytes.value.ui64 =7407wmsum_value(&arc_sums.arcstat_l2_write_bytes);7408as->arcstat_l2_writes_sent.value.ui64 =7409wmsum_value(&arc_sums.arcstat_l2_writes_sent);7410as->arcstat_l2_writes_done.value.ui64 =7411wmsum_value(&arc_sums.arcstat_l2_writes_done);7412as->arcstat_l2_writes_error.value.ui64 =7413wmsum_value(&arc_sums.arcstat_l2_writes_error);7414as->arcstat_l2_writes_lock_retry.value.ui64 =7415wmsum_value(&arc_sums.arcstat_l2_writes_lock_retry);7416as->arcstat_l2_evict_lock_retry.value.ui64 =7417wmsum_value(&arc_sums.arcstat_l2_evict_lock_retry);7418as->arcstat_l2_evict_reading.value.ui64 =7419wmsum_value(&arc_sums.arcstat_l2_evict_reading);7420as->arcstat_l2_evict_l1cached.value.ui64 =7421wmsum_value(&arc_sums.arcstat_l2_evict_l1cached);7422as->arcstat_l2_free_on_write.value.ui64 =7423wmsum_value(&arc_sums.arcstat_l2_free_on_write);7424as->arcstat_l2_abort_lowmem.value.ui64 =7425wmsum_value(&arc_sums.arcstat_l2_abort_lowmem);7426as->arcstat_l2_cksum_bad.value.ui64 =7427wmsum_value(&arc_sums.arcstat_l2_cksum_bad);7428as->arcstat_l2_io_error.value.ui64 =7429wmsum_value(&arc_sums.arcstat_l2_io_error);7430as->arcstat_l2_lsize.value.ui64 =7431wmsum_value(&arc_sums.arcstat_l2_lsize);7432as->arcstat_l2_psize.value.ui64 =7433wmsum_value(&arc_sums.arcstat_l2_psize);7434as->arcstat_l2_hdr_size.value.ui64 =7435aggsum_value(&arc_sums.arcstat_l2_hdr_size);7436as->arcstat_l2_log_blk_writes.value.ui64 =7437wmsum_value(&arc_sums.arcstat_l2_log_blk_writes);7438as->arcstat_l2_log_blk_asize.value.ui64 =7439wmsum_value(&arc_sums.arcstat_l2_log_blk_asize);7440as->arcstat_l2_log_blk_count.value.ui64 =7441wmsum_value(&arc_sums.arcstat_l2_log_blk_count);7442as->arcstat_l2_rebuild_success.value.ui64 =7443wmsum_value(&arc_sums.arcstat_l2_rebuild_success);7444as->arcstat_l2_rebuild_abort_unsupported.value.ui64 =7445wmsum_value(&arc_sums.arcstat_l2_rebuild_abort_unsupported);7446as->arcstat_l2_rebuild_abort_io_errors.value.ui64 =7447wmsum_value(&arc_sums.arcstat_l2_rebuild_abort_io_errors);7448as->arcstat_l2_rebuild_abort_dh_errors.value.ui64 =7449wmsum_value(&arc_sums.arcstat_l2_rebuild_abort_dh_errors);7450as->arcstat_l2_rebuild_abort_cksum_lb_errors.value.ui64 =7451wmsum_value(&arc_sums.arcstat_l2_rebuild_abort_cksum_lb_errors);7452as->arcstat_l2_rebuild_abort_lowmem.value.ui64 =7453wmsum_value(&arc_sums.arcstat_l2_rebuild_abort_lowmem);7454as->arcstat_l2_rebuild_size.value.ui64 =7455wmsum_value(&arc_sums.arcstat_l2_rebuild_size);7456as->arcstat_l2_rebuild_asize.value.ui64 =7457wmsum_value(&arc_sums.arcstat_l2_rebuild_asize);7458as->arcstat_l2_rebuild_bufs.value.ui64 =7459wmsum_value(&arc_sums.arcstat_l2_rebuild_bufs);7460as->arcstat_l2_rebuild_bufs_precached.value.ui64 =7461wmsum_value(&arc_sums.arcstat_l2_rebuild_bufs_precached);7462as->arcstat_l2_rebuild_log_blks.value.ui64 =7463wmsum_value(&arc_sums.arcstat_l2_rebuild_log_blks);7464as->arcstat_memory_throttle_count.value.ui64 =7465wmsum_value(&arc_sums.arcstat_memory_throttle_count);7466as->arcstat_memory_direct_count.value.ui64 =7467wmsum_value(&arc_sums.arcstat_memory_direct_count);7468as->arcstat_memory_indirect_count.value.ui64 =7469wmsum_value(&arc_sums.arcstat_memory_indirect_count);74707471as->arcstat_memory_all_bytes.value.ui64 =7472arc_all_memory();7473as->arcstat_memory_free_bytes.value.ui64 =7474arc_free_memory();7475as->arcstat_memory_available_bytes.value.i64 =7476arc_available_memory();74777478as->arcstat_prune.value.ui64 =7479wmsum_value(&arc_sums.arcstat_prune);7480as->arcstat_meta_used.value.ui64 =7481wmsum_value(&arc_sums.arcstat_meta_used);7482as->arcstat_async_upgrade_sync.value.ui64 =7483wmsum_value(&arc_sums.arcstat_async_upgrade_sync);7484as->arcstat_predictive_prefetch.value.ui64 =7485wmsum_value(&arc_sums.arcstat_predictive_prefetch);7486as->arcstat_demand_hit_predictive_prefetch.value.ui64 =7487wmsum_value(&arc_sums.arcstat_demand_hit_predictive_prefetch);7488as->arcstat_demand_iohit_predictive_prefetch.value.ui64 =7489wmsum_value(&arc_sums.arcstat_demand_iohit_predictive_prefetch);7490as->arcstat_prescient_prefetch.value.ui64 =7491wmsum_value(&arc_sums.arcstat_prescient_prefetch);7492as->arcstat_demand_hit_prescient_prefetch.value.ui64 =7493wmsum_value(&arc_sums.arcstat_demand_hit_prescient_prefetch);7494as->arcstat_demand_iohit_prescient_prefetch.value.ui64 =7495wmsum_value(&arc_sums.arcstat_demand_iohit_prescient_prefetch);7496as->arcstat_raw_size.value.ui64 =7497wmsum_value(&arc_sums.arcstat_raw_size);7498as->arcstat_cached_only_in_progress.value.ui64 =7499wmsum_value(&arc_sums.arcstat_cached_only_in_progress);7500as->arcstat_abd_chunk_waste_size.value.ui64 =7501wmsum_value(&arc_sums.arcstat_abd_chunk_waste_size);75027503return (0);7504}75057506/*7507* This function *must* return indices evenly distributed between all7508* sublists of the multilist. This is needed due to how the ARC eviction7509* code is laid out; arc_evict_state() assumes ARC buffers are evenly7510* distributed between all sublists and uses this assumption when7511* deciding which sublist to evict from and how much to evict from it.7512*/7513static unsigned int7514arc_state_multilist_index_func(multilist_t *ml, void *obj)7515{7516arc_buf_hdr_t *hdr = obj;75177518/*7519* We rely on b_dva to generate evenly distributed index7520* numbers using buf_hash below. So, as an added precaution,7521* let's make sure we never add empty buffers to the arc lists.7522*/7523ASSERT(!HDR_EMPTY(hdr));75247525/*7526* The assumption here, is the hash value for a given7527* arc_buf_hdr_t will remain constant throughout its lifetime7528* (i.e. its b_spa, b_dva, and b_birth fields don't change).7529* Thus, we don't need to store the header's sublist index7530* on insertion, as this index can be recalculated on removal.7531*7532* Also, the low order bits of the hash value are thought to be7533* distributed evenly. Otherwise, in the case that the multilist7534* has a power of two number of sublists, each sublists' usage7535* would not be evenly distributed. In this context full 64bit7536* division would be a waste of time, so limit it to 32 bits.7537*/7538return ((unsigned int)buf_hash(hdr->b_spa, &hdr->b_dva, hdr->b_birth) %7539multilist_get_num_sublists(ml));7540}75417542static unsigned int7543arc_state_l2c_multilist_index_func(multilist_t *ml, void *obj)7544{7545panic("Header %p insert into arc_l2c_only %p", obj, ml);7546}75477548#define WARN_IF_TUNING_IGNORED(tuning, value, do_warn) do { \7549if ((do_warn) && (tuning) && ((tuning) != (value))) { \7550cmn_err(CE_WARN, \7551"ignoring tunable %s (using %llu instead)", \7552(#tuning), (u_longlong_t)(value)); \7553} \7554} while (0)75557556/*7557* Called during module initialization and periodically thereafter to7558* apply reasonable changes to the exposed performance tunings. Can also be7559* called explicitly by param_set_arc_*() functions when ARC tunables are7560* updated manually. Non-zero zfs_* values which differ from the currently set7561* values will be applied.7562*/7563void7564arc_tuning_update(boolean_t verbose)7565{7566uint64_t allmem = arc_all_memory();75677568/* Valid range: 32M - <arc_c_max> */7569if ((zfs_arc_min) && (zfs_arc_min != arc_c_min) &&7570(zfs_arc_min >= 2ULL << SPA_MAXBLOCKSHIFT) &&7571(zfs_arc_min <= arc_c_max)) {7572arc_c_min = zfs_arc_min;7573arc_c = MAX(arc_c, arc_c_min);7574}7575WARN_IF_TUNING_IGNORED(zfs_arc_min, arc_c_min, verbose);75767577/* Valid range: 64M - <all physical memory> */7578if ((zfs_arc_max) && (zfs_arc_max != arc_c_max) &&7579(zfs_arc_max >= MIN_ARC_MAX) && (zfs_arc_max < allmem) &&7580(zfs_arc_max > arc_c_min)) {7581arc_c_max = zfs_arc_max;7582arc_c = MIN(arc_c, arc_c_max);7583if (arc_dnode_limit > arc_c_max)7584arc_dnode_limit = arc_c_max;7585}7586WARN_IF_TUNING_IGNORED(zfs_arc_max, arc_c_max, verbose);75877588/* Valid range: 0 - <all physical memory> */7589arc_dnode_limit = zfs_arc_dnode_limit ? zfs_arc_dnode_limit :7590MIN(zfs_arc_dnode_limit_percent, 100) * arc_c_max / 100;7591WARN_IF_TUNING_IGNORED(zfs_arc_dnode_limit, arc_dnode_limit, verbose);75927593/* Valid range: 1 - N */7594if (zfs_arc_grow_retry)7595arc_grow_retry = zfs_arc_grow_retry;75967597/* Valid range: 1 - N */7598if (zfs_arc_shrink_shift) {7599arc_shrink_shift = zfs_arc_shrink_shift;7600arc_no_grow_shift = MIN(arc_no_grow_shift, arc_shrink_shift -1);7601}76027603/* Valid range: 1 - N ms */7604if (zfs_arc_min_prefetch_ms)7605arc_min_prefetch = MSEC_TO_TICK(zfs_arc_min_prefetch_ms);76067607/* Valid range: 1 - N ms */7608if (zfs_arc_min_prescient_prefetch_ms) {7609arc_min_prescient_prefetch =7610MSEC_TO_TICK(zfs_arc_min_prescient_prefetch_ms);7611}76127613/* Valid range: 0 - 100 */7614if (zfs_arc_lotsfree_percent <= 100)7615arc_lotsfree_percent = zfs_arc_lotsfree_percent;7616WARN_IF_TUNING_IGNORED(zfs_arc_lotsfree_percent, arc_lotsfree_percent,7617verbose);76187619/* Valid range: 0 - <all physical memory> */7620if ((zfs_arc_sys_free) && (zfs_arc_sys_free != arc_sys_free))7621arc_sys_free = MIN(zfs_arc_sys_free, allmem);7622WARN_IF_TUNING_IGNORED(zfs_arc_sys_free, arc_sys_free, verbose);7623}76247625static void7626arc_state_multilist_init(multilist_t *ml,7627multilist_sublist_index_func_t *index_func, int *maxcountp)7628{7629multilist_create(ml, sizeof (arc_buf_hdr_t),7630offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node), index_func);7631*maxcountp = MAX(*maxcountp, multilist_get_num_sublists(ml));7632}76337634static void7635arc_state_init(void)7636{7637int num_sublists = 0;76387639arc_state_multilist_init(&arc_mru->arcs_list[ARC_BUFC_METADATA],7640arc_state_multilist_index_func, &num_sublists);7641arc_state_multilist_init(&arc_mru->arcs_list[ARC_BUFC_DATA],7642arc_state_multilist_index_func, &num_sublists);7643arc_state_multilist_init(&arc_mru_ghost->arcs_list[ARC_BUFC_METADATA],7644arc_state_multilist_index_func, &num_sublists);7645arc_state_multilist_init(&arc_mru_ghost->arcs_list[ARC_BUFC_DATA],7646arc_state_multilist_index_func, &num_sublists);7647arc_state_multilist_init(&arc_mfu->arcs_list[ARC_BUFC_METADATA],7648arc_state_multilist_index_func, &num_sublists);7649arc_state_multilist_init(&arc_mfu->arcs_list[ARC_BUFC_DATA],7650arc_state_multilist_index_func, &num_sublists);7651arc_state_multilist_init(&arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA],7652arc_state_multilist_index_func, &num_sublists);7653arc_state_multilist_init(&arc_mfu_ghost->arcs_list[ARC_BUFC_DATA],7654arc_state_multilist_index_func, &num_sublists);7655arc_state_multilist_init(&arc_uncached->arcs_list[ARC_BUFC_METADATA],7656arc_state_multilist_index_func, &num_sublists);7657arc_state_multilist_init(&arc_uncached->arcs_list[ARC_BUFC_DATA],7658arc_state_multilist_index_func, &num_sublists);76597660/*7661* L2 headers should never be on the L2 state list since they don't7662* have L1 headers allocated. Special index function asserts that.7663*/7664arc_state_multilist_init(&arc_l2c_only->arcs_list[ARC_BUFC_METADATA],7665arc_state_l2c_multilist_index_func, &num_sublists);7666arc_state_multilist_init(&arc_l2c_only->arcs_list[ARC_BUFC_DATA],7667arc_state_l2c_multilist_index_func, &num_sublists);76687669/*7670* Keep track of the number of markers needed to reclaim buffers from7671* any ARC state. The markers will be pre-allocated so as to minimize7672* the number of memory allocations performed by the eviction thread.7673*/7674arc_state_evict_marker_count = num_sublists;76757676zfs_refcount_create(&arc_anon->arcs_esize[ARC_BUFC_METADATA]);7677zfs_refcount_create(&arc_anon->arcs_esize[ARC_BUFC_DATA]);7678zfs_refcount_create(&arc_mru->arcs_esize[ARC_BUFC_METADATA]);7679zfs_refcount_create(&arc_mru->arcs_esize[ARC_BUFC_DATA]);7680zfs_refcount_create(&arc_mru_ghost->arcs_esize[ARC_BUFC_METADATA]);7681zfs_refcount_create(&arc_mru_ghost->arcs_esize[ARC_BUFC_DATA]);7682zfs_refcount_create(&arc_mfu->arcs_esize[ARC_BUFC_METADATA]);7683zfs_refcount_create(&arc_mfu->arcs_esize[ARC_BUFC_DATA]);7684zfs_refcount_create(&arc_mfu_ghost->arcs_esize[ARC_BUFC_METADATA]);7685zfs_refcount_create(&arc_mfu_ghost->arcs_esize[ARC_BUFC_DATA]);7686zfs_refcount_create(&arc_l2c_only->arcs_esize[ARC_BUFC_METADATA]);7687zfs_refcount_create(&arc_l2c_only->arcs_esize[ARC_BUFC_DATA]);7688zfs_refcount_create(&arc_uncached->arcs_esize[ARC_BUFC_METADATA]);7689zfs_refcount_create(&arc_uncached->arcs_esize[ARC_BUFC_DATA]);76907691zfs_refcount_create(&arc_anon->arcs_size[ARC_BUFC_DATA]);7692zfs_refcount_create(&arc_anon->arcs_size[ARC_BUFC_METADATA]);7693zfs_refcount_create(&arc_mru->arcs_size[ARC_BUFC_DATA]);7694zfs_refcount_create(&arc_mru->arcs_size[ARC_BUFC_METADATA]);7695zfs_refcount_create(&arc_mru_ghost->arcs_size[ARC_BUFC_DATA]);7696zfs_refcount_create(&arc_mru_ghost->arcs_size[ARC_BUFC_METADATA]);7697zfs_refcount_create(&arc_mfu->arcs_size[ARC_BUFC_DATA]);7698zfs_refcount_create(&arc_mfu->arcs_size[ARC_BUFC_METADATA]);7699zfs_refcount_create(&arc_mfu_ghost->arcs_size[ARC_BUFC_DATA]);7700zfs_refcount_create(&arc_mfu_ghost->arcs_size[ARC_BUFC_METADATA]);7701zfs_refcount_create(&arc_l2c_only->arcs_size[ARC_BUFC_DATA]);7702zfs_refcount_create(&arc_l2c_only->arcs_size[ARC_BUFC_METADATA]);7703zfs_refcount_create(&arc_uncached->arcs_size[ARC_BUFC_DATA]);7704zfs_refcount_create(&arc_uncached->arcs_size[ARC_BUFC_METADATA]);77057706wmsum_init(&arc_mru_ghost->arcs_hits[ARC_BUFC_DATA], 0);7707wmsum_init(&arc_mru_ghost->arcs_hits[ARC_BUFC_METADATA], 0);7708wmsum_init(&arc_mfu_ghost->arcs_hits[ARC_BUFC_DATA], 0);7709wmsum_init(&arc_mfu_ghost->arcs_hits[ARC_BUFC_METADATA], 0);77107711wmsum_init(&arc_sums.arcstat_hits, 0);7712wmsum_init(&arc_sums.arcstat_iohits, 0);7713wmsum_init(&arc_sums.arcstat_misses, 0);7714wmsum_init(&arc_sums.arcstat_demand_data_hits, 0);7715wmsum_init(&arc_sums.arcstat_demand_data_iohits, 0);7716wmsum_init(&arc_sums.arcstat_demand_data_misses, 0);7717wmsum_init(&arc_sums.arcstat_demand_metadata_hits, 0);7718wmsum_init(&arc_sums.arcstat_demand_metadata_iohits, 0);7719wmsum_init(&arc_sums.arcstat_demand_metadata_misses, 0);7720wmsum_init(&arc_sums.arcstat_prefetch_data_hits, 0);7721wmsum_init(&arc_sums.arcstat_prefetch_data_iohits, 0);7722wmsum_init(&arc_sums.arcstat_prefetch_data_misses, 0);7723wmsum_init(&arc_sums.arcstat_prefetch_metadata_hits, 0);7724wmsum_init(&arc_sums.arcstat_prefetch_metadata_iohits, 0);7725wmsum_init(&arc_sums.arcstat_prefetch_metadata_misses, 0);7726wmsum_init(&arc_sums.arcstat_mru_hits, 0);7727wmsum_init(&arc_sums.arcstat_mru_ghost_hits, 0);7728wmsum_init(&arc_sums.arcstat_mfu_hits, 0);7729wmsum_init(&arc_sums.arcstat_mfu_ghost_hits, 0);7730wmsum_init(&arc_sums.arcstat_uncached_hits, 0);7731wmsum_init(&arc_sums.arcstat_deleted, 0);7732wmsum_init(&arc_sums.arcstat_mutex_miss, 0);7733wmsum_init(&arc_sums.arcstat_access_skip, 0);7734wmsum_init(&arc_sums.arcstat_evict_skip, 0);7735wmsum_init(&arc_sums.arcstat_evict_not_enough, 0);7736wmsum_init(&arc_sums.arcstat_evict_l2_cached, 0);7737wmsum_init(&arc_sums.arcstat_evict_l2_eligible, 0);7738wmsum_init(&arc_sums.arcstat_evict_l2_eligible_mfu, 0);7739wmsum_init(&arc_sums.arcstat_evict_l2_eligible_mru, 0);7740wmsum_init(&arc_sums.arcstat_evict_l2_ineligible, 0);7741wmsum_init(&arc_sums.arcstat_evict_l2_skip, 0);7742wmsum_init(&arc_sums.arcstat_hash_elements, 0);7743wmsum_init(&arc_sums.arcstat_hash_collisions, 0);7744wmsum_init(&arc_sums.arcstat_hash_chains, 0);7745aggsum_init(&arc_sums.arcstat_size, 0);7746wmsum_init(&arc_sums.arcstat_compressed_size, 0);7747wmsum_init(&arc_sums.arcstat_uncompressed_size, 0);7748wmsum_init(&arc_sums.arcstat_overhead_size, 0);7749wmsum_init(&arc_sums.arcstat_hdr_size, 0);7750wmsum_init(&arc_sums.arcstat_data_size, 0);7751wmsum_init(&arc_sums.arcstat_metadata_size, 0);7752wmsum_init(&arc_sums.arcstat_dbuf_size, 0);7753aggsum_init(&arc_sums.arcstat_dnode_size, 0);7754wmsum_init(&arc_sums.arcstat_bonus_size, 0);7755wmsum_init(&arc_sums.arcstat_l2_hits, 0);7756wmsum_init(&arc_sums.arcstat_l2_misses, 0);7757wmsum_init(&arc_sums.arcstat_l2_prefetch_asize, 0);7758wmsum_init(&arc_sums.arcstat_l2_mru_asize, 0);7759wmsum_init(&arc_sums.arcstat_l2_mfu_asize, 0);7760wmsum_init(&arc_sums.arcstat_l2_bufc_data_asize, 0);7761wmsum_init(&arc_sums.arcstat_l2_bufc_metadata_asize, 0);7762wmsum_init(&arc_sums.arcstat_l2_feeds, 0);7763wmsum_init(&arc_sums.arcstat_l2_rw_clash, 0);7764wmsum_init(&arc_sums.arcstat_l2_read_bytes, 0);7765wmsum_init(&arc_sums.arcstat_l2_write_bytes, 0);7766wmsum_init(&arc_sums.arcstat_l2_writes_sent, 0);7767wmsum_init(&arc_sums.arcstat_l2_writes_done, 0);7768wmsum_init(&arc_sums.arcstat_l2_writes_error, 0);7769wmsum_init(&arc_sums.arcstat_l2_writes_lock_retry, 0);7770wmsum_init(&arc_sums.arcstat_l2_evict_lock_retry, 0);7771wmsum_init(&arc_sums.arcstat_l2_evict_reading, 0);7772wmsum_init(&arc_sums.arcstat_l2_evict_l1cached, 0);7773wmsum_init(&arc_sums.arcstat_l2_free_on_write, 0);7774wmsum_init(&arc_sums.arcstat_l2_abort_lowmem, 0);7775wmsum_init(&arc_sums.arcstat_l2_cksum_bad, 0);7776wmsum_init(&arc_sums.arcstat_l2_io_error, 0);7777wmsum_init(&arc_sums.arcstat_l2_lsize, 0);7778wmsum_init(&arc_sums.arcstat_l2_psize, 0);7779aggsum_init(&arc_sums.arcstat_l2_hdr_size, 0);7780wmsum_init(&arc_sums.arcstat_l2_log_blk_writes, 0);7781wmsum_init(&arc_sums.arcstat_l2_log_blk_asize, 0);7782wmsum_init(&arc_sums.arcstat_l2_log_blk_count, 0);7783wmsum_init(&arc_sums.arcstat_l2_rebuild_success, 0);7784wmsum_init(&arc_sums.arcstat_l2_rebuild_abort_unsupported, 0);7785wmsum_init(&arc_sums.arcstat_l2_rebuild_abort_io_errors, 0);7786wmsum_init(&arc_sums.arcstat_l2_rebuild_abort_dh_errors, 0);7787wmsum_init(&arc_sums.arcstat_l2_rebuild_abort_cksum_lb_errors, 0);7788wmsum_init(&arc_sums.arcstat_l2_rebuild_abort_lowmem, 0);7789wmsum_init(&arc_sums.arcstat_l2_rebuild_size, 0);7790wmsum_init(&arc_sums.arcstat_l2_rebuild_asize, 0);7791wmsum_init(&arc_sums.arcstat_l2_rebuild_bufs, 0);7792wmsum_init(&arc_sums.arcstat_l2_rebuild_bufs_precached, 0);7793wmsum_init(&arc_sums.arcstat_l2_rebuild_log_blks, 0);7794wmsum_init(&arc_sums.arcstat_memory_throttle_count, 0);7795wmsum_init(&arc_sums.arcstat_memory_direct_count, 0);7796wmsum_init(&arc_sums.arcstat_memory_indirect_count, 0);7797wmsum_init(&arc_sums.arcstat_prune, 0);7798wmsum_init(&arc_sums.arcstat_meta_used, 0);7799wmsum_init(&arc_sums.arcstat_async_upgrade_sync, 0);7800wmsum_init(&arc_sums.arcstat_predictive_prefetch, 0);7801wmsum_init(&arc_sums.arcstat_demand_hit_predictive_prefetch, 0);7802wmsum_init(&arc_sums.arcstat_demand_iohit_predictive_prefetch, 0);7803wmsum_init(&arc_sums.arcstat_prescient_prefetch, 0);7804wmsum_init(&arc_sums.arcstat_demand_hit_prescient_prefetch, 0);7805wmsum_init(&arc_sums.arcstat_demand_iohit_prescient_prefetch, 0);7806wmsum_init(&arc_sums.arcstat_raw_size, 0);7807wmsum_init(&arc_sums.arcstat_cached_only_in_progress, 0);7808wmsum_init(&arc_sums.arcstat_abd_chunk_waste_size, 0);78097810arc_anon->arcs_state = ARC_STATE_ANON;7811arc_mru->arcs_state = ARC_STATE_MRU;7812arc_mru_ghost->arcs_state = ARC_STATE_MRU_GHOST;7813arc_mfu->arcs_state = ARC_STATE_MFU;7814arc_mfu_ghost->arcs_state = ARC_STATE_MFU_GHOST;7815arc_l2c_only->arcs_state = ARC_STATE_L2C_ONLY;7816arc_uncached->arcs_state = ARC_STATE_UNCACHED;7817}78187819static void7820arc_state_fini(void)7821{7822zfs_refcount_destroy(&arc_anon->arcs_esize[ARC_BUFC_METADATA]);7823zfs_refcount_destroy(&arc_anon->arcs_esize[ARC_BUFC_DATA]);7824zfs_refcount_destroy(&arc_mru->arcs_esize[ARC_BUFC_METADATA]);7825zfs_refcount_destroy(&arc_mru->arcs_esize[ARC_BUFC_DATA]);7826zfs_refcount_destroy(&arc_mru_ghost->arcs_esize[ARC_BUFC_METADATA]);7827zfs_refcount_destroy(&arc_mru_ghost->arcs_esize[ARC_BUFC_DATA]);7828zfs_refcount_destroy(&arc_mfu->arcs_esize[ARC_BUFC_METADATA]);7829zfs_refcount_destroy(&arc_mfu->arcs_esize[ARC_BUFC_DATA]);7830zfs_refcount_destroy(&arc_mfu_ghost->arcs_esize[ARC_BUFC_METADATA]);7831zfs_refcount_destroy(&arc_mfu_ghost->arcs_esize[ARC_BUFC_DATA]);7832zfs_refcount_destroy(&arc_l2c_only->arcs_esize[ARC_BUFC_METADATA]);7833zfs_refcount_destroy(&arc_l2c_only->arcs_esize[ARC_BUFC_DATA]);7834zfs_refcount_destroy(&arc_uncached->arcs_esize[ARC_BUFC_METADATA]);7835zfs_refcount_destroy(&arc_uncached->arcs_esize[ARC_BUFC_DATA]);78367837zfs_refcount_destroy(&arc_anon->arcs_size[ARC_BUFC_DATA]);7838zfs_refcount_destroy(&arc_anon->arcs_size[ARC_BUFC_METADATA]);7839zfs_refcount_destroy(&arc_mru->arcs_size[ARC_BUFC_DATA]);7840zfs_refcount_destroy(&arc_mru->arcs_size[ARC_BUFC_METADATA]);7841zfs_refcount_destroy(&arc_mru_ghost->arcs_size[ARC_BUFC_DATA]);7842zfs_refcount_destroy(&arc_mru_ghost->arcs_size[ARC_BUFC_METADATA]);7843zfs_refcount_destroy(&arc_mfu->arcs_size[ARC_BUFC_DATA]);7844zfs_refcount_destroy(&arc_mfu->arcs_size[ARC_BUFC_METADATA]);7845zfs_refcount_destroy(&arc_mfu_ghost->arcs_size[ARC_BUFC_DATA]);7846zfs_refcount_destroy(&arc_mfu_ghost->arcs_size[ARC_BUFC_METADATA]);7847zfs_refcount_destroy(&arc_l2c_only->arcs_size[ARC_BUFC_DATA]);7848zfs_refcount_destroy(&arc_l2c_only->arcs_size[ARC_BUFC_METADATA]);7849zfs_refcount_destroy(&arc_uncached->arcs_size[ARC_BUFC_DATA]);7850zfs_refcount_destroy(&arc_uncached->arcs_size[ARC_BUFC_METADATA]);78517852multilist_destroy(&arc_mru->arcs_list[ARC_BUFC_METADATA]);7853multilist_destroy(&arc_mru_ghost->arcs_list[ARC_BUFC_METADATA]);7854multilist_destroy(&arc_mfu->arcs_list[ARC_BUFC_METADATA]);7855multilist_destroy(&arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA]);7856multilist_destroy(&arc_mru->arcs_list[ARC_BUFC_DATA]);7857multilist_destroy(&arc_mru_ghost->arcs_list[ARC_BUFC_DATA]);7858multilist_destroy(&arc_mfu->arcs_list[ARC_BUFC_DATA]);7859multilist_destroy(&arc_mfu_ghost->arcs_list[ARC_BUFC_DATA]);7860multilist_destroy(&arc_l2c_only->arcs_list[ARC_BUFC_METADATA]);7861multilist_destroy(&arc_l2c_only->arcs_list[ARC_BUFC_DATA]);7862multilist_destroy(&arc_uncached->arcs_list[ARC_BUFC_METADATA]);7863multilist_destroy(&arc_uncached->arcs_list[ARC_BUFC_DATA]);78647865wmsum_fini(&arc_mru_ghost->arcs_hits[ARC_BUFC_DATA]);7866wmsum_fini(&arc_mru_ghost->arcs_hits[ARC_BUFC_METADATA]);7867wmsum_fini(&arc_mfu_ghost->arcs_hits[ARC_BUFC_DATA]);7868wmsum_fini(&arc_mfu_ghost->arcs_hits[ARC_BUFC_METADATA]);78697870wmsum_fini(&arc_sums.arcstat_hits);7871wmsum_fini(&arc_sums.arcstat_iohits);7872wmsum_fini(&arc_sums.arcstat_misses);7873wmsum_fini(&arc_sums.arcstat_demand_data_hits);7874wmsum_fini(&arc_sums.arcstat_demand_data_iohits);7875wmsum_fini(&arc_sums.arcstat_demand_data_misses);7876wmsum_fini(&arc_sums.arcstat_demand_metadata_hits);7877wmsum_fini(&arc_sums.arcstat_demand_metadata_iohits);7878wmsum_fini(&arc_sums.arcstat_demand_metadata_misses);7879wmsum_fini(&arc_sums.arcstat_prefetch_data_hits);7880wmsum_fini(&arc_sums.arcstat_prefetch_data_iohits);7881wmsum_fini(&arc_sums.arcstat_prefetch_data_misses);7882wmsum_fini(&arc_sums.arcstat_prefetch_metadata_hits);7883wmsum_fini(&arc_sums.arcstat_prefetch_metadata_iohits);7884wmsum_fini(&arc_sums.arcstat_prefetch_metadata_misses);7885wmsum_fini(&arc_sums.arcstat_mru_hits);7886wmsum_fini(&arc_sums.arcstat_mru_ghost_hits);7887wmsum_fini(&arc_sums.arcstat_mfu_hits);7888wmsum_fini(&arc_sums.arcstat_mfu_ghost_hits);7889wmsum_fini(&arc_sums.arcstat_uncached_hits);7890wmsum_fini(&arc_sums.arcstat_deleted);7891wmsum_fini(&arc_sums.arcstat_mutex_miss);7892wmsum_fini(&arc_sums.arcstat_access_skip);7893wmsum_fini(&arc_sums.arcstat_evict_skip);7894wmsum_fini(&arc_sums.arcstat_evict_not_enough);7895wmsum_fini(&arc_sums.arcstat_evict_l2_cached);7896wmsum_fini(&arc_sums.arcstat_evict_l2_eligible);7897wmsum_fini(&arc_sums.arcstat_evict_l2_eligible_mfu);7898wmsum_fini(&arc_sums.arcstat_evict_l2_eligible_mru);7899wmsum_fini(&arc_sums.arcstat_evict_l2_ineligible);7900wmsum_fini(&arc_sums.arcstat_evict_l2_skip);7901wmsum_fini(&arc_sums.arcstat_hash_elements);7902wmsum_fini(&arc_sums.arcstat_hash_collisions);7903wmsum_fini(&arc_sums.arcstat_hash_chains);7904aggsum_fini(&arc_sums.arcstat_size);7905wmsum_fini(&arc_sums.arcstat_compressed_size);7906wmsum_fini(&arc_sums.arcstat_uncompressed_size);7907wmsum_fini(&arc_sums.arcstat_overhead_size);7908wmsum_fini(&arc_sums.arcstat_hdr_size);7909wmsum_fini(&arc_sums.arcstat_data_size);7910wmsum_fini(&arc_sums.arcstat_metadata_size);7911wmsum_fini(&arc_sums.arcstat_dbuf_size);7912aggsum_fini(&arc_sums.arcstat_dnode_size);7913wmsum_fini(&arc_sums.arcstat_bonus_size);7914wmsum_fini(&arc_sums.arcstat_l2_hits);7915wmsum_fini(&arc_sums.arcstat_l2_misses);7916wmsum_fini(&arc_sums.arcstat_l2_prefetch_asize);7917wmsum_fini(&arc_sums.arcstat_l2_mru_asize);7918wmsum_fini(&arc_sums.arcstat_l2_mfu_asize);7919wmsum_fini(&arc_sums.arcstat_l2_bufc_data_asize);7920wmsum_fini(&arc_sums.arcstat_l2_bufc_metadata_asize);7921wmsum_fini(&arc_sums.arcstat_l2_feeds);7922wmsum_fini(&arc_sums.arcstat_l2_rw_clash);7923wmsum_fini(&arc_sums.arcstat_l2_read_bytes);7924wmsum_fini(&arc_sums.arcstat_l2_write_bytes);7925wmsum_fini(&arc_sums.arcstat_l2_writes_sent);7926wmsum_fini(&arc_sums.arcstat_l2_writes_done);7927wmsum_fini(&arc_sums.arcstat_l2_writes_error);7928wmsum_fini(&arc_sums.arcstat_l2_writes_lock_retry);7929wmsum_fini(&arc_sums.arcstat_l2_evict_lock_retry);7930wmsum_fini(&arc_sums.arcstat_l2_evict_reading);7931wmsum_fini(&arc_sums.arcstat_l2_evict_l1cached);7932wmsum_fini(&arc_sums.arcstat_l2_free_on_write);7933wmsum_fini(&arc_sums.arcstat_l2_abort_lowmem);7934wmsum_fini(&arc_sums.arcstat_l2_cksum_bad);7935wmsum_fini(&arc_sums.arcstat_l2_io_error);7936wmsum_fini(&arc_sums.arcstat_l2_lsize);7937wmsum_fini(&arc_sums.arcstat_l2_psize);7938aggsum_fini(&arc_sums.arcstat_l2_hdr_size);7939wmsum_fini(&arc_sums.arcstat_l2_log_blk_writes);7940wmsum_fini(&arc_sums.arcstat_l2_log_blk_asize);7941wmsum_fini(&arc_sums.arcstat_l2_log_blk_count);7942wmsum_fini(&arc_sums.arcstat_l2_rebuild_success);7943wmsum_fini(&arc_sums.arcstat_l2_rebuild_abort_unsupported);7944wmsum_fini(&arc_sums.arcstat_l2_rebuild_abort_io_errors);7945wmsum_fini(&arc_sums.arcstat_l2_rebuild_abort_dh_errors);7946wmsum_fini(&arc_sums.arcstat_l2_rebuild_abort_cksum_lb_errors);7947wmsum_fini(&arc_sums.arcstat_l2_rebuild_abort_lowmem);7948wmsum_fini(&arc_sums.arcstat_l2_rebuild_size);7949wmsum_fini(&arc_sums.arcstat_l2_rebuild_asize);7950wmsum_fini(&arc_sums.arcstat_l2_rebuild_bufs);7951wmsum_fini(&arc_sums.arcstat_l2_rebuild_bufs_precached);7952wmsum_fini(&arc_sums.arcstat_l2_rebuild_log_blks);7953wmsum_fini(&arc_sums.arcstat_memory_throttle_count);7954wmsum_fini(&arc_sums.arcstat_memory_direct_count);7955wmsum_fini(&arc_sums.arcstat_memory_indirect_count);7956wmsum_fini(&arc_sums.arcstat_prune);7957wmsum_fini(&arc_sums.arcstat_meta_used);7958wmsum_fini(&arc_sums.arcstat_async_upgrade_sync);7959wmsum_fini(&arc_sums.arcstat_predictive_prefetch);7960wmsum_fini(&arc_sums.arcstat_demand_hit_predictive_prefetch);7961wmsum_fini(&arc_sums.arcstat_demand_iohit_predictive_prefetch);7962wmsum_fini(&arc_sums.arcstat_prescient_prefetch);7963wmsum_fini(&arc_sums.arcstat_demand_hit_prescient_prefetch);7964wmsum_fini(&arc_sums.arcstat_demand_iohit_prescient_prefetch);7965wmsum_fini(&arc_sums.arcstat_raw_size);7966wmsum_fini(&arc_sums.arcstat_cached_only_in_progress);7967wmsum_fini(&arc_sums.arcstat_abd_chunk_waste_size);7968}79697970uint64_t7971arc_target_bytes(void)7972{7973return (arc_c);7974}79757976void7977arc_set_limits(uint64_t allmem)7978{7979/* Set min cache to 1/32 of all memory, or 32MB, whichever is more. */7980arc_c_min = MAX(allmem / 32, 2ULL << SPA_MAXBLOCKSHIFT);79817982/* How to set default max varies by platform. */7983arc_c_max = arc_default_max(arc_c_min, allmem);7984}79857986void7987arc_init(void)7988{7989uint64_t percent, allmem = arc_all_memory();7990mutex_init(&arc_evict_lock, NULL, MUTEX_DEFAULT, NULL);7991list_create(&arc_evict_waiters, sizeof (arc_evict_waiter_t),7992offsetof(arc_evict_waiter_t, aew_node));79937994arc_min_prefetch = MSEC_TO_TICK(1000);7995arc_min_prescient_prefetch = MSEC_TO_TICK(6000);79967997#if defined(_KERNEL)7998arc_lowmem_init();7999#endif80008001arc_set_limits(allmem);80028003#ifdef _KERNEL8004/*8005* If zfs_arc_max is non-zero at init, meaning it was set in the kernel8006* environment before the module was loaded, don't block setting the8007* maximum because it is less than arc_c_min, instead, reset arc_c_min8008* to a lower value.8009* zfs_arc_min will be handled by arc_tuning_update().8010*/8011if (zfs_arc_max != 0 && zfs_arc_max >= MIN_ARC_MAX &&8012zfs_arc_max < allmem) {8013arc_c_max = zfs_arc_max;8014if (arc_c_min >= arc_c_max) {8015arc_c_min = MAX(zfs_arc_max / 2,80162ULL << SPA_MAXBLOCKSHIFT);8017}8018}8019#else8020/*8021* In userland, there's only the memory pressure that we artificially8022* create (see arc_available_memory()). Don't let arc_c get too8023* small, because it can cause transactions to be larger than8024* arc_c, causing arc_tempreserve_space() to fail.8025*/8026arc_c_min = MAX(arc_c_max / 2, 2ULL << SPA_MAXBLOCKSHIFT);8027#endif80288029arc_c = arc_c_min;8030/*8031* 32-bit fixed point fractions of metadata from total ARC size,8032* MRU data from all data and MRU metadata from all metadata.8033*/8034arc_meta = (1ULL << 32) / 4; /* Metadata is 25% of arc_c. */8035arc_pd = (1ULL << 32) / 2; /* Data MRU is 50% of data. */8036arc_pm = (1ULL << 32) / 2; /* Metadata MRU is 50% of metadata. */80378038percent = MIN(zfs_arc_dnode_limit_percent, 100);8039arc_dnode_limit = arc_c_max * percent / 100;80408041/* Apply user specified tunings */8042arc_tuning_update(B_TRUE);80438044/* if kmem_flags are set, lets try to use less memory */8045if (kmem_debugging())8046arc_c = arc_c / 2;8047if (arc_c < arc_c_min)8048arc_c = arc_c_min;80498050arc_register_hotplug();80518052arc_state_init();80538054buf_init();80558056list_create(&arc_prune_list, sizeof (arc_prune_t),8057offsetof(arc_prune_t, p_node));8058mutex_init(&arc_prune_mtx, NULL, MUTEX_DEFAULT, NULL);80598060arc_prune_taskq = taskq_create("arc_prune", zfs_arc_prune_task_threads,8061defclsyspri, 100, INT_MAX, TASKQ_PREPOPULATE | TASKQ_DYNAMIC);80628063arc_evict_thread_init();80648065list_create(&arc_async_flush_list, sizeof (arc_async_flush_t),8066offsetof(arc_async_flush_t, af_node));8067mutex_init(&arc_async_flush_lock, NULL, MUTEX_DEFAULT, NULL);8068arc_flush_taskq = taskq_create("arc_flush", MIN(boot_ncpus, 4),8069defclsyspri, 1, INT_MAX, TASKQ_DYNAMIC);80708071arc_ksp = kstat_create("zfs", 0, "arcstats", "misc", KSTAT_TYPE_NAMED,8072sizeof (arc_stats) / sizeof (kstat_named_t), KSTAT_FLAG_VIRTUAL);80738074if (arc_ksp != NULL) {8075arc_ksp->ks_data = &arc_stats;8076arc_ksp->ks_update = arc_kstat_update;8077kstat_install(arc_ksp);8078}80798080arc_state_evict_markers =8081arc_state_alloc_markers(arc_state_evict_marker_count);8082arc_evict_zthr = zthr_create_timer("arc_evict",8083arc_evict_cb_check, arc_evict_cb, NULL, SEC2NSEC(1), defclsyspri);8084arc_reap_zthr = zthr_create_timer("arc_reap",8085arc_reap_cb_check, arc_reap_cb, NULL, SEC2NSEC(1), minclsyspri);80868087arc_warm = B_FALSE;80888089/*8090* Calculate maximum amount of dirty data per pool.8091*8092* If it has been set by a module parameter, take that.8093* Otherwise, use a percentage of physical memory defined by8094* zfs_dirty_data_max_percent (default 10%) with a cap at8095* zfs_dirty_data_max_max (default 4G or 25% of physical memory).8096*/8097#ifdef __LP64__8098if (zfs_dirty_data_max_max == 0)8099zfs_dirty_data_max_max = MIN(4ULL * 1024 * 1024 * 1024,8100allmem * zfs_dirty_data_max_max_percent / 100);8101#else8102if (zfs_dirty_data_max_max == 0)8103zfs_dirty_data_max_max = MIN(1ULL * 1024 * 1024 * 1024,8104allmem * zfs_dirty_data_max_max_percent / 100);8105#endif81068107if (zfs_dirty_data_max == 0) {8108zfs_dirty_data_max = allmem *8109zfs_dirty_data_max_percent / 100;8110zfs_dirty_data_max = MIN(zfs_dirty_data_max,8111zfs_dirty_data_max_max);8112}81138114if (zfs_wrlog_data_max == 0) {81158116/*8117* dp_wrlog_total is reduced for each txg at the end of8118* spa_sync(). However, dp_dirty_total is reduced every time8119* a block is written out. Thus under normal operation,8120* dp_wrlog_total could grow 2 times as big as8121* zfs_dirty_data_max.8122*/8123zfs_wrlog_data_max = zfs_dirty_data_max * 2;8124}8125}81268127void8128arc_fini(void)8129{8130arc_prune_t *p;81318132#ifdef _KERNEL8133arc_lowmem_fini();8134#endif /* _KERNEL */81358136/* Wait for any background flushes */8137taskq_wait(arc_flush_taskq);8138taskq_destroy(arc_flush_taskq);81398140/* Use B_TRUE to ensure *all* buffers are evicted */8141arc_flush(NULL, B_TRUE);81428143if (arc_ksp != NULL) {8144kstat_delete(arc_ksp);8145arc_ksp = NULL;8146}81478148taskq_wait(arc_prune_taskq);8149taskq_destroy(arc_prune_taskq);81508151list_destroy(&arc_async_flush_list);8152mutex_destroy(&arc_async_flush_lock);81538154mutex_enter(&arc_prune_mtx);8155while ((p = list_remove_head(&arc_prune_list)) != NULL) {8156(void) zfs_refcount_remove(&p->p_refcnt, &arc_prune_list);8157zfs_refcount_destroy(&p->p_refcnt);8158kmem_free(p, sizeof (*p));8159}8160mutex_exit(&arc_prune_mtx);81618162list_destroy(&arc_prune_list);8163mutex_destroy(&arc_prune_mtx);81648165if (arc_evict_taskq != NULL)8166taskq_wait(arc_evict_taskq);81678168(void) zthr_cancel(arc_evict_zthr);8169(void) zthr_cancel(arc_reap_zthr);8170arc_state_free_markers(arc_state_evict_markers,8171arc_state_evict_marker_count);81728173if (arc_evict_taskq != NULL) {8174taskq_destroy(arc_evict_taskq);8175kmem_free(arc_evict_arg,8176sizeof (evict_arg_t) * zfs_arc_evict_threads);8177}81788179mutex_destroy(&arc_evict_lock);8180list_destroy(&arc_evict_waiters);81818182/*8183* Free any buffers that were tagged for destruction. This needs8184* to occur before arc_state_fini() runs and destroys the aggsum8185* values which are updated when freeing scatter ABDs.8186*/8187l2arc_do_free_on_write();81888189/*8190* buf_fini() must proceed arc_state_fini() because buf_fin() may8191* trigger the release of kmem magazines, which can callback to8192* arc_space_return() which accesses aggsums freed in act_state_fini().8193*/8194buf_fini();8195arc_state_fini();81968197arc_unregister_hotplug();81988199/*8200* We destroy the zthrs after all the ARC state has been8201* torn down to avoid the case of them receiving any8202* wakeup() signals after they are destroyed.8203*/8204zthr_destroy(arc_evict_zthr);8205zthr_destroy(arc_reap_zthr);82068207ASSERT0(arc_loaned_bytes);8208}82098210/*8211* Level 2 ARC8212*8213* The level 2 ARC (L2ARC) is a cache layer in-between main memory and disk.8214* It uses dedicated storage devices to hold cached data, which are populated8215* using large infrequent writes. The main role of this cache is to boost8216* the performance of random read workloads. The intended L2ARC devices8217* include short-stroked disks, solid state disks, and other media with8218* substantially faster read latency than disk.8219*8220* +-----------------------+8221* | ARC |8222* +-----------------------+8223* | ^ ^8224* | | |8225* l2arc_feed_thread() arc_read()8226* | | |8227* | l2arc read |8228* V | |8229* +---------------+ |8230* | L2ARC | |8231* +---------------+ |8232* | ^ |8233* l2arc_write() | |8234* | | |8235* V | |8236* +-------+ +-------+8237* | vdev | | vdev |8238* | cache | | cache |8239* +-------+ +-------+8240* +=========+ .-----.8241* : L2ARC : |-_____-|8242* : devices : | Disks |8243* +=========+ `-_____-'8244*8245* Read requests are satisfied from the following sources, in order:8246*8247* 1) ARC8248* 2) vdev cache of L2ARC devices8249* 3) L2ARC devices8250* 4) vdev cache of disks8251* 5) disks8252*8253* Some L2ARC device types exhibit extremely slow write performance.8254* To accommodate for this there are some significant differences between8255* the L2ARC and traditional cache design:8256*8257* 1. There is no eviction path from the ARC to the L2ARC. Evictions from8258* the ARC behave as usual, freeing buffers and placing headers on ghost8259* lists. The ARC does not send buffers to the L2ARC during eviction as8260* this would add inflated write latencies for all ARC memory pressure.8261*8262* 2. The L2ARC attempts to cache data from the ARC before it is evicted.8263* It does this by periodically scanning buffers from the eviction-end of8264* the MFU and MRU ARC lists, copying them to the L2ARC devices if they are8265* not already there. It scans until a headroom of buffers is satisfied,8266* which itself is a buffer for ARC eviction. If a compressible buffer is8267* found during scanning and selected for writing to an L2ARC device, we8268* temporarily boost scanning headroom during the next scan cycle to make8269* sure we adapt to compression effects (which might significantly reduce8270* the data volume we write to L2ARC). The thread that does this is8271* l2arc_feed_thread(), illustrated below; example sizes are included to8272* provide a better sense of ratio than this diagram:8273*8274* head --> tail8275* +---------------------+----------+8276* ARC_mfu |:::::#:::::::::::::::|o#o###o###|-->. # already on L2ARC8277* +---------------------+----------+ | o L2ARC eligible8278* ARC_mru |:#:::::::::::::::::::|#o#ooo####|-->| : ARC buffer8279* +---------------------+----------+ |8280* 15.9 Gbytes ^ 32 Mbytes |8281* headroom |8282* l2arc_feed_thread()8283* |8284* l2arc write hand <--[oooo]--'8285* | 8 Mbyte8286* | write max8287* V8288* +==============================+8289* L2ARC dev |####|#|###|###| |####| ... |8290* +==============================+8291* 32 Gbytes8292*8293* 3. If an ARC buffer is copied to the L2ARC but then hit instead of8294* evicted, then the L2ARC has cached a buffer much sooner than it probably8295* needed to, potentially wasting L2ARC device bandwidth and storage. It is8296* safe to say that this is an uncommon case, since buffers at the end of8297* the ARC lists have moved there due to inactivity.8298*8299* 4. If the ARC evicts faster than the L2ARC can maintain a headroom,8300* then the L2ARC simply misses copying some buffers. This serves as a8301* pressure valve to prevent heavy read workloads from both stalling the ARC8302* with waits and clogging the L2ARC with writes. This also helps prevent8303* the potential for the L2ARC to churn if it attempts to cache content too8304* quickly, such as during backups of the entire pool.8305*8306* 5. After system boot and before the ARC has filled main memory, there are8307* no evictions from the ARC and so the tails of the ARC_mfu and ARC_mru8308* lists can remain mostly static. Instead of searching from tail of these8309* lists as pictured, the l2arc_feed_thread() will search from the list heads8310* for eligible buffers, greatly increasing its chance of finding them.8311*8312* The L2ARC device write speed is also boosted during this time so that8313* the L2ARC warms up faster. Since there have been no ARC evictions yet,8314* there are no L2ARC reads, and no fear of degrading read performance8315* through increased writes.8316*8317* 6. Writes to the L2ARC devices are grouped and sent in-sequence, so that8318* the vdev queue can aggregate them into larger and fewer writes. Each8319* device is written to in a rotor fashion, sweeping writes through8320* available space then repeating.8321*8322* 7. The L2ARC does not store dirty content. It never needs to flush8323* write buffers back to disk based storage.8324*8325* 8. If an ARC buffer is written (and dirtied) which also exists in the8326* L2ARC, the now stale L2ARC buffer is immediately dropped.8327*8328* The performance of the L2ARC can be tweaked by a number of tunables, which8329* may be necessary for different workloads:8330*8331* l2arc_write_max max write bytes per interval8332* l2arc_write_boost extra write bytes during device warmup8333* l2arc_noprefetch skip caching prefetched buffers8334* l2arc_headroom number of max device writes to precache8335* l2arc_headroom_boost when we find compressed buffers during ARC8336* scanning, we multiply headroom by this8337* percentage factor for the next scan cycle,8338* since more compressed buffers are likely to8339* be present8340* l2arc_feed_secs seconds between L2ARC writing8341*8342* Tunables may be removed or added as future performance improvements are8343* integrated, and also may become zpool properties.8344*8345* There are three key functions that control how the L2ARC warms up:8346*8347* l2arc_write_eligible() check if a buffer is eligible to cache8348* l2arc_write_size() calculate how much to write8349* l2arc_write_interval() calculate sleep delay between writes8350*8351* These three functions determine what to write, how much, and how quickly8352* to send writes.8353*8354* L2ARC persistence:8355*8356* When writing buffers to L2ARC, we periodically add some metadata to8357* make sure we can pick them up after reboot, thus dramatically reducing8358* the impact that any downtime has on the performance of storage systems8359* with large caches.8360*8361* The implementation works fairly simply by integrating the following two8362* modifications:8363*8364* *) When writing to the L2ARC, we occasionally write a "l2arc log block",8365* which is an additional piece of metadata which describes what's been8366* written. This allows us to rebuild the arc_buf_hdr_t structures of the8367* main ARC buffers. There are 2 linked-lists of log blocks headed by8368* dh_start_lbps[2]. We alternate which chain we append to, so they are8369* time-wise and offset-wise interleaved, but that is an optimization rather8370* than for correctness. The log block also includes a pointer to the8371* previous block in its chain.8372*8373* *) We reserve SPA_MINBLOCKSIZE of space at the start of each L2ARC device8374* for our header bookkeeping purposes. This contains a device header,8375* which contains our top-level reference structures. We update it each8376* time we write a new log block, so that we're able to locate it in the8377* L2ARC device. If this write results in an inconsistent device header8378* (e.g. due to power failure), we detect this by verifying the header's8379* checksum and simply fail to reconstruct the L2ARC after reboot.8380*8381* Implementation diagram:8382*8383* +=== L2ARC device (not to scale) ======================================+8384* | ___two newest log block pointers__.__________ |8385* | / \dh_start_lbps[1] |8386* | / \ \dh_start_lbps[0]|8387* |.___/__. V V |8388* ||L2 dev|....|lb |bufs |lb |bufs |lb |bufs |lb |bufs |lb |---(empty)---|8389* || hdr| ^ /^ /^ / / |8390* |+------+ ...--\-------/ \-----/--\------/ / |8391* | \--------------/ \--------------/ |8392* +======================================================================+8393*8394* As can be seen on the diagram, rather than using a simple linked list,8395* we use a pair of linked lists with alternating elements. This is a8396* performance enhancement due to the fact that we only find out the8397* address of the next log block access once the current block has been8398* completely read in. Obviously, this hurts performance, because we'd be8399* keeping the device's I/O queue at only a 1 operation deep, thus8400* incurring a large amount of I/O round-trip latency. Having two lists8401* allows us to fetch two log blocks ahead of where we are currently8402* rebuilding L2ARC buffers.8403*8404* On-device data structures:8405*8406* L2ARC device header: l2arc_dev_hdr_phys_t8407* L2ARC log block: l2arc_log_blk_phys_t8408*8409* L2ARC reconstruction:8410*8411* When writing data, we simply write in the standard rotary fashion,8412* evicting buffers as we go and simply writing new data over them (writing8413* a new log block every now and then). This obviously means that once we8414* loop around the end of the device, we will start cutting into an already8415* committed log block (and its referenced data buffers), like so:8416*8417* current write head__ __old tail8418* \ /8419* V V8420* <--|bufs |lb |bufs |lb | |bufs |lb |bufs |lb |-->8421* ^ ^^^^^^^^^___________________________________8422* | \8423* <<nextwrite>> may overwrite this blk and/or its bufs --'8424*8425* When importing the pool, we detect this situation and use it to stop8426* our scanning process (see l2arc_rebuild).8427*8428* There is one significant caveat to consider when rebuilding ARC contents8429* from an L2ARC device: what about invalidated buffers? Given the above8430* construction, we cannot update blocks which we've already written to amend8431* them to remove buffers which were invalidated. Thus, during reconstruction,8432* we might be populating the cache with buffers for data that's not on the8433* main pool anymore, or may have been overwritten!8434*8435* As it turns out, this isn't a problem. Every arc_read request includes8436* both the DVA and, crucially, the birth TXG of the BP the caller is8437* looking for. So even if the cache were populated by completely rotten8438* blocks for data that had been long deleted and/or overwritten, we'll8439* never actually return bad data from the cache, since the DVA with the8440* birth TXG uniquely identify a block in space and time - once created,8441* a block is immutable on disk. The worst thing we have done is wasted8442* some time and memory at l2arc rebuild to reconstruct outdated ARC8443* entries that will get dropped from the l2arc as it is being updated8444* with new blocks.8445*8446* L2ARC buffers that have been evicted by l2arc_evict() ahead of the write8447* hand are not restored. This is done by saving the offset (in bytes)8448* l2arc_evict() has evicted to in the L2ARC device header and taking it8449* into account when restoring buffers.8450*/84518452static boolean_t8453l2arc_write_eligible(uint64_t spa_guid, arc_buf_hdr_t *hdr)8454{8455/*8456* A buffer is *not* eligible for the L2ARC if it:8457* 1. belongs to a different spa.8458* 2. is already cached on the L2ARC.8459* 3. has an I/O in progress (it may be an incomplete read).8460* 4. is flagged not eligible (zfs property).8461*/8462if (hdr->b_spa != spa_guid || HDR_HAS_L2HDR(hdr) ||8463HDR_IO_IN_PROGRESS(hdr) || !HDR_L2CACHE(hdr))8464return (B_FALSE);84658466return (B_TRUE);8467}84688469static uint64_t8470l2arc_write_size(l2arc_dev_t *dev)8471{8472uint64_t size;84738474/*8475* Make sure our globals have meaningful values in case the user8476* altered them.8477*/8478size = l2arc_write_max;8479if (size == 0) {8480cmn_err(CE_NOTE, "l2arc_write_max must be greater than zero, "8481"resetting it to the default (%d)", L2ARC_WRITE_SIZE);8482size = l2arc_write_max = L2ARC_WRITE_SIZE;8483}84848485if (arc_warm == B_FALSE)8486size += l2arc_write_boost;84878488/* We need to add in the worst case scenario of log block overhead. */8489size += l2arc_log_blk_overhead(size, dev);8490if (dev->l2ad_vdev->vdev_has_trim && l2arc_trim_ahead > 0) {8491/*8492* Trim ahead of the write size 64MB or (l2arc_trim_ahead/100)8493* times the writesize, whichever is greater.8494*/8495size += MAX(64 * 1024 * 1024,8496(size * l2arc_trim_ahead) / 100);8497}84988499/*8500* Make sure the write size does not exceed the size of the cache8501* device. This is important in l2arc_evict(), otherwise infinite8502* iteration can occur.8503*/8504size = MIN(size, (dev->l2ad_end - dev->l2ad_start) / 4);85058506size = P2ROUNDUP(size, 1ULL << dev->l2ad_vdev->vdev_ashift);85078508return (size);85098510}85118512static clock_t8513l2arc_write_interval(clock_t began, uint64_t wanted, uint64_t wrote)8514{8515clock_t interval, next, now;85168517/*8518* If the ARC lists are busy, increase our write rate; if the8519* lists are stale, idle back. This is achieved by checking8520* how much we previously wrote - if it was more than half of8521* what we wanted, schedule the next write much sooner.8522*/8523if (l2arc_feed_again && wrote > (wanted / 2))8524interval = (hz * l2arc_feed_min_ms) / 1000;8525else8526interval = hz * l2arc_feed_secs;85278528now = ddi_get_lbolt();8529next = MAX(now, MIN(now + interval, began + interval));85308531return (next);8532}85338534static boolean_t8535l2arc_dev_invalid(const l2arc_dev_t *dev)8536{8537/*8538* We want to skip devices that are being rebuilt, trimmed,8539* removed, or belong to a spa that is being exported.8540*/8541return (dev->l2ad_vdev == NULL || vdev_is_dead(dev->l2ad_vdev) ||8542dev->l2ad_rebuild || dev->l2ad_trim_all ||8543dev->l2ad_spa == NULL || dev->l2ad_spa->spa_is_exporting);8544}85458546/*8547* Cycle through L2ARC devices. This is how L2ARC load balances.8548* If a device is returned, this also returns holding the spa config lock.8549*/8550static l2arc_dev_t *8551l2arc_dev_get_next(void)8552{8553l2arc_dev_t *first, *next = NULL;85548555/*8556* Lock out the removal of spas (spa_namespace_lock), then removal8557* of cache devices (l2arc_dev_mtx). Once a device has been selected,8558* both locks will be dropped and a spa config lock held instead.8559*/8560spa_namespace_enter(FTAG);8561mutex_enter(&l2arc_dev_mtx);85628563/* if there are no vdevs, there is nothing to do */8564if (l2arc_ndev == 0)8565goto out;85668567first = NULL;8568next = l2arc_dev_last;8569do {8570/* loop around the list looking for a non-faulted vdev */8571if (next == NULL) {8572next = list_head(l2arc_dev_list);8573} else {8574next = list_next(l2arc_dev_list, next);8575if (next == NULL)8576next = list_head(l2arc_dev_list);8577}85788579/* if we have come back to the start, bail out */8580if (first == NULL)8581first = next;8582else if (next == first)8583break;85848585ASSERT3P(next, !=, NULL);8586} while (l2arc_dev_invalid(next));85878588/* if we were unable to find any usable vdevs, return NULL */8589if (l2arc_dev_invalid(next))8590next = NULL;85918592l2arc_dev_last = next;85938594out:8595mutex_exit(&l2arc_dev_mtx);85968597/*8598* Grab the config lock to prevent the 'next' device from being8599* removed while we are writing to it.8600*/8601if (next != NULL)8602spa_config_enter(next->l2ad_spa, SCL_L2ARC, next, RW_READER);8603spa_namespace_exit(FTAG);86048605return (next);8606}86078608/*8609* Free buffers that were tagged for destruction.8610*/8611static void8612l2arc_do_free_on_write(void)8613{8614l2arc_data_free_t *df;86158616mutex_enter(&l2arc_free_on_write_mtx);8617while ((df = list_remove_head(l2arc_free_on_write)) != NULL) {8618ASSERT3P(df->l2df_abd, !=, NULL);8619abd_free(df->l2df_abd);8620kmem_free(df, sizeof (l2arc_data_free_t));8621}8622mutex_exit(&l2arc_free_on_write_mtx);8623}86248625/*8626* A write to a cache device has completed. Update all headers to allow8627* reads from these buffers to begin.8628*/8629static void8630l2arc_write_done(zio_t *zio)8631{8632l2arc_write_callback_t *cb;8633l2arc_lb_abd_buf_t *abd_buf;8634l2arc_lb_ptr_buf_t *lb_ptr_buf;8635l2arc_dev_t *dev;8636l2arc_dev_hdr_phys_t *l2dhdr;8637list_t *buflist;8638arc_buf_hdr_t *head, *hdr, *hdr_prev;8639kmutex_t *hash_lock;8640int64_t bytes_dropped = 0;86418642cb = zio->io_private;8643ASSERT3P(cb, !=, NULL);8644dev = cb->l2wcb_dev;8645l2dhdr = dev->l2ad_dev_hdr;8646ASSERT3P(dev, !=, NULL);8647head = cb->l2wcb_head;8648ASSERT3P(head, !=, NULL);8649buflist = &dev->l2ad_buflist;8650ASSERT3P(buflist, !=, NULL);8651DTRACE_PROBE2(l2arc__iodone, zio_t *, zio,8652l2arc_write_callback_t *, cb);86538654/*8655* All writes completed, or an error was hit.8656*/8657top:8658mutex_enter(&dev->l2ad_mtx);8659for (hdr = list_prev(buflist, head); hdr; hdr = hdr_prev) {8660hdr_prev = list_prev(buflist, hdr);86618662hash_lock = HDR_LOCK(hdr);86638664/*8665* We cannot use mutex_enter or else we can deadlock8666* with l2arc_write_buffers (due to swapping the order8667* the hash lock and l2ad_mtx are taken).8668*/8669if (!mutex_tryenter(hash_lock)) {8670/*8671* Missed the hash lock. We must retry so we8672* don't leave the ARC_FLAG_L2_WRITING bit set.8673*/8674ARCSTAT_BUMP(arcstat_l2_writes_lock_retry);86758676/*8677* We don't want to rescan the headers we've8678* already marked as having been written out, so8679* we reinsert the head node so we can pick up8680* where we left off.8681*/8682list_remove(buflist, head);8683list_insert_after(buflist, hdr, head);86848685mutex_exit(&dev->l2ad_mtx);86868687/*8688* We wait for the hash lock to become available8689* to try and prevent busy waiting, and increase8690* the chance we'll be able to acquire the lock8691* the next time around.8692*/8693mutex_enter(hash_lock);8694mutex_exit(hash_lock);8695goto top;8696}86978698/*8699* We could not have been moved into the arc_l2c_only8700* state while in-flight due to our ARC_FLAG_L2_WRITING8701* bit being set. Let's just ensure that's being enforced.8702*/8703ASSERT(HDR_HAS_L1HDR(hdr));87048705/*8706* Skipped - drop L2ARC entry and mark the header as no8707* longer L2 eligibile.8708*/8709if (zio->io_error != 0) {8710/*8711* Error - drop L2ARC entry.8712*/8713list_remove(buflist, hdr);8714arc_hdr_clear_flags(hdr, ARC_FLAG_HAS_L2HDR);87158716uint64_t psize = HDR_GET_PSIZE(hdr);8717l2arc_hdr_arcstats_decrement(hdr);87188719ASSERT(dev->l2ad_vdev != NULL);87208721bytes_dropped +=8722vdev_psize_to_asize(dev->l2ad_vdev, psize);8723(void) zfs_refcount_remove_many(&dev->l2ad_alloc,8724arc_hdr_size(hdr), hdr);8725}87268727/*8728* Allow ARC to begin reads and ghost list evictions to8729* this L2ARC entry.8730*/8731arc_hdr_clear_flags(hdr, ARC_FLAG_L2_WRITING);87328733mutex_exit(hash_lock);8734}87358736/*8737* Free the allocated abd buffers for writing the log blocks.8738* If the zio failed reclaim the allocated space and remove the8739* pointers to these log blocks from the log block pointer list8740* of the L2ARC device.8741*/8742while ((abd_buf = list_remove_tail(&cb->l2wcb_abd_list)) != NULL) {8743abd_free(abd_buf->abd);8744zio_buf_free(abd_buf, sizeof (*abd_buf));8745if (zio->io_error != 0) {8746lb_ptr_buf = list_remove_head(&dev->l2ad_lbptr_list);8747/*8748* L2BLK_GET_PSIZE returns aligned size for log8749* blocks.8750*/8751uint64_t asize =8752L2BLK_GET_PSIZE((lb_ptr_buf->lb_ptr)->lbp_prop);8753bytes_dropped += asize;8754ARCSTAT_INCR(arcstat_l2_log_blk_asize, -asize);8755ARCSTAT_BUMPDOWN(arcstat_l2_log_blk_count);8756zfs_refcount_remove_many(&dev->l2ad_lb_asize, asize,8757lb_ptr_buf);8758(void) zfs_refcount_remove(&dev->l2ad_lb_count,8759lb_ptr_buf);8760kmem_free(lb_ptr_buf->lb_ptr,8761sizeof (l2arc_log_blkptr_t));8762kmem_free(lb_ptr_buf, sizeof (l2arc_lb_ptr_buf_t));8763}8764}8765list_destroy(&cb->l2wcb_abd_list);87668767if (zio->io_error != 0) {8768ARCSTAT_BUMP(arcstat_l2_writes_error);87698770/*8771* Restore the lbps array in the header to its previous state.8772* If the list of log block pointers is empty, zero out the8773* log block pointers in the device header.8774*/8775lb_ptr_buf = list_head(&dev->l2ad_lbptr_list);8776for (int i = 0; i < 2; i++) {8777if (lb_ptr_buf == NULL) {8778/*8779* If the list is empty zero out the device8780* header. Otherwise zero out the second log8781* block pointer in the header.8782*/8783if (i == 0) {8784memset(l2dhdr, 0,8785dev->l2ad_dev_hdr_asize);8786} else {8787memset(&l2dhdr->dh_start_lbps[i], 0,8788sizeof (l2arc_log_blkptr_t));8789}8790break;8791}8792memcpy(&l2dhdr->dh_start_lbps[i], lb_ptr_buf->lb_ptr,8793sizeof (l2arc_log_blkptr_t));8794lb_ptr_buf = list_next(&dev->l2ad_lbptr_list,8795lb_ptr_buf);8796}8797}87988799ARCSTAT_BUMP(arcstat_l2_writes_done);8800list_remove(buflist, head);8801ASSERT(!HDR_HAS_L1HDR(head));8802kmem_cache_free(hdr_l2only_cache, head);8803mutex_exit(&dev->l2ad_mtx);88048805ASSERT(dev->l2ad_vdev != NULL);8806vdev_space_update(dev->l2ad_vdev, -bytes_dropped, 0, 0);88078808l2arc_do_free_on_write();88098810kmem_free(cb, sizeof (l2arc_write_callback_t));8811}88128813static int8814l2arc_untransform(zio_t *zio, l2arc_read_callback_t *cb)8815{8816int ret;8817spa_t *spa = zio->io_spa;8818arc_buf_hdr_t *hdr = cb->l2rcb_hdr;8819blkptr_t *bp = zio->io_bp;8820uint8_t salt[ZIO_DATA_SALT_LEN];8821uint8_t iv[ZIO_DATA_IV_LEN];8822uint8_t mac[ZIO_DATA_MAC_LEN];8823boolean_t no_crypt = B_FALSE;88248825/*8826* ZIL data is never be written to the L2ARC, so we don't need8827* special handling for its unique MAC storage.8828*/8829ASSERT3U(BP_GET_TYPE(bp), !=, DMU_OT_INTENT_LOG);8830ASSERT(MUTEX_HELD(HDR_LOCK(hdr)));8831ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);88328833/*8834* If the data was encrypted, decrypt it now. Note that8835* we must check the bp here and not the hdr, since the8836* hdr does not have its encryption parameters updated8837* until arc_read_done().8838*/8839if (BP_IS_ENCRYPTED(bp)) {8840abd_t *eabd = arc_get_data_abd(hdr, arc_hdr_size(hdr), hdr,8841ARC_HDR_USE_RESERVE);88428843zio_crypt_decode_params_bp(bp, salt, iv);8844zio_crypt_decode_mac_bp(bp, mac);88458846ret = spa_do_crypt_abd(B_FALSE, spa, &cb->l2rcb_zb,8847BP_GET_TYPE(bp), BP_GET_DEDUP(bp), BP_SHOULD_BYTESWAP(bp),8848salt, iv, mac, HDR_GET_PSIZE(hdr), eabd,8849hdr->b_l1hdr.b_pabd, &no_crypt);8850if (ret != 0) {8851arc_free_data_abd(hdr, eabd, arc_hdr_size(hdr), hdr);8852goto error;8853}88548855/*8856* If we actually performed decryption, replace b_pabd8857* with the decrypted data. Otherwise we can just throw8858* our decryption buffer away.8859*/8860if (!no_crypt) {8861arc_free_data_abd(hdr, hdr->b_l1hdr.b_pabd,8862arc_hdr_size(hdr), hdr);8863hdr->b_l1hdr.b_pabd = eabd;8864zio->io_abd = eabd;8865} else {8866arc_free_data_abd(hdr, eabd, arc_hdr_size(hdr), hdr);8867}8868}88698870/*8871* If the L2ARC block was compressed, but ARC compression8872* is disabled we decompress the data into a new buffer and8873* replace the existing data.8874*/8875if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF &&8876!HDR_COMPRESSION_ENABLED(hdr)) {8877abd_t *cabd = arc_get_data_abd(hdr, arc_hdr_size(hdr), hdr,8878ARC_HDR_USE_RESERVE);88798880ret = zio_decompress_data(HDR_GET_COMPRESS(hdr),8881hdr->b_l1hdr.b_pabd, cabd, HDR_GET_PSIZE(hdr),8882HDR_GET_LSIZE(hdr), &hdr->b_complevel);8883if (ret != 0) {8884arc_free_data_abd(hdr, cabd, arc_hdr_size(hdr), hdr);8885goto error;8886}88878888arc_free_data_abd(hdr, hdr->b_l1hdr.b_pabd,8889arc_hdr_size(hdr), hdr);8890hdr->b_l1hdr.b_pabd = cabd;8891zio->io_abd = cabd;8892zio->io_size = HDR_GET_LSIZE(hdr);8893}88948895return (0);88968897error:8898return (ret);8899}890089018902/*8903* A read to a cache device completed. Validate buffer contents before8904* handing over to the regular ARC routines.8905*/8906static void8907l2arc_read_done(zio_t *zio)8908{8909int tfm_error = 0;8910l2arc_read_callback_t *cb = zio->io_private;8911arc_buf_hdr_t *hdr;8912kmutex_t *hash_lock;8913boolean_t valid_cksum;8914boolean_t using_rdata = (BP_IS_ENCRYPTED(&cb->l2rcb_bp) &&8915(cb->l2rcb_flags & ZIO_FLAG_RAW_ENCRYPT));89168917ASSERT3P(zio->io_vd, !=, NULL);8918ASSERT(zio->io_flags & ZIO_FLAG_DONT_PROPAGATE);89198920spa_config_exit(zio->io_spa, SCL_L2ARC, zio->io_vd);89218922ASSERT3P(cb, !=, NULL);8923hdr = cb->l2rcb_hdr;8924ASSERT3P(hdr, !=, NULL);89258926hash_lock = HDR_LOCK(hdr);8927mutex_enter(hash_lock);8928ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));89298930/*8931* If the data was read into a temporary buffer,8932* move it and free the buffer.8933*/8934if (cb->l2rcb_abd != NULL) {8935ASSERT3U(arc_hdr_size(hdr), <, zio->io_size);8936if (zio->io_error == 0) {8937if (using_rdata) {8938abd_copy(hdr->b_crypt_hdr.b_rabd,8939cb->l2rcb_abd, arc_hdr_size(hdr));8940} else {8941abd_copy(hdr->b_l1hdr.b_pabd,8942cb->l2rcb_abd, arc_hdr_size(hdr));8943}8944}89458946/*8947* The following must be done regardless of whether8948* there was an error:8949* - free the temporary buffer8950* - point zio to the real ARC buffer8951* - set zio size accordingly8952* These are required because zio is either re-used for8953* an I/O of the block in the case of the error8954* or the zio is passed to arc_read_done() and it8955* needs real data.8956*/8957abd_free(cb->l2rcb_abd);8958zio->io_size = zio->io_orig_size = arc_hdr_size(hdr);89598960if (using_rdata) {8961ASSERT(HDR_HAS_RABD(hdr));8962zio->io_abd = zio->io_orig_abd =8963hdr->b_crypt_hdr.b_rabd;8964} else {8965ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);8966zio->io_abd = zio->io_orig_abd = hdr->b_l1hdr.b_pabd;8967}8968}89698970ASSERT3P(zio->io_abd, !=, NULL);89718972/*8973* Check this survived the L2ARC journey.8974*/8975ASSERT(zio->io_abd == hdr->b_l1hdr.b_pabd ||8976(HDR_HAS_RABD(hdr) && zio->io_abd == hdr->b_crypt_hdr.b_rabd));8977zio->io_bp_copy = cb->l2rcb_bp; /* XXX fix in L2ARC 2.0 */8978zio->io_bp = &zio->io_bp_copy; /* XXX fix in L2ARC 2.0 */8979zio->io_prop.zp_complevel = hdr->b_complevel;89808981valid_cksum = arc_cksum_is_equal(hdr, zio);89828983/*8984* b_rabd will always match the data as it exists on disk if it is8985* being used. Therefore if we are reading into b_rabd we do not8986* attempt to untransform the data.8987*/8988if (valid_cksum && !using_rdata)8989tfm_error = l2arc_untransform(zio, cb);89908991if (valid_cksum && tfm_error == 0 && zio->io_error == 0 &&8992!HDR_L2_EVICTED(hdr)) {8993mutex_exit(hash_lock);8994zio->io_private = hdr;8995arc_read_done(zio);8996} else {8997/*8998* Buffer didn't survive caching. Increment stats and8999* reissue to the original storage device.9000*/9001if (zio->io_error != 0) {9002ARCSTAT_BUMP(arcstat_l2_io_error);9003} else {9004zio->io_error = SET_ERROR(EIO);9005}9006if (!valid_cksum || tfm_error != 0)9007ARCSTAT_BUMP(arcstat_l2_cksum_bad);90089009/*9010* If there's no waiter, issue an async i/o to the primary9011* storage now. If there *is* a waiter, the caller must9012* issue the i/o in a context where it's OK to block.9013*/9014if (zio->io_waiter == NULL) {9015zio_t *pio = zio_unique_parent(zio);9016void *abd = (using_rdata) ?9017hdr->b_crypt_hdr.b_rabd : hdr->b_l1hdr.b_pabd;90189019ASSERT(!pio || pio->io_child_type == ZIO_CHILD_LOGICAL);90209021zio = zio_read(pio, zio->io_spa, zio->io_bp,9022abd, zio->io_size, arc_read_done,9023hdr, zio->io_priority, cb->l2rcb_flags,9024&cb->l2rcb_zb);90259026/*9027* Original ZIO will be freed, so we need to update9028* ARC header with the new ZIO pointer to be used9029* by zio_change_priority() in arc_read().9030*/9031for (struct arc_callback *acb = hdr->b_l1hdr.b_acb;9032acb != NULL; acb = acb->acb_next)9033acb->acb_zio_head = zio;90349035mutex_exit(hash_lock);9036zio_nowait(zio);9037} else {9038mutex_exit(hash_lock);9039}9040}90419042kmem_free(cb, sizeof (l2arc_read_callback_t));9043}90449045/*9046* This is the list priority from which the L2ARC will search for pages to9047* cache. This is used within loops (0..3) to cycle through lists in the9048* desired order. This order can have a significant effect on cache9049* performance.9050*9051* Currently the metadata lists are hit first, MFU then MRU, followed by9052* the data lists. This function returns a locked list, and also returns9053* the lock pointer.9054*/9055static multilist_sublist_t *9056l2arc_sublist_lock(int list_num)9057{9058multilist_t *ml = NULL;9059unsigned int idx;90609061ASSERT(list_num >= 0 && list_num < L2ARC_FEED_TYPES);90629063switch (list_num) {9064case 0:9065ml = &arc_mfu->arcs_list[ARC_BUFC_METADATA];9066break;9067case 1:9068ml = &arc_mru->arcs_list[ARC_BUFC_METADATA];9069break;9070case 2:9071ml = &arc_mfu->arcs_list[ARC_BUFC_DATA];9072break;9073case 3:9074ml = &arc_mru->arcs_list[ARC_BUFC_DATA];9075break;9076default:9077return (NULL);9078}90799080/*9081* Return a randomly-selected sublist. This is acceptable9082* because the caller feeds only a little bit of data for each9083* call (8MB). Subsequent calls will result in different9084* sublists being selected.9085*/9086idx = multilist_get_random_index(ml);9087return (multilist_sublist_lock_idx(ml, idx));9088}90899090/*9091* Calculates the maximum overhead of L2ARC metadata log blocks for a given9092* L2ARC write size. l2arc_evict and l2arc_write_size need to include this9093* overhead in processing to make sure there is enough headroom available9094* when writing buffers.9095*/9096static inline uint64_t9097l2arc_log_blk_overhead(uint64_t write_sz, l2arc_dev_t *dev)9098{9099if (dev->l2ad_log_entries == 0) {9100return (0);9101} else {9102ASSERT(dev->l2ad_vdev != NULL);91039104uint64_t log_entries = write_sz >> SPA_MINBLOCKSHIFT;91059106uint64_t log_blocks = (log_entries +9107dev->l2ad_log_entries - 1) /9108dev->l2ad_log_entries;91099110return (vdev_psize_to_asize(dev->l2ad_vdev,9111sizeof (l2arc_log_blk_phys_t)) * log_blocks);9112}9113}91149115/*9116* Evict buffers from the device write hand to the distance specified in9117* bytes. This distance may span populated buffers, it may span nothing.9118* This is clearing a region on the L2ARC device ready for writing.9119* If the 'all' boolean is set, every buffer is evicted.9120*/9121static void9122l2arc_evict(l2arc_dev_t *dev, uint64_t distance, boolean_t all)9123{9124list_t *buflist;9125arc_buf_hdr_t *hdr, *hdr_prev;9126kmutex_t *hash_lock;9127uint64_t taddr;9128l2arc_lb_ptr_buf_t *lb_ptr_buf, *lb_ptr_buf_prev;9129vdev_t *vd = dev->l2ad_vdev;9130boolean_t rerun;91319132ASSERT(vd != NULL || all);9133ASSERT(dev->l2ad_spa != NULL || all);91349135buflist = &dev->l2ad_buflist;91369137top:9138rerun = B_FALSE;9139if (dev->l2ad_hand + distance > dev->l2ad_end) {9140/*9141* When there is no space to accommodate upcoming writes,9142* evict to the end. Then bump the write and evict hands9143* to the start and iterate. This iteration does not9144* happen indefinitely as we make sure in9145* l2arc_write_size() that when the write hand is reset,9146* the write size does not exceed the end of the device.9147*/9148rerun = B_TRUE;9149taddr = dev->l2ad_end;9150} else {9151taddr = dev->l2ad_hand + distance;9152}9153DTRACE_PROBE4(l2arc__evict, l2arc_dev_t *, dev, list_t *, buflist,9154uint64_t, taddr, boolean_t, all);91559156if (!all) {9157/*9158* This check has to be placed after deciding whether to9159* iterate (rerun).9160*/9161if (dev->l2ad_first) {9162/*9163* This is the first sweep through the device. There is9164* nothing to evict. We have already trimmed the9165* whole device.9166*/9167goto out;9168} else {9169/*9170* Trim the space to be evicted.9171*/9172if (vd->vdev_has_trim && dev->l2ad_evict < taddr &&9173l2arc_trim_ahead > 0) {9174/*9175* We have to drop the spa_config lock because9176* vdev_trim_range() will acquire it.9177* l2ad_evict already accounts for the label9178* size. To prevent vdev_trim_ranges() from9179* adding it again, we subtract it from9180* l2ad_evict.9181*/9182spa_config_exit(dev->l2ad_spa, SCL_L2ARC, dev);9183vdev_trim_simple(vd,9184dev->l2ad_evict - VDEV_LABEL_START_SIZE,9185taddr - dev->l2ad_evict);9186spa_config_enter(dev->l2ad_spa, SCL_L2ARC, dev,9187RW_READER);9188}91899190/*9191* When rebuilding L2ARC we retrieve the evict hand9192* from the header of the device. Of note, l2arc_evict()9193* does not actually delete buffers from the cache9194* device, but trimming may do so depending on the9195* hardware implementation. Thus keeping track of the9196* evict hand is useful.9197*/9198dev->l2ad_evict = MAX(dev->l2ad_evict, taddr);9199}9200}92019202retry:9203mutex_enter(&dev->l2ad_mtx);9204/*9205* We have to account for evicted log blocks. Run vdev_space_update()9206* on log blocks whose offset (in bytes) is before the evicted offset9207* (in bytes) by searching in the list of pointers to log blocks9208* present in the L2ARC device.9209*/9210for (lb_ptr_buf = list_tail(&dev->l2ad_lbptr_list); lb_ptr_buf;9211lb_ptr_buf = lb_ptr_buf_prev) {92129213lb_ptr_buf_prev = list_prev(&dev->l2ad_lbptr_list, lb_ptr_buf);92149215/* L2BLK_GET_PSIZE returns aligned size for log blocks */9216uint64_t asize = L2BLK_GET_PSIZE(9217(lb_ptr_buf->lb_ptr)->lbp_prop);92189219/*9220* We don't worry about log blocks left behind (ie9221* lbp_payload_start < l2ad_hand) because l2arc_write_buffers()9222* will never write more than l2arc_evict() evicts.9223*/9224if (!all && l2arc_log_blkptr_valid(dev, lb_ptr_buf->lb_ptr)) {9225break;9226} else {9227if (vd != NULL)9228vdev_space_update(vd, -asize, 0, 0);9229ARCSTAT_INCR(arcstat_l2_log_blk_asize, -asize);9230ARCSTAT_BUMPDOWN(arcstat_l2_log_blk_count);9231zfs_refcount_remove_many(&dev->l2ad_lb_asize, asize,9232lb_ptr_buf);9233(void) zfs_refcount_remove(&dev->l2ad_lb_count,9234lb_ptr_buf);9235list_remove(&dev->l2ad_lbptr_list, lb_ptr_buf);9236kmem_free(lb_ptr_buf->lb_ptr,9237sizeof (l2arc_log_blkptr_t));9238kmem_free(lb_ptr_buf, sizeof (l2arc_lb_ptr_buf_t));9239}9240}92419242for (hdr = list_tail(buflist); hdr; hdr = hdr_prev) {9243hdr_prev = list_prev(buflist, hdr);92449245ASSERT(!HDR_EMPTY(hdr));9246hash_lock = HDR_LOCK(hdr);92479248/*9249* We cannot use mutex_enter or else we can deadlock9250* with l2arc_write_buffers (due to swapping the order9251* the hash lock and l2ad_mtx are taken).9252*/9253if (!mutex_tryenter(hash_lock)) {9254/*9255* Missed the hash lock. Retry.9256*/9257ARCSTAT_BUMP(arcstat_l2_evict_lock_retry);9258mutex_exit(&dev->l2ad_mtx);9259mutex_enter(hash_lock);9260mutex_exit(hash_lock);9261goto retry;9262}92639264/*9265* A header can't be on this list if it doesn't have L2 header.9266*/9267ASSERT(HDR_HAS_L2HDR(hdr));92689269/* Ensure this header has finished being written. */9270ASSERT(!HDR_L2_WRITING(hdr));9271ASSERT(!HDR_L2_WRITE_HEAD(hdr));92729273if (!all && (hdr->b_l2hdr.b_daddr >= dev->l2ad_evict ||9274hdr->b_l2hdr.b_daddr < dev->l2ad_hand)) {9275/*9276* We've evicted to the target address,9277* or the end of the device.9278*/9279mutex_exit(hash_lock);9280break;9281}92829283if (!HDR_HAS_L1HDR(hdr)) {9284ASSERT(!HDR_L2_READING(hdr));9285/*9286* This doesn't exist in the ARC. Destroy.9287* arc_hdr_destroy() will call list_remove()9288* and decrement arcstat_l2_lsize.9289*/9290arc_change_state(arc_anon, hdr);9291arc_hdr_destroy(hdr);9292} else {9293ASSERT(hdr->b_l1hdr.b_state != arc_l2c_only);9294ARCSTAT_BUMP(arcstat_l2_evict_l1cached);9295/*9296* Invalidate issued or about to be issued9297* reads, since we may be about to write9298* over this location.9299*/9300if (HDR_L2_READING(hdr)) {9301ARCSTAT_BUMP(arcstat_l2_evict_reading);9302arc_hdr_set_flags(hdr, ARC_FLAG_L2_EVICTED);9303}93049305arc_hdr_l2hdr_destroy(hdr);9306}9307mutex_exit(hash_lock);9308}9309mutex_exit(&dev->l2ad_mtx);93109311out:9312/*9313* We need to check if we evict all buffers, otherwise we may iterate9314* unnecessarily.9315*/9316if (!all && rerun) {9317/*9318* Bump device hand to the device start if it is approaching the9319* end. l2arc_evict() has already evicted ahead for this case.9320*/9321dev->l2ad_hand = dev->l2ad_start;9322dev->l2ad_evict = dev->l2ad_start;9323dev->l2ad_first = B_FALSE;9324goto top;9325}93269327if (!all) {9328/*9329* In case of cache device removal (all) the following9330* assertions may be violated without functional consequences9331* as the device is about to be removed.9332*/9333ASSERT3U(dev->l2ad_hand + distance, <=, dev->l2ad_end);9334if (!dev->l2ad_first)9335ASSERT3U(dev->l2ad_hand, <=, dev->l2ad_evict);9336}9337}93389339/*9340* Handle any abd transforms that might be required for writing to the L2ARC.9341* If successful, this function will always return an abd with the data9342* transformed as it is on disk in a new abd of asize bytes.9343*/9344static int9345l2arc_apply_transforms(spa_t *spa, arc_buf_hdr_t *hdr, uint64_t asize,9346abd_t **abd_out)9347{9348int ret;9349abd_t *cabd = NULL, *eabd = NULL, *to_write = hdr->b_l1hdr.b_pabd;9350enum zio_compress compress = HDR_GET_COMPRESS(hdr);9351uint64_t psize = HDR_GET_PSIZE(hdr);9352uint64_t size = arc_hdr_size(hdr);9353boolean_t ismd = HDR_ISTYPE_METADATA(hdr);9354boolean_t bswap = (hdr->b_l1hdr.b_byteswap != DMU_BSWAP_NUMFUNCS);9355dsl_crypto_key_t *dck = NULL;9356uint8_t mac[ZIO_DATA_MAC_LEN] = { 0 };9357boolean_t no_crypt = B_FALSE;93589359ASSERT((HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF &&9360!HDR_COMPRESSION_ENABLED(hdr)) ||9361HDR_ENCRYPTED(hdr) || HDR_SHARED_DATA(hdr) || psize != asize);9362ASSERT3U(psize, <=, asize);93639364/*9365* If this data simply needs its own buffer, we simply allocate it9366* and copy the data. This may be done to eliminate a dependency on a9367* shared buffer or to reallocate the buffer to match asize.9368*/9369if (HDR_HAS_RABD(hdr)) {9370ASSERT3U(asize, >, psize);9371to_write = abd_alloc_for_io(asize, ismd);9372abd_copy(to_write, hdr->b_crypt_hdr.b_rabd, psize);9373abd_zero_off(to_write, psize, asize - psize);9374goto out;9375}93769377if ((compress == ZIO_COMPRESS_OFF || HDR_COMPRESSION_ENABLED(hdr)) &&9378!HDR_ENCRYPTED(hdr)) {9379ASSERT3U(size, ==, psize);9380to_write = abd_alloc_for_io(asize, ismd);9381abd_copy(to_write, hdr->b_l1hdr.b_pabd, size);9382if (asize > size)9383abd_zero_off(to_write, size, asize - size);9384goto out;9385}93869387if (compress != ZIO_COMPRESS_OFF && !HDR_COMPRESSION_ENABLED(hdr)) {9388cabd = abd_alloc_for_io(MAX(size, asize), ismd);9389uint64_t csize = zio_compress_data(compress, to_write, &cabd,9390size, MIN(size, psize), hdr->b_complevel);9391if (csize >= size || csize > psize) {9392/*9393* We can't re-compress the block into the original9394* psize. Even if it fits into asize, it does not9395* matter, since checksum will never match on read.9396*/9397abd_free(cabd);9398return (SET_ERROR(EIO));9399}9400if (asize > csize)9401abd_zero_off(cabd, csize, asize - csize);9402to_write = cabd;9403}94049405if (HDR_ENCRYPTED(hdr)) {9406eabd = abd_alloc_for_io(asize, ismd);94079408/*9409* If the dataset was disowned before the buffer9410* made it to this point, the key to re-encrypt9411* it won't be available. In this case we simply9412* won't write the buffer to the L2ARC.9413*/9414ret = spa_keystore_lookup_key(spa, hdr->b_crypt_hdr.b_dsobj,9415FTAG, &dck);9416if (ret != 0)9417goto error;94189419ret = zio_do_crypt_abd(B_TRUE, &dck->dck_key,9420hdr->b_crypt_hdr.b_ot, bswap, hdr->b_crypt_hdr.b_salt,9421hdr->b_crypt_hdr.b_iv, mac, psize, to_write, eabd,9422&no_crypt);9423if (ret != 0)9424goto error;94259426if (no_crypt)9427abd_copy(eabd, to_write, psize);94289429if (psize != asize)9430abd_zero_off(eabd, psize, asize - psize);94319432/* assert that the MAC we got here matches the one we saved */9433ASSERT0(memcmp(mac, hdr->b_crypt_hdr.b_mac, ZIO_DATA_MAC_LEN));9434spa_keystore_dsl_key_rele(spa, dck, FTAG);94359436if (to_write == cabd)9437abd_free(cabd);94389439to_write = eabd;9440}94419442out:9443ASSERT3P(to_write, !=, hdr->b_l1hdr.b_pabd);9444*abd_out = to_write;9445return (0);94469447error:9448if (dck != NULL)9449spa_keystore_dsl_key_rele(spa, dck, FTAG);9450if (cabd != NULL)9451abd_free(cabd);9452if (eabd != NULL)9453abd_free(eabd);94549455*abd_out = NULL;9456return (ret);9457}94589459static void9460l2arc_blk_fetch_done(zio_t *zio)9461{9462l2arc_read_callback_t *cb;94639464cb = zio->io_private;9465if (cb->l2rcb_abd != NULL)9466abd_free(cb->l2rcb_abd);9467kmem_free(cb, sizeof (l2arc_read_callback_t));9468}94699470/*9471* Find and write ARC buffers to the L2ARC device.9472*9473* An ARC_FLAG_L2_WRITING flag is set so that the L2ARC buffers are not valid9474* for reading until they have completed writing.9475* The headroom_boost is an in-out parameter used to maintain headroom boost9476* state between calls to this function.9477*9478* Returns the number of bytes actually written (which may be smaller than9479* the delta by which the device hand has changed due to alignment and the9480* writing of log blocks).9481*/9482static uint64_t9483l2arc_write_buffers(spa_t *spa, l2arc_dev_t *dev, uint64_t target_sz)9484{9485arc_buf_hdr_t *hdr, *head, *marker;9486uint64_t write_asize, write_psize, headroom;9487boolean_t full, from_head = !arc_warm;9488l2arc_write_callback_t *cb = NULL;9489zio_t *pio, *wzio;9490uint64_t guid = spa_load_guid(spa);9491l2arc_dev_hdr_phys_t *l2dhdr = dev->l2ad_dev_hdr;94929493ASSERT3P(dev->l2ad_vdev, !=, NULL);94949495pio = NULL;9496write_asize = write_psize = 0;9497full = B_FALSE;9498head = kmem_cache_alloc(hdr_l2only_cache, KM_PUSHPAGE);9499arc_hdr_set_flags(head, ARC_FLAG_L2_WRITE_HEAD | ARC_FLAG_HAS_L2HDR);9500marker = arc_state_alloc_marker();95019502/*9503* Copy buffers for L2ARC writing.9504*/9505for (int pass = 0; pass < L2ARC_FEED_TYPES; pass++) {9506/*9507* pass == 0: MFU meta9508* pass == 1: MRU meta9509* pass == 2: MFU data9510* pass == 3: MRU data9511*/9512if (l2arc_mfuonly == 1) {9513if (pass == 1 || pass == 3)9514continue;9515} else if (l2arc_mfuonly > 1) {9516if (pass == 3)9517continue;9518}95199520uint64_t passed_sz = 0;9521headroom = target_sz * l2arc_headroom;9522if (zfs_compressed_arc_enabled)9523headroom = (headroom * l2arc_headroom_boost) / 100;95249525/*9526* Until the ARC is warm and starts to evict, read from the9527* head of the ARC lists rather than the tail.9528*/9529multilist_sublist_t *mls = l2arc_sublist_lock(pass);9530ASSERT3P(mls, !=, NULL);9531if (from_head)9532hdr = multilist_sublist_head(mls);9533else9534hdr = multilist_sublist_tail(mls);95359536while (hdr != NULL) {9537kmutex_t *hash_lock;9538abd_t *to_write = NULL;95399540hash_lock = HDR_LOCK(hdr);9541if (!mutex_tryenter(hash_lock)) {9542skip:9543/* Skip this buffer rather than waiting. */9544if (from_head)9545hdr = multilist_sublist_next(mls, hdr);9546else9547hdr = multilist_sublist_prev(mls, hdr);9548continue;9549}95509551passed_sz += HDR_GET_LSIZE(hdr);9552if (l2arc_headroom != 0 && passed_sz > headroom) {9553/*9554* Searched too far.9555*/9556mutex_exit(hash_lock);9557break;9558}95599560if (!l2arc_write_eligible(guid, hdr)) {9561mutex_exit(hash_lock);9562goto skip;9563}95649565ASSERT(HDR_HAS_L1HDR(hdr));9566ASSERT3U(HDR_GET_PSIZE(hdr), >, 0);9567ASSERT3U(arc_hdr_size(hdr), >, 0);9568ASSERT(hdr->b_l1hdr.b_pabd != NULL ||9569HDR_HAS_RABD(hdr));9570uint64_t psize = HDR_GET_PSIZE(hdr);9571uint64_t asize = vdev_psize_to_asize(dev->l2ad_vdev,9572psize);95739574/*9575* If the allocated size of this buffer plus the max9576* size for the pending log block exceeds the evicted9577* target size, terminate writing buffers for this run.9578*/9579if (write_asize + asize +9580sizeof (l2arc_log_blk_phys_t) > target_sz) {9581full = B_TRUE;9582mutex_exit(hash_lock);9583break;9584}95859586/*9587* We should not sleep with sublist lock held or it9588* may block ARC eviction. Insert a marker to save9589* the position and drop the lock.9590*/9591if (from_head) {9592multilist_sublist_insert_after(mls, hdr,9593marker);9594} else {9595multilist_sublist_insert_before(mls, hdr,9596marker);9597}9598multilist_sublist_unlock(mls);95999600/*9601* If this header has b_rabd, we can use this since it9602* must always match the data exactly as it exists on9603* disk. Otherwise, the L2ARC can normally use the9604* hdr's data, but if we're sharing data between the9605* hdr and one of its bufs, L2ARC needs its own copy of9606* the data so that the ZIO below can't race with the9607* buf consumer. To ensure that this copy will be9608* available for the lifetime of the ZIO and be cleaned9609* up afterwards, we add it to the l2arc_free_on_write9610* queue. If we need to apply any transforms to the9611* data (compression, encryption) we will also need the9612* extra buffer.9613*/9614if (HDR_HAS_RABD(hdr) && psize == asize) {9615to_write = hdr->b_crypt_hdr.b_rabd;9616} else if ((HDR_COMPRESSION_ENABLED(hdr) ||9617HDR_GET_COMPRESS(hdr) == ZIO_COMPRESS_OFF) &&9618!HDR_ENCRYPTED(hdr) && !HDR_SHARED_DATA(hdr) &&9619psize == asize) {9620to_write = hdr->b_l1hdr.b_pabd;9621} else {9622int ret;9623arc_buf_contents_t type = arc_buf_type(hdr);96249625ret = l2arc_apply_transforms(spa, hdr, asize,9626&to_write);9627if (ret != 0) {9628arc_hdr_clear_flags(hdr,9629ARC_FLAG_L2CACHE);9630mutex_exit(hash_lock);9631goto next;9632}96339634l2arc_free_abd_on_write(to_write, asize, type);9635}96369637hdr->b_l2hdr.b_dev = dev;9638hdr->b_l2hdr.b_daddr = dev->l2ad_hand;9639hdr->b_l2hdr.b_hits = 0;9640hdr->b_l2hdr.b_arcs_state =9641hdr->b_l1hdr.b_state->arcs_state;9642/* l2arc_hdr_arcstats_update() expects a valid asize */9643HDR_SET_L2SIZE(hdr, asize);9644arc_hdr_set_flags(hdr, ARC_FLAG_HAS_L2HDR |9645ARC_FLAG_L2_WRITING);96469647(void) zfs_refcount_add_many(&dev->l2ad_alloc,9648arc_hdr_size(hdr), hdr);9649l2arc_hdr_arcstats_increment(hdr);9650vdev_space_update(dev->l2ad_vdev, asize, 0, 0);96519652mutex_enter(&dev->l2ad_mtx);9653if (pio == NULL) {9654/*9655* Insert a dummy header on the buflist so9656* l2arc_write_done() can find where the9657* write buffers begin without searching.9658*/9659list_insert_head(&dev->l2ad_buflist, head);9660}9661list_insert_head(&dev->l2ad_buflist, hdr);9662mutex_exit(&dev->l2ad_mtx);96639664boolean_t commit = l2arc_log_blk_insert(dev, hdr);9665mutex_exit(hash_lock);96669667if (pio == NULL) {9668cb = kmem_alloc(9669sizeof (l2arc_write_callback_t), KM_SLEEP);9670cb->l2wcb_dev = dev;9671cb->l2wcb_head = head;9672list_create(&cb->l2wcb_abd_list,9673sizeof (l2arc_lb_abd_buf_t),9674offsetof(l2arc_lb_abd_buf_t, node));9675pio = zio_root(spa, l2arc_write_done, cb,9676ZIO_FLAG_CANFAIL);9677}96789679wzio = zio_write_phys(pio, dev->l2ad_vdev,9680dev->l2ad_hand, asize, to_write,9681ZIO_CHECKSUM_OFF, NULL, hdr,9682ZIO_PRIORITY_ASYNC_WRITE,9683ZIO_FLAG_CANFAIL, B_FALSE);96849685DTRACE_PROBE2(l2arc__write, vdev_t *, dev->l2ad_vdev,9686zio_t *, wzio);9687zio_nowait(wzio);96889689write_psize += psize;9690write_asize += asize;9691dev->l2ad_hand += asize;96929693if (commit) {9694/* l2ad_hand will be adjusted inside. */9695write_asize +=9696l2arc_log_blk_commit(dev, pio, cb);9697}96989699next:9700multilist_sublist_lock(mls);9701if (from_head)9702hdr = multilist_sublist_next(mls, marker);9703else9704hdr = multilist_sublist_prev(mls, marker);9705multilist_sublist_remove(mls, marker);9706}97079708multilist_sublist_unlock(mls);97099710if (full == B_TRUE)9711break;9712}97139714arc_state_free_marker(marker);97159716/* No buffers selected for writing? */9717if (pio == NULL) {9718ASSERT0(write_psize);9719ASSERT(!HDR_HAS_L1HDR(head));9720kmem_cache_free(hdr_l2only_cache, head);97219722/*9723* Although we did not write any buffers l2ad_evict may9724* have advanced.9725*/9726if (dev->l2ad_evict != l2dhdr->dh_evict)9727l2arc_dev_hdr_update(dev);97289729return (0);9730}97319732if (!dev->l2ad_first)9733ASSERT3U(dev->l2ad_hand, <=, dev->l2ad_evict);97349735ASSERT3U(write_asize, <=, target_sz);9736ARCSTAT_BUMP(arcstat_l2_writes_sent);9737ARCSTAT_INCR(arcstat_l2_write_bytes, write_psize);97389739dev->l2ad_writing = B_TRUE;9740(void) zio_wait(pio);9741dev->l2ad_writing = B_FALSE;97429743/*9744* Update the device header after the zio completes as9745* l2arc_write_done() may have updated the memory holding the log block9746* pointers in the device header.9747*/9748l2arc_dev_hdr_update(dev);97499750return (write_asize);9751}97529753static boolean_t9754l2arc_hdr_limit_reached(void)9755{9756int64_t s = aggsum_upper_bound(&arc_sums.arcstat_l2_hdr_size);97579758return (arc_reclaim_needed() ||9759(s > (arc_warm ? arc_c : arc_c_max) * l2arc_meta_percent / 100));9760}97619762/*9763* This thread feeds the L2ARC at regular intervals. This is the beating9764* heart of the L2ARC.9765*/9766static __attribute__((noreturn)) void9767l2arc_feed_thread(void *unused)9768{9769(void) unused;9770callb_cpr_t cpr;9771l2arc_dev_t *dev;9772spa_t *spa;9773uint64_t size, wrote;9774clock_t begin, next = ddi_get_lbolt();9775fstrans_cookie_t cookie;97769777CALLB_CPR_INIT(&cpr, &l2arc_feed_thr_lock, callb_generic_cpr, FTAG);97789779mutex_enter(&l2arc_feed_thr_lock);97809781cookie = spl_fstrans_mark();9782while (l2arc_thread_exit == 0) {9783CALLB_CPR_SAFE_BEGIN(&cpr);9784(void) cv_timedwait_idle(&l2arc_feed_thr_cv,9785&l2arc_feed_thr_lock, next);9786CALLB_CPR_SAFE_END(&cpr, &l2arc_feed_thr_lock);9787next = ddi_get_lbolt() + hz;97889789/*9790* Quick check for L2ARC devices.9791*/9792mutex_enter(&l2arc_dev_mtx);9793if (l2arc_ndev == 0) {9794mutex_exit(&l2arc_dev_mtx);9795continue;9796}9797mutex_exit(&l2arc_dev_mtx);9798begin = ddi_get_lbolt();97999800/*9801* This selects the next l2arc device to write to, and in9802* doing so the next spa to feed from: dev->l2ad_spa. This9803* will return NULL if there are now no l2arc devices or if9804* they are all faulted.9805*9806* If a device is returned, its spa's config lock is also9807* held to prevent device removal. l2arc_dev_get_next()9808* will grab and release l2arc_dev_mtx.9809*/9810if ((dev = l2arc_dev_get_next()) == NULL)9811continue;98129813spa = dev->l2ad_spa;9814ASSERT3P(spa, !=, NULL);98159816/*9817* If the pool is read-only then force the feed thread to9818* sleep a little longer.9819*/9820if (!spa_writeable(spa)) {9821next = ddi_get_lbolt() + 5 * l2arc_feed_secs * hz;9822spa_config_exit(spa, SCL_L2ARC, dev);9823continue;9824}98259826/*9827* Avoid contributing to memory pressure.9828*/9829if (l2arc_hdr_limit_reached()) {9830ARCSTAT_BUMP(arcstat_l2_abort_lowmem);9831spa_config_exit(spa, SCL_L2ARC, dev);9832continue;9833}98349835ARCSTAT_BUMP(arcstat_l2_feeds);98369837size = l2arc_write_size(dev);98389839/*9840* Evict L2ARC buffers that will be overwritten.9841*/9842l2arc_evict(dev, size, B_FALSE);98439844/*9845* Write ARC buffers.9846*/9847wrote = l2arc_write_buffers(spa, dev, size);98489849/*9850* Calculate interval between writes.9851*/9852next = l2arc_write_interval(begin, size, wrote);9853spa_config_exit(spa, SCL_L2ARC, dev);9854}9855spl_fstrans_unmark(cookie);98569857l2arc_thread_exit = 0;9858cv_broadcast(&l2arc_feed_thr_cv);9859CALLB_CPR_EXIT(&cpr); /* drops l2arc_feed_thr_lock */9860thread_exit();9861}98629863boolean_t9864l2arc_vdev_present(vdev_t *vd)9865{9866return (l2arc_vdev_get(vd) != NULL);9867}98689869/*9870* Returns the l2arc_dev_t associated with a particular vdev_t or NULL if9871* the vdev_t isn't an L2ARC device.9872*/9873l2arc_dev_t *9874l2arc_vdev_get(vdev_t *vd)9875{9876l2arc_dev_t *dev;98779878mutex_enter(&l2arc_dev_mtx);9879for (dev = list_head(l2arc_dev_list); dev != NULL;9880dev = list_next(l2arc_dev_list, dev)) {9881if (dev->l2ad_vdev == vd)9882break;9883}9884mutex_exit(&l2arc_dev_mtx);98859886return (dev);9887}98889889static void9890l2arc_rebuild_dev(l2arc_dev_t *dev, boolean_t reopen)9891{9892l2arc_dev_hdr_phys_t *l2dhdr = dev->l2ad_dev_hdr;9893uint64_t l2dhdr_asize = dev->l2ad_dev_hdr_asize;9894spa_t *spa = dev->l2ad_spa;98959896/*9897* After a l2arc_remove_vdev(), the spa_t will no longer be valid9898*/9899if (spa == NULL)9900return;99019902/*9903* The L2ARC has to hold at least the payload of one log block for9904* them to be restored (persistent L2ARC). The payload of a log block9905* depends on the amount of its log entries. We always write log blocks9906* with 1022 entries. How many of them are committed or restored depends9907* on the size of the L2ARC device. Thus the maximum payload of9908* one log block is 1022 * SPA_MAXBLOCKSIZE = 16GB. If the L2ARC device9909* is less than that, we reduce the amount of committed and restored9910* log entries per block so as to enable persistence.9911*/9912if (dev->l2ad_end < l2arc_rebuild_blocks_min_l2size) {9913dev->l2ad_log_entries = 0;9914} else {9915dev->l2ad_log_entries = MIN((dev->l2ad_end -9916dev->l2ad_start) >> SPA_MAXBLOCKSHIFT,9917L2ARC_LOG_BLK_MAX_ENTRIES);9918}99199920/*9921* Read the device header, if an error is returned do not rebuild L2ARC.9922*/9923if (l2arc_dev_hdr_read(dev) == 0 && dev->l2ad_log_entries > 0) {9924/*9925* If we are onlining a cache device (vdev_reopen) that was9926* still present (l2arc_vdev_present()) and rebuild is enabled,9927* we should evict all ARC buffers and pointers to log blocks9928* and reclaim their space before restoring its contents to9929* L2ARC.9930*/9931if (reopen) {9932if (!l2arc_rebuild_enabled) {9933return;9934} else {9935l2arc_evict(dev, 0, B_TRUE);9936/* start a new log block */9937dev->l2ad_log_ent_idx = 0;9938dev->l2ad_log_blk_payload_asize = 0;9939dev->l2ad_log_blk_payload_start = 0;9940}9941}9942/*9943* Just mark the device as pending for a rebuild. We won't9944* be starting a rebuild in line here as it would block pool9945* import. Instead spa_load_impl will hand that off to an9946* async task which will call l2arc_spa_rebuild_start.9947*/9948dev->l2ad_rebuild = B_TRUE;9949} else if (spa_writeable(spa)) {9950/*9951* In this case TRIM the whole device if l2arc_trim_ahead > 0,9952* otherwise create a new header. We zero out the memory holding9953* the header to reset dh_start_lbps. If we TRIM the whole9954* device the new header will be written by9955* vdev_trim_l2arc_thread() at the end of the TRIM to update the9956* trim_state in the header too. When reading the header, if9957* trim_state is not VDEV_TRIM_COMPLETE and l2arc_trim_ahead > 09958* we opt to TRIM the whole device again.9959*/9960if (l2arc_trim_ahead > 0) {9961dev->l2ad_trim_all = B_TRUE;9962} else {9963memset(l2dhdr, 0, l2dhdr_asize);9964l2arc_dev_hdr_update(dev);9965}9966}9967}99689969/*9970* Add a vdev for use by the L2ARC. By this point the spa has already9971* validated the vdev and opened it.9972*/9973void9974l2arc_add_vdev(spa_t *spa, vdev_t *vd)9975{9976l2arc_dev_t *adddev;9977uint64_t l2dhdr_asize;99789979ASSERT(!l2arc_vdev_present(vd));99809981/*9982* Create a new l2arc device entry.9983*/9984adddev = vmem_zalloc(sizeof (l2arc_dev_t), KM_SLEEP);9985adddev->l2ad_spa = spa;9986adddev->l2ad_vdev = vd;9987/* leave extra size for an l2arc device header */9988l2dhdr_asize = adddev->l2ad_dev_hdr_asize =9989MAX(sizeof (*adddev->l2ad_dev_hdr), 1 << vd->vdev_ashift);9990adddev->l2ad_start = VDEV_LABEL_START_SIZE + l2dhdr_asize;9991adddev->l2ad_end = VDEV_LABEL_START_SIZE + vdev_get_min_asize(vd);9992ASSERT3U(adddev->l2ad_start, <, adddev->l2ad_end);9993adddev->l2ad_hand = adddev->l2ad_start;9994adddev->l2ad_evict = adddev->l2ad_start;9995adddev->l2ad_first = B_TRUE;9996adddev->l2ad_writing = B_FALSE;9997adddev->l2ad_trim_all = B_FALSE;9998list_link_init(&adddev->l2ad_node);9999adddev->l2ad_dev_hdr = kmem_zalloc(l2dhdr_asize, KM_SLEEP);1000010001mutex_init(&adddev->l2ad_mtx, NULL, MUTEX_DEFAULT, NULL);10002/*10003* This is a list of all ARC buffers that are still valid on the10004* device.10005*/10006list_create(&adddev->l2ad_buflist, sizeof (arc_buf_hdr_t),10007offsetof(arc_buf_hdr_t, b_l2hdr.b_l2node));1000810009/*10010* This is a list of pointers to log blocks that are still present10011* on the device.10012*/10013list_create(&adddev->l2ad_lbptr_list, sizeof (l2arc_lb_ptr_buf_t),10014offsetof(l2arc_lb_ptr_buf_t, node));1001510016vdev_space_update(vd, 0, 0, adddev->l2ad_end - adddev->l2ad_hand);10017zfs_refcount_create(&adddev->l2ad_alloc);10018zfs_refcount_create(&adddev->l2ad_lb_asize);10019zfs_refcount_create(&adddev->l2ad_lb_count);1002010021/*10022* Decide if dev is eligible for L2ARC rebuild or whole device10023* trimming. This has to happen before the device is added in the10024* cache device list and l2arc_dev_mtx is released. Otherwise10025* l2arc_feed_thread() might already start writing on the10026* device.10027*/10028l2arc_rebuild_dev(adddev, B_FALSE);1002910030/*10031* Add device to global list10032*/10033mutex_enter(&l2arc_dev_mtx);10034list_insert_head(l2arc_dev_list, adddev);10035atomic_inc_64(&l2arc_ndev);10036mutex_exit(&l2arc_dev_mtx);10037}1003810039/*10040* Decide if a vdev is eligible for L2ARC rebuild, called from vdev_reopen()10041* in case of onlining a cache device.10042*/10043void10044l2arc_rebuild_vdev(vdev_t *vd, boolean_t reopen)10045{10046l2arc_dev_t *dev = NULL;1004710048dev = l2arc_vdev_get(vd);10049ASSERT3P(dev, !=, NULL);1005010051/*10052* In contrast to l2arc_add_vdev() we do not have to worry about10053* l2arc_feed_thread() invalidating previous content when onlining a10054* cache device. The device parameters (l2ad*) are not cleared when10055* offlining the device and writing new buffers will not invalidate10056* all previous content. In worst case only buffers that have not had10057* their log block written to the device will be lost.10058* When onlining the cache device (ie offline->online without exporting10059* the pool in between) this happens:10060* vdev_reopen() -> vdev_open() -> l2arc_rebuild_vdev()10061* | |10062* vdev_is_dead() = B_FALSE l2ad_rebuild = B_TRUE10063* During the time where vdev_is_dead = B_FALSE and until l2ad_rebuild10064* is set to B_TRUE we might write additional buffers to the device.10065*/10066l2arc_rebuild_dev(dev, reopen);10067}1006810069typedef struct {10070l2arc_dev_t *rva_l2arc_dev;10071uint64_t rva_spa_gid;10072uint64_t rva_vdev_gid;10073boolean_t rva_async;1007410075} remove_vdev_args_t;1007610077static void10078l2arc_device_teardown(void *arg)10079{10080remove_vdev_args_t *rva = arg;10081l2arc_dev_t *remdev = rva->rva_l2arc_dev;10082hrtime_t start_time = gethrtime();1008310084/*10085* Clear all buflists and ARC references. L2ARC device flush.10086*/10087l2arc_evict(remdev, 0, B_TRUE);10088list_destroy(&remdev->l2ad_buflist);10089ASSERT(list_is_empty(&remdev->l2ad_lbptr_list));10090list_destroy(&remdev->l2ad_lbptr_list);10091mutex_destroy(&remdev->l2ad_mtx);10092zfs_refcount_destroy(&remdev->l2ad_alloc);10093zfs_refcount_destroy(&remdev->l2ad_lb_asize);10094zfs_refcount_destroy(&remdev->l2ad_lb_count);10095kmem_free(remdev->l2ad_dev_hdr, remdev->l2ad_dev_hdr_asize);10096vmem_free(remdev, sizeof (l2arc_dev_t));1009710098uint64_t elapsed = NSEC2MSEC(gethrtime() - start_time);10099if (elapsed > 0) {10100zfs_dbgmsg("spa %llu, vdev %llu removed in %llu ms",10101(u_longlong_t)rva->rva_spa_gid,10102(u_longlong_t)rva->rva_vdev_gid,10103(u_longlong_t)elapsed);10104}1010510106if (rva->rva_async)10107arc_async_flush_remove(rva->rva_spa_gid, 2);10108kmem_free(rva, sizeof (remove_vdev_args_t));10109}1011010111/*10112* Remove a vdev from the L2ARC.10113*/10114void10115l2arc_remove_vdev(vdev_t *vd)10116{10117spa_t *spa = vd->vdev_spa;10118boolean_t asynchronous = spa->spa_state == POOL_STATE_EXPORTED ||10119spa->spa_state == POOL_STATE_DESTROYED;1012010121/*10122* Find the device by vdev10123*/10124l2arc_dev_t *remdev = l2arc_vdev_get(vd);10125ASSERT3P(remdev, !=, NULL);1012610127/*10128* Save info for final teardown10129*/10130remove_vdev_args_t *rva = kmem_alloc(sizeof (remove_vdev_args_t),10131KM_SLEEP);10132rva->rva_l2arc_dev = remdev;10133rva->rva_spa_gid = spa_load_guid(spa);10134rva->rva_vdev_gid = remdev->l2ad_vdev->vdev_guid;1013510136/*10137* Cancel any ongoing or scheduled rebuild.10138*/10139mutex_enter(&l2arc_rebuild_thr_lock);10140remdev->l2ad_rebuild_cancel = B_TRUE;10141if (remdev->l2ad_rebuild_began == B_TRUE) {10142while (remdev->l2ad_rebuild == B_TRUE)10143cv_wait(&l2arc_rebuild_thr_cv, &l2arc_rebuild_thr_lock);10144}10145mutex_exit(&l2arc_rebuild_thr_lock);10146rva->rva_async = asynchronous;1014710148/*10149* Remove device from global list10150*/10151ASSERT(spa_config_held(spa, SCL_L2ARC, RW_WRITER) & SCL_L2ARC);10152mutex_enter(&l2arc_dev_mtx);10153list_remove(l2arc_dev_list, remdev);10154l2arc_dev_last = NULL; /* may have been invalidated */10155atomic_dec_64(&l2arc_ndev);1015610157/* During a pool export spa & vdev will no longer be valid */10158if (asynchronous) {10159remdev->l2ad_spa = NULL;10160remdev->l2ad_vdev = NULL;10161}10162mutex_exit(&l2arc_dev_mtx);1016310164if (!asynchronous) {10165l2arc_device_teardown(rva);10166return;10167}1016810169arc_async_flush_t *af = arc_async_flush_add(rva->rva_spa_gid, 2);1017010171taskq_dispatch_ent(arc_flush_taskq, l2arc_device_teardown, rva,10172TQ_SLEEP, &af->af_tqent);10173}1017410175void10176l2arc_init(void)10177{10178l2arc_thread_exit = 0;10179l2arc_ndev = 0;1018010181mutex_init(&l2arc_feed_thr_lock, NULL, MUTEX_DEFAULT, NULL);10182cv_init(&l2arc_feed_thr_cv, NULL, CV_DEFAULT, NULL);10183mutex_init(&l2arc_rebuild_thr_lock, NULL, MUTEX_DEFAULT, NULL);10184cv_init(&l2arc_rebuild_thr_cv, NULL, CV_DEFAULT, NULL);10185mutex_init(&l2arc_dev_mtx, NULL, MUTEX_DEFAULT, NULL);10186mutex_init(&l2arc_free_on_write_mtx, NULL, MUTEX_DEFAULT, NULL);1018710188l2arc_dev_list = &L2ARC_dev_list;10189l2arc_free_on_write = &L2ARC_free_on_write;10190list_create(l2arc_dev_list, sizeof (l2arc_dev_t),10191offsetof(l2arc_dev_t, l2ad_node));10192list_create(l2arc_free_on_write, sizeof (l2arc_data_free_t),10193offsetof(l2arc_data_free_t, l2df_list_node));10194}1019510196void10197l2arc_fini(void)10198{10199mutex_destroy(&l2arc_feed_thr_lock);10200cv_destroy(&l2arc_feed_thr_cv);10201mutex_destroy(&l2arc_rebuild_thr_lock);10202cv_destroy(&l2arc_rebuild_thr_cv);10203mutex_destroy(&l2arc_dev_mtx);10204mutex_destroy(&l2arc_free_on_write_mtx);1020510206list_destroy(l2arc_dev_list);10207list_destroy(l2arc_free_on_write);10208}1020910210void10211l2arc_start(void)10212{10213if (!(spa_mode_global & SPA_MODE_WRITE))10214return;1021510216(void) thread_create(NULL, 0, l2arc_feed_thread, NULL, 0, &p0,10217TS_RUN, defclsyspri);10218}1021910220void10221l2arc_stop(void)10222{10223if (!(spa_mode_global & SPA_MODE_WRITE))10224return;1022510226mutex_enter(&l2arc_feed_thr_lock);10227cv_signal(&l2arc_feed_thr_cv); /* kick thread out of startup */10228l2arc_thread_exit = 1;10229while (l2arc_thread_exit != 0)10230cv_wait(&l2arc_feed_thr_cv, &l2arc_feed_thr_lock);10231mutex_exit(&l2arc_feed_thr_lock);10232}1023310234/*10235* Punches out rebuild threads for the L2ARC devices in a spa. This should10236* be called after pool import from the spa async thread, since starting10237* these threads directly from spa_import() will make them part of the10238* "zpool import" context and delay process exit (and thus pool import).10239*/10240void10241l2arc_spa_rebuild_start(spa_t *spa)10242{10243ASSERT(spa_namespace_held());1024410245/*10246* Locate the spa's l2arc devices and kick off rebuild threads.10247*/10248for (int i = 0; i < spa->spa_l2cache.sav_count; i++) {10249l2arc_dev_t *dev =10250l2arc_vdev_get(spa->spa_l2cache.sav_vdevs[i]);10251if (dev == NULL) {10252/* Don't attempt a rebuild if the vdev is UNAVAIL */10253continue;10254}10255mutex_enter(&l2arc_rebuild_thr_lock);10256if (dev->l2ad_rebuild && !dev->l2ad_rebuild_cancel) {10257dev->l2ad_rebuild_began = B_TRUE;10258(void) thread_create(NULL, 0, l2arc_dev_rebuild_thread,10259dev, 0, &p0, TS_RUN, minclsyspri);10260}10261mutex_exit(&l2arc_rebuild_thr_lock);10262}10263}1026410265void10266l2arc_spa_rebuild_stop(spa_t *spa)10267{10268ASSERT(spa_namespace_held() ||10269spa->spa_export_thread == curthread);1027010271for (int i = 0; i < spa->spa_l2cache.sav_count; i++) {10272l2arc_dev_t *dev =10273l2arc_vdev_get(spa->spa_l2cache.sav_vdevs[i]);10274if (dev == NULL)10275continue;10276mutex_enter(&l2arc_rebuild_thr_lock);10277dev->l2ad_rebuild_cancel = B_TRUE;10278mutex_exit(&l2arc_rebuild_thr_lock);10279}10280for (int i = 0; i < spa->spa_l2cache.sav_count; i++) {10281l2arc_dev_t *dev =10282l2arc_vdev_get(spa->spa_l2cache.sav_vdevs[i]);10283if (dev == NULL)10284continue;10285mutex_enter(&l2arc_rebuild_thr_lock);10286if (dev->l2ad_rebuild_began == B_TRUE) {10287while (dev->l2ad_rebuild == B_TRUE) {10288cv_wait(&l2arc_rebuild_thr_cv,10289&l2arc_rebuild_thr_lock);10290}10291}10292mutex_exit(&l2arc_rebuild_thr_lock);10293}10294}1029510296/*10297* Main entry point for L2ARC rebuilding.10298*/10299static __attribute__((noreturn)) void10300l2arc_dev_rebuild_thread(void *arg)10301{10302l2arc_dev_t *dev = arg;1030310304VERIFY(dev->l2ad_rebuild);10305(void) l2arc_rebuild(dev);10306mutex_enter(&l2arc_rebuild_thr_lock);10307dev->l2ad_rebuild_began = B_FALSE;10308dev->l2ad_rebuild = B_FALSE;10309cv_signal(&l2arc_rebuild_thr_cv);10310mutex_exit(&l2arc_rebuild_thr_lock);1031110312thread_exit();10313}1031410315/*10316* This function implements the actual L2ARC metadata rebuild. It:10317* starts reading the log block chain and restores each block's contents10318* to memory (reconstructing arc_buf_hdr_t's).10319*10320* Operation stops under any of the following conditions:10321*10322* 1) We reach the end of the log block chain.10323* 2) We encounter *any* error condition (cksum errors, io errors)10324*/10325static int10326l2arc_rebuild(l2arc_dev_t *dev)10327{10328vdev_t *vd = dev->l2ad_vdev;10329spa_t *spa = vd->vdev_spa;10330int err = 0;10331l2arc_dev_hdr_phys_t *l2dhdr = dev->l2ad_dev_hdr;10332l2arc_log_blk_phys_t *this_lb, *next_lb;10333zio_t *this_io = NULL, *next_io = NULL;10334l2arc_log_blkptr_t lbps[2];10335l2arc_lb_ptr_buf_t *lb_ptr_buf;10336boolean_t lock_held;1033710338this_lb = vmem_zalloc(sizeof (*this_lb), KM_SLEEP);10339next_lb = vmem_zalloc(sizeof (*next_lb), KM_SLEEP);1034010341/*10342* We prevent device removal while issuing reads to the device,10343* then during the rebuilding phases we drop this lock again so10344* that a spa_unload or device remove can be initiated - this is10345* safe, because the spa will signal us to stop before removing10346* our device and wait for us to stop.10347*/10348spa_config_enter(spa, SCL_L2ARC, vd, RW_READER);10349lock_held = B_TRUE;1035010351/*10352* Retrieve the persistent L2ARC device state.10353* L2BLK_GET_PSIZE returns aligned size for log blocks.10354*/10355dev->l2ad_evict = MAX(l2dhdr->dh_evict, dev->l2ad_start);10356dev->l2ad_hand = MAX(l2dhdr->dh_start_lbps[0].lbp_daddr +10357L2BLK_GET_PSIZE((&l2dhdr->dh_start_lbps[0])->lbp_prop),10358dev->l2ad_start);10359dev->l2ad_first = !!(l2dhdr->dh_flags & L2ARC_DEV_HDR_EVICT_FIRST);1036010361vd->vdev_trim_action_time = l2dhdr->dh_trim_action_time;10362vd->vdev_trim_state = l2dhdr->dh_trim_state;1036310364/*10365* In case the zfs module parameter l2arc_rebuild_enabled is false10366* we do not start the rebuild process.10367*/10368if (!l2arc_rebuild_enabled)10369goto out;1037010371/* Prepare the rebuild process */10372memcpy(lbps, l2dhdr->dh_start_lbps, sizeof (lbps));1037310374/* Start the rebuild process */10375for (;;) {10376if (!l2arc_log_blkptr_valid(dev, &lbps[0]))10377break;1037810379if ((err = l2arc_log_blk_read(dev, &lbps[0], &lbps[1],10380this_lb, next_lb, this_io, &next_io)) != 0)10381goto out;1038210383/*10384* Our memory pressure valve. If the system is running low10385* on memory, rather than swamping memory with new ARC buf10386* hdrs, we opt not to rebuild the L2ARC. At this point,10387* however, we have already set up our L2ARC dev to chain in10388* new metadata log blocks, so the user may choose to offline/10389* online the L2ARC dev at a later time (or re-import the pool)10390* to reconstruct it (when there's less memory pressure).10391*/10392if (l2arc_hdr_limit_reached()) {10393ARCSTAT_BUMP(arcstat_l2_rebuild_abort_lowmem);10394cmn_err(CE_NOTE, "System running low on memory, "10395"aborting L2ARC rebuild.");10396err = SET_ERROR(ENOMEM);10397goto out;10398}1039910400spa_config_exit(spa, SCL_L2ARC, vd);10401lock_held = B_FALSE;1040210403/*10404* Now that we know that the next_lb checks out alright, we10405* can start reconstruction from this log block.10406* L2BLK_GET_PSIZE returns aligned size for log blocks.10407*/10408uint64_t asize = L2BLK_GET_PSIZE((&lbps[0])->lbp_prop);10409l2arc_log_blk_restore(dev, this_lb, asize);1041010411/*10412* log block restored, include its pointer in the list of10413* pointers to log blocks present in the L2ARC device.10414*/10415lb_ptr_buf = kmem_zalloc(sizeof (l2arc_lb_ptr_buf_t), KM_SLEEP);10416lb_ptr_buf->lb_ptr = kmem_zalloc(sizeof (l2arc_log_blkptr_t),10417KM_SLEEP);10418memcpy(lb_ptr_buf->lb_ptr, &lbps[0],10419sizeof (l2arc_log_blkptr_t));10420mutex_enter(&dev->l2ad_mtx);10421list_insert_tail(&dev->l2ad_lbptr_list, lb_ptr_buf);10422ARCSTAT_INCR(arcstat_l2_log_blk_asize, asize);10423ARCSTAT_BUMP(arcstat_l2_log_blk_count);10424zfs_refcount_add_many(&dev->l2ad_lb_asize, asize, lb_ptr_buf);10425zfs_refcount_add(&dev->l2ad_lb_count, lb_ptr_buf);10426mutex_exit(&dev->l2ad_mtx);10427vdev_space_update(vd, asize, 0, 0);1042810429/*10430* Protection against loops of log blocks:10431*10432* l2ad_hand l2ad_evict10433* V V10434* l2ad_start |=======================================| l2ad_end10435* -----|||----|||---|||----|||10436* (3) (2) (1) (0)10437* ---|||---|||----|||---|||10438* (7) (6) (5) (4)10439*10440* In this situation the pointer of log block (4) passes10441* l2arc_log_blkptr_valid() but the log block should not be10442* restored as it is overwritten by the payload of log block10443* (0). Only log blocks (0)-(3) should be restored. We check10444* whether l2ad_evict lies in between the payload starting10445* offset of the next log block (lbps[1].lbp_payload_start)10446* and the payload starting offset of the present log block10447* (lbps[0].lbp_payload_start). If true and this isn't the10448* first pass, we are looping from the beginning and we should10449* stop.10450*/10451if (l2arc_range_check_overlap(lbps[1].lbp_payload_start,10452lbps[0].lbp_payload_start, dev->l2ad_evict) &&10453!dev->l2ad_first)10454goto out;1045510456kpreempt(KPREEMPT_SYNC);10457for (;;) {10458mutex_enter(&l2arc_rebuild_thr_lock);10459if (dev->l2ad_rebuild_cancel) {10460mutex_exit(&l2arc_rebuild_thr_lock);10461err = SET_ERROR(ECANCELED);10462goto out;10463}10464mutex_exit(&l2arc_rebuild_thr_lock);10465if (spa_config_tryenter(spa, SCL_L2ARC, vd,10466RW_READER)) {10467lock_held = B_TRUE;10468break;10469}10470/*10471* L2ARC config lock held by somebody in writer,10472* possibly due to them trying to remove us. They'll10473* likely to want us to shut down, so after a little10474* delay, we check l2ad_rebuild_cancel and retry10475* the lock again.10476*/10477delay(1);10478}1047910480/*10481* Continue with the next log block.10482*/10483lbps[0] = lbps[1];10484lbps[1] = this_lb->lb_prev_lbp;10485PTR_SWAP(this_lb, next_lb);10486this_io = next_io;10487next_io = NULL;10488}1048910490if (this_io != NULL)10491l2arc_log_blk_fetch_abort(this_io);10492out:10493if (next_io != NULL)10494l2arc_log_blk_fetch_abort(next_io);10495vmem_free(this_lb, sizeof (*this_lb));10496vmem_free(next_lb, sizeof (*next_lb));1049710498if (err == ECANCELED) {10499/*10500* In case the rebuild was canceled do not log to spa history10501* log as the pool may be in the process of being removed.10502*/10503zfs_dbgmsg("L2ARC rebuild aborted, restored %llu blocks",10504(u_longlong_t)zfs_refcount_count(&dev->l2ad_lb_count));10505return (err);10506} else if (!l2arc_rebuild_enabled) {10507spa_history_log_internal(spa, "L2ARC rebuild", NULL,10508"disabled");10509} else if (err == 0 && zfs_refcount_count(&dev->l2ad_lb_count) > 0) {10510ARCSTAT_BUMP(arcstat_l2_rebuild_success);10511spa_history_log_internal(spa, "L2ARC rebuild", NULL,10512"successful, restored %llu blocks",10513(u_longlong_t)zfs_refcount_count(&dev->l2ad_lb_count));10514} else if (err == 0 && zfs_refcount_count(&dev->l2ad_lb_count) == 0) {10515/*10516* No error but also nothing restored, meaning the lbps array10517* in the device header points to invalid/non-present log10518* blocks. Reset the header.10519*/10520spa_history_log_internal(spa, "L2ARC rebuild", NULL,10521"no valid log blocks");10522memset(l2dhdr, 0, dev->l2ad_dev_hdr_asize);10523l2arc_dev_hdr_update(dev);10524} else if (err != 0) {10525spa_history_log_internal(spa, "L2ARC rebuild", NULL,10526"aborted, restored %llu blocks",10527(u_longlong_t)zfs_refcount_count(&dev->l2ad_lb_count));10528}1052910530if (lock_held)10531spa_config_exit(spa, SCL_L2ARC, vd);1053210533return (err);10534}1053510536/*10537* Attempts to read the device header on the provided L2ARC device and writes10538* it to `hdr'. On success, this function returns 0, otherwise the appropriate10539* error code is returned.10540*/10541static int10542l2arc_dev_hdr_read(l2arc_dev_t *dev)10543{10544int err;10545uint64_t guid;10546l2arc_dev_hdr_phys_t *l2dhdr = dev->l2ad_dev_hdr;10547const uint64_t l2dhdr_asize = dev->l2ad_dev_hdr_asize;10548abd_t *abd;1054910550guid = spa_guid(dev->l2ad_vdev->vdev_spa);1055110552abd = abd_get_from_buf(l2dhdr, l2dhdr_asize);1055310554err = zio_wait(zio_read_phys(NULL, dev->l2ad_vdev,10555VDEV_LABEL_START_SIZE, l2dhdr_asize, abd,10556ZIO_CHECKSUM_LABEL, NULL, NULL, ZIO_PRIORITY_SYNC_READ,10557ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY |10558ZIO_FLAG_SPECULATIVE, B_FALSE));1055910560abd_free(abd);1056110562if (err != 0) {10563ARCSTAT_BUMP(arcstat_l2_rebuild_abort_dh_errors);10564zfs_dbgmsg("L2ARC IO error (%d) while reading device header, "10565"vdev guid: %llu", err,10566(u_longlong_t)dev->l2ad_vdev->vdev_guid);10567return (err);10568}1056910570if (l2dhdr->dh_magic == BSWAP_64(L2ARC_DEV_HDR_MAGIC))10571byteswap_uint64_array(l2dhdr, sizeof (*l2dhdr));1057210573if (l2dhdr->dh_magic != L2ARC_DEV_HDR_MAGIC ||10574l2dhdr->dh_spa_guid != guid ||10575l2dhdr->dh_vdev_guid != dev->l2ad_vdev->vdev_guid ||10576l2dhdr->dh_version != L2ARC_PERSISTENT_VERSION ||10577l2dhdr->dh_log_entries != dev->l2ad_log_entries ||10578l2dhdr->dh_end != dev->l2ad_end ||10579!l2arc_range_check_overlap(dev->l2ad_start, dev->l2ad_end,10580l2dhdr->dh_evict) ||10581(l2dhdr->dh_trim_state != VDEV_TRIM_COMPLETE &&10582l2arc_trim_ahead > 0)) {10583/*10584* Attempt to rebuild a device containing no actual dev hdr10585* or containing a header from some other pool or from another10586* version of persistent L2ARC.10587*/10588ARCSTAT_BUMP(arcstat_l2_rebuild_abort_unsupported);10589return (SET_ERROR(ENOTSUP));10590}1059110592return (0);10593}1059410595/*10596* Reads L2ARC log blocks from storage and validates their contents.10597*10598* This function implements a simple fetcher to make sure that while10599* we're processing one buffer the L2ARC is already fetching the next10600* one in the chain.10601*10602* The arguments this_lp and next_lp point to the current and next log block10603* address in the block chain. Similarly, this_lb and next_lb hold the10604* l2arc_log_blk_phys_t's of the current and next L2ARC blk.10605*10606* The `this_io' and `next_io' arguments are used for block fetching.10607* When issuing the first blk IO during rebuild, you should pass NULL for10608* `this_io'. This function will then issue a sync IO to read the block and10609* also issue an async IO to fetch the next block in the block chain. The10610* fetched IO is returned in `next_io'. On subsequent calls to this10611* function, pass the value returned in `next_io' from the previous call10612* as `this_io' and a fresh `next_io' pointer to hold the next fetch IO.10613* Prior to the call, you should initialize your `next_io' pointer to be10614* NULL. If no fetch IO was issued, the pointer is left set at NULL.10615*10616* On success, this function returns 0, otherwise it returns an appropriate10617* error code. On error the fetching IO is aborted and cleared before10618* returning from this function. Therefore, if we return `success', the10619* caller can assume that we have taken care of cleanup of fetch IOs.10620*/10621static int10622l2arc_log_blk_read(l2arc_dev_t *dev,10623const l2arc_log_blkptr_t *this_lbp, const l2arc_log_blkptr_t *next_lbp,10624l2arc_log_blk_phys_t *this_lb, l2arc_log_blk_phys_t *next_lb,10625zio_t *this_io, zio_t **next_io)10626{10627int err = 0;10628zio_cksum_t cksum;10629uint64_t asize;1063010631ASSERT(this_lbp != NULL && next_lbp != NULL);10632ASSERT(this_lb != NULL && next_lb != NULL);10633ASSERT(next_io != NULL && *next_io == NULL);10634ASSERT(l2arc_log_blkptr_valid(dev, this_lbp));1063510636/*10637* Check to see if we have issued the IO for this log block in a10638* previous run. If not, this is the first call, so issue it now.10639*/10640if (this_io == NULL) {10641this_io = l2arc_log_blk_fetch(dev->l2ad_vdev, this_lbp,10642this_lb);10643}1064410645/*10646* Peek to see if we can start issuing the next IO immediately.10647*/10648if (l2arc_log_blkptr_valid(dev, next_lbp)) {10649/*10650* Start issuing IO for the next log block early - this10651* should help keep the L2ARC device busy while we10652* decompress and restore this log block.10653*/10654*next_io = l2arc_log_blk_fetch(dev->l2ad_vdev, next_lbp,10655next_lb);10656}1065710658/* Wait for the IO to read this log block to complete */10659if ((err = zio_wait(this_io)) != 0) {10660ARCSTAT_BUMP(arcstat_l2_rebuild_abort_io_errors);10661zfs_dbgmsg("L2ARC IO error (%d) while reading log block, "10662"offset: %llu, vdev guid: %llu", err,10663(u_longlong_t)this_lbp->lbp_daddr,10664(u_longlong_t)dev->l2ad_vdev->vdev_guid);10665goto cleanup;10666}1066710668/*10669* Make sure the buffer checks out.10670* L2BLK_GET_PSIZE returns aligned size for log blocks.10671*/10672asize = L2BLK_GET_PSIZE((this_lbp)->lbp_prop);10673fletcher_4_native(this_lb, asize, NULL, &cksum);10674if (!ZIO_CHECKSUM_EQUAL(cksum, this_lbp->lbp_cksum)) {10675ARCSTAT_BUMP(arcstat_l2_rebuild_abort_cksum_lb_errors);10676zfs_dbgmsg("L2ARC log block cksum failed, offset: %llu, "10677"vdev guid: %llu, l2ad_hand: %llu, l2ad_evict: %llu",10678(u_longlong_t)this_lbp->lbp_daddr,10679(u_longlong_t)dev->l2ad_vdev->vdev_guid,10680(u_longlong_t)dev->l2ad_hand,10681(u_longlong_t)dev->l2ad_evict);10682err = SET_ERROR(ECKSUM);10683goto cleanup;10684}1068510686/* Now we can take our time decoding this buffer */10687switch (L2BLK_GET_COMPRESS((this_lbp)->lbp_prop)) {10688case ZIO_COMPRESS_OFF:10689break;10690case ZIO_COMPRESS_LZ4: {10691abd_t *abd = abd_alloc_linear(asize, B_TRUE);10692abd_copy_from_buf_off(abd, this_lb, 0, asize);10693abd_t dabd;10694abd_get_from_buf_struct(&dabd, this_lb, sizeof (*this_lb));10695err = zio_decompress_data(10696L2BLK_GET_COMPRESS((this_lbp)->lbp_prop),10697abd, &dabd, asize, sizeof (*this_lb), NULL);10698abd_free(&dabd);10699abd_free(abd);10700if (err != 0) {10701err = SET_ERROR(EINVAL);10702goto cleanup;10703}10704break;10705}10706default:10707err = SET_ERROR(EINVAL);10708goto cleanup;10709}10710if (this_lb->lb_magic == BSWAP_64(L2ARC_LOG_BLK_MAGIC))10711byteswap_uint64_array(this_lb, sizeof (*this_lb));10712if (this_lb->lb_magic != L2ARC_LOG_BLK_MAGIC) {10713err = SET_ERROR(EINVAL);10714goto cleanup;10715}10716cleanup:10717/* Abort an in-flight fetch I/O in case of error */10718if (err != 0 && *next_io != NULL) {10719l2arc_log_blk_fetch_abort(*next_io);10720*next_io = NULL;10721}10722return (err);10723}1072410725/*10726* Restores the payload of a log block to ARC. This creates empty ARC hdr10727* entries which only contain an l2arc hdr, essentially restoring the10728* buffers to their L2ARC evicted state. This function also updates space10729* usage on the L2ARC vdev to make sure it tracks restored buffers.10730*/10731static void10732l2arc_log_blk_restore(l2arc_dev_t *dev, const l2arc_log_blk_phys_t *lb,10733uint64_t lb_asize)10734{10735uint64_t size = 0, asize = 0;10736uint64_t log_entries = dev->l2ad_log_entries;1073710738/*10739* Usually arc_adapt() is called only for data, not headers, but10740* since we may allocate significant amount of memory here, let ARC10741* grow its arc_c.10742*/10743arc_adapt(log_entries * HDR_L2ONLY_SIZE);1074410745for (int i = log_entries - 1; i >= 0; i--) {10746/*10747* Restore goes in the reverse temporal direction to preserve10748* correct temporal ordering of buffers in the l2ad_buflist.10749* l2arc_hdr_restore also does a list_insert_tail instead of10750* list_insert_head on the l2ad_buflist:10751*10752* LIST l2ad_buflist LIST10753* HEAD <------ (time) ------ TAIL10754* direction +-----+-----+-----+-----+-----+ direction10755* of l2arc <== | buf | buf | buf | buf | buf | ===> of rebuild10756* fill +-----+-----+-----+-----+-----+10757* ^ ^10758* | |10759* | |10760* l2arc_feed_thread l2arc_rebuild10761* will place new bufs here restores bufs here10762*10763* During l2arc_rebuild() the device is not used by10764* l2arc_feed_thread() as dev->l2ad_rebuild is set to true.10765*/10766size += L2BLK_GET_LSIZE((&lb->lb_entries[i])->le_prop);10767asize += vdev_psize_to_asize(dev->l2ad_vdev,10768L2BLK_GET_PSIZE((&lb->lb_entries[i])->le_prop));10769l2arc_hdr_restore(&lb->lb_entries[i], dev);10770}1077110772/*10773* Record rebuild stats:10774* size Logical size of restored buffers in the L2ARC10775* asize Aligned size of restored buffers in the L2ARC10776*/10777ARCSTAT_INCR(arcstat_l2_rebuild_size, size);10778ARCSTAT_INCR(arcstat_l2_rebuild_asize, asize);10779ARCSTAT_INCR(arcstat_l2_rebuild_bufs, log_entries);10780ARCSTAT_F_AVG(arcstat_l2_log_blk_avg_asize, lb_asize);10781ARCSTAT_F_AVG(arcstat_l2_data_to_meta_ratio, asize / lb_asize);10782ARCSTAT_BUMP(arcstat_l2_rebuild_log_blks);10783}1078410785/*10786* Restores a single ARC buf hdr from a log entry. The ARC buffer is put10787* into a state indicating that it has been evicted to L2ARC.10788*/10789static void10790l2arc_hdr_restore(const l2arc_log_ent_phys_t *le, l2arc_dev_t *dev)10791{10792arc_buf_hdr_t *hdr, *exists;10793kmutex_t *hash_lock;10794arc_buf_contents_t type = L2BLK_GET_TYPE((le)->le_prop);10795uint64_t asize = vdev_psize_to_asize(dev->l2ad_vdev,10796L2BLK_GET_PSIZE((le)->le_prop));1079710798/*10799* Do all the allocation before grabbing any locks, this lets us10800* sleep if memory is full and we don't have to deal with failed10801* allocations.10802*/10803hdr = arc_buf_alloc_l2only(L2BLK_GET_LSIZE((le)->le_prop), type,10804dev, le->le_dva, le->le_daddr,10805L2BLK_GET_PSIZE((le)->le_prop), asize, le->le_birth,10806L2BLK_GET_COMPRESS((le)->le_prop), le->le_complevel,10807L2BLK_GET_PROTECTED((le)->le_prop),10808L2BLK_GET_PREFETCH((le)->le_prop),10809L2BLK_GET_STATE((le)->le_prop));1081010811/*10812* vdev_space_update() has to be called before arc_hdr_destroy() to10813* avoid underflow since the latter also calls vdev_space_update().10814*/10815l2arc_hdr_arcstats_increment(hdr);10816vdev_space_update(dev->l2ad_vdev, asize, 0, 0);1081710818mutex_enter(&dev->l2ad_mtx);10819list_insert_tail(&dev->l2ad_buflist, hdr);10820(void) zfs_refcount_add_many(&dev->l2ad_alloc, arc_hdr_size(hdr), hdr);10821mutex_exit(&dev->l2ad_mtx);1082210823exists = buf_hash_insert(hdr, &hash_lock);10824if (exists) {10825/* Buffer was already cached, no need to restore it. */10826arc_hdr_destroy(hdr);10827/*10828* If the buffer is already cached, check whether it has10829* L2ARC metadata. If not, enter them and update the flag.10830* This is important is case of onlining a cache device, since10831* we previously evicted all L2ARC metadata from ARC.10832*/10833if (!HDR_HAS_L2HDR(exists)) {10834arc_hdr_set_flags(exists, ARC_FLAG_HAS_L2HDR);10835exists->b_l2hdr.b_dev = dev;10836exists->b_l2hdr.b_daddr = le->le_daddr;10837exists->b_l2hdr.b_arcs_state =10838L2BLK_GET_STATE((le)->le_prop);10839/* l2arc_hdr_arcstats_update() expects a valid asize */10840HDR_SET_L2SIZE(exists, asize);10841mutex_enter(&dev->l2ad_mtx);10842list_insert_tail(&dev->l2ad_buflist, exists);10843(void) zfs_refcount_add_many(&dev->l2ad_alloc,10844arc_hdr_size(exists), exists);10845mutex_exit(&dev->l2ad_mtx);10846l2arc_hdr_arcstats_increment(exists);10847vdev_space_update(dev->l2ad_vdev, asize, 0, 0);10848}10849ARCSTAT_BUMP(arcstat_l2_rebuild_bufs_precached);10850}1085110852mutex_exit(hash_lock);10853}1085410855/*10856* Starts an asynchronous read IO to read a log block. This is used in log10857* block reconstruction to start reading the next block before we are done10858* decoding and reconstructing the current block, to keep the l2arc device10859* nice and hot with read IO to process.10860* The returned zio will contain a newly allocated memory buffers for the IO10861* data which should then be freed by the caller once the zio is no longer10862* needed (i.e. due to it having completed). If you wish to abort this10863* zio, you should do so using l2arc_log_blk_fetch_abort, which takes10864* care of disposing of the allocated buffers correctly.10865*/10866static zio_t *10867l2arc_log_blk_fetch(vdev_t *vd, const l2arc_log_blkptr_t *lbp,10868l2arc_log_blk_phys_t *lb)10869{10870uint32_t asize;10871zio_t *pio;10872l2arc_read_callback_t *cb;1087310874/* L2BLK_GET_PSIZE returns aligned size for log blocks */10875asize = L2BLK_GET_PSIZE((lbp)->lbp_prop);10876ASSERT(asize <= sizeof (l2arc_log_blk_phys_t));1087710878cb = kmem_zalloc(sizeof (l2arc_read_callback_t), KM_SLEEP);10879cb->l2rcb_abd = abd_get_from_buf(lb, asize);10880pio = zio_root(vd->vdev_spa, l2arc_blk_fetch_done, cb,10881ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY);10882(void) zio_nowait(zio_read_phys(pio, vd, lbp->lbp_daddr, asize,10883cb->l2rcb_abd, ZIO_CHECKSUM_OFF, NULL, NULL,10884ZIO_PRIORITY_ASYNC_READ, ZIO_FLAG_CANFAIL |10885ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY, B_FALSE));1088610887return (pio);10888}1088910890/*10891* Aborts a zio returned from l2arc_log_blk_fetch and frees the data10892* buffers allocated for it.10893*/10894static void10895l2arc_log_blk_fetch_abort(zio_t *zio)10896{10897(void) zio_wait(zio);10898}1089910900/*10901* Creates a zio to update the device header on an l2arc device.10902*/10903void10904l2arc_dev_hdr_update(l2arc_dev_t *dev)10905{10906l2arc_dev_hdr_phys_t *l2dhdr = dev->l2ad_dev_hdr;10907const uint64_t l2dhdr_asize = dev->l2ad_dev_hdr_asize;10908abd_t *abd;10909int err;1091010911VERIFY(spa_config_held(dev->l2ad_spa, SCL_STATE_ALL, RW_READER));1091210913l2dhdr->dh_magic = L2ARC_DEV_HDR_MAGIC;10914l2dhdr->dh_version = L2ARC_PERSISTENT_VERSION;10915l2dhdr->dh_spa_guid = spa_guid(dev->l2ad_vdev->vdev_spa);10916l2dhdr->dh_vdev_guid = dev->l2ad_vdev->vdev_guid;10917l2dhdr->dh_log_entries = dev->l2ad_log_entries;10918l2dhdr->dh_evict = dev->l2ad_evict;10919l2dhdr->dh_start = dev->l2ad_start;10920l2dhdr->dh_end = dev->l2ad_end;10921l2dhdr->dh_lb_asize = zfs_refcount_count(&dev->l2ad_lb_asize);10922l2dhdr->dh_lb_count = zfs_refcount_count(&dev->l2ad_lb_count);10923l2dhdr->dh_flags = 0;10924l2dhdr->dh_trim_action_time = dev->l2ad_vdev->vdev_trim_action_time;10925l2dhdr->dh_trim_state = dev->l2ad_vdev->vdev_trim_state;10926if (dev->l2ad_first)10927l2dhdr->dh_flags |= L2ARC_DEV_HDR_EVICT_FIRST;1092810929abd = abd_get_from_buf(l2dhdr, l2dhdr_asize);1093010931err = zio_wait(zio_write_phys(NULL, dev->l2ad_vdev,10932VDEV_LABEL_START_SIZE, l2dhdr_asize, abd, ZIO_CHECKSUM_LABEL, NULL,10933NULL, ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_CANFAIL, B_FALSE));1093410935abd_free(abd);1093610937if (err != 0) {10938zfs_dbgmsg("L2ARC IO error (%d) while writing device header, "10939"vdev guid: %llu", err,10940(u_longlong_t)dev->l2ad_vdev->vdev_guid);10941}10942}1094310944/*10945* Commits a log block to the L2ARC device. This routine is invoked from10946* l2arc_write_buffers when the log block fills up.10947* This function allocates some memory to temporarily hold the serialized10948* buffer to be written. This is then released in l2arc_write_done.10949*/10950static uint64_t10951l2arc_log_blk_commit(l2arc_dev_t *dev, zio_t *pio, l2arc_write_callback_t *cb)10952{10953l2arc_log_blk_phys_t *lb = &dev->l2ad_log_blk;10954l2arc_dev_hdr_phys_t *l2dhdr = dev->l2ad_dev_hdr;10955uint64_t psize, asize;10956zio_t *wzio;10957l2arc_lb_abd_buf_t *abd_buf;10958abd_t *abd = NULL;10959l2arc_lb_ptr_buf_t *lb_ptr_buf;1096010961VERIFY3S(dev->l2ad_log_ent_idx, ==, dev->l2ad_log_entries);1096210963abd_buf = zio_buf_alloc(sizeof (*abd_buf));10964abd_buf->abd = abd_get_from_buf(lb, sizeof (*lb));10965lb_ptr_buf = kmem_zalloc(sizeof (l2arc_lb_ptr_buf_t), KM_SLEEP);10966lb_ptr_buf->lb_ptr = kmem_zalloc(sizeof (l2arc_log_blkptr_t), KM_SLEEP);1096710968/* link the buffer into the block chain */10969lb->lb_prev_lbp = l2dhdr->dh_start_lbps[1];10970lb->lb_magic = L2ARC_LOG_BLK_MAGIC;1097110972/*10973* l2arc_log_blk_commit() may be called multiple times during a single10974* l2arc_write_buffers() call. Save the allocated abd buffers in a list10975* so we can free them in l2arc_write_done() later on.10976*/10977list_insert_tail(&cb->l2wcb_abd_list, abd_buf);1097810979/* try to compress the buffer, at least one sector to save */10980psize = zio_compress_data(ZIO_COMPRESS_LZ4,10981abd_buf->abd, &abd, sizeof (*lb),10982zio_get_compression_max_size(ZIO_COMPRESS_LZ4,10983dev->l2ad_vdev->vdev_ashift,10984dev->l2ad_vdev->vdev_ashift, sizeof (*lb)), 0);1098510986/* a log block is never entirely zero */10987ASSERT(psize != 0);10988asize = vdev_psize_to_asize(dev->l2ad_vdev, psize);10989ASSERT(asize <= sizeof (*lb));1099010991/*10992* Update the start log block pointer in the device header to point10993* to the log block we're about to write.10994*/10995l2dhdr->dh_start_lbps[1] = l2dhdr->dh_start_lbps[0];10996l2dhdr->dh_start_lbps[0].lbp_daddr = dev->l2ad_hand;10997l2dhdr->dh_start_lbps[0].lbp_payload_asize =10998dev->l2ad_log_blk_payload_asize;10999l2dhdr->dh_start_lbps[0].lbp_payload_start =11000dev->l2ad_log_blk_payload_start;11001L2BLK_SET_LSIZE(11002(&l2dhdr->dh_start_lbps[0])->lbp_prop, sizeof (*lb));11003L2BLK_SET_PSIZE(11004(&l2dhdr->dh_start_lbps[0])->lbp_prop, asize);11005L2BLK_SET_CHECKSUM(11006(&l2dhdr->dh_start_lbps[0])->lbp_prop,11007ZIO_CHECKSUM_FLETCHER_4);11008if (asize < sizeof (*lb)) {11009/* compression succeeded */11010abd_zero_off(abd, psize, asize - psize);11011L2BLK_SET_COMPRESS(11012(&l2dhdr->dh_start_lbps[0])->lbp_prop,11013ZIO_COMPRESS_LZ4);11014} else {11015/* compression failed */11016abd_copy_from_buf_off(abd, lb, 0, sizeof (*lb));11017L2BLK_SET_COMPRESS(11018(&l2dhdr->dh_start_lbps[0])->lbp_prop,11019ZIO_COMPRESS_OFF);11020}1102111022/* checksum what we're about to write */11023abd_fletcher_4_native(abd, asize, NULL,11024&l2dhdr->dh_start_lbps[0].lbp_cksum);1102511026abd_free(abd_buf->abd);1102711028/* perform the write itself */11029abd_buf->abd = abd;11030wzio = zio_write_phys(pio, dev->l2ad_vdev, dev->l2ad_hand,11031asize, abd_buf->abd, ZIO_CHECKSUM_OFF, NULL, NULL,11032ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_CANFAIL, B_FALSE);11033DTRACE_PROBE2(l2arc__write, vdev_t *, dev->l2ad_vdev, zio_t *, wzio);11034(void) zio_nowait(wzio);1103511036dev->l2ad_hand += asize;11037vdev_space_update(dev->l2ad_vdev, asize, 0, 0);1103811039/*11040* Include the committed log block's pointer in the list of pointers11041* to log blocks present in the L2ARC device.11042*/11043memcpy(lb_ptr_buf->lb_ptr, &l2dhdr->dh_start_lbps[0],11044sizeof (l2arc_log_blkptr_t));11045mutex_enter(&dev->l2ad_mtx);11046list_insert_head(&dev->l2ad_lbptr_list, lb_ptr_buf);11047ARCSTAT_INCR(arcstat_l2_log_blk_asize, asize);11048ARCSTAT_BUMP(arcstat_l2_log_blk_count);11049zfs_refcount_add_many(&dev->l2ad_lb_asize, asize, lb_ptr_buf);11050zfs_refcount_add(&dev->l2ad_lb_count, lb_ptr_buf);11051mutex_exit(&dev->l2ad_mtx);1105211053/* bump the kstats */11054ARCSTAT_INCR(arcstat_l2_write_bytes, asize);11055ARCSTAT_BUMP(arcstat_l2_log_blk_writes);11056ARCSTAT_F_AVG(arcstat_l2_log_blk_avg_asize, asize);11057ARCSTAT_F_AVG(arcstat_l2_data_to_meta_ratio,11058dev->l2ad_log_blk_payload_asize / asize);1105911060/* start a new log block */11061dev->l2ad_log_ent_idx = 0;11062dev->l2ad_log_blk_payload_asize = 0;11063dev->l2ad_log_blk_payload_start = 0;1106411065return (asize);11066}1106711068/*11069* Validates an L2ARC log block address to make sure that it can be read11070* from the provided L2ARC device.11071*/11072boolean_t11073l2arc_log_blkptr_valid(l2arc_dev_t *dev, const l2arc_log_blkptr_t *lbp)11074{11075/* L2BLK_GET_PSIZE returns aligned size for log blocks */11076uint64_t asize = L2BLK_GET_PSIZE((lbp)->lbp_prop);11077uint64_t end = lbp->lbp_daddr + asize - 1;11078uint64_t start = lbp->lbp_payload_start;11079boolean_t evicted = B_FALSE;1108011081/*11082* A log block is valid if all of the following conditions are true:11083* - it fits entirely (including its payload) between l2ad_start and11084* l2ad_end11085* - it has a valid size11086* - neither the log block itself nor part of its payload was evicted11087* by l2arc_evict():11088*11089* l2ad_hand l2ad_evict11090* | | lbp_daddr11091* | start | | end11092* | | | | |11093* V V V V V11094* l2ad_start ============================================ l2ad_end11095* --------------------------||||11096* ^ ^11097* | log block11098* payload11099*/1110011101evicted =11102l2arc_range_check_overlap(start, end, dev->l2ad_hand) ||11103l2arc_range_check_overlap(start, end, dev->l2ad_evict) ||11104l2arc_range_check_overlap(dev->l2ad_hand, dev->l2ad_evict, start) ||11105l2arc_range_check_overlap(dev->l2ad_hand, dev->l2ad_evict, end);1110611107return (start >= dev->l2ad_start && end <= dev->l2ad_end &&11108asize > 0 && asize <= sizeof (l2arc_log_blk_phys_t) &&11109(!evicted || dev->l2ad_first));11110}1111111112/*11113* Inserts ARC buffer header `hdr' into the current L2ARC log block on11114* the device. The buffer being inserted must be present in L2ARC.11115* Returns B_TRUE if the L2ARC log block is full and needs to be committed11116* to L2ARC, or B_FALSE if it still has room for more ARC buffers.11117*/11118static boolean_t11119l2arc_log_blk_insert(l2arc_dev_t *dev, const arc_buf_hdr_t *hdr)11120{11121l2arc_log_blk_phys_t *lb = &dev->l2ad_log_blk;11122l2arc_log_ent_phys_t *le;1112311124if (dev->l2ad_log_entries == 0)11125return (B_FALSE);1112611127int index = dev->l2ad_log_ent_idx++;1112811129ASSERT3S(index, <, dev->l2ad_log_entries);11130ASSERT(HDR_HAS_L2HDR(hdr));1113111132le = &lb->lb_entries[index];11133memset(le, 0, sizeof (*le));11134le->le_dva = hdr->b_dva;11135le->le_birth = hdr->b_birth;11136le->le_daddr = hdr->b_l2hdr.b_daddr;11137if (index == 0)11138dev->l2ad_log_blk_payload_start = le->le_daddr;11139L2BLK_SET_LSIZE((le)->le_prop, HDR_GET_LSIZE(hdr));11140L2BLK_SET_PSIZE((le)->le_prop, HDR_GET_PSIZE(hdr));11141L2BLK_SET_COMPRESS((le)->le_prop, HDR_GET_COMPRESS(hdr));11142le->le_complevel = hdr->b_complevel;11143L2BLK_SET_TYPE((le)->le_prop, hdr->b_type);11144L2BLK_SET_PROTECTED((le)->le_prop, !!(HDR_PROTECTED(hdr)));11145L2BLK_SET_PREFETCH((le)->le_prop, !!(HDR_PREFETCH(hdr)));11146L2BLK_SET_STATE((le)->le_prop, hdr->b_l2hdr.b_arcs_state);1114711148dev->l2ad_log_blk_payload_asize += vdev_psize_to_asize(dev->l2ad_vdev,11149HDR_GET_PSIZE(hdr));1115011151return (dev->l2ad_log_ent_idx == dev->l2ad_log_entries);11152}1115311154/*11155* Checks whether a given L2ARC device address sits in a time-sequential11156* range. The trick here is that the L2ARC is a rotary buffer, so we can't11157* just do a range comparison, we need to handle the situation in which the11158* range wraps around the end of the L2ARC device. Arguments:11159* bottom -- Lower end of the range to check (written to earlier).11160* top -- Upper end of the range to check (written to later).11161* check -- The address for which we want to determine if it sits in11162* between the top and bottom.11163*11164* The 3-way conditional below represents the following cases:11165*11166* bottom < top : Sequentially ordered case:11167* <check>--------+-------------------+11168* | (overlap here?) |11169* L2ARC dev V V11170* |---------------<bottom>============<top>--------------|11171*11172* bottom > top: Looped-around case:11173* <check>--------+------------------+11174* | (overlap here?) |11175* L2ARC dev V V11176* |===============<top>---------------<bottom>===========|11177* ^ ^11178* | (or here?) |11179* +---------------+---------<check>11180*11181* top == bottom : Just a single address comparison.11182*/11183boolean_t11184l2arc_range_check_overlap(uint64_t bottom, uint64_t top, uint64_t check)11185{11186if (bottom < top)11187return (bottom <= check && check <= top);11188else if (bottom > top)11189return (check <= top || bottom <= check);11190else11191return (check == top);11192}1119311194EXPORT_SYMBOL(arc_buf_size);11195EXPORT_SYMBOL(arc_write);11196EXPORT_SYMBOL(arc_read);11197EXPORT_SYMBOL(arc_buf_info);11198EXPORT_SYMBOL(arc_getbuf_func);11199EXPORT_SYMBOL(arc_add_prune_callback);11200EXPORT_SYMBOL(arc_remove_prune_callback);1120111202ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, min, param_set_arc_min,11203spl_param_get_u64, ZMOD_RW, "Minimum ARC size in bytes");1120411205ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, max, param_set_arc_max,11206spl_param_get_u64, ZMOD_RW, "Maximum ARC size in bytes");1120711208ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, meta_balance, UINT, ZMOD_RW,11209"Balance between metadata and data on ghost hits.");1121011211ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, grow_retry, param_set_arc_int,11212param_get_uint, ZMOD_RW, "Seconds before growing ARC size");1121311214ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, shrink_shift, param_set_arc_int,11215param_get_uint, ZMOD_RW, "log2(fraction of ARC to reclaim)");1121611217#ifdef _KERNEL11218ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, pc_percent, UINT, ZMOD_RW,11219"Percent of pagecache to reclaim ARC to");11220#endif1122111222ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, average_blocksize, UINT, ZMOD_RD,11223"Target average block size");1122411225ZFS_MODULE_PARAM(zfs, zfs_, compressed_arc_enabled, INT, ZMOD_RW,11226"Disable compressed ARC buffers");1122711228ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, min_prefetch_ms, param_set_arc_int,11229param_get_uint, ZMOD_RW, "Min life of prefetch block in ms");1123011231ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, min_prescient_prefetch_ms,11232param_set_arc_int, param_get_uint, ZMOD_RW,11233"Min life of prescient prefetched block in ms");1123411235ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, write_max, U64, ZMOD_RW,11236"Max write bytes per interval");1123711238ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, write_boost, U64, ZMOD_RW,11239"Extra write bytes during device warmup");1124011241ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, headroom, U64, ZMOD_RW,11242"Number of max device writes to precache");1124311244ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, headroom_boost, U64, ZMOD_RW,11245"Compressed l2arc_headroom multiplier");1124611247ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, trim_ahead, U64, ZMOD_RW,11248"TRIM ahead L2ARC write size multiplier");1124911250ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, feed_secs, U64, ZMOD_RW,11251"Seconds between L2ARC writing");1125211253ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, feed_min_ms, U64, ZMOD_RW,11254"Min feed interval in milliseconds");1125511256ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, noprefetch, INT, ZMOD_RW,11257"Skip caching prefetched buffers");1125811259ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, feed_again, INT, ZMOD_RW,11260"Turbo L2ARC warmup");1126111262ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, norw, INT, ZMOD_RW,11263"No reads during writes");1126411265ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, meta_percent, UINT, ZMOD_RW,11266"Percent of ARC size allowed for L2ARC-only headers");1126711268ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, rebuild_enabled, INT, ZMOD_RW,11269"Rebuild the L2ARC when importing a pool");1127011271ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, rebuild_blocks_min_l2size, U64, ZMOD_RW,11272"Min size in bytes to write rebuild log blocks in L2ARC");1127311274ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, mfuonly, INT, ZMOD_RW,11275"Cache only MFU data from ARC into L2ARC");1127611277ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, exclude_special, INT, ZMOD_RW,11278"Exclude dbufs on special vdevs from being cached to L2ARC if set.");1127911280ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, lotsfree_percent, param_set_arc_int,11281param_get_uint, ZMOD_RW, "System free memory I/O throttle in bytes");1128211283ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, sys_free, param_set_arc_u64,11284spl_param_get_u64, ZMOD_RW, "System free memory target size in bytes");1128511286ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, dnode_limit, param_set_arc_u64,11287spl_param_get_u64, ZMOD_RW, "Minimum bytes of dnodes in ARC");1128811289ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, dnode_limit_percent,11290param_set_arc_int, param_get_uint, ZMOD_RW,11291"Percent of ARC meta buffers for dnodes");1129211293ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, dnode_reduce_percent, UINT, ZMOD_RW,11294"Percentage of excess dnodes to try to unpin");1129511296ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, eviction_pct, UINT, ZMOD_RW,11297"When full, ARC allocation waits for eviction of this % of alloc size");1129811299ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, evict_batch_limit, UINT, ZMOD_RW,11300"The number of headers to evict per sublist before moving to the next");1130111302ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, evict_batches_limit, UINT, ZMOD_RW,11303"The number of batches to run per parallel eviction task");1130411305ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, prune_task_threads, INT, ZMOD_RW,11306"Number of arc_prune threads");1130711308ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, evict_threads, UINT, ZMOD_RD,11309"Number of threads to use for ARC eviction.");113101131111312