Path: blob/main/sys/contrib/openzfs/module/zfs/arc.c
48383 views
// SPDX-License-Identifier: CDDL-1.01/*2* CDDL HEADER START3*4* The contents of this file are subject to the terms of the5* Common Development and Distribution License (the "License").6* You may not use this file except in compliance with the License.7*8* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE9* or https://opensource.org/licenses/CDDL-1.0.10* See the License for the specific language governing permissions11* and limitations under the License.12*13* When distributing Covered Code, include this CDDL HEADER in each14* file and include the License file at usr/src/OPENSOLARIS.LICENSE.15* If applicable, add the following below this CDDL HEADER, with the16* fields enclosed by brackets "[]" replaced with your own identifying17* information: Portions Copyright [yyyy] [name of copyright owner]18*19* CDDL HEADER END20*/21/*22* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.23* Copyright (c) 2018, Joyent, Inc.24* Copyright (c) 2011, 2020, Delphix. All rights reserved.25* Copyright (c) 2014, Saso Kiselkov. All rights reserved.26* Copyright (c) 2017, Nexenta Systems, Inc. All rights reserved.27* Copyright (c) 2019, loli10K <[email protected]>. All rights reserved.28* Copyright (c) 2020, George Amanakis. All rights reserved.29* Copyright (c) 2019, 2024, 2025, Klara, Inc.30* Copyright (c) 2019, Allan Jude31* Copyright (c) 2020, The FreeBSD Foundation [1]32* Copyright (c) 2021, 2024 by George Melikov. All rights reserved.33*34* [1] Portions of this software were developed by Allan Jude35* under sponsorship from the FreeBSD Foundation.36*/3738/*39* DVA-based Adjustable Replacement Cache40*41* While much of the theory of operation used here is42* based on the self-tuning, low overhead replacement cache43* presented by Megiddo and Modha at FAST 2003, there are some44* significant differences:45*46* 1. The Megiddo and Modha model assumes any page is evictable.47* Pages in its cache cannot be "locked" into memory. This makes48* the eviction algorithm simple: evict the last page in the list.49* This also make the performance characteristics easy to reason50* about. Our cache is not so simple. At any given moment, some51* subset of the blocks in the cache are un-evictable because we52* have handed out a reference to them. Blocks are only evictable53* when there are no external references active. This makes54* eviction far more problematic: we choose to evict the evictable55* blocks that are the "lowest" in the list.56*57* There are times when it is not possible to evict the requested58* space. In these circumstances we are unable to adjust the cache59* size. To prevent the cache growing unbounded at these times we60* implement a "cache throttle" that slows the flow of new data61* into the cache until we can make space available.62*63* 2. The Megiddo and Modha model assumes a fixed cache size.64* Pages are evicted when the cache is full and there is a cache65* miss. Our model has a variable sized cache. It grows with66* high use, but also tries to react to memory pressure from the67* operating system: decreasing its size when system memory is68* tight.69*70* 3. The Megiddo and Modha model assumes a fixed page size. All71* elements of the cache are therefore exactly the same size. So72* when adjusting the cache size following a cache miss, its simply73* a matter of choosing a single page to evict. In our model, we74* have variable sized cache blocks (ranging from 512 bytes to75* 128K bytes). We therefore choose a set of blocks to evict to make76* space for a cache miss that approximates as closely as possible77* the space used by the new block.78*79* See also: "ARC: A Self-Tuning, Low Overhead Replacement Cache"80* by N. Megiddo & D. Modha, FAST 200381*/8283/*84* The locking model:85*86* A new reference to a cache buffer can be obtained in two87* ways: 1) via a hash table lookup using the DVA as a key,88* or 2) via one of the ARC lists. The arc_read() interface89* uses method 1, while the internal ARC algorithms for90* adjusting the cache use method 2. We therefore provide two91* types of locks: 1) the hash table lock array, and 2) the92* ARC list locks.93*94* Buffers do not have their own mutexes, rather they rely on the95* hash table mutexes for the bulk of their protection (i.e. most96* fields in the arc_buf_hdr_t are protected by these mutexes).97*98* buf_hash_find() returns the appropriate mutex (held) when it99* locates the requested buffer in the hash table. It returns100* NULL for the mutex if the buffer was not in the table.101*102* buf_hash_remove() expects the appropriate hash mutex to be103* already held before it is invoked.104*105* Each ARC state also has a mutex which is used to protect the106* buffer list associated with the state. When attempting to107* obtain a hash table lock while holding an ARC list lock you108* must use: mutex_tryenter() to avoid deadlock. Also note that109* the active state mutex must be held before the ghost state mutex.110*111* It as also possible to register a callback which is run when the112* metadata limit is reached and no buffers can be safely evicted. In113* this case the arc user should drop a reference on some arc buffers so114* they can be reclaimed. For example, when using the ZPL each dentry115* holds a references on a znode. These dentries must be pruned before116* the arc buffer holding the znode can be safely evicted.117*118* Note that the majority of the performance stats are manipulated119* with atomic operations.120*121* The L2ARC uses the l2ad_mtx on each vdev for the following:122*123* - L2ARC buflist creation124* - L2ARC buflist eviction125* - L2ARC write completion, which walks L2ARC buflists126* - ARC header destruction, as it removes from L2ARC buflists127* - ARC header release, as it removes from L2ARC buflists128*/129130/*131* ARC operation:132*133* Every block that is in the ARC is tracked by an arc_buf_hdr_t structure.134* This structure can point either to a block that is still in the cache or to135* one that is only accessible in an L2 ARC device, or it can provide136* information about a block that was recently evicted. If a block is137* only accessible in the L2ARC, then the arc_buf_hdr_t only has enough138* information to retrieve it from the L2ARC device. This information is139* stored in the l2arc_buf_hdr_t sub-structure of the arc_buf_hdr_t. A block140* that is in this state cannot access the data directly.141*142* Blocks that are actively being referenced or have not been evicted143* are cached in the L1ARC. The L1ARC (l1arc_buf_hdr_t) is a structure within144* the arc_buf_hdr_t that will point to the data block in memory. A block can145* only be read by a consumer if it has an l1arc_buf_hdr_t. The L1ARC146* caches data in two ways -- in a list of ARC buffers (arc_buf_t) and147* also in the arc_buf_hdr_t's private physical data block pointer (b_pabd).148*149* The L1ARC's data pointer may or may not be uncompressed. The ARC has the150* ability to store the physical data (b_pabd) associated with the DVA of the151* arc_buf_hdr_t. Since the b_pabd is a copy of the on-disk physical block,152* it will match its on-disk compression characteristics. This behavior can be153* disabled by setting 'zfs_compressed_arc_enabled' to B_FALSE. When the154* compressed ARC functionality is disabled, the b_pabd will point to an155* uncompressed version of the on-disk data.156*157* Data in the L1ARC is not accessed by consumers of the ARC directly. Each158* arc_buf_hdr_t can have multiple ARC buffers (arc_buf_t) which reference it.159* Each ARC buffer (arc_buf_t) is being actively accessed by a specific ARC160* consumer. The ARC will provide references to this data and will keep it161* cached until it is no longer in use. The ARC caches only the L1ARC's physical162* data block and will evict any arc_buf_t that is no longer referenced. The163* amount of memory consumed by the arc_buf_ts' data buffers can be seen via the164* "overhead_size" kstat.165*166* Depending on the consumer, an arc_buf_t can be requested in uncompressed or167* compressed form. The typical case is that consumers will want uncompressed168* data, and when that happens a new data buffer is allocated where the data is169* decompressed for them to use. Currently the only consumer who wants170* compressed arc_buf_t's is "zfs send", when it streams data exactly as it171* exists on disk. When this happens, the arc_buf_t's data buffer is shared172* with the arc_buf_hdr_t.173*174* Here is a diagram showing an arc_buf_hdr_t referenced by two arc_buf_t's. The175* first one is owned by a compressed send consumer (and therefore references176* the same compressed data buffer as the arc_buf_hdr_t) and the second could be177* used by any other consumer (and has its own uncompressed copy of the data178* buffer).179*180* arc_buf_hdr_t181* +-----------+182* | fields |183* | common to |184* | L1- and |185* | L2ARC |186* +-----------+187* | l2arc_buf_hdr_t188* | |189* +-----------+190* | l1arc_buf_hdr_t191* | | arc_buf_t192* | b_buf +------------>+-----------+ arc_buf_t193* | b_pabd +-+ |b_next +---->+-----------+194* +-----------+ | |-----------| |b_next +-->NULL195* | |b_comp = T | +-----------+196* | |b_data +-+ |b_comp = F |197* | +-----------+ | |b_data +-+198* +->+------+ | +-----------+ |199* compressed | | | |200* data | |<--------------+ | uncompressed201* +------+ compressed, | data202* shared +-->+------+203* data | |204* | |205* +------+206*207* When a consumer reads a block, the ARC must first look to see if the208* arc_buf_hdr_t is cached. If the hdr is cached then the ARC allocates a new209* arc_buf_t and either copies uncompressed data into a new data buffer from an210* existing uncompressed arc_buf_t, decompresses the hdr's b_pabd buffer into a211* new data buffer, or shares the hdr's b_pabd buffer, depending on whether the212* hdr is compressed and the desired compression characteristics of the213* arc_buf_t consumer. If the arc_buf_t ends up sharing data with the214* arc_buf_hdr_t and both of them are uncompressed then the arc_buf_t must be215* the last buffer in the hdr's b_buf list, however a shared compressed buf can216* be anywhere in the hdr's list.217*218* The diagram below shows an example of an uncompressed ARC hdr that is219* sharing its data with an arc_buf_t (note that the shared uncompressed buf is220* the last element in the buf list):221*222* arc_buf_hdr_t223* +-----------+224* | |225* | |226* | |227* +-----------+228* l2arc_buf_hdr_t| |229* | |230* +-----------+231* l1arc_buf_hdr_t| |232* | | arc_buf_t (shared)233* | b_buf +------------>+---------+ arc_buf_t234* | | |b_next +---->+---------+235* | b_pabd +-+ |---------| |b_next +-->NULL236* +-----------+ | | | +---------+237* | |b_data +-+ | |238* | +---------+ | |b_data +-+239* +->+------+ | +---------+ |240* | | | |241* uncompressed | | | |242* data +------+ | |243* ^ +->+------+ |244* | uncompressed | | |245* | data | | |246* | +------+ |247* +---------------------------------+248*249* Writing to the ARC requires that the ARC first discard the hdr's b_pabd250* since the physical block is about to be rewritten. The new data contents251* will be contained in the arc_buf_t. As the I/O pipeline performs the write,252* it may compress the data before writing it to disk. The ARC will be called253* with the transformed data and will memcpy the transformed on-disk block into254* a newly allocated b_pabd. Writes are always done into buffers which have255* either been loaned (and hence are new and don't have other readers) or256* buffers which have been released (and hence have their own hdr, if there257* were originally other readers of the buf's original hdr). This ensures that258* the ARC only needs to update a single buf and its hdr after a write occurs.259*260* When the L2ARC is in use, it will also take advantage of the b_pabd. The261* L2ARC will always write the contents of b_pabd to the L2ARC. This means262* that when compressed ARC is enabled that the L2ARC blocks are identical263* to the on-disk block in the main data pool. This provides a significant264* advantage since the ARC can leverage the bp's checksum when reading from the265* L2ARC to determine if the contents are valid. However, if the compressed266* ARC is disabled, then the L2ARC's block must be transformed to look267* like the physical block in the main data pool before comparing the268* checksum and determining its validity.269*270* The L1ARC has a slightly different system for storing encrypted data.271* Raw (encrypted + possibly compressed) data has a few subtle differences from272* data that is just compressed. The biggest difference is that it is not273* possible to decrypt encrypted data (or vice-versa) if the keys aren't loaded.274* The other difference is that encryption cannot be treated as a suggestion.275* If a caller would prefer compressed data, but they actually wind up with276* uncompressed data the worst thing that could happen is there might be a277* performance hit. If the caller requests encrypted data, however, we must be278* sure they actually get it or else secret information could be leaked. Raw279* data is stored in hdr->b_crypt_hdr.b_rabd. An encrypted header, therefore,280* may have both an encrypted version and a decrypted version of its data at281* once. When a caller needs a raw arc_buf_t, it is allocated and the data is282* copied out of this header. To avoid complications with b_pabd, raw buffers283* cannot be shared.284*/285286#include <sys/spa.h>287#include <sys/zio.h>288#include <sys/spa_impl.h>289#include <sys/zio_compress.h>290#include <sys/zio_checksum.h>291#include <sys/zfs_context.h>292#include <sys/arc.h>293#include <sys/zfs_refcount.h>294#include <sys/vdev.h>295#include <sys/vdev_impl.h>296#include <sys/dsl_pool.h>297#include <sys/multilist.h>298#include <sys/abd.h>299#include <sys/dbuf.h>300#include <sys/zil.h>301#include <sys/fm/fs/zfs.h>302#include <sys/callb.h>303#include <sys/kstat.h>304#include <sys/zthr.h>305#include <zfs_fletcher.h>306#include <sys/arc_impl.h>307#include <sys/trace_zfs.h>308#include <sys/aggsum.h>309#include <sys/wmsum.h>310#include <cityhash.h>311#include <sys/vdev_trim.h>312#include <sys/zfs_racct.h>313#include <sys/zstd/zstd.h>314315#ifndef _KERNEL316/* set with ZFS_DEBUG=watch, to enable watchpoints on frozen buffers */317boolean_t arc_watch = B_FALSE;318#endif319320/*321* This thread's job is to keep enough free memory in the system, by322* calling arc_kmem_reap_soon() plus arc_reduce_target_size(), which improves323* arc_available_memory().324*/325static zthr_t *arc_reap_zthr;326327/*328* This thread's job is to keep arc_size under arc_c, by calling329* arc_evict(), which improves arc_is_overflowing().330*/331static zthr_t *arc_evict_zthr;332static arc_buf_hdr_t **arc_state_evict_markers;333static int arc_state_evict_marker_count;334335static kmutex_t arc_evict_lock;336static boolean_t arc_evict_needed = B_FALSE;337static clock_t arc_last_uncached_flush;338339static taskq_t *arc_evict_taskq;340static struct evict_arg *arc_evict_arg;341342/*343* Count of bytes evicted since boot.344*/345static uint64_t arc_evict_count;346347/*348* List of arc_evict_waiter_t's, representing threads waiting for the349* arc_evict_count to reach specific values.350*/351static list_t arc_evict_waiters;352353/*354* When arc_is_overflowing(), arc_get_data_impl() waits for this percent of355* the requested amount of data to be evicted. For example, by default for356* every 2KB that's evicted, 1KB of it may be "reused" by a new allocation.357* Since this is above 100%, it ensures that progress is made towards getting358* arc_size under arc_c. Since this is finite, it ensures that allocations359* can still happen, even during the potentially long time that arc_size is360* more than arc_c.361*/362static uint_t zfs_arc_eviction_pct = 200;363364/*365* The number of headers to evict in arc_evict_state_impl() before366* dropping the sublist lock and evicting from another sublist. A lower367* value means we're more likely to evict the "correct" header (i.e. the368* oldest header in the arc state), but comes with higher overhead369* (i.e. more invocations of arc_evict_state_impl()).370*/371static uint_t zfs_arc_evict_batch_limit = 10;372373/* number of seconds before growing cache again */374uint_t arc_grow_retry = 5;375376/*377* Minimum time between calls to arc_kmem_reap_soon().378*/379static const int arc_kmem_cache_reap_retry_ms = 1000;380381/* shift of arc_c for calculating overflow limit in arc_get_data_impl */382static int zfs_arc_overflow_shift = 8;383384/* log2(fraction of arc to reclaim) */385uint_t arc_shrink_shift = 7;386387#ifdef _KERNEL388/* percent of pagecache to reclaim arc to */389uint_t zfs_arc_pc_percent = 0;390#endif391392/*393* log2(fraction of ARC which must be free to allow growing).394* I.e. If there is less than arc_c >> arc_no_grow_shift free memory,395* when reading a new block into the ARC, we will evict an equal-sized block396* from the ARC.397*398* This must be less than arc_shrink_shift, so that when we shrink the ARC,399* we will still not allow it to grow.400*/401uint_t arc_no_grow_shift = 5;402403404/*405* minimum lifespan of a prefetch block in clock ticks406* (initialized in arc_init())407*/408static uint_t arc_min_prefetch_ms;409static uint_t arc_min_prescient_prefetch_ms;410411/*412* If this percent of memory is free, don't throttle.413*/414uint_t arc_lotsfree_percent = 10;415416/*417* The arc has filled available memory and has now warmed up.418*/419boolean_t arc_warm;420421/*422* These tunables are for performance analysis.423*/424uint64_t zfs_arc_max = 0;425uint64_t zfs_arc_min = 0;426static uint64_t zfs_arc_dnode_limit = 0;427static uint_t zfs_arc_dnode_reduce_percent = 10;428static uint_t zfs_arc_grow_retry = 0;429static uint_t zfs_arc_shrink_shift = 0;430uint_t zfs_arc_average_blocksize = 8 * 1024; /* 8KB */431432/*433* ARC dirty data constraints for arc_tempreserve_space() throttle:434* * total dirty data limit435* * anon block dirty limit436* * each pool's anon allowance437*/438static const unsigned long zfs_arc_dirty_limit_percent = 50;439static const unsigned long zfs_arc_anon_limit_percent = 25;440static const unsigned long zfs_arc_pool_dirty_percent = 20;441442/*443* Enable or disable compressed arc buffers.444*/445int zfs_compressed_arc_enabled = B_TRUE;446447/*448* Balance between metadata and data on ghost hits. Values above 100449* increase metadata caching by proportionally reducing effect of ghost450* data hits on target data/metadata rate.451*/452static uint_t zfs_arc_meta_balance = 500;453454/*455* Percentage that can be consumed by dnodes of ARC meta buffers.456*/457static uint_t zfs_arc_dnode_limit_percent = 10;458459/*460* These tunables are Linux-specific461*/462static uint64_t zfs_arc_sys_free = 0;463static uint_t zfs_arc_min_prefetch_ms = 0;464static uint_t zfs_arc_min_prescient_prefetch_ms = 0;465static uint_t zfs_arc_lotsfree_percent = 10;466467/*468* Number of arc_prune threads469*/470static int zfs_arc_prune_task_threads = 1;471472/* Used by spa_export/spa_destroy to flush the arc asynchronously */473static taskq_t *arc_flush_taskq;474475/*476* Controls the number of ARC eviction threads to dispatch sublists to.477*478* Possible values:479* 0 (auto) compute the number of threads using a logarithmic formula.480* 1 (disabled) one thread - parallel eviction is disabled.481* 2+ (manual) set the number manually.482*483* See arc_evict_thread_init() for how "auto" is computed.484*/485static uint_t zfs_arc_evict_threads = 0;486487/* The 7 states: */488arc_state_t ARC_anon;489arc_state_t ARC_mru;490arc_state_t ARC_mru_ghost;491arc_state_t ARC_mfu;492arc_state_t ARC_mfu_ghost;493arc_state_t ARC_l2c_only;494arc_state_t ARC_uncached;495496arc_stats_t arc_stats = {497{ "hits", KSTAT_DATA_UINT64 },498{ "iohits", KSTAT_DATA_UINT64 },499{ "misses", KSTAT_DATA_UINT64 },500{ "demand_data_hits", KSTAT_DATA_UINT64 },501{ "demand_data_iohits", KSTAT_DATA_UINT64 },502{ "demand_data_misses", KSTAT_DATA_UINT64 },503{ "demand_metadata_hits", KSTAT_DATA_UINT64 },504{ "demand_metadata_iohits", KSTAT_DATA_UINT64 },505{ "demand_metadata_misses", KSTAT_DATA_UINT64 },506{ "prefetch_data_hits", KSTAT_DATA_UINT64 },507{ "prefetch_data_iohits", KSTAT_DATA_UINT64 },508{ "prefetch_data_misses", KSTAT_DATA_UINT64 },509{ "prefetch_metadata_hits", KSTAT_DATA_UINT64 },510{ "prefetch_metadata_iohits", KSTAT_DATA_UINT64 },511{ "prefetch_metadata_misses", KSTAT_DATA_UINT64 },512{ "mru_hits", KSTAT_DATA_UINT64 },513{ "mru_ghost_hits", KSTAT_DATA_UINT64 },514{ "mfu_hits", KSTAT_DATA_UINT64 },515{ "mfu_ghost_hits", KSTAT_DATA_UINT64 },516{ "uncached_hits", KSTAT_DATA_UINT64 },517{ "deleted", KSTAT_DATA_UINT64 },518{ "mutex_miss", KSTAT_DATA_UINT64 },519{ "access_skip", KSTAT_DATA_UINT64 },520{ "evict_skip", KSTAT_DATA_UINT64 },521{ "evict_not_enough", KSTAT_DATA_UINT64 },522{ "evict_l2_cached", KSTAT_DATA_UINT64 },523{ "evict_l2_eligible", KSTAT_DATA_UINT64 },524{ "evict_l2_eligible_mfu", KSTAT_DATA_UINT64 },525{ "evict_l2_eligible_mru", KSTAT_DATA_UINT64 },526{ "evict_l2_ineligible", KSTAT_DATA_UINT64 },527{ "evict_l2_skip", KSTAT_DATA_UINT64 },528{ "hash_elements", KSTAT_DATA_UINT64 },529{ "hash_elements_max", KSTAT_DATA_UINT64 },530{ "hash_collisions", KSTAT_DATA_UINT64 },531{ "hash_chains", KSTAT_DATA_UINT64 },532{ "hash_chain_max", KSTAT_DATA_UINT64 },533{ "meta", KSTAT_DATA_UINT64 },534{ "pd", KSTAT_DATA_UINT64 },535{ "pm", KSTAT_DATA_UINT64 },536{ "c", KSTAT_DATA_UINT64 },537{ "c_min", KSTAT_DATA_UINT64 },538{ "c_max", KSTAT_DATA_UINT64 },539{ "size", KSTAT_DATA_UINT64 },540{ "compressed_size", KSTAT_DATA_UINT64 },541{ "uncompressed_size", KSTAT_DATA_UINT64 },542{ "overhead_size", KSTAT_DATA_UINT64 },543{ "hdr_size", KSTAT_DATA_UINT64 },544{ "data_size", KSTAT_DATA_UINT64 },545{ "metadata_size", KSTAT_DATA_UINT64 },546{ "dbuf_size", KSTAT_DATA_UINT64 },547{ "dnode_size", KSTAT_DATA_UINT64 },548{ "bonus_size", KSTAT_DATA_UINT64 },549#if defined(COMPAT_FREEBSD11)550{ "other_size", KSTAT_DATA_UINT64 },551#endif552{ "anon_size", KSTAT_DATA_UINT64 },553{ "anon_data", KSTAT_DATA_UINT64 },554{ "anon_metadata", KSTAT_DATA_UINT64 },555{ "anon_evictable_data", KSTAT_DATA_UINT64 },556{ "anon_evictable_metadata", KSTAT_DATA_UINT64 },557{ "mru_size", KSTAT_DATA_UINT64 },558{ "mru_data", KSTAT_DATA_UINT64 },559{ "mru_metadata", KSTAT_DATA_UINT64 },560{ "mru_evictable_data", KSTAT_DATA_UINT64 },561{ "mru_evictable_metadata", KSTAT_DATA_UINT64 },562{ "mru_ghost_size", KSTAT_DATA_UINT64 },563{ "mru_ghost_data", KSTAT_DATA_UINT64 },564{ "mru_ghost_metadata", KSTAT_DATA_UINT64 },565{ "mru_ghost_evictable_data", KSTAT_DATA_UINT64 },566{ "mru_ghost_evictable_metadata", KSTAT_DATA_UINT64 },567{ "mfu_size", KSTAT_DATA_UINT64 },568{ "mfu_data", KSTAT_DATA_UINT64 },569{ "mfu_metadata", KSTAT_DATA_UINT64 },570{ "mfu_evictable_data", KSTAT_DATA_UINT64 },571{ "mfu_evictable_metadata", KSTAT_DATA_UINT64 },572{ "mfu_ghost_size", KSTAT_DATA_UINT64 },573{ "mfu_ghost_data", KSTAT_DATA_UINT64 },574{ "mfu_ghost_metadata", KSTAT_DATA_UINT64 },575{ "mfu_ghost_evictable_data", KSTAT_DATA_UINT64 },576{ "mfu_ghost_evictable_metadata", KSTAT_DATA_UINT64 },577{ "uncached_size", KSTAT_DATA_UINT64 },578{ "uncached_data", KSTAT_DATA_UINT64 },579{ "uncached_metadata", KSTAT_DATA_UINT64 },580{ "uncached_evictable_data", KSTAT_DATA_UINT64 },581{ "uncached_evictable_metadata", KSTAT_DATA_UINT64 },582{ "l2_hits", KSTAT_DATA_UINT64 },583{ "l2_misses", KSTAT_DATA_UINT64 },584{ "l2_prefetch_asize", KSTAT_DATA_UINT64 },585{ "l2_mru_asize", KSTAT_DATA_UINT64 },586{ "l2_mfu_asize", KSTAT_DATA_UINT64 },587{ "l2_bufc_data_asize", KSTAT_DATA_UINT64 },588{ "l2_bufc_metadata_asize", KSTAT_DATA_UINT64 },589{ "l2_feeds", KSTAT_DATA_UINT64 },590{ "l2_rw_clash", KSTAT_DATA_UINT64 },591{ "l2_read_bytes", KSTAT_DATA_UINT64 },592{ "l2_write_bytes", KSTAT_DATA_UINT64 },593{ "l2_writes_sent", KSTAT_DATA_UINT64 },594{ "l2_writes_done", KSTAT_DATA_UINT64 },595{ "l2_writes_error", KSTAT_DATA_UINT64 },596{ "l2_writes_lock_retry", KSTAT_DATA_UINT64 },597{ "l2_evict_lock_retry", KSTAT_DATA_UINT64 },598{ "l2_evict_reading", KSTAT_DATA_UINT64 },599{ "l2_evict_l1cached", KSTAT_DATA_UINT64 },600{ "l2_free_on_write", KSTAT_DATA_UINT64 },601{ "l2_abort_lowmem", KSTAT_DATA_UINT64 },602{ "l2_cksum_bad", KSTAT_DATA_UINT64 },603{ "l2_io_error", KSTAT_DATA_UINT64 },604{ "l2_size", KSTAT_DATA_UINT64 },605{ "l2_asize", KSTAT_DATA_UINT64 },606{ "l2_hdr_size", KSTAT_DATA_UINT64 },607{ "l2_log_blk_writes", KSTAT_DATA_UINT64 },608{ "l2_log_blk_avg_asize", KSTAT_DATA_UINT64 },609{ "l2_log_blk_asize", KSTAT_DATA_UINT64 },610{ "l2_log_blk_count", KSTAT_DATA_UINT64 },611{ "l2_data_to_meta_ratio", KSTAT_DATA_UINT64 },612{ "l2_rebuild_success", KSTAT_DATA_UINT64 },613{ "l2_rebuild_unsupported", KSTAT_DATA_UINT64 },614{ "l2_rebuild_io_errors", KSTAT_DATA_UINT64 },615{ "l2_rebuild_dh_errors", KSTAT_DATA_UINT64 },616{ "l2_rebuild_cksum_lb_errors", KSTAT_DATA_UINT64 },617{ "l2_rebuild_lowmem", KSTAT_DATA_UINT64 },618{ "l2_rebuild_size", KSTAT_DATA_UINT64 },619{ "l2_rebuild_asize", KSTAT_DATA_UINT64 },620{ "l2_rebuild_bufs", KSTAT_DATA_UINT64 },621{ "l2_rebuild_bufs_precached", KSTAT_DATA_UINT64 },622{ "l2_rebuild_log_blks", KSTAT_DATA_UINT64 },623{ "memory_throttle_count", KSTAT_DATA_UINT64 },624{ "memory_direct_count", KSTAT_DATA_UINT64 },625{ "memory_indirect_count", KSTAT_DATA_UINT64 },626{ "memory_all_bytes", KSTAT_DATA_UINT64 },627{ "memory_free_bytes", KSTAT_DATA_UINT64 },628{ "memory_available_bytes", KSTAT_DATA_INT64 },629{ "arc_no_grow", KSTAT_DATA_UINT64 },630{ "arc_tempreserve", KSTAT_DATA_UINT64 },631{ "arc_loaned_bytes", KSTAT_DATA_UINT64 },632{ "arc_prune", KSTAT_DATA_UINT64 },633{ "arc_meta_used", KSTAT_DATA_UINT64 },634{ "arc_dnode_limit", KSTAT_DATA_UINT64 },635{ "async_upgrade_sync", KSTAT_DATA_UINT64 },636{ "predictive_prefetch", KSTAT_DATA_UINT64 },637{ "demand_hit_predictive_prefetch", KSTAT_DATA_UINT64 },638{ "demand_iohit_predictive_prefetch", KSTAT_DATA_UINT64 },639{ "prescient_prefetch", KSTAT_DATA_UINT64 },640{ "demand_hit_prescient_prefetch", KSTAT_DATA_UINT64 },641{ "demand_iohit_prescient_prefetch", KSTAT_DATA_UINT64 },642{ "arc_need_free", KSTAT_DATA_UINT64 },643{ "arc_sys_free", KSTAT_DATA_UINT64 },644{ "arc_raw_size", KSTAT_DATA_UINT64 },645{ "cached_only_in_progress", KSTAT_DATA_UINT64 },646{ "abd_chunk_waste_size", KSTAT_DATA_UINT64 },647};648649arc_sums_t arc_sums;650651#define ARCSTAT_MAX(stat, val) { \652uint64_t m; \653while ((val) > (m = arc_stats.stat.value.ui64) && \654(m != atomic_cas_64(&arc_stats.stat.value.ui64, m, (val)))) \655continue; \656}657658/*659* We define a macro to allow ARC hits/misses to be easily broken down by660* two separate conditions, giving a total of four different subtypes for661* each of hits and misses (so eight statistics total).662*/663#define ARCSTAT_CONDSTAT(cond1, stat1, notstat1, cond2, stat2, notstat2, stat) \664if (cond1) { \665if (cond2) { \666ARCSTAT_BUMP(arcstat_##stat1##_##stat2##_##stat); \667} else { \668ARCSTAT_BUMP(arcstat_##stat1##_##notstat2##_##stat); \669} \670} else { \671if (cond2) { \672ARCSTAT_BUMP(arcstat_##notstat1##_##stat2##_##stat); \673} else { \674ARCSTAT_BUMP(arcstat_##notstat1##_##notstat2##_##stat);\675} \676}677678/*679* This macro allows us to use kstats as floating averages. Each time we680* update this kstat, we first factor it and the update value by681* ARCSTAT_AVG_FACTOR to shrink the new value's contribution to the overall682* average. This macro assumes that integer loads and stores are atomic, but683* is not safe for multiple writers updating the kstat in parallel (only the684* last writer's update will remain).685*/686#define ARCSTAT_F_AVG_FACTOR 3687#define ARCSTAT_F_AVG(stat, value) \688do { \689uint64_t x = ARCSTAT(stat); \690x = x - x / ARCSTAT_F_AVG_FACTOR + \691(value) / ARCSTAT_F_AVG_FACTOR; \692ARCSTAT(stat) = x; \693} while (0)694695static kstat_t *arc_ksp;696697/*698* There are several ARC variables that are critical to export as kstats --699* but we don't want to have to grovel around in the kstat whenever we wish to700* manipulate them. For these variables, we therefore define them to be in701* terms of the statistic variable. This assures that we are not introducing702* the possibility of inconsistency by having shadow copies of the variables,703* while still allowing the code to be readable.704*/705#define arc_tempreserve ARCSTAT(arcstat_tempreserve)706#define arc_loaned_bytes ARCSTAT(arcstat_loaned_bytes)707#define arc_dnode_limit ARCSTAT(arcstat_dnode_limit) /* max size for dnodes */708#define arc_need_free ARCSTAT(arcstat_need_free) /* waiting to be evicted */709710hrtime_t arc_growtime;711list_t arc_prune_list;712kmutex_t arc_prune_mtx;713taskq_t *arc_prune_taskq;714715#define GHOST_STATE(state) \716((state) == arc_mru_ghost || (state) == arc_mfu_ghost || \717(state) == arc_l2c_only)718719#define HDR_IN_HASH_TABLE(hdr) ((hdr)->b_flags & ARC_FLAG_IN_HASH_TABLE)720#define HDR_IO_IN_PROGRESS(hdr) ((hdr)->b_flags & ARC_FLAG_IO_IN_PROGRESS)721#define HDR_IO_ERROR(hdr) ((hdr)->b_flags & ARC_FLAG_IO_ERROR)722#define HDR_PREFETCH(hdr) ((hdr)->b_flags & ARC_FLAG_PREFETCH)723#define HDR_PRESCIENT_PREFETCH(hdr) \724((hdr)->b_flags & ARC_FLAG_PRESCIENT_PREFETCH)725#define HDR_COMPRESSION_ENABLED(hdr) \726((hdr)->b_flags & ARC_FLAG_COMPRESSED_ARC)727728#define HDR_L2CACHE(hdr) ((hdr)->b_flags & ARC_FLAG_L2CACHE)729#define HDR_UNCACHED(hdr) ((hdr)->b_flags & ARC_FLAG_UNCACHED)730#define HDR_L2_READING(hdr) \731(((hdr)->b_flags & ARC_FLAG_IO_IN_PROGRESS) && \732((hdr)->b_flags & ARC_FLAG_HAS_L2HDR))733#define HDR_L2_WRITING(hdr) ((hdr)->b_flags & ARC_FLAG_L2_WRITING)734#define HDR_L2_EVICTED(hdr) ((hdr)->b_flags & ARC_FLAG_L2_EVICTED)735#define HDR_L2_WRITE_HEAD(hdr) ((hdr)->b_flags & ARC_FLAG_L2_WRITE_HEAD)736#define HDR_PROTECTED(hdr) ((hdr)->b_flags & ARC_FLAG_PROTECTED)737#define HDR_NOAUTH(hdr) ((hdr)->b_flags & ARC_FLAG_NOAUTH)738#define HDR_SHARED_DATA(hdr) ((hdr)->b_flags & ARC_FLAG_SHARED_DATA)739740#define HDR_ISTYPE_METADATA(hdr) \741((hdr)->b_flags & ARC_FLAG_BUFC_METADATA)742#define HDR_ISTYPE_DATA(hdr) (!HDR_ISTYPE_METADATA(hdr))743744#define HDR_HAS_L1HDR(hdr) ((hdr)->b_flags & ARC_FLAG_HAS_L1HDR)745#define HDR_HAS_L2HDR(hdr) ((hdr)->b_flags & ARC_FLAG_HAS_L2HDR)746#define HDR_HAS_RABD(hdr) \747(HDR_HAS_L1HDR(hdr) && HDR_PROTECTED(hdr) && \748(hdr)->b_crypt_hdr.b_rabd != NULL)749#define HDR_ENCRYPTED(hdr) \750(HDR_PROTECTED(hdr) && DMU_OT_IS_ENCRYPTED((hdr)->b_crypt_hdr.b_ot))751#define HDR_AUTHENTICATED(hdr) \752(HDR_PROTECTED(hdr) && !DMU_OT_IS_ENCRYPTED((hdr)->b_crypt_hdr.b_ot))753754/* For storing compression mode in b_flags */755#define HDR_COMPRESS_OFFSET (highbit64(ARC_FLAG_COMPRESS_0) - 1)756757#define HDR_GET_COMPRESS(hdr) ((enum zio_compress)BF32_GET((hdr)->b_flags, \758HDR_COMPRESS_OFFSET, SPA_COMPRESSBITS))759#define HDR_SET_COMPRESS(hdr, cmp) BF32_SET((hdr)->b_flags, \760HDR_COMPRESS_OFFSET, SPA_COMPRESSBITS, (cmp));761762#define ARC_BUF_LAST(buf) ((buf)->b_next == NULL)763#define ARC_BUF_SHARED(buf) ((buf)->b_flags & ARC_BUF_FLAG_SHARED)764#define ARC_BUF_COMPRESSED(buf) ((buf)->b_flags & ARC_BUF_FLAG_COMPRESSED)765#define ARC_BUF_ENCRYPTED(buf) ((buf)->b_flags & ARC_BUF_FLAG_ENCRYPTED)766767/*768* Other sizes769*/770771#define HDR_FULL_SIZE ((int64_t)sizeof (arc_buf_hdr_t))772#define HDR_L2ONLY_SIZE ((int64_t)offsetof(arc_buf_hdr_t, b_l1hdr))773774/*775* Hash table routines776*/777778#define BUF_LOCKS 2048779typedef struct buf_hash_table {780uint64_t ht_mask;781arc_buf_hdr_t **ht_table;782kmutex_t ht_locks[BUF_LOCKS] ____cacheline_aligned;783} buf_hash_table_t;784785static buf_hash_table_t buf_hash_table;786787#define BUF_HASH_INDEX(spa, dva, birth) \788(buf_hash(spa, dva, birth) & buf_hash_table.ht_mask)789#define BUF_HASH_LOCK(idx) (&buf_hash_table.ht_locks[idx & (BUF_LOCKS-1)])790#define HDR_LOCK(hdr) \791(BUF_HASH_LOCK(BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth)))792793uint64_t zfs_crc64_table[256];794795/*796* Asynchronous ARC flush797*798* We track these in a list for arc_async_flush_guid_inuse().799* Used for both L1 and L2 async teardown.800*/801static list_t arc_async_flush_list;802static kmutex_t arc_async_flush_lock;803804typedef struct arc_async_flush {805uint64_t af_spa_guid;806taskq_ent_t af_tqent;807uint_t af_cache_level; /* 1 or 2 to differentiate node */808list_node_t af_node;809} arc_async_flush_t;810811812/*813* Level 2 ARC814*/815816#define L2ARC_WRITE_SIZE (32 * 1024 * 1024) /* initial write max */817#define L2ARC_HEADROOM 8 /* num of writes */818819/*820* If we discover during ARC scan any buffers to be compressed, we boost821* our headroom for the next scanning cycle by this percentage multiple.822*/823#define L2ARC_HEADROOM_BOOST 200824#define L2ARC_FEED_SECS 1 /* caching interval secs */825#define L2ARC_FEED_MIN_MS 200 /* min caching interval ms */826827/*828* We can feed L2ARC from two states of ARC buffers, mru and mfu,829* and each of the state has two types: data and metadata.830*/831#define L2ARC_FEED_TYPES 4832833/* L2ARC Performance Tunables */834uint64_t l2arc_write_max = L2ARC_WRITE_SIZE; /* def max write size */835uint64_t l2arc_write_boost = L2ARC_WRITE_SIZE; /* extra warmup write */836uint64_t l2arc_headroom = L2ARC_HEADROOM; /* # of dev writes */837uint64_t l2arc_headroom_boost = L2ARC_HEADROOM_BOOST;838uint64_t l2arc_feed_secs = L2ARC_FEED_SECS; /* interval seconds */839uint64_t l2arc_feed_min_ms = L2ARC_FEED_MIN_MS; /* min interval msecs */840int l2arc_noprefetch = B_TRUE; /* don't cache prefetch bufs */841int l2arc_feed_again = B_TRUE; /* turbo warmup */842int l2arc_norw = B_FALSE; /* no reads during writes */843static uint_t l2arc_meta_percent = 33; /* limit on headers size */844845/*846* L2ARC Internals847*/848static list_t L2ARC_dev_list; /* device list */849static list_t *l2arc_dev_list; /* device list pointer */850static kmutex_t l2arc_dev_mtx; /* device list mutex */851static l2arc_dev_t *l2arc_dev_last; /* last device used */852static list_t L2ARC_free_on_write; /* free after write buf list */853static list_t *l2arc_free_on_write; /* free after write list ptr */854static kmutex_t l2arc_free_on_write_mtx; /* mutex for list */855static uint64_t l2arc_ndev; /* number of devices */856857typedef struct l2arc_read_callback {858arc_buf_hdr_t *l2rcb_hdr; /* read header */859blkptr_t l2rcb_bp; /* original blkptr */860zbookmark_phys_t l2rcb_zb; /* original bookmark */861int l2rcb_flags; /* original flags */862abd_t *l2rcb_abd; /* temporary buffer */863} l2arc_read_callback_t;864865typedef struct l2arc_data_free {866/* protected by l2arc_free_on_write_mtx */867abd_t *l2df_abd;868size_t l2df_size;869arc_buf_contents_t l2df_type;870list_node_t l2df_list_node;871} l2arc_data_free_t;872873typedef enum arc_fill_flags {874ARC_FILL_LOCKED = 1 << 0, /* hdr lock is held */875ARC_FILL_COMPRESSED = 1 << 1, /* fill with compressed data */876ARC_FILL_ENCRYPTED = 1 << 2, /* fill with encrypted data */877ARC_FILL_NOAUTH = 1 << 3, /* don't attempt to authenticate */878ARC_FILL_IN_PLACE = 1 << 4 /* fill in place (special case) */879} arc_fill_flags_t;880881typedef enum arc_ovf_level {882ARC_OVF_NONE, /* ARC within target size. */883ARC_OVF_SOME, /* ARC is slightly overflowed. */884ARC_OVF_SEVERE /* ARC is severely overflowed. */885} arc_ovf_level_t;886887static kmutex_t l2arc_feed_thr_lock;888static kcondvar_t l2arc_feed_thr_cv;889static uint8_t l2arc_thread_exit;890891static kmutex_t l2arc_rebuild_thr_lock;892static kcondvar_t l2arc_rebuild_thr_cv;893894enum arc_hdr_alloc_flags {895ARC_HDR_ALLOC_RDATA = 0x1,896ARC_HDR_USE_RESERVE = 0x4,897ARC_HDR_ALLOC_LINEAR = 0x8,898};899900901static abd_t *arc_get_data_abd(arc_buf_hdr_t *, uint64_t, const void *, int);902static void *arc_get_data_buf(arc_buf_hdr_t *, uint64_t, const void *);903static void arc_get_data_impl(arc_buf_hdr_t *, uint64_t, const void *, int);904static void arc_free_data_abd(arc_buf_hdr_t *, abd_t *, uint64_t, const void *);905static void arc_free_data_buf(arc_buf_hdr_t *, void *, uint64_t, const void *);906static void arc_free_data_impl(arc_buf_hdr_t *hdr, uint64_t size,907const void *tag);908static void arc_hdr_free_abd(arc_buf_hdr_t *, boolean_t);909static void arc_hdr_alloc_abd(arc_buf_hdr_t *, int);910static void arc_hdr_destroy(arc_buf_hdr_t *);911static void arc_access(arc_buf_hdr_t *, arc_flags_t, boolean_t);912static void arc_buf_watch(arc_buf_t *);913static void arc_change_state(arc_state_t *, arc_buf_hdr_t *);914915static arc_buf_contents_t arc_buf_type(arc_buf_hdr_t *);916static uint32_t arc_bufc_to_flags(arc_buf_contents_t);917static inline void arc_hdr_set_flags(arc_buf_hdr_t *hdr, arc_flags_t flags);918static inline void arc_hdr_clear_flags(arc_buf_hdr_t *hdr, arc_flags_t flags);919920static boolean_t l2arc_write_eligible(uint64_t, arc_buf_hdr_t *);921static void l2arc_read_done(zio_t *);922static void l2arc_do_free_on_write(void);923static void l2arc_hdr_arcstats_update(arc_buf_hdr_t *hdr, boolean_t incr,924boolean_t state_only);925926static void arc_prune_async(uint64_t adjust);927928#define l2arc_hdr_arcstats_increment(hdr) \929l2arc_hdr_arcstats_update((hdr), B_TRUE, B_FALSE)930#define l2arc_hdr_arcstats_decrement(hdr) \931l2arc_hdr_arcstats_update((hdr), B_FALSE, B_FALSE)932#define l2arc_hdr_arcstats_increment_state(hdr) \933l2arc_hdr_arcstats_update((hdr), B_TRUE, B_TRUE)934#define l2arc_hdr_arcstats_decrement_state(hdr) \935l2arc_hdr_arcstats_update((hdr), B_FALSE, B_TRUE)936937/*938* l2arc_exclude_special : A zfs module parameter that controls whether buffers939* present on special vdevs are eligibile for caching in L2ARC. If940* set to 1, exclude dbufs on special vdevs from being cached to941* L2ARC.942*/943int l2arc_exclude_special = 0;944945/*946* l2arc_mfuonly : A ZFS module parameter that controls whether only MFU947* metadata and data are cached from ARC into L2ARC.948*/949static int l2arc_mfuonly = 0;950951/*952* L2ARC TRIM953* l2arc_trim_ahead : A ZFS module parameter that controls how much ahead of954* the current write size (l2arc_write_max) we should TRIM if we955* have filled the device. It is defined as a percentage of the956* write size. If set to 100 we trim twice the space required to957* accommodate upcoming writes. A minimum of 64MB will be trimmed.958* It also enables TRIM of the whole L2ARC device upon creation or959* addition to an existing pool or if the header of the device is960* invalid upon importing a pool or onlining a cache device. The961* default is 0, which disables TRIM on L2ARC altogether as it can962* put significant stress on the underlying storage devices. This963* will vary depending of how well the specific device handles964* these commands.965*/966static uint64_t l2arc_trim_ahead = 0;967968/*969* Performance tuning of L2ARC persistence:970*971* l2arc_rebuild_enabled : A ZFS module parameter that controls whether adding972* an L2ARC device (either at pool import or later) will attempt973* to rebuild L2ARC buffer contents.974* l2arc_rebuild_blocks_min_l2size : A ZFS module parameter that controls975* whether log blocks are written to the L2ARC device. If the L2ARC976* device is less than 1GB, the amount of data l2arc_evict()977* evicts is significant compared to the amount of restored L2ARC978* data. In this case do not write log blocks in L2ARC in order979* not to waste space.980*/981static int l2arc_rebuild_enabled = B_TRUE;982static uint64_t l2arc_rebuild_blocks_min_l2size = 1024 * 1024 * 1024;983984/* L2ARC persistence rebuild control routines. */985void l2arc_rebuild_vdev(vdev_t *vd, boolean_t reopen);986static __attribute__((noreturn)) void l2arc_dev_rebuild_thread(void *arg);987static int l2arc_rebuild(l2arc_dev_t *dev);988989/* L2ARC persistence read I/O routines. */990static int l2arc_dev_hdr_read(l2arc_dev_t *dev);991static int l2arc_log_blk_read(l2arc_dev_t *dev,992const l2arc_log_blkptr_t *this_lp, const l2arc_log_blkptr_t *next_lp,993l2arc_log_blk_phys_t *this_lb, l2arc_log_blk_phys_t *next_lb,994zio_t *this_io, zio_t **next_io);995static zio_t *l2arc_log_blk_fetch(vdev_t *vd,996const l2arc_log_blkptr_t *lp, l2arc_log_blk_phys_t *lb);997static void l2arc_log_blk_fetch_abort(zio_t *zio);998999/* L2ARC persistence block restoration routines. */1000static void l2arc_log_blk_restore(l2arc_dev_t *dev,1001const l2arc_log_blk_phys_t *lb, uint64_t lb_asize);1002static void l2arc_hdr_restore(const l2arc_log_ent_phys_t *le,1003l2arc_dev_t *dev);10041005/* L2ARC persistence write I/O routines. */1006static uint64_t l2arc_log_blk_commit(l2arc_dev_t *dev, zio_t *pio,1007l2arc_write_callback_t *cb);10081009/* L2ARC persistence auxiliary routines. */1010boolean_t l2arc_log_blkptr_valid(l2arc_dev_t *dev,1011const l2arc_log_blkptr_t *lbp);1012static boolean_t l2arc_log_blk_insert(l2arc_dev_t *dev,1013const arc_buf_hdr_t *ab);1014boolean_t l2arc_range_check_overlap(uint64_t bottom,1015uint64_t top, uint64_t check);1016static void l2arc_blk_fetch_done(zio_t *zio);1017static inline uint64_t1018l2arc_log_blk_overhead(uint64_t write_sz, l2arc_dev_t *dev);10191020/*1021* We use Cityhash for this. It's fast, and has good hash properties without1022* requiring any large static buffers.1023*/1024static uint64_t1025buf_hash(uint64_t spa, const dva_t *dva, uint64_t birth)1026{1027return (cityhash4(spa, dva->dva_word[0], dva->dva_word[1], birth));1028}10291030#define HDR_EMPTY(hdr) \1031((hdr)->b_dva.dva_word[0] == 0 && \1032(hdr)->b_dva.dva_word[1] == 0)10331034#define HDR_EMPTY_OR_LOCKED(hdr) \1035(HDR_EMPTY(hdr) || MUTEX_HELD(HDR_LOCK(hdr)))10361037#define HDR_EQUAL(spa, dva, birth, hdr) \1038((hdr)->b_dva.dva_word[0] == (dva)->dva_word[0]) && \1039((hdr)->b_dva.dva_word[1] == (dva)->dva_word[1]) && \1040((hdr)->b_birth == birth) && ((hdr)->b_spa == spa)10411042static void1043buf_discard_identity(arc_buf_hdr_t *hdr)1044{1045hdr->b_dva.dva_word[0] = 0;1046hdr->b_dva.dva_word[1] = 0;1047hdr->b_birth = 0;1048}10491050static arc_buf_hdr_t *1051buf_hash_find(uint64_t spa, const blkptr_t *bp, kmutex_t **lockp)1052{1053const dva_t *dva = BP_IDENTITY(bp);1054uint64_t birth = BP_GET_PHYSICAL_BIRTH(bp);1055uint64_t idx = BUF_HASH_INDEX(spa, dva, birth);1056kmutex_t *hash_lock = BUF_HASH_LOCK(idx);1057arc_buf_hdr_t *hdr;10581059mutex_enter(hash_lock);1060for (hdr = buf_hash_table.ht_table[idx]; hdr != NULL;1061hdr = hdr->b_hash_next) {1062if (HDR_EQUAL(spa, dva, birth, hdr)) {1063*lockp = hash_lock;1064return (hdr);1065}1066}1067mutex_exit(hash_lock);1068*lockp = NULL;1069return (NULL);1070}10711072/*1073* Insert an entry into the hash table. If there is already an element1074* equal to elem in the hash table, then the already existing element1075* will be returned and the new element will not be inserted.1076* Otherwise returns NULL.1077* If lockp == NULL, the caller is assumed to already hold the hash lock.1078*/1079static arc_buf_hdr_t *1080buf_hash_insert(arc_buf_hdr_t *hdr, kmutex_t **lockp)1081{1082uint64_t idx = BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth);1083kmutex_t *hash_lock = BUF_HASH_LOCK(idx);1084arc_buf_hdr_t *fhdr;1085uint32_t i;10861087ASSERT(!DVA_IS_EMPTY(&hdr->b_dva));1088ASSERT(hdr->b_birth != 0);1089ASSERT(!HDR_IN_HASH_TABLE(hdr));10901091if (lockp != NULL) {1092*lockp = hash_lock;1093mutex_enter(hash_lock);1094} else {1095ASSERT(MUTEX_HELD(hash_lock));1096}10971098for (fhdr = buf_hash_table.ht_table[idx], i = 0; fhdr != NULL;1099fhdr = fhdr->b_hash_next, i++) {1100if (HDR_EQUAL(hdr->b_spa, &hdr->b_dva, hdr->b_birth, fhdr))1101return (fhdr);1102}11031104hdr->b_hash_next = buf_hash_table.ht_table[idx];1105buf_hash_table.ht_table[idx] = hdr;1106arc_hdr_set_flags(hdr, ARC_FLAG_IN_HASH_TABLE);11071108/* collect some hash table performance data */1109if (i > 0) {1110ARCSTAT_BUMP(arcstat_hash_collisions);1111if (i == 1)1112ARCSTAT_BUMP(arcstat_hash_chains);1113ARCSTAT_MAX(arcstat_hash_chain_max, i);1114}1115ARCSTAT_BUMP(arcstat_hash_elements);11161117return (NULL);1118}11191120static void1121buf_hash_remove(arc_buf_hdr_t *hdr)1122{1123arc_buf_hdr_t *fhdr, **hdrp;1124uint64_t idx = BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth);11251126ASSERT(MUTEX_HELD(BUF_HASH_LOCK(idx)));1127ASSERT(HDR_IN_HASH_TABLE(hdr));11281129hdrp = &buf_hash_table.ht_table[idx];1130while ((fhdr = *hdrp) != hdr) {1131ASSERT3P(fhdr, !=, NULL);1132hdrp = &fhdr->b_hash_next;1133}1134*hdrp = hdr->b_hash_next;1135hdr->b_hash_next = NULL;1136arc_hdr_clear_flags(hdr, ARC_FLAG_IN_HASH_TABLE);11371138/* collect some hash table performance data */1139ARCSTAT_BUMPDOWN(arcstat_hash_elements);1140if (buf_hash_table.ht_table[idx] &&1141buf_hash_table.ht_table[idx]->b_hash_next == NULL)1142ARCSTAT_BUMPDOWN(arcstat_hash_chains);1143}11441145/*1146* Global data structures and functions for the buf kmem cache.1147*/11481149static kmem_cache_t *hdr_full_cache;1150static kmem_cache_t *hdr_l2only_cache;1151static kmem_cache_t *buf_cache;11521153static void1154buf_fini(void)1155{1156#if defined(_KERNEL)1157/*1158* Large allocations which do not require contiguous pages1159* should be using vmem_free() in the linux kernel.1160*/1161vmem_free(buf_hash_table.ht_table,1162(buf_hash_table.ht_mask + 1) * sizeof (void *));1163#else1164kmem_free(buf_hash_table.ht_table,1165(buf_hash_table.ht_mask + 1) * sizeof (void *));1166#endif1167for (int i = 0; i < BUF_LOCKS; i++)1168mutex_destroy(BUF_HASH_LOCK(i));1169kmem_cache_destroy(hdr_full_cache);1170kmem_cache_destroy(hdr_l2only_cache);1171kmem_cache_destroy(buf_cache);1172}11731174/*1175* Constructor callback - called when the cache is empty1176* and a new buf is requested.1177*/1178static int1179hdr_full_cons(void *vbuf, void *unused, int kmflag)1180{1181(void) unused, (void) kmflag;1182arc_buf_hdr_t *hdr = vbuf;11831184memset(hdr, 0, HDR_FULL_SIZE);1185hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS;1186zfs_refcount_create(&hdr->b_l1hdr.b_refcnt);1187#ifdef ZFS_DEBUG1188mutex_init(&hdr->b_l1hdr.b_freeze_lock, NULL, MUTEX_DEFAULT, NULL);1189#endif1190multilist_link_init(&hdr->b_l1hdr.b_arc_node);1191list_link_init(&hdr->b_l2hdr.b_l2node);1192arc_space_consume(HDR_FULL_SIZE, ARC_SPACE_HDRS);11931194return (0);1195}11961197static int1198hdr_l2only_cons(void *vbuf, void *unused, int kmflag)1199{1200(void) unused, (void) kmflag;1201arc_buf_hdr_t *hdr = vbuf;12021203memset(hdr, 0, HDR_L2ONLY_SIZE);1204arc_space_consume(HDR_L2ONLY_SIZE, ARC_SPACE_L2HDRS);12051206return (0);1207}12081209static int1210buf_cons(void *vbuf, void *unused, int kmflag)1211{1212(void) unused, (void) kmflag;1213arc_buf_t *buf = vbuf;12141215memset(buf, 0, sizeof (arc_buf_t));1216arc_space_consume(sizeof (arc_buf_t), ARC_SPACE_HDRS);12171218return (0);1219}12201221/*1222* Destructor callback - called when a cached buf is1223* no longer required.1224*/1225static void1226hdr_full_dest(void *vbuf, void *unused)1227{1228(void) unused;1229arc_buf_hdr_t *hdr = vbuf;12301231ASSERT(HDR_EMPTY(hdr));1232zfs_refcount_destroy(&hdr->b_l1hdr.b_refcnt);1233#ifdef ZFS_DEBUG1234mutex_destroy(&hdr->b_l1hdr.b_freeze_lock);1235#endif1236ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));1237arc_space_return(HDR_FULL_SIZE, ARC_SPACE_HDRS);1238}12391240static void1241hdr_l2only_dest(void *vbuf, void *unused)1242{1243(void) unused;1244arc_buf_hdr_t *hdr = vbuf;12451246ASSERT(HDR_EMPTY(hdr));1247arc_space_return(HDR_L2ONLY_SIZE, ARC_SPACE_L2HDRS);1248}12491250static void1251buf_dest(void *vbuf, void *unused)1252{1253(void) unused;1254(void) vbuf;12551256arc_space_return(sizeof (arc_buf_t), ARC_SPACE_HDRS);1257}12581259static void1260buf_init(void)1261{1262uint64_t *ct = NULL;1263uint64_t hsize = 1ULL << 12;1264int i, j;12651266/*1267* The hash table is big enough to fill all of physical memory1268* with an average block size of zfs_arc_average_blocksize (default 8K).1269* By default, the table will take up1270* totalmem * sizeof(void*) / 8K (1MB per GB with 8-byte pointers).1271*/1272while (hsize * zfs_arc_average_blocksize < arc_all_memory())1273hsize <<= 1;1274retry:1275buf_hash_table.ht_mask = hsize - 1;1276#if defined(_KERNEL)1277/*1278* Large allocations which do not require contiguous pages1279* should be using vmem_alloc() in the linux kernel1280*/1281buf_hash_table.ht_table =1282vmem_zalloc(hsize * sizeof (void*), KM_SLEEP);1283#else1284buf_hash_table.ht_table =1285kmem_zalloc(hsize * sizeof (void*), KM_NOSLEEP);1286#endif1287if (buf_hash_table.ht_table == NULL) {1288ASSERT(hsize > (1ULL << 8));1289hsize >>= 1;1290goto retry;1291}12921293hdr_full_cache = kmem_cache_create("arc_buf_hdr_t_full", HDR_FULL_SIZE,12940, hdr_full_cons, hdr_full_dest, NULL, NULL, NULL, KMC_RECLAIMABLE);1295hdr_l2only_cache = kmem_cache_create("arc_buf_hdr_t_l2only",1296HDR_L2ONLY_SIZE, 0, hdr_l2only_cons, hdr_l2only_dest, NULL,1297NULL, NULL, 0);1298buf_cache = kmem_cache_create("arc_buf_t", sizeof (arc_buf_t),12990, buf_cons, buf_dest, NULL, NULL, NULL, 0);13001301for (i = 0; i < 256; i++)1302for (ct = zfs_crc64_table + i, *ct = i, j = 8; j > 0; j--)1303*ct = (*ct >> 1) ^ (-(*ct & 1) & ZFS_CRC64_POLY);13041305for (i = 0; i < BUF_LOCKS; i++)1306mutex_init(BUF_HASH_LOCK(i), NULL, MUTEX_DEFAULT, NULL);1307}13081309#define ARC_MINTIME (hz>>4) /* 62 ms */13101311/*1312* This is the size that the buf occupies in memory. If the buf is compressed,1313* it will correspond to the compressed size. You should use this method of1314* getting the buf size unless you explicitly need the logical size.1315*/1316uint64_t1317arc_buf_size(arc_buf_t *buf)1318{1319return (ARC_BUF_COMPRESSED(buf) ?1320HDR_GET_PSIZE(buf->b_hdr) : HDR_GET_LSIZE(buf->b_hdr));1321}13221323uint64_t1324arc_buf_lsize(arc_buf_t *buf)1325{1326return (HDR_GET_LSIZE(buf->b_hdr));1327}13281329/*1330* This function will return B_TRUE if the buffer is encrypted in memory.1331* This buffer can be decrypted by calling arc_untransform().1332*/1333boolean_t1334arc_is_encrypted(arc_buf_t *buf)1335{1336return (ARC_BUF_ENCRYPTED(buf) != 0);1337}13381339/*1340* Returns B_TRUE if the buffer represents data that has not had its MAC1341* verified yet.1342*/1343boolean_t1344arc_is_unauthenticated(arc_buf_t *buf)1345{1346return (HDR_NOAUTH(buf->b_hdr) != 0);1347}13481349void1350arc_get_raw_params(arc_buf_t *buf, boolean_t *byteorder, uint8_t *salt,1351uint8_t *iv, uint8_t *mac)1352{1353arc_buf_hdr_t *hdr = buf->b_hdr;13541355ASSERT(HDR_PROTECTED(hdr));13561357memcpy(salt, hdr->b_crypt_hdr.b_salt, ZIO_DATA_SALT_LEN);1358memcpy(iv, hdr->b_crypt_hdr.b_iv, ZIO_DATA_IV_LEN);1359memcpy(mac, hdr->b_crypt_hdr.b_mac, ZIO_DATA_MAC_LEN);1360*byteorder = (hdr->b_l1hdr.b_byteswap == DMU_BSWAP_NUMFUNCS) ?1361ZFS_HOST_BYTEORDER : !ZFS_HOST_BYTEORDER;1362}13631364/*1365* Indicates how this buffer is compressed in memory. If it is not compressed1366* the value will be ZIO_COMPRESS_OFF. It can be made normally readable with1367* arc_untransform() as long as it is also unencrypted.1368*/1369enum zio_compress1370arc_get_compression(arc_buf_t *buf)1371{1372return (ARC_BUF_COMPRESSED(buf) ?1373HDR_GET_COMPRESS(buf->b_hdr) : ZIO_COMPRESS_OFF);1374}13751376/*1377* Return the compression algorithm used to store this data in the ARC. If ARC1378* compression is enabled or this is an encrypted block, this will be the same1379* as what's used to store it on-disk. Otherwise, this will be ZIO_COMPRESS_OFF.1380*/1381static inline enum zio_compress1382arc_hdr_get_compress(arc_buf_hdr_t *hdr)1383{1384return (HDR_COMPRESSION_ENABLED(hdr) ?1385HDR_GET_COMPRESS(hdr) : ZIO_COMPRESS_OFF);1386}13871388uint8_t1389arc_get_complevel(arc_buf_t *buf)1390{1391return (buf->b_hdr->b_complevel);1392}13931394__maybe_unused1395static inline boolean_t1396arc_buf_is_shared(arc_buf_t *buf)1397{1398boolean_t shared = (buf->b_data != NULL &&1399buf->b_hdr->b_l1hdr.b_pabd != NULL &&1400abd_is_linear(buf->b_hdr->b_l1hdr.b_pabd) &&1401buf->b_data == abd_to_buf(buf->b_hdr->b_l1hdr.b_pabd));1402IMPLY(shared, HDR_SHARED_DATA(buf->b_hdr));1403EQUIV(shared, ARC_BUF_SHARED(buf));1404IMPLY(shared, ARC_BUF_COMPRESSED(buf) || ARC_BUF_LAST(buf));14051406/*1407* It would be nice to assert arc_can_share() too, but the "hdr isn't1408* already being shared" requirement prevents us from doing that.1409*/14101411return (shared);1412}14131414/*1415* Free the checksum associated with this header. If there is no checksum, this1416* is a no-op.1417*/1418static inline void1419arc_cksum_free(arc_buf_hdr_t *hdr)1420{1421#ifdef ZFS_DEBUG1422ASSERT(HDR_HAS_L1HDR(hdr));14231424mutex_enter(&hdr->b_l1hdr.b_freeze_lock);1425if (hdr->b_l1hdr.b_freeze_cksum != NULL) {1426kmem_free(hdr->b_l1hdr.b_freeze_cksum, sizeof (zio_cksum_t));1427hdr->b_l1hdr.b_freeze_cksum = NULL;1428}1429mutex_exit(&hdr->b_l1hdr.b_freeze_lock);1430#endif1431}14321433/*1434* Return true iff at least one of the bufs on hdr is not compressed.1435* Encrypted buffers count as compressed.1436*/1437static boolean_t1438arc_hdr_has_uncompressed_buf(arc_buf_hdr_t *hdr)1439{1440ASSERT(hdr->b_l1hdr.b_state == arc_anon || HDR_EMPTY_OR_LOCKED(hdr));14411442for (arc_buf_t *b = hdr->b_l1hdr.b_buf; b != NULL; b = b->b_next) {1443if (!ARC_BUF_COMPRESSED(b)) {1444return (B_TRUE);1445}1446}1447return (B_FALSE);1448}144914501451/*1452* If we've turned on the ZFS_DEBUG_MODIFY flag, verify that the buf's data1453* matches the checksum that is stored in the hdr. If there is no checksum,1454* or if the buf is compressed, this is a no-op.1455*/1456static void1457arc_cksum_verify(arc_buf_t *buf)1458{1459#ifdef ZFS_DEBUG1460arc_buf_hdr_t *hdr = buf->b_hdr;1461zio_cksum_t zc;14621463if (!(zfs_flags & ZFS_DEBUG_MODIFY))1464return;14651466if (ARC_BUF_COMPRESSED(buf))1467return;14681469ASSERT(HDR_HAS_L1HDR(hdr));14701471mutex_enter(&hdr->b_l1hdr.b_freeze_lock);14721473if (hdr->b_l1hdr.b_freeze_cksum == NULL || HDR_IO_ERROR(hdr)) {1474mutex_exit(&hdr->b_l1hdr.b_freeze_lock);1475return;1476}14771478fletcher_2_native(buf->b_data, arc_buf_size(buf), NULL, &zc);1479if (!ZIO_CHECKSUM_EQUAL(*hdr->b_l1hdr.b_freeze_cksum, zc))1480panic("buffer modified while frozen!");1481mutex_exit(&hdr->b_l1hdr.b_freeze_lock);1482#endif1483}14841485/*1486* This function makes the assumption that data stored in the L2ARC1487* will be transformed exactly as it is in the main pool. Because of1488* this we can verify the checksum against the reading process's bp.1489*/1490static boolean_t1491arc_cksum_is_equal(arc_buf_hdr_t *hdr, zio_t *zio)1492{1493ASSERT(!BP_IS_EMBEDDED(zio->io_bp));1494VERIFY3U(BP_GET_PSIZE(zio->io_bp), ==, HDR_GET_PSIZE(hdr));14951496/*1497* Block pointers always store the checksum for the logical data.1498* If the block pointer has the gang bit set, then the checksum1499* it represents is for the reconstituted data and not for an1500* individual gang member. The zio pipeline, however, must be able to1501* determine the checksum of each of the gang constituents so it1502* treats the checksum comparison differently than what we need1503* for l2arc blocks. This prevents us from using the1504* zio_checksum_error() interface directly. Instead we must call the1505* zio_checksum_error_impl() so that we can ensure the checksum is1506* generated using the correct checksum algorithm and accounts for the1507* logical I/O size and not just a gang fragment.1508*/1509return (zio_checksum_error_impl(zio->io_spa, zio->io_bp,1510BP_GET_CHECKSUM(zio->io_bp), zio->io_abd, zio->io_size,1511zio->io_offset, NULL) == 0);1512}15131514/*1515* Given a buf full of data, if ZFS_DEBUG_MODIFY is enabled this computes a1516* checksum and attaches it to the buf's hdr so that we can ensure that the buf1517* isn't modified later on. If buf is compressed or there is already a checksum1518* on the hdr, this is a no-op (we only checksum uncompressed bufs).1519*/1520static void1521arc_cksum_compute(arc_buf_t *buf)1522{1523if (!(zfs_flags & ZFS_DEBUG_MODIFY))1524return;15251526#ifdef ZFS_DEBUG1527arc_buf_hdr_t *hdr = buf->b_hdr;1528ASSERT(HDR_HAS_L1HDR(hdr));1529mutex_enter(&hdr->b_l1hdr.b_freeze_lock);1530if (hdr->b_l1hdr.b_freeze_cksum != NULL || ARC_BUF_COMPRESSED(buf)) {1531mutex_exit(&hdr->b_l1hdr.b_freeze_lock);1532return;1533}15341535ASSERT(!ARC_BUF_ENCRYPTED(buf));1536ASSERT(!ARC_BUF_COMPRESSED(buf));1537hdr->b_l1hdr.b_freeze_cksum = kmem_alloc(sizeof (zio_cksum_t),1538KM_SLEEP);1539fletcher_2_native(buf->b_data, arc_buf_size(buf), NULL,1540hdr->b_l1hdr.b_freeze_cksum);1541mutex_exit(&hdr->b_l1hdr.b_freeze_lock);1542#endif1543arc_buf_watch(buf);1544}15451546#ifndef _KERNEL1547void1548arc_buf_sigsegv(int sig, siginfo_t *si, void *unused)1549{1550(void) sig, (void) unused;1551panic("Got SIGSEGV at address: 0x%lx\n", (long)si->si_addr);1552}1553#endif15541555static void1556arc_buf_unwatch(arc_buf_t *buf)1557{1558#ifndef _KERNEL1559if (arc_watch) {1560ASSERT0(mprotect(buf->b_data, arc_buf_size(buf),1561PROT_READ | PROT_WRITE));1562}1563#else1564(void) buf;1565#endif1566}15671568static void1569arc_buf_watch(arc_buf_t *buf)1570{1571#ifndef _KERNEL1572if (arc_watch)1573ASSERT0(mprotect(buf->b_data, arc_buf_size(buf),1574PROT_READ));1575#else1576(void) buf;1577#endif1578}15791580static arc_buf_contents_t1581arc_buf_type(arc_buf_hdr_t *hdr)1582{1583arc_buf_contents_t type;1584if (HDR_ISTYPE_METADATA(hdr)) {1585type = ARC_BUFC_METADATA;1586} else {1587type = ARC_BUFC_DATA;1588}1589VERIFY3U(hdr->b_type, ==, type);1590return (type);1591}15921593boolean_t1594arc_is_metadata(arc_buf_t *buf)1595{1596return (HDR_ISTYPE_METADATA(buf->b_hdr) != 0);1597}15981599static uint32_t1600arc_bufc_to_flags(arc_buf_contents_t type)1601{1602switch (type) {1603case ARC_BUFC_DATA:1604/* metadata field is 0 if buffer contains normal data */1605return (0);1606case ARC_BUFC_METADATA:1607return (ARC_FLAG_BUFC_METADATA);1608default:1609break;1610}1611panic("undefined ARC buffer type!");1612return ((uint32_t)-1);1613}16141615void1616arc_buf_thaw(arc_buf_t *buf)1617{1618arc_buf_hdr_t *hdr = buf->b_hdr;16191620ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon);1621ASSERT(!HDR_IO_IN_PROGRESS(hdr));16221623arc_cksum_verify(buf);16241625/*1626* Compressed buffers do not manipulate the b_freeze_cksum.1627*/1628if (ARC_BUF_COMPRESSED(buf))1629return;16301631ASSERT(HDR_HAS_L1HDR(hdr));1632arc_cksum_free(hdr);1633arc_buf_unwatch(buf);1634}16351636void1637arc_buf_freeze(arc_buf_t *buf)1638{1639if (!(zfs_flags & ZFS_DEBUG_MODIFY))1640return;16411642if (ARC_BUF_COMPRESSED(buf))1643return;16441645ASSERT(HDR_HAS_L1HDR(buf->b_hdr));1646arc_cksum_compute(buf);1647}16481649/*1650* The arc_buf_hdr_t's b_flags should never be modified directly. Instead,1651* the following functions should be used to ensure that the flags are1652* updated in a thread-safe way. When manipulating the flags either1653* the hash_lock must be held or the hdr must be undiscoverable. This1654* ensures that we're not racing with any other threads when updating1655* the flags.1656*/1657static inline void1658arc_hdr_set_flags(arc_buf_hdr_t *hdr, arc_flags_t flags)1659{1660ASSERT(HDR_EMPTY_OR_LOCKED(hdr));1661hdr->b_flags |= flags;1662}16631664static inline void1665arc_hdr_clear_flags(arc_buf_hdr_t *hdr, arc_flags_t flags)1666{1667ASSERT(HDR_EMPTY_OR_LOCKED(hdr));1668hdr->b_flags &= ~flags;1669}16701671/*1672* Setting the compression bits in the arc_buf_hdr_t's b_flags is1673* done in a special way since we have to clear and set bits1674* at the same time. Consumers that wish to set the compression bits1675* must use this function to ensure that the flags are updated in1676* thread-safe manner.1677*/1678static void1679arc_hdr_set_compress(arc_buf_hdr_t *hdr, enum zio_compress cmp)1680{1681ASSERT(HDR_EMPTY_OR_LOCKED(hdr));16821683/*1684* Holes and embedded blocks will always have a psize = 0 so1685* we ignore the compression of the blkptr and set the1686* want to uncompress them. Mark them as uncompressed.1687*/1688if (!zfs_compressed_arc_enabled || HDR_GET_PSIZE(hdr) == 0) {1689arc_hdr_clear_flags(hdr, ARC_FLAG_COMPRESSED_ARC);1690ASSERT(!HDR_COMPRESSION_ENABLED(hdr));1691} else {1692arc_hdr_set_flags(hdr, ARC_FLAG_COMPRESSED_ARC);1693ASSERT(HDR_COMPRESSION_ENABLED(hdr));1694}16951696HDR_SET_COMPRESS(hdr, cmp);1697ASSERT3U(HDR_GET_COMPRESS(hdr), ==, cmp);1698}16991700/*1701* Looks for another buf on the same hdr which has the data decompressed, copies1702* from it, and returns true. If no such buf exists, returns false.1703*/1704static boolean_t1705arc_buf_try_copy_decompressed_data(arc_buf_t *buf)1706{1707arc_buf_hdr_t *hdr = buf->b_hdr;1708boolean_t copied = B_FALSE;17091710ASSERT(HDR_HAS_L1HDR(hdr));1711ASSERT3P(buf->b_data, !=, NULL);1712ASSERT(!ARC_BUF_COMPRESSED(buf));17131714for (arc_buf_t *from = hdr->b_l1hdr.b_buf; from != NULL;1715from = from->b_next) {1716/* can't use our own data buffer */1717if (from == buf) {1718continue;1719}17201721if (!ARC_BUF_COMPRESSED(from)) {1722memcpy(buf->b_data, from->b_data, arc_buf_size(buf));1723copied = B_TRUE;1724break;1725}1726}17271728#ifdef ZFS_DEBUG1729/*1730* There were no decompressed bufs, so there should not be a1731* checksum on the hdr either.1732*/1733if (zfs_flags & ZFS_DEBUG_MODIFY)1734EQUIV(!copied, hdr->b_l1hdr.b_freeze_cksum == NULL);1735#endif17361737return (copied);1738}17391740/*1741* Allocates an ARC buf header that's in an evicted & L2-cached state.1742* This is used during l2arc reconstruction to make empty ARC buffers1743* which circumvent the regular disk->arc->l2arc path and instead come1744* into being in the reverse order, i.e. l2arc->arc.1745*/1746static arc_buf_hdr_t *1747arc_buf_alloc_l2only(size_t size, arc_buf_contents_t type, l2arc_dev_t *dev,1748dva_t dva, uint64_t daddr, int32_t psize, uint64_t asize, uint64_t birth,1749enum zio_compress compress, uint8_t complevel, boolean_t protected,1750boolean_t prefetch, arc_state_type_t arcs_state)1751{1752arc_buf_hdr_t *hdr;17531754ASSERT(size != 0);1755ASSERT(dev->l2ad_vdev != NULL);17561757hdr = kmem_cache_alloc(hdr_l2only_cache, KM_SLEEP);1758hdr->b_birth = birth;1759hdr->b_type = type;1760hdr->b_flags = 0;1761arc_hdr_set_flags(hdr, arc_bufc_to_flags(type) | ARC_FLAG_HAS_L2HDR);1762HDR_SET_LSIZE(hdr, size);1763HDR_SET_PSIZE(hdr, psize);1764HDR_SET_L2SIZE(hdr, asize);1765arc_hdr_set_compress(hdr, compress);1766hdr->b_complevel = complevel;1767if (protected)1768arc_hdr_set_flags(hdr, ARC_FLAG_PROTECTED);1769if (prefetch)1770arc_hdr_set_flags(hdr, ARC_FLAG_PREFETCH);1771hdr->b_spa = spa_load_guid(dev->l2ad_vdev->vdev_spa);17721773hdr->b_dva = dva;17741775hdr->b_l2hdr.b_dev = dev;1776hdr->b_l2hdr.b_daddr = daddr;1777hdr->b_l2hdr.b_arcs_state = arcs_state;17781779return (hdr);1780}17811782/*1783* Return the size of the block, b_pabd, that is stored in the arc_buf_hdr_t.1784*/1785static uint64_t1786arc_hdr_size(arc_buf_hdr_t *hdr)1787{1788uint64_t size;17891790if (arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF &&1791HDR_GET_PSIZE(hdr) > 0) {1792size = HDR_GET_PSIZE(hdr);1793} else {1794ASSERT3U(HDR_GET_LSIZE(hdr), !=, 0);1795size = HDR_GET_LSIZE(hdr);1796}1797return (size);1798}17991800static int1801arc_hdr_authenticate(arc_buf_hdr_t *hdr, spa_t *spa, uint64_t dsobj)1802{1803int ret;1804uint64_t csize;1805uint64_t lsize = HDR_GET_LSIZE(hdr);1806uint64_t psize = HDR_GET_PSIZE(hdr);1807abd_t *abd = hdr->b_l1hdr.b_pabd;1808boolean_t free_abd = B_FALSE;18091810ASSERT(HDR_EMPTY_OR_LOCKED(hdr));1811ASSERT(HDR_AUTHENTICATED(hdr));1812ASSERT3P(abd, !=, NULL);18131814/*1815* The MAC is calculated on the compressed data that is stored on disk.1816* However, if compressed arc is disabled we will only have the1817* decompressed data available to us now. Compress it into a temporary1818* abd so we can verify the MAC. The performance overhead of this will1819* be relatively low, since most objects in an encrypted objset will1820* be encrypted (instead of authenticated) anyway.1821*/1822if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF &&1823!HDR_COMPRESSION_ENABLED(hdr)) {1824abd = NULL;1825csize = zio_compress_data(HDR_GET_COMPRESS(hdr),1826hdr->b_l1hdr.b_pabd, &abd, lsize, MIN(lsize, psize),1827hdr->b_complevel);1828if (csize >= lsize || csize > psize) {1829ret = SET_ERROR(EIO);1830return (ret);1831}1832ASSERT3P(abd, !=, NULL);1833abd_zero_off(abd, csize, psize - csize);1834free_abd = B_TRUE;1835}18361837/*1838* Authentication is best effort. We authenticate whenever the key is1839* available. If we succeed we clear ARC_FLAG_NOAUTH.1840*/1841if (hdr->b_crypt_hdr.b_ot == DMU_OT_OBJSET) {1842ASSERT3U(HDR_GET_COMPRESS(hdr), ==, ZIO_COMPRESS_OFF);1843ASSERT3U(lsize, ==, psize);1844ret = spa_do_crypt_objset_mac_abd(B_FALSE, spa, dsobj, abd,1845psize, hdr->b_l1hdr.b_byteswap != DMU_BSWAP_NUMFUNCS);1846} else {1847ret = spa_do_crypt_mac_abd(B_FALSE, spa, dsobj, abd, psize,1848hdr->b_crypt_hdr.b_mac);1849}18501851if (ret == 0)1852arc_hdr_clear_flags(hdr, ARC_FLAG_NOAUTH);1853else if (ret == ENOENT)1854ret = 0;18551856if (free_abd)1857abd_free(abd);18581859return (ret);1860}18611862/*1863* This function will take a header that only has raw encrypted data in1864* b_crypt_hdr.b_rabd and decrypt it into a new buffer which is stored in1865* b_l1hdr.b_pabd. If designated in the header flags, this function will1866* also decompress the data.1867*/1868static int1869arc_hdr_decrypt(arc_buf_hdr_t *hdr, spa_t *spa, const zbookmark_phys_t *zb)1870{1871int ret;1872abd_t *cabd = NULL;1873boolean_t no_crypt = B_FALSE;1874boolean_t bswap = (hdr->b_l1hdr.b_byteswap != DMU_BSWAP_NUMFUNCS);18751876ASSERT(HDR_EMPTY_OR_LOCKED(hdr));1877ASSERT(HDR_ENCRYPTED(hdr));18781879arc_hdr_alloc_abd(hdr, 0);18801881ret = spa_do_crypt_abd(B_FALSE, spa, zb, hdr->b_crypt_hdr.b_ot,1882B_FALSE, bswap, hdr->b_crypt_hdr.b_salt, hdr->b_crypt_hdr.b_iv,1883hdr->b_crypt_hdr.b_mac, HDR_GET_PSIZE(hdr), hdr->b_l1hdr.b_pabd,1884hdr->b_crypt_hdr.b_rabd, &no_crypt);1885if (ret != 0)1886goto error;18871888if (no_crypt) {1889abd_copy(hdr->b_l1hdr.b_pabd, hdr->b_crypt_hdr.b_rabd,1890HDR_GET_PSIZE(hdr));1891}18921893/*1894* If this header has disabled arc compression but the b_pabd is1895* compressed after decrypting it, we need to decompress the newly1896* decrypted data.1897*/1898if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF &&1899!HDR_COMPRESSION_ENABLED(hdr)) {1900/*1901* We want to make sure that we are correctly honoring the1902* zfs_abd_scatter_enabled setting, so we allocate an abd here1903* and then loan a buffer from it, rather than allocating a1904* linear buffer and wrapping it in an abd later.1905*/1906cabd = arc_get_data_abd(hdr, arc_hdr_size(hdr), hdr, 0);19071908ret = zio_decompress_data(HDR_GET_COMPRESS(hdr),1909hdr->b_l1hdr.b_pabd, cabd, HDR_GET_PSIZE(hdr),1910HDR_GET_LSIZE(hdr), &hdr->b_complevel);1911if (ret != 0) {1912goto error;1913}19141915arc_free_data_abd(hdr, hdr->b_l1hdr.b_pabd,1916arc_hdr_size(hdr), hdr);1917hdr->b_l1hdr.b_pabd = cabd;1918}19191920return (0);19211922error:1923arc_hdr_free_abd(hdr, B_FALSE);1924if (cabd != NULL)1925arc_free_data_abd(hdr, cabd, arc_hdr_size(hdr), hdr);19261927return (ret);1928}19291930/*1931* This function is called during arc_buf_fill() to prepare the header's1932* abd plaintext pointer for use. This involves authenticated protected1933* data and decrypting encrypted data into the plaintext abd.1934*/1935static int1936arc_fill_hdr_crypt(arc_buf_hdr_t *hdr, kmutex_t *hash_lock, spa_t *spa,1937const zbookmark_phys_t *zb, boolean_t noauth)1938{1939int ret;19401941ASSERT(HDR_PROTECTED(hdr));19421943if (hash_lock != NULL)1944mutex_enter(hash_lock);19451946if (HDR_NOAUTH(hdr) && !noauth) {1947/*1948* The caller requested authenticated data but our data has1949* not been authenticated yet. Verify the MAC now if we can.1950*/1951ret = arc_hdr_authenticate(hdr, spa, zb->zb_objset);1952if (ret != 0)1953goto error;1954} else if (HDR_HAS_RABD(hdr) && hdr->b_l1hdr.b_pabd == NULL) {1955/*1956* If we only have the encrypted version of the data, but the1957* unencrypted version was requested we take this opportunity1958* to store the decrypted version in the header for future use.1959*/1960ret = arc_hdr_decrypt(hdr, spa, zb);1961if (ret != 0)1962goto error;1963}19641965ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);19661967if (hash_lock != NULL)1968mutex_exit(hash_lock);19691970return (0);19711972error:1973if (hash_lock != NULL)1974mutex_exit(hash_lock);19751976return (ret);1977}19781979/*1980* This function is used by the dbuf code to decrypt bonus buffers in place.1981* The dbuf code itself doesn't have any locking for decrypting a shared dnode1982* block, so we use the hash lock here to protect against concurrent calls to1983* arc_buf_fill().1984*/1985static void1986arc_buf_untransform_in_place(arc_buf_t *buf)1987{1988arc_buf_hdr_t *hdr = buf->b_hdr;19891990ASSERT(HDR_ENCRYPTED(hdr));1991ASSERT3U(hdr->b_crypt_hdr.b_ot, ==, DMU_OT_DNODE);1992ASSERT(HDR_EMPTY_OR_LOCKED(hdr));1993ASSERT3PF(hdr->b_l1hdr.b_pabd, !=, NULL, "hdr %px buf %px", hdr, buf);19941995zio_crypt_copy_dnode_bonus(hdr->b_l1hdr.b_pabd, buf->b_data,1996arc_buf_size(buf));1997buf->b_flags &= ~ARC_BUF_FLAG_ENCRYPTED;1998buf->b_flags &= ~ARC_BUF_FLAG_COMPRESSED;1999}20002001/*2002* Given a buf that has a data buffer attached to it, this function will2003* efficiently fill the buf with data of the specified compression setting from2004* the hdr and update the hdr's b_freeze_cksum if necessary. If the buf and hdr2005* are already sharing a data buf, no copy is performed.2006*2007* If the buf is marked as compressed but uncompressed data was requested, this2008* will allocate a new data buffer for the buf, remove that flag, and fill the2009* buf with uncompressed data. You can't request a compressed buf on a hdr with2010* uncompressed data, and (since we haven't added support for it yet) if you2011* want compressed data your buf must already be marked as compressed and have2012* the correct-sized data buffer.2013*/2014static int2015arc_buf_fill(arc_buf_t *buf, spa_t *spa, const zbookmark_phys_t *zb,2016arc_fill_flags_t flags)2017{2018int error = 0;2019arc_buf_hdr_t *hdr = buf->b_hdr;2020boolean_t hdr_compressed =2021(arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF);2022boolean_t compressed = (flags & ARC_FILL_COMPRESSED) != 0;2023boolean_t encrypted = (flags & ARC_FILL_ENCRYPTED) != 0;2024dmu_object_byteswap_t bswap = hdr->b_l1hdr.b_byteswap;2025kmutex_t *hash_lock = (flags & ARC_FILL_LOCKED) ? NULL : HDR_LOCK(hdr);20262027ASSERT3P(buf->b_data, !=, NULL);2028IMPLY(compressed, hdr_compressed || ARC_BUF_ENCRYPTED(buf));2029IMPLY(compressed, ARC_BUF_COMPRESSED(buf));2030IMPLY(encrypted, HDR_ENCRYPTED(hdr));2031IMPLY(encrypted, ARC_BUF_ENCRYPTED(buf));2032IMPLY(encrypted, ARC_BUF_COMPRESSED(buf));2033IMPLY(encrypted, !arc_buf_is_shared(buf));20342035/*2036* If the caller wanted encrypted data we just need to copy it from2037* b_rabd and potentially byteswap it. We won't be able to do any2038* further transforms on it.2039*/2040if (encrypted) {2041ASSERT(HDR_HAS_RABD(hdr));2042abd_copy_to_buf(buf->b_data, hdr->b_crypt_hdr.b_rabd,2043HDR_GET_PSIZE(hdr));2044goto byteswap;2045}20462047/*2048* Adjust encrypted and authenticated headers to accommodate2049* the request if needed. Dnode blocks (ARC_FILL_IN_PLACE) are2050* allowed to fail decryption due to keys not being loaded2051* without being marked as an IO error.2052*/2053if (HDR_PROTECTED(hdr)) {2054error = arc_fill_hdr_crypt(hdr, hash_lock, spa,2055zb, !!(flags & ARC_FILL_NOAUTH));2056if (error == EACCES && (flags & ARC_FILL_IN_PLACE) != 0) {2057return (error);2058} else if (error != 0) {2059if (hash_lock != NULL)2060mutex_enter(hash_lock);2061arc_hdr_set_flags(hdr, ARC_FLAG_IO_ERROR);2062if (hash_lock != NULL)2063mutex_exit(hash_lock);2064return (error);2065}2066}20672068/*2069* There is a special case here for dnode blocks which are2070* decrypting their bonus buffers. These blocks may request to2071* be decrypted in-place. This is necessary because there may2072* be many dnodes pointing into this buffer and there is2073* currently no method to synchronize replacing the backing2074* b_data buffer and updating all of the pointers. Here we use2075* the hash lock to ensure there are no races. If the need2076* arises for other types to be decrypted in-place, they must2077* add handling here as well.2078*/2079if ((flags & ARC_FILL_IN_PLACE) != 0) {2080ASSERT(!hdr_compressed);2081ASSERT(!compressed);2082ASSERT(!encrypted);20832084if (HDR_ENCRYPTED(hdr) && ARC_BUF_ENCRYPTED(buf)) {2085ASSERT3U(hdr->b_crypt_hdr.b_ot, ==, DMU_OT_DNODE);20862087if (hash_lock != NULL)2088mutex_enter(hash_lock);2089arc_buf_untransform_in_place(buf);2090if (hash_lock != NULL)2091mutex_exit(hash_lock);20922093/* Compute the hdr's checksum if necessary */2094arc_cksum_compute(buf);2095}20962097return (0);2098}20992100if (hdr_compressed == compressed) {2101if (ARC_BUF_SHARED(buf)) {2102ASSERT(arc_buf_is_shared(buf));2103} else {2104abd_copy_to_buf(buf->b_data, hdr->b_l1hdr.b_pabd,2105arc_buf_size(buf));2106}2107} else {2108ASSERT(hdr_compressed);2109ASSERT(!compressed);21102111/*2112* If the buf is sharing its data with the hdr, unlink it and2113* allocate a new data buffer for the buf.2114*/2115if (ARC_BUF_SHARED(buf)) {2116ASSERTF(ARC_BUF_COMPRESSED(buf),2117"buf %p was uncompressed", buf);21182119/* We need to give the buf its own b_data */2120buf->b_flags &= ~ARC_BUF_FLAG_SHARED;2121buf->b_data =2122arc_get_data_buf(hdr, HDR_GET_LSIZE(hdr), buf);2123arc_hdr_clear_flags(hdr, ARC_FLAG_SHARED_DATA);21242125/* Previously overhead was 0; just add new overhead */2126ARCSTAT_INCR(arcstat_overhead_size, HDR_GET_LSIZE(hdr));2127} else if (ARC_BUF_COMPRESSED(buf)) {2128ASSERT(!arc_buf_is_shared(buf));21292130/* We need to reallocate the buf's b_data */2131arc_free_data_buf(hdr, buf->b_data, HDR_GET_PSIZE(hdr),2132buf);2133buf->b_data =2134arc_get_data_buf(hdr, HDR_GET_LSIZE(hdr), buf);21352136/* We increased the size of b_data; update overhead */2137ARCSTAT_INCR(arcstat_overhead_size,2138HDR_GET_LSIZE(hdr) - HDR_GET_PSIZE(hdr));2139}21402141/*2142* Regardless of the buf's previous compression settings, it2143* should not be compressed at the end of this function.2144*/2145buf->b_flags &= ~ARC_BUF_FLAG_COMPRESSED;21462147/*2148* Try copying the data from another buf which already has a2149* decompressed version. If that's not possible, it's time to2150* bite the bullet and decompress the data from the hdr.2151*/2152if (arc_buf_try_copy_decompressed_data(buf)) {2153/* Skip byteswapping and checksumming (already done) */2154return (0);2155} else {2156abd_t dabd;2157abd_get_from_buf_struct(&dabd, buf->b_data,2158HDR_GET_LSIZE(hdr));2159error = zio_decompress_data(HDR_GET_COMPRESS(hdr),2160hdr->b_l1hdr.b_pabd, &dabd,2161HDR_GET_PSIZE(hdr), HDR_GET_LSIZE(hdr),2162&hdr->b_complevel);2163abd_free(&dabd);21642165/*2166* Absent hardware errors or software bugs, this should2167* be impossible, but log it anyway so we can debug it.2168*/2169if (error != 0) {2170zfs_dbgmsg(2171"hdr %px, compress %d, psize %d, lsize %d",2172hdr, arc_hdr_get_compress(hdr),2173HDR_GET_PSIZE(hdr), HDR_GET_LSIZE(hdr));2174if (hash_lock != NULL)2175mutex_enter(hash_lock);2176arc_hdr_set_flags(hdr, ARC_FLAG_IO_ERROR);2177if (hash_lock != NULL)2178mutex_exit(hash_lock);2179return (SET_ERROR(EIO));2180}2181}2182}21832184byteswap:2185/* Byteswap the buf's data if necessary */2186if (bswap != DMU_BSWAP_NUMFUNCS) {2187ASSERT(!HDR_SHARED_DATA(hdr));2188ASSERT3U(bswap, <, DMU_BSWAP_NUMFUNCS);2189dmu_ot_byteswap[bswap].ob_func(buf->b_data, HDR_GET_LSIZE(hdr));2190}21912192/* Compute the hdr's checksum if necessary */2193arc_cksum_compute(buf);21942195return (0);2196}21972198/*2199* If this function is being called to decrypt an encrypted buffer or verify an2200* authenticated one, the key must be loaded and a mapping must be made2201* available in the keystore via spa_keystore_create_mapping() or one of its2202* callers.2203*/2204int2205arc_untransform(arc_buf_t *buf, spa_t *spa, const zbookmark_phys_t *zb,2206boolean_t in_place)2207{2208int ret;2209arc_fill_flags_t flags = 0;22102211if (in_place)2212flags |= ARC_FILL_IN_PLACE;22132214ret = arc_buf_fill(buf, spa, zb, flags);2215if (ret == ECKSUM) {2216/*2217* Convert authentication and decryption errors to EIO2218* (and generate an ereport) before leaving the ARC.2219*/2220ret = SET_ERROR(EIO);2221spa_log_error(spa, zb, buf->b_hdr->b_birth);2222(void) zfs_ereport_post(FM_EREPORT_ZFS_AUTHENTICATION,2223spa, NULL, zb, NULL, 0);2224}22252226return (ret);2227}22282229/*2230* Increment the amount of evictable space in the arc_state_t's refcount.2231* We account for the space used by the hdr and the arc buf individually2232* so that we can add and remove them from the refcount individually.2233*/2234static void2235arc_evictable_space_increment(arc_buf_hdr_t *hdr, arc_state_t *state)2236{2237arc_buf_contents_t type = arc_buf_type(hdr);22382239ASSERT(HDR_HAS_L1HDR(hdr));22402241if (GHOST_STATE(state)) {2242ASSERT0P(hdr->b_l1hdr.b_buf);2243ASSERT0P(hdr->b_l1hdr.b_pabd);2244ASSERT(!HDR_HAS_RABD(hdr));2245(void) zfs_refcount_add_many(&state->arcs_esize[type],2246HDR_GET_LSIZE(hdr), hdr);2247return;2248}22492250if (hdr->b_l1hdr.b_pabd != NULL) {2251(void) zfs_refcount_add_many(&state->arcs_esize[type],2252arc_hdr_size(hdr), hdr);2253}2254if (HDR_HAS_RABD(hdr)) {2255(void) zfs_refcount_add_many(&state->arcs_esize[type],2256HDR_GET_PSIZE(hdr), hdr);2257}22582259for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL;2260buf = buf->b_next) {2261if (ARC_BUF_SHARED(buf))2262continue;2263(void) zfs_refcount_add_many(&state->arcs_esize[type],2264arc_buf_size(buf), buf);2265}2266}22672268/*2269* Decrement the amount of evictable space in the arc_state_t's refcount.2270* We account for the space used by the hdr and the arc buf individually2271* so that we can add and remove them from the refcount individually.2272*/2273static void2274arc_evictable_space_decrement(arc_buf_hdr_t *hdr, arc_state_t *state)2275{2276arc_buf_contents_t type = arc_buf_type(hdr);22772278ASSERT(HDR_HAS_L1HDR(hdr));22792280if (GHOST_STATE(state)) {2281ASSERT0P(hdr->b_l1hdr.b_buf);2282ASSERT0P(hdr->b_l1hdr.b_pabd);2283ASSERT(!HDR_HAS_RABD(hdr));2284(void) zfs_refcount_remove_many(&state->arcs_esize[type],2285HDR_GET_LSIZE(hdr), hdr);2286return;2287}22882289if (hdr->b_l1hdr.b_pabd != NULL) {2290(void) zfs_refcount_remove_many(&state->arcs_esize[type],2291arc_hdr_size(hdr), hdr);2292}2293if (HDR_HAS_RABD(hdr)) {2294(void) zfs_refcount_remove_many(&state->arcs_esize[type],2295HDR_GET_PSIZE(hdr), hdr);2296}22972298for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL;2299buf = buf->b_next) {2300if (ARC_BUF_SHARED(buf))2301continue;2302(void) zfs_refcount_remove_many(&state->arcs_esize[type],2303arc_buf_size(buf), buf);2304}2305}23062307/*2308* Add a reference to this hdr indicating that someone is actively2309* referencing that memory. When the refcount transitions from 0 to 1,2310* we remove it from the respective arc_state_t list to indicate that2311* it is not evictable.2312*/2313static void2314add_reference(arc_buf_hdr_t *hdr, const void *tag)2315{2316arc_state_t *state = hdr->b_l1hdr.b_state;23172318ASSERT(HDR_HAS_L1HDR(hdr));2319if (!HDR_EMPTY(hdr) && !MUTEX_HELD(HDR_LOCK(hdr))) {2320ASSERT(state == arc_anon);2321ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt));2322ASSERT0P(hdr->b_l1hdr.b_buf);2323}23242325if ((zfs_refcount_add(&hdr->b_l1hdr.b_refcnt, tag) == 1) &&2326state != arc_anon && state != arc_l2c_only) {2327/* We don't use the L2-only state list. */2328multilist_remove(&state->arcs_list[arc_buf_type(hdr)], hdr);2329arc_evictable_space_decrement(hdr, state);2330}2331}23322333/*2334* Remove a reference from this hdr. When the reference transitions from2335* 1 to 0 and we're not anonymous, then we add this hdr to the arc_state_t's2336* list making it eligible for eviction.2337*/2338static int2339remove_reference(arc_buf_hdr_t *hdr, const void *tag)2340{2341int cnt;2342arc_state_t *state = hdr->b_l1hdr.b_state;23432344ASSERT(HDR_HAS_L1HDR(hdr));2345ASSERT(state == arc_anon || MUTEX_HELD(HDR_LOCK(hdr)));2346ASSERT(!GHOST_STATE(state)); /* arc_l2c_only counts as a ghost. */23472348if ((cnt = zfs_refcount_remove(&hdr->b_l1hdr.b_refcnt, tag)) != 0)2349return (cnt);23502351if (state == arc_anon) {2352arc_hdr_destroy(hdr);2353return (0);2354}2355if (state == arc_uncached && !HDR_PREFETCH(hdr)) {2356arc_change_state(arc_anon, hdr);2357arc_hdr_destroy(hdr);2358return (0);2359}2360multilist_insert(&state->arcs_list[arc_buf_type(hdr)], hdr);2361arc_evictable_space_increment(hdr, state);2362return (0);2363}23642365/*2366* Returns detailed information about a specific arc buffer. When the2367* state_index argument is set the function will calculate the arc header2368* list position for its arc state. Since this requires a linear traversal2369* callers are strongly encourage not to do this. However, it can be helpful2370* for targeted analysis so the functionality is provided.2371*/2372void2373arc_buf_info(arc_buf_t *ab, arc_buf_info_t *abi, int state_index)2374{2375(void) state_index;2376arc_buf_hdr_t *hdr = ab->b_hdr;2377l1arc_buf_hdr_t *l1hdr = NULL;2378l2arc_buf_hdr_t *l2hdr = NULL;2379arc_state_t *state = NULL;23802381memset(abi, 0, sizeof (arc_buf_info_t));23822383if (hdr == NULL)2384return;23852386abi->abi_flags = hdr->b_flags;23872388if (HDR_HAS_L1HDR(hdr)) {2389l1hdr = &hdr->b_l1hdr;2390state = l1hdr->b_state;2391}2392if (HDR_HAS_L2HDR(hdr))2393l2hdr = &hdr->b_l2hdr;23942395if (l1hdr) {2396abi->abi_bufcnt = 0;2397for (arc_buf_t *buf = l1hdr->b_buf; buf; buf = buf->b_next)2398abi->abi_bufcnt++;2399abi->abi_access = l1hdr->b_arc_access;2400abi->abi_mru_hits = l1hdr->b_mru_hits;2401abi->abi_mru_ghost_hits = l1hdr->b_mru_ghost_hits;2402abi->abi_mfu_hits = l1hdr->b_mfu_hits;2403abi->abi_mfu_ghost_hits = l1hdr->b_mfu_ghost_hits;2404abi->abi_holds = zfs_refcount_count(&l1hdr->b_refcnt);2405}24062407if (l2hdr) {2408abi->abi_l2arc_dattr = l2hdr->b_daddr;2409abi->abi_l2arc_hits = l2hdr->b_hits;2410}24112412abi->abi_state_type = state ? state->arcs_state : ARC_STATE_ANON;2413abi->abi_state_contents = arc_buf_type(hdr);2414abi->abi_size = arc_hdr_size(hdr);2415}24162417/*2418* Move the supplied buffer to the indicated state. The hash lock2419* for the buffer must be held by the caller.2420*/2421static void2422arc_change_state(arc_state_t *new_state, arc_buf_hdr_t *hdr)2423{2424arc_state_t *old_state;2425int64_t refcnt;2426boolean_t update_old, update_new;2427arc_buf_contents_t type = arc_buf_type(hdr);24282429/*2430* We almost always have an L1 hdr here, since we call arc_hdr_realloc()2431* in arc_read() when bringing a buffer out of the L2ARC. However, the2432* L1 hdr doesn't always exist when we change state to arc_anon before2433* destroying a header, in which case reallocating to add the L1 hdr is2434* pointless.2435*/2436if (HDR_HAS_L1HDR(hdr)) {2437old_state = hdr->b_l1hdr.b_state;2438refcnt = zfs_refcount_count(&hdr->b_l1hdr.b_refcnt);2439update_old = (hdr->b_l1hdr.b_buf != NULL ||2440hdr->b_l1hdr.b_pabd != NULL || HDR_HAS_RABD(hdr));24412442IMPLY(GHOST_STATE(old_state), hdr->b_l1hdr.b_buf == NULL);2443IMPLY(GHOST_STATE(new_state), hdr->b_l1hdr.b_buf == NULL);2444IMPLY(old_state == arc_anon, hdr->b_l1hdr.b_buf == NULL ||2445ARC_BUF_LAST(hdr->b_l1hdr.b_buf));2446} else {2447old_state = arc_l2c_only;2448refcnt = 0;2449update_old = B_FALSE;2450}2451update_new = update_old;2452if (GHOST_STATE(old_state))2453update_old = B_TRUE;2454if (GHOST_STATE(new_state))2455update_new = B_TRUE;24562457ASSERT(MUTEX_HELD(HDR_LOCK(hdr)));2458ASSERT3P(new_state, !=, old_state);24592460/*2461* If this buffer is evictable, transfer it from the2462* old state list to the new state list.2463*/2464if (refcnt == 0) {2465if (old_state != arc_anon && old_state != arc_l2c_only) {2466ASSERT(HDR_HAS_L1HDR(hdr));2467/* remove_reference() saves on insert. */2468if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {2469multilist_remove(&old_state->arcs_list[type],2470hdr);2471arc_evictable_space_decrement(hdr, old_state);2472}2473}2474if (new_state != arc_anon && new_state != arc_l2c_only) {2475/*2476* An L1 header always exists here, since if we're2477* moving to some L1-cached state (i.e. not l2c_only or2478* anonymous), we realloc the header to add an L1hdr2479* beforehand.2480*/2481ASSERT(HDR_HAS_L1HDR(hdr));2482multilist_insert(&new_state->arcs_list[type], hdr);2483arc_evictable_space_increment(hdr, new_state);2484}2485}24862487ASSERT(!HDR_EMPTY(hdr));2488if (new_state == arc_anon && HDR_IN_HASH_TABLE(hdr))2489buf_hash_remove(hdr);24902491/* adjust state sizes (ignore arc_l2c_only) */24922493if (update_new && new_state != arc_l2c_only) {2494ASSERT(HDR_HAS_L1HDR(hdr));2495if (GHOST_STATE(new_state)) {24962497/*2498* When moving a header to a ghost state, we first2499* remove all arc buffers. Thus, we'll have no arc2500* buffer to use for the reference. As a result, we2501* use the arc header pointer for the reference.2502*/2503(void) zfs_refcount_add_many(2504&new_state->arcs_size[type],2505HDR_GET_LSIZE(hdr), hdr);2506ASSERT0P(hdr->b_l1hdr.b_pabd);2507ASSERT(!HDR_HAS_RABD(hdr));2508} else {25092510/*2511* Each individual buffer holds a unique reference,2512* thus we must remove each of these references one2513* at a time.2514*/2515for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL;2516buf = buf->b_next) {25172518/*2519* When the arc_buf_t is sharing the data2520* block with the hdr, the owner of the2521* reference belongs to the hdr. Only2522* add to the refcount if the arc_buf_t is2523* not shared.2524*/2525if (ARC_BUF_SHARED(buf))2526continue;25272528(void) zfs_refcount_add_many(2529&new_state->arcs_size[type],2530arc_buf_size(buf), buf);2531}25322533if (hdr->b_l1hdr.b_pabd != NULL) {2534(void) zfs_refcount_add_many(2535&new_state->arcs_size[type],2536arc_hdr_size(hdr), hdr);2537}25382539if (HDR_HAS_RABD(hdr)) {2540(void) zfs_refcount_add_many(2541&new_state->arcs_size[type],2542HDR_GET_PSIZE(hdr), hdr);2543}2544}2545}25462547if (update_old && old_state != arc_l2c_only) {2548ASSERT(HDR_HAS_L1HDR(hdr));2549if (GHOST_STATE(old_state)) {2550ASSERT0P(hdr->b_l1hdr.b_pabd);2551ASSERT(!HDR_HAS_RABD(hdr));25522553/*2554* When moving a header off of a ghost state,2555* the header will not contain any arc buffers.2556* We use the arc header pointer for the reference2557* which is exactly what we did when we put the2558* header on the ghost state.2559*/25602561(void) zfs_refcount_remove_many(2562&old_state->arcs_size[type],2563HDR_GET_LSIZE(hdr), hdr);2564} else {25652566/*2567* Each individual buffer holds a unique reference,2568* thus we must remove each of these references one2569* at a time.2570*/2571for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL;2572buf = buf->b_next) {25732574/*2575* When the arc_buf_t is sharing the data2576* block with the hdr, the owner of the2577* reference belongs to the hdr. Only2578* add to the refcount if the arc_buf_t is2579* not shared.2580*/2581if (ARC_BUF_SHARED(buf))2582continue;25832584(void) zfs_refcount_remove_many(2585&old_state->arcs_size[type],2586arc_buf_size(buf), buf);2587}2588ASSERT(hdr->b_l1hdr.b_pabd != NULL ||2589HDR_HAS_RABD(hdr));25902591if (hdr->b_l1hdr.b_pabd != NULL) {2592(void) zfs_refcount_remove_many(2593&old_state->arcs_size[type],2594arc_hdr_size(hdr), hdr);2595}25962597if (HDR_HAS_RABD(hdr)) {2598(void) zfs_refcount_remove_many(2599&old_state->arcs_size[type],2600HDR_GET_PSIZE(hdr), hdr);2601}2602}2603}26042605if (HDR_HAS_L1HDR(hdr)) {2606hdr->b_l1hdr.b_state = new_state;26072608if (HDR_HAS_L2HDR(hdr) && new_state != arc_l2c_only) {2609l2arc_hdr_arcstats_decrement_state(hdr);2610hdr->b_l2hdr.b_arcs_state = new_state->arcs_state;2611l2arc_hdr_arcstats_increment_state(hdr);2612}2613}2614}26152616void2617arc_space_consume(uint64_t space, arc_space_type_t type)2618{2619ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES);26202621switch (type) {2622default:2623break;2624case ARC_SPACE_DATA:2625ARCSTAT_INCR(arcstat_data_size, space);2626break;2627case ARC_SPACE_META:2628ARCSTAT_INCR(arcstat_metadata_size, space);2629break;2630case ARC_SPACE_BONUS:2631ARCSTAT_INCR(arcstat_bonus_size, space);2632break;2633case ARC_SPACE_DNODE:2634aggsum_add(&arc_sums.arcstat_dnode_size, space);2635break;2636case ARC_SPACE_DBUF:2637ARCSTAT_INCR(arcstat_dbuf_size, space);2638break;2639case ARC_SPACE_HDRS:2640ARCSTAT_INCR(arcstat_hdr_size, space);2641break;2642case ARC_SPACE_L2HDRS:2643aggsum_add(&arc_sums.arcstat_l2_hdr_size, space);2644break;2645case ARC_SPACE_ABD_CHUNK_WASTE:2646/*2647* Note: this includes space wasted by all scatter ABD's, not2648* just those allocated by the ARC. But the vast majority of2649* scatter ABD's come from the ARC, because other users are2650* very short-lived.2651*/2652ARCSTAT_INCR(arcstat_abd_chunk_waste_size, space);2653break;2654}26552656if (type != ARC_SPACE_DATA && type != ARC_SPACE_ABD_CHUNK_WASTE)2657ARCSTAT_INCR(arcstat_meta_used, space);26582659aggsum_add(&arc_sums.arcstat_size, space);2660}26612662void2663arc_space_return(uint64_t space, arc_space_type_t type)2664{2665ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES);26662667switch (type) {2668default:2669break;2670case ARC_SPACE_DATA:2671ARCSTAT_INCR(arcstat_data_size, -space);2672break;2673case ARC_SPACE_META:2674ARCSTAT_INCR(arcstat_metadata_size, -space);2675break;2676case ARC_SPACE_BONUS:2677ARCSTAT_INCR(arcstat_bonus_size, -space);2678break;2679case ARC_SPACE_DNODE:2680aggsum_add(&arc_sums.arcstat_dnode_size, -space);2681break;2682case ARC_SPACE_DBUF:2683ARCSTAT_INCR(arcstat_dbuf_size, -space);2684break;2685case ARC_SPACE_HDRS:2686ARCSTAT_INCR(arcstat_hdr_size, -space);2687break;2688case ARC_SPACE_L2HDRS:2689aggsum_add(&arc_sums.arcstat_l2_hdr_size, -space);2690break;2691case ARC_SPACE_ABD_CHUNK_WASTE:2692ARCSTAT_INCR(arcstat_abd_chunk_waste_size, -space);2693break;2694}26952696if (type != ARC_SPACE_DATA && type != ARC_SPACE_ABD_CHUNK_WASTE)2697ARCSTAT_INCR(arcstat_meta_used, -space);26982699ASSERT(aggsum_compare(&arc_sums.arcstat_size, space) >= 0);2700aggsum_add(&arc_sums.arcstat_size, -space);2701}27022703/*2704* Given a hdr and a buf, returns whether that buf can share its b_data buffer2705* with the hdr's b_pabd.2706*/2707static boolean_t2708arc_can_share(arc_buf_hdr_t *hdr, arc_buf_t *buf)2709{2710/*2711* The criteria for sharing a hdr's data are:2712* 1. the buffer is not encrypted2713* 2. the hdr's compression matches the buf's compression2714* 3. the hdr doesn't need to be byteswapped2715* 4. the hdr isn't already being shared2716* 5. the buf is either compressed or it is the last buf in the hdr list2717*2718* Criterion #5 maintains the invariant that shared uncompressed2719* bufs must be the final buf in the hdr's b_buf list. Reading this, you2720* might ask, "if a compressed buf is allocated first, won't that be the2721* last thing in the list?", but in that case it's impossible to create2722* a shared uncompressed buf anyway (because the hdr must be compressed2723* to have the compressed buf). You might also think that #3 is2724* sufficient to make this guarantee, however it's possible2725* (specifically in the rare L2ARC write race mentioned in2726* arc_buf_alloc_impl()) there will be an existing uncompressed buf that2727* is shareable, but wasn't at the time of its allocation. Rather than2728* allow a new shared uncompressed buf to be created and then shuffle2729* the list around to make it the last element, this simply disallows2730* sharing if the new buf isn't the first to be added.2731*/2732ASSERT3P(buf->b_hdr, ==, hdr);2733boolean_t hdr_compressed =2734arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF;2735boolean_t buf_compressed = ARC_BUF_COMPRESSED(buf) != 0;2736return (!ARC_BUF_ENCRYPTED(buf) &&2737buf_compressed == hdr_compressed &&2738hdr->b_l1hdr.b_byteswap == DMU_BSWAP_NUMFUNCS &&2739!HDR_SHARED_DATA(hdr) &&2740(ARC_BUF_LAST(buf) || ARC_BUF_COMPRESSED(buf)));2741}27422743/*2744* Allocate a buf for this hdr. If you care about the data that's in the hdr,2745* or if you want a compressed buffer, pass those flags in. Returns 0 if the2746* copy was made successfully, or an error code otherwise.2747*/2748static int2749arc_buf_alloc_impl(arc_buf_hdr_t *hdr, spa_t *spa, const zbookmark_phys_t *zb,2750const void *tag, boolean_t encrypted, boolean_t compressed,2751boolean_t noauth, boolean_t fill, arc_buf_t **ret)2752{2753arc_buf_t *buf;2754arc_fill_flags_t flags = ARC_FILL_LOCKED;27552756ASSERT(HDR_HAS_L1HDR(hdr));2757ASSERT3U(HDR_GET_LSIZE(hdr), >, 0);2758VERIFY(hdr->b_type == ARC_BUFC_DATA ||2759hdr->b_type == ARC_BUFC_METADATA);2760ASSERT3P(ret, !=, NULL);2761ASSERT0P(*ret);2762IMPLY(encrypted, compressed);27632764buf = *ret = kmem_cache_alloc(buf_cache, KM_PUSHPAGE);2765buf->b_hdr = hdr;2766buf->b_data = NULL;2767buf->b_next = hdr->b_l1hdr.b_buf;2768buf->b_flags = 0;27692770add_reference(hdr, tag);27712772/*2773* We're about to change the hdr's b_flags. We must either2774* hold the hash_lock or be undiscoverable.2775*/2776ASSERT(HDR_EMPTY_OR_LOCKED(hdr));27772778/*2779* Only honor requests for compressed bufs if the hdr is actually2780* compressed. This must be overridden if the buffer is encrypted since2781* encrypted buffers cannot be decompressed.2782*/2783if (encrypted) {2784buf->b_flags |= ARC_BUF_FLAG_COMPRESSED;2785buf->b_flags |= ARC_BUF_FLAG_ENCRYPTED;2786flags |= ARC_FILL_COMPRESSED | ARC_FILL_ENCRYPTED;2787} else if (compressed &&2788arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF) {2789buf->b_flags |= ARC_BUF_FLAG_COMPRESSED;2790flags |= ARC_FILL_COMPRESSED;2791}27922793if (noauth) {2794ASSERT0(encrypted);2795flags |= ARC_FILL_NOAUTH;2796}27972798/*2799* If the hdr's data can be shared then we share the data buffer and2800* set the appropriate bit in the hdr's b_flags to indicate the hdr is2801* sharing it's b_pabd with the arc_buf_t. Otherwise, we allocate a new2802* buffer to store the buf's data.2803*2804* There are two additional restrictions here because we're sharing2805* hdr -> buf instead of the usual buf -> hdr. First, the hdr can't be2806* actively involved in an L2ARC write, because if this buf is used by2807* an arc_write() then the hdr's data buffer will be released when the2808* write completes, even though the L2ARC write might still be using it.2809* Second, the hdr's ABD must be linear so that the buf's user doesn't2810* need to be ABD-aware. It must be allocated via2811* zio_[data_]buf_alloc(), not as a page, because we need to be able2812* to abd_release_ownership_of_buf(), which isn't allowed on "linear2813* page" buffers because the ABD code needs to handle freeing them2814* specially.2815*/2816boolean_t can_share = arc_can_share(hdr, buf) &&2817!HDR_L2_WRITING(hdr) &&2818hdr->b_l1hdr.b_pabd != NULL &&2819abd_is_linear(hdr->b_l1hdr.b_pabd) &&2820!abd_is_linear_page(hdr->b_l1hdr.b_pabd);28212822/* Set up b_data and sharing */2823if (can_share) {2824buf->b_data = abd_to_buf(hdr->b_l1hdr.b_pabd);2825buf->b_flags |= ARC_BUF_FLAG_SHARED;2826arc_hdr_set_flags(hdr, ARC_FLAG_SHARED_DATA);2827} else {2828buf->b_data =2829arc_get_data_buf(hdr, arc_buf_size(buf), buf);2830ARCSTAT_INCR(arcstat_overhead_size, arc_buf_size(buf));2831}2832VERIFY3P(buf->b_data, !=, NULL);28332834hdr->b_l1hdr.b_buf = buf;28352836/*2837* If the user wants the data from the hdr, we need to either copy or2838* decompress the data.2839*/2840if (fill) {2841ASSERT3P(zb, !=, NULL);2842return (arc_buf_fill(buf, spa, zb, flags));2843}28442845return (0);2846}28472848static const char *arc_onloan_tag = "onloan";28492850static inline void2851arc_loaned_bytes_update(int64_t delta)2852{2853atomic_add_64(&arc_loaned_bytes, delta);28542855/* assert that it did not wrap around */2856ASSERT3S(atomic_add_64_nv(&arc_loaned_bytes, 0), >=, 0);2857}28582859/*2860* Loan out an anonymous arc buffer. Loaned buffers are not counted as in2861* flight data by arc_tempreserve_space() until they are "returned". Loaned2862* buffers must be returned to the arc before they can be used by the DMU or2863* freed.2864*/2865arc_buf_t *2866arc_loan_buf(spa_t *spa, boolean_t is_metadata, int size)2867{2868arc_buf_t *buf = arc_alloc_buf(spa, arc_onloan_tag,2869is_metadata ? ARC_BUFC_METADATA : ARC_BUFC_DATA, size);28702871arc_loaned_bytes_update(arc_buf_size(buf));28722873return (buf);2874}28752876arc_buf_t *2877arc_loan_compressed_buf(spa_t *spa, uint64_t psize, uint64_t lsize,2878enum zio_compress compression_type, uint8_t complevel)2879{2880arc_buf_t *buf = arc_alloc_compressed_buf(spa, arc_onloan_tag,2881psize, lsize, compression_type, complevel);28822883arc_loaned_bytes_update(arc_buf_size(buf));28842885return (buf);2886}28872888arc_buf_t *2889arc_loan_raw_buf(spa_t *spa, uint64_t dsobj, boolean_t byteorder,2890const uint8_t *salt, const uint8_t *iv, const uint8_t *mac,2891dmu_object_type_t ot, uint64_t psize, uint64_t lsize,2892enum zio_compress compression_type, uint8_t complevel)2893{2894arc_buf_t *buf = arc_alloc_raw_buf(spa, arc_onloan_tag, dsobj,2895byteorder, salt, iv, mac, ot, psize, lsize, compression_type,2896complevel);28972898atomic_add_64(&arc_loaned_bytes, psize);2899return (buf);2900}290129022903/*2904* Return a loaned arc buffer to the arc.2905*/2906void2907arc_return_buf(arc_buf_t *buf, const void *tag)2908{2909arc_buf_hdr_t *hdr = buf->b_hdr;29102911ASSERT3P(buf->b_data, !=, NULL);2912ASSERT(HDR_HAS_L1HDR(hdr));2913(void) zfs_refcount_add(&hdr->b_l1hdr.b_refcnt, tag);2914(void) zfs_refcount_remove(&hdr->b_l1hdr.b_refcnt, arc_onloan_tag);29152916arc_loaned_bytes_update(-arc_buf_size(buf));2917}29182919/* Detach an arc_buf from a dbuf (tag) */2920void2921arc_loan_inuse_buf(arc_buf_t *buf, const void *tag)2922{2923arc_buf_hdr_t *hdr = buf->b_hdr;29242925ASSERT3P(buf->b_data, !=, NULL);2926ASSERT(HDR_HAS_L1HDR(hdr));2927(void) zfs_refcount_add(&hdr->b_l1hdr.b_refcnt, arc_onloan_tag);2928(void) zfs_refcount_remove(&hdr->b_l1hdr.b_refcnt, tag);29292930arc_loaned_bytes_update(arc_buf_size(buf));2931}29322933static void2934l2arc_free_abd_on_write(abd_t *abd, size_t size, arc_buf_contents_t type)2935{2936l2arc_data_free_t *df = kmem_alloc(sizeof (*df), KM_SLEEP);29372938df->l2df_abd = abd;2939df->l2df_size = size;2940df->l2df_type = type;2941mutex_enter(&l2arc_free_on_write_mtx);2942list_insert_head(l2arc_free_on_write, df);2943mutex_exit(&l2arc_free_on_write_mtx);2944}29452946static void2947arc_hdr_free_on_write(arc_buf_hdr_t *hdr, boolean_t free_rdata)2948{2949arc_state_t *state = hdr->b_l1hdr.b_state;2950arc_buf_contents_t type = arc_buf_type(hdr);2951uint64_t size = (free_rdata) ? HDR_GET_PSIZE(hdr) : arc_hdr_size(hdr);29522953/* protected by hash lock, if in the hash table */2954if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {2955ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt));2956ASSERT(state != arc_anon && state != arc_l2c_only);29572958(void) zfs_refcount_remove_many(&state->arcs_esize[type],2959size, hdr);2960}2961(void) zfs_refcount_remove_many(&state->arcs_size[type], size, hdr);2962if (type == ARC_BUFC_METADATA) {2963arc_space_return(size, ARC_SPACE_META);2964} else {2965ASSERT(type == ARC_BUFC_DATA);2966arc_space_return(size, ARC_SPACE_DATA);2967}29682969if (free_rdata) {2970l2arc_free_abd_on_write(hdr->b_crypt_hdr.b_rabd, size, type);2971} else {2972l2arc_free_abd_on_write(hdr->b_l1hdr.b_pabd, size, type);2973}2974}29752976/*2977* Share the arc_buf_t's data with the hdr. Whenever we are sharing the2978* data buffer, we transfer the refcount ownership to the hdr and update2979* the appropriate kstats.2980*/2981static void2982arc_share_buf(arc_buf_hdr_t *hdr, arc_buf_t *buf)2983{2984ASSERT(arc_can_share(hdr, buf));2985ASSERT0P(hdr->b_l1hdr.b_pabd);2986ASSERT(!ARC_BUF_ENCRYPTED(buf));2987ASSERT(HDR_EMPTY_OR_LOCKED(hdr));29882989/*2990* Start sharing the data buffer. We transfer the2991* refcount ownership to the hdr since it always owns2992* the refcount whenever an arc_buf_t is shared.2993*/2994zfs_refcount_transfer_ownership_many(2995&hdr->b_l1hdr.b_state->arcs_size[arc_buf_type(hdr)],2996arc_hdr_size(hdr), buf, hdr);2997hdr->b_l1hdr.b_pabd = abd_get_from_buf(buf->b_data, arc_buf_size(buf));2998abd_take_ownership_of_buf(hdr->b_l1hdr.b_pabd,2999HDR_ISTYPE_METADATA(hdr));3000arc_hdr_set_flags(hdr, ARC_FLAG_SHARED_DATA);3001buf->b_flags |= ARC_BUF_FLAG_SHARED;30023003/*3004* Since we've transferred ownership to the hdr we need3005* to increment its compressed and uncompressed kstats and3006* decrement the overhead size.3007*/3008ARCSTAT_INCR(arcstat_compressed_size, arc_hdr_size(hdr));3009ARCSTAT_INCR(arcstat_uncompressed_size, HDR_GET_LSIZE(hdr));3010ARCSTAT_INCR(arcstat_overhead_size, -arc_buf_size(buf));3011}30123013static void3014arc_unshare_buf(arc_buf_hdr_t *hdr, arc_buf_t *buf)3015{3016ASSERT(arc_buf_is_shared(buf));3017ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);3018ASSERT(HDR_EMPTY_OR_LOCKED(hdr));30193020/*3021* We are no longer sharing this buffer so we need3022* to transfer its ownership to the rightful owner.3023*/3024zfs_refcount_transfer_ownership_many(3025&hdr->b_l1hdr.b_state->arcs_size[arc_buf_type(hdr)],3026arc_hdr_size(hdr), hdr, buf);3027arc_hdr_clear_flags(hdr, ARC_FLAG_SHARED_DATA);3028abd_release_ownership_of_buf(hdr->b_l1hdr.b_pabd);3029abd_free(hdr->b_l1hdr.b_pabd);3030hdr->b_l1hdr.b_pabd = NULL;3031buf->b_flags &= ~ARC_BUF_FLAG_SHARED;30323033/*3034* Since the buffer is no longer shared between3035* the arc buf and the hdr, count it as overhead.3036*/3037ARCSTAT_INCR(arcstat_compressed_size, -arc_hdr_size(hdr));3038ARCSTAT_INCR(arcstat_uncompressed_size, -HDR_GET_LSIZE(hdr));3039ARCSTAT_INCR(arcstat_overhead_size, arc_buf_size(buf));3040}30413042/*3043* Remove an arc_buf_t from the hdr's buf list and return the last3044* arc_buf_t on the list. If no buffers remain on the list then return3045* NULL.3046*/3047static arc_buf_t *3048arc_buf_remove(arc_buf_hdr_t *hdr, arc_buf_t *buf)3049{3050ASSERT(HDR_HAS_L1HDR(hdr));3051ASSERT(HDR_EMPTY_OR_LOCKED(hdr));30523053arc_buf_t **bufp = &hdr->b_l1hdr.b_buf;3054arc_buf_t *lastbuf = NULL;30553056/*3057* Remove the buf from the hdr list and locate the last3058* remaining buffer on the list.3059*/3060while (*bufp != NULL) {3061if (*bufp == buf)3062*bufp = buf->b_next;30633064/*3065* If we've removed a buffer in the middle of3066* the list then update the lastbuf and update3067* bufp.3068*/3069if (*bufp != NULL) {3070lastbuf = *bufp;3071bufp = &(*bufp)->b_next;3072}3073}3074buf->b_next = NULL;3075ASSERT3P(lastbuf, !=, buf);3076IMPLY(lastbuf != NULL, ARC_BUF_LAST(lastbuf));30773078return (lastbuf);3079}30803081/*3082* Free up buf->b_data and pull the arc_buf_t off of the arc_buf_hdr_t's3083* list and free it.3084*/3085static void3086arc_buf_destroy_impl(arc_buf_t *buf)3087{3088arc_buf_hdr_t *hdr = buf->b_hdr;30893090/*3091* Free up the data associated with the buf but only if we're not3092* sharing this with the hdr. If we are sharing it with the hdr, the3093* hdr is responsible for doing the free.3094*/3095if (buf->b_data != NULL) {3096/*3097* We're about to change the hdr's b_flags. We must either3098* hold the hash_lock or be undiscoverable.3099*/3100ASSERT(HDR_EMPTY_OR_LOCKED(hdr));31013102arc_cksum_verify(buf);3103arc_buf_unwatch(buf);31043105if (ARC_BUF_SHARED(buf)) {3106arc_hdr_clear_flags(hdr, ARC_FLAG_SHARED_DATA);3107} else {3108ASSERT(!arc_buf_is_shared(buf));3109uint64_t size = arc_buf_size(buf);3110arc_free_data_buf(hdr, buf->b_data, size, buf);3111ARCSTAT_INCR(arcstat_overhead_size, -size);3112}3113buf->b_data = NULL;31143115/*3116* If we have no more encrypted buffers and we've already3117* gotten a copy of the decrypted data we can free b_rabd3118* to save some space.3119*/3120if (ARC_BUF_ENCRYPTED(buf) && HDR_HAS_RABD(hdr) &&3121hdr->b_l1hdr.b_pabd != NULL && !HDR_IO_IN_PROGRESS(hdr)) {3122arc_buf_t *b;3123for (b = hdr->b_l1hdr.b_buf; b; b = b->b_next) {3124if (b != buf && ARC_BUF_ENCRYPTED(b))3125break;3126}3127if (b == NULL)3128arc_hdr_free_abd(hdr, B_TRUE);3129}3130}31313132arc_buf_t *lastbuf = arc_buf_remove(hdr, buf);31333134if (ARC_BUF_SHARED(buf) && !ARC_BUF_COMPRESSED(buf)) {3135/*3136* If the current arc_buf_t is sharing its data buffer with the3137* hdr, then reassign the hdr's b_pabd to share it with the new3138* buffer at the end of the list. The shared buffer is always3139* the last one on the hdr's buffer list.3140*3141* There is an equivalent case for compressed bufs, but since3142* they aren't guaranteed to be the last buf in the list and3143* that is an exceedingly rare case, we just allow that space be3144* wasted temporarily. We must also be careful not to share3145* encrypted buffers, since they cannot be shared.3146*/3147if (lastbuf != NULL && !ARC_BUF_ENCRYPTED(lastbuf)) {3148/* Only one buf can be shared at once */3149ASSERT(!arc_buf_is_shared(lastbuf));3150/* hdr is uncompressed so can't have compressed buf */3151ASSERT(!ARC_BUF_COMPRESSED(lastbuf));31523153ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);3154arc_hdr_free_abd(hdr, B_FALSE);31553156/*3157* We must setup a new shared block between the3158* last buffer and the hdr. The data would have3159* been allocated by the arc buf so we need to transfer3160* ownership to the hdr since it's now being shared.3161*/3162arc_share_buf(hdr, lastbuf);3163}3164} else if (HDR_SHARED_DATA(hdr)) {3165/*3166* Uncompressed shared buffers are always at the end3167* of the list. Compressed buffers don't have the3168* same requirements. This makes it hard to3169* simply assert that the lastbuf is shared so3170* we rely on the hdr's compression flags to determine3171* if we have a compressed, shared buffer.3172*/3173ASSERT3P(lastbuf, !=, NULL);3174ASSERT(arc_buf_is_shared(lastbuf) ||3175arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF);3176}31773178/*3179* Free the checksum if we're removing the last uncompressed buf from3180* this hdr.3181*/3182if (!arc_hdr_has_uncompressed_buf(hdr)) {3183arc_cksum_free(hdr);3184}31853186/* clean up the buf */3187buf->b_hdr = NULL;3188kmem_cache_free(buf_cache, buf);3189}31903191static void3192arc_hdr_alloc_abd(arc_buf_hdr_t *hdr, int alloc_flags)3193{3194uint64_t size;3195boolean_t alloc_rdata = ((alloc_flags & ARC_HDR_ALLOC_RDATA) != 0);31963197ASSERT3U(HDR_GET_LSIZE(hdr), >, 0);3198ASSERT(HDR_HAS_L1HDR(hdr));3199ASSERT(!HDR_SHARED_DATA(hdr) || alloc_rdata);3200IMPLY(alloc_rdata, HDR_PROTECTED(hdr));32013202if (alloc_rdata) {3203size = HDR_GET_PSIZE(hdr);3204ASSERT0P(hdr->b_crypt_hdr.b_rabd);3205hdr->b_crypt_hdr.b_rabd = arc_get_data_abd(hdr, size, hdr,3206alloc_flags);3207ASSERT3P(hdr->b_crypt_hdr.b_rabd, !=, NULL);3208ARCSTAT_INCR(arcstat_raw_size, size);3209} else {3210size = arc_hdr_size(hdr);3211ASSERT0P(hdr->b_l1hdr.b_pabd);3212hdr->b_l1hdr.b_pabd = arc_get_data_abd(hdr, size, hdr,3213alloc_flags);3214ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);3215}32163217ARCSTAT_INCR(arcstat_compressed_size, size);3218ARCSTAT_INCR(arcstat_uncompressed_size, HDR_GET_LSIZE(hdr));3219}32203221static void3222arc_hdr_free_abd(arc_buf_hdr_t *hdr, boolean_t free_rdata)3223{3224uint64_t size = (free_rdata) ? HDR_GET_PSIZE(hdr) : arc_hdr_size(hdr);32253226ASSERT(HDR_HAS_L1HDR(hdr));3227ASSERT(hdr->b_l1hdr.b_pabd != NULL || HDR_HAS_RABD(hdr));3228IMPLY(free_rdata, HDR_HAS_RABD(hdr));32293230/*3231* If the hdr is currently being written to the l2arc then3232* we defer freeing the data by adding it to the l2arc_free_on_write3233* list. The l2arc will free the data once it's finished3234* writing it to the l2arc device.3235*/3236if (HDR_L2_WRITING(hdr)) {3237arc_hdr_free_on_write(hdr, free_rdata);3238ARCSTAT_BUMP(arcstat_l2_free_on_write);3239} else if (free_rdata) {3240arc_free_data_abd(hdr, hdr->b_crypt_hdr.b_rabd, size, hdr);3241} else {3242arc_free_data_abd(hdr, hdr->b_l1hdr.b_pabd, size, hdr);3243}32443245if (free_rdata) {3246hdr->b_crypt_hdr.b_rabd = NULL;3247ARCSTAT_INCR(arcstat_raw_size, -size);3248} else {3249hdr->b_l1hdr.b_pabd = NULL;3250}32513252if (hdr->b_l1hdr.b_pabd == NULL && !HDR_HAS_RABD(hdr))3253hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS;32543255ARCSTAT_INCR(arcstat_compressed_size, -size);3256ARCSTAT_INCR(arcstat_uncompressed_size, -HDR_GET_LSIZE(hdr));3257}32583259/*3260* Allocate empty anonymous ARC header. The header will get its identity3261* assigned and buffers attached later as part of read or write operations.3262*3263* In case of read arc_read() assigns header its identify (b_dva + b_birth),3264* inserts it into ARC hash to become globally visible and allocates physical3265* (b_pabd) or raw (b_rabd) ABD buffer to read into from disk. On disk read3266* completion arc_read_done() allocates ARC buffer(s) as needed, potentially3267* sharing one of them with the physical ABD buffer.3268*3269* In case of write arc_alloc_buf() allocates ARC buffer to be filled with3270* data. Then after compression and/or encryption arc_write_ready() allocates3271* and fills (or potentially shares) physical (b_pabd) or raw (b_rabd) ABD3272* buffer. On disk write completion arc_write_done() assigns the header its3273* new identity (b_dva + b_birth) and inserts into ARC hash.3274*3275* In case of partial overwrite the old data is read first as described. Then3276* arc_release() either allocates new anonymous ARC header and moves the ARC3277* buffer to it, or reuses the old ARC header by discarding its identity and3278* removing it from ARC hash. After buffer modification normal write process3279* follows as described.3280*/3281static arc_buf_hdr_t *3282arc_hdr_alloc(uint64_t spa, int32_t psize, int32_t lsize,3283boolean_t protected, enum zio_compress compression_type, uint8_t complevel,3284arc_buf_contents_t type)3285{3286arc_buf_hdr_t *hdr;32873288VERIFY(type == ARC_BUFC_DATA || type == ARC_BUFC_METADATA);3289hdr = kmem_cache_alloc(hdr_full_cache, KM_PUSHPAGE);32903291ASSERT(HDR_EMPTY(hdr));3292#ifdef ZFS_DEBUG3293ASSERT0P(hdr->b_l1hdr.b_freeze_cksum);3294#endif3295HDR_SET_PSIZE(hdr, psize);3296HDR_SET_LSIZE(hdr, lsize);3297hdr->b_spa = spa;3298hdr->b_type = type;3299hdr->b_flags = 0;3300arc_hdr_set_flags(hdr, arc_bufc_to_flags(type) | ARC_FLAG_HAS_L1HDR);3301arc_hdr_set_compress(hdr, compression_type);3302hdr->b_complevel = complevel;3303if (protected)3304arc_hdr_set_flags(hdr, ARC_FLAG_PROTECTED);33053306hdr->b_l1hdr.b_state = arc_anon;3307hdr->b_l1hdr.b_arc_access = 0;3308hdr->b_l1hdr.b_mru_hits = 0;3309hdr->b_l1hdr.b_mru_ghost_hits = 0;3310hdr->b_l1hdr.b_mfu_hits = 0;3311hdr->b_l1hdr.b_mfu_ghost_hits = 0;3312hdr->b_l1hdr.b_buf = NULL;33133314ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt));33153316return (hdr);3317}33183319/*3320* Transition between the two allocation states for the arc_buf_hdr struct.3321* The arc_buf_hdr struct can be allocated with (hdr_full_cache) or without3322* (hdr_l2only_cache) the fields necessary for the L1 cache - the smaller3323* version is used when a cache buffer is only in the L2ARC in order to reduce3324* memory usage.3325*/3326static arc_buf_hdr_t *3327arc_hdr_realloc(arc_buf_hdr_t *hdr, kmem_cache_t *old, kmem_cache_t *new)3328{3329ASSERT(HDR_HAS_L2HDR(hdr));33303331arc_buf_hdr_t *nhdr;3332l2arc_dev_t *dev = hdr->b_l2hdr.b_dev;33333334ASSERT((old == hdr_full_cache && new == hdr_l2only_cache) ||3335(old == hdr_l2only_cache && new == hdr_full_cache));33363337nhdr = kmem_cache_alloc(new, KM_PUSHPAGE);33383339ASSERT(MUTEX_HELD(HDR_LOCK(hdr)));3340buf_hash_remove(hdr);33413342memcpy(nhdr, hdr, HDR_L2ONLY_SIZE);33433344if (new == hdr_full_cache) {3345arc_hdr_set_flags(nhdr, ARC_FLAG_HAS_L1HDR);3346/*3347* arc_access and arc_change_state need to be aware that a3348* header has just come out of L2ARC, so we set its state to3349* l2c_only even though it's about to change.3350*/3351nhdr->b_l1hdr.b_state = arc_l2c_only;33523353/* Verify previous threads set to NULL before freeing */3354ASSERT0P(nhdr->b_l1hdr.b_pabd);3355ASSERT(!HDR_HAS_RABD(hdr));3356} else {3357ASSERT0P(hdr->b_l1hdr.b_buf);3358#ifdef ZFS_DEBUG3359ASSERT0P(hdr->b_l1hdr.b_freeze_cksum);3360#endif33613362/*3363* If we've reached here, We must have been called from3364* arc_evict_hdr(), as such we should have already been3365* removed from any ghost list we were previously on3366* (which protects us from racing with arc_evict_state),3367* thus no locking is needed during this check.3368*/3369ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));33703371/*3372* A buffer must not be moved into the arc_l2c_only3373* state if it's not finished being written out to the3374* l2arc device. Otherwise, the b_l1hdr.b_pabd field3375* might try to be accessed, even though it was removed.3376*/3377VERIFY(!HDR_L2_WRITING(hdr));3378VERIFY0P(hdr->b_l1hdr.b_pabd);3379ASSERT(!HDR_HAS_RABD(hdr));33803381arc_hdr_clear_flags(nhdr, ARC_FLAG_HAS_L1HDR);3382}3383/*3384* The header has been reallocated so we need to re-insert it into any3385* lists it was on.3386*/3387(void) buf_hash_insert(nhdr, NULL);33883389ASSERT(list_link_active(&hdr->b_l2hdr.b_l2node));33903391mutex_enter(&dev->l2ad_mtx);33923393/*3394* We must place the realloc'ed header back into the list at3395* the same spot. Otherwise, if it's placed earlier in the list,3396* l2arc_write_buffers() could find it during the function's3397* write phase, and try to write it out to the l2arc.3398*/3399list_insert_after(&dev->l2ad_buflist, hdr, nhdr);3400list_remove(&dev->l2ad_buflist, hdr);34013402mutex_exit(&dev->l2ad_mtx);34033404/*3405* Since we're using the pointer address as the tag when3406* incrementing and decrementing the l2ad_alloc refcount, we3407* must remove the old pointer (that we're about to destroy) and3408* add the new pointer to the refcount. Otherwise we'd remove3409* the wrong pointer address when calling arc_hdr_destroy() later.3410*/34113412(void) zfs_refcount_remove_many(&dev->l2ad_alloc,3413arc_hdr_size(hdr), hdr);3414(void) zfs_refcount_add_many(&dev->l2ad_alloc,3415arc_hdr_size(nhdr), nhdr);34163417buf_discard_identity(hdr);3418kmem_cache_free(old, hdr);34193420return (nhdr);3421}34223423/*3424* This function is used by the send / receive code to convert a newly3425* allocated arc_buf_t to one that is suitable for a raw encrypted write. It3426* is also used to allow the root objset block to be updated without altering3427* its embedded MACs. Both block types will always be uncompressed so we do not3428* have to worry about compression type or psize.3429*/3430void3431arc_convert_to_raw(arc_buf_t *buf, uint64_t dsobj, boolean_t byteorder,3432dmu_object_type_t ot, const uint8_t *salt, const uint8_t *iv,3433const uint8_t *mac)3434{3435arc_buf_hdr_t *hdr = buf->b_hdr;34363437ASSERT(ot == DMU_OT_DNODE || ot == DMU_OT_OBJSET);3438ASSERT(HDR_HAS_L1HDR(hdr));3439ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon);34403441buf->b_flags |= (ARC_BUF_FLAG_COMPRESSED | ARC_BUF_FLAG_ENCRYPTED);3442arc_hdr_set_flags(hdr, ARC_FLAG_PROTECTED);3443hdr->b_crypt_hdr.b_dsobj = dsobj;3444hdr->b_crypt_hdr.b_ot = ot;3445hdr->b_l1hdr.b_byteswap = (byteorder == ZFS_HOST_BYTEORDER) ?3446DMU_BSWAP_NUMFUNCS : DMU_OT_BYTESWAP(ot);3447if (!arc_hdr_has_uncompressed_buf(hdr))3448arc_cksum_free(hdr);34493450if (salt != NULL)3451memcpy(hdr->b_crypt_hdr.b_salt, salt, ZIO_DATA_SALT_LEN);3452if (iv != NULL)3453memcpy(hdr->b_crypt_hdr.b_iv, iv, ZIO_DATA_IV_LEN);3454if (mac != NULL)3455memcpy(hdr->b_crypt_hdr.b_mac, mac, ZIO_DATA_MAC_LEN);3456}34573458/*3459* Allocate a new arc_buf_hdr_t and arc_buf_t and return the buf to the caller.3460* The buf is returned thawed since we expect the consumer to modify it.3461*/3462arc_buf_t *3463arc_alloc_buf(spa_t *spa, const void *tag, arc_buf_contents_t type,3464int32_t size)3465{3466arc_buf_hdr_t *hdr = arc_hdr_alloc(spa_load_guid(spa), size, size,3467B_FALSE, ZIO_COMPRESS_OFF, 0, type);34683469arc_buf_t *buf = NULL;3470VERIFY0(arc_buf_alloc_impl(hdr, spa, NULL, tag, B_FALSE, B_FALSE,3471B_FALSE, B_FALSE, &buf));3472arc_buf_thaw(buf);34733474return (buf);3475}34763477/*3478* Allocate a compressed buf in the same manner as arc_alloc_buf. Don't use this3479* for bufs containing metadata.3480*/3481arc_buf_t *3482arc_alloc_compressed_buf(spa_t *spa, const void *tag, uint64_t psize,3483uint64_t lsize, enum zio_compress compression_type, uint8_t complevel)3484{3485ASSERT3U(lsize, >, 0);3486ASSERT3U(lsize, >=, psize);3487ASSERT3U(compression_type, >, ZIO_COMPRESS_OFF);3488ASSERT3U(compression_type, <, ZIO_COMPRESS_FUNCTIONS);34893490arc_buf_hdr_t *hdr = arc_hdr_alloc(spa_load_guid(spa), psize, lsize,3491B_FALSE, compression_type, complevel, ARC_BUFC_DATA);34923493arc_buf_t *buf = NULL;3494VERIFY0(arc_buf_alloc_impl(hdr, spa, NULL, tag, B_FALSE,3495B_TRUE, B_FALSE, B_FALSE, &buf));3496arc_buf_thaw(buf);34973498/*3499* To ensure that the hdr has the correct data in it if we call3500* arc_untransform() on this buf before it's been written to disk,3501* it's easiest if we just set up sharing between the buf and the hdr.3502*/3503arc_share_buf(hdr, buf);35043505return (buf);3506}35073508arc_buf_t *3509arc_alloc_raw_buf(spa_t *spa, const void *tag, uint64_t dsobj,3510boolean_t byteorder, const uint8_t *salt, const uint8_t *iv,3511const uint8_t *mac, dmu_object_type_t ot, uint64_t psize, uint64_t lsize,3512enum zio_compress compression_type, uint8_t complevel)3513{3514arc_buf_hdr_t *hdr;3515arc_buf_t *buf;3516arc_buf_contents_t type = DMU_OT_IS_METADATA(ot) ?3517ARC_BUFC_METADATA : ARC_BUFC_DATA;35183519ASSERT3U(lsize, >, 0);3520ASSERT3U(lsize, >=, psize);3521ASSERT3U(compression_type, >=, ZIO_COMPRESS_OFF);3522ASSERT3U(compression_type, <, ZIO_COMPRESS_FUNCTIONS);35233524hdr = arc_hdr_alloc(spa_load_guid(spa), psize, lsize, B_TRUE,3525compression_type, complevel, type);35263527hdr->b_crypt_hdr.b_dsobj = dsobj;3528hdr->b_crypt_hdr.b_ot = ot;3529hdr->b_l1hdr.b_byteswap = (byteorder == ZFS_HOST_BYTEORDER) ?3530DMU_BSWAP_NUMFUNCS : DMU_OT_BYTESWAP(ot);3531memcpy(hdr->b_crypt_hdr.b_salt, salt, ZIO_DATA_SALT_LEN);3532memcpy(hdr->b_crypt_hdr.b_iv, iv, ZIO_DATA_IV_LEN);3533memcpy(hdr->b_crypt_hdr.b_mac, mac, ZIO_DATA_MAC_LEN);35343535/*3536* This buffer will be considered encrypted even if the ot is not an3537* encrypted type. It will become authenticated instead in3538* arc_write_ready().3539*/3540buf = NULL;3541VERIFY0(arc_buf_alloc_impl(hdr, spa, NULL, tag, B_TRUE, B_TRUE,3542B_FALSE, B_FALSE, &buf));3543arc_buf_thaw(buf);35443545return (buf);3546}35473548static void3549l2arc_hdr_arcstats_update(arc_buf_hdr_t *hdr, boolean_t incr,3550boolean_t state_only)3551{3552uint64_t lsize = HDR_GET_LSIZE(hdr);3553uint64_t psize = HDR_GET_PSIZE(hdr);3554uint64_t asize = HDR_GET_L2SIZE(hdr);3555arc_buf_contents_t type = hdr->b_type;3556int64_t lsize_s;3557int64_t psize_s;3558int64_t asize_s;35593560/* For L2 we expect the header's b_l2size to be valid */3561ASSERT3U(asize, >=, psize);35623563if (incr) {3564lsize_s = lsize;3565psize_s = psize;3566asize_s = asize;3567} else {3568lsize_s = -lsize;3569psize_s = -psize;3570asize_s = -asize;3571}35723573/* If the buffer is a prefetch, count it as such. */3574if (HDR_PREFETCH(hdr)) {3575ARCSTAT_INCR(arcstat_l2_prefetch_asize, asize_s);3576} else {3577/*3578* We use the value stored in the L2 header upon initial3579* caching in L2ARC. This value will be updated in case3580* an MRU/MRU_ghost buffer transitions to MFU but the L2ARC3581* metadata (log entry) cannot currently be updated. Having3582* the ARC state in the L2 header solves the problem of a3583* possibly absent L1 header (apparent in buffers restored3584* from persistent L2ARC).3585*/3586switch (hdr->b_l2hdr.b_arcs_state) {3587case ARC_STATE_MRU_GHOST:3588case ARC_STATE_MRU:3589ARCSTAT_INCR(arcstat_l2_mru_asize, asize_s);3590break;3591case ARC_STATE_MFU_GHOST:3592case ARC_STATE_MFU:3593ARCSTAT_INCR(arcstat_l2_mfu_asize, asize_s);3594break;3595default:3596break;3597}3598}35993600if (state_only)3601return;36023603ARCSTAT_INCR(arcstat_l2_psize, psize_s);3604ARCSTAT_INCR(arcstat_l2_lsize, lsize_s);36053606switch (type) {3607case ARC_BUFC_DATA:3608ARCSTAT_INCR(arcstat_l2_bufc_data_asize, asize_s);3609break;3610case ARC_BUFC_METADATA:3611ARCSTAT_INCR(arcstat_l2_bufc_metadata_asize, asize_s);3612break;3613default:3614break;3615}3616}361736183619static void3620arc_hdr_l2hdr_destroy(arc_buf_hdr_t *hdr)3621{3622l2arc_buf_hdr_t *l2hdr = &hdr->b_l2hdr;3623l2arc_dev_t *dev = l2hdr->b_dev;36243625ASSERT(MUTEX_HELD(&dev->l2ad_mtx));3626ASSERT(HDR_HAS_L2HDR(hdr));36273628list_remove(&dev->l2ad_buflist, hdr);36293630l2arc_hdr_arcstats_decrement(hdr);3631if (dev->l2ad_vdev != NULL) {3632uint64_t asize = HDR_GET_L2SIZE(hdr);3633vdev_space_update(dev->l2ad_vdev, -asize, 0, 0);3634}36353636(void) zfs_refcount_remove_many(&dev->l2ad_alloc, arc_hdr_size(hdr),3637hdr);3638arc_hdr_clear_flags(hdr, ARC_FLAG_HAS_L2HDR);3639}36403641static void3642arc_hdr_destroy(arc_buf_hdr_t *hdr)3643{3644if (HDR_HAS_L1HDR(hdr)) {3645ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt));3646ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon);3647}3648ASSERT(!HDR_IO_IN_PROGRESS(hdr));3649ASSERT(!HDR_IN_HASH_TABLE(hdr));36503651if (HDR_HAS_L2HDR(hdr)) {3652l2arc_dev_t *dev = hdr->b_l2hdr.b_dev;3653boolean_t buflist_held = MUTEX_HELD(&dev->l2ad_mtx);36543655if (!buflist_held)3656mutex_enter(&dev->l2ad_mtx);36573658/*3659* Even though we checked this conditional above, we3660* need to check this again now that we have the3661* l2ad_mtx. This is because we could be racing with3662* another thread calling l2arc_evict() which might have3663* destroyed this header's L2 portion as we were waiting3664* to acquire the l2ad_mtx. If that happens, we don't3665* want to re-destroy the header's L2 portion.3666*/3667if (HDR_HAS_L2HDR(hdr)) {36683669if (!HDR_EMPTY(hdr))3670buf_discard_identity(hdr);36713672arc_hdr_l2hdr_destroy(hdr);3673}36743675if (!buflist_held)3676mutex_exit(&dev->l2ad_mtx);3677}36783679/*3680* The header's identify can only be safely discarded once it is no3681* longer discoverable. This requires removing it from the hash table3682* and the l2arc header list. After this point the hash lock can not3683* be used to protect the header.3684*/3685if (!HDR_EMPTY(hdr))3686buf_discard_identity(hdr);36873688if (HDR_HAS_L1HDR(hdr)) {3689arc_cksum_free(hdr);36903691while (hdr->b_l1hdr.b_buf != NULL)3692arc_buf_destroy_impl(hdr->b_l1hdr.b_buf);36933694if (hdr->b_l1hdr.b_pabd != NULL)3695arc_hdr_free_abd(hdr, B_FALSE);36963697if (HDR_HAS_RABD(hdr))3698arc_hdr_free_abd(hdr, B_TRUE);3699}37003701ASSERT0P(hdr->b_hash_next);3702if (HDR_HAS_L1HDR(hdr)) {3703ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));3704ASSERT0P(hdr->b_l1hdr.b_acb);3705#ifdef ZFS_DEBUG3706ASSERT0P(hdr->b_l1hdr.b_freeze_cksum);3707#endif3708kmem_cache_free(hdr_full_cache, hdr);3709} else {3710kmem_cache_free(hdr_l2only_cache, hdr);3711}3712}37133714void3715arc_buf_destroy(arc_buf_t *buf, const void *tag)3716{3717arc_buf_hdr_t *hdr = buf->b_hdr;37183719if (hdr->b_l1hdr.b_state == arc_anon) {3720ASSERT3P(hdr->b_l1hdr.b_buf, ==, buf);3721ASSERT(ARC_BUF_LAST(buf));3722ASSERT(!HDR_IO_IN_PROGRESS(hdr));3723VERIFY0(remove_reference(hdr, tag));3724return;3725}37263727kmutex_t *hash_lock = HDR_LOCK(hdr);3728mutex_enter(hash_lock);37293730ASSERT3P(hdr, ==, buf->b_hdr);3731ASSERT3P(hdr->b_l1hdr.b_buf, !=, NULL);3732ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));3733ASSERT3P(hdr->b_l1hdr.b_state, !=, arc_anon);3734ASSERT3P(buf->b_data, !=, NULL);37353736arc_buf_destroy_impl(buf);3737(void) remove_reference(hdr, tag);3738mutex_exit(hash_lock);3739}37403741/*3742* Evict the arc_buf_hdr that is provided as a parameter. The resultant3743* state of the header is dependent on its state prior to entering this3744* function. The following transitions are possible:3745*3746* - arc_mru -> arc_mru_ghost3747* - arc_mfu -> arc_mfu_ghost3748* - arc_mru_ghost -> arc_l2c_only3749* - arc_mru_ghost -> deleted3750* - arc_mfu_ghost -> arc_l2c_only3751* - arc_mfu_ghost -> deleted3752* - arc_uncached -> deleted3753*3754* Return total size of evicted data buffers for eviction progress tracking.3755* When evicting from ghost states return logical buffer size to make eviction3756* progress at the same (or at least comparable) rate as from non-ghost states.3757*3758* Return *real_evicted for actual ARC size reduction to wake up threads3759* waiting for it. For non-ghost states it includes size of evicted data3760* buffers (the headers are not freed there). For ghost states it includes3761* only the evicted headers size.3762*/3763static int64_t3764arc_evict_hdr(arc_buf_hdr_t *hdr, uint64_t *real_evicted)3765{3766arc_state_t *evicted_state, *state;3767int64_t bytes_evicted = 0;3768uint_t min_lifetime = HDR_PRESCIENT_PREFETCH(hdr) ?3769arc_min_prescient_prefetch_ms : arc_min_prefetch_ms;37703771ASSERT(MUTEX_HELD(HDR_LOCK(hdr)));3772ASSERT(HDR_HAS_L1HDR(hdr));3773ASSERT(!HDR_IO_IN_PROGRESS(hdr));3774ASSERT0P(hdr->b_l1hdr.b_buf);3775ASSERT0(zfs_refcount_count(&hdr->b_l1hdr.b_refcnt));37763777*real_evicted = 0;3778state = hdr->b_l1hdr.b_state;3779if (GHOST_STATE(state)) {37803781/*3782* l2arc_write_buffers() relies on a header's L1 portion3783* (i.e. its b_pabd field) during it's write phase.3784* Thus, we cannot push a header onto the arc_l2c_only3785* state (removing its L1 piece) until the header is3786* done being written to the l2arc.3787*/3788if (HDR_HAS_L2HDR(hdr) && HDR_L2_WRITING(hdr)) {3789ARCSTAT_BUMP(arcstat_evict_l2_skip);3790return (bytes_evicted);3791}37923793ARCSTAT_BUMP(arcstat_deleted);3794bytes_evicted += HDR_GET_LSIZE(hdr);37953796DTRACE_PROBE1(arc__delete, arc_buf_hdr_t *, hdr);37973798if (HDR_HAS_L2HDR(hdr)) {3799ASSERT0P(hdr->b_l1hdr.b_pabd);3800ASSERT(!HDR_HAS_RABD(hdr));3801/*3802* This buffer is cached on the 2nd Level ARC;3803* don't destroy the header.3804*/3805arc_change_state(arc_l2c_only, hdr);3806/*3807* dropping from L1+L2 cached to L2-only,3808* realloc to remove the L1 header.3809*/3810(void) arc_hdr_realloc(hdr, hdr_full_cache,3811hdr_l2only_cache);3812*real_evicted += HDR_FULL_SIZE - HDR_L2ONLY_SIZE;3813} else {3814arc_change_state(arc_anon, hdr);3815arc_hdr_destroy(hdr);3816*real_evicted += HDR_FULL_SIZE;3817}3818return (bytes_evicted);3819}38203821ASSERT(state == arc_mru || state == arc_mfu || state == arc_uncached);3822evicted_state = (state == arc_uncached) ? arc_anon :3823((state == arc_mru) ? arc_mru_ghost : arc_mfu_ghost);38243825/* prefetch buffers have a minimum lifespan */3826if ((hdr->b_flags & (ARC_FLAG_PREFETCH | ARC_FLAG_INDIRECT)) &&3827ddi_get_lbolt() - hdr->b_l1hdr.b_arc_access <3828MSEC_TO_TICK(min_lifetime)) {3829ARCSTAT_BUMP(arcstat_evict_skip);3830return (bytes_evicted);3831}38323833if (HDR_HAS_L2HDR(hdr)) {3834ARCSTAT_INCR(arcstat_evict_l2_cached, HDR_GET_LSIZE(hdr));3835} else {3836if (l2arc_write_eligible(hdr->b_spa, hdr)) {3837ARCSTAT_INCR(arcstat_evict_l2_eligible,3838HDR_GET_LSIZE(hdr));38393840switch (state->arcs_state) {3841case ARC_STATE_MRU:3842ARCSTAT_INCR(3843arcstat_evict_l2_eligible_mru,3844HDR_GET_LSIZE(hdr));3845break;3846case ARC_STATE_MFU:3847ARCSTAT_INCR(3848arcstat_evict_l2_eligible_mfu,3849HDR_GET_LSIZE(hdr));3850break;3851default:3852break;3853}3854} else {3855ARCSTAT_INCR(arcstat_evict_l2_ineligible,3856HDR_GET_LSIZE(hdr));3857}3858}38593860bytes_evicted += arc_hdr_size(hdr);3861*real_evicted += arc_hdr_size(hdr);38623863/*3864* If this hdr is being evicted and has a compressed buffer then we3865* discard it here before we change states. This ensures that the3866* accounting is updated correctly in arc_free_data_impl().3867*/3868if (hdr->b_l1hdr.b_pabd != NULL)3869arc_hdr_free_abd(hdr, B_FALSE);38703871if (HDR_HAS_RABD(hdr))3872arc_hdr_free_abd(hdr, B_TRUE);38733874arc_change_state(evicted_state, hdr);3875DTRACE_PROBE1(arc__evict, arc_buf_hdr_t *, hdr);3876if (evicted_state == arc_anon) {3877arc_hdr_destroy(hdr);3878*real_evicted += HDR_FULL_SIZE;3879} else {3880ASSERT(HDR_IN_HASH_TABLE(hdr));3881}38823883return (bytes_evicted);3884}38853886static void3887arc_set_need_free(void)3888{3889ASSERT(MUTEX_HELD(&arc_evict_lock));3890int64_t remaining = arc_free_memory() - arc_sys_free / 2;3891arc_evict_waiter_t *aw = list_tail(&arc_evict_waiters);3892if (aw == NULL) {3893arc_need_free = MAX(-remaining, 0);3894} else {3895arc_need_free =3896MAX(-remaining, (int64_t)(aw->aew_count - arc_evict_count));3897}3898}38993900static uint64_t3901arc_evict_state_impl(multilist_t *ml, int idx, arc_buf_hdr_t *marker,3902uint64_t spa, uint64_t bytes)3903{3904multilist_sublist_t *mls;3905uint64_t bytes_evicted = 0, real_evicted = 0;3906arc_buf_hdr_t *hdr;3907kmutex_t *hash_lock;3908uint_t evict_count = zfs_arc_evict_batch_limit;39093910ASSERT3P(marker, !=, NULL);39113912mls = multilist_sublist_lock_idx(ml, idx);39133914for (hdr = multilist_sublist_prev(mls, marker); likely(hdr != NULL);3915hdr = multilist_sublist_prev(mls, marker)) {3916if ((evict_count == 0) || (bytes_evicted >= bytes))3917break;39183919/*3920* To keep our iteration location, move the marker3921* forward. Since we're not holding hdr's hash lock, we3922* must be very careful and not remove 'hdr' from the3923* sublist. Otherwise, other consumers might mistake the3924* 'hdr' as not being on a sublist when they call the3925* multilist_link_active() function (they all rely on3926* the hash lock protecting concurrent insertions and3927* removals). multilist_sublist_move_forward() was3928* specifically implemented to ensure this is the case3929* (only 'marker' will be removed and re-inserted).3930*/3931multilist_sublist_move_forward(mls, marker);39323933/*3934* The only case where the b_spa field should ever be3935* zero, is the marker headers inserted by3936* arc_evict_state(). It's possible for multiple threads3937* to be calling arc_evict_state() concurrently (e.g.3938* dsl_pool_close() and zio_inject_fault()), so we must3939* skip any markers we see from these other threads.3940*/3941if (hdr->b_spa == 0)3942continue;39433944/* we're only interested in evicting buffers of a certain spa */3945if (spa != 0 && hdr->b_spa != spa) {3946ARCSTAT_BUMP(arcstat_evict_skip);3947continue;3948}39493950hash_lock = HDR_LOCK(hdr);39513952/*3953* We aren't calling this function from any code path3954* that would already be holding a hash lock, so we're3955* asserting on this assumption to be defensive in case3956* this ever changes. Without this check, it would be3957* possible to incorrectly increment arcstat_mutex_miss3958* below (e.g. if the code changed such that we called3959* this function with a hash lock held).3960*/3961ASSERT(!MUTEX_HELD(hash_lock));39623963if (mutex_tryenter(hash_lock)) {3964uint64_t revicted;3965uint64_t evicted = arc_evict_hdr(hdr, &revicted);3966mutex_exit(hash_lock);39673968bytes_evicted += evicted;3969real_evicted += revicted;39703971/*3972* If evicted is zero, arc_evict_hdr() must have3973* decided to skip this header, don't increment3974* evict_count in this case.3975*/3976if (evicted != 0)3977evict_count--;39783979} else {3980ARCSTAT_BUMP(arcstat_mutex_miss);3981}3982}39833984multilist_sublist_unlock(mls);39853986/*3987* Increment the count of evicted bytes, and wake up any threads that3988* are waiting for the count to reach this value. Since the list is3989* ordered by ascending aew_count, we pop off the beginning of the3990* list until we reach the end, or a waiter that's past the current3991* "count". Doing this outside the loop reduces the number of times3992* we need to acquire the global arc_evict_lock.3993*3994* Only wake when there's sufficient free memory in the system3995* (specifically, arc_sys_free/2, which by default is a bit more than3996* 1/64th of RAM). See the comments in arc_wait_for_eviction().3997*/3998mutex_enter(&arc_evict_lock);3999arc_evict_count += real_evicted;40004001if (arc_free_memory() > arc_sys_free / 2) {4002arc_evict_waiter_t *aw;4003while ((aw = list_head(&arc_evict_waiters)) != NULL &&4004aw->aew_count <= arc_evict_count) {4005list_remove(&arc_evict_waiters, aw);4006cv_broadcast(&aw->aew_cv);4007}4008}4009arc_set_need_free();4010mutex_exit(&arc_evict_lock);40114012/*4013* If the ARC size is reduced from arc_c_max to arc_c_min (especially4014* if the average cached block is small), eviction can be on-CPU for4015* many seconds. To ensure that other threads that may be bound to4016* this CPU are able to make progress, make a voluntary preemption4017* call here.4018*/4019kpreempt(KPREEMPT_SYNC);40204021return (bytes_evicted);4022}40234024static arc_buf_hdr_t *4025arc_state_alloc_marker(void)4026{4027arc_buf_hdr_t *marker = kmem_cache_alloc(hdr_full_cache, KM_SLEEP);40284029/*4030* A b_spa of 0 is used to indicate that this header is4031* a marker. This fact is used in arc_evict_state_impl().4032*/4033marker->b_spa = 0;40344035return (marker);4036}40374038static void4039arc_state_free_marker(arc_buf_hdr_t *marker)4040{4041kmem_cache_free(hdr_full_cache, marker);4042}40434044/*4045* Allocate an array of buffer headers used as placeholders during arc state4046* eviction.4047*/4048static arc_buf_hdr_t **4049arc_state_alloc_markers(int count)4050{4051arc_buf_hdr_t **markers;40524053markers = kmem_zalloc(sizeof (*markers) * count, KM_SLEEP);4054for (int i = 0; i < count; i++)4055markers[i] = arc_state_alloc_marker();4056return (markers);4057}40584059static void4060arc_state_free_markers(arc_buf_hdr_t **markers, int count)4061{4062for (int i = 0; i < count; i++)4063arc_state_free_marker(markers[i]);4064kmem_free(markers, sizeof (*markers) * count);4065}40664067typedef struct evict_arg {4068taskq_ent_t eva_tqent;4069multilist_t *eva_ml;4070arc_buf_hdr_t *eva_marker;4071int eva_idx;4072uint64_t eva_spa;4073uint64_t eva_bytes;4074uint64_t eva_evicted;4075} evict_arg_t;40764077static void4078arc_evict_task(void *arg)4079{4080evict_arg_t *eva = arg;4081eva->eva_evicted = arc_evict_state_impl(eva->eva_ml, eva->eva_idx,4082eva->eva_marker, eva->eva_spa, eva->eva_bytes);4083}40844085static void4086arc_evict_thread_init(void)4087{4088if (zfs_arc_evict_threads == 0) {4089/*4090* Compute number of threads we want to use for eviction.4091*4092* Normally, it's log2(ncpus) + ncpus/32, which gets us to the4093* default max of 16 threads at ~256 CPUs.4094*4095* However, that formula goes to two threads at 4 CPUs, which4096* is still rather to low to be really useful, so we just go4097* with 1 thread at fewer than 6 cores.4098*/4099if (max_ncpus < 6)4100zfs_arc_evict_threads = 1;4101else4102zfs_arc_evict_threads =4103(highbit64(max_ncpus) - 1) + max_ncpus / 32;4104} else if (zfs_arc_evict_threads > max_ncpus)4105zfs_arc_evict_threads = max_ncpus;41064107if (zfs_arc_evict_threads > 1) {4108arc_evict_taskq = taskq_create("arc_evict",4109zfs_arc_evict_threads, defclsyspri, 0, INT_MAX,4110TASKQ_PREPOPULATE);4111arc_evict_arg = kmem_zalloc(4112sizeof (evict_arg_t) * zfs_arc_evict_threads, KM_SLEEP);4113}4114}41154116/*4117* The minimum number of bytes we can evict at once is a block size.4118* So, SPA_MAXBLOCKSIZE is a reasonable minimal value per an eviction task.4119* We use this value to compute a scaling factor for the eviction tasks.4120*/4121#define MIN_EVICT_SIZE (SPA_MAXBLOCKSIZE)41224123/*4124* Evict buffers from the given arc state, until we've removed the4125* specified number of bytes. Move the removed buffers to the4126* appropriate evict state.4127*4128* This function makes a "best effort". It skips over any buffers4129* it can't get a hash_lock on, and so, may not catch all candidates.4130* It may also return without evicting as much space as requested.4131*4132* If bytes is specified using the special value ARC_EVICT_ALL, this4133* will evict all available (i.e. unlocked and evictable) buffers from4134* the given arc state; which is used by arc_flush().4135*/4136static uint64_t4137arc_evict_state(arc_state_t *state, arc_buf_contents_t type, uint64_t spa,4138uint64_t bytes)4139{4140uint64_t total_evicted = 0;4141multilist_t *ml = &state->arcs_list[type];4142int num_sublists;4143arc_buf_hdr_t **markers;4144evict_arg_t *eva = NULL;41454146num_sublists = multilist_get_num_sublists(ml);41474148boolean_t use_evcttq = zfs_arc_evict_threads > 1;41494150/*4151* If we've tried to evict from each sublist, made some4152* progress, but still have not hit the target number of bytes4153* to evict, we want to keep trying. The markers allow us to4154* pick up where we left off for each individual sublist, rather4155* than starting from the tail each time.4156*/4157if (zthr_iscurthread(arc_evict_zthr)) {4158markers = arc_state_evict_markers;4159ASSERT3S(num_sublists, <=, arc_state_evict_marker_count);4160} else {4161markers = arc_state_alloc_markers(num_sublists);4162}4163for (int i = 0; i < num_sublists; i++) {4164multilist_sublist_t *mls;41654166mls = multilist_sublist_lock_idx(ml, i);4167multilist_sublist_insert_tail(mls, markers[i]);4168multilist_sublist_unlock(mls);4169}41704171if (use_evcttq) {4172if (zthr_iscurthread(arc_evict_zthr))4173eva = arc_evict_arg;4174else4175eva = kmem_alloc(sizeof (evict_arg_t) *4176zfs_arc_evict_threads, KM_NOSLEEP);4177if (eva) {4178for (int i = 0; i < zfs_arc_evict_threads; i++) {4179taskq_init_ent(&eva[i].eva_tqent);4180eva[i].eva_ml = ml;4181eva[i].eva_spa = spa;4182}4183} else {4184/*4185* Fall back to the regular single evict if it is not4186* possible to allocate memory for the taskq entries.4187*/4188use_evcttq = B_FALSE;4189}4190}41914192/*4193* Start eviction using a randomly selected sublist, this is to try and4194* evenly balance eviction across all sublists. Always starting at the4195* same sublist (e.g. index 0) would cause evictions to favor certain4196* sublists over others.4197*/4198uint64_t scan_evicted = 0;4199int sublists_left = num_sublists;4200int sublist_idx = multilist_get_random_index(ml);42014202/*4203* While we haven't hit our target number of bytes to evict, or4204* we're evicting all available buffers.4205*/4206while (total_evicted < bytes) {4207uint64_t evict = MIN_EVICT_SIZE;4208uint_t ntasks = zfs_arc_evict_threads;42094210if (use_evcttq) {4211if (sublists_left < ntasks)4212ntasks = sublists_left;42134214if (ntasks < 2)4215use_evcttq = B_FALSE;4216}42174218if (use_evcttq) {4219uint64_t left = bytes - total_evicted;42204221if (bytes == ARC_EVICT_ALL) {4222evict = bytes;4223} else if (left > ntasks * MIN_EVICT_SIZE) {4224evict = DIV_ROUND_UP(left, ntasks);4225} else {4226ntasks = DIV_ROUND_UP(left, MIN_EVICT_SIZE);4227if (ntasks == 1)4228use_evcttq = B_FALSE;4229}4230}42314232for (int i = 0; sublists_left > 0; i++, sublist_idx++,4233sublists_left--) {4234uint64_t bytes_remaining;4235uint64_t bytes_evicted;42364237/* we've reached the end, wrap to the beginning */4238if (sublist_idx >= num_sublists)4239sublist_idx = 0;42404241if (use_evcttq) {4242if (i == ntasks)4243break;42444245eva[i].eva_marker = markers[sublist_idx];4246eva[i].eva_idx = sublist_idx;4247eva[i].eva_bytes = evict;42484249taskq_dispatch_ent(arc_evict_taskq,4250arc_evict_task, &eva[i], 0,4251&eva[i].eva_tqent);42524253continue;4254}42554256if (total_evicted < bytes)4257bytes_remaining = bytes - total_evicted;4258else4259break;42604261bytes_evicted = arc_evict_state_impl(ml, sublist_idx,4262markers[sublist_idx], spa, bytes_remaining);42634264scan_evicted += bytes_evicted;4265total_evicted += bytes_evicted;4266}42674268if (use_evcttq) {4269taskq_wait(arc_evict_taskq);42704271for (int i = 0; i < ntasks; i++) {4272scan_evicted += eva[i].eva_evicted;4273total_evicted += eva[i].eva_evicted;4274}4275}42764277/*4278* If we scanned all sublists and didn't evict anything, we4279* have no reason to believe we'll evict more during another4280* scan, so break the loop.4281*/4282if (scan_evicted == 0 && sublists_left == 0) {4283/* This isn't possible, let's make that obvious */4284ASSERT3S(bytes, !=, 0);42854286/*4287* When bytes is ARC_EVICT_ALL, the only way to4288* break the loop is when scan_evicted is zero.4289* In that case, we actually have evicted enough,4290* so we don't want to increment the kstat.4291*/4292if (bytes != ARC_EVICT_ALL) {4293ASSERT3S(total_evicted, <, bytes);4294ARCSTAT_BUMP(arcstat_evict_not_enough);4295}42964297break;4298}42994300/*4301* If we scanned all sublists but still have more to do,4302* reset the counts so we can go around again.4303*/4304if (sublists_left == 0) {4305sublists_left = num_sublists;4306sublist_idx = multilist_get_random_index(ml);4307scan_evicted = 0;43084309/*4310* Since we're about to reconsider all sublists,4311* re-enable use of the evict threads if available.4312*/4313use_evcttq = (zfs_arc_evict_threads > 1 && eva != NULL);4314}4315}43164317if (eva != NULL && eva != arc_evict_arg)4318kmem_free(eva, sizeof (evict_arg_t) * zfs_arc_evict_threads);43194320for (int i = 0; i < num_sublists; i++) {4321multilist_sublist_t *mls = multilist_sublist_lock_idx(ml, i);4322multilist_sublist_remove(mls, markers[i]);4323multilist_sublist_unlock(mls);4324}43254326if (markers != arc_state_evict_markers)4327arc_state_free_markers(markers, num_sublists);43284329return (total_evicted);4330}43314332/*4333* Flush all "evictable" data of the given type from the arc state4334* specified. This will not evict any "active" buffers (i.e. referenced).4335*4336* When 'retry' is set to B_FALSE, the function will make a single pass4337* over the state and evict any buffers that it can. Since it doesn't4338* continually retry the eviction, it might end up leaving some buffers4339* in the ARC due to lock misses.4340*4341* When 'retry' is set to B_TRUE, the function will continually retry the4342* eviction until *all* evictable buffers have been removed from the4343* state. As a result, if concurrent insertions into the state are4344* allowed (e.g. if the ARC isn't shutting down), this function might4345* wind up in an infinite loop, continually trying to evict buffers.4346*/4347static uint64_t4348arc_flush_state(arc_state_t *state, uint64_t spa, arc_buf_contents_t type,4349boolean_t retry)4350{4351uint64_t evicted = 0;43524353while (zfs_refcount_count(&state->arcs_esize[type]) != 0) {4354evicted += arc_evict_state(state, type, spa, ARC_EVICT_ALL);43554356if (!retry)4357break;4358}43594360return (evicted);4361}43624363/*4364* Evict the specified number of bytes from the state specified. This4365* function prevents us from trying to evict more from a state's list4366* than is "evictable", and to skip evicting altogether when passed a4367* negative value for "bytes". In contrast, arc_evict_state() will4368* evict everything it can, when passed a negative value for "bytes".4369*/4370static uint64_t4371arc_evict_impl(arc_state_t *state, arc_buf_contents_t type, int64_t bytes)4372{4373uint64_t delta;43744375if (bytes > 0 && zfs_refcount_count(&state->arcs_esize[type]) > 0) {4376delta = MIN(zfs_refcount_count(&state->arcs_esize[type]),4377bytes);4378return (arc_evict_state(state, type, 0, delta));4379}43804381return (0);4382}43834384/*4385* Adjust specified fraction, taking into account initial ghost state(s) size,4386* ghost hit bytes towards increasing the fraction, ghost hit bytes towards4387* decreasing it, plus a balance factor, controlling the decrease rate, used4388* to balance metadata vs data.4389*/4390static uint64_t4391arc_evict_adj(uint64_t frac, uint64_t total, uint64_t up, uint64_t down,4392uint_t balance)4393{4394if (total < 32 || up + down == 0)4395return (frac);43964397/*4398* We should not have more ghost hits than ghost size, but they may4399* get close. To avoid overflows below up/down should not be bigger4400* than 1/5 of total. But to limit maximum adjustment speed restrict4401* it some more.4402*/4403if (up + down >= total / 16) {4404uint64_t scale = (up + down) / (total / 32);4405up /= scale;4406down /= scale;4407}44084409/* Get maximal dynamic range by choosing optimal shifts. */4410int s = highbit64(total);4411s = MIN(64 - s, 32);44124413ASSERT3U(frac, <=, 1ULL << 32);4414uint64_t ofrac = (1ULL << 32) - frac;44154416if (frac >= 4 * ofrac)4417up /= frac / (2 * ofrac + 1);4418up = (up << s) / (total >> (32 - s));4419if (ofrac >= 4 * frac)4420down /= ofrac / (2 * frac + 1);4421down = (down << s) / (total >> (32 - s));4422down = down * 100 / balance;44234424ASSERT3U(up, <=, (1ULL << 32) - frac);4425ASSERT3U(down, <=, frac);4426return (frac + up - down);4427}44284429/*4430* Calculate (x * multiplier / divisor) without unnecesary overflows.4431*/4432static uint64_t4433arc_mf(uint64_t x, uint64_t multiplier, uint64_t divisor)4434{4435uint64_t q = (x / divisor);4436uint64_t r = (x % divisor);44374438return ((q * multiplier) + ((r * multiplier) / divisor));4439}44404441/*4442* Evict buffers from the cache, such that arcstat_size is capped by arc_c.4443*/4444static uint64_t4445arc_evict(void)4446{4447uint64_t bytes, total_evicted = 0;4448int64_t e, mrud, mrum, mfud, mfum, w;4449static uint64_t ogrd, ogrm, ogfd, ogfm;4450static uint64_t gsrd, gsrm, gsfd, gsfm;4451uint64_t ngrd, ngrm, ngfd, ngfm;44524453/* Get current size of ARC states we can evict from. */4454mrud = zfs_refcount_count(&arc_mru->arcs_size[ARC_BUFC_DATA]) +4455zfs_refcount_count(&arc_anon->arcs_size[ARC_BUFC_DATA]);4456mrum = zfs_refcount_count(&arc_mru->arcs_size[ARC_BUFC_METADATA]) +4457zfs_refcount_count(&arc_anon->arcs_size[ARC_BUFC_METADATA]);4458mfud = zfs_refcount_count(&arc_mfu->arcs_size[ARC_BUFC_DATA]);4459mfum = zfs_refcount_count(&arc_mfu->arcs_size[ARC_BUFC_METADATA]);4460uint64_t d = mrud + mfud;4461uint64_t m = mrum + mfum;4462uint64_t t = d + m;44634464/* Get ARC ghost hits since last eviction. */4465ngrd = wmsum_value(&arc_mru_ghost->arcs_hits[ARC_BUFC_DATA]);4466uint64_t grd = ngrd - ogrd;4467ogrd = ngrd;4468ngrm = wmsum_value(&arc_mru_ghost->arcs_hits[ARC_BUFC_METADATA]);4469uint64_t grm = ngrm - ogrm;4470ogrm = ngrm;4471ngfd = wmsum_value(&arc_mfu_ghost->arcs_hits[ARC_BUFC_DATA]);4472uint64_t gfd = ngfd - ogfd;4473ogfd = ngfd;4474ngfm = wmsum_value(&arc_mfu_ghost->arcs_hits[ARC_BUFC_METADATA]);4475uint64_t gfm = ngfm - ogfm;4476ogfm = ngfm;44774478/* Adjust ARC states balance based on ghost hits. */4479arc_meta = arc_evict_adj(arc_meta, gsrd + gsrm + gsfd + gsfm,4480grm + gfm, grd + gfd, zfs_arc_meta_balance);4481arc_pd = arc_evict_adj(arc_pd, gsrd + gsfd, grd, gfd, 100);4482arc_pm = arc_evict_adj(arc_pm, gsrm + gsfm, grm, gfm, 100);44834484uint64_t asize = aggsum_value(&arc_sums.arcstat_size);4485uint64_t ac = arc_c;4486int64_t wt = t - (asize - ac);44874488/*4489* Try to reduce pinned dnodes if more than 3/4 of wanted metadata4490* target is not evictable or if they go over arc_dnode_limit.4491*/4492int64_t prune = 0;4493int64_t dn = aggsum_value(&arc_sums.arcstat_dnode_size);4494int64_t nem = zfs_refcount_count(&arc_mru->arcs_size[ARC_BUFC_METADATA])4495+ zfs_refcount_count(&arc_mfu->arcs_size[ARC_BUFC_METADATA])4496- zfs_refcount_count(&arc_mru->arcs_esize[ARC_BUFC_METADATA])4497- zfs_refcount_count(&arc_mfu->arcs_esize[ARC_BUFC_METADATA]);4498w = wt * (int64_t)(arc_meta >> 16) >> 16;4499if (nem > w * 3 / 4) {4500prune = dn / sizeof (dnode_t) *4501zfs_arc_dnode_reduce_percent / 100;4502if (nem < w && w > 4)4503prune = arc_mf(prune, nem - w * 3 / 4, w / 4);4504}4505if (dn > arc_dnode_limit) {4506prune = MAX(prune, (dn - arc_dnode_limit) / sizeof (dnode_t) *4507zfs_arc_dnode_reduce_percent / 100);4508}4509if (prune > 0)4510arc_prune_async(prune);45114512/* Evict MRU metadata. */4513w = wt * (int64_t)(arc_meta * arc_pm >> 48) >> 16;4514e = MIN((int64_t)(asize - ac), (int64_t)(mrum - w));4515bytes = arc_evict_impl(arc_mru, ARC_BUFC_METADATA, e);4516total_evicted += bytes;4517mrum -= bytes;4518asize -= bytes;45194520/* Evict MFU metadata. */4521w = wt * (int64_t)(arc_meta >> 16) >> 16;4522e = MIN((int64_t)(asize - ac), (int64_t)(m - bytes - w));4523bytes = arc_evict_impl(arc_mfu, ARC_BUFC_METADATA, e);4524total_evicted += bytes;4525mfum -= bytes;4526asize -= bytes;45274528/* Evict MRU data. */4529wt -= m - total_evicted;4530w = wt * (int64_t)(arc_pd >> 16) >> 16;4531e = MIN((int64_t)(asize - ac), (int64_t)(mrud - w));4532bytes = arc_evict_impl(arc_mru, ARC_BUFC_DATA, e);4533total_evicted += bytes;4534mrud -= bytes;4535asize -= bytes;45364537/* Evict MFU data. */4538e = asize - ac;4539bytes = arc_evict_impl(arc_mfu, ARC_BUFC_DATA, e);4540mfud -= bytes;4541total_evicted += bytes;45424543/*4544* Evict ghost lists4545*4546* Size of each state's ghost list represents how much that state4547* may grow by shrinking the other states. Would it need to shrink4548* other states to zero (that is unlikely), its ghost size would be4549* equal to sum of other three state sizes. But excessive ghost4550* size may result in false ghost hits (too far back), that may4551* never result in real cache hits if several states are competing.4552* So choose some arbitraty point of 1/2 of other state sizes.4553*/4554gsrd = (mrum + mfud + mfum) / 2;4555e = zfs_refcount_count(&arc_mru_ghost->arcs_size[ARC_BUFC_DATA]) -4556gsrd;4557(void) arc_evict_impl(arc_mru_ghost, ARC_BUFC_DATA, e);45584559gsrm = (mrud + mfud + mfum) / 2;4560e = zfs_refcount_count(&arc_mru_ghost->arcs_size[ARC_BUFC_METADATA]) -4561gsrm;4562(void) arc_evict_impl(arc_mru_ghost, ARC_BUFC_METADATA, e);45634564gsfd = (mrud + mrum + mfum) / 2;4565e = zfs_refcount_count(&arc_mfu_ghost->arcs_size[ARC_BUFC_DATA]) -4566gsfd;4567(void) arc_evict_impl(arc_mfu_ghost, ARC_BUFC_DATA, e);45684569gsfm = (mrud + mrum + mfud) / 2;4570e = zfs_refcount_count(&arc_mfu_ghost->arcs_size[ARC_BUFC_METADATA]) -4571gsfm;4572(void) arc_evict_impl(arc_mfu_ghost, ARC_BUFC_METADATA, e);45734574return (total_evicted);4575}45764577static void4578arc_flush_impl(uint64_t guid, boolean_t retry)4579{4580ASSERT(!retry || guid == 0);45814582(void) arc_flush_state(arc_mru, guid, ARC_BUFC_DATA, retry);4583(void) arc_flush_state(arc_mru, guid, ARC_BUFC_METADATA, retry);45844585(void) arc_flush_state(arc_mfu, guid, ARC_BUFC_DATA, retry);4586(void) arc_flush_state(arc_mfu, guid, ARC_BUFC_METADATA, retry);45874588(void) arc_flush_state(arc_mru_ghost, guid, ARC_BUFC_DATA, retry);4589(void) arc_flush_state(arc_mru_ghost, guid, ARC_BUFC_METADATA, retry);45904591(void) arc_flush_state(arc_mfu_ghost, guid, ARC_BUFC_DATA, retry);4592(void) arc_flush_state(arc_mfu_ghost, guid, ARC_BUFC_METADATA, retry);45934594(void) arc_flush_state(arc_uncached, guid, ARC_BUFC_DATA, retry);4595(void) arc_flush_state(arc_uncached, guid, ARC_BUFC_METADATA, retry);4596}45974598void4599arc_flush(spa_t *spa, boolean_t retry)4600{4601/*4602* If retry is B_TRUE, a spa must not be specified since we have4603* no good way to determine if all of a spa's buffers have been4604* evicted from an arc state.4605*/4606ASSERT(!retry || spa == NULL);46074608arc_flush_impl(spa != NULL ? spa_load_guid(spa) : 0, retry);4609}46104611static arc_async_flush_t *4612arc_async_flush_add(uint64_t spa_guid, uint_t level)4613{4614arc_async_flush_t *af = kmem_alloc(sizeof (*af), KM_SLEEP);4615af->af_spa_guid = spa_guid;4616af->af_cache_level = level;4617taskq_init_ent(&af->af_tqent);4618list_link_init(&af->af_node);46194620mutex_enter(&arc_async_flush_lock);4621list_insert_tail(&arc_async_flush_list, af);4622mutex_exit(&arc_async_flush_lock);46234624return (af);4625}46264627static void4628arc_async_flush_remove(uint64_t spa_guid, uint_t level)4629{4630mutex_enter(&arc_async_flush_lock);4631for (arc_async_flush_t *af = list_head(&arc_async_flush_list);4632af != NULL; af = list_next(&arc_async_flush_list, af)) {4633if (af->af_spa_guid == spa_guid &&4634af->af_cache_level == level) {4635list_remove(&arc_async_flush_list, af);4636kmem_free(af, sizeof (*af));4637break;4638}4639}4640mutex_exit(&arc_async_flush_lock);4641}46424643static void4644arc_flush_task(void *arg)4645{4646arc_async_flush_t *af = arg;4647hrtime_t start_time = gethrtime();4648uint64_t spa_guid = af->af_spa_guid;46494650arc_flush_impl(spa_guid, B_FALSE);4651arc_async_flush_remove(spa_guid, af->af_cache_level);46524653uint64_t elapsed = NSEC2MSEC(gethrtime() - start_time);4654if (elapsed > 0) {4655zfs_dbgmsg("spa %llu arc flushed in %llu ms",4656(u_longlong_t)spa_guid, (u_longlong_t)elapsed);4657}4658}46594660/*4661* ARC buffers use the spa's load guid and can continue to exist after4662* the spa_t is gone (exported). The blocks are orphaned since each4663* spa import has a different load guid.4664*4665* It's OK if the spa is re-imported while this asynchronous flush is4666* still in progress. The new spa_load_guid will be different.4667*4668* Also, arc_fini will wait for any arc_flush_task to finish.4669*/4670void4671arc_flush_async(spa_t *spa)4672{4673uint64_t spa_guid = spa_load_guid(spa);4674arc_async_flush_t *af = arc_async_flush_add(spa_guid, 1);46754676taskq_dispatch_ent(arc_flush_taskq, arc_flush_task,4677af, TQ_SLEEP, &af->af_tqent);4678}46794680/*4681* Check if a guid is still in-use as part of an async teardown task4682*/4683boolean_t4684arc_async_flush_guid_inuse(uint64_t spa_guid)4685{4686mutex_enter(&arc_async_flush_lock);4687for (arc_async_flush_t *af = list_head(&arc_async_flush_list);4688af != NULL; af = list_next(&arc_async_flush_list, af)) {4689if (af->af_spa_guid == spa_guid) {4690mutex_exit(&arc_async_flush_lock);4691return (B_TRUE);4692}4693}4694mutex_exit(&arc_async_flush_lock);4695return (B_FALSE);4696}46974698uint64_t4699arc_reduce_target_size(uint64_t to_free)4700{4701/*4702* Get the actual arc size. Even if we don't need it, this updates4703* the aggsum lower bound estimate for arc_is_overflowing().4704*/4705uint64_t asize = aggsum_value(&arc_sums.arcstat_size);47064707/*4708* All callers want the ARC to actually evict (at least) this much4709* memory. Therefore we reduce from the lower of the current size and4710* the target size. This way, even if arc_c is much higher than4711* arc_size (as can be the case after many calls to arc_freed(), we will4712* immediately have arc_c < arc_size and therefore the arc_evict_zthr4713* will evict.4714*/4715uint64_t c = arc_c;4716if (c > arc_c_min) {4717c = MIN(c, MAX(asize, arc_c_min));4718to_free = MIN(to_free, c - arc_c_min);4719arc_c = c - to_free;4720} else {4721to_free = 0;4722}47234724/*4725* Since dbuf cache size is a fraction of target ARC size, we should4726* notify dbuf about the reduction, which might be significant,4727* especially if current ARC size was much smaller than the target.4728*/4729dbuf_cache_reduce_target_size();47304731/*4732* Whether or not we reduced the target size, request eviction if the4733* current size is over it now, since caller obviously wants some RAM.4734*/4735if (asize > arc_c) {4736/* See comment in arc_evict_cb_check() on why lock+flag */4737mutex_enter(&arc_evict_lock);4738arc_evict_needed = B_TRUE;4739mutex_exit(&arc_evict_lock);4740zthr_wakeup(arc_evict_zthr);4741}47424743return (to_free);4744}47454746/*4747* Determine if the system is under memory pressure and is asking4748* to reclaim memory. A return value of B_TRUE indicates that the system4749* is under memory pressure and that the arc should adjust accordingly.4750*/4751boolean_t4752arc_reclaim_needed(void)4753{4754return (arc_available_memory() < 0);4755}47564757void4758arc_kmem_reap_soon(void)4759{4760size_t i;4761kmem_cache_t *prev_cache = NULL;4762kmem_cache_t *prev_data_cache = NULL;47634764#ifdef _KERNEL4765#if defined(_ILP32)4766/*4767* Reclaim unused memory from all kmem caches.4768*/4769kmem_reap();4770#endif4771#endif47724773for (i = 0; i < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT; i++) {4774#if defined(_ILP32)4775/* reach upper limit of cache size on 32-bit */4776if (zio_buf_cache[i] == NULL)4777break;4778#endif4779if (zio_buf_cache[i] != prev_cache) {4780prev_cache = zio_buf_cache[i];4781kmem_cache_reap_now(zio_buf_cache[i]);4782}4783if (zio_data_buf_cache[i] != prev_data_cache) {4784prev_data_cache = zio_data_buf_cache[i];4785kmem_cache_reap_now(zio_data_buf_cache[i]);4786}4787}4788kmem_cache_reap_now(buf_cache);4789kmem_cache_reap_now(hdr_full_cache);4790kmem_cache_reap_now(hdr_l2only_cache);4791kmem_cache_reap_now(zfs_btree_leaf_cache);4792abd_cache_reap_now();4793}47944795static boolean_t4796arc_evict_cb_check(void *arg, zthr_t *zthr)4797{4798(void) arg, (void) zthr;47994800#ifdef ZFS_DEBUG4801/*4802* This is necessary in order to keep the kstat information4803* up to date for tools that display kstat data such as the4804* mdb ::arc dcmd and the Linux crash utility. These tools4805* typically do not call kstat's update function, but simply4806* dump out stats from the most recent update. Without4807* this call, these commands may show stale stats for the4808* anon, mru, mru_ghost, mfu, and mfu_ghost lists. Even4809* with this call, the data might be out of date if the4810* evict thread hasn't been woken recently; but that should4811* suffice. The arc_state_t structures can be queried4812* directly if more accurate information is needed.4813*/4814if (arc_ksp != NULL)4815arc_ksp->ks_update(arc_ksp, KSTAT_READ);4816#endif48174818/*4819* We have to rely on arc_wait_for_eviction() to tell us when to4820* evict, rather than checking if we are overflowing here, so that we4821* are sure to not leave arc_wait_for_eviction() waiting on aew_cv.4822* If we have become "not overflowing" since arc_wait_for_eviction()4823* checked, we need to wake it up. We could broadcast the CV here,4824* but arc_wait_for_eviction() may have not yet gone to sleep. We4825* would need to use a mutex to ensure that this function doesn't4826* broadcast until arc_wait_for_eviction() has gone to sleep (e.g.4827* the arc_evict_lock). However, the lock ordering of such a lock4828* would necessarily be incorrect with respect to the zthr_lock,4829* which is held before this function is called, and is held by4830* arc_wait_for_eviction() when it calls zthr_wakeup().4831*/4832if (arc_evict_needed)4833return (B_TRUE);48344835/*4836* If we have buffers in uncached state, evict them periodically.4837*/4838return ((zfs_refcount_count(&arc_uncached->arcs_esize[ARC_BUFC_DATA]) +4839zfs_refcount_count(&arc_uncached->arcs_esize[ARC_BUFC_METADATA]) &&4840ddi_get_lbolt() - arc_last_uncached_flush >4841MSEC_TO_TICK(arc_min_prefetch_ms / 2)));4842}48434844/*4845* Keep arc_size under arc_c by running arc_evict which evicts data4846* from the ARC.4847*/4848static void4849arc_evict_cb(void *arg, zthr_t *zthr)4850{4851(void) arg;48524853uint64_t evicted = 0;4854fstrans_cookie_t cookie = spl_fstrans_mark();48554856/* Always try to evict from uncached state. */4857arc_last_uncached_flush = ddi_get_lbolt();4858evicted += arc_flush_state(arc_uncached, 0, ARC_BUFC_DATA, B_FALSE);4859evicted += arc_flush_state(arc_uncached, 0, ARC_BUFC_METADATA, B_FALSE);48604861/* Evict from other states only if told to. */4862if (arc_evict_needed)4863evicted += arc_evict();48644865/*4866* If evicted is zero, we couldn't evict anything4867* via arc_evict(). This could be due to hash lock4868* collisions, but more likely due to the majority of4869* arc buffers being unevictable. Therefore, even if4870* arc_size is above arc_c, another pass is unlikely to4871* be helpful and could potentially cause us to enter an4872* infinite loop. Additionally, zthr_iscancelled() is4873* checked here so that if the arc is shutting down, the4874* broadcast will wake any remaining arc evict waiters.4875*4876* Note we cancel using zthr instead of arc_evict_zthr4877* because the latter may not yet be initializd when the4878* callback is first invoked.4879*/4880mutex_enter(&arc_evict_lock);4881arc_evict_needed = !zthr_iscancelled(zthr) &&4882evicted > 0 && aggsum_compare(&arc_sums.arcstat_size, arc_c) > 0;4883if (!arc_evict_needed) {4884/*4885* We're either no longer overflowing, or we4886* can't evict anything more, so we should wake4887* arc_get_data_impl() sooner.4888*/4889arc_evict_waiter_t *aw;4890while ((aw = list_remove_head(&arc_evict_waiters)) != NULL) {4891cv_broadcast(&aw->aew_cv);4892}4893arc_set_need_free();4894}4895mutex_exit(&arc_evict_lock);4896spl_fstrans_unmark(cookie);4897}48984899static boolean_t4900arc_reap_cb_check(void *arg, zthr_t *zthr)4901{4902(void) arg, (void) zthr;49034904int64_t free_memory = arc_available_memory();4905static int reap_cb_check_counter = 0;49064907/*4908* If a kmem reap is already active, don't schedule more. We must4909* check for this because kmem_cache_reap_soon() won't actually4910* block on the cache being reaped (this is to prevent callers from4911* becoming implicitly blocked by a system-wide kmem reap -- which,4912* on a system with many, many full magazines, can take minutes).4913*/4914if (!kmem_cache_reap_active() && free_memory < 0) {49154916arc_no_grow = B_TRUE;4917arc_warm = B_TRUE;4918/*4919* Wait at least zfs_grow_retry (default 5) seconds4920* before considering growing.4921*/4922arc_growtime = gethrtime() + SEC2NSEC(arc_grow_retry);4923return (B_TRUE);4924} else if (free_memory < arc_c >> arc_no_grow_shift) {4925arc_no_grow = B_TRUE;4926} else if (gethrtime() >= arc_growtime) {4927arc_no_grow = B_FALSE;4928}49294930/*4931* Called unconditionally every 60 seconds to reclaim unused4932* zstd compression and decompression context. This is done4933* here to avoid the need for an independent thread.4934*/4935if (!((reap_cb_check_counter++) % 60))4936zfs_zstd_cache_reap_now();49374938return (B_FALSE);4939}49404941/*4942* Keep enough free memory in the system by reaping the ARC's kmem4943* caches. To cause more slabs to be reapable, we may reduce the4944* target size of the cache (arc_c), causing the arc_evict_cb()4945* to free more buffers.4946*/4947static void4948arc_reap_cb(void *arg, zthr_t *zthr)4949{4950int64_t can_free, free_memory, to_free;49514952(void) arg, (void) zthr;4953fstrans_cookie_t cookie = spl_fstrans_mark();49544955/*4956* Kick off asynchronous kmem_reap()'s of all our caches.4957*/4958arc_kmem_reap_soon();49594960/*4961* Wait at least arc_kmem_cache_reap_retry_ms between4962* arc_kmem_reap_soon() calls. Without this check it is possible to4963* end up in a situation where we spend lots of time reaping4964* caches, while we're near arc_c_min. Waiting here also gives the4965* subsequent free memory check a chance of finding that the4966* asynchronous reap has already freed enough memory, and we don't4967* need to call arc_reduce_target_size().4968*/4969delay((hz * arc_kmem_cache_reap_retry_ms + 999) / 1000);49704971/*4972* Reduce the target size as needed to maintain the amount of free4973* memory in the system at a fraction of the arc_size (1/128th by4974* default). If oversubscribed (free_memory < 0) then reduce the4975* target arc_size by the deficit amount plus the fractional4976* amount. If free memory is positive but less than the fractional4977* amount, reduce by what is needed to hit the fractional amount.4978*/4979free_memory = arc_available_memory();4980can_free = arc_c - arc_c_min;4981to_free = (MAX(can_free, 0) >> arc_shrink_shift) - free_memory;4982if (to_free > 0)4983arc_reduce_target_size(to_free);4984spl_fstrans_unmark(cookie);4985}49864987#ifdef _KERNEL4988/*4989* Determine the amount of memory eligible for eviction contained in the4990* ARC. All clean data reported by the ghost lists can always be safely4991* evicted. Due to arc_c_min, the same does not hold for all clean data4992* contained by the regular mru and mfu lists.4993*4994* In the case of the regular mru and mfu lists, we need to report as4995* much clean data as possible, such that evicting that same reported4996* data will not bring arc_size below arc_c_min. Thus, in certain4997* circumstances, the total amount of clean data in the mru and mfu4998* lists might not actually be evictable.4999*5000* The following two distinct cases are accounted for:5001*5002* 1. The sum of the amount of dirty data contained by both the mru and5003* mfu lists, plus the ARC's other accounting (e.g. the anon list),5004* is greater than or equal to arc_c_min.5005* (i.e. amount of dirty data >= arc_c_min)5006*5007* This is the easy case; all clean data contained by the mru and mfu5008* lists is evictable. Evicting all clean data can only drop arc_size5009* to the amount of dirty data, which is greater than arc_c_min.5010*5011* 2. The sum of the amount of dirty data contained by both the mru and5012* mfu lists, plus the ARC's other accounting (e.g. the anon list),5013* is less than arc_c_min.5014* (i.e. arc_c_min > amount of dirty data)5015*5016* 2.1. arc_size is greater than or equal arc_c_min.5017* (i.e. arc_size >= arc_c_min > amount of dirty data)5018*5019* In this case, not all clean data from the regular mru and mfu5020* lists is actually evictable; we must leave enough clean data5021* to keep arc_size above arc_c_min. Thus, the maximum amount of5022* evictable data from the two lists combined, is exactly the5023* difference between arc_size and arc_c_min.5024*5025* 2.2. arc_size is less than arc_c_min5026* (i.e. arc_c_min > arc_size > amount of dirty data)5027*5028* In this case, none of the data contained in the mru and mfu5029* lists is evictable, even if it's clean. Since arc_size is5030* already below arc_c_min, evicting any more would only5031* increase this negative difference.5032*/50335034#endif /* _KERNEL */50355036/*5037* Adapt arc info given the number of bytes we are trying to add and5038* the state that we are coming from. This function is only called5039* when we are adding new content to the cache.5040*/5041static void5042arc_adapt(uint64_t bytes)5043{5044/*5045* Wake reap thread if we do not have any available memory5046*/5047if (arc_reclaim_needed()) {5048zthr_wakeup(arc_reap_zthr);5049return;5050}50515052if (arc_no_grow)5053return;50545055if (arc_c >= arc_c_max)5056return;50575058/*5059* If we're within (2 * maxblocksize) bytes of the target5060* cache size, increment the target cache size5061*/5062if (aggsum_upper_bound(&arc_sums.arcstat_size) +50632 * SPA_MAXBLOCKSIZE >= arc_c) {5064uint64_t dc = MAX(bytes, SPA_OLD_MAXBLOCKSIZE);5065if (atomic_add_64_nv(&arc_c, dc) > arc_c_max)5066arc_c = arc_c_max;5067}5068}50695070/*5071* Check if ARC current size has grown past our upper thresholds.5072*/5073static arc_ovf_level_t5074arc_is_overflowing(boolean_t lax, boolean_t use_reserve)5075{5076/*5077* We just compare the lower bound here for performance reasons. Our5078* primary goals are to make sure that the arc never grows without5079* bound, and that it can reach its maximum size. This check5080* accomplishes both goals. The maximum amount we could run over by is5081* 2 * aggsum_borrow_multiplier * NUM_CPUS * the average size of a block5082* in the ARC. In practice, that's in the tens of MB, which is low5083* enough to be safe.5084*/5085int64_t arc_over = aggsum_lower_bound(&arc_sums.arcstat_size) - arc_c -5086zfs_max_recordsize;5087int64_t dn_over = aggsum_lower_bound(&arc_sums.arcstat_dnode_size) -5088arc_dnode_limit;50895090/* Always allow at least one block of overflow. */5091if (arc_over < 0 && dn_over <= 0)5092return (ARC_OVF_NONE);50935094/* If we are under memory pressure, report severe overflow. */5095if (!lax)5096return (ARC_OVF_SEVERE);50975098/* We are not under pressure, so be more or less relaxed. */5099int64_t overflow = (arc_c >> zfs_arc_overflow_shift) / 2;5100if (use_reserve)5101overflow *= 3;5102return (arc_over < overflow ? ARC_OVF_SOME : ARC_OVF_SEVERE);5103}51045105static abd_t *5106arc_get_data_abd(arc_buf_hdr_t *hdr, uint64_t size, const void *tag,5107int alloc_flags)5108{5109arc_buf_contents_t type = arc_buf_type(hdr);51105111arc_get_data_impl(hdr, size, tag, alloc_flags);5112if (alloc_flags & ARC_HDR_ALLOC_LINEAR)5113return (abd_alloc_linear(size, type == ARC_BUFC_METADATA));5114else5115return (abd_alloc(size, type == ARC_BUFC_METADATA));5116}51175118static void *5119arc_get_data_buf(arc_buf_hdr_t *hdr, uint64_t size, const void *tag)5120{5121arc_buf_contents_t type = arc_buf_type(hdr);51225123arc_get_data_impl(hdr, size, tag, 0);5124if (type == ARC_BUFC_METADATA) {5125return (zio_buf_alloc(size));5126} else {5127ASSERT(type == ARC_BUFC_DATA);5128return (zio_data_buf_alloc(size));5129}5130}51315132/*5133* Wait for the specified amount of data (in bytes) to be evicted from the5134* ARC, and for there to be sufficient free memory in the system.5135* The lax argument specifies that caller does not have a specific reason5136* to wait, not aware of any memory pressure. Low memory handlers though5137* should set it to B_FALSE to wait for all required evictions to complete.5138* The use_reserve argument allows some callers to wait less than others5139* to not block critical code paths, possibly blocking other resources.5140*/5141void5142arc_wait_for_eviction(uint64_t amount, boolean_t lax, boolean_t use_reserve)5143{5144switch (arc_is_overflowing(lax, use_reserve)) {5145case ARC_OVF_NONE:5146return;5147case ARC_OVF_SOME:5148/*5149* This is a bit racy without taking arc_evict_lock, but the5150* worst that can happen is we either call zthr_wakeup() extra5151* time due to race with other thread here, or the set flag5152* get cleared by arc_evict_cb(), which is unlikely due to5153* big hysteresis, but also not important since at this level5154* of overflow the eviction is purely advisory. Same time5155* taking the global lock here every time without waiting for5156* the actual eviction creates a significant lock contention.5157*/5158if (!arc_evict_needed) {5159arc_evict_needed = B_TRUE;5160zthr_wakeup(arc_evict_zthr);5161}5162return;5163case ARC_OVF_SEVERE:5164default:5165{5166arc_evict_waiter_t aw;5167list_link_init(&aw.aew_node);5168cv_init(&aw.aew_cv, NULL, CV_DEFAULT, NULL);51695170uint64_t last_count = 0;5171mutex_enter(&arc_evict_lock);5172if (!list_is_empty(&arc_evict_waiters)) {5173arc_evict_waiter_t *last =5174list_tail(&arc_evict_waiters);5175last_count = last->aew_count;5176} else if (!arc_evict_needed) {5177arc_evict_needed = B_TRUE;5178zthr_wakeup(arc_evict_zthr);5179}5180/*5181* Note, the last waiter's count may be less than5182* arc_evict_count if we are low on memory in which5183* case arc_evict_state_impl() may have deferred5184* wakeups (but still incremented arc_evict_count).5185*/5186aw.aew_count = MAX(last_count, arc_evict_count) + amount;51875188list_insert_tail(&arc_evict_waiters, &aw);51895190arc_set_need_free();51915192DTRACE_PROBE3(arc__wait__for__eviction,5193uint64_t, amount,5194uint64_t, arc_evict_count,5195uint64_t, aw.aew_count);51965197/*5198* We will be woken up either when arc_evict_count reaches5199* aew_count, or when the ARC is no longer overflowing and5200* eviction completes.5201* In case of "false" wakeup, we will still be on the list.5202*/5203do {5204cv_wait(&aw.aew_cv, &arc_evict_lock);5205} while (list_link_active(&aw.aew_node));5206mutex_exit(&arc_evict_lock);52075208cv_destroy(&aw.aew_cv);5209}5210}5211}52125213/*5214* Allocate a block and return it to the caller. If we are hitting the5215* hard limit for the cache size, we must sleep, waiting for the eviction5216* thread to catch up. If we're past the target size but below the hard5217* limit, we'll only signal the reclaim thread and continue on.5218*/5219static void5220arc_get_data_impl(arc_buf_hdr_t *hdr, uint64_t size, const void *tag,5221int alloc_flags)5222{5223arc_adapt(size);52245225/*5226* If arc_size is currently overflowing, we must be adding data5227* faster than we are evicting. To ensure we don't compound the5228* problem by adding more data and forcing arc_size to grow even5229* further past it's target size, we wait for the eviction thread to5230* make some progress. We also wait for there to be sufficient free5231* memory in the system, as measured by arc_free_memory().5232*5233* Specifically, we wait for zfs_arc_eviction_pct percent of the5234* requested size to be evicted. This should be more than 100%, to5235* ensure that that progress is also made towards getting arc_size5236* under arc_c. See the comment above zfs_arc_eviction_pct.5237*/5238arc_wait_for_eviction(size * zfs_arc_eviction_pct / 100,5239B_TRUE, alloc_flags & ARC_HDR_USE_RESERVE);52405241arc_buf_contents_t type = arc_buf_type(hdr);5242if (type == ARC_BUFC_METADATA) {5243arc_space_consume(size, ARC_SPACE_META);5244} else {5245arc_space_consume(size, ARC_SPACE_DATA);5246}52475248/*5249* Update the state size. Note that ghost states have a5250* "ghost size" and so don't need to be updated.5251*/5252arc_state_t *state = hdr->b_l1hdr.b_state;5253if (!GHOST_STATE(state)) {52545255(void) zfs_refcount_add_many(&state->arcs_size[type], size,5256tag);52575258/*5259* If this is reached via arc_read, the link is5260* protected by the hash lock. If reached via5261* arc_buf_alloc, the header should not be accessed by5262* any other thread. And, if reached via arc_read_done,5263* the hash lock will protect it if it's found in the5264* hash table; otherwise no other thread should be5265* trying to [add|remove]_reference it.5266*/5267if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {5268ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt));5269(void) zfs_refcount_add_many(&state->arcs_esize[type],5270size, tag);5271}5272}5273}52745275static void5276arc_free_data_abd(arc_buf_hdr_t *hdr, abd_t *abd, uint64_t size,5277const void *tag)5278{5279arc_free_data_impl(hdr, size, tag);5280abd_free(abd);5281}52825283static void5284arc_free_data_buf(arc_buf_hdr_t *hdr, void *buf, uint64_t size, const void *tag)5285{5286arc_buf_contents_t type = arc_buf_type(hdr);52875288arc_free_data_impl(hdr, size, tag);5289if (type == ARC_BUFC_METADATA) {5290zio_buf_free(buf, size);5291} else {5292ASSERT(type == ARC_BUFC_DATA);5293zio_data_buf_free(buf, size);5294}5295}52965297/*5298* Free the arc data buffer.5299*/5300static void5301arc_free_data_impl(arc_buf_hdr_t *hdr, uint64_t size, const void *tag)5302{5303arc_state_t *state = hdr->b_l1hdr.b_state;5304arc_buf_contents_t type = arc_buf_type(hdr);53055306/* protected by hash lock, if in the hash table */5307if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) {5308ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt));5309ASSERT(state != arc_anon && state != arc_l2c_only);53105311(void) zfs_refcount_remove_many(&state->arcs_esize[type],5312size, tag);5313}5314(void) zfs_refcount_remove_many(&state->arcs_size[type], size, tag);53155316VERIFY3U(hdr->b_type, ==, type);5317if (type == ARC_BUFC_METADATA) {5318arc_space_return(size, ARC_SPACE_META);5319} else {5320ASSERT(type == ARC_BUFC_DATA);5321arc_space_return(size, ARC_SPACE_DATA);5322}5323}53245325/*5326* This routine is called whenever a buffer is accessed.5327*/5328static void5329arc_access(arc_buf_hdr_t *hdr, arc_flags_t arc_flags, boolean_t hit)5330{5331ASSERT(MUTEX_HELD(HDR_LOCK(hdr)));5332ASSERT(HDR_HAS_L1HDR(hdr));53335334/*5335* Update buffer prefetch status.5336*/5337boolean_t was_prefetch = HDR_PREFETCH(hdr);5338boolean_t now_prefetch = arc_flags & ARC_FLAG_PREFETCH;5339if (was_prefetch != now_prefetch) {5340if (was_prefetch) {5341ARCSTAT_CONDSTAT(hit, demand_hit, demand_iohit,5342HDR_PRESCIENT_PREFETCH(hdr), prescient, predictive,5343prefetch);5344}5345if (HDR_HAS_L2HDR(hdr))5346l2arc_hdr_arcstats_decrement_state(hdr);5347if (was_prefetch) {5348arc_hdr_clear_flags(hdr,5349ARC_FLAG_PREFETCH | ARC_FLAG_PRESCIENT_PREFETCH);5350} else {5351arc_hdr_set_flags(hdr, ARC_FLAG_PREFETCH);5352}5353if (HDR_HAS_L2HDR(hdr))5354l2arc_hdr_arcstats_increment_state(hdr);5355}5356if (now_prefetch) {5357if (arc_flags & ARC_FLAG_PRESCIENT_PREFETCH) {5358arc_hdr_set_flags(hdr, ARC_FLAG_PRESCIENT_PREFETCH);5359ARCSTAT_BUMP(arcstat_prescient_prefetch);5360} else {5361ARCSTAT_BUMP(arcstat_predictive_prefetch);5362}5363}5364if (arc_flags & ARC_FLAG_L2CACHE)5365arc_hdr_set_flags(hdr, ARC_FLAG_L2CACHE);53665367clock_t now = ddi_get_lbolt();5368if (hdr->b_l1hdr.b_state == arc_anon) {5369arc_state_t *new_state;5370/*5371* This buffer is not in the cache, and does not appear in5372* our "ghost" lists. Add it to the MRU or uncached state.5373*/5374ASSERT0(hdr->b_l1hdr.b_arc_access);5375hdr->b_l1hdr.b_arc_access = now;5376if (HDR_UNCACHED(hdr)) {5377new_state = arc_uncached;5378DTRACE_PROBE1(new_state__uncached, arc_buf_hdr_t *,5379hdr);5380} else {5381new_state = arc_mru;5382DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, hdr);5383}5384arc_change_state(new_state, hdr);5385} else if (hdr->b_l1hdr.b_state == arc_mru) {5386/*5387* This buffer has been accessed once recently and either5388* its read is still in progress or it is in the cache.5389*/5390if (HDR_IO_IN_PROGRESS(hdr)) {5391hdr->b_l1hdr.b_arc_access = now;5392return;5393}5394hdr->b_l1hdr.b_mru_hits++;5395ARCSTAT_BUMP(arcstat_mru_hits);53965397/*5398* If the previous access was a prefetch, then it already5399* handled possible promotion, so nothing more to do for now.5400*/5401if (was_prefetch) {5402hdr->b_l1hdr.b_arc_access = now;5403return;5404}54055406/*5407* If more than ARC_MINTIME have passed from the previous5408* hit, promote the buffer to the MFU state.5409*/5410if (ddi_time_after(now, hdr->b_l1hdr.b_arc_access +5411ARC_MINTIME)) {5412hdr->b_l1hdr.b_arc_access = now;5413DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr);5414arc_change_state(arc_mfu, hdr);5415}5416} else if (hdr->b_l1hdr.b_state == arc_mru_ghost) {5417arc_state_t *new_state;5418/*5419* This buffer has been accessed once recently, but was5420* evicted from the cache. Would we have bigger MRU, it5421* would be an MRU hit, so handle it the same way, except5422* we don't need to check the previous access time.5423*/5424hdr->b_l1hdr.b_mru_ghost_hits++;5425ARCSTAT_BUMP(arcstat_mru_ghost_hits);5426hdr->b_l1hdr.b_arc_access = now;5427wmsum_add(&arc_mru_ghost->arcs_hits[arc_buf_type(hdr)],5428arc_hdr_size(hdr));5429if (was_prefetch) {5430new_state = arc_mru;5431DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, hdr);5432} else {5433new_state = arc_mfu;5434DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr);5435}5436arc_change_state(new_state, hdr);5437} else if (hdr->b_l1hdr.b_state == arc_mfu) {5438/*5439* This buffer has been accessed more than once and either5440* still in the cache or being restored from one of ghosts.5441*/5442if (!HDR_IO_IN_PROGRESS(hdr)) {5443hdr->b_l1hdr.b_mfu_hits++;5444ARCSTAT_BUMP(arcstat_mfu_hits);5445}5446hdr->b_l1hdr.b_arc_access = now;5447} else if (hdr->b_l1hdr.b_state == arc_mfu_ghost) {5448/*5449* This buffer has been accessed more than once recently, but5450* has been evicted from the cache. Would we have bigger MFU5451* it would stay in cache, so move it back to MFU state.5452*/5453hdr->b_l1hdr.b_mfu_ghost_hits++;5454ARCSTAT_BUMP(arcstat_mfu_ghost_hits);5455hdr->b_l1hdr.b_arc_access = now;5456wmsum_add(&arc_mfu_ghost->arcs_hits[arc_buf_type(hdr)],5457arc_hdr_size(hdr));5458DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr);5459arc_change_state(arc_mfu, hdr);5460} else if (hdr->b_l1hdr.b_state == arc_uncached) {5461/*5462* This buffer is uncacheable, but we got a hit. Probably5463* a demand read after prefetch. Nothing more to do here.5464*/5465if (!HDR_IO_IN_PROGRESS(hdr))5466ARCSTAT_BUMP(arcstat_uncached_hits);5467hdr->b_l1hdr.b_arc_access = now;5468} else if (hdr->b_l1hdr.b_state == arc_l2c_only) {5469/*5470* This buffer is on the 2nd Level ARC and was not accessed5471* for a long time, so treat it as new and put into MRU.5472*/5473hdr->b_l1hdr.b_arc_access = now;5474DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, hdr);5475arc_change_state(arc_mru, hdr);5476} else {5477cmn_err(CE_PANIC, "invalid arc state 0x%p",5478hdr->b_l1hdr.b_state);5479}5480}54815482/*5483* This routine is called by dbuf_hold() to update the arc_access() state5484* which otherwise would be skipped for entries in the dbuf cache.5485*/5486void5487arc_buf_access(arc_buf_t *buf)5488{5489arc_buf_hdr_t *hdr = buf->b_hdr;54905491/*5492* Avoid taking the hash_lock when possible as an optimization.5493* The header must be checked again under the hash_lock in order5494* to handle the case where it is concurrently being released.5495*/5496if (hdr->b_l1hdr.b_state == arc_anon || HDR_EMPTY(hdr))5497return;54985499kmutex_t *hash_lock = HDR_LOCK(hdr);5500mutex_enter(hash_lock);55015502if (hdr->b_l1hdr.b_state == arc_anon || HDR_EMPTY(hdr)) {5503mutex_exit(hash_lock);5504ARCSTAT_BUMP(arcstat_access_skip);5505return;5506}55075508ASSERT(hdr->b_l1hdr.b_state == arc_mru ||5509hdr->b_l1hdr.b_state == arc_mfu ||5510hdr->b_l1hdr.b_state == arc_uncached);55115512DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr);5513arc_access(hdr, 0, B_TRUE);5514mutex_exit(hash_lock);55155516ARCSTAT_BUMP(arcstat_hits);5517ARCSTAT_CONDSTAT(B_TRUE /* demand */, demand, prefetch,5518!HDR_ISTYPE_METADATA(hdr), data, metadata, hits);5519}55205521/* a generic arc_read_done_func_t which you can use */5522void5523arc_bcopy_func(zio_t *zio, const zbookmark_phys_t *zb, const blkptr_t *bp,5524arc_buf_t *buf, void *arg)5525{5526(void) zio, (void) zb, (void) bp;55275528if (buf == NULL)5529return;55305531memcpy(arg, buf->b_data, arc_buf_size(buf));5532arc_buf_destroy(buf, arg);5533}55345535/* a generic arc_read_done_func_t */5536void5537arc_getbuf_func(zio_t *zio, const zbookmark_phys_t *zb, const blkptr_t *bp,5538arc_buf_t *buf, void *arg)5539{5540(void) zb, (void) bp;5541arc_buf_t **bufp = arg;55425543if (buf == NULL) {5544ASSERT(zio == NULL || zio->io_error != 0);5545*bufp = NULL;5546} else {5547ASSERT(zio == NULL || zio->io_error == 0);5548*bufp = buf;5549ASSERT(buf->b_data != NULL);5550}5551}55525553static void5554arc_hdr_verify(arc_buf_hdr_t *hdr, blkptr_t *bp)5555{5556if (BP_IS_HOLE(bp) || BP_IS_EMBEDDED(bp)) {5557ASSERT0(HDR_GET_PSIZE(hdr));5558ASSERT3U(arc_hdr_get_compress(hdr), ==, ZIO_COMPRESS_OFF);5559} else {5560if (HDR_COMPRESSION_ENABLED(hdr)) {5561ASSERT3U(arc_hdr_get_compress(hdr), ==,5562BP_GET_COMPRESS(bp));5563}5564ASSERT3U(HDR_GET_LSIZE(hdr), ==, BP_GET_LSIZE(bp));5565ASSERT3U(HDR_GET_PSIZE(hdr), ==, BP_GET_PSIZE(bp));5566ASSERT3U(!!HDR_PROTECTED(hdr), ==, BP_IS_PROTECTED(bp));5567}5568}55695570static void5571arc_read_done(zio_t *zio)5572{5573blkptr_t *bp = zio->io_bp;5574arc_buf_hdr_t *hdr = zio->io_private;5575kmutex_t *hash_lock = NULL;5576arc_callback_t *callback_list;5577arc_callback_t *acb;55785579/*5580* The hdr was inserted into hash-table and removed from lists5581* prior to starting I/O. We should find this header, since5582* it's in the hash table, and it should be legit since it's5583* not possible to evict it during the I/O. The only possible5584* reason for it not to be found is if we were freed during the5585* read.5586*/5587if (HDR_IN_HASH_TABLE(hdr)) {5588arc_buf_hdr_t *found;55895590ASSERT3U(hdr->b_birth, ==, BP_GET_PHYSICAL_BIRTH(zio->io_bp));5591ASSERT3U(hdr->b_dva.dva_word[0], ==,5592BP_IDENTITY(zio->io_bp)->dva_word[0]);5593ASSERT3U(hdr->b_dva.dva_word[1], ==,5594BP_IDENTITY(zio->io_bp)->dva_word[1]);55955596found = buf_hash_find(hdr->b_spa, zio->io_bp, &hash_lock);55975598ASSERT((found == hdr &&5599DVA_EQUAL(&hdr->b_dva, BP_IDENTITY(zio->io_bp))) ||5600(found == hdr && HDR_L2_READING(hdr)));5601ASSERT3P(hash_lock, !=, NULL);5602}56035604if (BP_IS_PROTECTED(bp)) {5605hdr->b_crypt_hdr.b_ot = BP_GET_TYPE(bp);5606hdr->b_crypt_hdr.b_dsobj = zio->io_bookmark.zb_objset;5607zio_crypt_decode_params_bp(bp, hdr->b_crypt_hdr.b_salt,5608hdr->b_crypt_hdr.b_iv);56095610if (zio->io_error == 0) {5611if (BP_GET_TYPE(bp) == DMU_OT_INTENT_LOG) {5612void *tmpbuf;56135614tmpbuf = abd_borrow_buf_copy(zio->io_abd,5615sizeof (zil_chain_t));5616zio_crypt_decode_mac_zil(tmpbuf,5617hdr->b_crypt_hdr.b_mac);5618abd_return_buf(zio->io_abd, tmpbuf,5619sizeof (zil_chain_t));5620} else {5621zio_crypt_decode_mac_bp(bp,5622hdr->b_crypt_hdr.b_mac);5623}5624}5625}56265627if (zio->io_error == 0) {5628/* byteswap if necessary */5629if (BP_SHOULD_BYTESWAP(zio->io_bp)) {5630if (BP_GET_LEVEL(zio->io_bp) > 0) {5631hdr->b_l1hdr.b_byteswap = DMU_BSWAP_UINT64;5632} else {5633hdr->b_l1hdr.b_byteswap =5634DMU_OT_BYTESWAP(BP_GET_TYPE(zio->io_bp));5635}5636} else {5637hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS;5638}5639if (!HDR_L2_READING(hdr)) {5640hdr->b_complevel = zio->io_prop.zp_complevel;5641}5642}56435644arc_hdr_clear_flags(hdr, ARC_FLAG_L2_EVICTED);5645if (l2arc_noprefetch && HDR_PREFETCH(hdr))5646arc_hdr_clear_flags(hdr, ARC_FLAG_L2CACHE);56475648callback_list = hdr->b_l1hdr.b_acb;5649ASSERT3P(callback_list, !=, NULL);5650hdr->b_l1hdr.b_acb = NULL;56515652/*5653* If a read request has a callback (i.e. acb_done is not NULL), then we5654* make a buf containing the data according to the parameters which were5655* passed in. The implementation of arc_buf_alloc_impl() ensures that we5656* aren't needlessly decompressing the data multiple times.5657*/5658int callback_cnt = 0;5659for (acb = callback_list; acb != NULL; acb = acb->acb_next) {56605661/* We need the last one to call below in original order. */5662callback_list = acb;56635664if (!acb->acb_done || acb->acb_nobuf)5665continue;56665667callback_cnt++;56685669if (zio->io_error != 0)5670continue;56715672int error = arc_buf_alloc_impl(hdr, zio->io_spa,5673&acb->acb_zb, acb->acb_private, acb->acb_encrypted,5674acb->acb_compressed, acb->acb_noauth, B_TRUE,5675&acb->acb_buf);56765677/*5678* Assert non-speculative zios didn't fail because an5679* encryption key wasn't loaded5680*/5681ASSERT((zio->io_flags & ZIO_FLAG_SPECULATIVE) ||5682error != EACCES);56835684/*5685* If we failed to decrypt, report an error now (as the zio5686* layer would have done if it had done the transforms).5687*/5688if (error == ECKSUM) {5689ASSERT(BP_IS_PROTECTED(bp));5690error = SET_ERROR(EIO);5691if ((zio->io_flags & ZIO_FLAG_SPECULATIVE) == 0) {5692spa_log_error(zio->io_spa, &acb->acb_zb,5693BP_GET_PHYSICAL_BIRTH(zio->io_bp));5694(void) zfs_ereport_post(5695FM_EREPORT_ZFS_AUTHENTICATION,5696zio->io_spa, NULL, &acb->acb_zb, zio, 0);5697}5698}56995700if (error != 0) {5701/*5702* Decompression or decryption failed. Set5703* io_error so that when we call acb_done5704* (below), we will indicate that the read5705* failed. Note that in the unusual case5706* where one callback is compressed and another5707* uncompressed, we will mark all of them5708* as failed, even though the uncompressed5709* one can't actually fail. In this case,5710* the hdr will not be anonymous, because5711* if there are multiple callbacks, it's5712* because multiple threads found the same5713* arc buf in the hash table.5714*/5715zio->io_error = error;5716}5717}57185719/*5720* If there are multiple callbacks, we must have the hash lock,5721* because the only way for multiple threads to find this hdr is5722* in the hash table. This ensures that if there are multiple5723* callbacks, the hdr is not anonymous. If it were anonymous,5724* we couldn't use arc_buf_destroy() in the error case below.5725*/5726ASSERT(callback_cnt < 2 || hash_lock != NULL);57275728if (zio->io_error == 0) {5729arc_hdr_verify(hdr, zio->io_bp);5730} else {5731arc_hdr_set_flags(hdr, ARC_FLAG_IO_ERROR);5732if (hdr->b_l1hdr.b_state != arc_anon)5733arc_change_state(arc_anon, hdr);5734if (HDR_IN_HASH_TABLE(hdr))5735buf_hash_remove(hdr);5736}57375738arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);5739(void) remove_reference(hdr, hdr);57405741if (hash_lock != NULL)5742mutex_exit(hash_lock);57435744/* execute each callback and free its structure */5745while ((acb = callback_list) != NULL) {5746if (acb->acb_done != NULL) {5747if (zio->io_error != 0 && acb->acb_buf != NULL) {5748/*5749* If arc_buf_alloc_impl() fails during5750* decompression, the buf will still be5751* allocated, and needs to be freed here.5752*/5753arc_buf_destroy(acb->acb_buf,5754acb->acb_private);5755acb->acb_buf = NULL;5756}5757acb->acb_done(zio, &zio->io_bookmark, zio->io_bp,5758acb->acb_buf, acb->acb_private);5759}57605761if (acb->acb_zio_dummy != NULL) {5762acb->acb_zio_dummy->io_error = zio->io_error;5763zio_nowait(acb->acb_zio_dummy);5764}57655766callback_list = acb->acb_prev;5767if (acb->acb_wait) {5768mutex_enter(&acb->acb_wait_lock);5769acb->acb_wait_error = zio->io_error;5770acb->acb_wait = B_FALSE;5771cv_signal(&acb->acb_wait_cv);5772mutex_exit(&acb->acb_wait_lock);5773/* acb will be freed by the waiting thread. */5774} else {5775kmem_free(acb, sizeof (arc_callback_t));5776}5777}5778}57795780/*5781* Lookup the block at the specified DVA (in bp), and return the manner in5782* which the block is cached. A zero return indicates not cached.5783*/5784int5785arc_cached(spa_t *spa, const blkptr_t *bp)5786{5787arc_buf_hdr_t *hdr = NULL;5788kmutex_t *hash_lock = NULL;5789uint64_t guid = spa_load_guid(spa);5790int flags = 0;57915792if (BP_IS_EMBEDDED(bp))5793return (ARC_CACHED_EMBEDDED);57945795hdr = buf_hash_find(guid, bp, &hash_lock);5796if (hdr == NULL)5797return (0);57985799if (HDR_HAS_L1HDR(hdr)) {5800arc_state_t *state = hdr->b_l1hdr.b_state;5801/*5802* We switch to ensure that any future arc_state_type_t5803* changes are handled. This is just a shift to promote5804* more compile-time checking.5805*/5806switch (state->arcs_state) {5807case ARC_STATE_ANON:5808break;5809case ARC_STATE_MRU:5810flags |= ARC_CACHED_IN_MRU | ARC_CACHED_IN_L1;5811break;5812case ARC_STATE_MFU:5813flags |= ARC_CACHED_IN_MFU | ARC_CACHED_IN_L1;5814break;5815case ARC_STATE_UNCACHED:5816/* The header is still in L1, probably not for long */5817flags |= ARC_CACHED_IN_L1;5818break;5819default:5820break;5821}5822}5823if (HDR_HAS_L2HDR(hdr))5824flags |= ARC_CACHED_IN_L2;58255826mutex_exit(hash_lock);58275828return (flags);5829}58305831/*5832* "Read" the block at the specified DVA (in bp) via the5833* cache. If the block is found in the cache, invoke the provided5834* callback immediately and return. Note that the `zio' parameter5835* in the callback will be NULL in this case, since no IO was5836* required. If the block is not in the cache pass the read request5837* on to the spa with a substitute callback function, so that the5838* requested block will be added to the cache.5839*5840* If a read request arrives for a block that has a read in-progress,5841* either wait for the in-progress read to complete (and return the5842* results); or, if this is a read with a "done" func, add a record5843* to the read to invoke the "done" func when the read completes,5844* and return; or just return.5845*5846* arc_read_done() will invoke all the requested "done" functions5847* for readers of this block.5848*/5849int5850arc_read(zio_t *pio, spa_t *spa, const blkptr_t *bp,5851arc_read_done_func_t *done, void *private, zio_priority_t priority,5852int zio_flags, arc_flags_t *arc_flags, const zbookmark_phys_t *zb)5853{5854arc_buf_hdr_t *hdr = NULL;5855kmutex_t *hash_lock = NULL;5856zio_t *rzio;5857uint64_t guid = spa_load_guid(spa);5858boolean_t compressed_read = (zio_flags & ZIO_FLAG_RAW_COMPRESS) != 0;5859boolean_t encrypted_read = BP_IS_ENCRYPTED(bp) &&5860(zio_flags & ZIO_FLAG_RAW_ENCRYPT) != 0;5861boolean_t noauth_read = BP_IS_AUTHENTICATED(bp) &&5862(zio_flags & ZIO_FLAG_RAW_ENCRYPT) != 0;5863boolean_t embedded_bp = !!BP_IS_EMBEDDED(bp);5864boolean_t no_buf = *arc_flags & ARC_FLAG_NO_BUF;5865arc_buf_t *buf = NULL;5866int rc = 0;5867boolean_t bp_validation = B_FALSE;58685869ASSERT(!embedded_bp ||5870BPE_GET_ETYPE(bp) == BP_EMBEDDED_TYPE_DATA);5871ASSERT(!BP_IS_HOLE(bp));5872ASSERT(!BP_IS_REDACTED(bp));58735874/*5875* Normally SPL_FSTRANS will already be set since kernel threads which5876* expect to call the DMU interfaces will set it when created. System5877* calls are similarly handled by setting/cleaning the bit in the5878* registered callback (module/os/.../zfs/zpl_*).5879*5880* External consumers such as Lustre which call the exported DMU5881* interfaces may not have set SPL_FSTRANS. To avoid a deadlock5882* on the hash_lock always set and clear the bit.5883*/5884fstrans_cookie_t cookie = spl_fstrans_mark();5885top:5886if (!embedded_bp) {5887/*5888* Embedded BP's have no DVA and require no I/O to "read".5889* Create an anonymous arc buf to back it.5890*/5891hdr = buf_hash_find(guid, bp, &hash_lock);5892}58935894/*5895* Determine if we have an L1 cache hit or a cache miss. For simplicity5896* we maintain encrypted data separately from compressed / uncompressed5897* data. If the user is requesting raw encrypted data and we don't have5898* that in the header we will read from disk to guarantee that we can5899* get it even if the encryption keys aren't loaded.5900*/5901if (hdr != NULL && HDR_HAS_L1HDR(hdr) && (HDR_HAS_RABD(hdr) ||5902(hdr->b_l1hdr.b_pabd != NULL && !encrypted_read))) {5903boolean_t is_data = !HDR_ISTYPE_METADATA(hdr);59045905/*5906* Verify the block pointer contents are reasonable. This5907* should always be the case since the blkptr is protected by5908* a checksum.5909*/5910if (zfs_blkptr_verify(spa, bp, BLK_CONFIG_SKIP,5911BLK_VERIFY_LOG)) {5912mutex_exit(hash_lock);5913rc = SET_ERROR(ECKSUM);5914goto done;5915}59165917if (HDR_IO_IN_PROGRESS(hdr)) {5918if (*arc_flags & ARC_FLAG_CACHED_ONLY) {5919mutex_exit(hash_lock);5920ARCSTAT_BUMP(arcstat_cached_only_in_progress);5921rc = SET_ERROR(ENOENT);5922goto done;5923}59245925zio_t *head_zio = hdr->b_l1hdr.b_acb->acb_zio_head;5926ASSERT3P(head_zio, !=, NULL);5927if ((hdr->b_flags & ARC_FLAG_PRIO_ASYNC_READ) &&5928priority == ZIO_PRIORITY_SYNC_READ) {5929/*5930* This is a sync read that needs to wait for5931* an in-flight async read. Request that the5932* zio have its priority upgraded.5933*/5934zio_change_priority(head_zio, priority);5935DTRACE_PROBE1(arc__async__upgrade__sync,5936arc_buf_hdr_t *, hdr);5937ARCSTAT_BUMP(arcstat_async_upgrade_sync);5938}59395940DTRACE_PROBE1(arc__iohit, arc_buf_hdr_t *, hdr);5941arc_access(hdr, *arc_flags, B_FALSE);59425943/*5944* If there are multiple threads reading the same block5945* and that block is not yet in the ARC, then only one5946* thread will do the physical I/O and all other5947* threads will wait until that I/O completes.5948* Synchronous reads use the acb_wait_cv whereas nowait5949* reads register a callback. Both are signalled/called5950* in arc_read_done.5951*5952* Errors of the physical I/O may need to be propagated.5953* Synchronous read errors are returned here from5954* arc_read_done via acb_wait_error. Nowait reads5955* attach the acb_zio_dummy zio to pio and5956* arc_read_done propagates the physical I/O's io_error5957* to acb_zio_dummy, and thereby to pio.5958*/5959arc_callback_t *acb = NULL;5960if (done || pio || *arc_flags & ARC_FLAG_WAIT) {5961acb = kmem_zalloc(sizeof (arc_callback_t),5962KM_SLEEP);5963acb->acb_done = done;5964acb->acb_private = private;5965acb->acb_compressed = compressed_read;5966acb->acb_encrypted = encrypted_read;5967acb->acb_noauth = noauth_read;5968acb->acb_nobuf = no_buf;5969if (*arc_flags & ARC_FLAG_WAIT) {5970acb->acb_wait = B_TRUE;5971mutex_init(&acb->acb_wait_lock, NULL,5972MUTEX_DEFAULT, NULL);5973cv_init(&acb->acb_wait_cv, NULL,5974CV_DEFAULT, NULL);5975}5976acb->acb_zb = *zb;5977if (pio != NULL) {5978acb->acb_zio_dummy = zio_null(pio,5979spa, NULL, NULL, NULL, zio_flags);5980}5981acb->acb_zio_head = head_zio;5982acb->acb_next = hdr->b_l1hdr.b_acb;5983hdr->b_l1hdr.b_acb->acb_prev = acb;5984hdr->b_l1hdr.b_acb = acb;5985}5986mutex_exit(hash_lock);59875988ARCSTAT_BUMP(arcstat_iohits);5989ARCSTAT_CONDSTAT(!(*arc_flags & ARC_FLAG_PREFETCH),5990demand, prefetch, is_data, data, metadata, iohits);59915992if (*arc_flags & ARC_FLAG_WAIT) {5993mutex_enter(&acb->acb_wait_lock);5994while (acb->acb_wait) {5995cv_wait(&acb->acb_wait_cv,5996&acb->acb_wait_lock);5997}5998rc = acb->acb_wait_error;5999mutex_exit(&acb->acb_wait_lock);6000mutex_destroy(&acb->acb_wait_lock);6001cv_destroy(&acb->acb_wait_cv);6002kmem_free(acb, sizeof (arc_callback_t));6003}6004goto out;6005}60066007ASSERT(hdr->b_l1hdr.b_state == arc_mru ||6008hdr->b_l1hdr.b_state == arc_mfu ||6009hdr->b_l1hdr.b_state == arc_uncached);60106011DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr);6012arc_access(hdr, *arc_flags, B_TRUE);60136014if (done && !no_buf) {6015ASSERT(!embedded_bp || !BP_IS_HOLE(bp));60166017/* Get a buf with the desired data in it. */6018rc = arc_buf_alloc_impl(hdr, spa, zb, private,6019encrypted_read, compressed_read, noauth_read,6020B_TRUE, &buf);6021if (rc == ECKSUM) {6022/*6023* Convert authentication and decryption errors6024* to EIO (and generate an ereport if needed)6025* before leaving the ARC.6026*/6027rc = SET_ERROR(EIO);6028if ((zio_flags & ZIO_FLAG_SPECULATIVE) == 0) {6029spa_log_error(spa, zb, hdr->b_birth);6030(void) zfs_ereport_post(6031FM_EREPORT_ZFS_AUTHENTICATION,6032spa, NULL, zb, NULL, 0);6033}6034}6035if (rc != 0) {6036arc_buf_destroy_impl(buf);6037buf = NULL;6038(void) remove_reference(hdr, private);6039}60406041/* assert any errors weren't due to unloaded keys */6042ASSERT((zio_flags & ZIO_FLAG_SPECULATIVE) ||6043rc != EACCES);6044}6045mutex_exit(hash_lock);6046ARCSTAT_BUMP(arcstat_hits);6047ARCSTAT_CONDSTAT(!(*arc_flags & ARC_FLAG_PREFETCH),6048demand, prefetch, is_data, data, metadata, hits);6049*arc_flags |= ARC_FLAG_CACHED;6050goto done;6051} else {6052uint64_t lsize = BP_GET_LSIZE(bp);6053uint64_t psize = BP_GET_PSIZE(bp);6054arc_callback_t *acb;6055vdev_t *vd = NULL;6056uint64_t addr = 0;6057boolean_t devw = B_FALSE;6058uint64_t size;6059abd_t *hdr_abd;6060int alloc_flags = encrypted_read ? ARC_HDR_ALLOC_RDATA : 0;6061arc_buf_contents_t type = BP_GET_BUFC_TYPE(bp);6062int config_lock;6063int error;60646065if (*arc_flags & ARC_FLAG_CACHED_ONLY) {6066if (hash_lock != NULL)6067mutex_exit(hash_lock);6068rc = SET_ERROR(ENOENT);6069goto done;6070}60716072if (zio_flags & ZIO_FLAG_CONFIG_WRITER) {6073config_lock = BLK_CONFIG_HELD;6074} else if (hash_lock != NULL) {6075/*6076* Prevent lock order reversal6077*/6078config_lock = BLK_CONFIG_NEEDED_TRY;6079} else {6080config_lock = BLK_CONFIG_NEEDED;6081}60826083/*6084* Verify the block pointer contents are reasonable. This6085* should always be the case since the blkptr is protected by6086* a checksum.6087*/6088if (!bp_validation && (error = zfs_blkptr_verify(spa, bp,6089config_lock, BLK_VERIFY_LOG))) {6090if (hash_lock != NULL)6091mutex_exit(hash_lock);6092if (error == EBUSY && !zfs_blkptr_verify(spa, bp,6093BLK_CONFIG_NEEDED, BLK_VERIFY_LOG)) {6094bp_validation = B_TRUE;6095goto top;6096}6097rc = SET_ERROR(ECKSUM);6098goto done;6099}61006101if (hdr == NULL) {6102/*6103* This block is not in the cache or it has6104* embedded data.6105*/6106arc_buf_hdr_t *exists = NULL;6107hdr = arc_hdr_alloc(guid, psize, lsize,6108BP_IS_PROTECTED(bp), BP_GET_COMPRESS(bp), 0, type);61096110if (!embedded_bp) {6111hdr->b_dva = *BP_IDENTITY(bp);6112hdr->b_birth = BP_GET_PHYSICAL_BIRTH(bp);6113exists = buf_hash_insert(hdr, &hash_lock);6114}6115if (exists != NULL) {6116/* somebody beat us to the hash insert */6117mutex_exit(hash_lock);6118buf_discard_identity(hdr);6119arc_hdr_destroy(hdr);6120goto top; /* restart the IO request */6121}6122} else {6123/*6124* This block is in the ghost cache or encrypted data6125* was requested and we didn't have it. If it was6126* L2-only (and thus didn't have an L1 hdr),6127* we realloc the header to add an L1 hdr.6128*/6129if (!HDR_HAS_L1HDR(hdr)) {6130hdr = arc_hdr_realloc(hdr, hdr_l2only_cache,6131hdr_full_cache);6132}61336134if (GHOST_STATE(hdr->b_l1hdr.b_state)) {6135ASSERT0P(hdr->b_l1hdr.b_pabd);6136ASSERT(!HDR_HAS_RABD(hdr));6137ASSERT(!HDR_IO_IN_PROGRESS(hdr));6138ASSERT0(zfs_refcount_count(6139&hdr->b_l1hdr.b_refcnt));6140ASSERT0P(hdr->b_l1hdr.b_buf);6141#ifdef ZFS_DEBUG6142ASSERT0P(hdr->b_l1hdr.b_freeze_cksum);6143#endif6144} else if (HDR_IO_IN_PROGRESS(hdr)) {6145/*6146* If this header already had an IO in progress6147* and we are performing another IO to fetch6148* encrypted data we must wait until the first6149* IO completes so as not to confuse6150* arc_read_done(). This should be very rare6151* and so the performance impact shouldn't6152* matter.6153*/6154arc_callback_t *acb = kmem_zalloc(6155sizeof (arc_callback_t), KM_SLEEP);6156acb->acb_wait = B_TRUE;6157mutex_init(&acb->acb_wait_lock, NULL,6158MUTEX_DEFAULT, NULL);6159cv_init(&acb->acb_wait_cv, NULL, CV_DEFAULT,6160NULL);6161acb->acb_zio_head =6162hdr->b_l1hdr.b_acb->acb_zio_head;6163acb->acb_next = hdr->b_l1hdr.b_acb;6164hdr->b_l1hdr.b_acb->acb_prev = acb;6165hdr->b_l1hdr.b_acb = acb;6166mutex_exit(hash_lock);6167mutex_enter(&acb->acb_wait_lock);6168while (acb->acb_wait) {6169cv_wait(&acb->acb_wait_cv,6170&acb->acb_wait_lock);6171}6172mutex_exit(&acb->acb_wait_lock);6173mutex_destroy(&acb->acb_wait_lock);6174cv_destroy(&acb->acb_wait_cv);6175kmem_free(acb, sizeof (arc_callback_t));6176goto top;6177}6178}6179if (*arc_flags & ARC_FLAG_UNCACHED) {6180arc_hdr_set_flags(hdr, ARC_FLAG_UNCACHED);6181if (!encrypted_read)6182alloc_flags |= ARC_HDR_ALLOC_LINEAR;6183}61846185/*6186* Take additional reference for IO_IN_PROGRESS. It stops6187* arc_access() from putting this header without any buffers6188* and so other references but obviously nonevictable onto6189* the evictable list of MRU or MFU state.6190*/6191add_reference(hdr, hdr);6192if (!embedded_bp)6193arc_access(hdr, *arc_flags, B_FALSE);6194arc_hdr_set_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);6195arc_hdr_alloc_abd(hdr, alloc_flags);6196if (encrypted_read) {6197ASSERT(HDR_HAS_RABD(hdr));6198size = HDR_GET_PSIZE(hdr);6199hdr_abd = hdr->b_crypt_hdr.b_rabd;6200zio_flags |= ZIO_FLAG_RAW;6201} else {6202ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);6203size = arc_hdr_size(hdr);6204hdr_abd = hdr->b_l1hdr.b_pabd;62056206if (arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF) {6207zio_flags |= ZIO_FLAG_RAW_COMPRESS;6208}62096210/*6211* For authenticated bp's, we do not ask the ZIO layer6212* to authenticate them since this will cause the entire6213* IO to fail if the key isn't loaded. Instead, we6214* defer authentication until arc_buf_fill(), which will6215* verify the data when the key is available.6216*/6217if (BP_IS_AUTHENTICATED(bp))6218zio_flags |= ZIO_FLAG_RAW_ENCRYPT;6219}62206221if (BP_IS_AUTHENTICATED(bp))6222arc_hdr_set_flags(hdr, ARC_FLAG_NOAUTH);6223if (BP_GET_LEVEL(bp) > 0)6224arc_hdr_set_flags(hdr, ARC_FLAG_INDIRECT);6225ASSERT(!GHOST_STATE(hdr->b_l1hdr.b_state));62266227acb = kmem_zalloc(sizeof (arc_callback_t), KM_SLEEP);6228acb->acb_done = done;6229acb->acb_private = private;6230acb->acb_compressed = compressed_read;6231acb->acb_encrypted = encrypted_read;6232acb->acb_noauth = noauth_read;6233acb->acb_nobuf = no_buf;6234acb->acb_zb = *zb;62356236ASSERT0P(hdr->b_l1hdr.b_acb);6237hdr->b_l1hdr.b_acb = acb;62386239if (HDR_HAS_L2HDR(hdr) &&6240(vd = hdr->b_l2hdr.b_dev->l2ad_vdev) != NULL) {6241devw = hdr->b_l2hdr.b_dev->l2ad_writing;6242addr = hdr->b_l2hdr.b_daddr;6243/*6244* Lock out L2ARC device removal.6245*/6246if (vdev_is_dead(vd) ||6247!spa_config_tryenter(spa, SCL_L2ARC, vd, RW_READER))6248vd = NULL;6249}62506251/*6252* We count both async reads and scrub IOs as asynchronous so6253* that both can be upgraded in the event of a cache hit while6254* the read IO is still in-flight.6255*/6256if (priority == ZIO_PRIORITY_ASYNC_READ ||6257priority == ZIO_PRIORITY_SCRUB)6258arc_hdr_set_flags(hdr, ARC_FLAG_PRIO_ASYNC_READ);6259else6260arc_hdr_clear_flags(hdr, ARC_FLAG_PRIO_ASYNC_READ);62616262/*6263* At this point, we have a level 1 cache miss or a blkptr6264* with embedded data. Try again in L2ARC if possible.6265*/6266ASSERT3U(HDR_GET_LSIZE(hdr), ==, lsize);62676268/*6269* Skip ARC stat bump for block pointers with embedded6270* data. The data are read from the blkptr itself via6271* decode_embedded_bp_compressed().6272*/6273if (!embedded_bp) {6274DTRACE_PROBE4(arc__miss, arc_buf_hdr_t *, hdr,6275blkptr_t *, bp, uint64_t, lsize,6276zbookmark_phys_t *, zb);6277ARCSTAT_BUMP(arcstat_misses);6278ARCSTAT_CONDSTAT(!(*arc_flags & ARC_FLAG_PREFETCH),6279demand, prefetch, !HDR_ISTYPE_METADATA(hdr), data,6280metadata, misses);6281zfs_racct_read(spa, size, 1,6282(*arc_flags & ARC_FLAG_UNCACHED) ?6283DMU_UNCACHEDIO : 0);6284}62856286/* Check if the spa even has l2 configured */6287const boolean_t spa_has_l2 = l2arc_ndev != 0 &&6288spa->spa_l2cache.sav_count > 0;62896290if (vd != NULL && spa_has_l2 && !(l2arc_norw && devw)) {6291/*6292* Read from the L2ARC if the following are true:6293* 1. The L2ARC vdev was previously cached.6294* 2. This buffer still has L2ARC metadata.6295* 3. This buffer isn't currently writing to the L2ARC.6296* 4. The L2ARC entry wasn't evicted, which may6297* also have invalidated the vdev.6298*/6299if (HDR_HAS_L2HDR(hdr) &&6300!HDR_L2_WRITING(hdr) && !HDR_L2_EVICTED(hdr)) {6301l2arc_read_callback_t *cb;6302abd_t *abd;6303uint64_t asize;63046305DTRACE_PROBE1(l2arc__hit, arc_buf_hdr_t *, hdr);6306ARCSTAT_BUMP(arcstat_l2_hits);6307hdr->b_l2hdr.b_hits++;63086309cb = kmem_zalloc(sizeof (l2arc_read_callback_t),6310KM_SLEEP);6311cb->l2rcb_hdr = hdr;6312cb->l2rcb_bp = *bp;6313cb->l2rcb_zb = *zb;6314cb->l2rcb_flags = zio_flags;63156316/*6317* When Compressed ARC is disabled, but the6318* L2ARC block is compressed, arc_hdr_size()6319* will have returned LSIZE rather than PSIZE.6320*/6321if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF &&6322!HDR_COMPRESSION_ENABLED(hdr) &&6323HDR_GET_PSIZE(hdr) != 0) {6324size = HDR_GET_PSIZE(hdr);6325}63266327asize = vdev_psize_to_asize(vd, size);6328if (asize != size) {6329abd = abd_alloc_for_io(asize,6330HDR_ISTYPE_METADATA(hdr));6331cb->l2rcb_abd = abd;6332} else {6333abd = hdr_abd;6334}63356336ASSERT(addr >= VDEV_LABEL_START_SIZE &&6337addr + asize <= vd->vdev_psize -6338VDEV_LABEL_END_SIZE);63396340/*6341* l2arc read. The SCL_L2ARC lock will be6342* released by l2arc_read_done().6343* Issue a null zio if the underlying buffer6344* was squashed to zero size by compression.6345*/6346ASSERT3U(arc_hdr_get_compress(hdr), !=,6347ZIO_COMPRESS_EMPTY);6348rzio = zio_read_phys(pio, vd, addr,6349asize, abd,6350ZIO_CHECKSUM_OFF,6351l2arc_read_done, cb, priority,6352zio_flags | ZIO_FLAG_CANFAIL |6353ZIO_FLAG_DONT_PROPAGATE |6354ZIO_FLAG_DONT_RETRY, B_FALSE);6355acb->acb_zio_head = rzio;63566357if (hash_lock != NULL)6358mutex_exit(hash_lock);63596360DTRACE_PROBE2(l2arc__read, vdev_t *, vd,6361zio_t *, rzio);6362ARCSTAT_INCR(arcstat_l2_read_bytes,6363HDR_GET_PSIZE(hdr));63646365if (*arc_flags & ARC_FLAG_NOWAIT) {6366zio_nowait(rzio);6367goto out;6368}63696370ASSERT(*arc_flags & ARC_FLAG_WAIT);6371if (zio_wait(rzio) == 0)6372goto out;63736374/* l2arc read error; goto zio_read() */6375if (hash_lock != NULL)6376mutex_enter(hash_lock);6377} else {6378DTRACE_PROBE1(l2arc__miss,6379arc_buf_hdr_t *, hdr);6380ARCSTAT_BUMP(arcstat_l2_misses);6381if (HDR_L2_WRITING(hdr))6382ARCSTAT_BUMP(arcstat_l2_rw_clash);6383spa_config_exit(spa, SCL_L2ARC, vd);6384}6385} else {6386if (vd != NULL)6387spa_config_exit(spa, SCL_L2ARC, vd);63886389/*6390* Only a spa with l2 should contribute to l26391* miss stats. (Including the case of having a6392* faulted cache device - that's also a miss.)6393*/6394if (spa_has_l2) {6395/*6396* Skip ARC stat bump for block pointers with6397* embedded data. The data are read from the6398* blkptr itself via6399* decode_embedded_bp_compressed().6400*/6401if (!embedded_bp) {6402DTRACE_PROBE1(l2arc__miss,6403arc_buf_hdr_t *, hdr);6404ARCSTAT_BUMP(arcstat_l2_misses);6405}6406}6407}64086409rzio = zio_read(pio, spa, bp, hdr_abd, size,6410arc_read_done, hdr, priority, zio_flags, zb);6411acb->acb_zio_head = rzio;64126413if (hash_lock != NULL)6414mutex_exit(hash_lock);64156416if (*arc_flags & ARC_FLAG_WAIT) {6417rc = zio_wait(rzio);6418goto out;6419}64206421ASSERT(*arc_flags & ARC_FLAG_NOWAIT);6422zio_nowait(rzio);6423}64246425out:6426/* embedded bps don't actually go to disk */6427if (!embedded_bp)6428spa_read_history_add(spa, zb, *arc_flags);6429spl_fstrans_unmark(cookie);6430return (rc);64316432done:6433if (done)6434done(NULL, zb, bp, buf, private);6435if (pio && rc != 0) {6436zio_t *zio = zio_null(pio, spa, NULL, NULL, NULL, zio_flags);6437zio->io_error = rc;6438zio_nowait(zio);6439}6440goto out;6441}64426443arc_prune_t *6444arc_add_prune_callback(arc_prune_func_t *func, void *private)6445{6446arc_prune_t *p;64476448p = kmem_alloc(sizeof (*p), KM_SLEEP);6449p->p_pfunc = func;6450p->p_private = private;6451list_link_init(&p->p_node);6452zfs_refcount_create(&p->p_refcnt);64536454mutex_enter(&arc_prune_mtx);6455zfs_refcount_add(&p->p_refcnt, &arc_prune_list);6456list_insert_head(&arc_prune_list, p);6457mutex_exit(&arc_prune_mtx);64586459return (p);6460}64616462void6463arc_remove_prune_callback(arc_prune_t *p)6464{6465boolean_t wait = B_FALSE;6466mutex_enter(&arc_prune_mtx);6467list_remove(&arc_prune_list, p);6468if (zfs_refcount_remove(&p->p_refcnt, &arc_prune_list) > 0)6469wait = B_TRUE;6470mutex_exit(&arc_prune_mtx);64716472/* wait for arc_prune_task to finish */6473if (wait)6474taskq_wait_outstanding(arc_prune_taskq, 0);6475ASSERT0(zfs_refcount_count(&p->p_refcnt));6476zfs_refcount_destroy(&p->p_refcnt);6477kmem_free(p, sizeof (*p));6478}64796480/*6481* Helper function for arc_prune_async() it is responsible for safely6482* handling the execution of a registered arc_prune_func_t.6483*/6484static void6485arc_prune_task(void *ptr)6486{6487arc_prune_t *ap = (arc_prune_t *)ptr;6488arc_prune_func_t *func = ap->p_pfunc;64896490if (func != NULL)6491func(ap->p_adjust, ap->p_private);64926493(void) zfs_refcount_remove(&ap->p_refcnt, func);6494}64956496/*6497* Notify registered consumers they must drop holds on a portion of the ARC6498* buffers they reference. This provides a mechanism to ensure the ARC can6499* honor the metadata limit and reclaim otherwise pinned ARC buffers.6500*6501* This operation is performed asynchronously so it may be safely called6502* in the context of the arc_reclaim_thread(). A reference is taken here6503* for each registered arc_prune_t and the arc_prune_task() is responsible6504* for releasing it once the registered arc_prune_func_t has completed.6505*/6506static void6507arc_prune_async(uint64_t adjust)6508{6509arc_prune_t *ap;65106511mutex_enter(&arc_prune_mtx);6512for (ap = list_head(&arc_prune_list); ap != NULL;6513ap = list_next(&arc_prune_list, ap)) {65146515if (zfs_refcount_count(&ap->p_refcnt) >= 2)6516continue;65176518zfs_refcount_add(&ap->p_refcnt, ap->p_pfunc);6519ap->p_adjust = adjust;6520if (taskq_dispatch(arc_prune_taskq, arc_prune_task,6521ap, TQ_SLEEP) == TASKQID_INVALID) {6522(void) zfs_refcount_remove(&ap->p_refcnt, ap->p_pfunc);6523continue;6524}6525ARCSTAT_BUMP(arcstat_prune);6526}6527mutex_exit(&arc_prune_mtx);6528}65296530/*6531* Notify the arc that a block was freed, and thus will never be used again.6532*/6533void6534arc_freed(spa_t *spa, const blkptr_t *bp)6535{6536arc_buf_hdr_t *hdr;6537kmutex_t *hash_lock;6538uint64_t guid = spa_load_guid(spa);65396540ASSERT(!BP_IS_EMBEDDED(bp));65416542hdr = buf_hash_find(guid, bp, &hash_lock);6543if (hdr == NULL)6544return;65456546/*6547* We might be trying to free a block that is still doing I/O6548* (i.e. prefetch) or has some other reference (i.e. a dedup-ed,6549* dmu_sync-ed block). A block may also have a reference if it is6550* part of a dedup-ed, dmu_synced write. The dmu_sync() function would6551* have written the new block to its final resting place on disk but6552* without the dedup flag set. This would have left the hdr in the MRU6553* state and discoverable. When the txg finally syncs it detects that6554* the block was overridden in open context and issues an override I/O.6555* Since this is a dedup block, the override I/O will determine if the6556* block is already in the DDT. If so, then it will replace the io_bp6557* with the bp from the DDT and allow the I/O to finish. When the I/O6558* reaches the done callback, dbuf_write_override_done, it will6559* check to see if the io_bp and io_bp_override are identical.6560* If they are not, then it indicates that the bp was replaced with6561* the bp in the DDT and the override bp is freed. This allows6562* us to arrive here with a reference on a block that is being6563* freed. So if we have an I/O in progress, or a reference to6564* this hdr, then we don't destroy the hdr.6565*/6566if (!HDR_HAS_L1HDR(hdr) ||6567zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt)) {6568arc_change_state(arc_anon, hdr);6569arc_hdr_destroy(hdr);6570mutex_exit(hash_lock);6571} else {6572mutex_exit(hash_lock);6573}65746575}65766577/*6578* Release this buffer from the cache, making it an anonymous buffer. This6579* must be done after a read and prior to modifying the buffer contents.6580* If the buffer has more than one reference, we must make6581* a new hdr for the buffer.6582*/6583void6584arc_release(arc_buf_t *buf, const void *tag)6585{6586arc_buf_hdr_t *hdr = buf->b_hdr;65876588/*6589* It would be nice to assert that if its DMU metadata (level >6590* 0 || it's the dnode file), then it must be syncing context.6591* But we don't know that information at this level.6592*/65936594ASSERT(HDR_HAS_L1HDR(hdr));65956596/*6597* We don't grab the hash lock prior to this check, because if6598* the buffer's header is in the arc_anon state, it won't be6599* linked into the hash table.6600*/6601if (hdr->b_l1hdr.b_state == arc_anon) {6602ASSERT(!HDR_IO_IN_PROGRESS(hdr));6603ASSERT(!HDR_IN_HASH_TABLE(hdr));6604ASSERT(!HDR_HAS_L2HDR(hdr));66056606ASSERT3P(hdr->b_l1hdr.b_buf, ==, buf);6607ASSERT(ARC_BUF_LAST(buf));6608ASSERT3S(zfs_refcount_count(&hdr->b_l1hdr.b_refcnt), ==, 1);6609ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));66106611hdr->b_l1hdr.b_arc_access = 0;66126613/*6614* If the buf is being overridden then it may already6615* have a hdr that is not empty.6616*/6617buf_discard_identity(hdr);6618arc_buf_thaw(buf);66196620return;6621}66226623kmutex_t *hash_lock = HDR_LOCK(hdr);6624mutex_enter(hash_lock);66256626/*6627* This assignment is only valid as long as the hash_lock is6628* held, we must be careful not to reference state or the6629* b_state field after dropping the lock.6630*/6631arc_state_t *state = hdr->b_l1hdr.b_state;6632ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));6633ASSERT3P(state, !=, arc_anon);6634ASSERT3P(state, !=, arc_l2c_only);66356636/* this buffer is not on any list */6637ASSERT3S(zfs_refcount_count(&hdr->b_l1hdr.b_refcnt), >, 0);66386639/*6640* Do we have more than one buf?6641*/6642if (hdr->b_l1hdr.b_buf != buf || !ARC_BUF_LAST(buf)) {6643arc_buf_hdr_t *nhdr;6644uint64_t spa = hdr->b_spa;6645uint64_t psize = HDR_GET_PSIZE(hdr);6646uint64_t lsize = HDR_GET_LSIZE(hdr);6647boolean_t protected = HDR_PROTECTED(hdr);6648enum zio_compress compress = arc_hdr_get_compress(hdr);6649arc_buf_contents_t type = arc_buf_type(hdr);66506651if (ARC_BUF_SHARED(buf) && !ARC_BUF_COMPRESSED(buf)) {6652ASSERT3P(hdr->b_l1hdr.b_buf, !=, buf);6653ASSERT(ARC_BUF_LAST(buf));6654}66556656/*6657* Pull the buffer off of this hdr and find the last buffer6658* in the hdr's buffer list.6659*/6660VERIFY3S(remove_reference(hdr, tag), >, 0);6661arc_buf_t *lastbuf = arc_buf_remove(hdr, buf);6662ASSERT3P(lastbuf, !=, NULL);66636664/*6665* If the current arc_buf_t and the hdr are sharing their data6666* buffer, then we must stop sharing that block.6667*/6668if (ARC_BUF_SHARED(buf)) {6669ASSERT(!arc_buf_is_shared(lastbuf));66706671/*6672* First, sever the block sharing relationship between6673* buf and the arc_buf_hdr_t.6674*/6675arc_unshare_buf(hdr, buf);66766677/*6678* Now we need to recreate the hdr's b_pabd. Since we6679* have lastbuf handy, we try to share with it, but if6680* we can't then we allocate a new b_pabd and copy the6681* data from buf into it.6682*/6683if (arc_can_share(hdr, lastbuf)) {6684arc_share_buf(hdr, lastbuf);6685} else {6686arc_hdr_alloc_abd(hdr, 0);6687abd_copy_from_buf(hdr->b_l1hdr.b_pabd,6688buf->b_data, psize);6689}6690} else if (HDR_SHARED_DATA(hdr)) {6691/*6692* Uncompressed shared buffers are always at the end6693* of the list. Compressed buffers don't have the6694* same requirements. This makes it hard to6695* simply assert that the lastbuf is shared so6696* we rely on the hdr's compression flags to determine6697* if we have a compressed, shared buffer.6698*/6699ASSERT(arc_buf_is_shared(lastbuf) ||6700arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF);6701ASSERT(!arc_buf_is_shared(buf));6702}67036704ASSERT(hdr->b_l1hdr.b_pabd != NULL || HDR_HAS_RABD(hdr));67056706(void) zfs_refcount_remove_many(&state->arcs_size[type],6707arc_buf_size(buf), buf);67086709arc_cksum_verify(buf);6710arc_buf_unwatch(buf);67116712/* if this is the last uncompressed buf free the checksum */6713if (!arc_hdr_has_uncompressed_buf(hdr))6714arc_cksum_free(hdr);67156716mutex_exit(hash_lock);67176718nhdr = arc_hdr_alloc(spa, psize, lsize, protected,6719compress, hdr->b_complevel, type);6720ASSERT0P(nhdr->b_l1hdr.b_buf);6721ASSERT0(zfs_refcount_count(&nhdr->b_l1hdr.b_refcnt));6722VERIFY3U(nhdr->b_type, ==, type);6723ASSERT(!HDR_SHARED_DATA(nhdr));67246725nhdr->b_l1hdr.b_buf = buf;6726(void) zfs_refcount_add(&nhdr->b_l1hdr.b_refcnt, tag);6727buf->b_hdr = nhdr;67286729(void) zfs_refcount_add_many(&arc_anon->arcs_size[type],6730arc_buf_size(buf), buf);6731} else {6732ASSERT(zfs_refcount_count(&hdr->b_l1hdr.b_refcnt) == 1);6733/* protected by hash lock, or hdr is on arc_anon */6734ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node));6735ASSERT(!HDR_IO_IN_PROGRESS(hdr));67366737if (HDR_HAS_L2HDR(hdr)) {6738mutex_enter(&hdr->b_l2hdr.b_dev->l2ad_mtx);6739/* Recheck to prevent race with l2arc_evict(). */6740if (HDR_HAS_L2HDR(hdr))6741arc_hdr_l2hdr_destroy(hdr);6742mutex_exit(&hdr->b_l2hdr.b_dev->l2ad_mtx);6743}67446745hdr->b_l1hdr.b_mru_hits = 0;6746hdr->b_l1hdr.b_mru_ghost_hits = 0;6747hdr->b_l1hdr.b_mfu_hits = 0;6748hdr->b_l1hdr.b_mfu_ghost_hits = 0;6749arc_change_state(arc_anon, hdr);6750hdr->b_l1hdr.b_arc_access = 0;67516752mutex_exit(hash_lock);6753buf_discard_identity(hdr);6754arc_buf_thaw(buf);6755}6756}67576758int6759arc_released(arc_buf_t *buf)6760{6761return (buf->b_data != NULL &&6762buf->b_hdr->b_l1hdr.b_state == arc_anon);6763}67646765#ifdef ZFS_DEBUG6766int6767arc_referenced(arc_buf_t *buf)6768{6769return (zfs_refcount_count(&buf->b_hdr->b_l1hdr.b_refcnt));6770}6771#endif67726773static void6774arc_write_ready(zio_t *zio)6775{6776arc_write_callback_t *callback = zio->io_private;6777arc_buf_t *buf = callback->awcb_buf;6778arc_buf_hdr_t *hdr = buf->b_hdr;6779blkptr_t *bp = zio->io_bp;6780uint64_t psize = BP_IS_HOLE(bp) ? 0 : BP_GET_PSIZE(bp);6781fstrans_cookie_t cookie = spl_fstrans_mark();67826783ASSERT(HDR_HAS_L1HDR(hdr));6784ASSERT(!zfs_refcount_is_zero(&buf->b_hdr->b_l1hdr.b_refcnt));6785ASSERT3P(hdr->b_l1hdr.b_buf, !=, NULL);67866787/*6788* If we're reexecuting this zio because the pool suspended, then6789* cleanup any state that was previously set the first time the6790* callback was invoked.6791*/6792if (zio->io_flags & ZIO_FLAG_REEXECUTED) {6793arc_cksum_free(hdr);6794arc_buf_unwatch(buf);6795if (hdr->b_l1hdr.b_pabd != NULL) {6796if (ARC_BUF_SHARED(buf)) {6797arc_unshare_buf(hdr, buf);6798} else {6799ASSERT(!arc_buf_is_shared(buf));6800arc_hdr_free_abd(hdr, B_FALSE);6801}6802}68036804if (HDR_HAS_RABD(hdr))6805arc_hdr_free_abd(hdr, B_TRUE);6806}6807ASSERT0P(hdr->b_l1hdr.b_pabd);6808ASSERT(!HDR_HAS_RABD(hdr));6809ASSERT(!HDR_SHARED_DATA(hdr));6810ASSERT(!arc_buf_is_shared(buf));68116812callback->awcb_ready(zio, buf, callback->awcb_private);68136814if (HDR_IO_IN_PROGRESS(hdr)) {6815ASSERT(zio->io_flags & ZIO_FLAG_REEXECUTED);6816} else {6817arc_hdr_set_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);6818add_reference(hdr, hdr); /* For IO_IN_PROGRESS. */6819}68206821if (BP_IS_PROTECTED(bp)) {6822/* ZIL blocks are written through zio_rewrite */6823ASSERT3U(BP_GET_TYPE(bp), !=, DMU_OT_INTENT_LOG);68246825if (BP_SHOULD_BYTESWAP(bp)) {6826if (BP_GET_LEVEL(bp) > 0) {6827hdr->b_l1hdr.b_byteswap = DMU_BSWAP_UINT64;6828} else {6829hdr->b_l1hdr.b_byteswap =6830DMU_OT_BYTESWAP(BP_GET_TYPE(bp));6831}6832} else {6833hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS;6834}68356836arc_hdr_set_flags(hdr, ARC_FLAG_PROTECTED);6837hdr->b_crypt_hdr.b_ot = BP_GET_TYPE(bp);6838hdr->b_crypt_hdr.b_dsobj = zio->io_bookmark.zb_objset;6839zio_crypt_decode_params_bp(bp, hdr->b_crypt_hdr.b_salt,6840hdr->b_crypt_hdr.b_iv);6841zio_crypt_decode_mac_bp(bp, hdr->b_crypt_hdr.b_mac);6842} else {6843arc_hdr_clear_flags(hdr, ARC_FLAG_PROTECTED);6844}68456846/*6847* If this block was written for raw encryption but the zio layer6848* ended up only authenticating it, adjust the buffer flags now.6849*/6850if (BP_IS_AUTHENTICATED(bp) && ARC_BUF_ENCRYPTED(buf)) {6851arc_hdr_set_flags(hdr, ARC_FLAG_NOAUTH);6852buf->b_flags &= ~ARC_BUF_FLAG_ENCRYPTED;6853if (BP_GET_COMPRESS(bp) == ZIO_COMPRESS_OFF)6854buf->b_flags &= ~ARC_BUF_FLAG_COMPRESSED;6855} else if (BP_IS_HOLE(bp) && ARC_BUF_ENCRYPTED(buf)) {6856buf->b_flags &= ~ARC_BUF_FLAG_ENCRYPTED;6857buf->b_flags &= ~ARC_BUF_FLAG_COMPRESSED;6858}68596860/* this must be done after the buffer flags are adjusted */6861arc_cksum_compute(buf);68626863enum zio_compress compress;6864if (BP_IS_HOLE(bp) || BP_IS_EMBEDDED(bp)) {6865compress = ZIO_COMPRESS_OFF;6866} else {6867ASSERT3U(HDR_GET_LSIZE(hdr), ==, BP_GET_LSIZE(bp));6868compress = BP_GET_COMPRESS(bp);6869}6870HDR_SET_PSIZE(hdr, psize);6871arc_hdr_set_compress(hdr, compress);6872hdr->b_complevel = zio->io_prop.zp_complevel;68736874if (zio->io_error != 0 || psize == 0)6875goto out;68766877/*6878* Fill the hdr with data. If the buffer is encrypted we have no choice6879* but to copy the data into b_radb. If the hdr is compressed, the data6880* we want is available from the zio, otherwise we can take it from6881* the buf.6882*6883* We might be able to share the buf's data with the hdr here. However,6884* doing so would cause the ARC to be full of linear ABDs if we write a6885* lot of shareable data. As a compromise, we check whether scattered6886* ABDs are allowed, and assume that if they are then the user wants6887* the ARC to be primarily filled with them regardless of the data being6888* written. Therefore, if they're allowed then we allocate one and copy6889* the data into it; otherwise, we share the data directly if we can.6890*/6891if (ARC_BUF_ENCRYPTED(buf)) {6892ASSERT3U(psize, >, 0);6893ASSERT(ARC_BUF_COMPRESSED(buf));6894arc_hdr_alloc_abd(hdr, ARC_HDR_ALLOC_RDATA |6895ARC_HDR_USE_RESERVE);6896abd_copy(hdr->b_crypt_hdr.b_rabd, zio->io_abd, psize);6897} else if (!(HDR_UNCACHED(hdr) ||6898abd_size_alloc_linear(arc_buf_size(buf))) ||6899!arc_can_share(hdr, buf)) {6900/*6901* Ideally, we would always copy the io_abd into b_pabd, but the6902* user may have disabled compressed ARC, thus we must check the6903* hdr's compression setting rather than the io_bp's.6904*/6905if (BP_IS_ENCRYPTED(bp)) {6906ASSERT3U(psize, >, 0);6907arc_hdr_alloc_abd(hdr, ARC_HDR_ALLOC_RDATA |6908ARC_HDR_USE_RESERVE);6909abd_copy(hdr->b_crypt_hdr.b_rabd, zio->io_abd, psize);6910} else if (arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF &&6911!ARC_BUF_COMPRESSED(buf)) {6912ASSERT3U(psize, >, 0);6913arc_hdr_alloc_abd(hdr, ARC_HDR_USE_RESERVE);6914abd_copy(hdr->b_l1hdr.b_pabd, zio->io_abd, psize);6915} else {6916ASSERT3U(zio->io_orig_size, ==, arc_hdr_size(hdr));6917arc_hdr_alloc_abd(hdr, ARC_HDR_USE_RESERVE);6918abd_copy_from_buf(hdr->b_l1hdr.b_pabd, buf->b_data,6919arc_buf_size(buf));6920}6921} else {6922ASSERT3P(buf->b_data, ==, abd_to_buf(zio->io_orig_abd));6923ASSERT3U(zio->io_orig_size, ==, arc_buf_size(buf));6924ASSERT3P(hdr->b_l1hdr.b_buf, ==, buf);6925ASSERT(ARC_BUF_LAST(buf));69266927arc_share_buf(hdr, buf);6928}69296930out:6931arc_hdr_verify(hdr, bp);6932spl_fstrans_unmark(cookie);6933}69346935static void6936arc_write_children_ready(zio_t *zio)6937{6938arc_write_callback_t *callback = zio->io_private;6939arc_buf_t *buf = callback->awcb_buf;69406941callback->awcb_children_ready(zio, buf, callback->awcb_private);6942}69436944static void6945arc_write_done(zio_t *zio)6946{6947arc_write_callback_t *callback = zio->io_private;6948arc_buf_t *buf = callback->awcb_buf;6949arc_buf_hdr_t *hdr = buf->b_hdr;69506951ASSERT0P(hdr->b_l1hdr.b_acb);69526953if (zio->io_error == 0) {6954arc_hdr_verify(hdr, zio->io_bp);69556956if (BP_IS_HOLE(zio->io_bp) || BP_IS_EMBEDDED(zio->io_bp)) {6957buf_discard_identity(hdr);6958} else {6959hdr->b_dva = *BP_IDENTITY(zio->io_bp);6960hdr->b_birth = BP_GET_PHYSICAL_BIRTH(zio->io_bp);6961}6962} else {6963ASSERT(HDR_EMPTY(hdr));6964}69656966/*6967* If the block to be written was all-zero or compressed enough to be6968* embedded in the BP, no write was performed so there will be no6969* dva/birth/checksum. The buffer must therefore remain anonymous6970* (and uncached).6971*/6972if (!HDR_EMPTY(hdr)) {6973arc_buf_hdr_t *exists;6974kmutex_t *hash_lock;69756976ASSERT0(zio->io_error);69776978arc_cksum_verify(buf);69796980exists = buf_hash_insert(hdr, &hash_lock);6981if (exists != NULL) {6982/*6983* This can only happen if we overwrite for6984* sync-to-convergence, because we remove6985* buffers from the hash table when we arc_free().6986*/6987if (zio->io_flags & ZIO_FLAG_IO_REWRITE) {6988if (!BP_EQUAL(&zio->io_bp_orig, zio->io_bp))6989panic("bad overwrite, hdr=%p exists=%p",6990(void *)hdr, (void *)exists);6991ASSERT(zfs_refcount_is_zero(6992&exists->b_l1hdr.b_refcnt));6993arc_change_state(arc_anon, exists);6994arc_hdr_destroy(exists);6995mutex_exit(hash_lock);6996exists = buf_hash_insert(hdr, &hash_lock);6997ASSERT0P(exists);6998} else if (zio->io_flags & ZIO_FLAG_NOPWRITE) {6999/* nopwrite */7000ASSERT(zio->io_prop.zp_nopwrite);7001if (!BP_EQUAL(&zio->io_bp_orig, zio->io_bp))7002panic("bad nopwrite, hdr=%p exists=%p",7003(void *)hdr, (void *)exists);7004} else {7005/* Dedup */7006ASSERT3P(hdr->b_l1hdr.b_buf, !=, NULL);7007ASSERT(ARC_BUF_LAST(hdr->b_l1hdr.b_buf));7008ASSERT(hdr->b_l1hdr.b_state == arc_anon);7009ASSERT(BP_GET_DEDUP(zio->io_bp));7010ASSERT0(BP_GET_LEVEL(zio->io_bp));7011}7012}7013arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);7014VERIFY3S(remove_reference(hdr, hdr), >, 0);7015/* if it's not anon, we are doing a scrub */7016if (exists == NULL && hdr->b_l1hdr.b_state == arc_anon)7017arc_access(hdr, 0, B_FALSE);7018mutex_exit(hash_lock);7019} else {7020arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS);7021VERIFY3S(remove_reference(hdr, hdr), >, 0);7022}70237024callback->awcb_done(zio, buf, callback->awcb_private);70257026abd_free(zio->io_abd);7027kmem_free(callback, sizeof (arc_write_callback_t));7028}70297030zio_t *7031arc_write(zio_t *pio, spa_t *spa, uint64_t txg,7032blkptr_t *bp, arc_buf_t *buf, boolean_t uncached, boolean_t l2arc,7033const zio_prop_t *zp, arc_write_done_func_t *ready,7034arc_write_done_func_t *children_ready, arc_write_done_func_t *done,7035void *private, zio_priority_t priority, int zio_flags,7036const zbookmark_phys_t *zb)7037{7038arc_buf_hdr_t *hdr = buf->b_hdr;7039arc_write_callback_t *callback;7040zio_t *zio;7041zio_prop_t localprop = *zp;70427043ASSERT3P(ready, !=, NULL);7044ASSERT3P(done, !=, NULL);7045ASSERT(!HDR_IO_ERROR(hdr));7046ASSERT(!HDR_IO_IN_PROGRESS(hdr));7047ASSERT0P(hdr->b_l1hdr.b_acb);7048ASSERT3P(hdr->b_l1hdr.b_buf, !=, NULL);7049if (uncached)7050arc_hdr_set_flags(hdr, ARC_FLAG_UNCACHED);7051else if (l2arc)7052arc_hdr_set_flags(hdr, ARC_FLAG_L2CACHE);70537054if (ARC_BUF_ENCRYPTED(buf)) {7055ASSERT(ARC_BUF_COMPRESSED(buf));7056localprop.zp_encrypt = B_TRUE;7057localprop.zp_compress = HDR_GET_COMPRESS(hdr);7058localprop.zp_complevel = hdr->b_complevel;7059localprop.zp_byteorder =7060(hdr->b_l1hdr.b_byteswap == DMU_BSWAP_NUMFUNCS) ?7061ZFS_HOST_BYTEORDER : !ZFS_HOST_BYTEORDER;7062memcpy(localprop.zp_salt, hdr->b_crypt_hdr.b_salt,7063ZIO_DATA_SALT_LEN);7064memcpy(localprop.zp_iv, hdr->b_crypt_hdr.b_iv,7065ZIO_DATA_IV_LEN);7066memcpy(localprop.zp_mac, hdr->b_crypt_hdr.b_mac,7067ZIO_DATA_MAC_LEN);7068if (DMU_OT_IS_ENCRYPTED(localprop.zp_type)) {7069localprop.zp_nopwrite = B_FALSE;7070localprop.zp_copies =7071MIN(localprop.zp_copies, SPA_DVAS_PER_BP - 1);7072localprop.zp_gang_copies =7073MIN(localprop.zp_gang_copies, SPA_DVAS_PER_BP - 1);7074}7075zio_flags |= ZIO_FLAG_RAW;7076} else if (ARC_BUF_COMPRESSED(buf)) {7077ASSERT3U(HDR_GET_LSIZE(hdr), !=, arc_buf_size(buf));7078localprop.zp_compress = HDR_GET_COMPRESS(hdr);7079localprop.zp_complevel = hdr->b_complevel;7080zio_flags |= ZIO_FLAG_RAW_COMPRESS;7081}7082callback = kmem_zalloc(sizeof (arc_write_callback_t), KM_SLEEP);7083callback->awcb_ready = ready;7084callback->awcb_children_ready = children_ready;7085callback->awcb_done = done;7086callback->awcb_private = private;7087callback->awcb_buf = buf;70887089/*7090* The hdr's b_pabd is now stale, free it now. A new data block7091* will be allocated when the zio pipeline calls arc_write_ready().7092*/7093if (hdr->b_l1hdr.b_pabd != NULL) {7094/*7095* If the buf is currently sharing the data block with7096* the hdr then we need to break that relationship here.7097* The hdr will remain with a NULL data pointer and the7098* buf will take sole ownership of the block.7099*/7100if (ARC_BUF_SHARED(buf)) {7101arc_unshare_buf(hdr, buf);7102} else {7103ASSERT(!arc_buf_is_shared(buf));7104arc_hdr_free_abd(hdr, B_FALSE);7105}7106VERIFY3P(buf->b_data, !=, NULL);7107}71087109if (HDR_HAS_RABD(hdr))7110arc_hdr_free_abd(hdr, B_TRUE);71117112if (!(zio_flags & ZIO_FLAG_RAW))7113arc_hdr_set_compress(hdr, ZIO_COMPRESS_OFF);71147115ASSERT(!arc_buf_is_shared(buf));7116ASSERT0P(hdr->b_l1hdr.b_pabd);71177118zio = zio_write(pio, spa, txg, bp,7119abd_get_from_buf(buf->b_data, HDR_GET_LSIZE(hdr)),7120HDR_GET_LSIZE(hdr), arc_buf_size(buf), &localprop, arc_write_ready,7121(children_ready != NULL) ? arc_write_children_ready : NULL,7122arc_write_done, callback, priority, zio_flags, zb);71237124return (zio);7125}71267127void7128arc_tempreserve_clear(uint64_t reserve)7129{7130atomic_add_64(&arc_tempreserve, -reserve);7131ASSERT((int64_t)arc_tempreserve >= 0);7132}71337134int7135arc_tempreserve_space(spa_t *spa, uint64_t reserve, uint64_t txg)7136{7137int error;7138uint64_t anon_size;71397140if (!arc_no_grow &&7141reserve > arc_c/4 &&7142reserve * 4 > (2ULL << SPA_MAXBLOCKSHIFT))7143arc_c = MIN(arc_c_max, reserve * 4);71447145/*7146* Throttle when the calculated memory footprint for the TXG7147* exceeds the target ARC size.7148*/7149if (reserve > arc_c) {7150DMU_TX_STAT_BUMP(dmu_tx_memory_reserve);7151return (SET_ERROR(ERESTART));7152}71537154/*7155* Don't count loaned bufs as in flight dirty data to prevent long7156* network delays from blocking transactions that are ready to be7157* assigned to a txg.7158*/71597160/* assert that it has not wrapped around */7161ASSERT3S(atomic_add_64_nv(&arc_loaned_bytes, 0), >=, 0);71627163anon_size = MAX((int64_t)7164(zfs_refcount_count(&arc_anon->arcs_size[ARC_BUFC_DATA]) +7165zfs_refcount_count(&arc_anon->arcs_size[ARC_BUFC_METADATA]) -7166arc_loaned_bytes), 0);71677168/*7169* Writes will, almost always, require additional memory allocations7170* in order to compress/encrypt/etc the data. We therefore need to7171* make sure that there is sufficient available memory for this.7172*/7173error = arc_memory_throttle(spa, reserve, txg);7174if (error != 0)7175return (error);71767177/*7178* Throttle writes when the amount of dirty data in the cache7179* gets too large. We try to keep the cache less than half full7180* of dirty blocks so that our sync times don't grow too large.7181*7182* In the case of one pool being built on another pool, we want7183* to make sure we don't end up throttling the lower (backing)7184* pool when the upper pool is the majority contributor to dirty7185* data. To insure we make forward progress during throttling, we7186* also check the current pool's net dirty data and only throttle7187* if it exceeds zfs_arc_pool_dirty_percent of the anonymous dirty7188* data in the cache.7189*7190* Note: if two requests come in concurrently, we might let them7191* both succeed, when one of them should fail. Not a huge deal.7192*/7193uint64_t total_dirty = reserve + arc_tempreserve + anon_size;7194uint64_t spa_dirty_anon = spa_dirty_data(spa);7195uint64_t rarc_c = arc_warm ? arc_c : arc_c_max;7196if (total_dirty > rarc_c * zfs_arc_dirty_limit_percent / 100 &&7197anon_size > rarc_c * zfs_arc_anon_limit_percent / 100 &&7198spa_dirty_anon > anon_size * zfs_arc_pool_dirty_percent / 100) {7199#ifdef ZFS_DEBUG7200uint64_t meta_esize = zfs_refcount_count(7201&arc_anon->arcs_esize[ARC_BUFC_METADATA]);7202uint64_t data_esize =7203zfs_refcount_count(&arc_anon->arcs_esize[ARC_BUFC_DATA]);7204dprintf("failing, arc_tempreserve=%lluK anon_meta=%lluK "7205"anon_data=%lluK tempreserve=%lluK rarc_c=%lluK\n",7206(u_longlong_t)arc_tempreserve >> 10,7207(u_longlong_t)meta_esize >> 10,7208(u_longlong_t)data_esize >> 10,7209(u_longlong_t)reserve >> 10,7210(u_longlong_t)rarc_c >> 10);7211#endif7212DMU_TX_STAT_BUMP(dmu_tx_dirty_throttle);7213return (SET_ERROR(ERESTART));7214}7215atomic_add_64(&arc_tempreserve, reserve);7216return (0);7217}72187219static void7220arc_kstat_update_state(arc_state_t *state, kstat_named_t *size,7221kstat_named_t *data, kstat_named_t *metadata,7222kstat_named_t *evict_data, kstat_named_t *evict_metadata)7223{7224data->value.ui64 =7225zfs_refcount_count(&state->arcs_size[ARC_BUFC_DATA]);7226metadata->value.ui64 =7227zfs_refcount_count(&state->arcs_size[ARC_BUFC_METADATA]);7228size->value.ui64 = data->value.ui64 + metadata->value.ui64;7229evict_data->value.ui64 =7230zfs_refcount_count(&state->arcs_esize[ARC_BUFC_DATA]);7231evict_metadata->value.ui64 =7232zfs_refcount_count(&state->arcs_esize[ARC_BUFC_METADATA]);7233}72347235static int7236arc_kstat_update(kstat_t *ksp, int rw)7237{7238arc_stats_t *as = ksp->ks_data;72397240if (rw == KSTAT_WRITE)7241return (SET_ERROR(EACCES));72427243as->arcstat_hits.value.ui64 =7244wmsum_value(&arc_sums.arcstat_hits);7245as->arcstat_iohits.value.ui64 =7246wmsum_value(&arc_sums.arcstat_iohits);7247as->arcstat_misses.value.ui64 =7248wmsum_value(&arc_sums.arcstat_misses);7249as->arcstat_demand_data_hits.value.ui64 =7250wmsum_value(&arc_sums.arcstat_demand_data_hits);7251as->arcstat_demand_data_iohits.value.ui64 =7252wmsum_value(&arc_sums.arcstat_demand_data_iohits);7253as->arcstat_demand_data_misses.value.ui64 =7254wmsum_value(&arc_sums.arcstat_demand_data_misses);7255as->arcstat_demand_metadata_hits.value.ui64 =7256wmsum_value(&arc_sums.arcstat_demand_metadata_hits);7257as->arcstat_demand_metadata_iohits.value.ui64 =7258wmsum_value(&arc_sums.arcstat_demand_metadata_iohits);7259as->arcstat_demand_metadata_misses.value.ui64 =7260wmsum_value(&arc_sums.arcstat_demand_metadata_misses);7261as->arcstat_prefetch_data_hits.value.ui64 =7262wmsum_value(&arc_sums.arcstat_prefetch_data_hits);7263as->arcstat_prefetch_data_iohits.value.ui64 =7264wmsum_value(&arc_sums.arcstat_prefetch_data_iohits);7265as->arcstat_prefetch_data_misses.value.ui64 =7266wmsum_value(&arc_sums.arcstat_prefetch_data_misses);7267as->arcstat_prefetch_metadata_hits.value.ui64 =7268wmsum_value(&arc_sums.arcstat_prefetch_metadata_hits);7269as->arcstat_prefetch_metadata_iohits.value.ui64 =7270wmsum_value(&arc_sums.arcstat_prefetch_metadata_iohits);7271as->arcstat_prefetch_metadata_misses.value.ui64 =7272wmsum_value(&arc_sums.arcstat_prefetch_metadata_misses);7273as->arcstat_mru_hits.value.ui64 =7274wmsum_value(&arc_sums.arcstat_mru_hits);7275as->arcstat_mru_ghost_hits.value.ui64 =7276wmsum_value(&arc_sums.arcstat_mru_ghost_hits);7277as->arcstat_mfu_hits.value.ui64 =7278wmsum_value(&arc_sums.arcstat_mfu_hits);7279as->arcstat_mfu_ghost_hits.value.ui64 =7280wmsum_value(&arc_sums.arcstat_mfu_ghost_hits);7281as->arcstat_uncached_hits.value.ui64 =7282wmsum_value(&arc_sums.arcstat_uncached_hits);7283as->arcstat_deleted.value.ui64 =7284wmsum_value(&arc_sums.arcstat_deleted);7285as->arcstat_mutex_miss.value.ui64 =7286wmsum_value(&arc_sums.arcstat_mutex_miss);7287as->arcstat_access_skip.value.ui64 =7288wmsum_value(&arc_sums.arcstat_access_skip);7289as->arcstat_evict_skip.value.ui64 =7290wmsum_value(&arc_sums.arcstat_evict_skip);7291as->arcstat_evict_not_enough.value.ui64 =7292wmsum_value(&arc_sums.arcstat_evict_not_enough);7293as->arcstat_evict_l2_cached.value.ui64 =7294wmsum_value(&arc_sums.arcstat_evict_l2_cached);7295as->arcstat_evict_l2_eligible.value.ui64 =7296wmsum_value(&arc_sums.arcstat_evict_l2_eligible);7297as->arcstat_evict_l2_eligible_mfu.value.ui64 =7298wmsum_value(&arc_sums.arcstat_evict_l2_eligible_mfu);7299as->arcstat_evict_l2_eligible_mru.value.ui64 =7300wmsum_value(&arc_sums.arcstat_evict_l2_eligible_mru);7301as->arcstat_evict_l2_ineligible.value.ui64 =7302wmsum_value(&arc_sums.arcstat_evict_l2_ineligible);7303as->arcstat_evict_l2_skip.value.ui64 =7304wmsum_value(&arc_sums.arcstat_evict_l2_skip);7305as->arcstat_hash_elements.value.ui64 =7306as->arcstat_hash_elements_max.value.ui64 =7307wmsum_value(&arc_sums.arcstat_hash_elements);7308as->arcstat_hash_collisions.value.ui64 =7309wmsum_value(&arc_sums.arcstat_hash_collisions);7310as->arcstat_hash_chains.value.ui64 =7311wmsum_value(&arc_sums.arcstat_hash_chains);7312as->arcstat_size.value.ui64 =7313aggsum_value(&arc_sums.arcstat_size);7314as->arcstat_compressed_size.value.ui64 =7315wmsum_value(&arc_sums.arcstat_compressed_size);7316as->arcstat_uncompressed_size.value.ui64 =7317wmsum_value(&arc_sums.arcstat_uncompressed_size);7318as->arcstat_overhead_size.value.ui64 =7319wmsum_value(&arc_sums.arcstat_overhead_size);7320as->arcstat_hdr_size.value.ui64 =7321wmsum_value(&arc_sums.arcstat_hdr_size);7322as->arcstat_data_size.value.ui64 =7323wmsum_value(&arc_sums.arcstat_data_size);7324as->arcstat_metadata_size.value.ui64 =7325wmsum_value(&arc_sums.arcstat_metadata_size);7326as->arcstat_dbuf_size.value.ui64 =7327wmsum_value(&arc_sums.arcstat_dbuf_size);7328#if defined(COMPAT_FREEBSD11)7329as->arcstat_other_size.value.ui64 =7330wmsum_value(&arc_sums.arcstat_bonus_size) +7331aggsum_value(&arc_sums.arcstat_dnode_size) +7332wmsum_value(&arc_sums.arcstat_dbuf_size);7333#endif73347335arc_kstat_update_state(arc_anon,7336&as->arcstat_anon_size,7337&as->arcstat_anon_data,7338&as->arcstat_anon_metadata,7339&as->arcstat_anon_evictable_data,7340&as->arcstat_anon_evictable_metadata);7341arc_kstat_update_state(arc_mru,7342&as->arcstat_mru_size,7343&as->arcstat_mru_data,7344&as->arcstat_mru_metadata,7345&as->arcstat_mru_evictable_data,7346&as->arcstat_mru_evictable_metadata);7347arc_kstat_update_state(arc_mru_ghost,7348&as->arcstat_mru_ghost_size,7349&as->arcstat_mru_ghost_data,7350&as->arcstat_mru_ghost_metadata,7351&as->arcstat_mru_ghost_evictable_data,7352&as->arcstat_mru_ghost_evictable_metadata);7353arc_kstat_update_state(arc_mfu,7354&as->arcstat_mfu_size,7355&as->arcstat_mfu_data,7356&as->arcstat_mfu_metadata,7357&as->arcstat_mfu_evictable_data,7358&as->arcstat_mfu_evictable_metadata);7359arc_kstat_update_state(arc_mfu_ghost,7360&as->arcstat_mfu_ghost_size,7361&as->arcstat_mfu_ghost_data,7362&as->arcstat_mfu_ghost_metadata,7363&as->arcstat_mfu_ghost_evictable_data,7364&as->arcstat_mfu_ghost_evictable_metadata);7365arc_kstat_update_state(arc_uncached,7366&as->arcstat_uncached_size,7367&as->arcstat_uncached_data,7368&as->arcstat_uncached_metadata,7369&as->arcstat_uncached_evictable_data,7370&as->arcstat_uncached_evictable_metadata);73717372as->arcstat_dnode_size.value.ui64 =7373aggsum_value(&arc_sums.arcstat_dnode_size);7374as->arcstat_bonus_size.value.ui64 =7375wmsum_value(&arc_sums.arcstat_bonus_size);7376as->arcstat_l2_hits.value.ui64 =7377wmsum_value(&arc_sums.arcstat_l2_hits);7378as->arcstat_l2_misses.value.ui64 =7379wmsum_value(&arc_sums.arcstat_l2_misses);7380as->arcstat_l2_prefetch_asize.value.ui64 =7381wmsum_value(&arc_sums.arcstat_l2_prefetch_asize);7382as->arcstat_l2_mru_asize.value.ui64 =7383wmsum_value(&arc_sums.arcstat_l2_mru_asize);7384as->arcstat_l2_mfu_asize.value.ui64 =7385wmsum_value(&arc_sums.arcstat_l2_mfu_asize);7386as->arcstat_l2_bufc_data_asize.value.ui64 =7387wmsum_value(&arc_sums.arcstat_l2_bufc_data_asize);7388as->arcstat_l2_bufc_metadata_asize.value.ui64 =7389wmsum_value(&arc_sums.arcstat_l2_bufc_metadata_asize);7390as->arcstat_l2_feeds.value.ui64 =7391wmsum_value(&arc_sums.arcstat_l2_feeds);7392as->arcstat_l2_rw_clash.value.ui64 =7393wmsum_value(&arc_sums.arcstat_l2_rw_clash);7394as->arcstat_l2_read_bytes.value.ui64 =7395wmsum_value(&arc_sums.arcstat_l2_read_bytes);7396as->arcstat_l2_write_bytes.value.ui64 =7397wmsum_value(&arc_sums.arcstat_l2_write_bytes);7398as->arcstat_l2_writes_sent.value.ui64 =7399wmsum_value(&arc_sums.arcstat_l2_writes_sent);7400as->arcstat_l2_writes_done.value.ui64 =7401wmsum_value(&arc_sums.arcstat_l2_writes_done);7402as->arcstat_l2_writes_error.value.ui64 =7403wmsum_value(&arc_sums.arcstat_l2_writes_error);7404as->arcstat_l2_writes_lock_retry.value.ui64 =7405wmsum_value(&arc_sums.arcstat_l2_writes_lock_retry);7406as->arcstat_l2_evict_lock_retry.value.ui64 =7407wmsum_value(&arc_sums.arcstat_l2_evict_lock_retry);7408as->arcstat_l2_evict_reading.value.ui64 =7409wmsum_value(&arc_sums.arcstat_l2_evict_reading);7410as->arcstat_l2_evict_l1cached.value.ui64 =7411wmsum_value(&arc_sums.arcstat_l2_evict_l1cached);7412as->arcstat_l2_free_on_write.value.ui64 =7413wmsum_value(&arc_sums.arcstat_l2_free_on_write);7414as->arcstat_l2_abort_lowmem.value.ui64 =7415wmsum_value(&arc_sums.arcstat_l2_abort_lowmem);7416as->arcstat_l2_cksum_bad.value.ui64 =7417wmsum_value(&arc_sums.arcstat_l2_cksum_bad);7418as->arcstat_l2_io_error.value.ui64 =7419wmsum_value(&arc_sums.arcstat_l2_io_error);7420as->arcstat_l2_lsize.value.ui64 =7421wmsum_value(&arc_sums.arcstat_l2_lsize);7422as->arcstat_l2_psize.value.ui64 =7423wmsum_value(&arc_sums.arcstat_l2_psize);7424as->arcstat_l2_hdr_size.value.ui64 =7425aggsum_value(&arc_sums.arcstat_l2_hdr_size);7426as->arcstat_l2_log_blk_writes.value.ui64 =7427wmsum_value(&arc_sums.arcstat_l2_log_blk_writes);7428as->arcstat_l2_log_blk_asize.value.ui64 =7429wmsum_value(&arc_sums.arcstat_l2_log_blk_asize);7430as->arcstat_l2_log_blk_count.value.ui64 =7431wmsum_value(&arc_sums.arcstat_l2_log_blk_count);7432as->arcstat_l2_rebuild_success.value.ui64 =7433wmsum_value(&arc_sums.arcstat_l2_rebuild_success);7434as->arcstat_l2_rebuild_abort_unsupported.value.ui64 =7435wmsum_value(&arc_sums.arcstat_l2_rebuild_abort_unsupported);7436as->arcstat_l2_rebuild_abort_io_errors.value.ui64 =7437wmsum_value(&arc_sums.arcstat_l2_rebuild_abort_io_errors);7438as->arcstat_l2_rebuild_abort_dh_errors.value.ui64 =7439wmsum_value(&arc_sums.arcstat_l2_rebuild_abort_dh_errors);7440as->arcstat_l2_rebuild_abort_cksum_lb_errors.value.ui64 =7441wmsum_value(&arc_sums.arcstat_l2_rebuild_abort_cksum_lb_errors);7442as->arcstat_l2_rebuild_abort_lowmem.value.ui64 =7443wmsum_value(&arc_sums.arcstat_l2_rebuild_abort_lowmem);7444as->arcstat_l2_rebuild_size.value.ui64 =7445wmsum_value(&arc_sums.arcstat_l2_rebuild_size);7446as->arcstat_l2_rebuild_asize.value.ui64 =7447wmsum_value(&arc_sums.arcstat_l2_rebuild_asize);7448as->arcstat_l2_rebuild_bufs.value.ui64 =7449wmsum_value(&arc_sums.arcstat_l2_rebuild_bufs);7450as->arcstat_l2_rebuild_bufs_precached.value.ui64 =7451wmsum_value(&arc_sums.arcstat_l2_rebuild_bufs_precached);7452as->arcstat_l2_rebuild_log_blks.value.ui64 =7453wmsum_value(&arc_sums.arcstat_l2_rebuild_log_blks);7454as->arcstat_memory_throttle_count.value.ui64 =7455wmsum_value(&arc_sums.arcstat_memory_throttle_count);7456as->arcstat_memory_direct_count.value.ui64 =7457wmsum_value(&arc_sums.arcstat_memory_direct_count);7458as->arcstat_memory_indirect_count.value.ui64 =7459wmsum_value(&arc_sums.arcstat_memory_indirect_count);74607461as->arcstat_memory_all_bytes.value.ui64 =7462arc_all_memory();7463as->arcstat_memory_free_bytes.value.ui64 =7464arc_free_memory();7465as->arcstat_memory_available_bytes.value.i64 =7466arc_available_memory();74677468as->arcstat_prune.value.ui64 =7469wmsum_value(&arc_sums.arcstat_prune);7470as->arcstat_meta_used.value.ui64 =7471wmsum_value(&arc_sums.arcstat_meta_used);7472as->arcstat_async_upgrade_sync.value.ui64 =7473wmsum_value(&arc_sums.arcstat_async_upgrade_sync);7474as->arcstat_predictive_prefetch.value.ui64 =7475wmsum_value(&arc_sums.arcstat_predictive_prefetch);7476as->arcstat_demand_hit_predictive_prefetch.value.ui64 =7477wmsum_value(&arc_sums.arcstat_demand_hit_predictive_prefetch);7478as->arcstat_demand_iohit_predictive_prefetch.value.ui64 =7479wmsum_value(&arc_sums.arcstat_demand_iohit_predictive_prefetch);7480as->arcstat_prescient_prefetch.value.ui64 =7481wmsum_value(&arc_sums.arcstat_prescient_prefetch);7482as->arcstat_demand_hit_prescient_prefetch.value.ui64 =7483wmsum_value(&arc_sums.arcstat_demand_hit_prescient_prefetch);7484as->arcstat_demand_iohit_prescient_prefetch.value.ui64 =7485wmsum_value(&arc_sums.arcstat_demand_iohit_prescient_prefetch);7486as->arcstat_raw_size.value.ui64 =7487wmsum_value(&arc_sums.arcstat_raw_size);7488as->arcstat_cached_only_in_progress.value.ui64 =7489wmsum_value(&arc_sums.arcstat_cached_only_in_progress);7490as->arcstat_abd_chunk_waste_size.value.ui64 =7491wmsum_value(&arc_sums.arcstat_abd_chunk_waste_size);74927493return (0);7494}74957496/*7497* This function *must* return indices evenly distributed between all7498* sublists of the multilist. This is needed due to how the ARC eviction7499* code is laid out; arc_evict_state() assumes ARC buffers are evenly7500* distributed between all sublists and uses this assumption when7501* deciding which sublist to evict from and how much to evict from it.7502*/7503static unsigned int7504arc_state_multilist_index_func(multilist_t *ml, void *obj)7505{7506arc_buf_hdr_t *hdr = obj;75077508/*7509* We rely on b_dva to generate evenly distributed index7510* numbers using buf_hash below. So, as an added precaution,7511* let's make sure we never add empty buffers to the arc lists.7512*/7513ASSERT(!HDR_EMPTY(hdr));75147515/*7516* The assumption here, is the hash value for a given7517* arc_buf_hdr_t will remain constant throughout its lifetime7518* (i.e. its b_spa, b_dva, and b_birth fields don't change).7519* Thus, we don't need to store the header's sublist index7520* on insertion, as this index can be recalculated on removal.7521*7522* Also, the low order bits of the hash value are thought to be7523* distributed evenly. Otherwise, in the case that the multilist7524* has a power of two number of sublists, each sublists' usage7525* would not be evenly distributed. In this context full 64bit7526* division would be a waste of time, so limit it to 32 bits.7527*/7528return ((unsigned int)buf_hash(hdr->b_spa, &hdr->b_dva, hdr->b_birth) %7529multilist_get_num_sublists(ml));7530}75317532static unsigned int7533arc_state_l2c_multilist_index_func(multilist_t *ml, void *obj)7534{7535panic("Header %p insert into arc_l2c_only %p", obj, ml);7536}75377538#define WARN_IF_TUNING_IGNORED(tuning, value, do_warn) do { \7539if ((do_warn) && (tuning) && ((tuning) != (value))) { \7540cmn_err(CE_WARN, \7541"ignoring tunable %s (using %llu instead)", \7542(#tuning), (u_longlong_t)(value)); \7543} \7544} while (0)75457546/*7547* Called during module initialization and periodically thereafter to7548* apply reasonable changes to the exposed performance tunings. Can also be7549* called explicitly by param_set_arc_*() functions when ARC tunables are7550* updated manually. Non-zero zfs_* values which differ from the currently set7551* values will be applied.7552*/7553void7554arc_tuning_update(boolean_t verbose)7555{7556uint64_t allmem = arc_all_memory();75577558/* Valid range: 32M - <arc_c_max> */7559if ((zfs_arc_min) && (zfs_arc_min != arc_c_min) &&7560(zfs_arc_min >= 2ULL << SPA_MAXBLOCKSHIFT) &&7561(zfs_arc_min <= arc_c_max)) {7562arc_c_min = zfs_arc_min;7563arc_c = MAX(arc_c, arc_c_min);7564}7565WARN_IF_TUNING_IGNORED(zfs_arc_min, arc_c_min, verbose);75667567/* Valid range: 64M - <all physical memory> */7568if ((zfs_arc_max) && (zfs_arc_max != arc_c_max) &&7569(zfs_arc_max >= MIN_ARC_MAX) && (zfs_arc_max < allmem) &&7570(zfs_arc_max > arc_c_min)) {7571arc_c_max = zfs_arc_max;7572arc_c = MIN(arc_c, arc_c_max);7573if (arc_dnode_limit > arc_c_max)7574arc_dnode_limit = arc_c_max;7575}7576WARN_IF_TUNING_IGNORED(zfs_arc_max, arc_c_max, verbose);75777578/* Valid range: 0 - <all physical memory> */7579arc_dnode_limit = zfs_arc_dnode_limit ? zfs_arc_dnode_limit :7580MIN(zfs_arc_dnode_limit_percent, 100) * arc_c_max / 100;7581WARN_IF_TUNING_IGNORED(zfs_arc_dnode_limit, arc_dnode_limit, verbose);75827583/* Valid range: 1 - N */7584if (zfs_arc_grow_retry)7585arc_grow_retry = zfs_arc_grow_retry;75867587/* Valid range: 1 - N */7588if (zfs_arc_shrink_shift) {7589arc_shrink_shift = zfs_arc_shrink_shift;7590arc_no_grow_shift = MIN(arc_no_grow_shift, arc_shrink_shift -1);7591}75927593/* Valid range: 1 - N ms */7594if (zfs_arc_min_prefetch_ms)7595arc_min_prefetch_ms = zfs_arc_min_prefetch_ms;75967597/* Valid range: 1 - N ms */7598if (zfs_arc_min_prescient_prefetch_ms) {7599arc_min_prescient_prefetch_ms =7600zfs_arc_min_prescient_prefetch_ms;7601}76027603/* Valid range: 0 - 100 */7604if (zfs_arc_lotsfree_percent <= 100)7605arc_lotsfree_percent = zfs_arc_lotsfree_percent;7606WARN_IF_TUNING_IGNORED(zfs_arc_lotsfree_percent, arc_lotsfree_percent,7607verbose);76087609/* Valid range: 0 - <all physical memory> */7610if ((zfs_arc_sys_free) && (zfs_arc_sys_free != arc_sys_free))7611arc_sys_free = MIN(zfs_arc_sys_free, allmem);7612WARN_IF_TUNING_IGNORED(zfs_arc_sys_free, arc_sys_free, verbose);7613}76147615static void7616arc_state_multilist_init(multilist_t *ml,7617multilist_sublist_index_func_t *index_func, int *maxcountp)7618{7619multilist_create(ml, sizeof (arc_buf_hdr_t),7620offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node), index_func);7621*maxcountp = MAX(*maxcountp, multilist_get_num_sublists(ml));7622}76237624static void7625arc_state_init(void)7626{7627int num_sublists = 0;76287629arc_state_multilist_init(&arc_mru->arcs_list[ARC_BUFC_METADATA],7630arc_state_multilist_index_func, &num_sublists);7631arc_state_multilist_init(&arc_mru->arcs_list[ARC_BUFC_DATA],7632arc_state_multilist_index_func, &num_sublists);7633arc_state_multilist_init(&arc_mru_ghost->arcs_list[ARC_BUFC_METADATA],7634arc_state_multilist_index_func, &num_sublists);7635arc_state_multilist_init(&arc_mru_ghost->arcs_list[ARC_BUFC_DATA],7636arc_state_multilist_index_func, &num_sublists);7637arc_state_multilist_init(&arc_mfu->arcs_list[ARC_BUFC_METADATA],7638arc_state_multilist_index_func, &num_sublists);7639arc_state_multilist_init(&arc_mfu->arcs_list[ARC_BUFC_DATA],7640arc_state_multilist_index_func, &num_sublists);7641arc_state_multilist_init(&arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA],7642arc_state_multilist_index_func, &num_sublists);7643arc_state_multilist_init(&arc_mfu_ghost->arcs_list[ARC_BUFC_DATA],7644arc_state_multilist_index_func, &num_sublists);7645arc_state_multilist_init(&arc_uncached->arcs_list[ARC_BUFC_METADATA],7646arc_state_multilist_index_func, &num_sublists);7647arc_state_multilist_init(&arc_uncached->arcs_list[ARC_BUFC_DATA],7648arc_state_multilist_index_func, &num_sublists);76497650/*7651* L2 headers should never be on the L2 state list since they don't7652* have L1 headers allocated. Special index function asserts that.7653*/7654arc_state_multilist_init(&arc_l2c_only->arcs_list[ARC_BUFC_METADATA],7655arc_state_l2c_multilist_index_func, &num_sublists);7656arc_state_multilist_init(&arc_l2c_only->arcs_list[ARC_BUFC_DATA],7657arc_state_l2c_multilist_index_func, &num_sublists);76587659/*7660* Keep track of the number of markers needed to reclaim buffers from7661* any ARC state. The markers will be pre-allocated so as to minimize7662* the number of memory allocations performed by the eviction thread.7663*/7664arc_state_evict_marker_count = num_sublists;76657666zfs_refcount_create(&arc_anon->arcs_esize[ARC_BUFC_METADATA]);7667zfs_refcount_create(&arc_anon->arcs_esize[ARC_BUFC_DATA]);7668zfs_refcount_create(&arc_mru->arcs_esize[ARC_BUFC_METADATA]);7669zfs_refcount_create(&arc_mru->arcs_esize[ARC_BUFC_DATA]);7670zfs_refcount_create(&arc_mru_ghost->arcs_esize[ARC_BUFC_METADATA]);7671zfs_refcount_create(&arc_mru_ghost->arcs_esize[ARC_BUFC_DATA]);7672zfs_refcount_create(&arc_mfu->arcs_esize[ARC_BUFC_METADATA]);7673zfs_refcount_create(&arc_mfu->arcs_esize[ARC_BUFC_DATA]);7674zfs_refcount_create(&arc_mfu_ghost->arcs_esize[ARC_BUFC_METADATA]);7675zfs_refcount_create(&arc_mfu_ghost->arcs_esize[ARC_BUFC_DATA]);7676zfs_refcount_create(&arc_l2c_only->arcs_esize[ARC_BUFC_METADATA]);7677zfs_refcount_create(&arc_l2c_only->arcs_esize[ARC_BUFC_DATA]);7678zfs_refcount_create(&arc_uncached->arcs_esize[ARC_BUFC_METADATA]);7679zfs_refcount_create(&arc_uncached->arcs_esize[ARC_BUFC_DATA]);76807681zfs_refcount_create(&arc_anon->arcs_size[ARC_BUFC_DATA]);7682zfs_refcount_create(&arc_anon->arcs_size[ARC_BUFC_METADATA]);7683zfs_refcount_create(&arc_mru->arcs_size[ARC_BUFC_DATA]);7684zfs_refcount_create(&arc_mru->arcs_size[ARC_BUFC_METADATA]);7685zfs_refcount_create(&arc_mru_ghost->arcs_size[ARC_BUFC_DATA]);7686zfs_refcount_create(&arc_mru_ghost->arcs_size[ARC_BUFC_METADATA]);7687zfs_refcount_create(&arc_mfu->arcs_size[ARC_BUFC_DATA]);7688zfs_refcount_create(&arc_mfu->arcs_size[ARC_BUFC_METADATA]);7689zfs_refcount_create(&arc_mfu_ghost->arcs_size[ARC_BUFC_DATA]);7690zfs_refcount_create(&arc_mfu_ghost->arcs_size[ARC_BUFC_METADATA]);7691zfs_refcount_create(&arc_l2c_only->arcs_size[ARC_BUFC_DATA]);7692zfs_refcount_create(&arc_l2c_only->arcs_size[ARC_BUFC_METADATA]);7693zfs_refcount_create(&arc_uncached->arcs_size[ARC_BUFC_DATA]);7694zfs_refcount_create(&arc_uncached->arcs_size[ARC_BUFC_METADATA]);76957696wmsum_init(&arc_mru_ghost->arcs_hits[ARC_BUFC_DATA], 0);7697wmsum_init(&arc_mru_ghost->arcs_hits[ARC_BUFC_METADATA], 0);7698wmsum_init(&arc_mfu_ghost->arcs_hits[ARC_BUFC_DATA], 0);7699wmsum_init(&arc_mfu_ghost->arcs_hits[ARC_BUFC_METADATA], 0);77007701wmsum_init(&arc_sums.arcstat_hits, 0);7702wmsum_init(&arc_sums.arcstat_iohits, 0);7703wmsum_init(&arc_sums.arcstat_misses, 0);7704wmsum_init(&arc_sums.arcstat_demand_data_hits, 0);7705wmsum_init(&arc_sums.arcstat_demand_data_iohits, 0);7706wmsum_init(&arc_sums.arcstat_demand_data_misses, 0);7707wmsum_init(&arc_sums.arcstat_demand_metadata_hits, 0);7708wmsum_init(&arc_sums.arcstat_demand_metadata_iohits, 0);7709wmsum_init(&arc_sums.arcstat_demand_metadata_misses, 0);7710wmsum_init(&arc_sums.arcstat_prefetch_data_hits, 0);7711wmsum_init(&arc_sums.arcstat_prefetch_data_iohits, 0);7712wmsum_init(&arc_sums.arcstat_prefetch_data_misses, 0);7713wmsum_init(&arc_sums.arcstat_prefetch_metadata_hits, 0);7714wmsum_init(&arc_sums.arcstat_prefetch_metadata_iohits, 0);7715wmsum_init(&arc_sums.arcstat_prefetch_metadata_misses, 0);7716wmsum_init(&arc_sums.arcstat_mru_hits, 0);7717wmsum_init(&arc_sums.arcstat_mru_ghost_hits, 0);7718wmsum_init(&arc_sums.arcstat_mfu_hits, 0);7719wmsum_init(&arc_sums.arcstat_mfu_ghost_hits, 0);7720wmsum_init(&arc_sums.arcstat_uncached_hits, 0);7721wmsum_init(&arc_sums.arcstat_deleted, 0);7722wmsum_init(&arc_sums.arcstat_mutex_miss, 0);7723wmsum_init(&arc_sums.arcstat_access_skip, 0);7724wmsum_init(&arc_sums.arcstat_evict_skip, 0);7725wmsum_init(&arc_sums.arcstat_evict_not_enough, 0);7726wmsum_init(&arc_sums.arcstat_evict_l2_cached, 0);7727wmsum_init(&arc_sums.arcstat_evict_l2_eligible, 0);7728wmsum_init(&arc_sums.arcstat_evict_l2_eligible_mfu, 0);7729wmsum_init(&arc_sums.arcstat_evict_l2_eligible_mru, 0);7730wmsum_init(&arc_sums.arcstat_evict_l2_ineligible, 0);7731wmsum_init(&arc_sums.arcstat_evict_l2_skip, 0);7732wmsum_init(&arc_sums.arcstat_hash_elements, 0);7733wmsum_init(&arc_sums.arcstat_hash_collisions, 0);7734wmsum_init(&arc_sums.arcstat_hash_chains, 0);7735aggsum_init(&arc_sums.arcstat_size, 0);7736wmsum_init(&arc_sums.arcstat_compressed_size, 0);7737wmsum_init(&arc_sums.arcstat_uncompressed_size, 0);7738wmsum_init(&arc_sums.arcstat_overhead_size, 0);7739wmsum_init(&arc_sums.arcstat_hdr_size, 0);7740wmsum_init(&arc_sums.arcstat_data_size, 0);7741wmsum_init(&arc_sums.arcstat_metadata_size, 0);7742wmsum_init(&arc_sums.arcstat_dbuf_size, 0);7743aggsum_init(&arc_sums.arcstat_dnode_size, 0);7744wmsum_init(&arc_sums.arcstat_bonus_size, 0);7745wmsum_init(&arc_sums.arcstat_l2_hits, 0);7746wmsum_init(&arc_sums.arcstat_l2_misses, 0);7747wmsum_init(&arc_sums.arcstat_l2_prefetch_asize, 0);7748wmsum_init(&arc_sums.arcstat_l2_mru_asize, 0);7749wmsum_init(&arc_sums.arcstat_l2_mfu_asize, 0);7750wmsum_init(&arc_sums.arcstat_l2_bufc_data_asize, 0);7751wmsum_init(&arc_sums.arcstat_l2_bufc_metadata_asize, 0);7752wmsum_init(&arc_sums.arcstat_l2_feeds, 0);7753wmsum_init(&arc_sums.arcstat_l2_rw_clash, 0);7754wmsum_init(&arc_sums.arcstat_l2_read_bytes, 0);7755wmsum_init(&arc_sums.arcstat_l2_write_bytes, 0);7756wmsum_init(&arc_sums.arcstat_l2_writes_sent, 0);7757wmsum_init(&arc_sums.arcstat_l2_writes_done, 0);7758wmsum_init(&arc_sums.arcstat_l2_writes_error, 0);7759wmsum_init(&arc_sums.arcstat_l2_writes_lock_retry, 0);7760wmsum_init(&arc_sums.arcstat_l2_evict_lock_retry, 0);7761wmsum_init(&arc_sums.arcstat_l2_evict_reading, 0);7762wmsum_init(&arc_sums.arcstat_l2_evict_l1cached, 0);7763wmsum_init(&arc_sums.arcstat_l2_free_on_write, 0);7764wmsum_init(&arc_sums.arcstat_l2_abort_lowmem, 0);7765wmsum_init(&arc_sums.arcstat_l2_cksum_bad, 0);7766wmsum_init(&arc_sums.arcstat_l2_io_error, 0);7767wmsum_init(&arc_sums.arcstat_l2_lsize, 0);7768wmsum_init(&arc_sums.arcstat_l2_psize, 0);7769aggsum_init(&arc_sums.arcstat_l2_hdr_size, 0);7770wmsum_init(&arc_sums.arcstat_l2_log_blk_writes, 0);7771wmsum_init(&arc_sums.arcstat_l2_log_blk_asize, 0);7772wmsum_init(&arc_sums.arcstat_l2_log_blk_count, 0);7773wmsum_init(&arc_sums.arcstat_l2_rebuild_success, 0);7774wmsum_init(&arc_sums.arcstat_l2_rebuild_abort_unsupported, 0);7775wmsum_init(&arc_sums.arcstat_l2_rebuild_abort_io_errors, 0);7776wmsum_init(&arc_sums.arcstat_l2_rebuild_abort_dh_errors, 0);7777wmsum_init(&arc_sums.arcstat_l2_rebuild_abort_cksum_lb_errors, 0);7778wmsum_init(&arc_sums.arcstat_l2_rebuild_abort_lowmem, 0);7779wmsum_init(&arc_sums.arcstat_l2_rebuild_size, 0);7780wmsum_init(&arc_sums.arcstat_l2_rebuild_asize, 0);7781wmsum_init(&arc_sums.arcstat_l2_rebuild_bufs, 0);7782wmsum_init(&arc_sums.arcstat_l2_rebuild_bufs_precached, 0);7783wmsum_init(&arc_sums.arcstat_l2_rebuild_log_blks, 0);7784wmsum_init(&arc_sums.arcstat_memory_throttle_count, 0);7785wmsum_init(&arc_sums.arcstat_memory_direct_count, 0);7786wmsum_init(&arc_sums.arcstat_memory_indirect_count, 0);7787wmsum_init(&arc_sums.arcstat_prune, 0);7788wmsum_init(&arc_sums.arcstat_meta_used, 0);7789wmsum_init(&arc_sums.arcstat_async_upgrade_sync, 0);7790wmsum_init(&arc_sums.arcstat_predictive_prefetch, 0);7791wmsum_init(&arc_sums.arcstat_demand_hit_predictive_prefetch, 0);7792wmsum_init(&arc_sums.arcstat_demand_iohit_predictive_prefetch, 0);7793wmsum_init(&arc_sums.arcstat_prescient_prefetch, 0);7794wmsum_init(&arc_sums.arcstat_demand_hit_prescient_prefetch, 0);7795wmsum_init(&arc_sums.arcstat_demand_iohit_prescient_prefetch, 0);7796wmsum_init(&arc_sums.arcstat_raw_size, 0);7797wmsum_init(&arc_sums.arcstat_cached_only_in_progress, 0);7798wmsum_init(&arc_sums.arcstat_abd_chunk_waste_size, 0);77997800arc_anon->arcs_state = ARC_STATE_ANON;7801arc_mru->arcs_state = ARC_STATE_MRU;7802arc_mru_ghost->arcs_state = ARC_STATE_MRU_GHOST;7803arc_mfu->arcs_state = ARC_STATE_MFU;7804arc_mfu_ghost->arcs_state = ARC_STATE_MFU_GHOST;7805arc_l2c_only->arcs_state = ARC_STATE_L2C_ONLY;7806arc_uncached->arcs_state = ARC_STATE_UNCACHED;7807}78087809static void7810arc_state_fini(void)7811{7812zfs_refcount_destroy(&arc_anon->arcs_esize[ARC_BUFC_METADATA]);7813zfs_refcount_destroy(&arc_anon->arcs_esize[ARC_BUFC_DATA]);7814zfs_refcount_destroy(&arc_mru->arcs_esize[ARC_BUFC_METADATA]);7815zfs_refcount_destroy(&arc_mru->arcs_esize[ARC_BUFC_DATA]);7816zfs_refcount_destroy(&arc_mru_ghost->arcs_esize[ARC_BUFC_METADATA]);7817zfs_refcount_destroy(&arc_mru_ghost->arcs_esize[ARC_BUFC_DATA]);7818zfs_refcount_destroy(&arc_mfu->arcs_esize[ARC_BUFC_METADATA]);7819zfs_refcount_destroy(&arc_mfu->arcs_esize[ARC_BUFC_DATA]);7820zfs_refcount_destroy(&arc_mfu_ghost->arcs_esize[ARC_BUFC_METADATA]);7821zfs_refcount_destroy(&arc_mfu_ghost->arcs_esize[ARC_BUFC_DATA]);7822zfs_refcount_destroy(&arc_l2c_only->arcs_esize[ARC_BUFC_METADATA]);7823zfs_refcount_destroy(&arc_l2c_only->arcs_esize[ARC_BUFC_DATA]);7824zfs_refcount_destroy(&arc_uncached->arcs_esize[ARC_BUFC_METADATA]);7825zfs_refcount_destroy(&arc_uncached->arcs_esize[ARC_BUFC_DATA]);78267827zfs_refcount_destroy(&arc_anon->arcs_size[ARC_BUFC_DATA]);7828zfs_refcount_destroy(&arc_anon->arcs_size[ARC_BUFC_METADATA]);7829zfs_refcount_destroy(&arc_mru->arcs_size[ARC_BUFC_DATA]);7830zfs_refcount_destroy(&arc_mru->arcs_size[ARC_BUFC_METADATA]);7831zfs_refcount_destroy(&arc_mru_ghost->arcs_size[ARC_BUFC_DATA]);7832zfs_refcount_destroy(&arc_mru_ghost->arcs_size[ARC_BUFC_METADATA]);7833zfs_refcount_destroy(&arc_mfu->arcs_size[ARC_BUFC_DATA]);7834zfs_refcount_destroy(&arc_mfu->arcs_size[ARC_BUFC_METADATA]);7835zfs_refcount_destroy(&arc_mfu_ghost->arcs_size[ARC_BUFC_DATA]);7836zfs_refcount_destroy(&arc_mfu_ghost->arcs_size[ARC_BUFC_METADATA]);7837zfs_refcount_destroy(&arc_l2c_only->arcs_size[ARC_BUFC_DATA]);7838zfs_refcount_destroy(&arc_l2c_only->arcs_size[ARC_BUFC_METADATA]);7839zfs_refcount_destroy(&arc_uncached->arcs_size[ARC_BUFC_DATA]);7840zfs_refcount_destroy(&arc_uncached->arcs_size[ARC_BUFC_METADATA]);78417842multilist_destroy(&arc_mru->arcs_list[ARC_BUFC_METADATA]);7843multilist_destroy(&arc_mru_ghost->arcs_list[ARC_BUFC_METADATA]);7844multilist_destroy(&arc_mfu->arcs_list[ARC_BUFC_METADATA]);7845multilist_destroy(&arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA]);7846multilist_destroy(&arc_mru->arcs_list[ARC_BUFC_DATA]);7847multilist_destroy(&arc_mru_ghost->arcs_list[ARC_BUFC_DATA]);7848multilist_destroy(&arc_mfu->arcs_list[ARC_BUFC_DATA]);7849multilist_destroy(&arc_mfu_ghost->arcs_list[ARC_BUFC_DATA]);7850multilist_destroy(&arc_l2c_only->arcs_list[ARC_BUFC_METADATA]);7851multilist_destroy(&arc_l2c_only->arcs_list[ARC_BUFC_DATA]);7852multilist_destroy(&arc_uncached->arcs_list[ARC_BUFC_METADATA]);7853multilist_destroy(&arc_uncached->arcs_list[ARC_BUFC_DATA]);78547855wmsum_fini(&arc_mru_ghost->arcs_hits[ARC_BUFC_DATA]);7856wmsum_fini(&arc_mru_ghost->arcs_hits[ARC_BUFC_METADATA]);7857wmsum_fini(&arc_mfu_ghost->arcs_hits[ARC_BUFC_DATA]);7858wmsum_fini(&arc_mfu_ghost->arcs_hits[ARC_BUFC_METADATA]);78597860wmsum_fini(&arc_sums.arcstat_hits);7861wmsum_fini(&arc_sums.arcstat_iohits);7862wmsum_fini(&arc_sums.arcstat_misses);7863wmsum_fini(&arc_sums.arcstat_demand_data_hits);7864wmsum_fini(&arc_sums.arcstat_demand_data_iohits);7865wmsum_fini(&arc_sums.arcstat_demand_data_misses);7866wmsum_fini(&arc_sums.arcstat_demand_metadata_hits);7867wmsum_fini(&arc_sums.arcstat_demand_metadata_iohits);7868wmsum_fini(&arc_sums.arcstat_demand_metadata_misses);7869wmsum_fini(&arc_sums.arcstat_prefetch_data_hits);7870wmsum_fini(&arc_sums.arcstat_prefetch_data_iohits);7871wmsum_fini(&arc_sums.arcstat_prefetch_data_misses);7872wmsum_fini(&arc_sums.arcstat_prefetch_metadata_hits);7873wmsum_fini(&arc_sums.arcstat_prefetch_metadata_iohits);7874wmsum_fini(&arc_sums.arcstat_prefetch_metadata_misses);7875wmsum_fini(&arc_sums.arcstat_mru_hits);7876wmsum_fini(&arc_sums.arcstat_mru_ghost_hits);7877wmsum_fini(&arc_sums.arcstat_mfu_hits);7878wmsum_fini(&arc_sums.arcstat_mfu_ghost_hits);7879wmsum_fini(&arc_sums.arcstat_uncached_hits);7880wmsum_fini(&arc_sums.arcstat_deleted);7881wmsum_fini(&arc_sums.arcstat_mutex_miss);7882wmsum_fini(&arc_sums.arcstat_access_skip);7883wmsum_fini(&arc_sums.arcstat_evict_skip);7884wmsum_fini(&arc_sums.arcstat_evict_not_enough);7885wmsum_fini(&arc_sums.arcstat_evict_l2_cached);7886wmsum_fini(&arc_sums.arcstat_evict_l2_eligible);7887wmsum_fini(&arc_sums.arcstat_evict_l2_eligible_mfu);7888wmsum_fini(&arc_sums.arcstat_evict_l2_eligible_mru);7889wmsum_fini(&arc_sums.arcstat_evict_l2_ineligible);7890wmsum_fini(&arc_sums.arcstat_evict_l2_skip);7891wmsum_fini(&arc_sums.arcstat_hash_elements);7892wmsum_fini(&arc_sums.arcstat_hash_collisions);7893wmsum_fini(&arc_sums.arcstat_hash_chains);7894aggsum_fini(&arc_sums.arcstat_size);7895wmsum_fini(&arc_sums.arcstat_compressed_size);7896wmsum_fini(&arc_sums.arcstat_uncompressed_size);7897wmsum_fini(&arc_sums.arcstat_overhead_size);7898wmsum_fini(&arc_sums.arcstat_hdr_size);7899wmsum_fini(&arc_sums.arcstat_data_size);7900wmsum_fini(&arc_sums.arcstat_metadata_size);7901wmsum_fini(&arc_sums.arcstat_dbuf_size);7902aggsum_fini(&arc_sums.arcstat_dnode_size);7903wmsum_fini(&arc_sums.arcstat_bonus_size);7904wmsum_fini(&arc_sums.arcstat_l2_hits);7905wmsum_fini(&arc_sums.arcstat_l2_misses);7906wmsum_fini(&arc_sums.arcstat_l2_prefetch_asize);7907wmsum_fini(&arc_sums.arcstat_l2_mru_asize);7908wmsum_fini(&arc_sums.arcstat_l2_mfu_asize);7909wmsum_fini(&arc_sums.arcstat_l2_bufc_data_asize);7910wmsum_fini(&arc_sums.arcstat_l2_bufc_metadata_asize);7911wmsum_fini(&arc_sums.arcstat_l2_feeds);7912wmsum_fini(&arc_sums.arcstat_l2_rw_clash);7913wmsum_fini(&arc_sums.arcstat_l2_read_bytes);7914wmsum_fini(&arc_sums.arcstat_l2_write_bytes);7915wmsum_fini(&arc_sums.arcstat_l2_writes_sent);7916wmsum_fini(&arc_sums.arcstat_l2_writes_done);7917wmsum_fini(&arc_sums.arcstat_l2_writes_error);7918wmsum_fini(&arc_sums.arcstat_l2_writes_lock_retry);7919wmsum_fini(&arc_sums.arcstat_l2_evict_lock_retry);7920wmsum_fini(&arc_sums.arcstat_l2_evict_reading);7921wmsum_fini(&arc_sums.arcstat_l2_evict_l1cached);7922wmsum_fini(&arc_sums.arcstat_l2_free_on_write);7923wmsum_fini(&arc_sums.arcstat_l2_abort_lowmem);7924wmsum_fini(&arc_sums.arcstat_l2_cksum_bad);7925wmsum_fini(&arc_sums.arcstat_l2_io_error);7926wmsum_fini(&arc_sums.arcstat_l2_lsize);7927wmsum_fini(&arc_sums.arcstat_l2_psize);7928aggsum_fini(&arc_sums.arcstat_l2_hdr_size);7929wmsum_fini(&arc_sums.arcstat_l2_log_blk_writes);7930wmsum_fini(&arc_sums.arcstat_l2_log_blk_asize);7931wmsum_fini(&arc_sums.arcstat_l2_log_blk_count);7932wmsum_fini(&arc_sums.arcstat_l2_rebuild_success);7933wmsum_fini(&arc_sums.arcstat_l2_rebuild_abort_unsupported);7934wmsum_fini(&arc_sums.arcstat_l2_rebuild_abort_io_errors);7935wmsum_fini(&arc_sums.arcstat_l2_rebuild_abort_dh_errors);7936wmsum_fini(&arc_sums.arcstat_l2_rebuild_abort_cksum_lb_errors);7937wmsum_fini(&arc_sums.arcstat_l2_rebuild_abort_lowmem);7938wmsum_fini(&arc_sums.arcstat_l2_rebuild_size);7939wmsum_fini(&arc_sums.arcstat_l2_rebuild_asize);7940wmsum_fini(&arc_sums.arcstat_l2_rebuild_bufs);7941wmsum_fini(&arc_sums.arcstat_l2_rebuild_bufs_precached);7942wmsum_fini(&arc_sums.arcstat_l2_rebuild_log_blks);7943wmsum_fini(&arc_sums.arcstat_memory_throttle_count);7944wmsum_fini(&arc_sums.arcstat_memory_direct_count);7945wmsum_fini(&arc_sums.arcstat_memory_indirect_count);7946wmsum_fini(&arc_sums.arcstat_prune);7947wmsum_fini(&arc_sums.arcstat_meta_used);7948wmsum_fini(&arc_sums.arcstat_async_upgrade_sync);7949wmsum_fini(&arc_sums.arcstat_predictive_prefetch);7950wmsum_fini(&arc_sums.arcstat_demand_hit_predictive_prefetch);7951wmsum_fini(&arc_sums.arcstat_demand_iohit_predictive_prefetch);7952wmsum_fini(&arc_sums.arcstat_prescient_prefetch);7953wmsum_fini(&arc_sums.arcstat_demand_hit_prescient_prefetch);7954wmsum_fini(&arc_sums.arcstat_demand_iohit_prescient_prefetch);7955wmsum_fini(&arc_sums.arcstat_raw_size);7956wmsum_fini(&arc_sums.arcstat_cached_only_in_progress);7957wmsum_fini(&arc_sums.arcstat_abd_chunk_waste_size);7958}79597960uint64_t7961arc_target_bytes(void)7962{7963return (arc_c);7964}79657966void7967arc_set_limits(uint64_t allmem)7968{7969/* Set min cache to 1/32 of all memory, or 32MB, whichever is more. */7970arc_c_min = MAX(allmem / 32, 2ULL << SPA_MAXBLOCKSHIFT);79717972/* How to set default max varies by platform. */7973arc_c_max = arc_default_max(arc_c_min, allmem);7974}79757976void7977arc_init(void)7978{7979uint64_t percent, allmem = arc_all_memory();7980mutex_init(&arc_evict_lock, NULL, MUTEX_DEFAULT, NULL);7981list_create(&arc_evict_waiters, sizeof (arc_evict_waiter_t),7982offsetof(arc_evict_waiter_t, aew_node));79837984arc_min_prefetch_ms = 1000;7985arc_min_prescient_prefetch_ms = 6000;79867987#if defined(_KERNEL)7988arc_lowmem_init();7989#endif79907991arc_set_limits(allmem);79927993#ifdef _KERNEL7994/*7995* If zfs_arc_max is non-zero at init, meaning it was set in the kernel7996* environment before the module was loaded, don't block setting the7997* maximum because it is less than arc_c_min, instead, reset arc_c_min7998* to a lower value.7999* zfs_arc_min will be handled by arc_tuning_update().8000*/8001if (zfs_arc_max != 0 && zfs_arc_max >= MIN_ARC_MAX &&8002zfs_arc_max < allmem) {8003arc_c_max = zfs_arc_max;8004if (arc_c_min >= arc_c_max) {8005arc_c_min = MAX(zfs_arc_max / 2,80062ULL << SPA_MAXBLOCKSHIFT);8007}8008}8009#else8010/*8011* In userland, there's only the memory pressure that we artificially8012* create (see arc_available_memory()). Don't let arc_c get too8013* small, because it can cause transactions to be larger than8014* arc_c, causing arc_tempreserve_space() to fail.8015*/8016arc_c_min = MAX(arc_c_max / 2, 2ULL << SPA_MAXBLOCKSHIFT);8017#endif80188019arc_c = arc_c_min;8020/*8021* 32-bit fixed point fractions of metadata from total ARC size,8022* MRU data from all data and MRU metadata from all metadata.8023*/8024arc_meta = (1ULL << 32) / 4; /* Metadata is 25% of arc_c. */8025arc_pd = (1ULL << 32) / 2; /* Data MRU is 50% of data. */8026arc_pm = (1ULL << 32) / 2; /* Metadata MRU is 50% of metadata. */80278028percent = MIN(zfs_arc_dnode_limit_percent, 100);8029arc_dnode_limit = arc_c_max * percent / 100;80308031/* Apply user specified tunings */8032arc_tuning_update(B_TRUE);80338034/* if kmem_flags are set, lets try to use less memory */8035if (kmem_debugging())8036arc_c = arc_c / 2;8037if (arc_c < arc_c_min)8038arc_c = arc_c_min;80398040arc_register_hotplug();80418042arc_state_init();80438044buf_init();80458046list_create(&arc_prune_list, sizeof (arc_prune_t),8047offsetof(arc_prune_t, p_node));8048mutex_init(&arc_prune_mtx, NULL, MUTEX_DEFAULT, NULL);80498050arc_prune_taskq = taskq_create("arc_prune", zfs_arc_prune_task_threads,8051defclsyspri, 100, INT_MAX, TASKQ_PREPOPULATE | TASKQ_DYNAMIC);80528053arc_evict_thread_init();80548055list_create(&arc_async_flush_list, sizeof (arc_async_flush_t),8056offsetof(arc_async_flush_t, af_node));8057mutex_init(&arc_async_flush_lock, NULL, MUTEX_DEFAULT, NULL);8058arc_flush_taskq = taskq_create("arc_flush", MIN(boot_ncpus, 4),8059defclsyspri, 1, INT_MAX, TASKQ_DYNAMIC);80608061arc_ksp = kstat_create("zfs", 0, "arcstats", "misc", KSTAT_TYPE_NAMED,8062sizeof (arc_stats) / sizeof (kstat_named_t), KSTAT_FLAG_VIRTUAL);80638064if (arc_ksp != NULL) {8065arc_ksp->ks_data = &arc_stats;8066arc_ksp->ks_update = arc_kstat_update;8067kstat_install(arc_ksp);8068}80698070arc_state_evict_markers =8071arc_state_alloc_markers(arc_state_evict_marker_count);8072arc_evict_zthr = zthr_create_timer("arc_evict",8073arc_evict_cb_check, arc_evict_cb, NULL, SEC2NSEC(1), defclsyspri);8074arc_reap_zthr = zthr_create_timer("arc_reap",8075arc_reap_cb_check, arc_reap_cb, NULL, SEC2NSEC(1), minclsyspri);80768077arc_warm = B_FALSE;80788079/*8080* Calculate maximum amount of dirty data per pool.8081*8082* If it has been set by a module parameter, take that.8083* Otherwise, use a percentage of physical memory defined by8084* zfs_dirty_data_max_percent (default 10%) with a cap at8085* zfs_dirty_data_max_max (default 4G or 25% of physical memory).8086*/8087#ifdef __LP64__8088if (zfs_dirty_data_max_max == 0)8089zfs_dirty_data_max_max = MIN(4ULL * 1024 * 1024 * 1024,8090allmem * zfs_dirty_data_max_max_percent / 100);8091#else8092if (zfs_dirty_data_max_max == 0)8093zfs_dirty_data_max_max = MIN(1ULL * 1024 * 1024 * 1024,8094allmem * zfs_dirty_data_max_max_percent / 100);8095#endif80968097if (zfs_dirty_data_max == 0) {8098zfs_dirty_data_max = allmem *8099zfs_dirty_data_max_percent / 100;8100zfs_dirty_data_max = MIN(zfs_dirty_data_max,8101zfs_dirty_data_max_max);8102}81038104if (zfs_wrlog_data_max == 0) {81058106/*8107* dp_wrlog_total is reduced for each txg at the end of8108* spa_sync(). However, dp_dirty_total is reduced every time8109* a block is written out. Thus under normal operation,8110* dp_wrlog_total could grow 2 times as big as8111* zfs_dirty_data_max.8112*/8113zfs_wrlog_data_max = zfs_dirty_data_max * 2;8114}8115}81168117void8118arc_fini(void)8119{8120arc_prune_t *p;81218122#ifdef _KERNEL8123arc_lowmem_fini();8124#endif /* _KERNEL */81258126/* Wait for any background flushes */8127taskq_wait(arc_flush_taskq);8128taskq_destroy(arc_flush_taskq);81298130/* Use B_TRUE to ensure *all* buffers are evicted */8131arc_flush(NULL, B_TRUE);81328133if (arc_ksp != NULL) {8134kstat_delete(arc_ksp);8135arc_ksp = NULL;8136}81378138taskq_wait(arc_prune_taskq);8139taskq_destroy(arc_prune_taskq);81408141list_destroy(&arc_async_flush_list);8142mutex_destroy(&arc_async_flush_lock);81438144mutex_enter(&arc_prune_mtx);8145while ((p = list_remove_head(&arc_prune_list)) != NULL) {8146(void) zfs_refcount_remove(&p->p_refcnt, &arc_prune_list);8147zfs_refcount_destroy(&p->p_refcnt);8148kmem_free(p, sizeof (*p));8149}8150mutex_exit(&arc_prune_mtx);81518152list_destroy(&arc_prune_list);8153mutex_destroy(&arc_prune_mtx);81548155if (arc_evict_taskq != NULL)8156taskq_wait(arc_evict_taskq);81578158(void) zthr_cancel(arc_evict_zthr);8159(void) zthr_cancel(arc_reap_zthr);8160arc_state_free_markers(arc_state_evict_markers,8161arc_state_evict_marker_count);81628163if (arc_evict_taskq != NULL) {8164taskq_destroy(arc_evict_taskq);8165kmem_free(arc_evict_arg,8166sizeof (evict_arg_t) * zfs_arc_evict_threads);8167}81688169mutex_destroy(&arc_evict_lock);8170list_destroy(&arc_evict_waiters);81718172/*8173* Free any buffers that were tagged for destruction. This needs8174* to occur before arc_state_fini() runs and destroys the aggsum8175* values which are updated when freeing scatter ABDs.8176*/8177l2arc_do_free_on_write();81788179/*8180* buf_fini() must proceed arc_state_fini() because buf_fin() may8181* trigger the release of kmem magazines, which can callback to8182* arc_space_return() which accesses aggsums freed in act_state_fini().8183*/8184buf_fini();8185arc_state_fini();81868187arc_unregister_hotplug();81888189/*8190* We destroy the zthrs after all the ARC state has been8191* torn down to avoid the case of them receiving any8192* wakeup() signals after they are destroyed.8193*/8194zthr_destroy(arc_evict_zthr);8195zthr_destroy(arc_reap_zthr);81968197ASSERT0(arc_loaned_bytes);8198}81998200/*8201* Level 2 ARC8202*8203* The level 2 ARC (L2ARC) is a cache layer in-between main memory and disk.8204* It uses dedicated storage devices to hold cached data, which are populated8205* using large infrequent writes. The main role of this cache is to boost8206* the performance of random read workloads. The intended L2ARC devices8207* include short-stroked disks, solid state disks, and other media with8208* substantially faster read latency than disk.8209*8210* +-----------------------+8211* | ARC |8212* +-----------------------+8213* | ^ ^8214* | | |8215* l2arc_feed_thread() arc_read()8216* | | |8217* | l2arc read |8218* V | |8219* +---------------+ |8220* | L2ARC | |8221* +---------------+ |8222* | ^ |8223* l2arc_write() | |8224* | | |8225* V | |8226* +-------+ +-------+8227* | vdev | | vdev |8228* | cache | | cache |8229* +-------+ +-------+8230* +=========+ .-----.8231* : L2ARC : |-_____-|8232* : devices : | Disks |8233* +=========+ `-_____-'8234*8235* Read requests are satisfied from the following sources, in order:8236*8237* 1) ARC8238* 2) vdev cache of L2ARC devices8239* 3) L2ARC devices8240* 4) vdev cache of disks8241* 5) disks8242*8243* Some L2ARC device types exhibit extremely slow write performance.8244* To accommodate for this there are some significant differences between8245* the L2ARC and traditional cache design:8246*8247* 1. There is no eviction path from the ARC to the L2ARC. Evictions from8248* the ARC behave as usual, freeing buffers and placing headers on ghost8249* lists. The ARC does not send buffers to the L2ARC during eviction as8250* this would add inflated write latencies for all ARC memory pressure.8251*8252* 2. The L2ARC attempts to cache data from the ARC before it is evicted.8253* It does this by periodically scanning buffers from the eviction-end of8254* the MFU and MRU ARC lists, copying them to the L2ARC devices if they are8255* not already there. It scans until a headroom of buffers is satisfied,8256* which itself is a buffer for ARC eviction. If a compressible buffer is8257* found during scanning and selected for writing to an L2ARC device, we8258* temporarily boost scanning headroom during the next scan cycle to make8259* sure we adapt to compression effects (which might significantly reduce8260* the data volume we write to L2ARC). The thread that does this is8261* l2arc_feed_thread(), illustrated below; example sizes are included to8262* provide a better sense of ratio than this diagram:8263*8264* head --> tail8265* +---------------------+----------+8266* ARC_mfu |:::::#:::::::::::::::|o#o###o###|-->. # already on L2ARC8267* +---------------------+----------+ | o L2ARC eligible8268* ARC_mru |:#:::::::::::::::::::|#o#ooo####|-->| : ARC buffer8269* +---------------------+----------+ |8270* 15.9 Gbytes ^ 32 Mbytes |8271* headroom |8272* l2arc_feed_thread()8273* |8274* l2arc write hand <--[oooo]--'8275* | 8 Mbyte8276* | write max8277* V8278* +==============================+8279* L2ARC dev |####|#|###|###| |####| ... |8280* +==============================+8281* 32 Gbytes8282*8283* 3. If an ARC buffer is copied to the L2ARC but then hit instead of8284* evicted, then the L2ARC has cached a buffer much sooner than it probably8285* needed to, potentially wasting L2ARC device bandwidth and storage. It is8286* safe to say that this is an uncommon case, since buffers at the end of8287* the ARC lists have moved there due to inactivity.8288*8289* 4. If the ARC evicts faster than the L2ARC can maintain a headroom,8290* then the L2ARC simply misses copying some buffers. This serves as a8291* pressure valve to prevent heavy read workloads from both stalling the ARC8292* with waits and clogging the L2ARC with writes. This also helps prevent8293* the potential for the L2ARC to churn if it attempts to cache content too8294* quickly, such as during backups of the entire pool.8295*8296* 5. After system boot and before the ARC has filled main memory, there are8297* no evictions from the ARC and so the tails of the ARC_mfu and ARC_mru8298* lists can remain mostly static. Instead of searching from tail of these8299* lists as pictured, the l2arc_feed_thread() will search from the list heads8300* for eligible buffers, greatly increasing its chance of finding them.8301*8302* The L2ARC device write speed is also boosted during this time so that8303* the L2ARC warms up faster. Since there have been no ARC evictions yet,8304* there are no L2ARC reads, and no fear of degrading read performance8305* through increased writes.8306*8307* 6. Writes to the L2ARC devices are grouped and sent in-sequence, so that8308* the vdev queue can aggregate them into larger and fewer writes. Each8309* device is written to in a rotor fashion, sweeping writes through8310* available space then repeating.8311*8312* 7. The L2ARC does not store dirty content. It never needs to flush8313* write buffers back to disk based storage.8314*8315* 8. If an ARC buffer is written (and dirtied) which also exists in the8316* L2ARC, the now stale L2ARC buffer is immediately dropped.8317*8318* The performance of the L2ARC can be tweaked by a number of tunables, which8319* may be necessary for different workloads:8320*8321* l2arc_write_max max write bytes per interval8322* l2arc_write_boost extra write bytes during device warmup8323* l2arc_noprefetch skip caching prefetched buffers8324* l2arc_headroom number of max device writes to precache8325* l2arc_headroom_boost when we find compressed buffers during ARC8326* scanning, we multiply headroom by this8327* percentage factor for the next scan cycle,8328* since more compressed buffers are likely to8329* be present8330* l2arc_feed_secs seconds between L2ARC writing8331*8332* Tunables may be removed or added as future performance improvements are8333* integrated, and also may become zpool properties.8334*8335* There are three key functions that control how the L2ARC warms up:8336*8337* l2arc_write_eligible() check if a buffer is eligible to cache8338* l2arc_write_size() calculate how much to write8339* l2arc_write_interval() calculate sleep delay between writes8340*8341* These three functions determine what to write, how much, and how quickly8342* to send writes.8343*8344* L2ARC persistence:8345*8346* When writing buffers to L2ARC, we periodically add some metadata to8347* make sure we can pick them up after reboot, thus dramatically reducing8348* the impact that any downtime has on the performance of storage systems8349* with large caches.8350*8351* The implementation works fairly simply by integrating the following two8352* modifications:8353*8354* *) When writing to the L2ARC, we occasionally write a "l2arc log block",8355* which is an additional piece of metadata which describes what's been8356* written. This allows us to rebuild the arc_buf_hdr_t structures of the8357* main ARC buffers. There are 2 linked-lists of log blocks headed by8358* dh_start_lbps[2]. We alternate which chain we append to, so they are8359* time-wise and offset-wise interleaved, but that is an optimization rather8360* than for correctness. The log block also includes a pointer to the8361* previous block in its chain.8362*8363* *) We reserve SPA_MINBLOCKSIZE of space at the start of each L2ARC device8364* for our header bookkeeping purposes. This contains a device header,8365* which contains our top-level reference structures. We update it each8366* time we write a new log block, so that we're able to locate it in the8367* L2ARC device. If this write results in an inconsistent device header8368* (e.g. due to power failure), we detect this by verifying the header's8369* checksum and simply fail to reconstruct the L2ARC after reboot.8370*8371* Implementation diagram:8372*8373* +=== L2ARC device (not to scale) ======================================+8374* | ___two newest log block pointers__.__________ |8375* | / \dh_start_lbps[1] |8376* | / \ \dh_start_lbps[0]|8377* |.___/__. V V |8378* ||L2 dev|....|lb |bufs |lb |bufs |lb |bufs |lb |bufs |lb |---(empty)---|8379* || hdr| ^ /^ /^ / / |8380* |+------+ ...--\-------/ \-----/--\------/ / |8381* | \--------------/ \--------------/ |8382* +======================================================================+8383*8384* As can be seen on the diagram, rather than using a simple linked list,8385* we use a pair of linked lists with alternating elements. This is a8386* performance enhancement due to the fact that we only find out the8387* address of the next log block access once the current block has been8388* completely read in. Obviously, this hurts performance, because we'd be8389* keeping the device's I/O queue at only a 1 operation deep, thus8390* incurring a large amount of I/O round-trip latency. Having two lists8391* allows us to fetch two log blocks ahead of where we are currently8392* rebuilding L2ARC buffers.8393*8394* On-device data structures:8395*8396* L2ARC device header: l2arc_dev_hdr_phys_t8397* L2ARC log block: l2arc_log_blk_phys_t8398*8399* L2ARC reconstruction:8400*8401* When writing data, we simply write in the standard rotary fashion,8402* evicting buffers as we go and simply writing new data over them (writing8403* a new log block every now and then). This obviously means that once we8404* loop around the end of the device, we will start cutting into an already8405* committed log block (and its referenced data buffers), like so:8406*8407* current write head__ __old tail8408* \ /8409* V V8410* <--|bufs |lb |bufs |lb | |bufs |lb |bufs |lb |-->8411* ^ ^^^^^^^^^___________________________________8412* | \8413* <<nextwrite>> may overwrite this blk and/or its bufs --'8414*8415* When importing the pool, we detect this situation and use it to stop8416* our scanning process (see l2arc_rebuild).8417*8418* There is one significant caveat to consider when rebuilding ARC contents8419* from an L2ARC device: what about invalidated buffers? Given the above8420* construction, we cannot update blocks which we've already written to amend8421* them to remove buffers which were invalidated. Thus, during reconstruction,8422* we might be populating the cache with buffers for data that's not on the8423* main pool anymore, or may have been overwritten!8424*8425* As it turns out, this isn't a problem. Every arc_read request includes8426* both the DVA and, crucially, the birth TXG of the BP the caller is8427* looking for. So even if the cache were populated by completely rotten8428* blocks for data that had been long deleted and/or overwritten, we'll8429* never actually return bad data from the cache, since the DVA with the8430* birth TXG uniquely identify a block in space and time - once created,8431* a block is immutable on disk. The worst thing we have done is wasted8432* some time and memory at l2arc rebuild to reconstruct outdated ARC8433* entries that will get dropped from the l2arc as it is being updated8434* with new blocks.8435*8436* L2ARC buffers that have been evicted by l2arc_evict() ahead of the write8437* hand are not restored. This is done by saving the offset (in bytes)8438* l2arc_evict() has evicted to in the L2ARC device header and taking it8439* into account when restoring buffers.8440*/84418442static boolean_t8443l2arc_write_eligible(uint64_t spa_guid, arc_buf_hdr_t *hdr)8444{8445/*8446* A buffer is *not* eligible for the L2ARC if it:8447* 1. belongs to a different spa.8448* 2. is already cached on the L2ARC.8449* 3. has an I/O in progress (it may be an incomplete read).8450* 4. is flagged not eligible (zfs property).8451*/8452if (hdr->b_spa != spa_guid || HDR_HAS_L2HDR(hdr) ||8453HDR_IO_IN_PROGRESS(hdr) || !HDR_L2CACHE(hdr))8454return (B_FALSE);84558456return (B_TRUE);8457}84588459static uint64_t8460l2arc_write_size(l2arc_dev_t *dev)8461{8462uint64_t size;84638464/*8465* Make sure our globals have meaningful values in case the user8466* altered them.8467*/8468size = l2arc_write_max;8469if (size == 0) {8470cmn_err(CE_NOTE, "l2arc_write_max must be greater than zero, "8471"resetting it to the default (%d)", L2ARC_WRITE_SIZE);8472size = l2arc_write_max = L2ARC_WRITE_SIZE;8473}84748475if (arc_warm == B_FALSE)8476size += l2arc_write_boost;84778478/* We need to add in the worst case scenario of log block overhead. */8479size += l2arc_log_blk_overhead(size, dev);8480if (dev->l2ad_vdev->vdev_has_trim && l2arc_trim_ahead > 0) {8481/*8482* Trim ahead of the write size 64MB or (l2arc_trim_ahead/100)8483* times the writesize, whichever is greater.8484*/8485size += MAX(64 * 1024 * 1024,8486(size * l2arc_trim_ahead) / 100);8487}84888489/*8490* Make sure the write size does not exceed the size of the cache8491* device. This is important in l2arc_evict(), otherwise infinite8492* iteration can occur.8493*/8494size = MIN(size, (dev->l2ad_end - dev->l2ad_start) / 4);84958496size = P2ROUNDUP(size, 1ULL << dev->l2ad_vdev->vdev_ashift);84978498return (size);84998500}85018502static clock_t8503l2arc_write_interval(clock_t began, uint64_t wanted, uint64_t wrote)8504{8505clock_t interval, next, now;85068507/*8508* If the ARC lists are busy, increase our write rate; if the8509* lists are stale, idle back. This is achieved by checking8510* how much we previously wrote - if it was more than half of8511* what we wanted, schedule the next write much sooner.8512*/8513if (l2arc_feed_again && wrote > (wanted / 2))8514interval = (hz * l2arc_feed_min_ms) / 1000;8515else8516interval = hz * l2arc_feed_secs;85178518now = ddi_get_lbolt();8519next = MAX(now, MIN(now + interval, began + interval));85208521return (next);8522}85238524static boolean_t8525l2arc_dev_invalid(const l2arc_dev_t *dev)8526{8527/*8528* We want to skip devices that are being rebuilt, trimmed,8529* removed, or belong to a spa that is being exported.8530*/8531return (dev->l2ad_vdev == NULL || vdev_is_dead(dev->l2ad_vdev) ||8532dev->l2ad_rebuild || dev->l2ad_trim_all ||8533dev->l2ad_spa == NULL || dev->l2ad_spa->spa_is_exporting);8534}85358536/*8537* Cycle through L2ARC devices. This is how L2ARC load balances.8538* If a device is returned, this also returns holding the spa config lock.8539*/8540static l2arc_dev_t *8541l2arc_dev_get_next(void)8542{8543l2arc_dev_t *first, *next = NULL;85448545/*8546* Lock out the removal of spas (spa_namespace_lock), then removal8547* of cache devices (l2arc_dev_mtx). Once a device has been selected,8548* both locks will be dropped and a spa config lock held instead.8549*/8550mutex_enter(&spa_namespace_lock);8551mutex_enter(&l2arc_dev_mtx);85528553/* if there are no vdevs, there is nothing to do */8554if (l2arc_ndev == 0)8555goto out;85568557first = NULL;8558next = l2arc_dev_last;8559do {8560/* loop around the list looking for a non-faulted vdev */8561if (next == NULL) {8562next = list_head(l2arc_dev_list);8563} else {8564next = list_next(l2arc_dev_list, next);8565if (next == NULL)8566next = list_head(l2arc_dev_list);8567}85688569/* if we have come back to the start, bail out */8570if (first == NULL)8571first = next;8572else if (next == first)8573break;85748575ASSERT3P(next, !=, NULL);8576} while (l2arc_dev_invalid(next));85778578/* if we were unable to find any usable vdevs, return NULL */8579if (l2arc_dev_invalid(next))8580next = NULL;85818582l2arc_dev_last = next;85838584out:8585mutex_exit(&l2arc_dev_mtx);85868587/*8588* Grab the config lock to prevent the 'next' device from being8589* removed while we are writing to it.8590*/8591if (next != NULL)8592spa_config_enter(next->l2ad_spa, SCL_L2ARC, next, RW_READER);8593mutex_exit(&spa_namespace_lock);85948595return (next);8596}85978598/*8599* Free buffers that were tagged for destruction.8600*/8601static void8602l2arc_do_free_on_write(void)8603{8604l2arc_data_free_t *df;86058606mutex_enter(&l2arc_free_on_write_mtx);8607while ((df = list_remove_head(l2arc_free_on_write)) != NULL) {8608ASSERT3P(df->l2df_abd, !=, NULL);8609abd_free(df->l2df_abd);8610kmem_free(df, sizeof (l2arc_data_free_t));8611}8612mutex_exit(&l2arc_free_on_write_mtx);8613}86148615/*8616* A write to a cache device has completed. Update all headers to allow8617* reads from these buffers to begin.8618*/8619static void8620l2arc_write_done(zio_t *zio)8621{8622l2arc_write_callback_t *cb;8623l2arc_lb_abd_buf_t *abd_buf;8624l2arc_lb_ptr_buf_t *lb_ptr_buf;8625l2arc_dev_t *dev;8626l2arc_dev_hdr_phys_t *l2dhdr;8627list_t *buflist;8628arc_buf_hdr_t *head, *hdr, *hdr_prev;8629kmutex_t *hash_lock;8630int64_t bytes_dropped = 0;86318632cb = zio->io_private;8633ASSERT3P(cb, !=, NULL);8634dev = cb->l2wcb_dev;8635l2dhdr = dev->l2ad_dev_hdr;8636ASSERT3P(dev, !=, NULL);8637head = cb->l2wcb_head;8638ASSERT3P(head, !=, NULL);8639buflist = &dev->l2ad_buflist;8640ASSERT3P(buflist, !=, NULL);8641DTRACE_PROBE2(l2arc__iodone, zio_t *, zio,8642l2arc_write_callback_t *, cb);86438644/*8645* All writes completed, or an error was hit.8646*/8647top:8648mutex_enter(&dev->l2ad_mtx);8649for (hdr = list_prev(buflist, head); hdr; hdr = hdr_prev) {8650hdr_prev = list_prev(buflist, hdr);86518652hash_lock = HDR_LOCK(hdr);86538654/*8655* We cannot use mutex_enter or else we can deadlock8656* with l2arc_write_buffers (due to swapping the order8657* the hash lock and l2ad_mtx are taken).8658*/8659if (!mutex_tryenter(hash_lock)) {8660/*8661* Missed the hash lock. We must retry so we8662* don't leave the ARC_FLAG_L2_WRITING bit set.8663*/8664ARCSTAT_BUMP(arcstat_l2_writes_lock_retry);86658666/*8667* We don't want to rescan the headers we've8668* already marked as having been written out, so8669* we reinsert the head node so we can pick up8670* where we left off.8671*/8672list_remove(buflist, head);8673list_insert_after(buflist, hdr, head);86748675mutex_exit(&dev->l2ad_mtx);86768677/*8678* We wait for the hash lock to become available8679* to try and prevent busy waiting, and increase8680* the chance we'll be able to acquire the lock8681* the next time around.8682*/8683mutex_enter(hash_lock);8684mutex_exit(hash_lock);8685goto top;8686}86878688/*8689* We could not have been moved into the arc_l2c_only8690* state while in-flight due to our ARC_FLAG_L2_WRITING8691* bit being set. Let's just ensure that's being enforced.8692*/8693ASSERT(HDR_HAS_L1HDR(hdr));86948695/*8696* Skipped - drop L2ARC entry and mark the header as no8697* longer L2 eligibile.8698*/8699if (zio->io_error != 0) {8700/*8701* Error - drop L2ARC entry.8702*/8703list_remove(buflist, hdr);8704arc_hdr_clear_flags(hdr, ARC_FLAG_HAS_L2HDR);87058706uint64_t psize = HDR_GET_PSIZE(hdr);8707l2arc_hdr_arcstats_decrement(hdr);87088709ASSERT(dev->l2ad_vdev != NULL);87108711bytes_dropped +=8712vdev_psize_to_asize(dev->l2ad_vdev, psize);8713(void) zfs_refcount_remove_many(&dev->l2ad_alloc,8714arc_hdr_size(hdr), hdr);8715}87168717/*8718* Allow ARC to begin reads and ghost list evictions to8719* this L2ARC entry.8720*/8721arc_hdr_clear_flags(hdr, ARC_FLAG_L2_WRITING);87228723mutex_exit(hash_lock);8724}87258726/*8727* Free the allocated abd buffers for writing the log blocks.8728* If the zio failed reclaim the allocated space and remove the8729* pointers to these log blocks from the log block pointer list8730* of the L2ARC device.8731*/8732while ((abd_buf = list_remove_tail(&cb->l2wcb_abd_list)) != NULL) {8733abd_free(abd_buf->abd);8734zio_buf_free(abd_buf, sizeof (*abd_buf));8735if (zio->io_error != 0) {8736lb_ptr_buf = list_remove_head(&dev->l2ad_lbptr_list);8737/*8738* L2BLK_GET_PSIZE returns aligned size for log8739* blocks.8740*/8741uint64_t asize =8742L2BLK_GET_PSIZE((lb_ptr_buf->lb_ptr)->lbp_prop);8743bytes_dropped += asize;8744ARCSTAT_INCR(arcstat_l2_log_blk_asize, -asize);8745ARCSTAT_BUMPDOWN(arcstat_l2_log_blk_count);8746zfs_refcount_remove_many(&dev->l2ad_lb_asize, asize,8747lb_ptr_buf);8748(void) zfs_refcount_remove(&dev->l2ad_lb_count,8749lb_ptr_buf);8750kmem_free(lb_ptr_buf->lb_ptr,8751sizeof (l2arc_log_blkptr_t));8752kmem_free(lb_ptr_buf, sizeof (l2arc_lb_ptr_buf_t));8753}8754}8755list_destroy(&cb->l2wcb_abd_list);87568757if (zio->io_error != 0) {8758ARCSTAT_BUMP(arcstat_l2_writes_error);87598760/*8761* Restore the lbps array in the header to its previous state.8762* If the list of log block pointers is empty, zero out the8763* log block pointers in the device header.8764*/8765lb_ptr_buf = list_head(&dev->l2ad_lbptr_list);8766for (int i = 0; i < 2; i++) {8767if (lb_ptr_buf == NULL) {8768/*8769* If the list is empty zero out the device8770* header. Otherwise zero out the second log8771* block pointer in the header.8772*/8773if (i == 0) {8774memset(l2dhdr, 0,8775dev->l2ad_dev_hdr_asize);8776} else {8777memset(&l2dhdr->dh_start_lbps[i], 0,8778sizeof (l2arc_log_blkptr_t));8779}8780break;8781}8782memcpy(&l2dhdr->dh_start_lbps[i], lb_ptr_buf->lb_ptr,8783sizeof (l2arc_log_blkptr_t));8784lb_ptr_buf = list_next(&dev->l2ad_lbptr_list,8785lb_ptr_buf);8786}8787}87888789ARCSTAT_BUMP(arcstat_l2_writes_done);8790list_remove(buflist, head);8791ASSERT(!HDR_HAS_L1HDR(head));8792kmem_cache_free(hdr_l2only_cache, head);8793mutex_exit(&dev->l2ad_mtx);87948795ASSERT(dev->l2ad_vdev != NULL);8796vdev_space_update(dev->l2ad_vdev, -bytes_dropped, 0, 0);87978798l2arc_do_free_on_write();87998800kmem_free(cb, sizeof (l2arc_write_callback_t));8801}88028803static int8804l2arc_untransform(zio_t *zio, l2arc_read_callback_t *cb)8805{8806int ret;8807spa_t *spa = zio->io_spa;8808arc_buf_hdr_t *hdr = cb->l2rcb_hdr;8809blkptr_t *bp = zio->io_bp;8810uint8_t salt[ZIO_DATA_SALT_LEN];8811uint8_t iv[ZIO_DATA_IV_LEN];8812uint8_t mac[ZIO_DATA_MAC_LEN];8813boolean_t no_crypt = B_FALSE;88148815/*8816* ZIL data is never be written to the L2ARC, so we don't need8817* special handling for its unique MAC storage.8818*/8819ASSERT3U(BP_GET_TYPE(bp), !=, DMU_OT_INTENT_LOG);8820ASSERT(MUTEX_HELD(HDR_LOCK(hdr)));8821ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);88228823/*8824* If the data was encrypted, decrypt it now. Note that8825* we must check the bp here and not the hdr, since the8826* hdr does not have its encryption parameters updated8827* until arc_read_done().8828*/8829if (BP_IS_ENCRYPTED(bp)) {8830abd_t *eabd = arc_get_data_abd(hdr, arc_hdr_size(hdr), hdr,8831ARC_HDR_USE_RESERVE);88328833zio_crypt_decode_params_bp(bp, salt, iv);8834zio_crypt_decode_mac_bp(bp, mac);88358836ret = spa_do_crypt_abd(B_FALSE, spa, &cb->l2rcb_zb,8837BP_GET_TYPE(bp), BP_GET_DEDUP(bp), BP_SHOULD_BYTESWAP(bp),8838salt, iv, mac, HDR_GET_PSIZE(hdr), eabd,8839hdr->b_l1hdr.b_pabd, &no_crypt);8840if (ret != 0) {8841arc_free_data_abd(hdr, eabd, arc_hdr_size(hdr), hdr);8842goto error;8843}88448845/*8846* If we actually performed decryption, replace b_pabd8847* with the decrypted data. Otherwise we can just throw8848* our decryption buffer away.8849*/8850if (!no_crypt) {8851arc_free_data_abd(hdr, hdr->b_l1hdr.b_pabd,8852arc_hdr_size(hdr), hdr);8853hdr->b_l1hdr.b_pabd = eabd;8854zio->io_abd = eabd;8855} else {8856arc_free_data_abd(hdr, eabd, arc_hdr_size(hdr), hdr);8857}8858}88598860/*8861* If the L2ARC block was compressed, but ARC compression8862* is disabled we decompress the data into a new buffer and8863* replace the existing data.8864*/8865if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF &&8866!HDR_COMPRESSION_ENABLED(hdr)) {8867abd_t *cabd = arc_get_data_abd(hdr, arc_hdr_size(hdr), hdr,8868ARC_HDR_USE_RESERVE);88698870ret = zio_decompress_data(HDR_GET_COMPRESS(hdr),8871hdr->b_l1hdr.b_pabd, cabd, HDR_GET_PSIZE(hdr),8872HDR_GET_LSIZE(hdr), &hdr->b_complevel);8873if (ret != 0) {8874arc_free_data_abd(hdr, cabd, arc_hdr_size(hdr), hdr);8875goto error;8876}88778878arc_free_data_abd(hdr, hdr->b_l1hdr.b_pabd,8879arc_hdr_size(hdr), hdr);8880hdr->b_l1hdr.b_pabd = cabd;8881zio->io_abd = cabd;8882zio->io_size = HDR_GET_LSIZE(hdr);8883}88848885return (0);88868887error:8888return (ret);8889}889088918892/*8893* A read to a cache device completed. Validate buffer contents before8894* handing over to the regular ARC routines.8895*/8896static void8897l2arc_read_done(zio_t *zio)8898{8899int tfm_error = 0;8900l2arc_read_callback_t *cb = zio->io_private;8901arc_buf_hdr_t *hdr;8902kmutex_t *hash_lock;8903boolean_t valid_cksum;8904boolean_t using_rdata = (BP_IS_ENCRYPTED(&cb->l2rcb_bp) &&8905(cb->l2rcb_flags & ZIO_FLAG_RAW_ENCRYPT));89068907ASSERT3P(zio->io_vd, !=, NULL);8908ASSERT(zio->io_flags & ZIO_FLAG_DONT_PROPAGATE);89098910spa_config_exit(zio->io_spa, SCL_L2ARC, zio->io_vd);89118912ASSERT3P(cb, !=, NULL);8913hdr = cb->l2rcb_hdr;8914ASSERT3P(hdr, !=, NULL);89158916hash_lock = HDR_LOCK(hdr);8917mutex_enter(hash_lock);8918ASSERT3P(hash_lock, ==, HDR_LOCK(hdr));89198920/*8921* If the data was read into a temporary buffer,8922* move it and free the buffer.8923*/8924if (cb->l2rcb_abd != NULL) {8925ASSERT3U(arc_hdr_size(hdr), <, zio->io_size);8926if (zio->io_error == 0) {8927if (using_rdata) {8928abd_copy(hdr->b_crypt_hdr.b_rabd,8929cb->l2rcb_abd, arc_hdr_size(hdr));8930} else {8931abd_copy(hdr->b_l1hdr.b_pabd,8932cb->l2rcb_abd, arc_hdr_size(hdr));8933}8934}89358936/*8937* The following must be done regardless of whether8938* there was an error:8939* - free the temporary buffer8940* - point zio to the real ARC buffer8941* - set zio size accordingly8942* These are required because zio is either re-used for8943* an I/O of the block in the case of the error8944* or the zio is passed to arc_read_done() and it8945* needs real data.8946*/8947abd_free(cb->l2rcb_abd);8948zio->io_size = zio->io_orig_size = arc_hdr_size(hdr);89498950if (using_rdata) {8951ASSERT(HDR_HAS_RABD(hdr));8952zio->io_abd = zio->io_orig_abd =8953hdr->b_crypt_hdr.b_rabd;8954} else {8955ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL);8956zio->io_abd = zio->io_orig_abd = hdr->b_l1hdr.b_pabd;8957}8958}89598960ASSERT3P(zio->io_abd, !=, NULL);89618962/*8963* Check this survived the L2ARC journey.8964*/8965ASSERT(zio->io_abd == hdr->b_l1hdr.b_pabd ||8966(HDR_HAS_RABD(hdr) && zio->io_abd == hdr->b_crypt_hdr.b_rabd));8967zio->io_bp_copy = cb->l2rcb_bp; /* XXX fix in L2ARC 2.0 */8968zio->io_bp = &zio->io_bp_copy; /* XXX fix in L2ARC 2.0 */8969zio->io_prop.zp_complevel = hdr->b_complevel;89708971valid_cksum = arc_cksum_is_equal(hdr, zio);89728973/*8974* b_rabd will always match the data as it exists on disk if it is8975* being used. Therefore if we are reading into b_rabd we do not8976* attempt to untransform the data.8977*/8978if (valid_cksum && !using_rdata)8979tfm_error = l2arc_untransform(zio, cb);89808981if (valid_cksum && tfm_error == 0 && zio->io_error == 0 &&8982!HDR_L2_EVICTED(hdr)) {8983mutex_exit(hash_lock);8984zio->io_private = hdr;8985arc_read_done(zio);8986} else {8987/*8988* Buffer didn't survive caching. Increment stats and8989* reissue to the original storage device.8990*/8991if (zio->io_error != 0) {8992ARCSTAT_BUMP(arcstat_l2_io_error);8993} else {8994zio->io_error = SET_ERROR(EIO);8995}8996if (!valid_cksum || tfm_error != 0)8997ARCSTAT_BUMP(arcstat_l2_cksum_bad);89988999/*9000* If there's no waiter, issue an async i/o to the primary9001* storage now. If there *is* a waiter, the caller must9002* issue the i/o in a context where it's OK to block.9003*/9004if (zio->io_waiter == NULL) {9005zio_t *pio = zio_unique_parent(zio);9006void *abd = (using_rdata) ?9007hdr->b_crypt_hdr.b_rabd : hdr->b_l1hdr.b_pabd;90089009ASSERT(!pio || pio->io_child_type == ZIO_CHILD_LOGICAL);90109011zio = zio_read(pio, zio->io_spa, zio->io_bp,9012abd, zio->io_size, arc_read_done,9013hdr, zio->io_priority, cb->l2rcb_flags,9014&cb->l2rcb_zb);90159016/*9017* Original ZIO will be freed, so we need to update9018* ARC header with the new ZIO pointer to be used9019* by zio_change_priority() in arc_read().9020*/9021for (struct arc_callback *acb = hdr->b_l1hdr.b_acb;9022acb != NULL; acb = acb->acb_next)9023acb->acb_zio_head = zio;90249025mutex_exit(hash_lock);9026zio_nowait(zio);9027} else {9028mutex_exit(hash_lock);9029}9030}90319032kmem_free(cb, sizeof (l2arc_read_callback_t));9033}90349035/*9036* This is the list priority from which the L2ARC will search for pages to9037* cache. This is used within loops (0..3) to cycle through lists in the9038* desired order. This order can have a significant effect on cache9039* performance.9040*9041* Currently the metadata lists are hit first, MFU then MRU, followed by9042* the data lists. This function returns a locked list, and also returns9043* the lock pointer.9044*/9045static multilist_sublist_t *9046l2arc_sublist_lock(int list_num)9047{9048multilist_t *ml = NULL;9049unsigned int idx;90509051ASSERT(list_num >= 0 && list_num < L2ARC_FEED_TYPES);90529053switch (list_num) {9054case 0:9055ml = &arc_mfu->arcs_list[ARC_BUFC_METADATA];9056break;9057case 1:9058ml = &arc_mru->arcs_list[ARC_BUFC_METADATA];9059break;9060case 2:9061ml = &arc_mfu->arcs_list[ARC_BUFC_DATA];9062break;9063case 3:9064ml = &arc_mru->arcs_list[ARC_BUFC_DATA];9065break;9066default:9067return (NULL);9068}90699070/*9071* Return a randomly-selected sublist. This is acceptable9072* because the caller feeds only a little bit of data for each9073* call (8MB). Subsequent calls will result in different9074* sublists being selected.9075*/9076idx = multilist_get_random_index(ml);9077return (multilist_sublist_lock_idx(ml, idx));9078}90799080/*9081* Calculates the maximum overhead of L2ARC metadata log blocks for a given9082* L2ARC write size. l2arc_evict and l2arc_write_size need to include this9083* overhead in processing to make sure there is enough headroom available9084* when writing buffers.9085*/9086static inline uint64_t9087l2arc_log_blk_overhead(uint64_t write_sz, l2arc_dev_t *dev)9088{9089if (dev->l2ad_log_entries == 0) {9090return (0);9091} else {9092ASSERT(dev->l2ad_vdev != NULL);90939094uint64_t log_entries = write_sz >> SPA_MINBLOCKSHIFT;90959096uint64_t log_blocks = (log_entries +9097dev->l2ad_log_entries - 1) /9098dev->l2ad_log_entries;90999100return (vdev_psize_to_asize(dev->l2ad_vdev,9101sizeof (l2arc_log_blk_phys_t)) * log_blocks);9102}9103}91049105/*9106* Evict buffers from the device write hand to the distance specified in9107* bytes. This distance may span populated buffers, it may span nothing.9108* This is clearing a region on the L2ARC device ready for writing.9109* If the 'all' boolean is set, every buffer is evicted.9110*/9111static void9112l2arc_evict(l2arc_dev_t *dev, uint64_t distance, boolean_t all)9113{9114list_t *buflist;9115arc_buf_hdr_t *hdr, *hdr_prev;9116kmutex_t *hash_lock;9117uint64_t taddr;9118l2arc_lb_ptr_buf_t *lb_ptr_buf, *lb_ptr_buf_prev;9119vdev_t *vd = dev->l2ad_vdev;9120boolean_t rerun;91219122ASSERT(vd != NULL || all);9123ASSERT(dev->l2ad_spa != NULL || all);91249125buflist = &dev->l2ad_buflist;91269127top:9128rerun = B_FALSE;9129if (dev->l2ad_hand + distance > dev->l2ad_end) {9130/*9131* When there is no space to accommodate upcoming writes,9132* evict to the end. Then bump the write and evict hands9133* to the start and iterate. This iteration does not9134* happen indefinitely as we make sure in9135* l2arc_write_size() that when the write hand is reset,9136* the write size does not exceed the end of the device.9137*/9138rerun = B_TRUE;9139taddr = dev->l2ad_end;9140} else {9141taddr = dev->l2ad_hand + distance;9142}9143DTRACE_PROBE4(l2arc__evict, l2arc_dev_t *, dev, list_t *, buflist,9144uint64_t, taddr, boolean_t, all);91459146if (!all) {9147/*9148* This check has to be placed after deciding whether to9149* iterate (rerun).9150*/9151if (dev->l2ad_first) {9152/*9153* This is the first sweep through the device. There is9154* nothing to evict. We have already trimmed the9155* whole device.9156*/9157goto out;9158} else {9159/*9160* Trim the space to be evicted.9161*/9162if (vd->vdev_has_trim && dev->l2ad_evict < taddr &&9163l2arc_trim_ahead > 0) {9164/*9165* We have to drop the spa_config lock because9166* vdev_trim_range() will acquire it.9167* l2ad_evict already accounts for the label9168* size. To prevent vdev_trim_ranges() from9169* adding it again, we subtract it from9170* l2ad_evict.9171*/9172spa_config_exit(dev->l2ad_spa, SCL_L2ARC, dev);9173vdev_trim_simple(vd,9174dev->l2ad_evict - VDEV_LABEL_START_SIZE,9175taddr - dev->l2ad_evict);9176spa_config_enter(dev->l2ad_spa, SCL_L2ARC, dev,9177RW_READER);9178}91799180/*9181* When rebuilding L2ARC we retrieve the evict hand9182* from the header of the device. Of note, l2arc_evict()9183* does not actually delete buffers from the cache9184* device, but trimming may do so depending on the9185* hardware implementation. Thus keeping track of the9186* evict hand is useful.9187*/9188dev->l2ad_evict = MAX(dev->l2ad_evict, taddr);9189}9190}91919192retry:9193mutex_enter(&dev->l2ad_mtx);9194/*9195* We have to account for evicted log blocks. Run vdev_space_update()9196* on log blocks whose offset (in bytes) is before the evicted offset9197* (in bytes) by searching in the list of pointers to log blocks9198* present in the L2ARC device.9199*/9200for (lb_ptr_buf = list_tail(&dev->l2ad_lbptr_list); lb_ptr_buf;9201lb_ptr_buf = lb_ptr_buf_prev) {92029203lb_ptr_buf_prev = list_prev(&dev->l2ad_lbptr_list, lb_ptr_buf);92049205/* L2BLK_GET_PSIZE returns aligned size for log blocks */9206uint64_t asize = L2BLK_GET_PSIZE(9207(lb_ptr_buf->lb_ptr)->lbp_prop);92089209/*9210* We don't worry about log blocks left behind (ie9211* lbp_payload_start < l2ad_hand) because l2arc_write_buffers()9212* will never write more than l2arc_evict() evicts.9213*/9214if (!all && l2arc_log_blkptr_valid(dev, lb_ptr_buf->lb_ptr)) {9215break;9216} else {9217if (vd != NULL)9218vdev_space_update(vd, -asize, 0, 0);9219ARCSTAT_INCR(arcstat_l2_log_blk_asize, -asize);9220ARCSTAT_BUMPDOWN(arcstat_l2_log_blk_count);9221zfs_refcount_remove_many(&dev->l2ad_lb_asize, asize,9222lb_ptr_buf);9223(void) zfs_refcount_remove(&dev->l2ad_lb_count,9224lb_ptr_buf);9225list_remove(&dev->l2ad_lbptr_list, lb_ptr_buf);9226kmem_free(lb_ptr_buf->lb_ptr,9227sizeof (l2arc_log_blkptr_t));9228kmem_free(lb_ptr_buf, sizeof (l2arc_lb_ptr_buf_t));9229}9230}92319232for (hdr = list_tail(buflist); hdr; hdr = hdr_prev) {9233hdr_prev = list_prev(buflist, hdr);92349235ASSERT(!HDR_EMPTY(hdr));9236hash_lock = HDR_LOCK(hdr);92379238/*9239* We cannot use mutex_enter or else we can deadlock9240* with l2arc_write_buffers (due to swapping the order9241* the hash lock and l2ad_mtx are taken).9242*/9243if (!mutex_tryenter(hash_lock)) {9244/*9245* Missed the hash lock. Retry.9246*/9247ARCSTAT_BUMP(arcstat_l2_evict_lock_retry);9248mutex_exit(&dev->l2ad_mtx);9249mutex_enter(hash_lock);9250mutex_exit(hash_lock);9251goto retry;9252}92539254/*9255* A header can't be on this list if it doesn't have L2 header.9256*/9257ASSERT(HDR_HAS_L2HDR(hdr));92589259/* Ensure this header has finished being written. */9260ASSERT(!HDR_L2_WRITING(hdr));9261ASSERT(!HDR_L2_WRITE_HEAD(hdr));92629263if (!all && (hdr->b_l2hdr.b_daddr >= dev->l2ad_evict ||9264hdr->b_l2hdr.b_daddr < dev->l2ad_hand)) {9265/*9266* We've evicted to the target address,9267* or the end of the device.9268*/9269mutex_exit(hash_lock);9270break;9271}92729273if (!HDR_HAS_L1HDR(hdr)) {9274ASSERT(!HDR_L2_READING(hdr));9275/*9276* This doesn't exist in the ARC. Destroy.9277* arc_hdr_destroy() will call list_remove()9278* and decrement arcstat_l2_lsize.9279*/9280arc_change_state(arc_anon, hdr);9281arc_hdr_destroy(hdr);9282} else {9283ASSERT(hdr->b_l1hdr.b_state != arc_l2c_only);9284ARCSTAT_BUMP(arcstat_l2_evict_l1cached);9285/*9286* Invalidate issued or about to be issued9287* reads, since we may be about to write9288* over this location.9289*/9290if (HDR_L2_READING(hdr)) {9291ARCSTAT_BUMP(arcstat_l2_evict_reading);9292arc_hdr_set_flags(hdr, ARC_FLAG_L2_EVICTED);9293}92949295arc_hdr_l2hdr_destroy(hdr);9296}9297mutex_exit(hash_lock);9298}9299mutex_exit(&dev->l2ad_mtx);93009301out:9302/*9303* We need to check if we evict all buffers, otherwise we may iterate9304* unnecessarily.9305*/9306if (!all && rerun) {9307/*9308* Bump device hand to the device start if it is approaching the9309* end. l2arc_evict() has already evicted ahead for this case.9310*/9311dev->l2ad_hand = dev->l2ad_start;9312dev->l2ad_evict = dev->l2ad_start;9313dev->l2ad_first = B_FALSE;9314goto top;9315}93169317if (!all) {9318/*9319* In case of cache device removal (all) the following9320* assertions may be violated without functional consequences9321* as the device is about to be removed.9322*/9323ASSERT3U(dev->l2ad_hand + distance, <=, dev->l2ad_end);9324if (!dev->l2ad_first)9325ASSERT3U(dev->l2ad_hand, <=, dev->l2ad_evict);9326}9327}93289329/*9330* Handle any abd transforms that might be required for writing to the L2ARC.9331* If successful, this function will always return an abd with the data9332* transformed as it is on disk in a new abd of asize bytes.9333*/9334static int9335l2arc_apply_transforms(spa_t *spa, arc_buf_hdr_t *hdr, uint64_t asize,9336abd_t **abd_out)9337{9338int ret;9339abd_t *cabd = NULL, *eabd = NULL, *to_write = hdr->b_l1hdr.b_pabd;9340enum zio_compress compress = HDR_GET_COMPRESS(hdr);9341uint64_t psize = HDR_GET_PSIZE(hdr);9342uint64_t size = arc_hdr_size(hdr);9343boolean_t ismd = HDR_ISTYPE_METADATA(hdr);9344boolean_t bswap = (hdr->b_l1hdr.b_byteswap != DMU_BSWAP_NUMFUNCS);9345dsl_crypto_key_t *dck = NULL;9346uint8_t mac[ZIO_DATA_MAC_LEN] = { 0 };9347boolean_t no_crypt = B_FALSE;93489349ASSERT((HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF &&9350!HDR_COMPRESSION_ENABLED(hdr)) ||9351HDR_ENCRYPTED(hdr) || HDR_SHARED_DATA(hdr) || psize != asize);9352ASSERT3U(psize, <=, asize);93539354/*9355* If this data simply needs its own buffer, we simply allocate it9356* and copy the data. This may be done to eliminate a dependency on a9357* shared buffer or to reallocate the buffer to match asize.9358*/9359if (HDR_HAS_RABD(hdr)) {9360ASSERT3U(asize, >, psize);9361to_write = abd_alloc_for_io(asize, ismd);9362abd_copy(to_write, hdr->b_crypt_hdr.b_rabd, psize);9363abd_zero_off(to_write, psize, asize - psize);9364goto out;9365}93669367if ((compress == ZIO_COMPRESS_OFF || HDR_COMPRESSION_ENABLED(hdr)) &&9368!HDR_ENCRYPTED(hdr)) {9369ASSERT3U(size, ==, psize);9370to_write = abd_alloc_for_io(asize, ismd);9371abd_copy(to_write, hdr->b_l1hdr.b_pabd, size);9372if (asize > size)9373abd_zero_off(to_write, size, asize - size);9374goto out;9375}93769377if (compress != ZIO_COMPRESS_OFF && !HDR_COMPRESSION_ENABLED(hdr)) {9378cabd = abd_alloc_for_io(MAX(size, asize), ismd);9379uint64_t csize = zio_compress_data(compress, to_write, &cabd,9380size, MIN(size, psize), hdr->b_complevel);9381if (csize >= size || csize > psize) {9382/*9383* We can't re-compress the block into the original9384* psize. Even if it fits into asize, it does not9385* matter, since checksum will never match on read.9386*/9387abd_free(cabd);9388return (SET_ERROR(EIO));9389}9390if (asize > csize)9391abd_zero_off(cabd, csize, asize - csize);9392to_write = cabd;9393}93949395if (HDR_ENCRYPTED(hdr)) {9396eabd = abd_alloc_for_io(asize, ismd);93979398/*9399* If the dataset was disowned before the buffer9400* made it to this point, the key to re-encrypt9401* it won't be available. In this case we simply9402* won't write the buffer to the L2ARC.9403*/9404ret = spa_keystore_lookup_key(spa, hdr->b_crypt_hdr.b_dsobj,9405FTAG, &dck);9406if (ret != 0)9407goto error;94089409ret = zio_do_crypt_abd(B_TRUE, &dck->dck_key,9410hdr->b_crypt_hdr.b_ot, bswap, hdr->b_crypt_hdr.b_salt,9411hdr->b_crypt_hdr.b_iv, mac, psize, to_write, eabd,9412&no_crypt);9413if (ret != 0)9414goto error;94159416if (no_crypt)9417abd_copy(eabd, to_write, psize);94189419if (psize != asize)9420abd_zero_off(eabd, psize, asize - psize);94219422/* assert that the MAC we got here matches the one we saved */9423ASSERT0(memcmp(mac, hdr->b_crypt_hdr.b_mac, ZIO_DATA_MAC_LEN));9424spa_keystore_dsl_key_rele(spa, dck, FTAG);94259426if (to_write == cabd)9427abd_free(cabd);94289429to_write = eabd;9430}94319432out:9433ASSERT3P(to_write, !=, hdr->b_l1hdr.b_pabd);9434*abd_out = to_write;9435return (0);94369437error:9438if (dck != NULL)9439spa_keystore_dsl_key_rele(spa, dck, FTAG);9440if (cabd != NULL)9441abd_free(cabd);9442if (eabd != NULL)9443abd_free(eabd);94449445*abd_out = NULL;9446return (ret);9447}94489449static void9450l2arc_blk_fetch_done(zio_t *zio)9451{9452l2arc_read_callback_t *cb;94539454cb = zio->io_private;9455if (cb->l2rcb_abd != NULL)9456abd_free(cb->l2rcb_abd);9457kmem_free(cb, sizeof (l2arc_read_callback_t));9458}94599460/*9461* Find and write ARC buffers to the L2ARC device.9462*9463* An ARC_FLAG_L2_WRITING flag is set so that the L2ARC buffers are not valid9464* for reading until they have completed writing.9465* The headroom_boost is an in-out parameter used to maintain headroom boost9466* state between calls to this function.9467*9468* Returns the number of bytes actually written (which may be smaller than9469* the delta by which the device hand has changed due to alignment and the9470* writing of log blocks).9471*/9472static uint64_t9473l2arc_write_buffers(spa_t *spa, l2arc_dev_t *dev, uint64_t target_sz)9474{9475arc_buf_hdr_t *hdr, *head, *marker;9476uint64_t write_asize, write_psize, headroom;9477boolean_t full, from_head = !arc_warm;9478l2arc_write_callback_t *cb = NULL;9479zio_t *pio, *wzio;9480uint64_t guid = spa_load_guid(spa);9481l2arc_dev_hdr_phys_t *l2dhdr = dev->l2ad_dev_hdr;94829483ASSERT3P(dev->l2ad_vdev, !=, NULL);94849485pio = NULL;9486write_asize = write_psize = 0;9487full = B_FALSE;9488head = kmem_cache_alloc(hdr_l2only_cache, KM_PUSHPAGE);9489arc_hdr_set_flags(head, ARC_FLAG_L2_WRITE_HEAD | ARC_FLAG_HAS_L2HDR);9490marker = arc_state_alloc_marker();94919492/*9493* Copy buffers for L2ARC writing.9494*/9495for (int pass = 0; pass < L2ARC_FEED_TYPES; pass++) {9496/*9497* pass == 0: MFU meta9498* pass == 1: MRU meta9499* pass == 2: MFU data9500* pass == 3: MRU data9501*/9502if (l2arc_mfuonly == 1) {9503if (pass == 1 || pass == 3)9504continue;9505} else if (l2arc_mfuonly > 1) {9506if (pass == 3)9507continue;9508}95099510uint64_t passed_sz = 0;9511headroom = target_sz * l2arc_headroom;9512if (zfs_compressed_arc_enabled)9513headroom = (headroom * l2arc_headroom_boost) / 100;95149515/*9516* Until the ARC is warm and starts to evict, read from the9517* head of the ARC lists rather than the tail.9518*/9519multilist_sublist_t *mls = l2arc_sublist_lock(pass);9520ASSERT3P(mls, !=, NULL);9521if (from_head)9522hdr = multilist_sublist_head(mls);9523else9524hdr = multilist_sublist_tail(mls);95259526while (hdr != NULL) {9527kmutex_t *hash_lock;9528abd_t *to_write = NULL;95299530hash_lock = HDR_LOCK(hdr);9531if (!mutex_tryenter(hash_lock)) {9532skip:9533/* Skip this buffer rather than waiting. */9534if (from_head)9535hdr = multilist_sublist_next(mls, hdr);9536else9537hdr = multilist_sublist_prev(mls, hdr);9538continue;9539}95409541passed_sz += HDR_GET_LSIZE(hdr);9542if (l2arc_headroom != 0 && passed_sz > headroom) {9543/*9544* Searched too far.9545*/9546mutex_exit(hash_lock);9547break;9548}95499550if (!l2arc_write_eligible(guid, hdr)) {9551mutex_exit(hash_lock);9552goto skip;9553}95549555ASSERT(HDR_HAS_L1HDR(hdr));9556ASSERT3U(HDR_GET_PSIZE(hdr), >, 0);9557ASSERT3U(arc_hdr_size(hdr), >, 0);9558ASSERT(hdr->b_l1hdr.b_pabd != NULL ||9559HDR_HAS_RABD(hdr));9560uint64_t psize = HDR_GET_PSIZE(hdr);9561uint64_t asize = vdev_psize_to_asize(dev->l2ad_vdev,9562psize);95639564/*9565* If the allocated size of this buffer plus the max9566* size for the pending log block exceeds the evicted9567* target size, terminate writing buffers for this run.9568*/9569if (write_asize + asize +9570sizeof (l2arc_log_blk_phys_t) > target_sz) {9571full = B_TRUE;9572mutex_exit(hash_lock);9573break;9574}95759576/*9577* We should not sleep with sublist lock held or it9578* may block ARC eviction. Insert a marker to save9579* the position and drop the lock.9580*/9581if (from_head) {9582multilist_sublist_insert_after(mls, hdr,9583marker);9584} else {9585multilist_sublist_insert_before(mls, hdr,9586marker);9587}9588multilist_sublist_unlock(mls);95899590/*9591* If this header has b_rabd, we can use this since it9592* must always match the data exactly as it exists on9593* disk. Otherwise, the L2ARC can normally use the9594* hdr's data, but if we're sharing data between the9595* hdr and one of its bufs, L2ARC needs its own copy of9596* the data so that the ZIO below can't race with the9597* buf consumer. To ensure that this copy will be9598* available for the lifetime of the ZIO and be cleaned9599* up afterwards, we add it to the l2arc_free_on_write9600* queue. If we need to apply any transforms to the9601* data (compression, encryption) we will also need the9602* extra buffer.9603*/9604if (HDR_HAS_RABD(hdr) && psize == asize) {9605to_write = hdr->b_crypt_hdr.b_rabd;9606} else if ((HDR_COMPRESSION_ENABLED(hdr) ||9607HDR_GET_COMPRESS(hdr) == ZIO_COMPRESS_OFF) &&9608!HDR_ENCRYPTED(hdr) && !HDR_SHARED_DATA(hdr) &&9609psize == asize) {9610to_write = hdr->b_l1hdr.b_pabd;9611} else {9612int ret;9613arc_buf_contents_t type = arc_buf_type(hdr);96149615ret = l2arc_apply_transforms(spa, hdr, asize,9616&to_write);9617if (ret != 0) {9618arc_hdr_clear_flags(hdr,9619ARC_FLAG_L2CACHE);9620mutex_exit(hash_lock);9621goto next;9622}96239624l2arc_free_abd_on_write(to_write, asize, type);9625}96269627hdr->b_l2hdr.b_dev = dev;9628hdr->b_l2hdr.b_daddr = dev->l2ad_hand;9629hdr->b_l2hdr.b_hits = 0;9630hdr->b_l2hdr.b_arcs_state =9631hdr->b_l1hdr.b_state->arcs_state;9632/* l2arc_hdr_arcstats_update() expects a valid asize */9633HDR_SET_L2SIZE(hdr, asize);9634arc_hdr_set_flags(hdr, ARC_FLAG_HAS_L2HDR |9635ARC_FLAG_L2_WRITING);96369637(void) zfs_refcount_add_many(&dev->l2ad_alloc,9638arc_hdr_size(hdr), hdr);9639l2arc_hdr_arcstats_increment(hdr);9640vdev_space_update(dev->l2ad_vdev, asize, 0, 0);96419642mutex_enter(&dev->l2ad_mtx);9643if (pio == NULL) {9644/*9645* Insert a dummy header on the buflist so9646* l2arc_write_done() can find where the9647* write buffers begin without searching.9648*/9649list_insert_head(&dev->l2ad_buflist, head);9650}9651list_insert_head(&dev->l2ad_buflist, hdr);9652mutex_exit(&dev->l2ad_mtx);96539654boolean_t commit = l2arc_log_blk_insert(dev, hdr);9655mutex_exit(hash_lock);96569657if (pio == NULL) {9658cb = kmem_alloc(9659sizeof (l2arc_write_callback_t), KM_SLEEP);9660cb->l2wcb_dev = dev;9661cb->l2wcb_head = head;9662list_create(&cb->l2wcb_abd_list,9663sizeof (l2arc_lb_abd_buf_t),9664offsetof(l2arc_lb_abd_buf_t, node));9665pio = zio_root(spa, l2arc_write_done, cb,9666ZIO_FLAG_CANFAIL);9667}96689669wzio = zio_write_phys(pio, dev->l2ad_vdev,9670dev->l2ad_hand, asize, to_write,9671ZIO_CHECKSUM_OFF, NULL, hdr,9672ZIO_PRIORITY_ASYNC_WRITE,9673ZIO_FLAG_CANFAIL, B_FALSE);96749675DTRACE_PROBE2(l2arc__write, vdev_t *, dev->l2ad_vdev,9676zio_t *, wzio);9677zio_nowait(wzio);96789679write_psize += psize;9680write_asize += asize;9681dev->l2ad_hand += asize;96829683if (commit) {9684/* l2ad_hand will be adjusted inside. */9685write_asize +=9686l2arc_log_blk_commit(dev, pio, cb);9687}96889689next:9690multilist_sublist_lock(mls);9691if (from_head)9692hdr = multilist_sublist_next(mls, marker);9693else9694hdr = multilist_sublist_prev(mls, marker);9695multilist_sublist_remove(mls, marker);9696}96979698multilist_sublist_unlock(mls);96999700if (full == B_TRUE)9701break;9702}97039704arc_state_free_marker(marker);97059706/* No buffers selected for writing? */9707if (pio == NULL) {9708ASSERT0(write_psize);9709ASSERT(!HDR_HAS_L1HDR(head));9710kmem_cache_free(hdr_l2only_cache, head);97119712/*9713* Although we did not write any buffers l2ad_evict may9714* have advanced.9715*/9716if (dev->l2ad_evict != l2dhdr->dh_evict)9717l2arc_dev_hdr_update(dev);97189719return (0);9720}97219722if (!dev->l2ad_first)9723ASSERT3U(dev->l2ad_hand, <=, dev->l2ad_evict);97249725ASSERT3U(write_asize, <=, target_sz);9726ARCSTAT_BUMP(arcstat_l2_writes_sent);9727ARCSTAT_INCR(arcstat_l2_write_bytes, write_psize);97289729dev->l2ad_writing = B_TRUE;9730(void) zio_wait(pio);9731dev->l2ad_writing = B_FALSE;97329733/*9734* Update the device header after the zio completes as9735* l2arc_write_done() may have updated the memory holding the log block9736* pointers in the device header.9737*/9738l2arc_dev_hdr_update(dev);97399740return (write_asize);9741}97429743static boolean_t9744l2arc_hdr_limit_reached(void)9745{9746int64_t s = aggsum_upper_bound(&arc_sums.arcstat_l2_hdr_size);97479748return (arc_reclaim_needed() ||9749(s > (arc_warm ? arc_c : arc_c_max) * l2arc_meta_percent / 100));9750}97519752/*9753* This thread feeds the L2ARC at regular intervals. This is the beating9754* heart of the L2ARC.9755*/9756static __attribute__((noreturn)) void9757l2arc_feed_thread(void *unused)9758{9759(void) unused;9760callb_cpr_t cpr;9761l2arc_dev_t *dev;9762spa_t *spa;9763uint64_t size, wrote;9764clock_t begin, next = ddi_get_lbolt();9765fstrans_cookie_t cookie;97669767CALLB_CPR_INIT(&cpr, &l2arc_feed_thr_lock, callb_generic_cpr, FTAG);97689769mutex_enter(&l2arc_feed_thr_lock);97709771cookie = spl_fstrans_mark();9772while (l2arc_thread_exit == 0) {9773CALLB_CPR_SAFE_BEGIN(&cpr);9774(void) cv_timedwait_idle(&l2arc_feed_thr_cv,9775&l2arc_feed_thr_lock, next);9776CALLB_CPR_SAFE_END(&cpr, &l2arc_feed_thr_lock);9777next = ddi_get_lbolt() + hz;97789779/*9780* Quick check for L2ARC devices.9781*/9782mutex_enter(&l2arc_dev_mtx);9783if (l2arc_ndev == 0) {9784mutex_exit(&l2arc_dev_mtx);9785continue;9786}9787mutex_exit(&l2arc_dev_mtx);9788begin = ddi_get_lbolt();97899790/*9791* This selects the next l2arc device to write to, and in9792* doing so the next spa to feed from: dev->l2ad_spa. This9793* will return NULL if there are now no l2arc devices or if9794* they are all faulted.9795*9796* If a device is returned, its spa's config lock is also9797* held to prevent device removal. l2arc_dev_get_next()9798* will grab and release l2arc_dev_mtx.9799*/9800if ((dev = l2arc_dev_get_next()) == NULL)9801continue;98029803spa = dev->l2ad_spa;9804ASSERT3P(spa, !=, NULL);98059806/*9807* If the pool is read-only then force the feed thread to9808* sleep a little longer.9809*/9810if (!spa_writeable(spa)) {9811next = ddi_get_lbolt() + 5 * l2arc_feed_secs * hz;9812spa_config_exit(spa, SCL_L2ARC, dev);9813continue;9814}98159816/*9817* Avoid contributing to memory pressure.9818*/9819if (l2arc_hdr_limit_reached()) {9820ARCSTAT_BUMP(arcstat_l2_abort_lowmem);9821spa_config_exit(spa, SCL_L2ARC, dev);9822continue;9823}98249825ARCSTAT_BUMP(arcstat_l2_feeds);98269827size = l2arc_write_size(dev);98289829/*9830* Evict L2ARC buffers that will be overwritten.9831*/9832l2arc_evict(dev, size, B_FALSE);98339834/*9835* Write ARC buffers.9836*/9837wrote = l2arc_write_buffers(spa, dev, size);98389839/*9840* Calculate interval between writes.9841*/9842next = l2arc_write_interval(begin, size, wrote);9843spa_config_exit(spa, SCL_L2ARC, dev);9844}9845spl_fstrans_unmark(cookie);98469847l2arc_thread_exit = 0;9848cv_broadcast(&l2arc_feed_thr_cv);9849CALLB_CPR_EXIT(&cpr); /* drops l2arc_feed_thr_lock */9850thread_exit();9851}98529853boolean_t9854l2arc_vdev_present(vdev_t *vd)9855{9856return (l2arc_vdev_get(vd) != NULL);9857}98589859/*9860* Returns the l2arc_dev_t associated with a particular vdev_t or NULL if9861* the vdev_t isn't an L2ARC device.9862*/9863l2arc_dev_t *9864l2arc_vdev_get(vdev_t *vd)9865{9866l2arc_dev_t *dev;98679868mutex_enter(&l2arc_dev_mtx);9869for (dev = list_head(l2arc_dev_list); dev != NULL;9870dev = list_next(l2arc_dev_list, dev)) {9871if (dev->l2ad_vdev == vd)9872break;9873}9874mutex_exit(&l2arc_dev_mtx);98759876return (dev);9877}98789879static void9880l2arc_rebuild_dev(l2arc_dev_t *dev, boolean_t reopen)9881{9882l2arc_dev_hdr_phys_t *l2dhdr = dev->l2ad_dev_hdr;9883uint64_t l2dhdr_asize = dev->l2ad_dev_hdr_asize;9884spa_t *spa = dev->l2ad_spa;98859886/*9887* After a l2arc_remove_vdev(), the spa_t will no longer be valid9888*/9889if (spa == NULL)9890return;98919892/*9893* The L2ARC has to hold at least the payload of one log block for9894* them to be restored (persistent L2ARC). The payload of a log block9895* depends on the amount of its log entries. We always write log blocks9896* with 1022 entries. How many of them are committed or restored depends9897* on the size of the L2ARC device. Thus the maximum payload of9898* one log block is 1022 * SPA_MAXBLOCKSIZE = 16GB. If the L2ARC device9899* is less than that, we reduce the amount of committed and restored9900* log entries per block so as to enable persistence.9901*/9902if (dev->l2ad_end < l2arc_rebuild_blocks_min_l2size) {9903dev->l2ad_log_entries = 0;9904} else {9905dev->l2ad_log_entries = MIN((dev->l2ad_end -9906dev->l2ad_start) >> SPA_MAXBLOCKSHIFT,9907L2ARC_LOG_BLK_MAX_ENTRIES);9908}99099910/*9911* Read the device header, if an error is returned do not rebuild L2ARC.9912*/9913if (l2arc_dev_hdr_read(dev) == 0 && dev->l2ad_log_entries > 0) {9914/*9915* If we are onlining a cache device (vdev_reopen) that was9916* still present (l2arc_vdev_present()) and rebuild is enabled,9917* we should evict all ARC buffers and pointers to log blocks9918* and reclaim their space before restoring its contents to9919* L2ARC.9920*/9921if (reopen) {9922if (!l2arc_rebuild_enabled) {9923return;9924} else {9925l2arc_evict(dev, 0, B_TRUE);9926/* start a new log block */9927dev->l2ad_log_ent_idx = 0;9928dev->l2ad_log_blk_payload_asize = 0;9929dev->l2ad_log_blk_payload_start = 0;9930}9931}9932/*9933* Just mark the device as pending for a rebuild. We won't9934* be starting a rebuild in line here as it would block pool9935* import. Instead spa_load_impl will hand that off to an9936* async task which will call l2arc_spa_rebuild_start.9937*/9938dev->l2ad_rebuild = B_TRUE;9939} else if (spa_writeable(spa)) {9940/*9941* In this case TRIM the whole device if l2arc_trim_ahead > 0,9942* otherwise create a new header. We zero out the memory holding9943* the header to reset dh_start_lbps. If we TRIM the whole9944* device the new header will be written by9945* vdev_trim_l2arc_thread() at the end of the TRIM to update the9946* trim_state in the header too. When reading the header, if9947* trim_state is not VDEV_TRIM_COMPLETE and l2arc_trim_ahead > 09948* we opt to TRIM the whole device again.9949*/9950if (l2arc_trim_ahead > 0) {9951dev->l2ad_trim_all = B_TRUE;9952} else {9953memset(l2dhdr, 0, l2dhdr_asize);9954l2arc_dev_hdr_update(dev);9955}9956}9957}99589959/*9960* Add a vdev for use by the L2ARC. By this point the spa has already9961* validated the vdev and opened it.9962*/9963void9964l2arc_add_vdev(spa_t *spa, vdev_t *vd)9965{9966l2arc_dev_t *adddev;9967uint64_t l2dhdr_asize;99689969ASSERT(!l2arc_vdev_present(vd));99709971/*9972* Create a new l2arc device entry.9973*/9974adddev = vmem_zalloc(sizeof (l2arc_dev_t), KM_SLEEP);9975adddev->l2ad_spa = spa;9976adddev->l2ad_vdev = vd;9977/* leave extra size for an l2arc device header */9978l2dhdr_asize = adddev->l2ad_dev_hdr_asize =9979MAX(sizeof (*adddev->l2ad_dev_hdr), 1 << vd->vdev_ashift);9980adddev->l2ad_start = VDEV_LABEL_START_SIZE + l2dhdr_asize;9981adddev->l2ad_end = VDEV_LABEL_START_SIZE + vdev_get_min_asize(vd);9982ASSERT3U(adddev->l2ad_start, <, adddev->l2ad_end);9983adddev->l2ad_hand = adddev->l2ad_start;9984adddev->l2ad_evict = adddev->l2ad_start;9985adddev->l2ad_first = B_TRUE;9986adddev->l2ad_writing = B_FALSE;9987adddev->l2ad_trim_all = B_FALSE;9988list_link_init(&adddev->l2ad_node);9989adddev->l2ad_dev_hdr = kmem_zalloc(l2dhdr_asize, KM_SLEEP);99909991mutex_init(&adddev->l2ad_mtx, NULL, MUTEX_DEFAULT, NULL);9992/*9993* This is a list of all ARC buffers that are still valid on the9994* device.9995*/9996list_create(&adddev->l2ad_buflist, sizeof (arc_buf_hdr_t),9997offsetof(arc_buf_hdr_t, b_l2hdr.b_l2node));99989999/*10000* This is a list of pointers to log blocks that are still present10001* on the device.10002*/10003list_create(&adddev->l2ad_lbptr_list, sizeof (l2arc_lb_ptr_buf_t),10004offsetof(l2arc_lb_ptr_buf_t, node));1000510006vdev_space_update(vd, 0, 0, adddev->l2ad_end - adddev->l2ad_hand);10007zfs_refcount_create(&adddev->l2ad_alloc);10008zfs_refcount_create(&adddev->l2ad_lb_asize);10009zfs_refcount_create(&adddev->l2ad_lb_count);1001010011/*10012* Decide if dev is eligible for L2ARC rebuild or whole device10013* trimming. This has to happen before the device is added in the10014* cache device list and l2arc_dev_mtx is released. Otherwise10015* l2arc_feed_thread() might already start writing on the10016* device.10017*/10018l2arc_rebuild_dev(adddev, B_FALSE);1001910020/*10021* Add device to global list10022*/10023mutex_enter(&l2arc_dev_mtx);10024list_insert_head(l2arc_dev_list, adddev);10025atomic_inc_64(&l2arc_ndev);10026mutex_exit(&l2arc_dev_mtx);10027}1002810029/*10030* Decide if a vdev is eligible for L2ARC rebuild, called from vdev_reopen()10031* in case of onlining a cache device.10032*/10033void10034l2arc_rebuild_vdev(vdev_t *vd, boolean_t reopen)10035{10036l2arc_dev_t *dev = NULL;1003710038dev = l2arc_vdev_get(vd);10039ASSERT3P(dev, !=, NULL);1004010041/*10042* In contrast to l2arc_add_vdev() we do not have to worry about10043* l2arc_feed_thread() invalidating previous content when onlining a10044* cache device. The device parameters (l2ad*) are not cleared when10045* offlining the device and writing new buffers will not invalidate10046* all previous content. In worst case only buffers that have not had10047* their log block written to the device will be lost.10048* When onlining the cache device (ie offline->online without exporting10049* the pool in between) this happens:10050* vdev_reopen() -> vdev_open() -> l2arc_rebuild_vdev()10051* | |10052* vdev_is_dead() = B_FALSE l2ad_rebuild = B_TRUE10053* During the time where vdev_is_dead = B_FALSE and until l2ad_rebuild10054* is set to B_TRUE we might write additional buffers to the device.10055*/10056l2arc_rebuild_dev(dev, reopen);10057}1005810059typedef struct {10060l2arc_dev_t *rva_l2arc_dev;10061uint64_t rva_spa_gid;10062uint64_t rva_vdev_gid;10063boolean_t rva_async;1006410065} remove_vdev_args_t;1006610067static void10068l2arc_device_teardown(void *arg)10069{10070remove_vdev_args_t *rva = arg;10071l2arc_dev_t *remdev = rva->rva_l2arc_dev;10072hrtime_t start_time = gethrtime();1007310074/*10075* Clear all buflists and ARC references. L2ARC device flush.10076*/10077l2arc_evict(remdev, 0, B_TRUE);10078list_destroy(&remdev->l2ad_buflist);10079ASSERT(list_is_empty(&remdev->l2ad_lbptr_list));10080list_destroy(&remdev->l2ad_lbptr_list);10081mutex_destroy(&remdev->l2ad_mtx);10082zfs_refcount_destroy(&remdev->l2ad_alloc);10083zfs_refcount_destroy(&remdev->l2ad_lb_asize);10084zfs_refcount_destroy(&remdev->l2ad_lb_count);10085kmem_free(remdev->l2ad_dev_hdr, remdev->l2ad_dev_hdr_asize);10086vmem_free(remdev, sizeof (l2arc_dev_t));1008710088uint64_t elapsed = NSEC2MSEC(gethrtime() - start_time);10089if (elapsed > 0) {10090zfs_dbgmsg("spa %llu, vdev %llu removed in %llu ms",10091(u_longlong_t)rva->rva_spa_gid,10092(u_longlong_t)rva->rva_vdev_gid,10093(u_longlong_t)elapsed);10094}1009510096if (rva->rva_async)10097arc_async_flush_remove(rva->rva_spa_gid, 2);10098kmem_free(rva, sizeof (remove_vdev_args_t));10099}1010010101/*10102* Remove a vdev from the L2ARC.10103*/10104void10105l2arc_remove_vdev(vdev_t *vd)10106{10107spa_t *spa = vd->vdev_spa;10108boolean_t asynchronous = spa->spa_state == POOL_STATE_EXPORTED ||10109spa->spa_state == POOL_STATE_DESTROYED;1011010111/*10112* Find the device by vdev10113*/10114l2arc_dev_t *remdev = l2arc_vdev_get(vd);10115ASSERT3P(remdev, !=, NULL);1011610117/*10118* Save info for final teardown10119*/10120remove_vdev_args_t *rva = kmem_alloc(sizeof (remove_vdev_args_t),10121KM_SLEEP);10122rva->rva_l2arc_dev = remdev;10123rva->rva_spa_gid = spa_load_guid(spa);10124rva->rva_vdev_gid = remdev->l2ad_vdev->vdev_guid;1012510126/*10127* Cancel any ongoing or scheduled rebuild.10128*/10129mutex_enter(&l2arc_rebuild_thr_lock);10130remdev->l2ad_rebuild_cancel = B_TRUE;10131if (remdev->l2ad_rebuild_began == B_TRUE) {10132while (remdev->l2ad_rebuild == B_TRUE)10133cv_wait(&l2arc_rebuild_thr_cv, &l2arc_rebuild_thr_lock);10134}10135mutex_exit(&l2arc_rebuild_thr_lock);10136rva->rva_async = asynchronous;1013710138/*10139* Remove device from global list10140*/10141ASSERT(spa_config_held(spa, SCL_L2ARC, RW_WRITER) & SCL_L2ARC);10142mutex_enter(&l2arc_dev_mtx);10143list_remove(l2arc_dev_list, remdev);10144l2arc_dev_last = NULL; /* may have been invalidated */10145atomic_dec_64(&l2arc_ndev);1014610147/* During a pool export spa & vdev will no longer be valid */10148if (asynchronous) {10149remdev->l2ad_spa = NULL;10150remdev->l2ad_vdev = NULL;10151}10152mutex_exit(&l2arc_dev_mtx);1015310154if (!asynchronous) {10155l2arc_device_teardown(rva);10156return;10157}1015810159arc_async_flush_t *af = arc_async_flush_add(rva->rva_spa_gid, 2);1016010161taskq_dispatch_ent(arc_flush_taskq, l2arc_device_teardown, rva,10162TQ_SLEEP, &af->af_tqent);10163}1016410165void10166l2arc_init(void)10167{10168l2arc_thread_exit = 0;10169l2arc_ndev = 0;1017010171mutex_init(&l2arc_feed_thr_lock, NULL, MUTEX_DEFAULT, NULL);10172cv_init(&l2arc_feed_thr_cv, NULL, CV_DEFAULT, NULL);10173mutex_init(&l2arc_rebuild_thr_lock, NULL, MUTEX_DEFAULT, NULL);10174cv_init(&l2arc_rebuild_thr_cv, NULL, CV_DEFAULT, NULL);10175mutex_init(&l2arc_dev_mtx, NULL, MUTEX_DEFAULT, NULL);10176mutex_init(&l2arc_free_on_write_mtx, NULL, MUTEX_DEFAULT, NULL);1017710178l2arc_dev_list = &L2ARC_dev_list;10179l2arc_free_on_write = &L2ARC_free_on_write;10180list_create(l2arc_dev_list, sizeof (l2arc_dev_t),10181offsetof(l2arc_dev_t, l2ad_node));10182list_create(l2arc_free_on_write, sizeof (l2arc_data_free_t),10183offsetof(l2arc_data_free_t, l2df_list_node));10184}1018510186void10187l2arc_fini(void)10188{10189mutex_destroy(&l2arc_feed_thr_lock);10190cv_destroy(&l2arc_feed_thr_cv);10191mutex_destroy(&l2arc_rebuild_thr_lock);10192cv_destroy(&l2arc_rebuild_thr_cv);10193mutex_destroy(&l2arc_dev_mtx);10194mutex_destroy(&l2arc_free_on_write_mtx);1019510196list_destroy(l2arc_dev_list);10197list_destroy(l2arc_free_on_write);10198}1019910200void10201l2arc_start(void)10202{10203if (!(spa_mode_global & SPA_MODE_WRITE))10204return;1020510206(void) thread_create(NULL, 0, l2arc_feed_thread, NULL, 0, &p0,10207TS_RUN, defclsyspri);10208}1020910210void10211l2arc_stop(void)10212{10213if (!(spa_mode_global & SPA_MODE_WRITE))10214return;1021510216mutex_enter(&l2arc_feed_thr_lock);10217cv_signal(&l2arc_feed_thr_cv); /* kick thread out of startup */10218l2arc_thread_exit = 1;10219while (l2arc_thread_exit != 0)10220cv_wait(&l2arc_feed_thr_cv, &l2arc_feed_thr_lock);10221mutex_exit(&l2arc_feed_thr_lock);10222}1022310224/*10225* Punches out rebuild threads for the L2ARC devices in a spa. This should10226* be called after pool import from the spa async thread, since starting10227* these threads directly from spa_import() will make them part of the10228* "zpool import" context and delay process exit (and thus pool import).10229*/10230void10231l2arc_spa_rebuild_start(spa_t *spa)10232{10233ASSERT(MUTEX_HELD(&spa_namespace_lock));1023410235/*10236* Locate the spa's l2arc devices and kick off rebuild threads.10237*/10238for (int i = 0; i < spa->spa_l2cache.sav_count; i++) {10239l2arc_dev_t *dev =10240l2arc_vdev_get(spa->spa_l2cache.sav_vdevs[i]);10241if (dev == NULL) {10242/* Don't attempt a rebuild if the vdev is UNAVAIL */10243continue;10244}10245mutex_enter(&l2arc_rebuild_thr_lock);10246if (dev->l2ad_rebuild && !dev->l2ad_rebuild_cancel) {10247dev->l2ad_rebuild_began = B_TRUE;10248(void) thread_create(NULL, 0, l2arc_dev_rebuild_thread,10249dev, 0, &p0, TS_RUN, minclsyspri);10250}10251mutex_exit(&l2arc_rebuild_thr_lock);10252}10253}1025410255void10256l2arc_spa_rebuild_stop(spa_t *spa)10257{10258ASSERT(MUTEX_HELD(&spa_namespace_lock) ||10259spa->spa_export_thread == curthread);1026010261for (int i = 0; i < spa->spa_l2cache.sav_count; i++) {10262l2arc_dev_t *dev =10263l2arc_vdev_get(spa->spa_l2cache.sav_vdevs[i]);10264if (dev == NULL)10265continue;10266mutex_enter(&l2arc_rebuild_thr_lock);10267dev->l2ad_rebuild_cancel = B_TRUE;10268mutex_exit(&l2arc_rebuild_thr_lock);10269}10270for (int i = 0; i < spa->spa_l2cache.sav_count; i++) {10271l2arc_dev_t *dev =10272l2arc_vdev_get(spa->spa_l2cache.sav_vdevs[i]);10273if (dev == NULL)10274continue;10275mutex_enter(&l2arc_rebuild_thr_lock);10276if (dev->l2ad_rebuild_began == B_TRUE) {10277while (dev->l2ad_rebuild == B_TRUE) {10278cv_wait(&l2arc_rebuild_thr_cv,10279&l2arc_rebuild_thr_lock);10280}10281}10282mutex_exit(&l2arc_rebuild_thr_lock);10283}10284}1028510286/*10287* Main entry point for L2ARC rebuilding.10288*/10289static __attribute__((noreturn)) void10290l2arc_dev_rebuild_thread(void *arg)10291{10292l2arc_dev_t *dev = arg;1029310294VERIFY(dev->l2ad_rebuild);10295(void) l2arc_rebuild(dev);10296mutex_enter(&l2arc_rebuild_thr_lock);10297dev->l2ad_rebuild_began = B_FALSE;10298dev->l2ad_rebuild = B_FALSE;10299cv_signal(&l2arc_rebuild_thr_cv);10300mutex_exit(&l2arc_rebuild_thr_lock);1030110302thread_exit();10303}1030410305/*10306* This function implements the actual L2ARC metadata rebuild. It:10307* starts reading the log block chain and restores each block's contents10308* to memory (reconstructing arc_buf_hdr_t's).10309*10310* Operation stops under any of the following conditions:10311*10312* 1) We reach the end of the log block chain.10313* 2) We encounter *any* error condition (cksum errors, io errors)10314*/10315static int10316l2arc_rebuild(l2arc_dev_t *dev)10317{10318vdev_t *vd = dev->l2ad_vdev;10319spa_t *spa = vd->vdev_spa;10320int err = 0;10321l2arc_dev_hdr_phys_t *l2dhdr = dev->l2ad_dev_hdr;10322l2arc_log_blk_phys_t *this_lb, *next_lb;10323zio_t *this_io = NULL, *next_io = NULL;10324l2arc_log_blkptr_t lbps[2];10325l2arc_lb_ptr_buf_t *lb_ptr_buf;10326boolean_t lock_held;1032710328this_lb = vmem_zalloc(sizeof (*this_lb), KM_SLEEP);10329next_lb = vmem_zalloc(sizeof (*next_lb), KM_SLEEP);1033010331/*10332* We prevent device removal while issuing reads to the device,10333* then during the rebuilding phases we drop this lock again so10334* that a spa_unload or device remove can be initiated - this is10335* safe, because the spa will signal us to stop before removing10336* our device and wait for us to stop.10337*/10338spa_config_enter(spa, SCL_L2ARC, vd, RW_READER);10339lock_held = B_TRUE;1034010341/*10342* Retrieve the persistent L2ARC device state.10343* L2BLK_GET_PSIZE returns aligned size for log blocks.10344*/10345dev->l2ad_evict = MAX(l2dhdr->dh_evict, dev->l2ad_start);10346dev->l2ad_hand = MAX(l2dhdr->dh_start_lbps[0].lbp_daddr +10347L2BLK_GET_PSIZE((&l2dhdr->dh_start_lbps[0])->lbp_prop),10348dev->l2ad_start);10349dev->l2ad_first = !!(l2dhdr->dh_flags & L2ARC_DEV_HDR_EVICT_FIRST);1035010351vd->vdev_trim_action_time = l2dhdr->dh_trim_action_time;10352vd->vdev_trim_state = l2dhdr->dh_trim_state;1035310354/*10355* In case the zfs module parameter l2arc_rebuild_enabled is false10356* we do not start the rebuild process.10357*/10358if (!l2arc_rebuild_enabled)10359goto out;1036010361/* Prepare the rebuild process */10362memcpy(lbps, l2dhdr->dh_start_lbps, sizeof (lbps));1036310364/* Start the rebuild process */10365for (;;) {10366if (!l2arc_log_blkptr_valid(dev, &lbps[0]))10367break;1036810369if ((err = l2arc_log_blk_read(dev, &lbps[0], &lbps[1],10370this_lb, next_lb, this_io, &next_io)) != 0)10371goto out;1037210373/*10374* Our memory pressure valve. If the system is running low10375* on memory, rather than swamping memory with new ARC buf10376* hdrs, we opt not to rebuild the L2ARC. At this point,10377* however, we have already set up our L2ARC dev to chain in10378* new metadata log blocks, so the user may choose to offline/10379* online the L2ARC dev at a later time (or re-import the pool)10380* to reconstruct it (when there's less memory pressure).10381*/10382if (l2arc_hdr_limit_reached()) {10383ARCSTAT_BUMP(arcstat_l2_rebuild_abort_lowmem);10384cmn_err(CE_NOTE, "System running low on memory, "10385"aborting L2ARC rebuild.");10386err = SET_ERROR(ENOMEM);10387goto out;10388}1038910390spa_config_exit(spa, SCL_L2ARC, vd);10391lock_held = B_FALSE;1039210393/*10394* Now that we know that the next_lb checks out alright, we10395* can start reconstruction from this log block.10396* L2BLK_GET_PSIZE returns aligned size for log blocks.10397*/10398uint64_t asize = L2BLK_GET_PSIZE((&lbps[0])->lbp_prop);10399l2arc_log_blk_restore(dev, this_lb, asize);1040010401/*10402* log block restored, include its pointer in the list of10403* pointers to log blocks present in the L2ARC device.10404*/10405lb_ptr_buf = kmem_zalloc(sizeof (l2arc_lb_ptr_buf_t), KM_SLEEP);10406lb_ptr_buf->lb_ptr = kmem_zalloc(sizeof (l2arc_log_blkptr_t),10407KM_SLEEP);10408memcpy(lb_ptr_buf->lb_ptr, &lbps[0],10409sizeof (l2arc_log_blkptr_t));10410mutex_enter(&dev->l2ad_mtx);10411list_insert_tail(&dev->l2ad_lbptr_list, lb_ptr_buf);10412ARCSTAT_INCR(arcstat_l2_log_blk_asize, asize);10413ARCSTAT_BUMP(arcstat_l2_log_blk_count);10414zfs_refcount_add_many(&dev->l2ad_lb_asize, asize, lb_ptr_buf);10415zfs_refcount_add(&dev->l2ad_lb_count, lb_ptr_buf);10416mutex_exit(&dev->l2ad_mtx);10417vdev_space_update(vd, asize, 0, 0);1041810419/*10420* Protection against loops of log blocks:10421*10422* l2ad_hand l2ad_evict10423* V V10424* l2ad_start |=======================================| l2ad_end10425* -----|||----|||---|||----|||10426* (3) (2) (1) (0)10427* ---|||---|||----|||---|||10428* (7) (6) (5) (4)10429*10430* In this situation the pointer of log block (4) passes10431* l2arc_log_blkptr_valid() but the log block should not be10432* restored as it is overwritten by the payload of log block10433* (0). Only log blocks (0)-(3) should be restored. We check10434* whether l2ad_evict lies in between the payload starting10435* offset of the next log block (lbps[1].lbp_payload_start)10436* and the payload starting offset of the present log block10437* (lbps[0].lbp_payload_start). If true and this isn't the10438* first pass, we are looping from the beginning and we should10439* stop.10440*/10441if (l2arc_range_check_overlap(lbps[1].lbp_payload_start,10442lbps[0].lbp_payload_start, dev->l2ad_evict) &&10443!dev->l2ad_first)10444goto out;1044510446kpreempt(KPREEMPT_SYNC);10447for (;;) {10448mutex_enter(&l2arc_rebuild_thr_lock);10449if (dev->l2ad_rebuild_cancel) {10450mutex_exit(&l2arc_rebuild_thr_lock);10451err = SET_ERROR(ECANCELED);10452goto out;10453}10454mutex_exit(&l2arc_rebuild_thr_lock);10455if (spa_config_tryenter(spa, SCL_L2ARC, vd,10456RW_READER)) {10457lock_held = B_TRUE;10458break;10459}10460/*10461* L2ARC config lock held by somebody in writer,10462* possibly due to them trying to remove us. They'll10463* likely to want us to shut down, so after a little10464* delay, we check l2ad_rebuild_cancel and retry10465* the lock again.10466*/10467delay(1);10468}1046910470/*10471* Continue with the next log block.10472*/10473lbps[0] = lbps[1];10474lbps[1] = this_lb->lb_prev_lbp;10475PTR_SWAP(this_lb, next_lb);10476this_io = next_io;10477next_io = NULL;10478}1047910480if (this_io != NULL)10481l2arc_log_blk_fetch_abort(this_io);10482out:10483if (next_io != NULL)10484l2arc_log_blk_fetch_abort(next_io);10485vmem_free(this_lb, sizeof (*this_lb));10486vmem_free(next_lb, sizeof (*next_lb));1048710488if (err == ECANCELED) {10489/*10490* In case the rebuild was canceled do not log to spa history10491* log as the pool may be in the process of being removed.10492*/10493zfs_dbgmsg("L2ARC rebuild aborted, restored %llu blocks",10494(u_longlong_t)zfs_refcount_count(&dev->l2ad_lb_count));10495return (err);10496} else if (!l2arc_rebuild_enabled) {10497spa_history_log_internal(spa, "L2ARC rebuild", NULL,10498"disabled");10499} else if (err == 0 && zfs_refcount_count(&dev->l2ad_lb_count) > 0) {10500ARCSTAT_BUMP(arcstat_l2_rebuild_success);10501spa_history_log_internal(spa, "L2ARC rebuild", NULL,10502"successful, restored %llu blocks",10503(u_longlong_t)zfs_refcount_count(&dev->l2ad_lb_count));10504} else if (err == 0 && zfs_refcount_count(&dev->l2ad_lb_count) == 0) {10505/*10506* No error but also nothing restored, meaning the lbps array10507* in the device header points to invalid/non-present log10508* blocks. Reset the header.10509*/10510spa_history_log_internal(spa, "L2ARC rebuild", NULL,10511"no valid log blocks");10512memset(l2dhdr, 0, dev->l2ad_dev_hdr_asize);10513l2arc_dev_hdr_update(dev);10514} else if (err != 0) {10515spa_history_log_internal(spa, "L2ARC rebuild", NULL,10516"aborted, restored %llu blocks",10517(u_longlong_t)zfs_refcount_count(&dev->l2ad_lb_count));10518}1051910520if (lock_held)10521spa_config_exit(spa, SCL_L2ARC, vd);1052210523return (err);10524}1052510526/*10527* Attempts to read the device header on the provided L2ARC device and writes10528* it to `hdr'. On success, this function returns 0, otherwise the appropriate10529* error code is returned.10530*/10531static int10532l2arc_dev_hdr_read(l2arc_dev_t *dev)10533{10534int err;10535uint64_t guid;10536l2arc_dev_hdr_phys_t *l2dhdr = dev->l2ad_dev_hdr;10537const uint64_t l2dhdr_asize = dev->l2ad_dev_hdr_asize;10538abd_t *abd;1053910540guid = spa_guid(dev->l2ad_vdev->vdev_spa);1054110542abd = abd_get_from_buf(l2dhdr, l2dhdr_asize);1054310544err = zio_wait(zio_read_phys(NULL, dev->l2ad_vdev,10545VDEV_LABEL_START_SIZE, l2dhdr_asize, abd,10546ZIO_CHECKSUM_LABEL, NULL, NULL, ZIO_PRIORITY_SYNC_READ,10547ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY |10548ZIO_FLAG_SPECULATIVE, B_FALSE));1054910550abd_free(abd);1055110552if (err != 0) {10553ARCSTAT_BUMP(arcstat_l2_rebuild_abort_dh_errors);10554zfs_dbgmsg("L2ARC IO error (%d) while reading device header, "10555"vdev guid: %llu", err,10556(u_longlong_t)dev->l2ad_vdev->vdev_guid);10557return (err);10558}1055910560if (l2dhdr->dh_magic == BSWAP_64(L2ARC_DEV_HDR_MAGIC))10561byteswap_uint64_array(l2dhdr, sizeof (*l2dhdr));1056210563if (l2dhdr->dh_magic != L2ARC_DEV_HDR_MAGIC ||10564l2dhdr->dh_spa_guid != guid ||10565l2dhdr->dh_vdev_guid != dev->l2ad_vdev->vdev_guid ||10566l2dhdr->dh_version != L2ARC_PERSISTENT_VERSION ||10567l2dhdr->dh_log_entries != dev->l2ad_log_entries ||10568l2dhdr->dh_end != dev->l2ad_end ||10569!l2arc_range_check_overlap(dev->l2ad_start, dev->l2ad_end,10570l2dhdr->dh_evict) ||10571(l2dhdr->dh_trim_state != VDEV_TRIM_COMPLETE &&10572l2arc_trim_ahead > 0)) {10573/*10574* Attempt to rebuild a device containing no actual dev hdr10575* or containing a header from some other pool or from another10576* version of persistent L2ARC.10577*/10578ARCSTAT_BUMP(arcstat_l2_rebuild_abort_unsupported);10579return (SET_ERROR(ENOTSUP));10580}1058110582return (0);10583}1058410585/*10586* Reads L2ARC log blocks from storage and validates their contents.10587*10588* This function implements a simple fetcher to make sure that while10589* we're processing one buffer the L2ARC is already fetching the next10590* one in the chain.10591*10592* The arguments this_lp and next_lp point to the current and next log block10593* address in the block chain. Similarly, this_lb and next_lb hold the10594* l2arc_log_blk_phys_t's of the current and next L2ARC blk.10595*10596* The `this_io' and `next_io' arguments are used for block fetching.10597* When issuing the first blk IO during rebuild, you should pass NULL for10598* `this_io'. This function will then issue a sync IO to read the block and10599* also issue an async IO to fetch the next block in the block chain. The10600* fetched IO is returned in `next_io'. On subsequent calls to this10601* function, pass the value returned in `next_io' from the previous call10602* as `this_io' and a fresh `next_io' pointer to hold the next fetch IO.10603* Prior to the call, you should initialize your `next_io' pointer to be10604* NULL. If no fetch IO was issued, the pointer is left set at NULL.10605*10606* On success, this function returns 0, otherwise it returns an appropriate10607* error code. On error the fetching IO is aborted and cleared before10608* returning from this function. Therefore, if we return `success', the10609* caller can assume that we have taken care of cleanup of fetch IOs.10610*/10611static int10612l2arc_log_blk_read(l2arc_dev_t *dev,10613const l2arc_log_blkptr_t *this_lbp, const l2arc_log_blkptr_t *next_lbp,10614l2arc_log_blk_phys_t *this_lb, l2arc_log_blk_phys_t *next_lb,10615zio_t *this_io, zio_t **next_io)10616{10617int err = 0;10618zio_cksum_t cksum;10619uint64_t asize;1062010621ASSERT(this_lbp != NULL && next_lbp != NULL);10622ASSERT(this_lb != NULL && next_lb != NULL);10623ASSERT(next_io != NULL && *next_io == NULL);10624ASSERT(l2arc_log_blkptr_valid(dev, this_lbp));1062510626/*10627* Check to see if we have issued the IO for this log block in a10628* previous run. If not, this is the first call, so issue it now.10629*/10630if (this_io == NULL) {10631this_io = l2arc_log_blk_fetch(dev->l2ad_vdev, this_lbp,10632this_lb);10633}1063410635/*10636* Peek to see if we can start issuing the next IO immediately.10637*/10638if (l2arc_log_blkptr_valid(dev, next_lbp)) {10639/*10640* Start issuing IO for the next log block early - this10641* should help keep the L2ARC device busy while we10642* decompress and restore this log block.10643*/10644*next_io = l2arc_log_blk_fetch(dev->l2ad_vdev, next_lbp,10645next_lb);10646}1064710648/* Wait for the IO to read this log block to complete */10649if ((err = zio_wait(this_io)) != 0) {10650ARCSTAT_BUMP(arcstat_l2_rebuild_abort_io_errors);10651zfs_dbgmsg("L2ARC IO error (%d) while reading log block, "10652"offset: %llu, vdev guid: %llu", err,10653(u_longlong_t)this_lbp->lbp_daddr,10654(u_longlong_t)dev->l2ad_vdev->vdev_guid);10655goto cleanup;10656}1065710658/*10659* Make sure the buffer checks out.10660* L2BLK_GET_PSIZE returns aligned size for log blocks.10661*/10662asize = L2BLK_GET_PSIZE((this_lbp)->lbp_prop);10663fletcher_4_native(this_lb, asize, NULL, &cksum);10664if (!ZIO_CHECKSUM_EQUAL(cksum, this_lbp->lbp_cksum)) {10665ARCSTAT_BUMP(arcstat_l2_rebuild_abort_cksum_lb_errors);10666zfs_dbgmsg("L2ARC log block cksum failed, offset: %llu, "10667"vdev guid: %llu, l2ad_hand: %llu, l2ad_evict: %llu",10668(u_longlong_t)this_lbp->lbp_daddr,10669(u_longlong_t)dev->l2ad_vdev->vdev_guid,10670(u_longlong_t)dev->l2ad_hand,10671(u_longlong_t)dev->l2ad_evict);10672err = SET_ERROR(ECKSUM);10673goto cleanup;10674}1067510676/* Now we can take our time decoding this buffer */10677switch (L2BLK_GET_COMPRESS((this_lbp)->lbp_prop)) {10678case ZIO_COMPRESS_OFF:10679break;10680case ZIO_COMPRESS_LZ4: {10681abd_t *abd = abd_alloc_linear(asize, B_TRUE);10682abd_copy_from_buf_off(abd, this_lb, 0, asize);10683abd_t dabd;10684abd_get_from_buf_struct(&dabd, this_lb, sizeof (*this_lb));10685err = zio_decompress_data(10686L2BLK_GET_COMPRESS((this_lbp)->lbp_prop),10687abd, &dabd, asize, sizeof (*this_lb), NULL);10688abd_free(&dabd);10689abd_free(abd);10690if (err != 0) {10691err = SET_ERROR(EINVAL);10692goto cleanup;10693}10694break;10695}10696default:10697err = SET_ERROR(EINVAL);10698goto cleanup;10699}10700if (this_lb->lb_magic == BSWAP_64(L2ARC_LOG_BLK_MAGIC))10701byteswap_uint64_array(this_lb, sizeof (*this_lb));10702if (this_lb->lb_magic != L2ARC_LOG_BLK_MAGIC) {10703err = SET_ERROR(EINVAL);10704goto cleanup;10705}10706cleanup:10707/* Abort an in-flight fetch I/O in case of error */10708if (err != 0 && *next_io != NULL) {10709l2arc_log_blk_fetch_abort(*next_io);10710*next_io = NULL;10711}10712return (err);10713}1071410715/*10716* Restores the payload of a log block to ARC. This creates empty ARC hdr10717* entries which only contain an l2arc hdr, essentially restoring the10718* buffers to their L2ARC evicted state. This function also updates space10719* usage on the L2ARC vdev to make sure it tracks restored buffers.10720*/10721static void10722l2arc_log_blk_restore(l2arc_dev_t *dev, const l2arc_log_blk_phys_t *lb,10723uint64_t lb_asize)10724{10725uint64_t size = 0, asize = 0;10726uint64_t log_entries = dev->l2ad_log_entries;1072710728/*10729* Usually arc_adapt() is called only for data, not headers, but10730* since we may allocate significant amount of memory here, let ARC10731* grow its arc_c.10732*/10733arc_adapt(log_entries * HDR_L2ONLY_SIZE);1073410735for (int i = log_entries - 1; i >= 0; i--) {10736/*10737* Restore goes in the reverse temporal direction to preserve10738* correct temporal ordering of buffers in the l2ad_buflist.10739* l2arc_hdr_restore also does a list_insert_tail instead of10740* list_insert_head on the l2ad_buflist:10741*10742* LIST l2ad_buflist LIST10743* HEAD <------ (time) ------ TAIL10744* direction +-----+-----+-----+-----+-----+ direction10745* of l2arc <== | buf | buf | buf | buf | buf | ===> of rebuild10746* fill +-----+-----+-----+-----+-----+10747* ^ ^10748* | |10749* | |10750* l2arc_feed_thread l2arc_rebuild10751* will place new bufs here restores bufs here10752*10753* During l2arc_rebuild() the device is not used by10754* l2arc_feed_thread() as dev->l2ad_rebuild is set to true.10755*/10756size += L2BLK_GET_LSIZE((&lb->lb_entries[i])->le_prop);10757asize += vdev_psize_to_asize(dev->l2ad_vdev,10758L2BLK_GET_PSIZE((&lb->lb_entries[i])->le_prop));10759l2arc_hdr_restore(&lb->lb_entries[i], dev);10760}1076110762/*10763* Record rebuild stats:10764* size Logical size of restored buffers in the L2ARC10765* asize Aligned size of restored buffers in the L2ARC10766*/10767ARCSTAT_INCR(arcstat_l2_rebuild_size, size);10768ARCSTAT_INCR(arcstat_l2_rebuild_asize, asize);10769ARCSTAT_INCR(arcstat_l2_rebuild_bufs, log_entries);10770ARCSTAT_F_AVG(arcstat_l2_log_blk_avg_asize, lb_asize);10771ARCSTAT_F_AVG(arcstat_l2_data_to_meta_ratio, asize / lb_asize);10772ARCSTAT_BUMP(arcstat_l2_rebuild_log_blks);10773}1077410775/*10776* Restores a single ARC buf hdr from a log entry. The ARC buffer is put10777* into a state indicating that it has been evicted to L2ARC.10778*/10779static void10780l2arc_hdr_restore(const l2arc_log_ent_phys_t *le, l2arc_dev_t *dev)10781{10782arc_buf_hdr_t *hdr, *exists;10783kmutex_t *hash_lock;10784arc_buf_contents_t type = L2BLK_GET_TYPE((le)->le_prop);10785uint64_t asize = vdev_psize_to_asize(dev->l2ad_vdev,10786L2BLK_GET_PSIZE((le)->le_prop));1078710788/*10789* Do all the allocation before grabbing any locks, this lets us10790* sleep if memory is full and we don't have to deal with failed10791* allocations.10792*/10793hdr = arc_buf_alloc_l2only(L2BLK_GET_LSIZE((le)->le_prop), type,10794dev, le->le_dva, le->le_daddr,10795L2BLK_GET_PSIZE((le)->le_prop), asize, le->le_birth,10796L2BLK_GET_COMPRESS((le)->le_prop), le->le_complevel,10797L2BLK_GET_PROTECTED((le)->le_prop),10798L2BLK_GET_PREFETCH((le)->le_prop),10799L2BLK_GET_STATE((le)->le_prop));1080010801/*10802* vdev_space_update() has to be called before arc_hdr_destroy() to10803* avoid underflow since the latter also calls vdev_space_update().10804*/10805l2arc_hdr_arcstats_increment(hdr);10806vdev_space_update(dev->l2ad_vdev, asize, 0, 0);1080710808mutex_enter(&dev->l2ad_mtx);10809list_insert_tail(&dev->l2ad_buflist, hdr);10810(void) zfs_refcount_add_many(&dev->l2ad_alloc, arc_hdr_size(hdr), hdr);10811mutex_exit(&dev->l2ad_mtx);1081210813exists = buf_hash_insert(hdr, &hash_lock);10814if (exists) {10815/* Buffer was already cached, no need to restore it. */10816arc_hdr_destroy(hdr);10817/*10818* If the buffer is already cached, check whether it has10819* L2ARC metadata. If not, enter them and update the flag.10820* This is important is case of onlining a cache device, since10821* we previously evicted all L2ARC metadata from ARC.10822*/10823if (!HDR_HAS_L2HDR(exists)) {10824arc_hdr_set_flags(exists, ARC_FLAG_HAS_L2HDR);10825exists->b_l2hdr.b_dev = dev;10826exists->b_l2hdr.b_daddr = le->le_daddr;10827exists->b_l2hdr.b_arcs_state =10828L2BLK_GET_STATE((le)->le_prop);10829/* l2arc_hdr_arcstats_update() expects a valid asize */10830HDR_SET_L2SIZE(exists, asize);10831mutex_enter(&dev->l2ad_mtx);10832list_insert_tail(&dev->l2ad_buflist, exists);10833(void) zfs_refcount_add_many(&dev->l2ad_alloc,10834arc_hdr_size(exists), exists);10835mutex_exit(&dev->l2ad_mtx);10836l2arc_hdr_arcstats_increment(exists);10837vdev_space_update(dev->l2ad_vdev, asize, 0, 0);10838}10839ARCSTAT_BUMP(arcstat_l2_rebuild_bufs_precached);10840}1084110842mutex_exit(hash_lock);10843}1084410845/*10846* Starts an asynchronous read IO to read a log block. This is used in log10847* block reconstruction to start reading the next block before we are done10848* decoding and reconstructing the current block, to keep the l2arc device10849* nice and hot with read IO to process.10850* The returned zio will contain a newly allocated memory buffers for the IO10851* data which should then be freed by the caller once the zio is no longer10852* needed (i.e. due to it having completed). If you wish to abort this10853* zio, you should do so using l2arc_log_blk_fetch_abort, which takes10854* care of disposing of the allocated buffers correctly.10855*/10856static zio_t *10857l2arc_log_blk_fetch(vdev_t *vd, const l2arc_log_blkptr_t *lbp,10858l2arc_log_blk_phys_t *lb)10859{10860uint32_t asize;10861zio_t *pio;10862l2arc_read_callback_t *cb;1086310864/* L2BLK_GET_PSIZE returns aligned size for log blocks */10865asize = L2BLK_GET_PSIZE((lbp)->lbp_prop);10866ASSERT(asize <= sizeof (l2arc_log_blk_phys_t));1086710868cb = kmem_zalloc(sizeof (l2arc_read_callback_t), KM_SLEEP);10869cb->l2rcb_abd = abd_get_from_buf(lb, asize);10870pio = zio_root(vd->vdev_spa, l2arc_blk_fetch_done, cb,10871ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY);10872(void) zio_nowait(zio_read_phys(pio, vd, lbp->lbp_daddr, asize,10873cb->l2rcb_abd, ZIO_CHECKSUM_OFF, NULL, NULL,10874ZIO_PRIORITY_ASYNC_READ, ZIO_FLAG_CANFAIL |10875ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY, B_FALSE));1087610877return (pio);10878}1087910880/*10881* Aborts a zio returned from l2arc_log_blk_fetch and frees the data10882* buffers allocated for it.10883*/10884static void10885l2arc_log_blk_fetch_abort(zio_t *zio)10886{10887(void) zio_wait(zio);10888}1088910890/*10891* Creates a zio to update the device header on an l2arc device.10892*/10893void10894l2arc_dev_hdr_update(l2arc_dev_t *dev)10895{10896l2arc_dev_hdr_phys_t *l2dhdr = dev->l2ad_dev_hdr;10897const uint64_t l2dhdr_asize = dev->l2ad_dev_hdr_asize;10898abd_t *abd;10899int err;1090010901VERIFY(spa_config_held(dev->l2ad_spa, SCL_STATE_ALL, RW_READER));1090210903l2dhdr->dh_magic = L2ARC_DEV_HDR_MAGIC;10904l2dhdr->dh_version = L2ARC_PERSISTENT_VERSION;10905l2dhdr->dh_spa_guid = spa_guid(dev->l2ad_vdev->vdev_spa);10906l2dhdr->dh_vdev_guid = dev->l2ad_vdev->vdev_guid;10907l2dhdr->dh_log_entries = dev->l2ad_log_entries;10908l2dhdr->dh_evict = dev->l2ad_evict;10909l2dhdr->dh_start = dev->l2ad_start;10910l2dhdr->dh_end = dev->l2ad_end;10911l2dhdr->dh_lb_asize = zfs_refcount_count(&dev->l2ad_lb_asize);10912l2dhdr->dh_lb_count = zfs_refcount_count(&dev->l2ad_lb_count);10913l2dhdr->dh_flags = 0;10914l2dhdr->dh_trim_action_time = dev->l2ad_vdev->vdev_trim_action_time;10915l2dhdr->dh_trim_state = dev->l2ad_vdev->vdev_trim_state;10916if (dev->l2ad_first)10917l2dhdr->dh_flags |= L2ARC_DEV_HDR_EVICT_FIRST;1091810919abd = abd_get_from_buf(l2dhdr, l2dhdr_asize);1092010921err = zio_wait(zio_write_phys(NULL, dev->l2ad_vdev,10922VDEV_LABEL_START_SIZE, l2dhdr_asize, abd, ZIO_CHECKSUM_LABEL, NULL,10923NULL, ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_CANFAIL, B_FALSE));1092410925abd_free(abd);1092610927if (err != 0) {10928zfs_dbgmsg("L2ARC IO error (%d) while writing device header, "10929"vdev guid: %llu", err,10930(u_longlong_t)dev->l2ad_vdev->vdev_guid);10931}10932}1093310934/*10935* Commits a log block to the L2ARC device. This routine is invoked from10936* l2arc_write_buffers when the log block fills up.10937* This function allocates some memory to temporarily hold the serialized10938* buffer to be written. This is then released in l2arc_write_done.10939*/10940static uint64_t10941l2arc_log_blk_commit(l2arc_dev_t *dev, zio_t *pio, l2arc_write_callback_t *cb)10942{10943l2arc_log_blk_phys_t *lb = &dev->l2ad_log_blk;10944l2arc_dev_hdr_phys_t *l2dhdr = dev->l2ad_dev_hdr;10945uint64_t psize, asize;10946zio_t *wzio;10947l2arc_lb_abd_buf_t *abd_buf;10948abd_t *abd = NULL;10949l2arc_lb_ptr_buf_t *lb_ptr_buf;1095010951VERIFY3S(dev->l2ad_log_ent_idx, ==, dev->l2ad_log_entries);1095210953abd_buf = zio_buf_alloc(sizeof (*abd_buf));10954abd_buf->abd = abd_get_from_buf(lb, sizeof (*lb));10955lb_ptr_buf = kmem_zalloc(sizeof (l2arc_lb_ptr_buf_t), KM_SLEEP);10956lb_ptr_buf->lb_ptr = kmem_zalloc(sizeof (l2arc_log_blkptr_t), KM_SLEEP);1095710958/* link the buffer into the block chain */10959lb->lb_prev_lbp = l2dhdr->dh_start_lbps[1];10960lb->lb_magic = L2ARC_LOG_BLK_MAGIC;1096110962/*10963* l2arc_log_blk_commit() may be called multiple times during a single10964* l2arc_write_buffers() call. Save the allocated abd buffers in a list10965* so we can free them in l2arc_write_done() later on.10966*/10967list_insert_tail(&cb->l2wcb_abd_list, abd_buf);1096810969/* try to compress the buffer, at least one sector to save */10970psize = zio_compress_data(ZIO_COMPRESS_LZ4,10971abd_buf->abd, &abd, sizeof (*lb),10972zio_get_compression_max_size(ZIO_COMPRESS_LZ4,10973dev->l2ad_vdev->vdev_ashift,10974dev->l2ad_vdev->vdev_ashift, sizeof (*lb)), 0);1097510976/* a log block is never entirely zero */10977ASSERT(psize != 0);10978asize = vdev_psize_to_asize(dev->l2ad_vdev, psize);10979ASSERT(asize <= sizeof (*lb));1098010981/*10982* Update the start log block pointer in the device header to point10983* to the log block we're about to write.10984*/10985l2dhdr->dh_start_lbps[1] = l2dhdr->dh_start_lbps[0];10986l2dhdr->dh_start_lbps[0].lbp_daddr = dev->l2ad_hand;10987l2dhdr->dh_start_lbps[0].lbp_payload_asize =10988dev->l2ad_log_blk_payload_asize;10989l2dhdr->dh_start_lbps[0].lbp_payload_start =10990dev->l2ad_log_blk_payload_start;10991L2BLK_SET_LSIZE(10992(&l2dhdr->dh_start_lbps[0])->lbp_prop, sizeof (*lb));10993L2BLK_SET_PSIZE(10994(&l2dhdr->dh_start_lbps[0])->lbp_prop, asize);10995L2BLK_SET_CHECKSUM(10996(&l2dhdr->dh_start_lbps[0])->lbp_prop,10997ZIO_CHECKSUM_FLETCHER_4);10998if (asize < sizeof (*lb)) {10999/* compression succeeded */11000abd_zero_off(abd, psize, asize - psize);11001L2BLK_SET_COMPRESS(11002(&l2dhdr->dh_start_lbps[0])->lbp_prop,11003ZIO_COMPRESS_LZ4);11004} else {11005/* compression failed */11006abd_copy_from_buf_off(abd, lb, 0, sizeof (*lb));11007L2BLK_SET_COMPRESS(11008(&l2dhdr->dh_start_lbps[0])->lbp_prop,11009ZIO_COMPRESS_OFF);11010}1101111012/* checksum what we're about to write */11013abd_fletcher_4_native(abd, asize, NULL,11014&l2dhdr->dh_start_lbps[0].lbp_cksum);1101511016abd_free(abd_buf->abd);1101711018/* perform the write itself */11019abd_buf->abd = abd;11020wzio = zio_write_phys(pio, dev->l2ad_vdev, dev->l2ad_hand,11021asize, abd_buf->abd, ZIO_CHECKSUM_OFF, NULL, NULL,11022ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_CANFAIL, B_FALSE);11023DTRACE_PROBE2(l2arc__write, vdev_t *, dev->l2ad_vdev, zio_t *, wzio);11024(void) zio_nowait(wzio);1102511026dev->l2ad_hand += asize;11027vdev_space_update(dev->l2ad_vdev, asize, 0, 0);1102811029/*11030* Include the committed log block's pointer in the list of pointers11031* to log blocks present in the L2ARC device.11032*/11033memcpy(lb_ptr_buf->lb_ptr, &l2dhdr->dh_start_lbps[0],11034sizeof (l2arc_log_blkptr_t));11035mutex_enter(&dev->l2ad_mtx);11036list_insert_head(&dev->l2ad_lbptr_list, lb_ptr_buf);11037ARCSTAT_INCR(arcstat_l2_log_blk_asize, asize);11038ARCSTAT_BUMP(arcstat_l2_log_blk_count);11039zfs_refcount_add_many(&dev->l2ad_lb_asize, asize, lb_ptr_buf);11040zfs_refcount_add(&dev->l2ad_lb_count, lb_ptr_buf);11041mutex_exit(&dev->l2ad_mtx);1104211043/* bump the kstats */11044ARCSTAT_INCR(arcstat_l2_write_bytes, asize);11045ARCSTAT_BUMP(arcstat_l2_log_blk_writes);11046ARCSTAT_F_AVG(arcstat_l2_log_blk_avg_asize, asize);11047ARCSTAT_F_AVG(arcstat_l2_data_to_meta_ratio,11048dev->l2ad_log_blk_payload_asize / asize);1104911050/* start a new log block */11051dev->l2ad_log_ent_idx = 0;11052dev->l2ad_log_blk_payload_asize = 0;11053dev->l2ad_log_blk_payload_start = 0;1105411055return (asize);11056}1105711058/*11059* Validates an L2ARC log block address to make sure that it can be read11060* from the provided L2ARC device.11061*/11062boolean_t11063l2arc_log_blkptr_valid(l2arc_dev_t *dev, const l2arc_log_blkptr_t *lbp)11064{11065/* L2BLK_GET_PSIZE returns aligned size for log blocks */11066uint64_t asize = L2BLK_GET_PSIZE((lbp)->lbp_prop);11067uint64_t end = lbp->lbp_daddr + asize - 1;11068uint64_t start = lbp->lbp_payload_start;11069boolean_t evicted = B_FALSE;1107011071/*11072* A log block is valid if all of the following conditions are true:11073* - it fits entirely (including its payload) between l2ad_start and11074* l2ad_end11075* - it has a valid size11076* - neither the log block itself nor part of its payload was evicted11077* by l2arc_evict():11078*11079* l2ad_hand l2ad_evict11080* | | lbp_daddr11081* | start | | end11082* | | | | |11083* V V V V V11084* l2ad_start ============================================ l2ad_end11085* --------------------------||||11086* ^ ^11087* | log block11088* payload11089*/1109011091evicted =11092l2arc_range_check_overlap(start, end, dev->l2ad_hand) ||11093l2arc_range_check_overlap(start, end, dev->l2ad_evict) ||11094l2arc_range_check_overlap(dev->l2ad_hand, dev->l2ad_evict, start) ||11095l2arc_range_check_overlap(dev->l2ad_hand, dev->l2ad_evict, end);1109611097return (start >= dev->l2ad_start && end <= dev->l2ad_end &&11098asize > 0 && asize <= sizeof (l2arc_log_blk_phys_t) &&11099(!evicted || dev->l2ad_first));11100}1110111102/*11103* Inserts ARC buffer header `hdr' into the current L2ARC log block on11104* the device. The buffer being inserted must be present in L2ARC.11105* Returns B_TRUE if the L2ARC log block is full and needs to be committed11106* to L2ARC, or B_FALSE if it still has room for more ARC buffers.11107*/11108static boolean_t11109l2arc_log_blk_insert(l2arc_dev_t *dev, const arc_buf_hdr_t *hdr)11110{11111l2arc_log_blk_phys_t *lb = &dev->l2ad_log_blk;11112l2arc_log_ent_phys_t *le;1111311114if (dev->l2ad_log_entries == 0)11115return (B_FALSE);1111611117int index = dev->l2ad_log_ent_idx++;1111811119ASSERT3S(index, <, dev->l2ad_log_entries);11120ASSERT(HDR_HAS_L2HDR(hdr));1112111122le = &lb->lb_entries[index];11123memset(le, 0, sizeof (*le));11124le->le_dva = hdr->b_dva;11125le->le_birth = hdr->b_birth;11126le->le_daddr = hdr->b_l2hdr.b_daddr;11127if (index == 0)11128dev->l2ad_log_blk_payload_start = le->le_daddr;11129L2BLK_SET_LSIZE((le)->le_prop, HDR_GET_LSIZE(hdr));11130L2BLK_SET_PSIZE((le)->le_prop, HDR_GET_PSIZE(hdr));11131L2BLK_SET_COMPRESS((le)->le_prop, HDR_GET_COMPRESS(hdr));11132le->le_complevel = hdr->b_complevel;11133L2BLK_SET_TYPE((le)->le_prop, hdr->b_type);11134L2BLK_SET_PROTECTED((le)->le_prop, !!(HDR_PROTECTED(hdr)));11135L2BLK_SET_PREFETCH((le)->le_prop, !!(HDR_PREFETCH(hdr)));11136L2BLK_SET_STATE((le)->le_prop, hdr->b_l2hdr.b_arcs_state);1113711138dev->l2ad_log_blk_payload_asize += vdev_psize_to_asize(dev->l2ad_vdev,11139HDR_GET_PSIZE(hdr));1114011141return (dev->l2ad_log_ent_idx == dev->l2ad_log_entries);11142}1114311144/*11145* Checks whether a given L2ARC device address sits in a time-sequential11146* range. The trick here is that the L2ARC is a rotary buffer, so we can't11147* just do a range comparison, we need to handle the situation in which the11148* range wraps around the end of the L2ARC device. Arguments:11149* bottom -- Lower end of the range to check (written to earlier).11150* top -- Upper end of the range to check (written to later).11151* check -- The address for which we want to determine if it sits in11152* between the top and bottom.11153*11154* The 3-way conditional below represents the following cases:11155*11156* bottom < top : Sequentially ordered case:11157* <check>--------+-------------------+11158* | (overlap here?) |11159* L2ARC dev V V11160* |---------------<bottom>============<top>--------------|11161*11162* bottom > top: Looped-around case:11163* <check>--------+------------------+11164* | (overlap here?) |11165* L2ARC dev V V11166* |===============<top>---------------<bottom>===========|11167* ^ ^11168* | (or here?) |11169* +---------------+---------<check>11170*11171* top == bottom : Just a single address comparison.11172*/11173boolean_t11174l2arc_range_check_overlap(uint64_t bottom, uint64_t top, uint64_t check)11175{11176if (bottom < top)11177return (bottom <= check && check <= top);11178else if (bottom > top)11179return (check <= top || bottom <= check);11180else11181return (check == top);11182}1118311184EXPORT_SYMBOL(arc_buf_size);11185EXPORT_SYMBOL(arc_write);11186EXPORT_SYMBOL(arc_read);11187EXPORT_SYMBOL(arc_buf_info);11188EXPORT_SYMBOL(arc_getbuf_func);11189EXPORT_SYMBOL(arc_add_prune_callback);11190EXPORT_SYMBOL(arc_remove_prune_callback);1119111192ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, min, param_set_arc_min,11193spl_param_get_u64, ZMOD_RW, "Minimum ARC size in bytes");1119411195ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, max, param_set_arc_max,11196spl_param_get_u64, ZMOD_RW, "Maximum ARC size in bytes");1119711198ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, meta_balance, UINT, ZMOD_RW,11199"Balance between metadata and data on ghost hits.");1120011201ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, grow_retry, param_set_arc_int,11202param_get_uint, ZMOD_RW, "Seconds before growing ARC size");1120311204ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, shrink_shift, param_set_arc_int,11205param_get_uint, ZMOD_RW, "log2(fraction of ARC to reclaim)");1120611207#ifdef _KERNEL11208ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, pc_percent, UINT, ZMOD_RW,11209"Percent of pagecache to reclaim ARC to");11210#endif1121111212ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, average_blocksize, UINT, ZMOD_RD,11213"Target average block size");1121411215ZFS_MODULE_PARAM(zfs, zfs_, compressed_arc_enabled, INT, ZMOD_RW,11216"Disable compressed ARC buffers");1121711218ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, min_prefetch_ms, param_set_arc_int,11219param_get_uint, ZMOD_RW, "Min life of prefetch block in ms");1122011221ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, min_prescient_prefetch_ms,11222param_set_arc_int, param_get_uint, ZMOD_RW,11223"Min life of prescient prefetched block in ms");1122411225ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, write_max, U64, ZMOD_RW,11226"Max write bytes per interval");1122711228ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, write_boost, U64, ZMOD_RW,11229"Extra write bytes during device warmup");1123011231ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, headroom, U64, ZMOD_RW,11232"Number of max device writes to precache");1123311234ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, headroom_boost, U64, ZMOD_RW,11235"Compressed l2arc_headroom multiplier");1123611237ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, trim_ahead, U64, ZMOD_RW,11238"TRIM ahead L2ARC write size multiplier");1123911240ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, feed_secs, U64, ZMOD_RW,11241"Seconds between L2ARC writing");1124211243ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, feed_min_ms, U64, ZMOD_RW,11244"Min feed interval in milliseconds");1124511246ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, noprefetch, INT, ZMOD_RW,11247"Skip caching prefetched buffers");1124811249ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, feed_again, INT, ZMOD_RW,11250"Turbo L2ARC warmup");1125111252ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, norw, INT, ZMOD_RW,11253"No reads during writes");1125411255ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, meta_percent, UINT, ZMOD_RW,11256"Percent of ARC size allowed for L2ARC-only headers");1125711258ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, rebuild_enabled, INT, ZMOD_RW,11259"Rebuild the L2ARC when importing a pool");1126011261ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, rebuild_blocks_min_l2size, U64, ZMOD_RW,11262"Min size in bytes to write rebuild log blocks in L2ARC");1126311264ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, mfuonly, INT, ZMOD_RW,11265"Cache only MFU data from ARC into L2ARC");1126611267ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, exclude_special, INT, ZMOD_RW,11268"Exclude dbufs on special vdevs from being cached to L2ARC if set.");1126911270ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, lotsfree_percent, param_set_arc_int,11271param_get_uint, ZMOD_RW, "System free memory I/O throttle in bytes");1127211273ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, sys_free, param_set_arc_u64,11274spl_param_get_u64, ZMOD_RW, "System free memory target size in bytes");1127511276ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, dnode_limit, param_set_arc_u64,11277spl_param_get_u64, ZMOD_RW, "Minimum bytes of dnodes in ARC");1127811279ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, dnode_limit_percent,11280param_set_arc_int, param_get_uint, ZMOD_RW,11281"Percent of ARC meta buffers for dnodes");1128211283ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, dnode_reduce_percent, UINT, ZMOD_RW,11284"Percentage of excess dnodes to try to unpin");1128511286ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, eviction_pct, UINT, ZMOD_RW,11287"When full, ARC allocation waits for eviction of this % of alloc size");1128811289ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, evict_batch_limit, UINT, ZMOD_RW,11290"The number of headers to evict per sublist before moving to the next");1129111292ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, prune_task_threads, INT, ZMOD_RW,11293"Number of arc_prune threads");1129411295ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, evict_threads, UINT, ZMOD_RD,11296"Number of threads to use for ARC eviction.");112971129811299