4 * The contents of this file are subject to the terms of the
5 * Common Development and Distribution License (the "License").
6 * You may not use this file except in compliance with the License.
8 * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
9 * or http://www.opensolaris.org/os/licensing.
10 * See the License for the specific language governing permissions
11 * and limitations under the License.
13 * When distributing Covered Code, include this CDDL HEADER in each
14 * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
15 * If applicable, add the following below this CDDL HEADER, with the
16 * fields enclosed by brackets "[]" replaced with your own identifying
17 * information: Portions Copyright [yyyy] [name of copyright owner]
23 * Copyright (c) 2017 by Delphix. All rights reserved.
27 * Storage Pool Checkpoint
29 * A storage pool checkpoint can be thought of as a pool-wide snapshot or
30 * a stable version of extreme rewind that guarantees no blocks from the
31 * checkpointed state will have been overwritten. It remembers the entire
32 * state of the storage pool (e.g. snapshots, dataset names, etc..) from the
33 * point that it was taken and the user can rewind back to that point even if
34 * they applied destructive operations on their datasets or even enabled new
35 * zpool on-disk features. If a pool has a checkpoint that is no longer
36 * needed, the user can discard it.
38 * == On disk data structures used ==
40 * - The pool has a new feature flag and a new entry in the MOS. The feature
41 * flag is set to active when we create the checkpoint and remains active
42 * until the checkpoint is fully discarded. The entry in the MOS config
43 * (DMU_POOL_ZPOOL_CHECKPOINT) is populated with the uberblock that
44 * references the state of the pool when we take the checkpoint. The entry
45 * remains populated until we start discarding the checkpoint or we rewind
48 * - Each vdev contains a vdev-wide space map while the pool has a checkpoint,
49 * which persists until the checkpoint is fully discarded. The space map
50 * contains entries that have been freed in the current state of the pool
51 * but we want to keep around in case we decide to rewind to the checkpoint.
52 * [see vdev_checkpoint_sm]
54 * - Each metaslab's ms_sm space map behaves the same as without the
55 * checkpoint, with the only exception being the scenario when we free
56 * blocks that belong to the checkpoint. In this case, these blocks remain
57 * ALLOCATED in the metaslab's space map and they are added as FREE in the
58 * vdev's checkpoint space map.
60 * - Each uberblock has a field (ub_checkpoint_txg) which holds the txg that
61 * the uberblock was checkpointed. For normal uberblocks this field is 0.
63 * == Overview of operations ==
65 * - To create a checkpoint, we first wait for the current TXG to be synced,
66 * so we can use the most recently synced uberblock (spa_ubsync) as the
67 * checkpointed uberblock. Then we use an early synctask to place that
68 * uberblock in MOS config, increment the feature flag for the checkpoint
69 * (marking it active), and setting spa_checkpoint_txg (see its use below)
70 * to the TXG of the checkpointed uberblock. We use an early synctask for
71 * the aforementioned operations to ensure that no blocks were dirtied
72 * between the current TXG and the TXG of the checkpointed uberblock
73 * (e.g the previous txg).
75 * - When a checkpoint exists, we need to ensure that the blocks that
76 * belong to the checkpoint are freed but never reused. This means that
77 * these blocks should never end up in the ms_allocatable or the ms_freeing
78 * trees of a metaslab. Therefore, whenever there is a checkpoint the new
79 * ms_checkpointing tree is used in addition to the aforementioned ones.
81 * Whenever a block is freed and we find out that it is referenced by the
82 * checkpoint (we find out by comparing its birth to spa_checkpoint_txg),
83 * we place it in the ms_checkpointing tree instead of the ms_freeingtree.
84 * This way, we divide the blocks that are being freed into checkpointed
85 * and not-checkpointed blocks.
87 * In order to persist these frees, we write the extents from the
88 * ms_freeingtree to the ms_sm as usual, and the extents from the
89 * ms_checkpointing tree to the vdev_checkpoint_sm. This way, these
90 * checkpointed extents will remain allocated in the metaslab's ms_sm space
91 * map, and therefore won't be reused [see metaslab_sync()]. In addition,
92 * when we discard the checkpoint, we can find the entries that have
93 * actually been freed in vdev_checkpoint_sm.
94 * [see spa_checkpoint_discard_thread_sync()]
96 * - To discard the checkpoint we use an early synctask to delete the
97 * checkpointed uberblock from the MOS config, set spa_checkpoint_txg to 0,
98 * and wakeup the discarding zthr thread (an open-context async thread).
99 * We use an early synctask to ensure that the operation happens before any
100 * new data end up in the checkpoint's data structures.
102 * Once the synctask is done and the discarding zthr is awake, we discard
103 * the checkpointed data over multiple TXGs by having the zthr prefetching
104 * entries from vdev_checkpoint_sm and then starting a synctask that places
105 * them as free blocks in to their respective ms_allocatable and ms_sm
107 * [see spa_checkpoint_discard_thread()]
109 * When there are no entries left in the vdev_checkpoint_sm of all
110 * top-level vdevs, a final synctask runs that decrements the feature flag.
112 * - To rewind to the checkpoint, we first use the current uberblock and
113 * open the MOS so we can access the checkpointed uberblock from the MOS
114 * config. After we retrieve the checkpointed uberblock, we use it as the
115 * current uberblock for the pool by writing it to disk with an updated
116 * TXG, opening its version of the MOS, and moving on as usual from there.
117 * [see spa_ld_checkpoint_rewind()]
119 * An important note on rewinding to the checkpoint has to do with how we
120 * handle ZIL blocks. In the scenario of a rewind, we clear out any ZIL
121 * blocks that have not been claimed by the time we took the checkpoint
122 * as they should no longer be valid.
123 * [see comment in zil_claim()]
125 * == Miscellaneous information ==
127 * - In the hypothetical event that we take a checkpoint, remove a vdev,
128 * and attempt to rewind, the rewind would fail as the checkpointed
129 * uberblock would reference data in the removed device. For this reason
130 * and others of similar nature, we disallow the following operations that
131 * can change the config:
132 * vdev removal and attach/detach, mirror splitting, and pool reguid.
134 * - As most of the checkpoint logic is implemented in the SPA and doesn't
135 * distinguish datasets when it comes to space accounting, having a
136 * checkpoint can potentially break the boundaries set by dataset
140 #include <sys/dmu_tx.h>
141 #include <sys/dsl_dir.h>
142 #include <sys/dsl_synctask.h>
143 #include <sys/metaslab_impl.h>
145 #include <sys/spa_impl.h>
146 #include <sys/spa_checkpoint.h>
147 #include <sys/vdev_impl.h>
149 #include <sys/zfeature.h>
152 * The following parameter limits the amount of memory to be used for the
153 * prefetching of the checkpoint space map done on each vdev while
154 * discarding the checkpoint.
156 * The reason it exists is because top-level vdevs with long checkpoint
157 * space maps can potentially take up a lot of memory depending on the
158 * amount of checkpointed data that has been freed within them while
159 * the pool had a checkpoint.
161 uint64_t zfs_spa_discard_memory_limit
= 16 * 1024 * 1024;
164 spa_checkpoint_get_stats(spa_t
*spa
, pool_checkpoint_stat_t
*pcs
)
166 if (!spa_feature_is_active(spa
, SPA_FEATURE_POOL_CHECKPOINT
))
167 return (SET_ERROR(ZFS_ERR_NO_CHECKPOINT
));
169 bzero(pcs
, sizeof (pool_checkpoint_stat_t
));
171 int error
= zap_contains(spa_meta_objset(spa
),
172 DMU_POOL_DIRECTORY_OBJECT
, DMU_POOL_ZPOOL_CHECKPOINT
);
173 ASSERT(error
== 0 || error
== ENOENT
);
176 pcs
->pcs_state
= CS_CHECKPOINT_DISCARDING
;
178 pcs
->pcs_state
= CS_CHECKPOINT_EXISTS
;
180 pcs
->pcs_space
= spa
->spa_checkpoint_info
.sci_dspace
;
181 pcs
->pcs_start_time
= spa
->spa_checkpoint_info
.sci_timestamp
;
187 spa_checkpoint_discard_complete_sync(void *arg
, dmu_tx_t
*tx
)
191 spa
->spa_checkpoint_info
.sci_timestamp
= 0;
193 spa_feature_decr(spa
, SPA_FEATURE_POOL_CHECKPOINT
, tx
);
195 spa_history_log_internal(spa
, "spa discard checkpoint", tx
,
196 "finished discarding checkpointed state from the pool");
199 typedef struct spa_checkpoint_discard_sync_callback_arg
{
202 uint64_t sdc_entry_limit
;
203 } spa_checkpoint_discard_sync_callback_arg_t
;
206 spa_checkpoint_discard_sync_callback(space_map_entry_t
*sme
, void *arg
)
208 spa_checkpoint_discard_sync_callback_arg_t
*sdc
= arg
;
209 vdev_t
*vd
= sdc
->sdc_vd
;
210 metaslab_t
*ms
= vd
->vdev_ms
[sme
->sme_offset
>> vd
->vdev_ms_shift
];
211 uint64_t end
= sme
->sme_offset
+ sme
->sme_run
;
213 if (sdc
->sdc_entry_limit
== 0)
217 * Since the space map is not condensed, we know that
218 * none of its entries is crossing the boundaries of
219 * its respective metaslab.
221 * That said, there is no fundamental requirement that
222 * the checkpoint's space map entries should not cross
223 * metaslab boundaries. So if needed we could add code
224 * that handles metaslab-crossing segments in the future.
226 VERIFY3U(sme
->sme_type
, ==, SM_FREE
);
227 VERIFY3U(sme
->sme_offset
, >=, ms
->ms_start
);
228 VERIFY3U(end
, <=, ms
->ms_start
+ ms
->ms_size
);
231 * At this point we should not be processing any
232 * other frees concurrently, so the lock is technically
233 * unnecessary. We use the lock anyway though to
234 * potentially save ourselves from future headaches.
236 mutex_enter(&ms
->ms_lock
);
237 if (range_tree_is_empty(ms
->ms_freeing
))
238 vdev_dirty(vd
, VDD_METASLAB
, ms
, sdc
->sdc_txg
);
239 range_tree_add(ms
->ms_freeing
, sme
->sme_offset
, sme
->sme_run
);
240 mutex_exit(&ms
->ms_lock
);
242 ASSERT3U(vd
->vdev_spa
->spa_checkpoint_info
.sci_dspace
, >=,
244 ASSERT3U(vd
->vdev_stat
.vs_checkpoint_space
, >=, sme
->sme_run
);
246 vd
->vdev_spa
->spa_checkpoint_info
.sci_dspace
-= sme
->sme_run
;
247 vd
->vdev_stat
.vs_checkpoint_space
-= sme
->sme_run
;
248 sdc
->sdc_entry_limit
--;
254 spa_checkpoint_accounting_verify(spa_t
*spa
)
256 vdev_t
*rvd
= spa
->spa_root_vdev
;
257 uint64_t ckpoint_sm_space_sum
= 0;
258 uint64_t vs_ckpoint_space_sum
= 0;
260 for (uint64_t c
= 0; c
< rvd
->vdev_children
; c
++) {
261 vdev_t
*vd
= rvd
->vdev_child
[c
];
263 if (vd
->vdev_checkpoint_sm
!= NULL
) {
264 ckpoint_sm_space_sum
+=
265 -vd
->vdev_checkpoint_sm
->sm_alloc
;
266 vs_ckpoint_space_sum
+=
267 vd
->vdev_stat
.vs_checkpoint_space
;
268 ASSERT3U(ckpoint_sm_space_sum
, ==,
269 vs_ckpoint_space_sum
);
271 ASSERT0(vd
->vdev_stat
.vs_checkpoint_space
);
274 ASSERT3U(spa
->spa_checkpoint_info
.sci_dspace
, ==, ckpoint_sm_space_sum
);
278 spa_checkpoint_discard_thread_sync(void *arg
, dmu_tx_t
*tx
)
284 * The space map callback is applied only to non-debug entries.
285 * Because the number of debug entries is less or equal to the
286 * number of non-debug entries, we want to ensure that we only
287 * read what we prefetched from open-context.
289 * Thus, we set the maximum entries that the space map callback
290 * will be applied to be half the entries that could fit in the
291 * imposed memory limit.
293 * Note that since this is a conservative estimate we also
294 * assume the worst case scenario in our computation where each
297 uint64_t max_entry_limit
=
298 (zfs_spa_discard_memory_limit
/ (2 * sizeof (uint64_t))) >> 1;
301 * Iterate from the end of the space map towards the beginning,
302 * placing its entries on ms_freeing and removing them from the
303 * space map. The iteration stops if one of the following
304 * conditions is true:
306 * 1] We reached the beginning of the space map. At this point
307 * the space map should be completely empty and
308 * space_map_incremental_destroy should have returned 0.
309 * The next step would be to free and close the space map
310 * and remove its entry from its vdev's top zap. This allows
311 * spa_checkpoint_discard_thread() to move on to the next vdev.
313 * 2] We reached the memory limit (amount of memory used to hold
314 * space map entries in memory) and space_map_incremental_destroy
315 * returned EINTR. This means that there are entries remaining
316 * in the space map that will be cleared in a future invocation
317 * of this function by spa_checkpoint_discard_thread().
319 spa_checkpoint_discard_sync_callback_arg_t sdc
;
321 sdc
.sdc_txg
= tx
->tx_txg
;
322 sdc
.sdc_entry_limit
= max_entry_limit
;
324 uint64_t words_before
=
325 space_map_length(vd
->vdev_checkpoint_sm
) / sizeof (uint64_t);
327 error
= space_map_incremental_destroy(vd
->vdev_checkpoint_sm
,
328 spa_checkpoint_discard_sync_callback
, &sdc
, tx
);
330 uint64_t words_after
=
331 space_map_length(vd
->vdev_checkpoint_sm
) / sizeof (uint64_t);
334 spa_checkpoint_accounting_verify(vd
->vdev_spa
);
337 zfs_dbgmsg("discarding checkpoint: txg %llu, vdev id %d, "
338 "deleted %llu words - %llu words are left",
339 tx
->tx_txg
, vd
->vdev_id
, (words_before
- words_after
),
342 if (error
!= EINTR
) {
344 zfs_panic_recover("zfs: error %d was returned "
345 "while incrementally destroying the checkpoint "
346 "space map of vdev %llu\n",
349 ASSERT0(words_after
);
350 ASSERT0(vd
->vdev_checkpoint_sm
->sm_alloc
);
351 ASSERT0(space_map_length(vd
->vdev_checkpoint_sm
));
353 space_map_free(vd
->vdev_checkpoint_sm
, tx
);
354 space_map_close(vd
->vdev_checkpoint_sm
);
355 vd
->vdev_checkpoint_sm
= NULL
;
357 VERIFY0(zap_remove(spa_meta_objset(vd
->vdev_spa
),
358 vd
->vdev_top_zap
, VDEV_TOP_ZAP_POOL_CHECKPOINT_SM
, tx
));
363 spa_checkpoint_discard_is_done(spa_t
*spa
)
365 vdev_t
*rvd
= spa
->spa_root_vdev
;
367 ASSERT(!spa_has_checkpoint(spa
));
368 ASSERT(spa_feature_is_active(spa
, SPA_FEATURE_POOL_CHECKPOINT
));
370 for (uint64_t c
= 0; c
< rvd
->vdev_children
; c
++) {
371 if (rvd
->vdev_child
[c
]->vdev_checkpoint_sm
!= NULL
)
373 ASSERT0(rvd
->vdev_child
[c
]->vdev_stat
.vs_checkpoint_space
);
381 spa_checkpoint_discard_thread_check(void *arg
, zthr_t
*zthr
)
385 if (!spa_feature_is_active(spa
, SPA_FEATURE_POOL_CHECKPOINT
))
388 if (spa_has_checkpoint(spa
))
395 spa_checkpoint_discard_thread(void *arg
, zthr_t
*zthr
)
398 vdev_t
*rvd
= spa
->spa_root_vdev
;
400 for (uint64_t c
= 0; c
< rvd
->vdev_children
; c
++) {
401 vdev_t
*vd
= rvd
->vdev_child
[c
];
403 while (vd
->vdev_checkpoint_sm
!= NULL
) {
404 space_map_t
*checkpoint_sm
= vd
->vdev_checkpoint_sm
;
408 if (zthr_iscancelled(zthr
))
411 ASSERT3P(vd
->vdev_ops
, !=, &vdev_indirect_ops
);
413 uint64_t size
= MIN(space_map_length(checkpoint_sm
),
414 zfs_spa_discard_memory_limit
);
416 space_map_length(checkpoint_sm
) - size
;
419 * Ensure that the part of the space map that will
420 * be destroyed by the synctask, is prefetched in
421 * memory before the synctask runs.
423 int error
= dmu_buf_hold_array_by_bonus(
424 checkpoint_sm
->sm_dbuf
, offset
, size
,
425 B_TRUE
, FTAG
, &numbufs
, &dbp
);
427 zfs_panic_recover("zfs: error %d was returned "
428 "while prefetching checkpoint space map "
429 "entries of vdev %llu\n",
433 VERIFY0(dsl_sync_task(spa
->spa_name
, NULL
,
434 spa_checkpoint_discard_thread_sync
, vd
,
435 0, ZFS_SPACE_CHECK_NONE
));
437 dmu_buf_rele_array(dbp
, numbufs
, FTAG
);
441 VERIFY(spa_checkpoint_discard_is_done(spa
));
442 VERIFY0(spa
->spa_checkpoint_info
.sci_dspace
);
443 VERIFY0(dsl_sync_task(spa
->spa_name
, NULL
,
444 spa_checkpoint_discard_complete_sync
, spa
,
445 0, ZFS_SPACE_CHECK_NONE
));
453 spa_checkpoint_check(void *arg
, dmu_tx_t
*tx
)
455 spa_t
*spa
= dmu_tx_pool(tx
)->dp_spa
;
457 if (!spa_feature_is_enabled(spa
, SPA_FEATURE_POOL_CHECKPOINT
))
458 return (SET_ERROR(ENOTSUP
));
460 if (!spa_top_vdevs_spacemap_addressable(spa
))
461 return (SET_ERROR(ZFS_ERR_VDEV_TOO_BIG
));
463 if (spa
->spa_vdev_removal
!= NULL
)
464 return (SET_ERROR(ZFS_ERR_DEVRM_IN_PROGRESS
));
466 if (spa
->spa_checkpoint_txg
!= 0)
467 return (SET_ERROR(ZFS_ERR_CHECKPOINT_EXISTS
));
469 if (spa_feature_is_active(spa
, SPA_FEATURE_POOL_CHECKPOINT
))
470 return (SET_ERROR(ZFS_ERR_DISCARDING_CHECKPOINT
));
477 spa_checkpoint_sync(void *arg
, dmu_tx_t
*tx
)
479 dsl_pool_t
*dp
= dmu_tx_pool(tx
);
480 spa_t
*spa
= dp
->dp_spa
;
481 uberblock_t checkpoint
= spa
->spa_ubsync
;
484 * At this point, there should not be a checkpoint in the MOS.
486 ASSERT3U(zap_contains(spa_meta_objset(spa
), DMU_POOL_DIRECTORY_OBJECT
,
487 DMU_POOL_ZPOOL_CHECKPOINT
), ==, ENOENT
);
489 ASSERT0(spa
->spa_checkpoint_info
.sci_timestamp
);
490 ASSERT0(spa
->spa_checkpoint_info
.sci_dspace
);
493 * Since the checkpointed uberblock is the one that just got synced
494 * (we use spa_ubsync), its txg must be equal to the txg number of
495 * the txg we are syncing, minus 1.
497 ASSERT3U(checkpoint
.ub_txg
, ==, spa
->spa_syncing_txg
- 1);
500 * Once the checkpoint is in place, we need to ensure that none of
501 * its blocks will be marked for reuse after it has been freed.
502 * When there is a checkpoint and a block is freed, we compare its
503 * birth txg to the txg of the checkpointed uberblock to see if the
504 * block is part of the checkpoint or not. Therefore, we have to set
505 * spa_checkpoint_txg before any frees happen in this txg (which is
506 * why this is done as an early_synctask as explained in the comment
507 * in spa_checkpoint()).
509 spa
->spa_checkpoint_txg
= checkpoint
.ub_txg
;
510 spa
->spa_checkpoint_info
.sci_timestamp
= checkpoint
.ub_timestamp
;
512 checkpoint
.ub_checkpoint_txg
= checkpoint
.ub_txg
;
513 VERIFY0(zap_add(spa
->spa_dsl_pool
->dp_meta_objset
,
514 DMU_POOL_DIRECTORY_OBJECT
, DMU_POOL_ZPOOL_CHECKPOINT
,
515 sizeof (uint64_t), sizeof (uberblock_t
) / sizeof (uint64_t),
519 * Increment the feature refcount and thus activate the feature.
520 * Note that the feature will be deactivated when we've
521 * completely discarded all checkpointed state (both vdev
522 * space maps and uberblock).
524 spa_feature_incr(spa
, SPA_FEATURE_POOL_CHECKPOINT
, tx
);
526 spa_history_log_internal(spa
, "spa checkpoint", tx
,
527 "checkpointed uberblock txg=%llu", checkpoint
.ub_txg
);
531 * Create a checkpoint for the pool.
534 spa_checkpoint(const char *pool
)
539 error
= spa_open(pool
, &spa
, FTAG
);
543 mutex_enter(&spa
->spa_vdev_top_lock
);
546 * Wait for current syncing txg to finish so the latest synced
547 * uberblock (spa_ubsync) has all the changes that we expect
548 * to see if we were to revert later to the checkpoint. In other
549 * words we want the checkpointed uberblock to include/reference
550 * all the changes that were pending at the time that we issued
551 * the checkpoint command.
553 txg_wait_synced(spa_get_dsl(spa
), 0);
556 * As the checkpointed uberblock references blocks from the previous
557 * txg (spa_ubsync) we want to ensure that are not freeing any of
558 * these blocks in the same txg that the following synctask will
559 * run. Thus, we run it as an early synctask, so the dirty changes
560 * that are synced to disk afterwards during zios and other synctasks
561 * do not reuse checkpointed blocks.
563 error
= dsl_early_sync_task(pool
, spa_checkpoint_check
,
564 spa_checkpoint_sync
, NULL
, 0, ZFS_SPACE_CHECK_NORMAL
);
566 mutex_exit(&spa
->spa_vdev_top_lock
);
568 spa_close(spa
, FTAG
);
574 spa_checkpoint_discard_check(void *arg
, dmu_tx_t
*tx
)
576 spa_t
*spa
= dmu_tx_pool(tx
)->dp_spa
;
578 if (!spa_feature_is_active(spa
, SPA_FEATURE_POOL_CHECKPOINT
))
579 return (SET_ERROR(ZFS_ERR_NO_CHECKPOINT
));
581 if (spa
->spa_checkpoint_txg
== 0)
582 return (SET_ERROR(ZFS_ERR_DISCARDING_CHECKPOINT
));
584 VERIFY0(zap_contains(spa_meta_objset(spa
),
585 DMU_POOL_DIRECTORY_OBJECT
, DMU_POOL_ZPOOL_CHECKPOINT
));
592 spa_checkpoint_discard_sync(void *arg
, dmu_tx_t
*tx
)
594 spa_t
*spa
= dmu_tx_pool(tx
)->dp_spa
;
596 VERIFY0(zap_remove(spa_meta_objset(spa
), DMU_POOL_DIRECTORY_OBJECT
,
597 DMU_POOL_ZPOOL_CHECKPOINT
, tx
));
599 spa
->spa_checkpoint_txg
= 0;
601 zthr_wakeup(spa
->spa_checkpoint_discard_zthr
);
603 spa_history_log_internal(spa
, "spa discard checkpoint", tx
,
604 "started discarding checkpointed state from the pool");
608 * Discard the checkpoint from a pool.
611 spa_checkpoint_discard(const char *pool
)
614 * Similarly to spa_checkpoint(), we want our synctask to run
615 * before any pending dirty data are written to disk so they
616 * won't end up in the checkpoint's data structures (e.g.
617 * ms_checkpointing and vdev_checkpoint_sm) and re-create any
618 * space maps that the discarding open-context thread has
620 * [see spa_discard_checkpoint_sync and spa_discard_checkpoint_thread]
622 return (dsl_early_sync_task(pool
, spa_checkpoint_discard_check
,
623 spa_checkpoint_discard_sync
, NULL
, 0,
624 ZFS_SPACE_CHECK_DISCARD_CHECKPOINT
));