2 ----------------------------------------
3 Copyright Fujitsu, Corp. 2016
4 Copyright (c) 2016 Intel Corporation
5 Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.
7 This work is licensed under the terms of the GNU GPL, version 2 or later.
8 See the COPYING file in the top-level directory.
10 Block replication is used for continuous checkpoints. It is designed
11 for COLO (COarse-grain LOck-stepping) where the Secondary VM is running.
12 It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario,
13 where the Secondary VM is not running.
15 This document gives an overview of block replication's design.
18 High availability solutions such as micro checkpoint and COLO will do
19 consecutive checkpoints. The VM state of the Primary and Secondary VM is
20 identical right after a VM checkpoint, but becomes different as the VM
21 executes till the next checkpoint. To support disk contents checkpoint,
22 the modified disk contents in the Secondary VM must be buffered, and are
23 only dropped at next checkpoint time. To reduce the network transportation
24 effort during a vmstate checkpoint, the disk modification operations of
25 the Primary disk are asynchronously forwarded to the Secondary node.
28 The following is the image of block replication workflow:
30 +----------------------+ +------------------------+
31 |Primary Write Requests| |Secondary Write Requests|
32 +----------------------+ +------------------------+
37 | Copy and Forward | |
38 |---------(1)----------+ | Disk Buffer |
45 +--------------+ +----------------+
46 | Primary Disk | | Secondary Disk |
47 +--------------+ +----------------+
49 1) Primary write requests will be copied and forwarded to Secondary
51 2) Before Primary write requests are written to Secondary disk, the
52 original sector content will be read from Secondary disk and
53 buffered in the Disk buffer, but it will not overwrite the existing
54 sector content (it could be from either "Secondary Write Requests" or
55 previous COW of "Primary Write Requests") in the Disk buffer.
56 3) Primary write requests will be written to Secondary disk.
57 4) Secondary write requests will be buffered in the Disk buffer and it
58 will overwrite the existing sector content in the buffer.
61 We are going to implement block replication from many basic
62 blocks that are already in QEMU.
67 1 Quorum || '----------
73 3 NBD -------> 3 NBD |
74 client || server 2 filter
77 Primary | || Secondary disk <--------- hidden-disk 5 <--------- active-disk 4
78 --------' || | backing ^ backing
81 || '-------------------------'
82 || drive-backup sync=none 6
84 1) The disk on the primary is represented by a block device with two
85 children, providing replication between a primary disk and the host that
86 runs the secondary VM. The read pattern (fifo) for quorum can be extended
87 to make the primary always read from the local disk instead of going through
90 2) The new block filter (the name is replication) will control the block
93 3) The secondary disk receives writes from the primary VM through QEMU's
94 embedded NBD server (speculative write-through).
96 4) The disk on the secondary is represented by a custom block device
97 (called active-disk). It should start as an empty disk, and the format
98 should support bdrv_make_empty() and backing file.
100 5) The hidden-disk is created automatically. It buffers the original content
101 that is modified by the primary VM. It should also start as an empty disk,
102 and the driver supports bdrv_make_empty() and backing file.
104 6) The drive-backup job (sync=none) is run to allow hidden-disk to buffer
105 any state that would otherwise be lost by the speculative write-through
106 of the NBD server into the secondary disk. So before block replication,
107 the primary disk and secondary disk should contain the same data.
109 == Failure Handling ==
110 There are 7 internal errors when block replication is running:
111 1. I/O error on primary disk
112 2. Forwarding primary write requests failed
114 4. I/O error on secondary disk
115 5. I/O error on active disk
116 6. Making active disk or hidden disk empty failed
117 7. Doing failover failed
118 In case 1 and 5, we just report the error to the disk layer. In case 2, 3,
119 4 and 6, we just report block replication's error to FT/HA manager (which
120 decides when to do a new checkpoint, when to do failover).
121 In case 7, if active commit failed, we use replication failover failed state
122 in Secondary's write operation (what decides which target to write).
124 == New block driver interface ==
125 We add four block driver interfaces to control block replication:
126 a. replication_start_all()
127 Start block replication, called in migration/checkpoint thread.
128 We must call block_replication_start_all() in secondary QEMU before
129 calling block_replication_start_all() in primary QEMU. The caller
130 must hold the I/O mutex lock if it is in migration/checkpoint
132 b. replication_do_checkpoint_all()
133 This interface is called after all VM state is transferred to
134 Secondary QEMU. The Disk buffer will be dropped in this interface.
135 The caller must hold the I/O mutex lock if it is in migration/checkpoint
137 c. replication_get_error_all()
138 This interface is called to check if error happened in replication.
139 The caller must hold the I/O mutex lock if it is in migration/checkpoint
141 d. replication_stop_all()
142 It is called on failover. We will flush the Disk buffer into
143 Secondary Disk and stop block replication. The vm should be stopped
144 before calling it if you use this API to shutdown the guest, or other
145 things except failover. The caller must hold the I/O mutex lock if it is
146 in migration/checkpoint thread.
150 -drive if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1,\
151 children.0.file.filename=1.raw,\
152 children.0.driver=raw
154 Run qmp command in primary qemu:
155 { 'execute': 'human-monitor-command',
157 'command-line': 'drive_add -n buddy driver=replication,mode=primary,file.driver=nbd,file.host=xxxx,file.port=xxxx,file.export=colo1,node-name=nbd_client1'
160 { 'execute': 'x-blockdev-change',
163 'node': 'nbd_client1'
167 1. There should be only one NBD Client for each primary disk.
168 2. host is the secondary physical machine's hostname or IP
169 3. Each disk must have its own export name.
170 4. It is all a single argument to -drive and you should ignore the
172 5. The qmp command line must be run after running qmp command line in
174 6. After failover we need remove children.1 (replication driver).
177 -drive if=none,driver=raw,file.filename=1.raw,id=colo1 \
178 -drive if=xxx,id=topxxx,driver=replication,mode=secondary,top-id=topxxx\
179 file.file.filename=active_disk.qcow2,\
181 file.backing.file.filename=hidden_disk.qcow2,\
182 file.backing.driver=qcow2,\
183 file.backing.backing=colo1
185 Then run qmp command in secondary qemu:
186 { 'execute': 'nbd-server-start',
197 { 'execute': 'nbd-server-add',
205 1. The export name in secondary QEMU command line is the secondary
207 2. The export name for the same disk must be the same
208 3. The qmp command nbd-server-start and nbd-server-add must be run
209 before running the qmp command migrate on primary QEMU
210 4. Active disk, hidden disk and nbd target's length should be the
212 5. It is better to put active disk and hidden disk in ramdisk.
213 6. It is all a single argument to -drive, and you should ignore
214 the leading whitespace.
218 The secondary host is down, so we should run the following qmp command
219 to remove the nbd child from the quorum:
220 { 'execute': 'x-blockdev-change',
223 'child': 'children.1'
226 { 'execute': 'human-monitor-command',
228 'command-line': 'drive_del xxxx'
231 Note: there is no qmp command to remove the blockdev now
234 The primary host is down, so we should do the following thing:
235 { 'execute': 'nbd-server-stop' }
238 1. Continuous block replication