block/nbd: allow drain during reconnect attempt
It should be safe to reenter qio_channel_yield() on io/channel read/write
path, so it's safe to reduce in_flight and allow attaching new aio
context. And no problem to allow drain itself: connection attempt is
not a guest request. Moreover, if remote server is down, we can hang
in negotiation, blocking drain section and provoking a dead lock.
How to reproduce the dead lock:
1. Create nbd-fault-injector.conf with the following contents:
[inject-error "mega1"]
event=data
io=readwrite
when=before
2. In one terminal run nbd-fault-injector in a loop, like this:
n=1; while true; do
echo $n; ((n++));
./nbd-fault-injector.py 127.0.0.1:10000 nbd-fault-injector.conf;
done
3. In another terminal run qemu-io in a loop, like this:
n=1; while true; do
echo $n; ((n++));
./qemu-io -c 'read 0 512' nbd://127.0.0.1:10000;
done
After some time, qemu-io will hang trying to drain, for example, like
this:
#3 aio_poll (ctx=0x55f006bdd890, blocking=true) at
util/aio-posix.c:600
#4 bdrv_do_drained_begin (bs=0x55f006bea710, recursive=false,
parent=0x0, ignore_bds_parents=false, poll=true) at block/io.c:427
#5 bdrv_drained_begin (bs=0x55f006bea710) at block/io.c:433
#6 blk_drain (blk=0x55f006befc80) at block/block-backend.c:1710
#7 blk_unref (blk=0x55f006befc80) at block/block-backend.c:498
#8 bdrv_open_inherit (filename=0x7fffba1563bc
"nbd+tcp://127.0.0.1:10000", reference=0x0, options=0x55f006be86d0,
flags=24578, parent=0x0, child_class=0x0, child_role=0,
errp=0x7fffba154620) at block.c:3491
#9 bdrv_open (filename=0x7fffba1563bc "nbd+tcp://127.0.0.1:10000",
reference=0x0, options=0x0, flags=16386, errp=0x7fffba154620) at
block.c:3513
#10 blk_new_open (filename=0x7fffba1563bc "nbd+tcp://127.0.0.1:10000",
reference=0x0, options=0x0, flags=16386, errp=0x7fffba154620) at
block/block-backend.c:421
And connection_co stack like this:
#0 qemu_coroutine_switch (from_=0x55f006bf2650, to_=0x7fe96e07d918,
action=COROUTINE_YIELD) at util/coroutine-ucontext.c:302
#1 qemu_coroutine_yield () at util/qemu-coroutine.c:193
#2 qio_channel_yield (ioc=0x55f006bb3c20, condition=G_IO_IN) at
io/channel.c:472
#3 qio_channel_readv_all_eof (ioc=0x55f006bb3c20, iov=0x7fe96d729bf0,
niov=1, errp=0x7fe96d729eb0) at io/channel.c:110
#4 qio_channel_readv_all (ioc=0x55f006bb3c20, iov=0x7fe96d729bf0,
niov=1, errp=0x7fe96d729eb0) at io/channel.c:143
#5 qio_channel_read_all (ioc=0x55f006bb3c20, buf=0x7fe96d729d28
"\300.\366\004\360U", buflen=8, errp=0x7fe96d729eb0) at
io/channel.c:247
#6 nbd_read (ioc=0x55f006bb3c20, buffer=0x7fe96d729d28, size=8,
desc=0x55f004f69644 "initial magic", errp=0x7fe96d729eb0) at
/work/src/qemu/master/include/block/nbd.h:365
#7 nbd_read64 (ioc=0x55f006bb3c20, val=0x7fe96d729d28,
desc=0x55f004f69644 "initial magic", errp=0x7fe96d729eb0) at
/work/src/qemu/master/include/block/nbd.h:391
#8 nbd_start_negotiate (aio_context=0x55f006bdd890,
ioc=0x55f006bb3c20, tlscreds=0x0, hostname=0x0,
outioc=0x55f006bf19f8, structured_reply=true,
zeroes=0x7fe96d729dca, errp=0x7fe96d729eb0) at nbd/client.c:904
#9 nbd_receive_negotiate (aio_context=0x55f006bdd890,
ioc=0x55f006bb3c20, tlscreds=0x0, hostname=0x0,
outioc=0x55f006bf19f8, info=0x55f006bf1a00, errp=0x7fe96d729eb0) at
nbd/client.c:1032
#10 nbd_client_connect (bs=0x55f006bea710, errp=0x7fe96d729eb0) at
block/nbd.c:1460
#11 nbd_reconnect_attempt (s=0x55f006bf19f0) at block/nbd.c:287
#12 nbd_co_reconnect_loop (s=0x55f006bf19f0) at block/nbd.c:309
#13 nbd_connection_entry (opaque=0x55f006bf19f0) at block/nbd.c:360
#14 coroutine_trampoline (i0=
113190480, i1=22000) at
util/coroutine-ucontext.c:173
Note, that the hang may be
triggered by another bug, so the whole case is fixed only together with
commit "block/nbd: on shutdown terminate connection attempt".
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <
20200727184751.15704-3-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>