ctdb-scripts: Avoid flapping NFS services at startup
commit578dfa576517b10d979c9aef539ac910b2f95381
authorMartin Schwenke <mschwenke@ddn.com>
Sat, 29 Jun 2024 02:25:59 +0000 (29 12:25 +1000)
committerMartin Schwenke <martins@samba.org>
Tue, 20 Aug 2024 22:50:34 +0000 (20 22:50 +0000)
tree5c57de1b1c72a3c77648ff80f176c4d21a559409
parent18a29ed367278849889a846bb93f49afd0c045a8
ctdb-scripts: Avoid flapping NFS services at startup

If an NFS service check is set to, say, unhealthy_after=2 then it will
always switch from the (default startup) unhealthy state to healthy,
even if there is a fatal problem.  If all services/scripts appear OK
then the node will become healthy.  When the counter hits the limit it
will return to unhealthy.  This is misleading.

Instead, never use the counter at startup, until the service becomes
healthy.  This stops services flapping unhealthy-healthy-unhealthy.

A side-effect is that a service that starts in a broken state will
never be restarted to try to fix the problem.  This makes sense.  The
counting and restarting really exist to deal with problems that might
occur under load.  The first monitor events occur before public IPs
are hosted, so there can be no load.  If a service doesn't start
reliably the first time then the admin probably wants to know about
it.

nfs_iterate_test() is updated to run an initial monitor event to mark
the services as healthy.  This initialises the counter so it can be
used for the important part of the test.  Passing the -i option avoids
running the extra monitor event, so the first iteration will be the
initial monitor event.

Signed-off-by: Martin Schwenke <mschwenke@ddn.com>
Reviewed-by: Amitay Isaacs <amitay@gmail.com>
ctdb/config/events/legacy/60.nfs.script
ctdb/config/functions
ctdb/tests/UNIT/eventscripts/60.nfs.monitor.171.sh [new file with mode: 0755]
ctdb/tests/UNIT/eventscripts/60.nfs.monitor.172.sh [new file with mode: 0755]
ctdb/tests/UNIT/eventscripts/scripts/60.nfs.sh