Detect and kill orphaned workers using heartbeat mechanism
Summary:
This diff builds upon
D26241563 but instead of having each worker have their own heartbeat, only the central controller writes the heartbeats. The workers simply check against the latest heartbeat and decide whether or not to terminate.
Currently the heartbeat timeout is hard coded to 20 seconds. In my testing, it seemed to work pretty robustly.
Since we don't have an OCaml SQL library, and parsing timestamps seemed non-trivial, I decided to delegate the timestamp => seconds conversion to the SQL query.
You can find more context around the problem in T84570409
Differential Revision:
D26443087
fbshipit-source-id:
10e4af5f48efc23d44f084f6a64046362aa60f4e