[#473] signalc: respond gracefully to shutdowns
Closes #473 (closed)
context
- signalboost shuts down in 2 instances: (1) in response to failed healthchecks (to restablish connections to all channel numbers), (2) upon restarting for deploy or daily backup job
- in these isntances we want to ensure that we gracefully shutdown all running singalc services. in particular, we want to ensure that no messages in flight are dropped and not resent due to a sudden restart
- this MR introduces logic to ensure that:
- when we restart either due to receiving an
abort
message from signalboost (in the healthcheck failure scenario), or receivingSIGTERM
(the restart scenario), we run a shutdown hook which: - shuts down all active resources (like db connections, sockets, etc.)
- drain all messages in the
SignalReceiver
decryption queue and theSignalSender
send queue before shutting down (or hitting a 2 minute timeout) - if we timeout before message queues have been drained, we quit with
SIGKILL
and log a message that we can monitor in grafana to alert us to the fact that our timeout should likely be longer!
- when we restart either due to receiving an
changes
signalboost
diagnostics module
- modify restart logic to branch on signald or signalc clients
- in signalc case, restart differently insofar as:
- we don't manually unsubscribe all channels (since singalc's shutdown sequence does this)
- we wait for signalc to shutdown before closing our socket connection (b/c it reports shutdown progress to us over the socket)
- instead of waiting a fixed time before assuming singalc has restarted (which was janky to begin with, but could be particularly bad if we tried to wait 5 seconds but we were draining the message queue for 2 min!), we infer when precisely signalc has shut down by observing when it closes the socket connection with signalboost (which we in turn infer from noticing the absence of a file at the socket descriptor path)
socket module
- in order to support the above new logic for inferring signalc shutdown, we needed to add an
#awaitClose
method. as things stood, the only place to add this would have been in theapp
layer (along withstopSocket
and#restartSocket
) - rather than add onto an already broken design, i borrowed from the refactoring of the
socket
module we did insimulator
, and put all socket-related logic insideapp.socket
- as a result of this refactor, the socket module is now stored as a stateful resource on
app
and is accessed viaapp.sockets
(instead of accessingsocket.index#run
as a singleton and then later accessingsocket.write#write
later as a singleton and worrying about arcane import order dependencies when trying to figure out where it is safe or not safe to add new logic!) - as a side effect of this refactor, we:
- cleaned up the
sockets
module by adding some wrapper methods onto thepools
collection returned byrun
- consolidated the logic for constructing new socket connection pools
- made signalboost observe the same convention of naming individual socket connections as
connections
rather thansockets
as we do in signalc - replaced our old (very anemic!) tests with something slightly more substantial and less mock-heavy
- extract
sharding
to its own top-level module (since it is somewhat oddly grouped with other "socket"-related code)
- cleaned up the
signal module
- modify
isAlive
to call newis_alive
command (supported by both signalc and new patch to our fork of signald) instead of vestigalversion
signald
- support
is_alive
signalc
message queues
- in both
SignalSender
andSignalReceiver
, add an atomicmessagesInFlight
counter, that we increment when starting to either send or decrypt a message (respectively) and decrement once sending or decrypting has completed - expose a
drain
method on both classes (delegating toMessageQueues#drain
) which polls themessagesInFlight
counters every 200ms until they have reached 0 or a 2 minute timeout has been reached (whicever is sooner)
signal receiver
- add an
#unsubscribeAll
method to call during shutdown sequence
application shutdown sequence
- add calls to unsubscribe to all incoming messages, then drain send/receive message queues in
Application#stop
(reporting whether the drain succeeded and, if not, reporting how many messages were dropped) - add a
shutDownHook
thread that is triggered bySIGTERM
and callsApplication#stop
to give the application a chance to shutdown gracefully before exiting - modify
SocketReceiver#abort
to callSystem.exit
to trigger stop and ensure we exit after it runs - take advantage of this graceful shutdown hook to also close our db connections (which we didn't do before but should have!)
- in the process of making the above, changes, i moved our
Mocks
singleton from an inner class onApplication
to tis own standalone singleton where it doesn't clutter up our app lifecycle logic!
socket receiver
- add
#isAlive
(which simply echoes back any messages of this type) to signal aliveness to signalboost after restart (in healtcheck failure scenario)
socket server
- add code to delete socket descriptor in boot sequence that we used to do in entrypoint script (but cannot clearnly accomplish when we run the app in a jar -- and which is accomplished in a much more self-contained/cohesive manner this way!)
tooling
reseed scripts
- in order to QA this MR, i used the load test framework to simulate a shutdown in the midst of sending a large batch of messages (in order to see if they would be successfully drained)
- in so doing, i had to run against a database build from a sql dump that did not reflect our current schema
- to work around this (and provide a less brittle loadtest toolchain), i modified
setup
to run migrations on top of whatever the initial sql dump was (this should get us pretty far?)
docker images
- in order to QA the
SIGTERM
/restart scenario above, it was necessary to create a docker image with a signalc jar running as the entrypoint (else when docker issuedSIGTERM
it would be handled by gradle instead of signalc, and our shutdown hook would never run) - as such, i created a new
signalc-dev
image and asignalc
image, which has a jar artifact of signalc baked in - i also modified both our dev (
docker-compose-sc
) and sc loadtest tooling to allow for running either "dev" (gradle-built) or "pinned" (baked-in jar) versions of signalc
discarded envelope code
- during the course of playing this card, i introduced, and then removed, a lot of logic for serializing, caching (and potentially re-dispatching) incoming envelopes (following the lead from signal-cli, et al) before ultimately deciding that simply draining the receive queue was a better design idea for our context
- in the likely event that we need to serialize and store envelopes in the future (when implementing the store-and-replay logic for a "locked" protocol store in implementing canopy), we will likely want to retrieve this code
- to make that easier, i removed it all in one atomic commit, which we can be retrieved by reverting 0e8e3271
Edited by aguestuser