Skip to content

[#473] signalc: respond gracefully to shutdowns

aguestuser requested to merge 473-sc-handle-restarts-gracefully into main

Closes #473 (closed)

context

  • signalboost shuts down in 2 instances: (1) in response to failed healthchecks (to restablish connections to all channel numbers), (2) upon restarting for deploy or daily backup job
  • in these isntances we want to ensure that we gracefully shutdown all running singalc services. in particular, we want to ensure that no messages in flight are dropped and not resent due to a sudden restart
  • this MR introduces logic to ensure that:
    • when we restart either due to receiving an abort message from signalboost (in the healthcheck failure scenario), or receiving SIGTERM (the restart scenario), we run a shutdown hook which:
    • shuts down all active resources (like db connections, sockets, etc.)
    • drain all messages in the SignalReceiver decryption queue and the SignalSender send queue before shutting down (or hitting a 2 minute timeout)
    • if we timeout before message queues have been drained, we quit with SIGKILL and log a message that we can monitor in grafana to alert us to the fact that our timeout should likely be longer!

changes

signalboost

diagnostics module

  • modify restart logic to branch on signald or signalc clients
  • in signalc case, restart differently insofar as:
    • we don't manually unsubscribe all channels (since singalc's shutdown sequence does this)
    • we wait for signalc to shutdown before closing our socket connection (b/c it reports shutdown progress to us over the socket)
    • instead of waiting a fixed time before assuming singalc has restarted (which was janky to begin with, but could be particularly bad if we tried to wait 5 seconds but we were draining the message queue for 2 min!), we infer when precisely signalc has shut down by observing when it closes the socket connection with signalboost (which we in turn infer from noticing the absence of a file at the socket descriptor path)

socket module

  • in order to support the above new logic for inferring signalc shutdown, we needed to add an #awaitClose method. as things stood, the only place to add this would have been in the app layer (along with stopSocket and #restartSocket)
  • rather than add onto an already broken design, i borrowed from the refactoring of the socket module we did in simulator, and put all socket-related logic inside app.socket
  • as a result of this refactor, the socket module is now stored as a stateful resource on app and is accessed via app.sockets (instead of accessing socket.index#run as a singleton and then later accessing socket.write#write later as a singleton and worrying about arcane import order dependencies when trying to figure out where it is safe or not safe to add new logic!)
  • as a side effect of this refactor, we:
    • cleaned up the sockets module by adding some wrapper methods onto the pools collection returned by run
    • consolidated the logic for constructing new socket connection pools
    • made signalboost observe the same convention of naming individual socket connections as connections rather than sockets as we do in signalc
    • replaced our old (very anemic!) tests with something slightly more substantial and less mock-heavy
    • extract sharding to its own top-level module (since it is somewhat oddly grouped with other "socket"-related code)

signal module

  • modify isAlive to call new is_alive command (supported by both signalc and new patch to our fork of signald) instead of vestigal version

signald

  • support is_alive

signalc

message queues

  • in both SignalSender and SignalReceiver, add an atomic messagesInFlight counter, that we increment when starting to either send or decrypt a message (respectively) and decrement once sending or decrypting has completed
  • expose a drain method on both classes (delegating to MessageQueues#drain) which polls the messagesInFlight counters every 200ms until they have reached 0 or a 2 minute timeout has been reached (whicever is sooner)

signal receiver

  • add an #unsubscribeAll method to call during shutdown sequence

application shutdown sequence

  • add calls to unsubscribe to all incoming messages, then drain send/receive message queues in Application#stop (reporting whether the drain succeeded and, if not, reporting how many messages were dropped)
  • add a shutDownHook thread that is triggered by SIGTERM and calls Application#stop to give the application a chance to shutdown gracefully before exiting
  • modify SocketReceiver#abort to call System.exit to trigger stop and ensure we exit after it runs
  • take advantage of this graceful shutdown hook to also close our db connections (which we didn't do before but should have!)
  • in the process of making the above, changes, i moved our Mocks singleton from an inner class on Application to tis own standalone singleton where it doesn't clutter up our app lifecycle logic!

socket receiver

  • add #isAlive (which simply echoes back any messages of this type) to signal aliveness to signalboost after restart (in healtcheck failure scenario)

socket server

  • add code to delete socket descriptor in boot sequence that we used to do in entrypoint script (but cannot clearnly accomplish when we run the app in a jar -- and which is accomplished in a much more self-contained/cohesive manner this way!)

tooling

reseed scripts

  • in order to QA this MR, i used the load test framework to simulate a shutdown in the midst of sending a large batch of messages (in order to see if they would be successfully drained)
  • in so doing, i had to run against a database build from a sql dump that did not reflect our current schema
  • to work around this (and provide a less brittle loadtest toolchain), i modified setup to run migrations on top of whatever the initial sql dump was (this should get us pretty far?)

docker images

  • in order to QA the SIGTERM/restart scenario above, it was necessary to create a docker image with a signalc jar running as the entrypoint (else when docker issued SIGTERM it would be handled by gradle instead of signalc, and our shutdown hook would never run)
  • as such, i created a new signalc-dev image and a signalc image, which has a jar artifact of signalc baked in
  • i also modified both our dev (docker-compose-sc) and sc loadtest tooling to allow for running either "dev" (gradle-built) or "pinned" (baked-in jar) versions of signalc

discarded envelope code

  • during the course of playing this card, i introduced, and then removed, a lot of logic for serializing, caching (and potentially re-dispatching) incoming envelopes (following the lead from signal-cli, et al) before ultimately deciding that simply draining the receive queue was a better design idea for our context
  • in the likely event that we need to serialize and store envelopes in the future (when implementing the store-and-replay logic for a "locked" protocol store in implementing canopy), we will likely want to retrieve this code
  • to make that easier, i removed it all in one atomic commit, which we can be retrieved by reverting 0e8e3271
Edited by aguestuser

Merge request reports