Skip to content

[376] compartmentalize auto-restarts by socket shard

aguestuser requested to merge 376-compartmentalized-auto-restarts into main

context

  • we would like to be more resilient to restarts so that we can push concurrency more heavily under load (which often triggers concurrency errors that crash a channel and trigger an auto-restart)
  • this MR introduces logic for auto-restarts that compartmentalizes each restart to the socket shard in which the channel failure occured
  • upshot: the system is more resilient to restarts b/c fewer channels overall are affected by any given restart

changes:

  • if healthchecks fail, group fatal failures by socket id, pass them to refactored _restartAndNotify (accepts socket id and failed channel number), which passes them to refactored restart
  • restart now does not shut down entire app, merely unsubscribes from channels in shard, aborts correct signald instance, restarts correct socket pool (using new convenience methods on app), then re-subscribes to channels
  • there is some fancy lodash footwork involved. perhaps we can simplify later! :)
  • side-effect: make healthchecks happen less frequently in dev to get cleaner logs

Closes #376 (closed)

Merge request reports