Skip to content

[#322] Resolve "per-channel health checks"

aguestuser requested to merge 322-per-channel-health-checks into master

Closes #322 (closed)

context

value

  • as busy sysadmins of a system that is likely going to be under stress for the medium term, we want a way to get an at-a-glance measure of whether any channels or down, so that we can engage in perf mitigations and communicate with users in the event of any outages or service degredation

designed behavior

metrics gathering

  • WHEN the application starts
  • THEN a job runs EVERY 15 MINUTES that causes the DIAGNOSTICS channel to ping every channel and wait for a response (likely by sending INFO, possibly by adding a HEALTH command)
  • IF a response is received within 15 minutes, THEN the application sets a prometheus gauge that records the amount of time it took to get a response
  • IF no response is received, that gauge is set to 0, and signals a non-responsive channel

metrics reading

  • WHEN a sysadmin logs into grafana
  • THEN they will see a Signalboost Health dashboard that shows a gauge for every channel showing:
    • (1) is the channel responsive or not
    • (2) how long is the current response latency (-1 if non-responsive)
    • (3) history of previous response latencies

changes

new configs:

  • add DIAGNOSTICS_CHANNEL_NUMBER to .env and diagnosticsPhonenumber to configs.signal

in metrics:

  • add gague support and use it to create a healtcheck gauge

add diagnostics module with:

  • dianostics.sendHealthchecks:
    • sends a health check from the diagnostics phone number to every other channel phone number
    • sets a gauge with the response time that is returned (sets -1 on timeout)
    • sends an alert to admins of the diagnostics channel if healthcheck times out
  • diagnostics.respondToHealthcheck:
    • converts a healthcheck to a healtcheck response
    • sends the response to the diagnostics phone number from the channel being checked

in signal module, add healtcheck function that:

  • is called by diagnostics.sendHealthchecks
  • sends a healtcheck message from diagnostics channel to a checked channel
  • registers a callback to return its response time to diagnostics.sendHealtchecks
  • if callback times out, catches rejected promise and instead resolves with -1 as a sentinel value to signal to diagnostics.sendHealthchecks that the healthcheck timed out

in callbacks module:

  • callback originator generates uuid, includes it in healtcheck body, uses it as id in callback registry compound id
  • we assume that channel under check responds healtcheck message by echoing the incoming healtcheck message but replacing the text "healhcheck" with "healthcheck_response" (but keeping the uuid)
  • payoff: callback handler recognizes healthcheck response as any incoming message with the text "healthcheck_response" in the body and a uuid that was registered for a pending healtcheck

in dispatcher module:

  • detect healtchecks and healthcheck responses coming off the wire from signald
  • route them to the appropriate diagnostics handler and prevent them from being processed as relayable messages
  • side-effect: make the increasingly crowded "detecting" section of dispatcher.dispatch more readable by extracting rate limit retry logic into a helper function

testing:

  • provide a fancy integration test for roundrtip healthcheck -> healthcheck response behavior that is our first test to simulate both message sending and response. ie: it puts messages back on the wire instead of simply just stubbing socket.write (likely useful for future integration tests!)

Merge request reports