[#322] Resolve "per-channel health checks"
Closes #322 (closed)
context
value
- as busy sysadmins of a system that is likely going to be under stress for the medium term, we want a way to get an at-a-glance measure of whether any channels or down, so that we can engage in perf mitigations and communicate with users in the event of any outages or service degredation
designed behavior
metrics gathering
- WHEN the application starts
- THEN a job runs EVERY 15 MINUTES that causes the
DIAGNOSTICS
channel to ping every channel and wait for a response (likely by sendingINFO
, possibly by adding aHEALTH
command) - IF a response is received within 15 minutes, THEN the application sets a prometheus gauge that records the amount of time it took to get a response
- IF no response is received, that gauge is set to 0, and signals a non-responsive channel
metrics reading
- WHEN a sysadmin logs into grafana
- THEN they will see a
Signalboost Health
dashboard that shows a gauge for every channel showing:- (1) is the channel responsive or not
- (2) how long is the current response latency (-1 if non-responsive)
- (3) history of previous response latencies
changes
new configs:
- add
DIAGNOSTICS_CHANNEL_NUMBER
to .env anddiagnosticsPhonenumber
toconfigs.signal
in metrics
:
- add gague support and use it to create a healtcheck gauge
add diagnostics
module with:
-
dianostics.sendHealthchecks
:- sends a health check from the diagnostics phone number to every other channel phone number
- sets a gauge with the response time that is returned (sets -1 on timeout)
- sends an alert to admins of the diagnostics channel if healthcheck times out
-
diagnostics.respondToHealthcheck
:- converts a healthcheck to a healtcheck response
- sends the response to the diagnostics phone number from the channel being checked
in signal
module, add healtcheck
function that:
- is called by
diagnostics.sendHealthchecks
- sends a healtcheck message from diagnostics channel to a checked channel
- registers a callback to return its response time to
diagnostics.sendHealtchecks
- if callback times out, catches rejected promise and instead resolves with -1 as a
sentinel value to signal to
diagnostics.sendHealthchecks
that the healthcheck timed out
in callbacks
module:
- callback originator generates uuid, includes it in healtcheck body, uses it as id in callback registry compound id
- we assume that channel under check responds healtcheck message by echoing the incoming healtcheck message but replacing the text "healhcheck" with "healthcheck_response" (but keeping the uuid)
- payoff: callback handler recognizes healthcheck response as any incoming message with the text "healthcheck_response" in the body and a uuid that was registered for a pending healtcheck
in dispatcher
module:
- detect healtchecks and healthcheck responses coming off the wire from signald
- route them to the appropriate diagnostics handler and prevent them from being processed as relayable messages
- side-effect: make the increasingly crowded "detecting" section of
dispatcher.dispatch
more readable by extracting rate limit retry logic into a helper function
testing:
- provide a fancy integration test for roundrtip healthcheck ->
healthcheck response behavior that is our first test to simulate
both message sending and response. ie: it puts messages back on the
wire instead of simply just stubbing
socket.write
(likely useful for future integration tests!)