context

value

as busy sysadmins of a system that is likely going to be under stress for the medium term, we want a way to get an at-a-glance measure of whether any channels or down, so that we can engage in perf mitigations and communicate with users in the event of any outages or service degredation

WHEN the application starts
THEN a job runs EVERY 15 MINUTES that causes the DIAGNOSTICS channel to ping every channel and wait for a response (likely by sending INFO, possibly by adding a HEALTH command)
IF a response is received within 15 minutes, THEN the application sets a prometheus gauge that records the amount of time it took to get a response
IF no response is received, that gauge is set to 0, and signals a non-responsive channel

WHEN a sysadmin logs into grafana
THEN they will see a Signalboost Health dashboard that shows a gauge for every channel showing:
- (1) is the channel responsive or not
- (2) how long is the current response latency (-1 if non-responsive)
- (3) history of previous response latencies

new configs:

add DIAGNOSTICS_CHANNEL_NUMBER to .env and diagnosticsPhonenumber to configs.signal

in metrics:

add diagnostics module with:

dianostics.sendHealthchecks:
- sends a health check from the diagnostics phone number to every other channel phone number
- sets a gauge with the response time that is returned (sets -1 on timeout)
- sends an alert to admins of the diagnostics channel if healthcheck times out
diagnostics.respondToHealthcheck:
- converts a healthcheck to a healtcheck response
- sends the response to the diagnostics phone number from the channel being checked

in signal module, add healtcheck function that:

is called by diagnostics.sendHealthchecks
sends a healtcheck message from diagnostics channel to a checked channel
registers a callback to return its response time to diagnostics.sendHealtchecks
if callback times out, catches rejected promise and instead resolves with -1 as a sentinel value to signal to diagnostics.sendHealthchecks that the healthcheck timed out

in callbacks module:

callback originator generates uuid, includes it in healtcheck body, uses it as id in callback registry compound id
we assume that channel under check responds healtcheck message by echoing the incoming healtcheck message but replacing the text "healhcheck" with "healthcheck_response" (but keeping the uuid)
payoff: callback handler recognizes healthcheck response as any incoming message with the text "healthcheck_response" in the body and a uuid that was registered for a pending healtcheck

in dispatcher module:

detect healtchecks and healthcheck responses coming off the wire from signald
route them to the appropriate diagnostics handler and prevent them from being processed as relayable messages
side-effect: make the increasingly crowded "detecting" section of dispatcher.dispatch more readable by extracting rate limit retry logic into a helper function

testing:

provide a fancy integration test for roundrtip healthcheck -> healthcheck response behavior that is our first test to simulate both message sending and response. ie: it puts messages back on the wire instead of simply just stubbing socket.write (likely useful for future integration tests!)