per-channel health checks
value
- as busy sysadmins of a system that is likely going to be under stress for the medium term, we want a way to get an at-a-glance measure of whether any channels or down, so that we can engage in perf mitigations and communicate with users in the event of any outages or service degredation
behavior
metrics gathering
- WHEN the application starts
- THEN a job runs EVERY 15 MINUTES that causes the
DIAGNOSTICS
channel to ping every channel and wait for a response (likely by sendingINFO
, possibly by adding aHEALTH
command) - IF a response is received within 15 minutes, THEN the application sets a prometheus gauge that records the amount of time it took to get a response
- IF no response is received, that gauge is set to 0, and signals a non-responsive channel
metrics reading
- WHEN a sysadmin logs into grafana
- THEN they will see a
Signalboost Health
dashboard that shows a gauge for every channel showing:- (1) is the channel responsive (green) or not (red)
- (2) how long is the current response latency (zero if non-responsive)
Edited by aguestuser