per-channel health checks

value

as busy sysadmins of a system that is likely going to be under stress for the medium term, we want a way to get an at-a-glance measure of whether any channels or down, so that we can engage in perf mitigations and communicate with users in the event of any outages or service degredation

behavior

metrics gathering

WHEN the application starts
THEN a job runs EVERY 15 MINUTES that causes the DIAGNOSTICS channel to ping every channel and wait for a response (likely by sending INFO, possibly by adding a HEALTH command)
IF a response is received within 15 minutes, THEN the application sets a prometheus gauge that records the amount of time it took to get a response
IF no response is received, that gauge is set to 0, and signals a non-responsive channel

metrics reading

WHEN a sysadmin logs into grafana
THEN they will see a Signalboost Health dashboard that shows a gauge for every channel showing:
- (1) is the channel responsive (green) or not (red)
- (2) how long is the current response latency (zero if non-responsive)

Edited Aug 10, 2020 by aguestuser

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information