[#340] Resolve "only alert after 2 consecuitve healthcheck failures"
Closes #340 (closed)
Context
- we have a somewhat subtle bug causing occasional (very non-deterministic) false alarm healthcheck failures
- however, in all observed cases, only channels that fail 2 consecutive health checks are actually in a broken state
- thus: as a workaround to allow us to focus on other issues rather than solving the bug, we modify the healthcheck logic to cache failures and only alert maintainers if a channel has failed healthcheck twice in a row
- to compensate for the delay in alerting, we decrease the healthcheck interval from 15 min to 10 min (so that it will take 20 min to find out if a channel is broken.) if this proves bad, we can shorten again to 5 min.
Changes
update logic in diagnostics.sendHealthcheck
- cache healthcheck failures in a set (delete them after 2 rounds of healthchecks)
- when a channel fails healthcheck, consult set to see if it failed alst round, if so, alert admins, if not, wait until next round
- refactor: use new
notifier.notifyMaintainers
helper insendHealthchecks
failure handler - do some fancy integration testing to make sure it works
side-effects:
- fix bug in
testApp.metricsResource
it omitted agauges
field, causing integration tests that calledmetrics.setGauge
to incorrectly (and confusingly!) fail - update localdev network comments in test docker-compose files
Edited by aguestuser