[#340] Resolve "only alert after 2 consecuitve healthcheck failures"
Closes #340 (closed)
- we have a somewhat subtle bug causing occasional (very non-deterministic) false alarm healthcheck failures
- however, in all observed cases, only channels that fail 2 consecutive health checks are actually in a broken state
- thus: as a workaround to allow us to focus on other issues rather than solving the bug, we modify the healthcheck logic to cache failures and only alert maintainers if a channel has failed healthcheck twice in a row
- to compensate for the delay in alerting, we decrease the healthcheck interval from 15 min to 10 min (so that it will take 20 min to find out if a channel is broken.) if this proves bad, we can shorten again to 5 min.
update logic in
- cache healthcheck failures in a set (delete them after 2 rounds of healthchecks)
- when a channel fails healthcheck, consult set to see if it failed alst round, if so, alert admins, if not, wait until next round
- refactor: use new
- do some fancy integration testing to make sure it works
- fix bug in
testApp.metricsResourceit omitted a
gaugesfield, causing integration tests that called
metrics.setGaugeto incorrectly (and confusingly!) fail
- update localdev network comments in test docker-compose files