Context

we have a somewhat subtle bug causing occasional (very non-deterministic) false alarm healthcheck failures
however, in all observed cases, only channels that fail 2 consecutive health checks are actually in a broken state
thus: as a workaround to allow us to focus on other issues rather than solving the bug, we modify the healthcheck logic to cache failures and only alert maintainers if a channel has failed healthcheck twice in a row
to compensate for the delay in alerting, we decrease the healthcheck interval from 15 min to 10 min (so that it will take 20 min to find out if a channel is broken.) if this proves bad, we can shorten again to 5 min.

Changes

update logic in diagnostics.sendHealthcheck

cache healthcheck failures in a set (delete them after 2 rounds of healthchecks)
when a channel fails healthcheck, consult set to see if it failed alst round, if so, alert admins, if not, wait until next round
refactor: use new notifier.notifyMaintainers helper in sendHealthchecks failure handler
do some fancy integration testing to make sure it works

side-effects:

fix bug in testApp.metricsResource it omitted a gauges field, causing integration tests that called metrics.setGauge to incorrectly (and confusingly!) fail
update localdev network comments in test docker-compose files

Edited Sep 16, 2020 by aguestuser