Skip to content

[#340] Resolve "only alert after 2 consecuitve healthcheck failures"

Closes #340 (closed)

Context

  • we have a somewhat subtle bug causing occasional (very non-deterministic) false alarm healthcheck failures
  • however, in all observed cases, only channels that fail 2 consecutive health checks are actually in a broken state
  • thus: as a workaround to allow us to focus on other issues rather than solving the bug, we modify the healthcheck logic to cache failures and only alert maintainers if a channel has failed healthcheck twice in a row
  • to compensate for the delay in alerting, we decrease the healthcheck interval from 15 min to 10 min (so that it will take 20 min to find out if a channel is broken.) if this proves bad, we can shorten again to 5 min.

Changes

update logic in diagnostics.sendHealthcheck

  • cache healthcheck failures in a set (delete them after 2 rounds of healthchecks)
  • when a channel fails healthcheck, consult set to see if it failed alst round, if so, alert admins, if not, wait until next round
  • refactor: use new notifier.notifyMaintainers helper in sendHealthchecks failure handler
  • do some fancy integration testing to make sure it works

side-effects:

  • fix bug in testApp.metricsResource it omitted a gauges field, causing integration tests that called metrics.setGauge to incorrectly (and confusingly!) fail
  • update localdev network comments in test docker-compose files
Edited by aguestuser

Merge request reports