TECH TASK: monitoring

STUB:

High level:

Use prometheus to scrape for basic health metrics.

Decide whether to use separate box for prometheus or do it on same box as prod. Same box is simpler, different box scales better (transition to watching many app instances is easier because you're already watching one.) If you choose same box, watch out for diskspace consumption by prometheus and memory consumption by docker. Add a prometheus counter for messages sent on each channel.

Then instrument with grafana.

We would like to count:

counts of broadast, hotline, and command messages over time (currently incremented in non-time-series fashion in messageCounts table)
counts of subscriber lists over time
memory consumption, CPU usage, and disk space usage over time

Implementation Notes

(@zig's notes from chat w/ @aguestuser):

i found a helpful way in reading the section on monitoring in nat's book: https://www.bookdepository.com/Real-World-SRE-Nat-Welch/9781788628884

his example in go, but prometheus has clients in basically every language. (we'd want JS, since you write the instrumentation code inline with your application code). it was roughly 20-30 lines of instrumentation code to get basic monitoring of memory/disk usage, request rates, error rates, etc.

then you can make custom "counter" instrumentations (which are cached in memory and scraped every SCRAPE_INTERVAL by prometheus. the only custom one i think we'd want is a counter incremented whenever a message was sent on a channel. so we can produce a graph of messages sent on each channel over time, (and group the busiest channels into their own box if needed down the line)

Check for blog post on all this coming from Nat Welsh soon! :)

Edited Jun 01, 2020 by aguestuser

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information