TECH TASK: monitoring
STUB:
High level:
Use prometheus to scrape for basic health metrics.
Decide whether to use separate box for prometheus or do it on same box as prod. Same box is simpler, different box scales better (transition to watching many app instances is easier because you're already watching one.) If you choose same box, watch out for diskspace consumption by prometheus and memory consumption by docker. Add a prometheus counter for messages sent on each channel.
Then instrument with grafana.
We would like to count:
- counts of broadast, hotline, and command messages over time (currently incremented in non-time-series fashion in
messageCounts
table) - counts of subscriber lists over time
- memory consumption, CPU usage, and disk space usage over time
Implementation Notes
(@zig's notes from chat w/ @aguestuser):
i found a helpful way in reading the section on monitoring in nat's book: https://www.bookdepository.com/Real-World-SRE-Nat-Welch/9781788628884
his example in go, but prometheus has clients in basically every language. (we'd want JS, since you write the instrumentation code inline with your application code). it was roughly 20-30 lines of instrumentation code to get basic monitoring of memory/disk usage, request rates, error rates, etc.
then you can make custom "counter" instrumentations (which are cached in memory and scraped every SCRAPE_INTERVAL by prometheus. the only custom one i think we'd want is a counter incremented whenever a message was sent on a channel. so we can produce a graph of messages sent on each channel over time, (and group the busiest channels into their own box if needed down the line)
Check for blog post on all this coming from Nat Welsh soon! :)