[#465] Resolve "load: identify choke point in signalc send path"
Closes #465 (closed)
We have been experiencing a bottleneck in the signalc send message path, which was originally observed by the fact that messages were being delivered slowly and also not consistently delivering 100% of messages. This PR instruments both signalc + libsignal with Prometheus metrics in order to identify the bottleneck within the system, and adds custom dispatchers to signalc which vast improved delivery rates.
As a result of many experiments run through the lag test and observed through our metrics, we were able to pinpoint the bottleneck in our system to the
getEncryptedMessages function inside of libsignal's SignalServiceMessageSender. This function makes several database calls to the Session table, leading to lots of database queries over the course of one test run. To resolve this, we are planning on moving to an in-memory database (H2) which performs significantly better the PostgresQL in this situation. For example, in a lag test sending out 1000 messages, our system was able to do this is ~5 seconds with h2, and ~15 seconds with PostgresQL.
Additionally, we were able to identify that our system was thread starved at various moments of a lag test with lots of messages, leading to deadlock and decreased delivery rates. With the introduction of custom cached thread pools for our coroutines to use, the deadlocks appear to be resolved and we are once again consistently delivering 100% of messages over a lag test run.
There are a few screenshots in the comments below showing various results of our instrumentation.
- Add metrics to signalc + libsignal
- Add Prometheus + Grafana services to load test docker-compose
- Add Prometheus scrape config to scrape sender-signalc
- Add Prometheus web server to signalc
- Add various counters + histograms to understand bottlenecks in send message path
- Add custom cachedThreadPool dispatcher to signalc
- 1 types currently (General)
- Switch all coroutines to this dispatcher
- Alleviates the thread deadlock that we have experienced
- Add custom maven package location for libsignal (hosted on 0xacab)
- Add mavenLocal() to build.gradle.kts
- Allows us to create local builds of libsignal for faster cycles
- Requires us to mount ~/.m2 directory of dev machine into sender-signalc container