Skip to content

[#415] load: reproduce channel crashes in signald

aguestuser requested to merge 415-reproduce-signald-crashes into main

Closes #415 (closed)

context

  • in #201 (closed) we validated that signalc can send messages from a single channel with higher throughput than signald
  • with this MR, we would like to validate that signalc does not suffer from the tendency to crash under heavy load (or high parallelism) that we observe in prod with signald*
  • to accomplish this, we attempt to direct a highly concurrent stream of messages to signald to cause it to crash, then send the same stream through signalc and show that it does not crash

* In prod, signald channels crash non-deterministically when under heavy load or when we try to send messages through them in parallel by leveraging a larger pool of socket connections btw/ signalboost and signald. These causes tend to coincide with ConcurrentModificationExceptions in the Signal Protocol Store, as signald attempts to read from the keystore at the same time as it tries to modify it. After such a "crash" occurs, the channel becomes unresponsive, and both (1) does not receive or acknowledge incoming messages from admins or subscribers ("one check") or transmit messages to anyone. Thus: we will accept as having reproduced this bug, any load test scenario that causes signald to drop messages in a way that coincides with concurrent modification errors.

findings

As detailed below, we validated our hypothesis, but only partially and raised some new concerns about signalc's performance under heavily concurrent load.

Test design:

  • We created 3 sender channels, each of which sent out 3 batches of 100 messages each at the same time (for 900 messages overall)
  • We observed message lag and whether or not concurrent modification errors caused signald to drop messages

Results:

  • In 3 out of 9 batches, signald dropped several messages (1 out of 100, 3 out of 100, and 7 out of 100 messages respectively)
  • Signalc successfully transmitted all 100 messages in 9 out of 9 batches

Observations requiring explanation:

  • (1) The "crashes" above were not reproducible. They happened on a "fresh" run of signald (after seeding the keystore), but not thereafter. I could reproduce crashing by reseeding the keystore, but not by running again "hot" over an existing keystore.
  • (2) Signalc appears to be slower than Signald for some batches (~20 sec elapsed time per "batch" vs. signald, which ranges from ~7 sec to ~20 sec)

Possibilities for further inquiry:

  • (1) Might be explainable by the fact that signald is bootstrapping the keystore and is thus more likely to be reading to and writing from it at the same time. To test this, we could design a new experiment in which we have the "receiver" in the test send messages back. (Thus simulating a scenario in which a channel is both receiving messages from admins and relaying them to susbscribers or vice versa).
  • (2) Might be explainable by the fact that signald tends to send messages in sequence, whereas signalc parallelizes more heavily. The "overall" batch time for signalc might reflect the overall transmission time for all 3 batches, while the elapsed time for a signald batch might reflect 1/9th of the overall trial time. To test this hypothesis, we could modify the test to record the overall trial time, and report that alongside the "batch** times, included below.

code changes

  • add testConcurrency (and make/bash tooling necessary to run it)
  • extract load test helpers into simulator.helpers:
    • testSendingN returns a report (rather than printing it), so that reports may be aggregated and printed at the same time at the end of a multi-batch trial
    • the calling test uses print to print the report
    • numBots is now private, so that different tests may set different values for numRecipients, but we always subscribe to the full numBots (to avoid situation in which we want to run a "hot" test that is trying to send to numbers to which the receiver has not subscribed)
  • combine all socket code into a single module addressable by calling the socketPool field on app after it has been initialized
    • down payment on a refactor i'd like to make on prod!
    • upshot: get rid of the weird split between socket/index (that initializes the sockets) and socket/socketWriter (that uses it, and is prone to circular import errors)

detailed findings

Signalc Lag In Sec (under 3 x 3 x 100 concurrency):

[{
  "client": "SIGNALC",
  "senderNumber": "+20000000000",
  "numRecipients": 100,
  "socketPoolSize": "32",
  "timestamp": "2021-03-31T05:34:11.636Z",
  "percentDelivered": 100,
  "minElapsed": 1.095,
  "maxElapsed": 21.413,
  "totalElapsed": 21.551
},
{
  "client": "SIGNALC",
  "senderNumber": "+20000000001",
  "numRecipients": 100,
  "socketPoolSize": "32",
  "timestamp": "2021-03-31T05:34:11.637Z",
  "percentDelivered": 100,
  "minElapsed": 3.97,
  "maxElapsed": 22.072,
  "totalElapsed": 22.203
},
{
  "client": "SIGNALC",
  "senderNumber": "+20000000002",
  "numRecipients": 100,
  "socketPoolSize": "32",
  "timestamp": "2021-03-31T05:34:11.637Z",
  "percentDelivered": 100,
  "minElapsed": 8.991,
  "maxElapsed": 21.904,
  "totalElapsed": 22.036
},
{
  "client": "SIGNALC",
  "numRecipients": 100,
  "socketPoolSize": "32",
  "timestamp": "2021-03-31T05:34:11.638Z",
  "percentDelivered": 100,
  "minElapsed": 1.067,
  "maxElapsed": 19.073,
  "totalElapsed": 19.209
},
{
  "client": "SIGNALC",
  "numRecipients": 100,
  "socketPoolSize": "32",
  "timestamp": "2021-03-31T05:34:11.638Z",
  "percentDelivered": 100,
  "minElapsed": 3.989,
  "maxElapsed": 20.884,
  "totalElapsed": 21.009
},
{
  "client": "SIGNALC",
  "numRecipients": 100,
  "socketPoolSize": "32",
  "timestamp": "2021-03-31T05:34:11.638Z",
  "percentDelivered": 100,
  "minElapsed": 8.592,
  "maxElapsed": 21.776,
  "totalElapsed": 21.905
},
{
  "client": "SIGNALC",
  "numRecipients": 100,
  "socketPoolSize": "32",
  "timestamp": "2021-03-31T05:34:11.639Z",
  "percentDelivered": 100,
  "minElapsed": 1.034,
  "maxElapsed": 18.5,
  "totalElapsed": 18.626
},
{
  "client": "SIGNALC",
  "numRecipients": 100,
  "socketPoolSize": "32",
  "timestamp": "2021-03-31T05:34:11.639Z",
  "percentDelivered": 100,
  "minElapsed": 4.146,
  "maxElapsed": 20.465,
  "totalElapsed": 20.579
},
{
  "client": "SIGNALC",
  "numRecipients": 100,
  "socketPoolSize": "32",
  "timestamp": "2021-03-31T05:34:11.640Z",
  "percentDelivered": 100,
  "minElapsed": 11.462,
  "maxElapsed": 21.325,
  "totalElapsed": 21.438
}]

Signald Lag In Sec and % delivered (under 3 x 3 x 100 concurrency)

[{
  "client": "SIGNALD",
  "senderNumber": "+23000000000",
  "numRecipients": 100,
  "socketPoolSize": "16",
  "timestamp": "2021-03-31T05:46:57.234Z",
  "percentDelivered": 93,
  "minElapsed": 0.9,
  "maxElapsed": 4.623,
  "totalElapsed": 120.225
},
{
  "client": "SIGNALD",
  "senderNumber": "+23000000001",
  "numRecipients": 100,
  "socketPoolSize": "16",
  "timestamp": "2021-03-31T05:46:57.234Z",
  "percentDelivered": 99,
  "minElapsed": 3.672,
  "maxElapsed": 5.849,
  "totalElapsed": 120.21
},
{
  "client": "SIGNALD",
  "senderNumber": "+23000000002",
  "numRecipients": 100,
  "socketPoolSize": "16",
  "timestamp": "2021-03-31T05:46:57.235Z",
  "percentDelivered": 100,
  "minElapsed": 4.71,
  "maxElapsed": 7.083,
  "totalElapsed": 7.219
},
{
  "client": "SIGNALD",
  "numRecipients": 100,
  "socketPoolSize": "16",
  "timestamp": "2021-03-31T05:46:57.235Z",
  "percentDelivered": 100,
  "minElapsed": 6.508,
  "maxElapsed": 10.831,
  "totalElapsed": 10.966
},
{
  "client": "SIGNALD",
  "numRecipients": 100,
  "socketPoolSize": "16",
  "timestamp": "2021-03-31T05:46:57.235Z",
  "percentDelivered": 100,
  "minElapsed": 9.973,
  "maxElapsed": 12.012,
  "totalElapsed": 12.144
},
{
  "client": "SIGNALD",
  "numRecipients": 100,
  "socketPoolSize": "16",
  "timestamp": "2021-03-31T05:46:57.235Z",
  "percentDelivered": 100,
  "minElapsed": 11.063,
  "maxElapsed": 13.291,
  "totalElapsed": 13.424
},
{
  "client": "SIGNALD",
  "numRecipients": 100,
  "socketPoolSize": "16",
  "timestamp": "2021-03-31T05:46:57.236Z",
  "percentDelivered": 97,
  "minElapsed": 12.64,
  "maxElapsed": 16.033,
  "totalElapsed": 120.161
},
{
  "client": "SIGNALD",
  "numRecipients": 100,
  "socketPoolSize": "16",
  "timestamp": "2021-03-31T05:46:57.236Z",
  "percentDelivered": 100,
  "minElapsed": 15.287,
  "maxElapsed": 18.343,
  "totalElapsed": 18.471
},
{
  "client": "SIGNALD",
  "numRecipients": 100,
  "socketPoolSize": "16",
  "timestamp": "2021-03-31T05:46:57.236Z",
  "percentDelivered": 100,
  "minElapsed": 17.181,
  "maxElapsed": 20.485,
  "totalElapsed": 20.61
}]
Edited by aguestuser

Merge request reports