[477] fix prekey bundle bug
Closes #477 (closed)
context
symptom:
- in load test env, we observe that messages to signal users who have never sent a message back to a channel are always interpreted as
PREKEY_BUNDLE
envelopes rather thanCIPHERTEXT
envelopes. (in dev env we have observed that messages from such users are also always interpreted asPREKEY_BUNDLE
s - why we care: our fix for refreshing prekeys in #404 (closed) rests on being able to confidently infer that receipt of a
PREKEY_BUNDLE
envelopes indicates a completely new session with a user with whom an account has never exchanged messages. (Which would mean it has depleted our store of prekeys by asking for one to start a new session with us, making this a good hook to go check whether we should refresh our prekeys!)
likely cause
- in libsignal,
SessionCipher#encrypt
's call toSessionBuilder.process
looks (in native code) forsession_state.unacknowledged_pre_key_message_items()
when determining whether to encode the data it is about to encrypt as aCiphertextMessage
or aPreKeyMesssage
- this collection is only ever set to empty by
#decrypt
(which would have received an acknowledgement of a prekey bundle from an interlocutor) - however, we are failing to set these collections to empty in most cases, because we don't decrypt envelopes of types other than
CIPHERTEXT
. notably: we do not decrypt messages of typeRECEIPT
that are the most likely ways that most interlocutors would signal acknowledgement of a prekey bundle message- as such, we never realize that our prekey bundle message has been acknowledged and we never clear the flag
fix
- this MR begins by re-establishing the 2 enum variants for
PREKEY_BUNDLE
andCIPHERTEXT
(that i had previously collapsed because i did not properly understand the failure behavior in which i only ever sawPREKEY_BUNDLE
messages in the load tests and assumed that since they were handled likeCIPHERTEXT
messages, they must also beCIPHERTEXT
messages - it then proceeds to reimplement
SignalSender#drop
to always decrypt all messages so that we will have a chance to observe and react to acknowledgment of prekey bundles from interlocutors at the earliest possible moment (and not waiting for them to send us a message -- which they might never do! -- to do so.) - in turn it establishes confidence that we are always correctly sending/receiving
PREKEY_BUNDLE
envelopes when we should
side-effects
- fix up our docker setup as follows:
- decouple the build and push steps for signalc docker build
- introduce a
dev
tag and tag all "pinned" image builds with it - use this
dev
tag in loadtest and signalc dev docker-compose stacks (so we don't have to manually update the pinned commit every time we make a change) - provisionally eliminate any usage of the old
signalc_dev
image (now alldocker-compose
files simply usesignalc
which is the image former specified assignalc_pinned
)- if we ever want to use
signalc_dev
again for faster iterations we can simply revert 5e8ce6b4
- if we ever want to use
- modify the max prekey id to follow what signal android uses (an int smaller than
Int.MAX_VALUE
)
further work
in the course of QA-ing this work, i realized a few things:
- in the absence of this bugfix, all new messages sent to a new channel would produce an
InvalidMessageException
upon receipt:
openwhispersystems.libsignal.InvalidMessageException: Message from +<REDACTED>:1 failed to decrypt; sender ratchet public key 0910fa5a7b3e3f6c1ae450c7fab44975a038fbfa662935bcb606155e3d1cda25 message counter 1
signalc_pinned | Candidate session 0 failed with 'invalid ciphertext message', had 0 receiv
- this was surprising to me, since i am not sure why decrypting messages (that were never sent) would cause the message to go away. however, the only way i can get that bug to disappear is by introducing the call to
decrypt
on every message - notably, even in the presence of this bugfix, the (unanticipated) symptom disappears, but the original context (in the loadtest receiver) has a new bug. namely, it never proceeds past receiving
PREKEY_BUNDLE
s because it always throws the sameInvalidMessageException
as above - why... we might ask does receiving (or sending -- also tested!) a first message to an actual phone work okay, but sending one to a bot running singalc fails? where did this sneak bug sneak in during the last month of work? all great questions, which are deferred until #478 to solve :)
Edited by aguestuser