Skip to content

fix #458 - more proper encoding handling

ng requested to merge fix-458-revisit-character-encoding into master

This revisits the fixes for #409 (closed) (merged in !301 (merged)), as when trying to parse unencrypted, but signed utf-8 mails, they were failing to be parsed and thus generated another error. More in-depth later.

With this fix we are changing the approach:

  1. We switch to UTF-8 as default input
  2. If this is not a valid encoding, we try to convert the input to UTF-8
  3. If we are failing to convert, we scrub the input, so that at least what is proper UTF-8 will be passed on.

ASCII-8BIT is a BINARY encoding and the mail library will only force the mail to have CRLF if it only contains ASCII conent. While UTF-8 is not a BINARY encoding and thus will get CRLF if it is a valid encoding. Getting CRLF is important for mails such as the one in signed_utf8.eml, as otherwise the parts detection will fail and the whole body will end up in the prologue, with no parts. Enforcing UTF-8 will still make some of our charset mails failing, as they are - validly - not UTF-8. By using a new dependency 'charlock_holmes', we are able to detect the actual encoding and thus try to convert it to UTF-8. If everything fails, we just drop the invalid characters.

Merge request reports