fix #458 - more proper encoding handling
This revisits the fixes for #409 (closed) (merged in !301 (merged)), as when trying to parse unencrypted, but signed utf-8 mails, they were failing to be parsed and thus generated another error. More in-depth later.
With this fix we are changing the approach:
- We switch to UTF-8 as default input
- If this is not a valid encoding, we try to convert the input to UTF-8
- If we are failing to convert, we scrub the input, so that at least what is proper UTF-8 will be passed on.
ASCII-8BIT is a BINARY encoding and the mail library will only
force the mail to have CRLF if it only contains ASCII conent.
While UTF-8 is not a BINARY encoding and thus will get CRLF if
it is a valid encoding. Getting CRLF is important for mails such
as the one in signed_utf8.eml
, as otherwise the parts detection
will fail and the whole body will end up in the prologue, with no
parts.
Enforcing UTF-8 will still make some of our charset mails failing,
as they are - validly - not UTF-8. By using a new dependency
'charlock_holmes', we are able to detect the actual encoding and
thus try to convert it to UTF-8. If everything fails, we just
drop the invalid characters.