Decode email headers with quoted encoded words
commitf9943a3ac7f21d4c62e06e212d606691ff184ade
authorPeter Grayson <jpgrayson@gmail.com>
Sun, 28 Jan 2018 05:26:52 +0000 (28 00:26 -0500)
committerPeter Grayson <jpgrayson@gmail.com>
Sun, 28 Jan 2018 05:30:04 +0000 (28 00:30 -0500)
tree96b354fa429a8948cff05bee23c369c699091bcd
parenta56e7126a7c554fac840adc40ffa834b44247c59
Decode email headers with quoted encoded words

Although not clearly RFC-2047 compliant, there are cases in the wild where
email headers contain encoded words within quotes. E.g.:

  From: "=?UTF-8?q?Christian=20K=C3=B6nig?=" <name@example.com>

Python2 and Python3 have different behavior when parsing such a header
using email.header.decode_words(). Python3 will decode the encoded words
between the quotes whereas Python2 will leave the encoded words in their
encoded state. The goal for this change is to make Python2 behave as
Python3.

A regular expression (email.header.ecre) is used in
email.header.decode_header() to detect and parse encoded words. This regex
is subtly different between Python2 and Python3--the Python2 regex requires
whitespace or the end-of-string after encoded words whereas Python3 does
not. Monkey-patching the Python2 email.header.ecre regex to not require
trailing whitespace is sufficient to make Python2 behave the same as
Python3.

A new test is added to t1800-import.sh to verify this behavior.

Signed-off-by: Peter Grayson <jpgrayson@gmail.com>
stgit/commands/common.py
t/t1800-import.sh
t/t1800-import/email-quoted-from [new file with mode: 0644]