I'm borrowing the following code to parse email headers, and additionally to add a header further down the line. Admittedly, I don't fully understand the reason for all the scaffolding around what should be straightforward usage of the email.Headers
module.
Noteworthy is that Headers
is not instantiated; rather its decode_header
function is called:
class DecodedHeader(object):def __init__(self, s, folder):self.msg=email.message_from_string(s[1])self.info=parseList(s[0])self.folder=folderdef __getitem__(self,name):if name.lower()=='folder': return self.folderelif name.lower()=='uid': return self.info[1][3]elif name.lower()=='flags': return ','.join(self.info[1][1])elif name.lower()=='internal-date':ds= self.info[1][5]if Options.dateFormat:ds= time.strftime(Options.dateFormat,imaplib.Internaldate2tuple('INTERNALDATE "'+ds+'"'))return dselif name.lower()=='size': return self.info[1][7]val= self.msg.__getitem__(name)if val==None: return Nonereturn self._convert(email.Header.decode_header(val),name)def get(self,key,default=None):return self.__getitem__(key)def _convert(self, list, name):l=[]for s, encoding in list:try: if (encoding!=None):s=unicode(s,encoding, 'replace').encode(Options.encoding,'replace')except Exception, e:print >>sys.stderr, "Encoding error", el.append(s)res= "".join(l)if Options.addr and name.lower() in ('from','to', 'cc', 'return-path','reply-to' ): res=self._modifyAddr(res)if Options.dateFormat and name.lower() in ('date'): res = self._formatDate(res)return res
Here's the problem: When the header (val) contains non-ASCII characters such as Ä and ä, I get:
Traceback (most recent call last):File "v12.py", line 434, in <module>main()File "v12.py", line 396, in mainwriter.writerow(msg)File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/csv.py", line 152, in writerowreturn self.writer.writerow(self._dict_to_list(rowdict))File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/csv.py", line 149, in _dict_to_listreturn [rowdict.get(key, self.restval) for key in self.fieldnames]File "v12.py", line 198, in getreturn self.__getitem__(key)File "v12.py", line 196, in __getitem__return self._convert(email.Header.decode_header(val),name)File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/email/header.py", line 76, in decode_headerheader = str(header)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 1: ordinal not in range(128)
where u'\xe4' is ä.
I've tried a few things:
- Adding # -- coding: utf-8 -- to the top of header.py
- Calling unicode() on
val
before passing it todecode_header()
- Calling .encode('utf-8') on
val
before passing it todecode_header()
- Calling .encode('ISO-8859-1') on
val
before passing it todecode_header()
No joy with any of the above. What is at cause here? Given that I'm looking to maintain the usage of email.Header
as above (with Header not instantiated directly), how do we ensure that non-ASCII characters get successfully decoded by decode_header
?