First, let me say that I'm a complete beginner at Python. I've never learned the language, I just thought "how hard can it be" when Google turned up nothing but Python snippets to solve my problem. :)
I have a bunch of mailboxes in Maildir format (a backup from the mail server on my old web host), and I need to extract the emails from these. So far, the simplest way I've found has been to convert them to the mbox format, which Thunderbird supports, and it seems Python has a few classes for reading/writing both formats. Seems perfect.
The Python docs even have this little code snippet doing exactly what I need:
src = mailbox.Maildir('maildir', factory=None)
dest = mailbox.mbox('/tmp/mbox')
for msg in src: #1
dest.add(msg) #2
Except it doesn't work. And here's where my complete lack of knowledge about Python sets in.
On a few messages, I get a UnicodeDecodeError during the iteration (that is, when it's trying to read msg
from src
, on line #1
). On others, I get a UnicodeEncodeError when trying to add msg
to dest
(line #2
).
Clearly it makes some wrong assumptions about the encoding used. But I have no clue how to specify an encoding on the mailbox (For that matter, I don't know what the encoding should be either, but I can probably figure that out once I find a way to actually specify an encoding).
I get stack traces similar to the following:
File "E:\Python30\lib\mailbox.py", line 102, in itervalues
value = self[key]
File "E:\Python30\lib\mailbox.py", line 74, in __getitem__
return self.get_message(key)
File "E:\Python30\lib\mailbox.py", line 317, in get_message
msg = MaildirMessage(f)
File "E:\Python30\lib\mailbox.py", line 1373, in __init__
Message.__init__(self, message)
File "E:\Python30\lib\mailbox.py", line 1345, in __init__
self._become_message(email.message_from_file(message))
File "E:\Python30\lib\email\__init__.py", line 46, in message_from_file
return Parser(*args, **kws).parse(fp)
File "E:\Python30\lib\email\parser.py", line 68, in parse
data = fp.read(8192)
File "E:\Python30\lib\io.py", line 1733, in read
eof = not self._read_chunk()
File "E:\Python30\lib\io.py", line 1562, in _read_chunk
self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
File "E:\Python30\lib\io.py", line 1295, in decode
output = self.decoder.decode(input, final=final)
File "E:\Python30\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 37: character maps to <undefined>
And on the UnicodeEncodeErrors:
File "E:\Python30\lib\email\message.py", line 121, in __str__
return self.as_string()
File "E:\Python30\lib\email\message.py", line 136, in as_string
g.flatten(self, unixfrom=unixfrom)
File "E:\Python30\lib\email\generator.py", line 76, in flatten
self._write(msg)
File "E:\Python30\lib\email\generator.py", line 108, in _write
self._write_headers(msg)
File "E:\Python30\lib\email\generator.py", line 141, in _write_headers
header_name=h, continuation_ws='\t')
File "E:\Python30\lib\email\header.py", line 189, in __init__
self.append(s, charset, errors)
File "E:\Python30\lib\email\header.py", line 262, in append
input_bytes = s.encode(input_charset, errors)
UnicodeEncodeError: 'ascii' codec can't encode character '\xe5' in position 16:
ordinal not in range(128)
Anyone able to help me out here? (Suggestions for completely different solutions not involving Python are obviously welcome too. I just need a way to access get import the mails from these Maildir files.
Updates:
sys.getdefaultencoding returns 'utf-8'
I uploaded sample messages which cause both errors. This one throws UnicodeEncodeError, and this throws UnicodeDecodeError
I tried running the same script in Python2.6, and got TypeErrors instead:
File "c:\python26\lib\mailbox.py", line 529, in add
self._toc[self._next_key] = self._append_message(message)
File "c:\python26\lib\mailbox.py", line 665, in _append_message
offsets = self._install_message(message)
File "c:\python26\lib\mailbox.py", line 724, in _install_message
self._dump_message(message, self._file, self._mangle_from_)
File "c:\python26\lib\mailbox.py", line 220, in _dump_message
raise TypeError('Invalid message type: %s' % type(message))
TypeError: Invalid message type: <type 'instance'>
sys.getdefaultencoding()
(importsys
before u do that, I never worked with 3.0, Windows box). I am able to reproduce the 2nd error not the 1st. – Illative