How can I parse an email header with python?
Asked Answered
B

5

5

Here's an example email header,

header = """
From: Media Temple user ([email protected])
Subject: article: A sample header
Date: January 25, 2011 3:30:58 PM PDT
To: [email protected]
Return-Path: <[email protected]>
Envelope-To: [email protected]
Delivery-Date: Tue, 25 Jan 2011 15:31:01 -0700
Received: from :po-out-1718.google.com ([72.14.252.155]:54907) by cl35.gs01.gridserver.com with esmtp (Exim 4.63) (envelope-from <[email protected]>) id 1KDoNH-0000f0-RL for [email protected]; Tue, 25 Jan 2011 15:31:01 -0700
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:mime-version:content-type; bh=+JqkmVt+sHDFIGX5jKp3oP18LQf10VQjAmZAKl1lspY=; b=F87jySDZnMayyitVxLdHcQNL073DytKRyrRh84GNsI24IRNakn0oOfrC2luliNvdea LGTk3adIrzt+N96GyMseWz8T9xE6O/sAI16db48q4Iqkd7uOiDvFsvS3CUQlNhybNw8m CH/o8eELTN0zbSbn5Trp0dkRYXhMX8FTAwrH0=
Domainkey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:mime-version:content-type; b=wkbBj0M8NCUlboI6idKooejg0sL2ms7fDPe1tHUkR9Ht0qr5lAJX4q9PMVJeyjWalH 36n4qGLtC2euBJY070bVra8IBB9FeDEW9C35BC1vuPT5XyucCm0hulbE86+uiUTXCkaB 6ykquzQGCer7xPAcMJqVfXDkHo3H61HM9oCQM=
Message-Id: <[email protected]>
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary="----=_Part_3927_12044027.1214951458678"
X-Spam-Status: score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7
X-Spam-Level: ***
Message Body: **The email message body**
"""

The header is stored as a string, how do I parse this header, so that i can map it to a dictionary as the header fields be the key and the values be the values in the dictionary?

I want a dictionary like this,

header_dict = {
'From': 'Media Temple user ([email protected])',
'Subject': article: 'A sample header',
'Date': 'January 25, 2011 3:30:58 PM PDT'
'and so on': .. . . . .. . . .. . 
 . . . . .. . . . ..  . . . . .
} 

I made a list of fields required,

header_reqd = ['From:','Subject:','Date:','To:','Return-Path:','Envelope-To:','Delivery-Date:','Received:','Dkim-Signature:','Domainkey-Signature:','Message-Id:','Mime-Version:','Content-Type:','X-Spam-Status:','X-Spam-Level:','Message Body:']

This can list items can likely be the keys for the dictionary.

Boiling answered 14/5, 2015 at 14:38 Comment(1)
Check out docs.python.org/3/library/email.parser.htmlCaeoma
P
7

It seems most of these answers have overlooked the Python email parser and the output results are not correct with prefix spaces in the values. Also the OP has perhaps made a typo by including a preceding newline in the header string which requires stripped for the email parser to work.

from email.parser import HeaderParser
header = header.strip() # Fix incorrect formatting
email_message = HeaderParser().parsestr(header)
dict(email_message)

Output (truncated):

>>> from pprint import pprint
>>> pprint(dict(email_message))
{'Content-Type': 'multipart/alternative; '
                 'boundary="----=_Part_3927_12044027.1214951458678"',
 'Date': 'January 25, 2011 3:30:58 PM PDT',
 'Delivery-Date': 'Tue, 25 Jan 2011 15:31:01 -0700',
 ...
 'Subject': 'article: A sample header',
 'To': '[email protected]',
 'X-Spam-Level': '***',
 'X-Spam-Status': 'score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, '
                  'HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7'}

Duplicate header keys

Be aware that email message headers can contain duplicate keys as mentioned in the Python documentation for email.message

Headers are stored and returned in case-preserving form, but field names are matched case-insensitively. Unlike a real dict, there is an ordering to the keys, and there can be duplicate keys. Additional methods are provided for working with headers that have duplicate keys.

For example converting the following email message to a Python dict only the first Received key would be retained.

headers = HeaderParser().parsestr("""Received: by mx0047p1mdw1.sendgrid.net with SMTP id 6WCVv7KAWn Wed, 27 Jul 2016 20:53:06 +0000 (UTC)
Received: from mail-io0-f169.google.com (mail-io0-f169.google.com [209.85.223.169]) by mx0047p1mdw1.sendgrid.net (Postfix) with ESMTPS id AA9FFA817F2 for <[email protected]>; Wed, 27 Jul 2016 20:53:06 +0000 (UTC)
Received: by mail-io0-f169.google.com with SMTP id b62so81593819iod.3 for <[email protected]>; Wed, 27 Jul 2016 13:53:06 -0700 (PDT)""")

dict(headers)
{'Received': 'by mx0047p1mdw1.sendgrid.net with SMTP id 6WCVv7KAWn Wed, 27 Jul 2016 20:53:06 +0000 (UTC)'}

Use the get_all method to check for duplicates:

headers.get_all('Received')
['by mx0047p1mdw1.sendgrid.net with SMTP id 6WCVv7KAWn Wed, 27 Jul 2016 20:53:06 +0000 (UTC)', 'from mail-io0-f169.google.com (mail-io0-f169.google.com [209.85.223.169]) by mx0047p1mdw1.sendgrid.net (Postfix) with ESMTPS id AA9FFA817F2 for <[email protected]>; Wed, 27 Jul 2016 20:53:06 +0000 (UTC)', 'by mail-io0-f169.google.com with SMTP id b62so81593819iod.3 for <[email protected]>; Wed, 27 Jul 2016 13:53:06 -0700 (PDT)']
Personalism answered 30/6, 2021 at 11:1 Comment(0)
S
1

you can split string on newline, then split each line on ":"

>>> my_header = {}
>>> for x in header.strip().split("\n"):
...     x = x.split(":", 1)
...     my_header[x[0]] = x[1]
... 
Slipperwort answered 14/5, 2015 at 14:43 Comment(2)
'Date': 'January 25, 2011 3:30:58 PM PDT' this will working according to your code? because after split x[0] is key and x[1] is value, So result will be 'Date': ' January 25, 2011 3'Flagellate
@VivekSable havent seen that date format , now updated :), thanksSlipperwort
F
1

split will work for you:

Demo:

>>> result = {}
>>> for i in header.split("\n"):
...    i = i.strip()
...    if i :
...       k, v = i.split(":", 1)
...       result[k] = v

output:

>>> import pprint
>>> pprint.pprint(result)
{'Content-Type': ' multipart/alternative; boundary="----=_Part_3927_12044027.1214951458678"',
 'Date': ' January 25, 2011 3:30:58 PM PDT',
 'Delivery-Date': ' Tue, 25 Jan 2011 15:31:01 -0700',
 'Dkim-Signature': ' v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:mime-version:content-type; bh=+JqkmVt+sHDFIGX5jKp3oP18LQf10VQjAmZAKl1lspY=; b=F87jySDZnMayyitVxLdHcQNL073DytKRyrRh84GNsI24IRNakn0oOfrC2luliNvdea LGTk3adIrzt+N96GyMseWz8T9xE6O/sAI16db48q4Iqkd7uOiDvFsvS3CUQlNhybNw8m CH/o8eELTN0zbSbn5Trp0dkRYXhMX8FTAwrH0=',
 'Domainkey-Signature': ' a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:mime-version:content-type; b=wkbBj0M8NCUlboI6idKooejg0sL2ms7fDPe1tHUkR9Ht0qr5lAJX4q9PMVJeyjWalH 36n4qGLtC2euBJY070bVra8IBB9FeDEW9C35BC1vuPT5XyucCm0hulbE86+uiUTXCkaB 6ykquzQGCer7xPAcMJqVfXDkHo3H61HM9oCQM=',
 'Envelope-To': ' [email protected]',
 'From': ' Media Temple user ([email protected])',
 'Message Body': ' **The email message body**',
 'Message-Id': ' <[email protected]>',
 'Mime-Version': ' 1.0',
 'Received': ' from :po-out-1718.google.com ([72.14.252.155]:54907) by cl35.gs01.gridserver.com with esmtp (Exim 4.63) (envelope-from <[email protected]>) id 1KDoNH-0000f0-RL for [email protected]; Tue, 25 Jan 2011 15:31:01 -0700',
 'Return-Path': ' <[email protected]>',
 'Subject': ' article: A sample header',
 'To': ' [email protected]',
 'X-Spam-Level': ' ***',
 'X-Spam-Status': ' score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7'}
Flagellate answered 14/5, 2015 at 14:46 Comment(6)
You can use header.splitlines() and it will remove the newlines too.Melody
@PadraicCunningham: yes. It is removing last blank new line but not first. e.g. >>> s = """\n1\n2\n3\n""" >>> s.splitlines() ['', '1', '2', '3'] >>> So best to do strip before split. Correct?Flagellate
the first newline is probably not actually there, it is just how the OP poseted the input. """From: Media Temple user ([email protected]) would be the actual start of the string. plus1 anyway, you got the split correctMelody
@PadraicCunningham: ok. Can explain more about your code? means any link. generator object is create and then you create dictionary.Flagellate
each line is split into lists, Subject: article: A sample header -> ["Subject:", "article: A sample header"] , try running dict([["Subject:", "article: A sample header"]]) from an interpreter and you will see what happens, what happens in my code is you have multiple sublistsMelody
@PadraicCunningham: yes.Flagellate
M
1
header = """From: Media Temple user ([email protected])
Subject: article: A sample header
Date: January 25, 2011 3:30:58 PM PDT
To: [email protected]
Return-Path: <[email protected]>
Envelope-To: [email protected]
Delivery-Date: Tue, 25 Jan 2011 15:31:01 -0700
Received: from :po-out-1718.google.com ([72.14.252.155]:54907) by cl35.gs01.gridserver.com with esmtp (Exim 4.63) (envelope-from <[email protected]>) id 1KDoNH-0000f0-RL for [email protected]; Tue, 25 Jan 2011 15:31:01 -0700
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:mime-version:content-type; bh=+JqkmVt+sHDFIGX5jKp3oP18LQf10VQjAmZAKl1lspY=; b=F87jySDZnMayyitVxLdHcQNL073DytKRyrRh84GNsI24IRNakn0oOfrC2luliNvdea LGTk3adIrzt+N96GyMseWz8T9xE6O/sAI16db48q4Iqkd7uOiDvFsvS3CUQlNhybNw8m CH/o8eELTN0zbSbn5Trp0dkRYXhMX8FTAwrH0=
Domainkey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:mime-version:content-type; b=wkbBj0M8NCUlboI6idKooejg0sL2ms7fDPe1tHUkR9Ht0qr5lAJX4q9PMVJeyjWalH 36n4qGLtC2euBJY070bVra8IBB9FeDEW9C35BC1vuPT5XyucCm0hulbE86+uiUTXCkaB 6ykquzQGCer7xPAcMJqVfXDkHo3H61HM9oCQM=
Message-Id: <[email protected]>
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary="----=_Part_3927_12044027.1214951458678"
X-Spam-Status: score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7
X-Spam-Level: ***
Message Body: **The email message body**
"""   

Split into individual lines then split each line once on :

from pprint import pprint as pp
pp(dict(line.split(":",1) for line in header.splitlines()))

Output:

{'Content-Type': ' multipart/alternative; '
                 'boundary="----=_Part_3927_12044027.1214951458678"',
 'Date': ' January 25, 2011 3:30:58 PM PDT',
 'Delivery-Date': ' Tue, 25 Jan 2011 15:31:01 -0700',
 'Dkim-Signature': ' v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; '
                   's=gamma; '
                   'h=domainkey-signature:received:received:message-id:date:from:to '
                   ':subject:mime-version:content-type; '
                   'bh=+JqkmVt+sHDFIGX5jKp3oP18LQf10VQjAmZAKl1lspY=; '
                   'b=F87jySDZnMayyitVxLdHcQNL073DytKRyrRh84GNsI24IRNakn0oOfrC2luliNvdea '
                   'LGTk3adIrzt+N96GyMseWz8T9xE6O/sAI16db48q4Iqkd7uOiDvFsvS3CUQlNhybNw8m '
                   'CH/o8eELTN0zbSbn5Trp0dkRYXhMX8FTAwrH0=',
 'Domainkey-Signature': ' a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; '
                        'h=message-id:date:from:to:subject:mime-version:content-type; '
                        'b=wkbBj0M8NCUlboI6idKooejg0sL2ms7fDPe1tHUkR9Ht0qr5lAJX4q9PMVJeyjWalH '
                        '36n4qGLtC2euBJY070bVra8IBB9FeDEW9C35BC1vuPT5XyucCm0hulbE86+uiUTXCkaB '
                        '6ykquzQGCer7xPAcMJqVfXDkHo3H61HM9oCQM=',
 'Envelope-To': ' [email protected]',
 'From': ' Media Temple user ([email protected])',
 'Message Body': ' **The email message body**',
 'Message-Id': ' '
               '<[email protected]>',
 'Mime-Version': ' 1.0',
 'Received': ' from :po-out-1718.google.com ([72.14.252.155]:54907) by '
             'cl35.gs01.gridserver.com with esmtp (Exim 4.63) (envelope-from '
             '<[email protected]>) id 1KDoNH-0000f0-RL for '
             '[email protected]; Tue, 25 Jan 2011 15:31:01 -0700',
 'Return-Path': ' <[email protected]>',
 'Subject': ' article: A sample header',
 'To': ' [email protected]',
 'X-Spam-Level': ' ***',
 'X-Spam-Status': ' score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, '
                  'HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7'}

line.split(":",1) makes sure we only split once on : so if there are any : in the values we won't end up splitting that also. You end up with sublists that are key/value pairings so calling dict creates the dict create from each pairing.

Melody answered 14/5, 2015 at 14:50 Comment(1)
@VivekSable, that is probably because the OP has a newline before the first line, do header.splitlines()[1:]Melody
I
0

To parse an email, you can use the Python standard email library. In particular, the Parser API can be used to load an email (from either memory or file) and create the corresponding EmailMessage object.

For example:

from email.parser import Parser
from email.policy import default as DefaultPolicy

raw_message = """From: [email protected]
Subject: Subject test
Date: January 25, 2011 3:30:58 PM PDT
To: [email protected]
Content-Type: text/plain; charset="utf-8"

Email message body test."""

message = Parser(policy=DefaultPolicy).parsestr(raw_message)

headers = {}
# Unique headers:
for header in ["Content-Type", "Date", "From", "Reply-To", "Sender", "Subject", "To"]:
    headers[header] = message.get(header) if header in message else ""
# Duplicated headers:
for header in ["Received"]:
    headers[header] = message.get_all(header) if header in message else []
print(f"Headers: {headers}")  # Headers: {'Content-Type': 'text/plain; charset="utf-8"', 'Date': 'Tue, 25 Jan 2011 03:30:58 -0000', 'From': '[email protected]', 'Reply-To': '', 'Sender': '', 'Subject': 'Subject test', 'To': '[email protected]', 'Received': []}

body = message.get_body()
print(f"Body: {body}")  # Body: From: [email protected][...]

NOTE: Take in mind that, as shown in the example above, some headers may appear more than once (e.g., "Received"). Accessing those via message.get(header) or message[header] will not return all their occurrences, you will need to use message.get_all(header) instead.

If the email is retrieved in bytes, rather than in a string, you can use BytesParser rather than Parser:

from email.parser import BytesParser
from email.policy import default as DefaultPolicy

raw_message = b"""From: [email protected]
Subject: Subject test
Date: January 25, 2011 3:30:58 PM PDT
To: [email protected]
Content-Type: text/plain; charset="utf-8"

Email message body test."""

message = BytesParser(policy=DefaultPolicy).parsebytes(raw_message)

Finally, take in mind that some emails may be "multipart". To parse the various email parts, you can use walk(). For example:

print(f"Is multipart? {message.is_multipart()}")  # Is multipart? False
for part in message.walk():
    print(f"Charset: {part.get_content_charset()}")
    print(f"Content-Disposition: {part.get_content_disposition()}")
    print(f"Content-Type: {part.get_content_type()}")
    print(f"Is attachment? {part.is_attachment()}")
    if part.is_attachment():
        print(f"Filename: {part.get_filename()}")
Inaptitude answered 25/9 at 15:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.