Get the Gmail attachment filename without downloading it
Asked Answered
S

4

13

I'm trying to get all the messages from a Gmail account that may contain some large attachments (about 30MB). I just need the names, not the whole files. I found a piece of code to get a message and the attachment's name, but it downloads the file and then read its name:

import imaplib, email

#log in and select the inbox
mail = imaplib.IMAP4_SSL('imap.gmail.com')
mail.login('username', 'password')
mail.select('inbox')

#get uids of all messages
result, data = mail.uid('search', None, 'ALL') 
uids = data[0].split()

#read the lastest message
result, data = mail.uid('fetch', uids[-1], '(RFC822)')
m = email.message_from_string(data[0][1])

if m.get_content_maintype() == 'multipart': #multipart messages only
    for part in m.walk():
        #find the attachment part
        if part.get_content_maintype() == 'multipart': continue
        if part.get('Content-Disposition') is None: continue

        #save the attachment in the program directory
        filename = part.get_filename()
        fp = open(filename, 'wb')
        fp.write(part.get_payload(decode=True))
        fp.close()
        print '%s saved!' % filename

I have to do this once a minute, so I can't download hundreds of MB of data. I am a newbie into the web scripting, so could anyone help me? I don't actually need to use imaplib, any python lib will be ok for me.

Best regards

Sublimation answered 1/12, 2012 at 20:57 Comment(2)
You can send only 20MB in gmail are you aware that?Matchbox
I mean all attachments in all the messages.Sublimation
E
9

Rather than fetch RFC822, which is the full content, you could specify BODYSTRUCTURE.

The resulting data structure from imaplib is pretty confusing, but you should be able to find the filename, content-type and sizes of each part of the message without downloading the entire thing.

Evildoer answered 1/12, 2012 at 21:31 Comment(4)
And that's what I was looking for... The result is truly confusing, but it works. Thank you so much!Sublimation
That's exactly what I was looking for too. But have you any clues as to how to parse that crazy result string? @mopsiok, how did you deal with it?Marchpast
I've made some tests with it, but the results weren't very nice. Actually I find getting attachments list quite insufficient for my application. Eventually I'm getting all mail content, getting text and all attachments by passing through it. I haven't got the parsing code, as I said it was uneffective. Sorry...Sublimation
For new readers refer EDIT of #13664172Jansen
O
3

If you know something about the file name, you can use the X-GM-RAW gmail extensions for imap SEARCH command. These extensions let you use any gmail advanced search query to filter the messages. This way you can restrict the downloads to the matching messages, or exclude some messages you don't want.

mail.uid('search', None, 'X-GM-RAW', 
       'has:attachment filename:pdf in:inbox -label:parsed'))

The above search for messages with PDF attachments in INBOX not labeled "parsed".

Some pro tips:

  • label the messages you have already parsed, so you don't need to fetch them again (the -label:parsed filter in the above example)
  • always use the uid version instead of the standard sequential ids (you are already doing this)
  • unfortunately MIME is messy: there are a lot of clients that do weird (or plain wrong) things. You could try to download and parse only the headers, but is it worth the trouble?

[edit]

If you label a message after parsing it, you can skip the messages you have parsed already. This should be reasonable enough to monitor your class mailbox.

Perhaps you live in a corner of the world where internet bandwidth is more expensive than programmer time; in this case, you can fetch only the headers and look for "Content-disposition" == "attachment; filename=somefilename.ext".

Octamerous answered 1/12, 2012 at 21:1 Comment(5)
It's cool, but the problem is I don't know anything about the attachment. I'm writing a script to "scan" all the gmail inbox of my class' account and tell me whether it's something new, including info about attachments (name and size). Searching for unread messages wouldn't work because the account is used by 30 people.Sublimation
At least you can skip the messages without attachments and the messages you already parsed; note that you can filter by size also.Octamerous
Of course I can, but there is no problem with skipping the messages I already parsed. The problem is to parse the next 20 messages with 20MB attachments inside, in one minute.Sublimation
Hi Paulo, I used the advanced search. But my problem is that I want to search for xls file, so I used filename:xls', but I ended up with both xls files and xlsx files. Do you know how to search for xls file only?Dishonor
@Cacheing: perhaps this is worth asking as a new question - the comment system is screwing my answer.Octamerous
D
2

A FETCH of the RFC822 message data item is functionally equivalent to BODY[]. IMAP4 supports other message data items, listed in section 6.4.5 of RFC 3501.

Try requesting a different set of message data items to get just the information that you need. For example, you could try RFC822.HEADER or maybe BODY.PEEK[MIME].

Diffractometer answered 1/12, 2012 at 21:16 Comment(0)
B
1

Old question, but just wanted to share the solution to this I came up with today. Searches for all emails with attachments and outputs the uid, sender, subject, and a formatted list of attachments. Edited relevant code to show how to format BODYSTRUCTURE:

    data   = mailobj.uid('fetch', mail_uid, '(BODYSTRUCTURE)')[1]
    struct = data[0].split()        
    list   = []                     #holds list of attachment filenames

    for j, k in enumerate(struct):
        if k == '("FILENAME"':
            count = 1
            val = struct[j + count]
            while val[-3] != '"':
                count += 1
                val += " " + struct[j + count]
            list.append(val[1:-3])
        elif k == '"FILENAME"':
            count = 1
            val = struct[j + count]
            while val[-1] != '"':
                count += 1
                val += " " + struct[j + count]
            list.append(val[1:-1])

I've also published it on GitHub.

EDIT

Above solution is good but the logic to extract attachment file name from payload is not robust. It fails when file name contains space with first word having only two characters,

for example: "ad cde gh.png".

Try this:

import re # Somewhere at the top

result, data = mailobj.uid("fetch", mail_uid, "BODYSTRUCTURE")

itr = re.finditer('("FILENAME" "([^\/:*?"<>|]+)")', data[0].decode("ascii"))

for match in itr:
    print(f"File name: {match.group(2)}")

Test Regex here.

Blackshear answered 27/6, 2017 at 20:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.