PHP fetch over 20000 imap emails
Asked Answered
I

2

7

I'm trying to export several mailboxes to an database. My current script will connect IMAP and just loop all messages. Though with larger mailboxes this won't work and it will slow down or even stop.

The idea is to run the script daily to "copy" all messages who are not in the database yet to the database. Whats the best way to fetch big amounts of e-mails (20k mails spread over about 40 - 50 folders).

Eventually this will need to work from a single server to scan hundreds or even thousands accounts daily (so imagine the amounts of data). It will store the mail (uid and subject) into the database and create a package which will be stored on the dataserver (so it also needs to fetch the attachments).

Internode answered 1/3, 2013 at 20:26 Comment(0)
S
6

So you want to perform email backup via IMAP. There are professional software tools that do this.

Let's start from something simple: downloading an email for one specific user from the inbox folder. This requires you to (a) login with the user's credentials, (b) select the INBOX folder, and (c) download the message (let's assume that you already know its UID, which is 55). You do this in IMAP as follows (requests only - responses not shown):

01 LOGIN username password
02 SELECT INBOX
03 UID FETCH 55 BODY[]

Each message in a particular folder is given a UID. This is a unique identifier for the message that never changes - it cannot be used by any other message in that folder. New messages must have a higher UID than previous ones. This makes it a useful tool to determine whether you already downloaded the message previously.

Next step: let us now look at downloading all new messages in the INBOX folder. Let's assume that you're downloading messages for the first time, and the INBOX currently has messages with UIDs 54, 55 and 57. You can download these messages all at once using a command such as:

03 UID FETCH 54,55,57 BODY[]

(You might want to break this up in batches (e.g. 30 at a time) if there are a lot to download.) After doing that, you store the highest UID you downloaded so far. Next time, you can check for UIDs higher than that as follows:

04 UID FETCH 58:* UID

That will retrieve the UID (only) for messages with a UID from 58 onwards. If you get results, then you download those, and again store the UID. And so on.

There is one catch. The UIDs of a message are valid so long as the folder's UIDVALIDITY attribute (included in the response to the SELECT command) does not change. If this changes for whatever reason, the folder is invalidated, and you need to download all messages in that folder all over again.

Finally, you want to extend this to work for all folders for all users. In order to get all folders for a particular user, you use the IMAP LIST command:

05 LIST "" "*"

You will need to know the credentials for the users beforehand and loop over them.

This is the IMAP theory behind what you need to do. Implementing it in PHP is left as an exercise.

Strangulate answered 4/3, 2013 at 19:48 Comment(11)
I can download them in batches with like a cronjob which runs each minute and checks if there is a batch to process. Though my application will need to check over 1k - 5k IMAP boxes and retrieve all new mails at least once a day. Assuming the IMAP boxes will have around 10k messages spread over 50 folders on average the import will take way to long? Is there any way I can speed this with PHP? Will 25x the cronjob (starting all @ different boxes) speed the process up by 25x?Internode
You want to download large amounts of data, and you will have to deal with the limitations of that. You will no doubt be limited by the amount of processing power and bandwidth available to you... so you can try some optimisations (like you mentioned, running tasks in parallel), but you will still hit a limit at some point and not be able to speed things up further. I'd recommend running small jobs more regularly (rather than once a day) so the amount to download is relatively small and incremental.Strangulate
Also, I presume that by 10k messages you are referring to a first-time download of all messages. Yes, that will take quite a while. You are still going to be limited by your resources though. Maybe have multiple machines archiving different accounts in parallel.Strangulate
Is it possible I have several e-mails which don't have an UID? currently I have a fingerprinted them by creating a MD5 hash (from the fields: fromaddress, toaddress, subject, date).Internode
No, all emails MUST have a UID according to RFC3501. Hashing is not the best way... especially if you aren't taking the body in consideration. It is possible (although statistically unlikely) that two different messages resolve to the same hash - it's called a collision.Strangulate
The problem I'm still facing is I currently am able to fetch the bigger IMAP boxes (10k+ messages) easily in the background. Though when I want to sync it like once a day I'm still having problems as I will have to scan through all folders to search if there are any new messages?Internode
No, you don't scan every message in the folders. You just keep track of the highest UID in each folder, and retrieve messages from there onwards. If the highest UID is N, then you want to fetch messages in the range N:* (as I already explained).Strangulate
But as you said the UID is unique per folder. What will happen when a user deletes all messages from folderB and moves a single message from folderA to folderB.Internode
If the user deletes all messages from a folder, then the UID is still bound to be greater than those that were there before (see UIDNEXT in the response to SELECT). When you move a message, you aren't moving a message in the way we understand moving files on a hard disk. A client moves a message by deleting it from folderA and APPENDing it to folderB.Strangulate
How can I be sure a single message is unique if I fetch all folders. Currently I'm fetching all the first time and then UID (lastUID):* next times. But this way a single message could get in my database multiple times.Internode
As far as IMAP is concerned, those are different messages. If they're identical, and you want identical messages to be stored once in your database (to save space), you will have to check the message contents. You can use a hashing mechanism (as you suggested) but that is only a preliminary check for equality. If the hash matches, you then have to check that the messages are actually the same. This is because different messages may possibly have the same hash (collisions, as I explained earlier).Strangulate
C
2

Are you using imap_ping?

imap_ping() pings the stream to see if it's still active. It may discover new mail; this is the preferred method for a periodic "new mail check" as well as a "keep alive" for servers which have inactivity timeout.

Other ones to look at: imap_timeout imap_reopen

Fact there is a method called reopen suggests something doesn't it :)

Another option that comes to mind if you just can't seem to keep the connection is to export the data to mbox format and get at it locally. Might be faster for a huge mailbox and can remove the timeout / connection issues.

Celestine answered 1/3, 2013 at 20:31 Comment(4)
It will have to be able to connect several different servers so locally is no an option unfortunatelly. The idea is to get all "new" e-mails once. But how can I be sure without looping through (and check if they exist in the database) them all again?Internode
ah, that is easy IMAP has a 'seen' flag right? Also, there should be a sequence... msgno? Alternately, forward the emails onto an archive you can process and expunge once handled. You describe a common use case for IMAP.Celestine
The mailboxes will just be "archived" to the database, so the unseen / seen flags won't work.Internode
second argument of imap_fetch_overview... A message sequence description. You can enumerate desired messages with the X,Y syntax, or retrieve all messages within an interval with the X:Y syntax. Would be surprised if you can't do what you describe with out of the box functions.Celestine

© 2022 - 2024 — McMap. All rights reserved.