How to split an mbox file into n-MB big chunks using the terminal?
Asked Answered
I

6

15

So I've read through this question on SO but it does not quite help me any. I want to import a Gmail generated mbox file into another webmail service, but the problem is it only allows 40 MB huge files per import.

So I somehow have to split the mbox file into max. 40 MB big files and import them one after another. How would you do this?

My initial thought was to use the other script (formail) to save each mail as a single file and afterwards run a script to combine them to 40 MB huge files, but still I wouldnt know how to do this using the terminal.

I also looked at the split command, but Im afraid it would cutoff mails. Thanks for any help!

Impostume answered 23/1, 2015 at 13:0 Comment(0)
B
15

If your mbox is in standard format, each message will begin with From and a space:

From [email protected]

So, you could COPY YOUR MBOX TO A TEMPORARY DIRECTORY and try using awk to process it, on a message-by-message basis, only splitting at the start of any message. Let's say we went for 1,000 messages per output file:

awk 'BEGIN{chunk=0} /^From /{msgs++;if(msgs==1000){msgs=0;chunk++}}{print > "chunk_" chunk ".txt"}' mbox

then you will get output files called chunk_1.txt to chunk_n.txt each containing up to 1,000 messages.

If you are unfortunate enough to be on Windows (which is incapable of understanding single quotes), you will need to save the following in a file called awk.txt

BEGIN{chunk=0} /^From /{msgs++;if(msgs==1000){msgs=0;chunk++}}{print > "chunk_" chunk ".txt"}

and then type

awk -f awk.txt mbox
Benfield answered 23/1, 2015 at 14:7 Comment(18)
how to make sure theyre about < 40 mb each?Impostume
Try with 10,000 messages and if the files are too big, remove the chunk files and increase the 10,000 to 20,000 before running again. It is not scientific but I guess you don't have to do it every day, so you may need to experiment a bit.Benfield
can i run this right in the console? and mbox is the file url?Impostume
If you haven't got the mbox file on your local machine, you will need to download it first. You can use curl "http://yourprovider/somepath/mbox" > mbox.local or FTP` or click some link that your provider gives you.Benfield
of course i have it :) i was just wondering what mbox is in your scriptImpostume
so do i make a sh file of that or what?Impostume
when i make a .sh file of it, i get an error: awk: syntax error at source line 2 context is /^From / >>> {msgs++;if((msgs==1000){ <<< awk: illegal statement at source line 2 awk: illegal statement at source line 2 missing )Impostume
I have simplified it - just copy the one line and paste it into the Terminal and press EnterBenfield
copying and executing the line in the terminal gives me this: awk: illegal statement at source line 1Impostume
It expects a file called mbox in the directory where you are running the command. Type ls -l mbox and see if you can see the mbox file.Benfield
You aren't on Windows are you?Benfield
LOL no im not :D on a mbp, osx 10.10Impostume
i renamed the file to just mbox, ls -lah lists it, just like ls -l mbox but still the command throws a syntax error. is there a missing ( or something?Impostume
nope, still the same error: awk: illegal statement at source line 1 i even started to edit your answer to make sure that i didnt copy unwanted chars. there just seems to be something wrong. i have no clue, but to me print > "chunk_" chunk ".txt" looks somehow weird, is this correct syntax?Impostume
Let us continue this discussion in chat.Benfield
Is there a way of re-combining these files back into a single .mbox?Pede
@Pede Sure, just loop through all the chunks and use cat to append them together in a new mbox file.Benfield
@Impostume I'm a little late to the table but ... that illegal statement error is because having an unparenthesized expression on the right side of input or output redirection such as print > "chunk_" chunk ".txt" is undefined behavior per POSIX so it'll fail in some awks, it needs to be print > ("chunk_" chunk ".txt") to work in all awks (but adding close() and using a variable close(out); out="chunk_" chunk ".txt"; ... print > out would be better).Medeah
B
19

I just improved a script from Mark Sechell's answer. As We can see, that script can parse the mbox file based on the amount of email per chunk. This improved script can parse the mbox file based on the defined-maximum-size for each chunk.
So, if you have size limitation in uploading or importing the mbox file, you can try the script below to split the mbox file into chunks with specified size*.
Save the script below to a text file, e.g. mboxsplit.txt, in the directory that contains the mbox file (e.g. named mbox):

BEGIN{chunk=0;filesize=0;}
    /^From /{
    if(filesize>=40000000){#file size per chunk in byte
        close("chunk_" chunk ".txt");
        filesize=0;
        chunk++;
    }
  }
  {filesize+=length()}
  {print > ("chunk_" chunk ".txt")}

And then run/type this line in that directory (contains the mboxsplit.txt and the mbox file):

  awk -f mboxsplit.txt mbox

Please note:

  • The size of the result may be larger than the defined size. It depends on the last email size inserted into the buffer/chunk before checking the chunk size.
  • It will not split the email body
  • One chunk may contain only one email if the email size is larger than the specified chunk size

I suggest you to specify the chunk size less or lower than the maximum upload/import size.

Burlington answered 6/3, 2017 at 11:23 Comment(9)
If I use this, is there a way to re-combine the split files? ThanksPede
of course, you can combine them! remember that the split files are text files which can be simply combined.Burlington
It did work perfectly with the huge mbox files generated by Google Takeout. I use them to import mail in Horde (GoDaddy email accounts). I just changed the size of the chunk mbox files from 40 MB to 100 MB.Unmake
Excellent answer. So fast as well.Ciceronian
an improvement to consider: use sprintf to zero-pad the file name indexes. Something like: {print > ("chunk_" sprintf("%03d",chunk) ".txt");}Sabina
Fantastic solution, thank you! This came in handy for the 255mb import upload limit on roundcubeEtalon
Second @Sabina 's comment. OP should modify the original script or mention this modification in the answer.Overtrump
Many thanks for this very good answer, I modified as follows{print > ("chunk_" sprintf("%06d",chunk) ".txt");} but probably ran into sprintf function overflow. I posted it as a separate question: #78256911. Does someone know how to fix the code?Solidus
Change if(filesize>=40000000) to if((filesize+length())>=40000000) or similar to ensure the output files are under 40MB.Medeah
B
15

If your mbox is in standard format, each message will begin with From and a space:

From [email protected]

So, you could COPY YOUR MBOX TO A TEMPORARY DIRECTORY and try using awk to process it, on a message-by-message basis, only splitting at the start of any message. Let's say we went for 1,000 messages per output file:

awk 'BEGIN{chunk=0} /^From /{msgs++;if(msgs==1000){msgs=0;chunk++}}{print > "chunk_" chunk ".txt"}' mbox

then you will get output files called chunk_1.txt to chunk_n.txt each containing up to 1,000 messages.

If you are unfortunate enough to be on Windows (which is incapable of understanding single quotes), you will need to save the following in a file called awk.txt

BEGIN{chunk=0} /^From /{msgs++;if(msgs==1000){msgs=0;chunk++}}{print > "chunk_" chunk ".txt"}

and then type

awk -f awk.txt mbox
Benfield answered 23/1, 2015 at 14:7 Comment(18)
how to make sure theyre about < 40 mb each?Impostume
Try with 10,000 messages and if the files are too big, remove the chunk files and increase the 10,000 to 20,000 before running again. It is not scientific but I guess you don't have to do it every day, so you may need to experiment a bit.Benfield
can i run this right in the console? and mbox is the file url?Impostume
If you haven't got the mbox file on your local machine, you will need to download it first. You can use curl "http://yourprovider/somepath/mbox" > mbox.local or FTP` or click some link that your provider gives you.Benfield
of course i have it :) i was just wondering what mbox is in your scriptImpostume
so do i make a sh file of that or what?Impostume
when i make a .sh file of it, i get an error: awk: syntax error at source line 2 context is /^From / >>> {msgs++;if((msgs==1000){ <<< awk: illegal statement at source line 2 awk: illegal statement at source line 2 missing )Impostume
I have simplified it - just copy the one line and paste it into the Terminal and press EnterBenfield
copying and executing the line in the terminal gives me this: awk: illegal statement at source line 1Impostume
It expects a file called mbox in the directory where you are running the command. Type ls -l mbox and see if you can see the mbox file.Benfield
You aren't on Windows are you?Benfield
LOL no im not :D on a mbp, osx 10.10Impostume
i renamed the file to just mbox, ls -lah lists it, just like ls -l mbox but still the command throws a syntax error. is there a missing ( or something?Impostume
nope, still the same error: awk: illegal statement at source line 1 i even started to edit your answer to make sure that i didnt copy unwanted chars. there just seems to be something wrong. i have no clue, but to me print > "chunk_" chunk ".txt" looks somehow weird, is this correct syntax?Impostume
Let us continue this discussion in chat.Benfield
Is there a way of re-combining these files back into a single .mbox?Pede
@Pede Sure, just loop through all the chunks and use cat to append them together in a new mbox file.Benfield
@Impostume I'm a little late to the table but ... that illegal statement error is because having an unparenthesized expression on the right side of input or output redirection such as print > "chunk_" chunk ".txt" is undefined behavior per POSIX so it'll fail in some awks, it needs to be print > ("chunk_" chunk ".txt") to work in all awks (but adding close() and using a variable close(out); out="chunk_" chunk ".txt"; ... print > out would be better).Medeah
K
4

formail is perfectly suited for this task. You may look at formail's +skip and -total options

Options
...
+skip
Skip the first skip messages while splitting.
-total
Output at most total messages while splitting.

Depending on the size of your mailbox and mails, you may try

formail -100 -s <google.mbox >import-01.mbox
formail +100 -100 -s <google.mbox >import-02.mbox
formail +200 -100 -s <google.mbox >import-03.mbox

etc.

The parts need not be of equal size, of course. If there's one large e-mail, you may have only formail +100 -60 -s <google.mbox >import-02.mbox, or if there are many small messages, maybe formail +100 -500 -s <google.mbox >import-02.mbox.

To look for an initial number of mails per chunk, try

formail -100 -s <google.mbox | wc
formail -500 -s <google.mbox | wc
formail -1000 -s <google.mbox | wc

You may need to experiment a bit, in order to accommodate to your mailbox size. On the other hand, since this seems to be a one time task, you may not want to spend too much time on this.

Kazukokb answered 15/4, 2019 at 8:24 Comment(0)
S
0

My initial thought was to use the other script (formail) to save each mail as a single file and afterwards run a script to combine them to 40 MB huge files, but still I wouldnt know how to do this using the terminal.

If I understand you correctly, you want to split the files up, then combine them into a big file before importing them. That sounds like what split and cat were meant to do. Split splits the files based upon your size specification whether based upon line or bytes. It then adds a suffix to these files to keep them in order, You then use cat to put the files back together:

$ split -b40m -a5 mbox  # this makes mbox.aaaaa, mbox.aaab, etc.

Once you get the files on the other system:

$ cat mbox.* > mbox

You wouldn't do this if you want to break the files so messages aren't split between files because you are going to import each file into the new mail system one at a time.

Sanches answered 23/1, 2015 at 14:56 Comment(2)
not quite, i thought formail was a good idea to export each email in an own textfile and from those, create chunks that are about < 40 mb so that i can import them. because split might split the file maybe right in the middle of an email so that i would not get imported correctlyImpostume
Split will split an email into two separate files. But, you made it sound like you're recombining the files before importing. If that's the case, it doesn't matter that split split up an individual email because cat would have put patched them back together. split always splits on lines.Sanches
T
0

I came over this problem when trying to import a MBOX file from Gmail into Thunderbird (on Windows). The ImportExporttools tell me that it's larger then 2 GB and stopped.

The solution was surprisingly easy, as Thunderbird itself uses MBOX files for local email storage.

  1. Create a new local folder in Thunderbird (e.g., "Import"), then close Thunderbird
  2. Locate the email folder of Thunderbird, for example C:\Users\uf501ap\AppData\Roaming\Thunderbird\Profiles\xyz.default-release\Mail\Local Folders
  3. Replace the empty file according to your newly created folder ("Import" in this example) with the MBOX file you want to import.
  4. Open Thunderbird and find the mails in the newly created folder. Thunderbird will then start indexing the file and - after a minute or so - shows the emails contained in the MBOX file.
Tsushima answered 13/11, 2023 at 13:52 Comment(0)
M
0

To export fully from gmail and import into Thunderbird:

  1. Use Takeout Google to export gmail.
  2. In Thunderbird, under LocalFolders, create a subfolder with a distinctive name.
  3. Exit Thunderbird completely and search recursively under C:\Users\(name)\AppData\Roaming\Thunderbird\Profiles for a file with this same distinctive name
  4. Replace this file with the mbox file found within the Takeout Google exported zip file.
  5. Open Thunderbird, and select the subfolder again.
  6. If it's a large mbox (e.g. 6GB) then note that nothing seems to happen but a small message at the bottom indicates that the index is building.
  7. Once built, the Thunderbird inbox will be full and will match gmail.
Mull answered 16/8 at 22:26 Comment(1)
If you have a new question, please ask it by clicking the Ask Question button. Include a link to this question if it helps provide context. - From ReviewLei

© 2022 - 2024 — McMap. All rights reserved.