I Have a very big .mbox (several GB) that I want to split into as many text files as there are email. I followed the answer of Oki Erie Rinaldi from the follwoing SO question: How to split an mbox file into n-MB big chunks using the terminal?
content of mboxsplit.txt
:
BEGIN{chunk=0;filesize=0;}
/^From /{
if(filesize>=40000000){#file size per chunk in byte
close("chunk_" chunk ".txt");
filesize=0;
chunk++;
}
}
{filesize+=length()}
{print > ("chunk_" chunk ".txt")}
And then run/type this line in that directory (contains the mboxsplit.txt and the mbox file):
awk -f mboxsplit.txt The_large_mbox_file.mbox
I therefore replaced the third line of mboxsplit.txt
as follows:
if(filesize>=0){#file size per chunk in byte
and the last line of mboxsplit.txt
as follows:
{print > ("chunk_" sprintf("%04d",chunk) ".txt");}
however, I get the following input
awk: can't open file chunk_0253.txt input record number 924244, file The_large_mbox_file.mbox source line number 10
N.B., the code works for
{print > ("chunk_" sprintf("%03d",chunk) ".txt");}
but not for
{print > ("chunk_" sprintf("%04d",chunk) ".txt");}
{print > ("chunk_" sprintf("%05d",chunk) ".txt");}
{print > ("chunk_" sprintf("%06d",chunk) ".txt");}
as the file number in the error number (chunk_0253.txt
) is close to 256, it might be related to overflow of the function sprintf
, or is it something else?
/proc/$$/fd
directory (where$$
represents the process's pid). This show's the handles opened by the process. Don't know awk, so can't help more. – Gliaseq 300 | awk '{ print $RN > ( "a" $RN ".txt" ) }'
. Even with MUCH larger sequences (e.g. 3000). Linux in WSL2 on Win11.) – Glia{print > ("chunk_" chunk ".txt")}
to{print > ("chunk_" sprintf("%04d",chunk) ".txt");}
(and another unrelated change. They said do not experience the problem without this change (contrary to what you claim), and I don't see a problem with the code they said work (contrary to what you claim). – Glia