Buffer overflow of sprintf using awk (or something else?)?
Asked Answered
P

1

1

I Have a very big .mbox (several GB) that I want to split into as many text files as there are email. I followed the answer of Oki Erie Rinaldi from the follwoing SO question: How to split an mbox file into n-MB big chunks using the terminal?

content of mboxsplit.txt:

BEGIN{chunk=0;filesize=0;}
    /^From /{
    if(filesize>=40000000){#file size per chunk in byte
        close("chunk_" chunk ".txt");
        filesize=0;
        chunk++;
    }
  }
  {filesize+=length()}
  {print > ("chunk_" chunk ".txt")}

And then run/type this line in that directory (contains the mboxsplit.txt and the mbox file):

  awk -f mboxsplit.txt The_large_mbox_file.mbox

I therefore replaced the third line of mboxsplit.txt as follows:

if(filesize>=0){#file size per chunk in byte

and the last line of mboxsplit.txt as follows:

{print > ("chunk_" sprintf("%04d",chunk) ".txt");}

however, I get the following input

awk: can't open file chunk_0253.txt input record number 924244, file The_large_mbox_file.mbox source line number 10

N.B., the code works for

{print > ("chunk_" sprintf("%03d",chunk) ".txt");}

but not for

{print > ("chunk_" sprintf("%04d",chunk) ".txt");}
{print > ("chunk_" sprintf("%05d",chunk) ".txt");}
{print > ("chunk_" sprintf("%06d",chunk) ".txt");}

as the file number in the error number (chunk_0253.txt) is close to 256, it might be related to overflow of the function sprintf, or is it something else?

Plangent answered 1/4 at 15:45 Comment(7)
This is virtually guaranteed to be related to the max number of open file descriptors. 3 for stdin, stdout and stderr, plus 253 equals 256. And a limit of 255 open file descriptors for the process sounds normal. (Actually, 256 sounds more normal, so maybe there's another handle open by awk itself.) This would suggested that your handles aren't getting closed as you'd expect. On Linux, you could very this by listing the files in the /proc/$$/fd directory (where $$ represents the process's pid). This show's the handles opened by the process. Don't know awk, so can't help more.Glia
(Can't replicate with seq 300 | awk '{ print $RN > ( "a" $RN ".txt" ) }'. Even with MUCH larger sequences (e.g. 3000). Linux in WSL2 on Win11.)Glia
Dear @ikegami. Many thanks for your comments. I don't what is the mechanism, but I could find a workaround in the answer below. Feel free to write some perspective if you understand the reason why it did not workPlangent
What that code does depends on which awk variant you're running and limit set on the Unix box you're executing it on. As @Glia points out you aren;t closing file names as you go and so exceeding the limit of how many simultaneously open files your process can have. In that situation most awks will just fail as you saw but GNU awk will try to manage opening/closing files for you as needed to stay under the Unix limit with the consequence being it won't fail but it will slow down, which I expect is what ikegami is seeing.Adapt
You said "I Have a very big .mbox (several GB) that I want to split into as many text files as there are email" but that's not exactly what the script in your question does, it splits the input into multiple text files but 1 output file could contain any number of emails, e.g. if 10 emails in a row are 4000000 characters long each then you'll get 1 output file containing all 10 of those emails rather than 10 separate output files. Is that what you really want?Adapt
ah, I didn't realize the code you posted was working code. It helps to post the code with which you have a problemGlia
@Ed Morton, The OP says they only have the problem after changing {print > ("chunk_" chunk ".txt")} to {print > ("chunk_" sprintf("%04d",chunk) ".txt");} (and another unrelated change. They said do not experience the problem without this change (contrary to what you claim), and I don't see a problem with the code they said work (contrary to what you claim).Glia
P
4

Buffer overflow of sprintf using awk (or something else?)?

From The GNU Awk User's Guide

If you use more files than the system allows you to have open, gawk attempts to multiplex the available open files among your data files. gawk’s ability to do this depends upon the facilities of your operating system, so it may not always work. It is therefore both good practice and good portability advice to always use close() on your files when you are done with them.

If you changed

{print > ("chunk_" chunk ".txt")}

to one of

{print > ("chunk_" sprintf("%04d",chunk) ".txt");}
{print > ("chunk_" sprintf("%05d",chunk) ".txt");}
{print > ("chunk_" sprintf("%06d",chunk) ".txt");}

but keep

close("chunk_" chunk ".txt");

You are attempting to close other file, e.g. when chunk = 3 and %04d then you are printing to chunk_0003.txt but close chunk_3.txt, which is different.

You might store filename in variable and thus avoid need to alter two places if you decide you want more zeros that is

BEGIN{chunk=0;filesize=0;filename=""}
    /^From /{
    if(filesize>=40000000){#file size per chunk in byte
        if(filename){close(filename)};
        filename=sprintf("chunk_%04d.txt",chunk);
        filesize=0;
        chunk++;
    }
  }
  {filesize+=length()}
  {print > filename}

Not tested due to lack of access to mbox file.

Plexiglas answered 1/4 at 16:49 Comment(1)
if(filename){close(filename)}; should just be close(filename) as close() will do nothing with an empty string as input but that if() would behave undesirably for the first file if you decided to just give your file names numbers.Adapt

© 2022 - 2024 — McMap. All rights reserved.