bash: process list of files in chunks
Asked Answered
A

5

6

The setting:

I have some hundred files, named something like input0.dat, input1.dat, ..., input150.dat, which I need to process using some command cmd (which basically merges the contents of all files). The cmd takes as first option the output filename and then a list of all input filenames:

./cmd output.dat input1.dat input2.dat [...] input150.dat

The problem:

The problem is that the script can only handle like 10 files or so due to memory issues (don't blame me for that). Thus, instead of using the bash wildcard extension like

./cmd output.dat *dat

I need to do something like

./cmd temp_output0.dat file0.dat file1.dat [...] file9.dat
[...]
./cmd temp_outputN.dat fileN0.dat fileN1.dat [...] fileN9.dat

Afterwards I can merge the temporary outputs.

./cmd output.dat output0.dat [...] outputN.dat

How do I script this efficiently in bash?

I tried, without success, e.g.

for filename in `echo *dat | xargs -n 3`; do [...]; done

The problem is that this again processes all files at once, because the output lines of xargs get concatenated.

EDIT: Note that I need to specify an output filename as first command line argument when calling cmd!

Aretha answered 20/1, 2012 at 17:4 Comment(0)
C
5

edit Without a pipe or process substitution - requires bash. This is able to deal with files with spaces in their names. Use a bash array and extract in slices:

i=0
infiles=(*dat)
opfiles=()
while ((${#infiles[@]})); do
    threefiles=("${infiles[@]:0:3}")
    echo ./cmd tmp_output$i.dat "${threefiles[@]}"
    opfiles+=("tmp_output$i.dat")
    ((i++))
    infiles=("${infiles[@]:3}")
done
echo ./cmd output.dat "${opfiles[@]}"
rm "${opfiles[@]}"

Using a fifo - this is not capable of dealing with spaces in filenames:

i=0
opfiles=
mkfifo /tmp/foo
echo *dat | xargs -n 3 >/tmp/foo&
while read threefiles; do
    ./cmd tmp_output$i.dat $threefiles
    opfiles="$opfiles tmp_output$i.dat"
    ((i++)) 
done </tmp/foo
rm -f /tmp/foo
wait
./cmd output.dat $opfiles
rm $opfiles

You need to use a fifo to keep the i variable value, as well as for the final concatenation set of files.

If you want, you can background the inside invocation of ./cmd, put a wait before the last invocation of cmd:

i=0
opfiles=
mkfifo /tmp/foo
echo *dat | xargs -n 3 >/tmp/foo&
while read threefiles; do
    ./cmd tmp_output$i.dat $threefiles&
    opfiles="$opfiles tmp_output$i.dat"
    ((i++)) 
done </tmp/foo
rm -f /tmp/foo
wait
./cmd output.dat $opfiles
rm $opfiles

update If you want to avoid using a fifo entirely, you can use process substitution to emulate it, so rewriting the first one as:

i=0
opfiles=()
while read threefiles; do
    ./cmd tmp_output$i.dat $threefiles
    opfiles+=("tmp_output$i.dat")
    ((i++)) 
done < <(echo *dat | xargs -n 3)
./cmd output.dat "${opfiles[@]}"
rm "${opfiles[@]}"

Again avoiding piping into the while, but reading from a redirection to keep the opfiles variable after the while loop.

Competitor answered 20/1, 2012 at 17:13 Comment(4)
Yes! That's what I've been looking for. Thanks.Aretha
This is rather more convoluted than it really needs to be. The temporary file can be avoided -- just pipe xargs to while read. The background processing might be nice, but could also complicate things needlessly, depending on how heavy the job is, etc.Mesial
this doesn't work if the filenames have spaces in themPointed
True, I've added an update that works with files with spaces in their names as well - it uses arrays, which is pretty much the only way to fly if you're trying to keep safe.Competitor
S
5

Try the following, it should work for you:

echo *dat | xargs -n3 ./cmd output.dat

EDIT: In response to your comment:

for i in {0..9}; do
    echo file${i}*.dat | xargs -n3 ./cmd output${i}.dat
done

That would send no more than three files at a time to ./cmd, while going over all file from file00.dat to file99.dat, and having 10 different output files, output1.dat to output9.dat.

Stenosis answered 20/1, 2012 at 17:6 Comment(2)
I see, I added what I think could work for you. Is that what you meant?Stenosis
No, actually, it does not quite do the right thing, because you're using the same output name multiple times for different input files.Aretha
C
5

edit Without a pipe or process substitution - requires bash. This is able to deal with files with spaces in their names. Use a bash array and extract in slices:

i=0
infiles=(*dat)
opfiles=()
while ((${#infiles[@]})); do
    threefiles=("${infiles[@]:0:3}")
    echo ./cmd tmp_output$i.dat "${threefiles[@]}"
    opfiles+=("tmp_output$i.dat")
    ((i++))
    infiles=("${infiles[@]:3}")
done
echo ./cmd output.dat "${opfiles[@]}"
rm "${opfiles[@]}"

Using a fifo - this is not capable of dealing with spaces in filenames:

i=0
opfiles=
mkfifo /tmp/foo
echo *dat | xargs -n 3 >/tmp/foo&
while read threefiles; do
    ./cmd tmp_output$i.dat $threefiles
    opfiles="$opfiles tmp_output$i.dat"
    ((i++)) 
done </tmp/foo
rm -f /tmp/foo
wait
./cmd output.dat $opfiles
rm $opfiles

You need to use a fifo to keep the i variable value, as well as for the final concatenation set of files.

If you want, you can background the inside invocation of ./cmd, put a wait before the last invocation of cmd:

i=0
opfiles=
mkfifo /tmp/foo
echo *dat | xargs -n 3 >/tmp/foo&
while read threefiles; do
    ./cmd tmp_output$i.dat $threefiles&
    opfiles="$opfiles tmp_output$i.dat"
    ((i++)) 
done </tmp/foo
rm -f /tmp/foo
wait
./cmd output.dat $opfiles
rm $opfiles

update If you want to avoid using a fifo entirely, you can use process substitution to emulate it, so rewriting the first one as:

i=0
opfiles=()
while read threefiles; do
    ./cmd tmp_output$i.dat $threefiles
    opfiles+=("tmp_output$i.dat")
    ((i++)) 
done < <(echo *dat | xargs -n 3)
./cmd output.dat "${opfiles[@]}"
rm "${opfiles[@]}"

Again avoiding piping into the while, but reading from a redirection to keep the opfiles variable after the while loop.

Competitor answered 20/1, 2012 at 17:13 Comment(4)
Yes! That's what I've been looking for. Thanks.Aretha
This is rather more convoluted than it really needs to be. The temporary file can be avoided -- just pipe xargs to while read. The background processing might be nice, but could also complicate things needlessly, depending on how heavy the job is, etc.Mesial
this doesn't work if the filenames have spaces in themPointed
True, I've added an update that works with files with spaces in their names as well - it uses arrays, which is pretty much the only way to fly if you're trying to keep safe.Competitor
C
3

I know that this question was answered and accepted a long time ago, but I find that there is a more simple solution than those offered so far.

find -name '*.dat' | xargs -n3 | xargs -n3 your_command

For more fine grained control, or to manipulate your string further, use the following form (substitute bash to your liking):

find -name '*.dat' | xargs -n3 | xargs -n3 -I{} sh -c 'your_command {}'

To parallelize the output (say, on 2 threads):

find -name '*.dat' | xargs -n3 | xargs -P2 -n3 -I{} sh -c 'your_command {}'

NOTE: This will not work for files that have spaces in them.

Cassock answered 27/3, 2018 at 5:54 Comment(5)
What does piping it through xargs twice do, exactly?Thorvaldsen
@Thorvaldsen the first xargs call creates the chunks and the second xargs call parallelizes the chunksCassock
Isn't the first xargs redundant? Shouldn't setting both -P and -n on the final xargs already handle splitting into chunks and parallelising? Also, that is likely to fail on filenames with whitespace in them, especially with -n present. -L could be better, but one should almost always use xargs -d $'\n' ... to split arguments on newlines, or better use find ... -print0 | xargs -0 ... to delimit with null bytes.Thorvaldsen
I also found many way to deal execute a function of a chunk of file but I really struggle to make it work with files having space too… :(Austronesia
This is sad that it’s not possible to use a placeholder, like with option -I of xargs, but with alongside option -L, this placeholder being then a place holder for multiple args.Austronesia
P
3

I'm using this quick solution I found from the bash manpage. It looks like others exist too. Unlike xargs -n, this should handle spaces in filenames properly.

ls *dat | while readarray -tn 10 tenfiles && ((${#tenfiles[@]}))
do
  cmd output.dat "${tenfiles[@]}"
done
Pointed answered 5/3, 2020 at 16:43 Comment(1)
Great solution - the docs for readarray can be found with help mapfile.Cerargyrite
E
1

GNU Parallel is excellent at "chunking things up" and generating input/output filenames and counters. This will take 3 files at a time (-N3) and generate an intermediate output file that is sequentially numbered and contains the merged contents. And it does it in parallel for you - making use of all those CPU cores that you paid Intel so handsomely for:

parallel -N3 cmd output.{#} {} ::: {1..150}.dat

To see it in action, use --dry-run option

parallel --dry-run -N3 cmd output.{#} {} ::: {1..150}.dat

Sample Output

cmd output.1 1.dat 2.dat 3.dat
cmd output.2 4.dat 5.dat 6.dat
cmd output.3 7.dat 8.dat 9.dat
cmd output.4 10.dat 11.dat 12.dat
cmd output.5 13.dat 14.dat 15.dat
cmd output.6 16.dat 17.dat 18.dat
cmd output.7 19.dat 20.dat 21.dat
...
...
cmd output.49 145.dat 146.dat 147.dat
cmd output.50 148.dat 149.dat 150.dat
Elinaelinor answered 5/3, 2020 at 17:14 Comment(1)
I think this really is the best solution because it gives you easy access to the counter while chunking. Thats super important when setting the output file path.Nagpur

© 2022 - 2024 — McMap. All rights reserved.