Parallel processing in awk?
Asked Answered
R

3

10

Awk processes the files line by line. Assuming each line operation has no dependency on other lines, is there any way to make awk process multiple lines at a time in parallel? Is there any other text processing tool which automatically exploits parallelism and processes the data quicker ?

Rossi answered 1/12, 2013 at 3:37 Comment(0)
A
6

The only awk implementation that was attempting to provide a parallel implementation of awk was parallel-awk but it looks like the project is dead now.

Otherwise, one way to parallelize awk is be to split your input in chunks and process them in parallel. However, splitting the input data would still be single threaded so might defeat the performance enhancement goal, the main issue being the standard split command is unable to split at line boundaries without reading each and every line.

If you have GNU split available, or a version that support the -n l/* option, here is one optimized way to process your file in parallel, assuming here you have 8 vCPUs:

inputfile=input.txt
outputfile=output.txt
script=script.awk
count=8

split -n l/$count $inputfile /tmp/_pawk$$
for file in /tmp/_pawk$$*; do
    awk -f script.awk $file > ${file}.out &
done
wait
cat /tmp/_pawk$$*.out > $outputfile
rm /tmp/_pawk$$*
Aragon answered 1/12, 2013 at 9:55 Comment(0)
E
6

You can use GNU Parallel for this purpose

Consider you are counting the sum of numbers in a big file:

cat rands20M.txt | awk '{s+=$1} END {print s}'

With GNU Parallel you can do it in multiple threads:

cat rands20M.txt | parallel --pipe awk \'{s+=\$1} END {print s}\' | awk '{s+=$1} END {print s}'

Edythedythe answered 24/12, 2014 at 13:20 Comment(1)
should the parallel command be just cat rands20M.txt | parallel --pipe awk \'{s+=\$1} END {print s}\' ? do you really need to pipe the parallel awk to the ~same command?Wendiwendie
L
0

GNU AWK forks

With gawk, one can use the fork standard extension to spawn parallel processes like in C, e.g.:

#!/usr/bin/gawk -f

@load "fork"

# Max number of lines processed by each child
BEGIN {CHUNKSIZE=100}

# At the beginning of each chunk, a child process starts and 
# is given the final line it should process.
# The parent process does not do anything but checking NR values
NR % CHUNKSIZE==1{
  if(pid==""||pid>0){
    pid=fork()
    stop=NR + CHUNKSIZE
  }
}

# Children exit upon finishing the chunk
NR>=stop { exit }

# Children do long stuff for each line
pid==0 {
  system("sleep 0.002")
}

# The parent waits for children to finish
END{
  if (pid>0) {
    while(wait()>0){}
  }
}

You can test the speedup by comparing this with awk '{system("sleep .002")}' on a file with 1000 lines.

Note that this way the processes are unable to communicate with each other on the go. Thus, to summarize the output of different children you would need to pipe the output to another awk script.

frawk

Another recent awk implementation worth keeping an eye on seems to be frawk which claims to support parallelization and be "a good deal faster than gawk or mawk" at the cost of being slightly deviant from POSIX standards.
I have not tested it myself.

Lactalbumin answered 26/1 at 15:40 Comment(2)
This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. - From ReviewIntracardiac
@LTyrone -- I can delete my answer if it does not follow the guidelines, but I think it is relevant here. First, the OP asks "Is there any other text processing tool which automatically exploits parallelism and processes the data quicker ?" -- and my suggestion fits that demand, even allowing sticking to awk-like syntax. Second, the question " is there any way to make awk process multiple lines at a time in parallel?" can be also potentially solved by frawk -- it was not explicitly stated that awk should be software and not a language here.Lactalbumin

© 2022 - 2024 — McMap. All rights reserved.