Bash-awk-parallel select process for each line of a huge file
Asked Answered
P

4

2

I am trying to send different lines of a very big file to different processes. So to show my problem I am building a toy example where I have a file with 10 categories and I want to compute the standard deviation (sd) of the second column for each category. Please keep in mind that my real file is millions of very long lines lines, and the sd computation is in fact a more complex computation.

STEP 1 building a test file :

seq 1 1000 | awk '{print int(10*rand()),int(100*rand())}' > testfile

STEP 2 splitting according to column 1 (I want to compute the variance of the second column for the different values in the first field)

cat testfile | awk '{print $2 >> "file"$1}'

STEP 3

so now I can compute each variance in parallel

for i in $(seq 0 9); do
    cat file$i | awk '{s+=$1;ss+=$1*$1}END{a=s/NR;print sqrt((ss-a*a)/NR)}' > sd$i &
done

So what I would like to do is to skip the file$i part and to send directly to 10 processes my numbers while reading my initial file.

In a way it s a bit like using parallel but instead of sending blocks of lines to processes it s using a field to send some specific lines to specific processes.

To give an idea of the last data I had to deal with, I have 13 million lines in 2377 categories. each line have 30K fields on which I am making stats using a specific bash command

Please also help me formulate my question !

Prospectus answered 23/2, 2023 at 15:7 Comment(4)
please update the question with a reduced set of data, eg, seq 1 20 | awk (replace 10 with 3); then update the question with the exepcted output for those 20 lines of input so that we have something to compare our results againstKopje
a bit more detail on your real problem may also be of benefit as it may affect the design of a solution; in the sample case you have 10 categories ... for the real data how many categories will you have ... 10? 100? 1000? more? also, will the more complex computation be performed in awk or will some other process/binary/program need to be called?Kopje
you've also stated the real file has very long lines ... some idea of what's in these lines and how they come into play re: the calculation may be of help in coming up with a solution; the sample deals with a simple pair of numbers and so a solution dealing with two numbers is going to be relatively simple; but a solution deailing with several (dozens? hundreds? more?) numbers could very well end up being something other than simple; also, the expected max size (MBytes) of the real data file will help us determine if an in-memory solution will be viableKopje
case in point: the sample provided here could be processed in a single awk script (eg, use a set of 10-entry arrays) and likely be more efficient than spawning 10 OS background processes; this same simple solution may not be viable for the real fileKopje
C
1

Parallelize stream processing using

(Full bash script using sed at end of this post!)

Using sed for stream filtering

a bit like using parallel but instead of sending blocks of lines to processes it s using a field to send some specific lines to specific processes.

In this use case: having to filter stream to distrubute to many subtasks, sed should be de quickest way (as sed is a lot lighter then perl and parallel is a perl script. Using sed will be sensibly quicker, lighter and will consume less of resources! Please look comparison at end of this!

Fist preparing sed command line:

printf -v sedcmd ' -e \47};/^%d/{s/^. //;w/dev/fd/%d\47 %d> >(exec \
    awk \47{c++;s+=$1;ss+=$1*$1}END{a=s/c;print %d,sqrt((ss-a*a)/c)}\47) ' $(
    for ((i=0;i<10;i++)) { echo $i $((i+4)) $((i+4)) $i  ; })

Then command is: eval sed -n "${sedcmd/\};} -e '};'":

eval sed -n "${sedcmd/\};} -e '};'" <testfile

or

eval sed -n "${sedcmd/\};} -e '};'" <testfile | cat

Where $sedcmd look like:

$ echo -- "$sedcmd"
--  -e '};/^0/{s/^. //;w/dev/fd/4' 4> >(exec \
    awk '{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 0,sqrt((ss-a*a)/c)}')  -e '};/^1/{s/^. //;w/dev/fd/5' 5> >(exec \
    awk '{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 1,sqrt((ss-a*a)/c)}')  -e '};/^2/{s/^. //;w/dev/fd/6' 6> >(exec \
    awk '{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 2,sqrt((ss-a*a)/c)}')  -e '};/^3/{s/^. //;w/dev/fd/7' 7> >(exec \
    awk '{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 3,sqrt((ss-a*a)/c)}')  -e '};/^4/{s/^. //;w/dev/fd/8' 8> >(exec \
    awk '{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 4,sqrt((ss-a*a)/c)}')  -e '};/^5/{s/^. //;w/dev/fd/9' 9> >(exec \
    awk '{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 5,sqrt((ss-a*a)/c)}')  -e '};/^6/{s/^. //;w/dev/fd/10' 10> >(exec \
    awk '{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 6,sqrt((ss-a*a)/c)}')  -e '};/^7/{s/^. //;w/dev/fd/11' 11> >(exec \
    awk '{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 7,sqrt((ss-a*a)/c)}')  -e '};/^8/{s/^. //;w/dev/fd/12' 12> >(exec \
    awk '{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 8,sqrt((ss-a*a)/c)}')  -e '};/^9/{s/^. //;w/dev/fd/13' 13> >(exec \
    awk '{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 9,sqrt((ss-a*a)/c)}') 

Where

  • 4> >(exec awk ...) tell bash to generate a fd number 4 and run awk
  • -e "/^0/{s/^. //;w/dev/fd/4" -e "}" tell sed to drop first character of lines wich begin by 0 and send it to fd/4.

parallel.sh full bash script (draft)

Here is a full parallelFiltering bash script using sed:

#!/bin/bash
# parallel.sh - bash script for filtering/parallelising using sed.
# (C) 2023 Felix Hauri - [email protected]
# Licensed under terms of GPL v3. www.gnu.org

prog=${0##*/}
usage() {
    cat <<-EOUsage
        Usage: $prog -t <tags> [-b <re>] [-a <re>] command args
          -h                 show this
          -t <tags>   coma separated liste of tags to send to separated tasks
                           or single tag, '-t' option could be submited multiple times
          -b <re>     sed regex to match before tags
          -a <re>     sed regex to match after tags
          command     Any command to be run once for each tag.
                        Special string "<RE>" will be replaced by current tag.
        EOUsage
}
die() {
    echo >&2 "ERROR $prog: $*"
    exit 1
}
while getopts "ht:a:b:" opt; do
    case $opt in
        h ) usage; exit ;;
        t ) IFS=, read -a crttags <<<"$OPTARG"
            tags+=("$crttags");;
        b ) before=$OPTARG ;;
        a ) after=$OPTARG ;;
        *) die Wrong argument. ;;
    esac
done
shift $((OPTIND-1))

[[ -v tags ]] || die "No tags submited"
(( $# )) || die "No command submited"

sedcmd='' paren=''
declare -i crtFd=4
for re in "${tags[@]}";do
    printf -v crtcmd '%q ' "${@//\<RE\>/$re}"
    printf -v crtcmd ' -e \47%s/%s/{s/%s//;w/dev/fd/%d\47 %d> >(exec %s) ' \
           "$paren" "$before$re$after"{,} $crtFd $crtFd "$crtcmd"
    paren='};'
    sedcmd+="$crtcmd" crtFd+=1
done
sedcmd+=" -e '$paren'"

eval sed -n "$sedcmd" 
Usage: parallel.sh -t <tags> [-b <re>] [-a <re>] command args
  -h            show this
  -t <tags>   coma separated liste of tags to send to separated tasks
                or single tag, '-t' option could be submited multiple times
  -b <re>     sed regex to match before tags
  -a <re>     sed regex to match after tags
  command     Any command to be run once for each tag.
                Special string "<RE>" will be replaced by current tag.

This script could be found there: parallel.sh.

Tested with your use case with:

 ./parallel.sh -t{0..9} -b ^ awk '{c++;s+=$1;ss+=$1*$1}END{a=s/c;print <RE>,sqrt((ss-a*a)/c)}' <testfile

Notice the only change from your command line is print <RE>,sqrt... where <RE> will be replaced by each tags (-t) in each subtask respectively.

9 55.6751
8 58.0447
7 55.6755
6 58.3663
5 58.696
4 58.2724
3 54.9797
2 57.5355
1 54.6131
0 57.1334

Comparison with GNU parallel

Of course this is about line buffered filtering, not suitable for block buffered distribution!!

I've tested with a simple 1000 lines random file:

for ((i=1000;i--;)){ echo $((RANDOM%10)) $((RANDOM%100));} >testfile

then using parallel:

sd() {
  awk '{s+=$'$1';ss+=$'$1'*$'$1'}END{a=s/NR;print sqrt((ss-a*a)/NR)}'
}
export -f sd
time parallel -j10 --colsep ' ' --bin 1 --pipe \
    --tagstring {%} sd 2 <testfile |sort 
10      58.3703
1       50.7911
2       56.9009
3       55.0832
4       52.5365
5       65.0864
6       61.4079
7       55.5353
8       62.337
9       51.2512

real    0m0.488s
user    0m1.158s
sys     0m0.272s

and using sed + bash:

time ./parallel.sh -t{0..9} -b ^ awk \
  '{c++;s+=$1;ss+=$1*$1}END{a=s/c;print <RE>,sqrt((ss-a*a)/c)}' <testfile |
    sort
0 58.3703
1 50.7911
2 56.9009
3 55.0832
4 52.5365
5 65.0864
6 61.4079
7 55.5353
8 62.337
9 51.2512

real    0m0.010s
user    0m0.009s
sys     0m0.000s

Fortunately computed results are same! (parallel version output 10 instead of 0).

Where bash+sed version

  • use tags instead of number
  • use a lot less system resources
  • is something quicker

Test with bigger and smaller files:

                   Number of lines     Real     User   System
parallel.sh            100'000'000   73.117   72.598    0.416
parallel (perl)        100'000'000  129.264  383.701   36.319

parallel.sh              1'000'000    0.744    0.728    0.013
parallel (perl)          1'000'000    1.798    5.571    0.613

parallel.sh                 10'000    0.018    0.007    0.009
parallel (perl)             10'000    0.523    1.148    0.269

Here are ouput of ps --tty pts/4 fw while parallel.sh was running in pts/4:

   5352 pts/4    Ss     0:00 -bash
   5983 pts/4    S+     0:00  \_ /bin/bash ./parallel.sh -t0 -t1 -t2..
   5985 pts/4    R+     0:13  |   \_ sed -n -e /^0/{s/^0//;w/dev/fd/..
   5986 pts/4    S+     0:00  |       \_ awk {c++;s+=$1;ss+=$1*$1}EN..
   5987 pts/4    S+     0:00  |       \_ awk {c++;s+=$1;ss+=$1*$1}EN..
   5988 pts/4    S+     0:00  |       \_ awk {c++;s+=$1;ss+=$1*$1}EN..
   5989 pts/4    S+     0:00  |       \_ awk {c++;s+=$1;ss+=$1*$1}EN..
   5990 pts/4    S+     0:00  |       \_ awk {c++;s+=$1;ss+=$1*$1}EN..
   5991 pts/4    S+     0:00  |       \_ awk {c++;s+=$1;ss+=$1*$1}EN..
   5992 pts/4    S+     0:00  |       \_ awk {c++;s+=$1;ss+=$1*$1}EN..
   5993 pts/4    S+     0:00  |       \_ awk {c++;s+=$1;ss+=$1*$1}EN..
   5994 pts/4    S+     0:00  |       \_ awk {c++;s+=$1;ss+=$1*$1}EN..
   5995 pts/4    S+     0:00  |       \_ awk {c++;s+=$1;ss+=$1*$1}EN..
   5984 pts/4    S+     0:00  \_ sort

Where bash execute sed wich run 10x awk, piped to sort. Look's ok!

Here are ouput of ps --tty pts/4 fw while parallel (perl) was running:

   5352 pts/4    Ss     0:00 -bash
   5777 pts/4    S+     0:00  \_ /usr/bin/perl /usr/bin/parallel -j1..
   5780 pts/4    R+     0:17  |   \_ /usr/bin/perl /usr/bin/parallel..
   5956 pts/4    R      0:16  |   |   \_ perl -e  use B; my $sep = s..
   5957 pts/4    R      0:16  |   |   \_ perl -e  use B; my $sep = s..
snip 7 lines
   5965 pts/4    R      0:16  |   |   \_ perl -e  use B; my $sep = s..
   5793 pts/4    S      0:00  |   \_ /usr/bin/bash -c perl -e '{use ..
   5794 pts/4    S      0:00  |   |   \_ perl -e {use POSIX qw(:errn..
   5795 pts/4    S      0:00  |   |   \_ /usr/bin/bash -c perl -e '{..
   5796 pts/4    S      0:01  |   |       \_ awk {s+=$2;ss+=$2*$2}EN..
snip 33 lines
   5852 pts/4    S      0:00  |   \_ /usr/bin/bash -c perl -e '{use ..
   5867 pts/4    S      0:00  |       \_ perl -e {use POSIX qw(:errn..
   5868 pts/4    S      0:00  |       \_ /usr/bin/bash -c perl -e '{..
   5870 pts/4    S      0:01  |           \_ awk {s+=$2;ss+=$2*$2}EN..
   5778 pts/4    S+     0:00  \_ sort

Well!! 52 process are executed to fork 10 time one stream to 10 subprocess!! Each subprocess require 5 sub tasks?!

Use case:

Quick demo on log file:

{ tags=($(cut -d\[ -f 1 | cut -d\  -f 5 | sort -u)) ;} <daemon.log 
./parallel.sh "${tags[@]/#/-t}" -b \\b -a \\[ bash -c \
    $'printf \47 - %-20s  %8d %8d %8d\\n\47 "$1" $(wc)' -- "<RE>" <daemon.log  |
    sort

This will run as many task there are tags. Then ouput wc for each sub stream.

Note: Syntax: ${tags[@]/#/-t} will be expanded as -tdhclient -tdnsmasq -tssystemd ....

 - accton                      14      154     1165
 - dbus-daemon                 80     1273    13731
 - dhclient                  6480    79920   542160
 - dnsmasq                   6480    49680   401760
 - systemd                 154608  1474418 10664639
...

But, you could create different filter for differents targets:

tags=( dhclient dnsmasq systemd )
./parallel.sh ${tags[@]/#/-t} -b \\b -a \\[ \
        "./filter-<RE>.sh" <daemon.log

Will run 3 different tasks: ./filter-dnsmasq.sh, ./filter-dhclient.sh and ./filter-systemd.sh, then parse log file to send watched lines to specific task.

Remark about parellel output:

Regarding Ole Tange's comment, this seem clear to me: If you ask many task to speak together on same uniq STDOUT, you may observe stange mixed lines!

If your filter have to ouput continuously, you have to drive his ouput in proper way!

./parallel.sh -t{0..9} -b ^ -a ' ' <testfile sh \
    -c 'sed "s/^/group <RE> /" >/tmp/file-<RE>.txt'
cat /tmp/file-?.txt

Nota: I've finally successfully made a test in the spirit of comments:
parellel.sh ... | sort | uniq -c....
For this to run, script would by modified by adding fifos to be merged by a cat at end of script, something like:

+++ parallel.sh   2023-03-07 19:17:08.976802098 +0100
@@ -40,5 +40,8 @@
 for re in "${tags[@]}";do
+    pasteFd=/tmp/fifo-pp-r$crtFd
+    mkfifo $pasteFd
+    pasteFds+=($pasteFd)
     printf -v crtcmd '%q ' "${@//\<RE\>/$re}"
-    printf -v crtcmd ' -e \47%s/%s/{s/%s//;w/dev/fd/%d\47 %d> >(exec %s) ' \
-    "$paren" "$before$re$after"{,} $crtFd $crtFd "$crtcmd"
+    printf -v crtcmd ' -e \47%s/%s/{s/%s//;w/dev/fd/%d\47 %d> >(exec %s >%s) '\
+    "$paren" "$before$re$after"{,} $crtFd $crtFd "$crtcmd" $pasteFd
     paren='};'
@@ -47,3 +50,4 @@
 sedcmd+=" -e '$paren'"
-
-eval sed -n "$sedcmd" 
+eval sed -n "$sedcmd" &
+parcat ${pasteFds[@]}
+rm ${pasteFds[@]}

I will probably add some options for making this properly in my final script (on my website).

Castled answered 23/2, 2023 at 17:3 Comment(16)
OK, here it is (-;! Wow, mind blown! What a wonderful puzzle-box solution! Sorry I can only upvote it once. Good luck to all!Quantify
It seems very impressive, i am not sure I understand everything. I did update my question a bit but I would add :that filtering should be done only once, there is a lot of tags and sd computation is a very long and specific awk commandProspectus
@Raphtheraph Ok, then the second part of my answer, using sed is the only to be considered. I will try to improve this in a few, but like perl, the command you want to run doesn't matter.Castled
@Raphtheraph Answering your comment, but my bins are not numeric, I think my multiple -t <sedRe> way could match your need!! (U could try echo -t{0..9} to see how bash expand this!)Castled
I really think the sed option is the way to go. I can t figure how to print a file for each sub process. Would you consider develloping this only option in a another answer ? What a bout a script where you give <thefile> <separator> <column> <bash command name> <out_rootname> and it will distribute for each new tag in this column to a new out + write in a file where this tag is append.Prospectus
My script is based on sed, so use sed regex. By specifying -b (before) and -a (after), you could precise where to find your tag. By using sed, coliumn counting could by done by something like \([^ ]\+ \+\)\{4\} for skiping 4 columns.Castled
@Raphtheraph If you really think the sed option is the way to go, why did you not accept my answer?Castled
The solution seems to be unsafe. Try: ./parallel.sh -t{0..9} -b ^ seq 100 <testfile | sort | uniq -c | grep -v 10. This generates 1..100 10 times and removes all numbers that occur 10 times. This works fine (giving no output), but replace 100 with 100000 and suddenly you get numbers that you have never asked for. Looks like a race condition to me.Quadrate
@OleTange I did'nt understard the meaging of seq 100 as target filter for this streaming testfile? seq don't accept input!Castled
@F.Hauri-GiveUpGitHub Feel free to replace it with any target filter of your choice that also prints out the numbers 1..100000. The problem seems to be that the solution assumes the output of the filter is small, and breaks down if the output of the filter is in the order of 1 MB.Quadrate
@OleTange This is not a race condition, you juste are merging together output of 10 subtasks together in one file descriptor, without any kind of ordering. Seeing characters to become mixed seem completly normal to me.Castled
@OleTange U could try specifying where to store output, there are no race condition: ./parallel.sh -t{0..9} -b ^ bash -c 'seq 100 >/tmp/file-<RE>.txt' <testfileCastled
This also gives the wrong result for me: ./parallel.sh -t{0..9} -b ^ bash -c 'echo -n a;sleep 1; echo b;wc' <testfileQuadrate
Let us continue this discussion in chat.Quadrate
@OleTangeThanks for driving me to this. For me I'ts so clear that each filter have to drive his output. I did'nt represant this as a bug before your comment! But as this will drop whole paralelism down, I will add this as an option only if user want to take output!Castled
My new version, using parcat ( perl too ;) is still a lot quicker than GNU parallel! I repeat: for line streaming, there are no concurrent to sed!!! More details on char room 252346Castled
Q
3

GNU Parallel has --bin for this.

seq 1 100000 | awk '{print int(10*rand()),int(100*rand()),int(15*rand())}' > testfile

sd() {
    # sd 2 = sd of column 2                                                               
    awk '{s+=$'$1';ss+=$'$1'*$'$1'}END{a=s/NR;print sqrt((ss-a*a)/NR)}'
}
export -f sd

# bin on col 1, sd on col 2
cat testfile | parallel -j10 --colsep ' ' --bin 1 --pipe --tagstring {%} sd 2
# bin on col 2, sd on col 3
cat testfile | parallel -j100 --colsep ' ' --bin 2 --pipe --tagstring {%} sd 3
# bin on col 3, sd on col 1
cat testfile | parallel -j15 --colsep ' ' --bin 3 --pipe --tagstring {%} sd 1
Quadrate answered 23/2, 2023 at 19:31 Comment(8)
I'm on a creaky Debian SBC. + cat testfile + parallel -j15 --colsep ' ' --bin 3 --pipe --tagstring '{%}' sd 1 parallel: invalid option -- '-' .... The +s are result of running under set -vx. No version info available. Is that yours (-;? (sudo apt-get install gnu-parallel reports Unable to locate package gnu-parallel )-;)Quantify
@Quantify Please read oletange.wordpress.com/2018/03/28/…Quadrate
It s a good answer but my bins are not numeric and I may have more category than jobs ... I will update my questionProspectus
plus installing gnu-parallel is a nigthmare on the cluster I am working withProspectus
@OleTange There is a reason not listed in your link: sed is lighter and quicker then perl (and is clearly installed everywhere!)Castled
@OleTange : Tried installing per your instructions, but getting worriesome msgs gpg: /home/odroid/.gnupg/trustdb.gpg: trustdb created gpg: error reading key: No public key gpg: key E20F2022FFFFFFF1: public key "Ole Tange <[email protected]>" imported gpg: Total number processed: 1 gpg: imported: 1 gpg: key 069ED49388888888: 3 signatures reordered gpg: key 069ED49388888888: public key "XiaoDouZiQwQ <[email protected]>" imported gpg: Total number processed: 1 gpg: imported: 1 Tried the altermates in your git README. Is there a better place to discuss? (sorry!) `Quantify
@Quantify Ouch! Yeah, that is no good. I have fixed www.pi.dk/3, so it should work now. Checksums in gnu.org/software/parallel/checksumsQuadrate
Clean install now! Thanks!Quantify
C
1

Parallelize stream processing using

(Full bash script using sed at end of this post!)

Using sed for stream filtering

a bit like using parallel but instead of sending blocks of lines to processes it s using a field to send some specific lines to specific processes.

In this use case: having to filter stream to distrubute to many subtasks, sed should be de quickest way (as sed is a lot lighter then perl and parallel is a perl script. Using sed will be sensibly quicker, lighter and will consume less of resources! Please look comparison at end of this!

Fist preparing sed command line:

printf -v sedcmd ' -e \47};/^%d/{s/^. //;w/dev/fd/%d\47 %d> >(exec \
    awk \47{c++;s+=$1;ss+=$1*$1}END{a=s/c;print %d,sqrt((ss-a*a)/c)}\47) ' $(
    for ((i=0;i<10;i++)) { echo $i $((i+4)) $((i+4)) $i  ; })

Then command is: eval sed -n "${sedcmd/\};} -e '};'":

eval sed -n "${sedcmd/\};} -e '};'" <testfile

or

eval sed -n "${sedcmd/\};} -e '};'" <testfile | cat

Where $sedcmd look like:

$ echo -- "$sedcmd"
--  -e '};/^0/{s/^. //;w/dev/fd/4' 4> >(exec \
    awk '{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 0,sqrt((ss-a*a)/c)}')  -e '};/^1/{s/^. //;w/dev/fd/5' 5> >(exec \
    awk '{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 1,sqrt((ss-a*a)/c)}')  -e '};/^2/{s/^. //;w/dev/fd/6' 6> >(exec \
    awk '{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 2,sqrt((ss-a*a)/c)}')  -e '};/^3/{s/^. //;w/dev/fd/7' 7> >(exec \
    awk '{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 3,sqrt((ss-a*a)/c)}')  -e '};/^4/{s/^. //;w/dev/fd/8' 8> >(exec \
    awk '{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 4,sqrt((ss-a*a)/c)}')  -e '};/^5/{s/^. //;w/dev/fd/9' 9> >(exec \
    awk '{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 5,sqrt((ss-a*a)/c)}')  -e '};/^6/{s/^. //;w/dev/fd/10' 10> >(exec \
    awk '{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 6,sqrt((ss-a*a)/c)}')  -e '};/^7/{s/^. //;w/dev/fd/11' 11> >(exec \
    awk '{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 7,sqrt((ss-a*a)/c)}')  -e '};/^8/{s/^. //;w/dev/fd/12' 12> >(exec \
    awk '{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 8,sqrt((ss-a*a)/c)}')  -e '};/^9/{s/^. //;w/dev/fd/13' 13> >(exec \
    awk '{c++;s+=$1;ss+=$1*$1}END{a=s/c;print 9,sqrt((ss-a*a)/c)}') 

Where

  • 4> >(exec awk ...) tell bash to generate a fd number 4 and run awk
  • -e "/^0/{s/^. //;w/dev/fd/4" -e "}" tell sed to drop first character of lines wich begin by 0 and send it to fd/4.

parallel.sh full bash script (draft)

Here is a full parallelFiltering bash script using sed:

#!/bin/bash
# parallel.sh - bash script for filtering/parallelising using sed.
# (C) 2023 Felix Hauri - [email protected]
# Licensed under terms of GPL v3. www.gnu.org

prog=${0##*/}
usage() {
    cat <<-EOUsage
        Usage: $prog -t <tags> [-b <re>] [-a <re>] command args
          -h                 show this
          -t <tags>   coma separated liste of tags to send to separated tasks
                           or single tag, '-t' option could be submited multiple times
          -b <re>     sed regex to match before tags
          -a <re>     sed regex to match after tags
          command     Any command to be run once for each tag.
                        Special string "<RE>" will be replaced by current tag.
        EOUsage
}
die() {
    echo >&2 "ERROR $prog: $*"
    exit 1
}
while getopts "ht:a:b:" opt; do
    case $opt in
        h ) usage; exit ;;
        t ) IFS=, read -a crttags <<<"$OPTARG"
            tags+=("$crttags");;
        b ) before=$OPTARG ;;
        a ) after=$OPTARG ;;
        *) die Wrong argument. ;;
    esac
done
shift $((OPTIND-1))

[[ -v tags ]] || die "No tags submited"
(( $# )) || die "No command submited"

sedcmd='' paren=''
declare -i crtFd=4
for re in "${tags[@]}";do
    printf -v crtcmd '%q ' "${@//\<RE\>/$re}"
    printf -v crtcmd ' -e \47%s/%s/{s/%s//;w/dev/fd/%d\47 %d> >(exec %s) ' \
           "$paren" "$before$re$after"{,} $crtFd $crtFd "$crtcmd"
    paren='};'
    sedcmd+="$crtcmd" crtFd+=1
done
sedcmd+=" -e '$paren'"

eval sed -n "$sedcmd" 
Usage: parallel.sh -t <tags> [-b <re>] [-a <re>] command args
  -h            show this
  -t <tags>   coma separated liste of tags to send to separated tasks
                or single tag, '-t' option could be submited multiple times
  -b <re>     sed regex to match before tags
  -a <re>     sed regex to match after tags
  command     Any command to be run once for each tag.
                Special string "<RE>" will be replaced by current tag.

This script could be found there: parallel.sh.

Tested with your use case with:

 ./parallel.sh -t{0..9} -b ^ awk '{c++;s+=$1;ss+=$1*$1}END{a=s/c;print <RE>,sqrt((ss-a*a)/c)}' <testfile

Notice the only change from your command line is print <RE>,sqrt... where <RE> will be replaced by each tags (-t) in each subtask respectively.

9 55.6751
8 58.0447
7 55.6755
6 58.3663
5 58.696
4 58.2724
3 54.9797
2 57.5355
1 54.6131
0 57.1334

Comparison with GNU parallel

Of course this is about line buffered filtering, not suitable for block buffered distribution!!

I've tested with a simple 1000 lines random file:

for ((i=1000;i--;)){ echo $((RANDOM%10)) $((RANDOM%100));} >testfile

then using parallel:

sd() {
  awk '{s+=$'$1';ss+=$'$1'*$'$1'}END{a=s/NR;print sqrt((ss-a*a)/NR)}'
}
export -f sd
time parallel -j10 --colsep ' ' --bin 1 --pipe \
    --tagstring {%} sd 2 <testfile |sort 
10      58.3703
1       50.7911
2       56.9009
3       55.0832
4       52.5365
5       65.0864
6       61.4079
7       55.5353
8       62.337
9       51.2512

real    0m0.488s
user    0m1.158s
sys     0m0.272s

and using sed + bash:

time ./parallel.sh -t{0..9} -b ^ awk \
  '{c++;s+=$1;ss+=$1*$1}END{a=s/c;print <RE>,sqrt((ss-a*a)/c)}' <testfile |
    sort
0 58.3703
1 50.7911
2 56.9009
3 55.0832
4 52.5365
5 65.0864
6 61.4079
7 55.5353
8 62.337
9 51.2512

real    0m0.010s
user    0m0.009s
sys     0m0.000s

Fortunately computed results are same! (parallel version output 10 instead of 0).

Where bash+sed version

  • use tags instead of number
  • use a lot less system resources
  • is something quicker

Test with bigger and smaller files:

                   Number of lines     Real     User   System
parallel.sh            100'000'000   73.117   72.598    0.416
parallel (perl)        100'000'000  129.264  383.701   36.319

parallel.sh              1'000'000    0.744    0.728    0.013
parallel (perl)          1'000'000    1.798    5.571    0.613

parallel.sh                 10'000    0.018    0.007    0.009
parallel (perl)             10'000    0.523    1.148    0.269

Here are ouput of ps --tty pts/4 fw while parallel.sh was running in pts/4:

   5352 pts/4    Ss     0:00 -bash
   5983 pts/4    S+     0:00  \_ /bin/bash ./parallel.sh -t0 -t1 -t2..
   5985 pts/4    R+     0:13  |   \_ sed -n -e /^0/{s/^0//;w/dev/fd/..
   5986 pts/4    S+     0:00  |       \_ awk {c++;s+=$1;ss+=$1*$1}EN..
   5987 pts/4    S+     0:00  |       \_ awk {c++;s+=$1;ss+=$1*$1}EN..
   5988 pts/4    S+     0:00  |       \_ awk {c++;s+=$1;ss+=$1*$1}EN..
   5989 pts/4    S+     0:00  |       \_ awk {c++;s+=$1;ss+=$1*$1}EN..
   5990 pts/4    S+     0:00  |       \_ awk {c++;s+=$1;ss+=$1*$1}EN..
   5991 pts/4    S+     0:00  |       \_ awk {c++;s+=$1;ss+=$1*$1}EN..
   5992 pts/4    S+     0:00  |       \_ awk {c++;s+=$1;ss+=$1*$1}EN..
   5993 pts/4    S+     0:00  |       \_ awk {c++;s+=$1;ss+=$1*$1}EN..
   5994 pts/4    S+     0:00  |       \_ awk {c++;s+=$1;ss+=$1*$1}EN..
   5995 pts/4    S+     0:00  |       \_ awk {c++;s+=$1;ss+=$1*$1}EN..
   5984 pts/4    S+     0:00  \_ sort

Where bash execute sed wich run 10x awk, piped to sort. Look's ok!

Here are ouput of ps --tty pts/4 fw while parallel (perl) was running:

   5352 pts/4    Ss     0:00 -bash
   5777 pts/4    S+     0:00  \_ /usr/bin/perl /usr/bin/parallel -j1..
   5780 pts/4    R+     0:17  |   \_ /usr/bin/perl /usr/bin/parallel..
   5956 pts/4    R      0:16  |   |   \_ perl -e  use B; my $sep = s..
   5957 pts/4    R      0:16  |   |   \_ perl -e  use B; my $sep = s..
snip 7 lines
   5965 pts/4    R      0:16  |   |   \_ perl -e  use B; my $sep = s..
   5793 pts/4    S      0:00  |   \_ /usr/bin/bash -c perl -e '{use ..
   5794 pts/4    S      0:00  |   |   \_ perl -e {use POSIX qw(:errn..
   5795 pts/4    S      0:00  |   |   \_ /usr/bin/bash -c perl -e '{..
   5796 pts/4    S      0:01  |   |       \_ awk {s+=$2;ss+=$2*$2}EN..
snip 33 lines
   5852 pts/4    S      0:00  |   \_ /usr/bin/bash -c perl -e '{use ..
   5867 pts/4    S      0:00  |       \_ perl -e {use POSIX qw(:errn..
   5868 pts/4    S      0:00  |       \_ /usr/bin/bash -c perl -e '{..
   5870 pts/4    S      0:01  |           \_ awk {s+=$2;ss+=$2*$2}EN..
   5778 pts/4    S+     0:00  \_ sort

Well!! 52 process are executed to fork 10 time one stream to 10 subprocess!! Each subprocess require 5 sub tasks?!

Use case:

Quick demo on log file:

{ tags=($(cut -d\[ -f 1 | cut -d\  -f 5 | sort -u)) ;} <daemon.log 
./parallel.sh "${tags[@]/#/-t}" -b \\b -a \\[ bash -c \
    $'printf \47 - %-20s  %8d %8d %8d\\n\47 "$1" $(wc)' -- "<RE>" <daemon.log  |
    sort

This will run as many task there are tags. Then ouput wc for each sub stream.

Note: Syntax: ${tags[@]/#/-t} will be expanded as -tdhclient -tdnsmasq -tssystemd ....

 - accton                      14      154     1165
 - dbus-daemon                 80     1273    13731
 - dhclient                  6480    79920   542160
 - dnsmasq                   6480    49680   401760
 - systemd                 154608  1474418 10664639
...

But, you could create different filter for differents targets:

tags=( dhclient dnsmasq systemd )
./parallel.sh ${tags[@]/#/-t} -b \\b -a \\[ \
        "./filter-<RE>.sh" <daemon.log

Will run 3 different tasks: ./filter-dnsmasq.sh, ./filter-dhclient.sh and ./filter-systemd.sh, then parse log file to send watched lines to specific task.

Remark about parellel output:

Regarding Ole Tange's comment, this seem clear to me: If you ask many task to speak together on same uniq STDOUT, you may observe stange mixed lines!

If your filter have to ouput continuously, you have to drive his ouput in proper way!

./parallel.sh -t{0..9} -b ^ -a ' ' <testfile sh \
    -c 'sed "s/^/group <RE> /" >/tmp/file-<RE>.txt'
cat /tmp/file-?.txt

Nota: I've finally successfully made a test in the spirit of comments:
parellel.sh ... | sort | uniq -c....
For this to run, script would by modified by adding fifos to be merged by a cat at end of script, something like:

+++ parallel.sh   2023-03-07 19:17:08.976802098 +0100
@@ -40,5 +40,8 @@
 for re in "${tags[@]}";do
+    pasteFd=/tmp/fifo-pp-r$crtFd
+    mkfifo $pasteFd
+    pasteFds+=($pasteFd)
     printf -v crtcmd '%q ' "${@//\<RE\>/$re}"
-    printf -v crtcmd ' -e \47%s/%s/{s/%s//;w/dev/fd/%d\47 %d> >(exec %s) ' \
-    "$paren" "$before$re$after"{,} $crtFd $crtFd "$crtcmd"
+    printf -v crtcmd ' -e \47%s/%s/{s/%s//;w/dev/fd/%d\47 %d> >(exec %s >%s) '\
+    "$paren" "$before$re$after"{,} $crtFd $crtFd "$crtcmd" $pasteFd
     paren='};'
@@ -47,3 +50,4 @@
 sedcmd+=" -e '$paren'"
-
-eval sed -n "$sedcmd" 
+eval sed -n "$sedcmd" &
+parcat ${pasteFds[@]}
+rm ${pasteFds[@]}

I will probably add some options for making this properly in my final script (on my website).

Castled answered 23/2, 2023 at 17:3 Comment(16)
OK, here it is (-;! Wow, mind blown! What a wonderful puzzle-box solution! Sorry I can only upvote it once. Good luck to all!Quantify
It seems very impressive, i am not sure I understand everything. I did update my question a bit but I would add :that filtering should be done only once, there is a lot of tags and sd computation is a very long and specific awk commandProspectus
@Raphtheraph Ok, then the second part of my answer, using sed is the only to be considered. I will try to improve this in a few, but like perl, the command you want to run doesn't matter.Castled
@Raphtheraph Answering your comment, but my bins are not numeric, I think my multiple -t <sedRe> way could match your need!! (U could try echo -t{0..9} to see how bash expand this!)Castled
I really think the sed option is the way to go. I can t figure how to print a file for each sub process. Would you consider develloping this only option in a another answer ? What a bout a script where you give <thefile> <separator> <column> <bash command name> <out_rootname> and it will distribute for each new tag in this column to a new out + write in a file where this tag is append.Prospectus
My script is based on sed, so use sed regex. By specifying -b (before) and -a (after), you could precise where to find your tag. By using sed, coliumn counting could by done by something like \([^ ]\+ \+\)\{4\} for skiping 4 columns.Castled
@Raphtheraph If you really think the sed option is the way to go, why did you not accept my answer?Castled
The solution seems to be unsafe. Try: ./parallel.sh -t{0..9} -b ^ seq 100 <testfile | sort | uniq -c | grep -v 10. This generates 1..100 10 times and removes all numbers that occur 10 times. This works fine (giving no output), but replace 100 with 100000 and suddenly you get numbers that you have never asked for. Looks like a race condition to me.Quadrate
@OleTange I did'nt understard the meaging of seq 100 as target filter for this streaming testfile? seq don't accept input!Castled
@F.Hauri-GiveUpGitHub Feel free to replace it with any target filter of your choice that also prints out the numbers 1..100000. The problem seems to be that the solution assumes the output of the filter is small, and breaks down if the output of the filter is in the order of 1 MB.Quadrate
@OleTange This is not a race condition, you juste are merging together output of 10 subtasks together in one file descriptor, without any kind of ordering. Seeing characters to become mixed seem completly normal to me.Castled
@OleTange U could try specifying where to store output, there are no race condition: ./parallel.sh -t{0..9} -b ^ bash -c 'seq 100 >/tmp/file-<RE>.txt' <testfileCastled
This also gives the wrong result for me: ./parallel.sh -t{0..9} -b ^ bash -c 'echo -n a;sleep 1; echo b;wc' <testfileQuadrate
Let us continue this discussion in chat.Quadrate
@OleTangeThanks for driving me to this. For me I'ts so clear that each filter have to drive his output. I did'nt represant this as a bug before your comment! But as this will drop whole paralelism down, I will add this as an option only if user want to take output!Castled
My new version, using parcat ( perl too ;) is still a lot quicker than GNU parallel! I repeat: for line streaming, there are no concurrent to sed!!! More details on char room 252346Castled
T
1

with awk do you really need to make it parallel at all ?

( time ( nice mawk2 -v ___='98766669' '
              BEGIN { srand()
                    srand()
                  CONVFMT = OFMT = "%.250g"
                  __ = (_+=_^=_<_)+_^++_
                 ___*=(_=__^(++_+_))^!_;
                   _ = 61277761 * 65537

                  while(___--) { print int(__*rand()) %__,
                                       int( _*rand())  } }' | pvE0 |

  mawk2 '
  BEGIN { 
      ___ += ___ = _^= SUBSEP = ""
      CONVFMT = OFMT = "%.250g" 
  } { 
      __ = $(_ =  ___)
      if ( ((_ = $--_) in _____)==(_<_) ) { 
          
          _____[_] = sprintf(" Grp =[ %15s ]= ",_)
      } 
      ____[_"|"]++            # counter
      ____[_"]"]+= __         # sum
      ____[_"["]+= __*__      # sum of squares

  } END { 
      for (______ = _<_; ______!~".."; ______++) {
         _ = ______
         printf(" %s\f\r\t\t| %23.f #\f\r\t\t| "\
                "%37.13f avg.\f\r\t\t| %37.13f st.dv.\n",
                         _____[_], 
                 ___=__ = ____[_"|"], 
                     __ = ____[_"]"] * (___^= -_^(_<_)),          # inverting counter 
                 (___ * ( ____[_"["] -  __^(_+=_^=_<_)))^_^-_^!_) # n^(1/2) == sqrt 

  } }' ) ) 
      in0: 1.45GiB 0:00:29 [49.9MiB/s] [49.9MiB/s] [ <=>                       ]
  Grp =[               0 ]= 
        |                 9874382 #
        |           2007927772624.7209472656250 avg.
        |           2318542792000.9663085937500 st.dv.
  Grp =[               1 ]= 
        |                 9877831 #
        |           2007986083790.7338867187500 avg.
        |           2318611292854.9780273437500 st.dv.
  Grp =[               2 ]= 
        |                 9873525 #
        |           2008346714329.9284667968750 avg.
        |           2318968400134.7607421875000 st.dv.
  Grp =[               3 ]= 
        |                 9877464 #
        |           2007416025675.7121582031250 avg.
        |           2318105286630.4780273437500 st.dv.
  Grp =[               4 ]= 
        |                 9878524 #
        |           2007843011030.0527343750000 avg.
        |           2318514145523.2456054687500 st.dv.
  Grp =[               5 ]= 
        |                 9875712 #
        |           2008091963744.2180175781250 avg.
        |           2318593468063.0859375000000 st.dv.
  Grp =[               6 ]= 
        |                 9875784 #
        |           2008134171756.2131347656250 avg.
        |           2318721989221.3188476562500 st.dv.
  Grp =[               7 ]= 
        |                 9881377 #
        |           2007915282626.9929199218750 avg.
        |           2318585484730.8193359375000 st.dv.
  Grp =[               8 ]= 
        |                 9877341 #
        |           2008109181888.2607421875000 avg.
        |           2318760855885.9106445312500 st.dv.
  Grp =[               9 ]= 
        |                 9874729 #
        |           2008153989929.5683593750000 avg.
        |           2318791539162.9497070312500 st.dv.

( nice mawk2 -v ___='98766669'  | pvE 0.1 in0 | mawk2 ; )  

50.38s user 0.99s system 172% cpu 29.700 total

It only took awk merely 29.7 secs end-to-end to generate 98.7 million rows of random integers between 0 (inclusive) and the composite of these 2 primes (exclusive) -

  •      65,537 : 1 + 4^8
    
  •  61,277,761 :     8^8, digit reversed
    

then aggregating the 1.45 GB output from step 1 and calculating the relevant stats per group.

Trailer answered 23/2, 2023 at 21:52 Comment(8)
An even more fascinating puzzle. Am I reading your code correctly ... Do I need something called pvE0 ? A quick internet searrch didn't reveal any obvious candidates excepting stuff that looked to be part of the proxmox. Can I use your code without that?Quantify
@Quantify : oh sorry that's just a formatting wrapper for the pv shell utility (http://www.ivarch.com/programs/pv.shtml) to show progress through the pipes. It's totally optional. or just change it to pv and get the default output formatting from it if you're unsureTrailer
LOL, my poor SBC uname_-srv=Linux 4.14.241+ #1 SMP PREEMPT Wed Jul 28 16:55:16 UTC 2021 .. time output: real 9m39.678s user 12m17.774s sys 0m20.981s )-;! Now I have to figure your code out! Good luck to all. `Quantify
Oh, and I made a hard link from mawk to mawk2 and it worked, but maybe that contributed to my SBC's slow perormance. sudo apt-get install mawk2 didn't find anything on my repositories. Just curious.Quantify
@Quantify : cuz it's a perpetual beta. you'll have to manually compile it from this github source https://github.com/mikebrennan000/mawk-2/blob/master/mawk-1.9.9.6.tar.gz. but even though it's labeled "beta", it's more stable and reliable than just about any other awk known. and performance of it is ridiculously faster at times.Trailer
significantly better : real 7m44.346s user 9m58.668s sys 0m12.790s but given a 4-5 yr old SBC I didn't expect/require much more. Thanks for followup.Quantify
@Quantify : it really comes down to whether the NAS is tying your hands behind your back with ghetto things like busybox awkTrailer
no busybox ... thank goodness! Ciao for Niao!Quantify
Q
0

If I understand your requirement, this may help

seq 1 1000 | awk '{print int(10*rand()),int(100*rand())}' \
| awk '{
    sArr[$1]+=$2; sCnt[$1]++;
    ssArr[$1]+=$2*$2
  }
  END{
    for(i in sCnt){
      if ( dbg) { print "processing group "i }
      a=sArr[i]/sCnt[i]
      print sqrt((ssArr[i]-a*a)/sCnt[i]) > "sd"i
      close("sd"i)
    }
  }'

But, I'm in a rush and my math skills are very rusty, so take with a grain of salt (-; Thanks to markp-fuso for improving comments!

Instead of using the $1 value to write an intermediate file, use it as a key for storing values of interest to an array. Even though you're reading very large files, this pares the memory usage down to just the fields you need being stored in memory.

P.S. Whitespace after the continutation character seen in several places (\) will cause an error.

$ ls -l sd*
-rw-r--r-- 1 Neil_2 None 8 Feb 23 09:51 sd0
-rw-r--r-- 1 Neil_2 None 8 Feb 23 09:51 sd1
-rw-r--r-- 1 shellter None 8 Feb 23 09:51 sd2
-rw-r--r-- 1 shellter None 8 Feb 23 09:51 sd3
-rw-r--r-- 1 shellter None 8 Feb 23 09:51 sd4
-rw-r--r-- 1 shellter None 8 Feb 23 09:51 sd5
-rw-r--r-- 1 shellter None 8 Feb 23 09:51 sd6
-rw-r--r-- 1 shellter None 8 Feb 23 09:51 sd7
-rw-r--r-- 1 shellter None 8 Feb 23 09:51 sd8
-rw-r--r-- 1 shellter None 8 Feb 23 09:51 sd9


$ awk '{print FILENAME " " $0}' sd*
sd0 58.8944
sd1 57.7931
sd2 43.1367
sd3 45.3593
sd4 69.5813
sd5 58.2163
sd6 39.3107
sd7 53.4005
sd8 60.4184
sd9 65.4446
Quantify answered 23/2, 2023 at 15:41 Comment(3)
But there is no parallelisation!?Castled
Ok, I've added my last edit today to my post. A little strong, but I hope this could be understandable!Castled
The point is that values cannot be stored because it s bigbig data, sorry ! but interesting tryProspectus

© 2022 - 2024 — McMap. All rights reserved.