How to (trivially) parallelize with the Linux shell by starting one task per Linux core?
Asked Answered
L

3

6

Today's CPUs typically comprise several physical cores. These might even be multi-threaded so that the Linux kernel sees quite a large number of cores and accordingly starts several times the Linux scheduler (one for each core). When running multiple tasks on a Linux system the scheduler achieves normally a good distribution of the total workload to all Linux cores (might be the same physical core).

Now, say, I have a large number of files to process with the same executable. I usually do this with the "find" command:

find <path> <option> <exec>

However, this starts just one task at any time and waits until its completion before starting the next task. Thus, just one core at any time is in use for this. This leaves the majority of the cores idle (if this find-command is the only task running on the system). It would be much better to launch N tasks at the same time. Where N is the number of cores seen by the Linux kernel.

Is there a command that would do that ?

Litt answered 24/1, 2012 at 16:59 Comment(3)
Have a look at the GNU parallel utility. I don't know how it fits into your particular problem, but have a read : gnu.org/software/parallelUraemia
Yes, you are right. GNU parallel is indeed intended for this usage. It can be used as a replacement for "xargs".Litt
@Daniel: Seems like you should post that as an answer.Centro
C
7

Use find with the -print0 option. Pipe it to xargs with the -0 option. xargs also accepts the -P option to specify a number of processes. -P should be used in combination with -n or -L.

Read man xargs for more information.

An example command: find . -print0 | xargs -0 -P4 -n4 grep searchstring

Centro answered 24/1, 2012 at 17:15 Comment(1)
Many thanks for the answer! Since GNU parallel is not a component of my distribution xargs is the choice (at the moment!).Litt
S
2

If you have GNU Parallel http://www.gnu.org/software/parallel/ installed you can do this:

find | parallel do stuff {} --option_a\; do more stuff {}

You can install GNU Parallel simply by:

wget http://git.savannah.gnu.org/cgit/parallel.git/plain/src/parallel
chmod 755 parallel
cp parallel sem

Watch the intro videos for GNU Parallel to learn more: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Syringa answered 24/1, 2012 at 21:1 Comment(2)
Great introductory videos. Many thanks!! Unfortunately GNU parallel did not yet make it into Ubuntu 11.10. Unbelievable, such a great tool! However, I have read hints that it will make it into 12.04. Good news!!Litt
It is not in ubuntuupdates.org/package_metas/list?name=parallel so I wonder where you read those hints.Syringa
N
0

Gnu parallel or xargs -P is probably a better way to handle this, but you can also write a sort-of multi-tasking framework in bash. It's a little messy and unreliable, however, due to the lack of certain facilities.

#!/bin/sh

MAXJOBS=3
CJ=0
SJ=""

gj() {
    echo ${1//[][-]/}
}

endj() {
    trap "" sigchld
    ej=$(gj $(jobs | grep Done))
    jobs %$ej
    wait %$ej
    CJ=$(( $CJ - 1 ))
    if [ -n "$SJ" ]; then
        kill $SJ
        SJ=""
    fi
}
startj() {
    j=$*
    while [ $CJ -ge $MAXJOBS ]; do
        sleep 1000 &
        SJ=$!
        echo too many jobs running: $CJ 
        echo waiting for sleeper job [$SJ]
        trap endj sigchld
        wait $SJ 2>/dev/null
    done
    CJ=$(( $CJ + 1 ))
    echo $CJ jobs running.  starting: $j
    eval "$j &"
}

set -m

# test
startj sleep 2
startj sleep 10
startj sleep 1
startj sleep 1
startj sleep 1
startj sleep 1
startj sleep 1
startj sleep 1
startj sleep 2
startj sleep 10

wait
Notch answered 25/1, 2012 at 1:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.