How can I modify my perl script to use multiple processors?
Asked Answered
N

3

6

Hi I have a simple script that takes a file and runs another Perl script on it. The script does this to every picture file in the current folder. This is running on a machine with 2 quad core Xeon processors, 16gb of ram, running RedHat Linux.

The first script work.pl basically calls magicplate.pl passes some parameters and the name of the file for magicplate.pl to process. Magic Plate takes about a minute to process each image. Because work.pl is preforming the same function over 100 times and because the system has multiple processors and cores I was thinking about splitting the task up so that it could run multiple times in parallel. I could split the images up to different folders if necessary. Any help would be great. Thank you

Here is what I have so far:

use strict;
use warnings;


my @initialImages = <*>;

foreach my $file (@initialImages) {

    if($file =~ /.png/){
        print "processing $file...\n";
        my @tmp=split(/\./,$file);
        my $name="";
        for(my $i=0;$i<(@tmp-1);$i++) {
            if($name eq "") { $name = $tmp[$i]; } else { $name=$name.".".$tmp[$i];}
        }

        my $exten=$tmp[(@tmp-1)];
        my $orig=$name.".".$exten;

        system("perl magicPlate.pl -i ".$orig." -min 4 -max 160 -d 1");
     }
}       
Nathan answered 13/12, 2010 at 14:1 Comment(0)
G
3

You could use Parallel::ForkManager (set $MAX_PROCESSES to the number of files processed at the same time):

use Parallel::ForkManager;
use strict;
use warnings;

my @initialImages = <*>;

foreach my $file (@initialImages) {

    if($file =~ /.png/){
        print "processing $file...\n";
        my @tmp=split(/\./,$file);
        my $name="";
        for(my $i=0;$i<(@tmp-1);$i++) {
            if($name eq "") { $name = $tmp[$i]; } else { $name=$name.".".$tmp[$i];}
        }

        my $exten=$tmp[(@tmp-1)];
        my $orig=$name.".".$exten;

  $pm = new Parallel::ForkManager($MAX_PROCESSES);
    my $pid = $pm->start and next;
        system("perl magicPlate.pl -i ".$orig." -min 4 -max 160 -d 1");
    $pm->finish; # Terminates the child process

     }
}       

But as suggested by Hugmeir running perl interpreter again and again for each new file is not a good idea.

Galore answered 13/12, 2010 at 14:23 Comment(3)
" running perl interpreter again and again for each new file is not a good idea" - Yes, but when you fork, you aren't starting up a new perl interpreter. Fork copies the parent process, and Linux uses CoW, so it's even cheaper than a full copy.Spaak
Also, why are you starting a new interpreter after you fork? Run the perl code in the new child process.Spaak
@JimB: I mean system call not forking. And i use system call because original code used it.Galore
G
7

You should consider NOT creating a new process for each file that you want to process -- It's horribly inefficient, and probably what is taking most of your time here. Just loading up Perl and whatever modules you use over and over ought to be creating some overhead. I recall a poster on PerlMonks that did something similar, and ended up transforming his second script into a module, reducing the worktime from an hour to a couple of minutes. Not that you should expect such a dramatic improvement, but one can dream..

With the second script refactored as a module, here's an example of thread usage, in which BrowserUK creates a thread pool, feeding it jobs through a queue.

Gennygeno answered 13/12, 2010 at 14:18 Comment(3)
Starting up a new perl interpreter is horribly inefficient, but creating a new process with fork() is very fast (especially since Linux uses CoW).Spaak
No. If your job is going to use 1 minute of CPU time, the time spent starting up the new task is going to be fairly negligible. Perl might use, say, 1 second of CPU to startup its environment (if you have quite a lot of modules loaded; I have seen this) but after that, it's all yours. Read the question carefully.Sitra
NB: Perl threads suck. Really, they do. They create loads of copies of everything (not CoW copies, real copies). They don't work right in some cases, but still use up heaps of unnecessary resources. Fork instead, it's way more efficient and more likely to work.Sitra
G
3

You could use Parallel::ForkManager (set $MAX_PROCESSES to the number of files processed at the same time):

use Parallel::ForkManager;
use strict;
use warnings;

my @initialImages = <*>;

foreach my $file (@initialImages) {

    if($file =~ /.png/){
        print "processing $file...\n";
        my @tmp=split(/\./,$file);
        my $name="";
        for(my $i=0;$i<(@tmp-1);$i++) {
            if($name eq "") { $name = $tmp[$i]; } else { $name=$name.".".$tmp[$i];}
        }

        my $exten=$tmp[(@tmp-1)];
        my $orig=$name.".".$exten;

  $pm = new Parallel::ForkManager($MAX_PROCESSES);
    my $pid = $pm->start and next;
        system("perl magicPlate.pl -i ".$orig." -min 4 -max 160 -d 1");
    $pm->finish; # Terminates the child process

     }
}       

But as suggested by Hugmeir running perl interpreter again and again for each new file is not a good idea.

Galore answered 13/12, 2010 at 14:23 Comment(3)
" running perl interpreter again and again for each new file is not a good idea" - Yes, but when you fork, you aren't starting up a new perl interpreter. Fork copies the parent process, and Linux uses CoW, so it's even cheaper than a full copy.Spaak
Also, why are you starting a new interpreter after you fork? Run the perl code in the new child process.Spaak
@JimB: I mean system call not forking. And i use system call because original code used it.Galore
C
3
  • Import "maigcplate" and use threading.
  • Start magicplate.pl in the background (you would need to add process throttling)
  • Import "magicplate" and use fork (add process throttling and a kiddy reaper)
  • Make "maigcplate" a daemon with a pool of workers = # of CPUs
    • use an MQ implementation for communication
    • use sockets for communication
  • Use webserver(nginx, apache, ...) and wrap in REST for a webservice
  • etc...

All these center around creating multiple workers that can each run on their own cpu. Certain implementations will use resources better (those that don't start a new process) and be easier to implement and maintain.

Charlyncharm answered 13/12, 2010 at 15:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.