I have to write a script that get some URLs in parallel and do some work. In the past I have always used Parallel::ForkManager
for such things, but now I wanted to learn something new and try asynchronous programming with AnyEvent
(and AnyEvent::HTTP
or AnyEvent::Curl::Multi
) ... but I'm having problem understanding AnyEvent and writing a script that should:
- open a file (every line is a seperate URL)
- (from now in parallel, but with a limit for f.e. 10 concurrent requests)
- read file line after line (I dont want to load whole file to memory - it might be big)
- make a HTTP request for that URL
- read response
- updates MySQL record accordingly
- (next file line)
I have read many manuals, tutorials, but its still hard for me to understand differences between blocking and non-blocking code. I have found similar script at http://perlmaven.com/fetching-several-web-pages-in-parallel-using-anyevent, where Mr. Szabo explains the basics, but I still cant understand how to implement something like:
...
open my $fh, "<", $file;
while ( my $line = <$fh> )
{
# http request, read response, update MySQL
}
close $fh
...
... and add a concurrency limit in this case.
I would be very grateful for help ;)
UPDATE
Following Ikegami's advice I gave Net::Curl::Multi
a try. I'm very pleased with results. After years of using Parallel::ForkManager
just for concurrent grabbing thousands of URLs, Net::Curl::Multi
seems to be awesome.
Here is my code with while
loop on filehandle. It seems to work as it should, but considering it's my first time writing something like this I would like to ask more experienced Perl users to take a look and tell me if there are some potential bugs, something I missed, etc.
Also, if I may ask: as I don't fully understand how Net::Curl::Multi
's concurrency works, please tell me whether I should expect any problems with putting MySQL UPDATE command (via DBI
) inside RESPONSE
loop (besides higher server load obviously - I expect final script to run with about 50 concurrent N::C::M
workers, maybe more).
#!/usr/bin/perl
use Net::Curl::Easy qw( :constants );
use Net::Curl::Multi qw( );
sub make_request {
my ( $url ) = @_;
my $easy = Net::Curl::Easy->new();
$easy->{url} = $url;
$easy->setopt( CURLOPT_URL, $url );
$easy->setopt( CURLOPT_HEADERDATA, \$easy->{head} );
$easy->setopt( CURLOPT_FILE, \$easy->{body} );
return $easy;
}
my $maxWorkers = 10;
my $multi = Net::Curl::Multi->new();
my $workers = 0;
my $i = 1;
open my $fh, "<", "urls.txt";
LINE: while ( my $url = <$fh> )
{
chomp( $url );
$url .= "?$i";
print "($i) $url\n";
my $easy = make_request( $url );
$multi->add_handle( $easy );
$workers++;
my $running = 0;
do {
my ($r, $w, $e) = $multi->fdset();
my $timeout = $multi->timeout();
select $r, $w, $e, $timeout / 1000
if $timeout > 0;
$running = $multi->perform();
RESPONSE: while ( my ( $msg, $easy, $result ) = $multi->info_read() ) {
$multi->remove_handle( $easy );
$workers--;
printf( "%s getting %s\n", $easy->getinfo( CURLINFO_RESPONSE_CODE ), $easy->{url} );
}
# dont max CPU while waiting
select( undef, undef, undef, 0.01 );
} while ( $workers == $maxWorkers || ( eof && $running ) );
$i++;
}
close $fh;
print "got it!";
where the# process $easy
is, the page content is automatically printed. – Meritocracy