perl Mojo and JSON for simultaneous requests
Asked Answered
B

3

3

I'm usually no Perl coder. However I've got to complete this task.

The following code works for me:

#!/usr/bin/perl

use LWP::UserAgent;
use JSON;
use strict;

my $md5 = $ARGV[0];
$md5 =~ s/[^A-Fa-f0-9 ]*//g;
die "invalid MD5" unless ( length($md5) == 32 );

my $ua = LWP::UserAgent->new(ssl_opts => { verify_hostname => 1 }, timeout => 10);
my $key="12345...7890";
my $url='https://www.virustotal.com/vtapi/v2/file/report';
my $response = $ua->post( $url, ['apikey' => $key, 'resource' => $md5] );
die "$url error: ", $response->status_line unless $response->is_success;
my $results=$response->content;

my $json = JSON->new->allow_nonref;
my $decjson = $json->decode( $results);

print "md5: ",$md5,"\n";
print "positives: ", $decjson->{"positives"}, "\n";
print "total: ", $decjson->{"total"}, "\n";
print "date: ", $decjson->{"scan_date"}, "\n";

Now I would like to recode the above for using asynchronous http using Mojo. I'm trying this:

#!/usr/bin/perl

use warnings;
use strict;
use Mojo;
use Mojo::UserAgent;

my $md5 = $ARGV[0];
$md5 =~ s/[^A-Fa-f0-9 ]*//g;
die "invalid MD5" unless ( length($md5) == 32 );

my ($vt_positives, $vt_scandate, $response_vt);
my $url='https://www.virustotal.com/vtapi/v2/file/report';
my $key="12345...7890";
my $ua = Mojo::UserAgent->new;
my $delay = Mojo::IOLoop->delay;

$ua->max_redirects(0)->connect_timeout(3)->request_timeout(6);
$ua->max_redirects(5);
$delay->begin;

$response_vt = $ua->post( $url => ['apikey' => $key, 'resource' => $md5] => sub {
    my ($ua, $tx) = @_;
    $vt_positives=$tx->res->json->{"positives"};
    print "Got response: $vt_positives\n";
    });

Mojo::IOLoop->start unless Mojo::IOLoop->is_running;

The first code is OK, the second isn't working. I must be doing something wrong when sending the request since I seem to get a 403 response (incorrect API usage). I also tried -> json calls but it didn't work out.

And even if I had done the request correctly, I'm not sure if I'm correctly decoding the json results with Mojo.

Help will be appreciated!

Boiler answered 1/3, 2013 at 7:23 Comment(0)
F
5

EDIT

It seems that we missed the real question, how to post forms. Oops sorry about that.

Posting forms depends on which version of Mojolicious you are using. Until recently (v3.85 -- 2013-02-13) there was a post_form method. On reflection however, it was decided there should either be *_form methods for every request type, or we should do something smarter, and thus the form generator was born.

$response_vt = $ua->post( 
  $url, 
  form => {'apikey' => $key, 'resource' => $md5}, 
  sub { ... }
);

It can be added to any request method, making it much more consistent than the old form. Also note that it should be a hashref, not an arrayref as LWP allows. BTW there is also a json generator that works like this too, or you can even add your own!

I'm leaving my original answer, showing non-blocking usage, which you may now amend given the above.

ORIGINAL

Building off the logic from creaktive, this is how I would start. The major difference is that there isn't a monitor watching to be sure that there are works going, rather when one finishes it checks to be sure that there are no idlers.

I have also made some changes in the parsing logic, but nothing major.

#!/usr/bin/env perl
use Mojo::Base -strict;
use utf8::all;

use Mojo::URL;
use Mojo::UserAgent;

# FIFO queue
my @urls = qw(
    http://sysd.org/page/1/
    http://sysd.org/page/2/
    http://sysd.org/page/3/
);

# User agent following up to 5 redirects
my $ua = Mojo::UserAgent
    ->new(max_redirects => 5)
    ->detect_proxy;

start_urls($ua, \@urls, \&get_callback);

sub start_urls {
  my ($ua, $queue, $cb) = @_;

  # Limit parallel connections to 4
  state $idle = 4;
  state $delay = Mojo::IOLoop->delay(sub{say @$queue ? "Loop ended before queue depleated" : "Finished"});

  while ( $idle and my $url = shift @$queue ) {
    $idle--;
    print "Starting $url, $idle idle\n\n";

    $delay->begin;

    $ua->get($url => sub{ 
      $idle++; 
      print "Got $url, $idle idle\n\n"; 
      $cb->(@_, $queue); 

      # refresh worker pool
      start_urls($ua, $queue, $cb); 
      $delay->end; 
    });

  }

  # Start event loop if necessary
  $delay->wait unless $delay->ioloop->is_running;
}

sub get_callback {
    my ($ua, $tx, $queue) = @_;

    # Parse only OK HTML responses
    return unless 
        $tx->res->is_status_class(200)
        and $tx->res->headers->content_type =~ m{^text/html\b}ix;

    # Request URL
    my $url = $tx->req->url;
    say "Processing $url";
    parse_html($url, $tx, $queue);
}

sub parse_html {
    my ($url, $tx, $queue) = @_;

    state %visited;

    my $dom = $tx->res->dom;
    say $dom->at('html title')->text;

    # Extract and enqueue URLs
    $dom->find('a[href]')->each(sub{

        # Validate href attribute
        my $link = Mojo::URL->new($_->{href});
        return unless eval { $link->isa('Mojo::URL') };

        # "normalize" link
        $link = $link->to_abs($url)->fragment(undef);
        return unless grep { $link->protocol eq $_ } qw(http https);

        # Don't go deeper than /a/b/c
        return if @{$link->path->parts} > 3;

        # Access every link only once
        return if $visited{$link->to_string}++;

        # Don't visit other hosts
        return if $link->host ne $url->host;

        push @$queue, $link;
        say " -> $link";
    });
    say '';

    return;
}
Freedwoman answered 1/3, 2013 at 21:7 Comment(8)
Very interesting, indeed! I didn't thought about using the delay as a "condvar". My recurring is actually a quick & dirty hack for emulating an event loop on top of other event loop... Your state usage and passing $queue around as a reference makes the code a lot cleaner, also.Katrinka
Here the same applies as before: Nice, but nothing to do with the original question, which was about that http POST didn't work. This code is helpful (and it is!) only with http GET.Boiler
Oh, I think we misunderstood the nature of your failing, thinking that it was the non-blocking nature. Let me read up a little and I will updateFreedwoman
OK, thanks a lot, that helped. I'm using OpenBSD. As you may know OpenBSDs ports and packages stuff doesn't always come with the latest versions. OpenBSD52, which I'm using, provides p5-Mojolicious-2.76.tgz. I know I could update to newer versions, but that would somewhat go against our policy to stick with the distro as much as possible. However, a friend helped me, so this is how to use it (I assume in all Mojolicious versions prior to 3.85): $ua->post_form($url, { apikey => $key, resource => $md5 }, sub {Boiler
I don't remember too much of the specifics of Mojolicious before about v3.0, so I can't help you there. Use your local perldoc command to read the documentation that came with your package or use Metacpan to read the docs for a given version. metacpan.org/module/SRI/Mojolicious-2.76/lib/Mojolicious.pmFreedwoman
That said, there are nice new goodies for you if you upgrade! Mojolicious development is very fast moving.Freedwoman
speaking of fast moving, $delay->end no longer works. Need to use my $end = $delay->begin; and then a call to $end->() in the callback. Synopsis updated but change not otherwise documented, other than single terse line in Changes file.Superfecundation
Its true that Mojolicious development moves fast. I have actually kept a modularized version of this answer working as a github gist however: gist.github.com/jberger/5153008Freedwoman
O
0

LWP::UserAgent takes arguments to post as either ref to array or ref to hash format.

http://search.cpan.org/~gaas/libwww-perl-6.04/lib/LWP/UserAgent.pm#REQUEST_METHODS

$ua->post( $url, \%form )
$ua->post( $url, \@form )

which you provide in the first script in the ref to array format "\@form"

my $response = $ua->post( $url, ['apikey' => $key, 'resource' => $md5] );

being as it is a hash this is probably better written in the hash format "\%form"

my $response = $ua->post( $url, {'apikey' => $key, 'resource' => $md5} );

In Mojo::UserAgent the arguments to post are a little more complex, but essentially appear to be a "string" of hash refs to hash keys, with which I am unfamiliar. However you may find using the hash ref format provides the expected arguments correctly.

http://search.cpan.org/~sri/Mojolicious-3.87/lib/Mojo/UserAgent.pm#post

POST

my $tx = $ua->post('kraih.com');
my $tx = $ua->post('http://kraih.com' => {DNT => 1} => 'Hi!');
my $tx = $ua->post('http://kraih.com' => {DNT => 1} => form => {a => 'b'});
my $tx = $ua->post('http://kraih.com' => {DNT => 1} => json => {a => 'b'});

try this ?:

$response_vt = $ua->post( $url => form => {'apikey' => $key, 'resource' => $md5} => sub {... });
Onionskin answered 1/3, 2013 at 13:35 Comment(4)
I appreciate the attempt but unfortunately this didn't work either. Changing the post command to the version using curly brackets didn't change anything. Still get a 403 error. I also tried $response_vt = $ua->post( $url => form => {'apikey' => $key, 'resource' => $md5} => sub {...} unsuccessful.Boiler
sorry this did not help, being unfamiliar with chaining hash elements in arguments. I had hoped the syntax correction may do the trick. oh hang on the digraph is just a fancy comma=> right?Onionskin
a hashref without a generator key (form or json) is used as headers. I have edited this answer for the correct usage. Note however that the generator are relatively recent, before which the post_form method was used.Freedwoman
and yes the => is called the fat comma, which is just a comma, but will also quote the left side argument IF it looks like a word.Freedwoman
K
0

Take a look at this concurrent-requesting Mojolicious-based web crawler I wrote to illustrate my article Web Scraping with Modern Perl:

#!/usr/bin/env perl
use 5.010;
use open qw(:locale);
use strict;
use utf8;
use warnings qw(all);

use Mojo::UserAgent;

# FIFO queue
my @urls = map { Mojo::URL->new($_) } qw(
    http://sysd.org/page/1/
    http://sysd.org/page/2/
    http://sysd.org/page/3/
);

# Limit parallel connections to 4
my $max_conn = 4;

# User agent following up to 5 redirects
my $ua = Mojo::UserAgent
    ->new(max_redirects => 5)
    ->detect_proxy;

# Keep track of active connections
my $active = 0;

Mojo::IOLoop->recurring(
    0 => sub {
        for ($active + 1 .. $max_conn) {

            # Dequeue or halt if there are no active crawlers anymore
            return ($active or Mojo::IOLoop->stop)
                unless my $url = shift @urls;

            # Fetch non-blocking just by adding
            # a callback and marking as active
            ++$active;
            $ua->get($url => \&get_callback);
        }
    }
);

# Start event loop if necessary
Mojo::IOLoop->start unless Mojo::IOLoop->is_running;

sub get_callback {
    my (undef, $tx) = @_;

    # Deactivate
    --$active;

    # Parse only OK HTML responses
    return
        if not $tx->res->is_status_class(200)
        or $tx->res->headers->content_type !~ m{^text/html\b}ix;

    # Request URL
    my $url = $tx->req->url;

    say $url;
    parse_html($url, $tx);

    return;
}

sub parse_html {
    my ($url, $tx) = @_;

    say $tx->res->dom->at('html title')->text;

    # Extract and enqueue URLs
    for my $e ($tx->res->dom('a[href]')->each) {

        # Validate href attribute
        my $link = Mojo::URL->new($e->{href});
        next if 'Mojo::URL' ne ref $link;

        # "normalize" link
        $link = $link->to_abs($tx->req->url)->fragment(undef);
        next unless grep { $link->protocol eq $_ } qw(http https);

        # Don't go deeper than /a/b/c
        next if @{$link->path->parts} > 3;

        # Access every link only once
        state $uniq = {};
        ++$uniq->{$url->to_string};
        next if ++$uniq->{$link->to_string} > 1;

        # Don't visit other hosts
        next if $link->host ne $url->host;

        push @urls, $link;
        say " -> $link";
    }
    say '';

    return;
}
Katrinka answered 1/3, 2013 at 18:9 Comment(4)
Why the recurring? Why not just start n workers and have the callback start another if the queue is not empty? Perhaps there isn't much difference, and I'm not an expert on the non-blocking side of things, but it seems more straightforward to me.Freedwoman
Sometimes @urls starts with only one item :)Katrinka
I have posted a snippet based on yours, I'm curious what your comments are.Freedwoman
This all is very nice, but it doesn't solve the problem of the OP. the initial problem is not how to do asynchronous http but how to do POST (instead of GET) with Mojo and submit form data. Might sound like a simple problem, but the documentation doesn't provide an example and all my attempts fail.Boiler

© 2022 - 2024 — McMap. All rights reserved.