Perl - HTTP::Proxy capture XHR/JSON communication
Asked Answered
R

2

5

The site http://openbook.etoro.com/#/main/ has an live feed what is generated by javascript via XHR keep-alive requests and getting answers from server as gzip compressed JSON string.

I want capture the feed into a file.

The usual way (WWW::Mech..) is (probably) not viable because the need of reverese engineering all Javascripts in the page and simulating the browser is really hard task, so, looking for an alternative solution.

My idea is using a Man-in-the-middle tactics, so the broswser will do his work and i want capture the communication via an perl proxy - dedicated only for this task.

I'm able catch the initial communication, but not the feed itself. The proxy working OK, because in the browser the feed is running only my filers not works.

use HTTP::Proxy;
use HTTP::Proxy::HeaderFilter::simple;
use HTTP::Proxy::BodyFilter::simple;
use Data::Dumper;
use strict;
use warnings;

my $proxy = HTTP::Proxy->new(
     port => 3128, max_clients => 100, max_keep_alive_requests => 100
);

my $hfilter = HTTP::Proxy::HeaderFilter::simple->new(
    sub {
        my ( $self, $headers, $message ) = @_;
        print STDERR "headers", Dumper($headers);
    }
);

my $bfilter = HTTP::Proxy::BodyFilter::simple->new(
    filter => sub {
        my ( $self, $dataref, $message, $protocol, $buffer ) = @_;
        print STDERR "dataref", Dumper($dataref);
    }
);

$proxy->push_filter( response => $hfilter); #header dumper
$proxy->push_filter( response => $bfilter); #body dumper
$proxy->start;

Firefox is configured using the above proxy for all communication.

The feed is running in the browser, so the proxy feeding it with data. (When i stop the proxy, the feed is stopping too). Randomly (can't figure when) i getting the following error:

[Tue Jul 10 17:13:58 2012] (42289) ERROR: Getting request failed: Client closed

Can anybody show me a way, how to construt the correct HTTP::Proxy filter for Dumper all communication between the browser and the server regardles of keep_alive XHR?

Riviera answered 10/7, 2012 at 15:29 Comment(3)
You're reinventing the wheel. Type ctrl+shift+i to run Firefox Firebug/Opera Dragonfly/Chromium Inspecter and look in the network panel what the HTTP request/response pairs look like. Alternatively, use Wireshark, complete a capture, filter the expression http in the combo-box near the top, select the packet that starts a request, menu Analyze → Follow TCP stream to see the text representation of a HTTP request/response pair.Corabella
Sorry @daxim, but this is not a solution. Ofc, i can use firebug or any other browser control panel (and used it for analyse). I can use tcpdump and/or tcpflow too. I want exactly capture the feed (for later work) on an the headless server (no X), no browser. Thanx for your answer anyway - but if i want capture plain packets will not asking for a perl solution.Riviera
@daxim, I understand your point of view, but the question is legitimate and showing real problem. (and IMO, it is much better than usual SO-perl questions like how to use tr/// :) I tried the script, and myself don't know the answer too - can you help?Disburse
M
5

Here's something that I think does what you're after:

#!/usr/bin/perl

use 5.010;
use strict;
use warnings;

use HTTP::Proxy;
use HTTP::Proxy::BodyFilter::complete;
use HTTP::Proxy::BodyFilter::simple;
use JSON::XS     qw( decode_json );
use Data::Dumper qw( Dumper );

my $proxy = HTTP::Proxy->new(
    port                     => 3128,
    max_clients              => 100,
    max_keep_alive_requests  => 100,
);

my $filter = HTTP::Proxy::BodyFilter::simple->new(
    sub {
        my ( $self, $dataref, $message, $protocol, $buffer ) = @_;
        return unless $$dataref;
        my $content_type = $message->headers->content_type or return;
        say "\nContent-type: $content_type";
        my $data = decode_json( $$dataref );
        say Dumper( $data );
    }
);

$proxy->push_filter(
    method   => 'GET',
    mime     => 'application/json',
    response => HTTP::Proxy::BodyFilter::complete->new,
    response => $filter
);

$proxy->start;

I don't think you need a separate header filter because you can access any headers you want to look at using $message->headers in the body filter.

You'll note that I pushed two filters onto the pipeline. The first one is of type HTTP::Proxy::BodyFilter::complete and its job is to collect up the chunks of response and ensure that the real filter that follows always gets a complete message in $dataref. However foreach chunk that's received and buffered, the following filter will be called and passed an empty $dataref. My filter ignores these by returning early.

I also set up the filter pipeline to ignore everything except GET requests that resulted in JSON responses - since these seem to be the most interesting.

Thanks for asking this question - it was an interesting little problem and you seemed to have done most of the hard work already.

Middy answered 11/7, 2012 at 9:46 Comment(1)
Yes! You solved the both of problems, a.) getting application/json and b.) the fragmentation too. Thank you very, very much. :)Riviera
C
2

Set the mime parameter, default is to filter text types only.

$proxy->push_filter(response => $hfilter, mime => 'application/json');
$proxy->push_filter(response => $bfilter, mime => 'application/json');
Corabella answered 11/7, 2012 at 9:5 Comment(1)
Thank you, daxim, this is the solution for the majority of the problem ;)Riviera

© 2022 - 2024 — McMap. All rights reserved.