Tor Web Crawler
Asked Answered
P

6

10

Ok, here's what I need. I have a PHP based web crawler. It is accessible here: http://rz7ocnxxu7ka6ncv.onion/ Now, my problem is that my spider that actually crawls pages needs to do so on a SOCKS port 9050. The thing is, I have to tunnel its connection through Tor so that It can resolve .onion domains, which is what I'm indexing. (Only ending in .onion.) I call this script from the command line using php crawl.php, and I add the appropriate parameters to crawl the page. Here is what I think: Is there any way to force it to use Tor? OR can i force my ENTIRE MACHINE to tunnel things through Tor, and how? (Like forcing all traffic through 127.0.0.1:9050) perhaps if i set up global proxy settings, php would respect them?

If any of my solutions work, how would I do it? (Step by step instructions please, I am a noob.)

I just want to crate my own Tor search engine. (Don't recommend my p2p search engines- it's not what I want for this- I know they exist, I did my homework.) Here is the crawler source if you are interested to take a look at: Perhaps someone with a kind heart can modify it to use 127.0.0.1:9050 for all crawling requests? http://pastebin.com/kscGJCc5

Perverted answered 11/2, 2012 at 3:3 Comment(2)
"perhaps if i set up global proxy settings, php would respect them?" doubtful. Don't fopen($url). Use cURL with CURLOPT_PROXY. Not sure how DNS lookups would work though.Rola
How do I do that? I'm a total noob at this.Perverted
V
10

cURL also supports SOCKS connections; try this:

<?php

$ch = curl_init('http://google.com'); 
curl_setopt($ch, CURLOPT_HEADER, 1); 
curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1); 

// SOCKS5
curl_setopt($ch, CURLOPT_PROXY, 'localhost:9050'); 
curl_setopt($ch, CURLOPT_PROXYTYPE, CURLPROXY_SOCKS5);

curl_exec($ch); 
curl_close($ch);
Vanthe answered 11/7, 2012 at 10:46 Comment(0)
S
9

Unless I'm missing something the answer is yes, and here is some documentation on the Tor site. The instructions are pretty specific. Though I've not set Tor up as a proxy it's something I've considered, this is the place I would start.

EDIT: It is dead simple to setup Tor on Linux and use it as a proxy as the documentation suggests.

sudo apt-get install tor
sudo /etc/init.d/tor start

netstat -ant | grep 9050 # verify Tor is running

Now after looking through OPs code we see calls to file_get_contents. While the easiest method to use at first file_get_contents becomes cumbersome when you want to start parametrizing the request because you have to use stream contexts.

First suggestion is to move to curl, but again, more reading on how SOCKS works w/ HTTP is probly in order to truly answer this question... But to answer the question technically, how to send an HTTP request to a Tor SOCKS proxy on localhost, again easy..

<?php  
$ch = curl_init('http://google.com'); 
curl_setopt($ch, CURLOPT_HEADER, 1); 
curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1); 
curl_setopt($ch, CURLOPT_PROXY, 'https://127.0.01:9050/'); 
curl_exec($ch); 
curl_close($ch);

But what does Tor tell us?

HTTP/1.0 501 Tor is not an HTTP Proxy

Content-Type: text/html; charset=iso-8859-1

Basically, learn more about SOCKS & HTTP. Another option is to google around for PHP SOCKS clients. A quick inspection reveals a library that claims it can send HTTP requests over SOCKS.

EDIT:

Alright, 1 more edit! Seconds after finishing my last post, I've found a way to do it. This article shows us how to set up something called Privoxy, which translates SOCKS requests into HTTP requests. Put that in front of Tor and blamo, we're sending proxied HTTP requests through Tor!

Seigneur answered 11/2, 2012 at 4:58 Comment(4)
ive read that article hundreds of times over the past week. It does not work- trust me.Perverted
I updated my answer. It's super-easy to send requests to Tor on localhost, but the challenge is sending HTTP requests over a SOCKS connection. See the end of the revised answer that points to a library claiming it can do just that.Seigneur
OK, seconds later I found something called Privoxy, now sending proxied HTTP requests through Tor. Thanks for pushing me, this is something I'd wanted to figure out anyways.Seigneur
Privoxy is the only thing I have yet to try. I am going to see if I can start that PHP crawler through TOR requests. I'll report back if your method works. :)Perverted
G
2

you have to intercept the dns lookup request from the php script by configuring tor with the "dnsport" directive. then you have to configure a "transport" for tor and a "virtualnetworkaddress". now what happens when your php script does a dns-lookup thru tor is that tor sees a request for a onion address and answers with a ip address from the "virtualnetworkaddress" range. you now have to redirect the traffic going to this address to the address defined with "transport". read "torrc" manual on "automaphostonresolve", "virtualnetworkaddress", "dnsport" and "transport".

Geronto answered 25/1, 2014 at 12:27 Comment(1)
Adding an example would be great, putting all that along for an unexperienced user may be harder than seeing an example.Damson
K
1

I think it is as simple as running your command line request with the usewithtor or torifyoption. For example:

$ usewithtor crawl.php

And the script will be able to interact with .onion sites. Having build a crawler for Tor myself, I definitely would not go this route for production use, I instead use python, PySocks, and other crawler libraries instead of CURL. Hopefully this answers your question and gives you some ideas for other implementation strategies moving forward.

Thanks

Knapp answered 27/5, 2015 at 14:23 Comment(0)
M
0

I searched how does make the same thing in php with Curl i've read much topic and examples but this is don't working ! without success i have seen an other post : How can I connect to a Tor hidden service using cURL in PHP? on Stackoverflow who can be interesting

I've succeed to find a hook this is works for me in PHP :

little example with https://blockchainbdgpzk.onion/

exec('curl -k --socks5-hostname 127.0.0.1:9150 "https://blockchainbdgpzk.onion/tobtc?currency=EUR&value=5"', $a);

print_r( $a );

return  Array ( [0] => 0.0029577 ) 

As I'm on environement Windows , i've copy curl.exe and his certificate in the folder c:\windows\system32

or like that works too just add this 2 rules ( -k )

curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);

source : PHP CURL CURLOPT_SSL_VERIFYPEER ignored

$url = "https://blockchainbdgpzk.onion/tobtc?currency=EUR&value=5";

$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_HEADER, 0);
curl_setopt ($ch, CURLOPT_PROXYTYPE, 7 );
curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1); 
curl_setopt ($ch, CURLOPT_PROXY, '127.0.0.1:9150' );
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);

ob_start();

curl_exec ($ch);
curl_close ($ch);

$result = ob_get_contents();
ob_end_clean();

var_dump($result);

return string '0.00296787' (length=10)

It's not perfect but if it can help someone. Sorry for my shit english friends.

Mightily answered 16/7, 2017 at 17:49 Comment(0)
F
-2

Just make you own HTTP proxy:

<?php

/**
* Proxy script that performs any HTTP request requested.
*/

// Check key
$key = 'YOUR_API_KEY';
if($_GET['key'] != $key) die; // Check for the API key

// Check URL
$url = isset($_GET['url']) ? trim(base64_decode($_GET['url'])) : '';
if(!$url || !filter_var($url, FILTER_VALIDATE_URL)) die; // Incorrect URL

class MyCurl {

    /**
    * CURL resource link
    * 
    * @var resource
    */
    protected $resource;

    /**
    * Constructor
    * 
    * @param String $host
    * @return MyCurl
    */
    public function __construct($url = 'localhost'){
        $this->resource = curl_init();
        $this->setUrl($url);
        $this->setOptions(array(
//          CURLOPT_RETURNTRANSFER => TRUE,
            CURLOPT_AUTOREFERER => TRUE,
            CURLOPT_FOLLOWLOCATION => TRUE,
            CURLOPT_REFERER => 'http://www.google.com/',
            CURLOPT_USERAGENT => 'Mozilla/5.0 (compatible; MSIE 5.01; Windows NT 5.0)',
            CURLOPT_SSL_VERIFYHOST => FALSE,
            CURLOPT_SSL_VERIFYPEER => FALSE,
        ));
    }

    /**
    * Set URL for the next request
    * 
    * @param String $url
    */
    public function setUrl($url = 'localhost') {
        $this->setOption(CURLOPT_URL, $url);
    }

    /**
    * Sets option to the CURL resource.
    * See http://www.php.net/manual/en/function.curl-setopt.php for option description
    * 
    * @param int $name Option identifier
    * @param mixed $value Option value
    * @return Crawler_Curl Returns itself for sugar-code
    */
    public function & setOption($name, $value){
        curl_setopt($this->resource, $name, $value);
        return $this;
    }

    /**
    * Sets multiple CURL options at once
    * 
    * @param array $options Associative array of options
    * @return Crawler_Curl Returns itself for sugar-code
    */
    public function & setOptions($options){
        curl_setopt_array($this->resource, $options);
        return $this;
    }

    /**
    * Set User-Agent header of the browser
    * 
    * @param String $useragent Defaults to Mozilla browser
    */
    public function setUserAgent($useragent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:22.0) Gecko/20100101 Firefox/22.0') {
        $this->setOption(CURLOPT_USERAGENT, $useragent);
    }

    /**
    * Get curl request info
    * 
    * @array
    */
    public function info() {
        return curl_getinfo($this->resource);
    }

    /**
    * Return sent headers if CURLINFO_HEADER_OUT option was enabled
    * 
    * @return String Headers
    */
    public function headersSent() {
        return curl_getinfo($this->resource, CURLINFO_HEADER_OUT);
    }

    /**
    * Executes CURL request
    *
    * @return mixed Returns CURL execution result
    */
    public function execute(){
        return curl_exec($this->resource);
    }

    /**
    * Cleans CURL connection
    */
    function __destruct(){
        curl_close($this->resource);
    }

}

$curl = new MyCurl($url);
$curl->execute();
Flem answered 30/7, 2013 at 15:7 Comment(2)
This does not answer the question.Engen
I actually like this because some may only have access to lots of places that run php instead of having access to one dedicated/VPS where they can install privoxy. If you have say a dozen hosting accounts with different ips you could set up your own small proxy network.Desist

© 2022 - 2024 — McMap. All rights reserved.