Node request for certain site results in ETIMEDOUT error most of the time
Asked Answered
R

1

13

Specs

Here's some background info on the system I'm running:

  • Ubuntu v 14.04

  • Node v4.4.0

  • Node request module v2.69.0

All of this on a DigitalOcean droplet/server on a New York-based center.

 

Problem Description

So I run the following js file:

var request = require('request');

var url = 'http://www.supremenewyork.com/';

request(url, function(err, res, body) { 
  if (err) {
    console.log(err);
    return;
  }

  console.log('body:', body);
});

On my droplet. Roughly 70-80% of the time I try this, Now every single time I try this, I'll get an ETIMEDOUT error like so:

{ [Error: connect ETIMEDOUT 52.6.25.180:80]
  code: 'ETIMEDOUT',
  errno: 'ETIMEDOUT',
  syscall: 'connect',
  address: '52.6.25.180',
  port: 80 }

Of note, the errors seem to come in 'waves'. That is, I'll manage to get a handful of requests through for a certain period of time, followed by a string of ETIMEDOUT errors. Errors happen more often than I am able to get my requests through by a ratio of approximately 3:1 errors to successes.

On my own computer (Mac running OS X El Capitan), running the js file for the given site works with 100% success (i.e. I've never run into this problem before)... so I'm not sure why the problem is contained to my droplet.

Any pointers would be appreciated.

 

Research/Similar Posts:

 

Additional Info

I also feel that it's worth mentioning the site I'm trying to make requests at actively has a problem with scripts and web scrapers, so I wouldn't be surprised if they tried everything in the book to prevent this from taking place.

 

Possible Causes

  • IP address blocking --> not the case (yet) as I will still occasionally get responses from the server I am no longer able to get any sort of response from the server. This might be the cause, but I am really confused at how they might be doing this. No issues on my local machine, no issues requesting their page from a browser on my droplet, but then this.

  • 'Rate-limiting' of my requests --> if this is somehow the case, I would like to know why this is happening specifically on my server and not, say, on my local machine

  • The manner in which I'm making my requests (i.e. not through a browser). --> I don't think this is the case because I can run the first script with a 100% response rate on my local computer (unless there is something my local computer does before sending my request to their server).

  • The system itself. I've only tested the first script on my Mac. Perhaps the code runs differently on different OS's/systems..?

 

Diagnosing with traceroute

So as per @ RabeeAbdelWahab's suggestion, I attempted to diagnose the problem with traceroute. However, I have practically no knowledge of networks so I'm not sure how to proceed. Here's an example output:

traceroute to <> (XXX.XXX.XXX.XXX), 30 hops max, 60 byte packets
 1  45.55.192.254 (45.55.192.254)  8.903 ms  8.879 ms  8.865 ms
 2  162.243.188.229 (162.243.188.229)  1.028 ms 162.243.188.233 (162.243.188.233)  0.986 ms  1.004 ms
 3  xe-0-9-0-17.r08.nycmny01.us.bb.gin.ntt.net (129.250.204.113)  1.923 ms  1.918 ms nyk-b3-link.telia.net (62.115.45.5)  1.587 ms
 4  ae-11.amazon.nycmny01.us.bb.gin.ntt.net (129.250.201.138)  1.935 ms ae-10.amazon.nycmny01.us.bb.gin.ntt.net (129.250.201.134)  1.586 ms *
 5  nyk-b5-link.telia.net (213.155.131.137)  1.822 ms * *
 6  * * 62.115.32.130 (62.115.32.130)  1.361 ms
 7  * * *
 8  * * *
 9  * * *
10  54.239.110.157 (54.239.110.157)  33.817 ms * 54.239.110.133 (54.239.110.133)  27.683 ms
11  54.239.111.17 (54.239.111.17)  8.193 ms 205.251.244.128 (205.251.244.128)  7.883 ms 54.239.111.23 (54.239.111.23)  9.319 ms
12  205.251.245.55 (205.251.245.55)  8.253 ms 54.239.110.175 (54.239.110.175)  24.601 ms 205.251.244.195 (205.251.244.195)  8.250 ms
13  * 54.239.111.27 (54.239.111.27)  9.319 ms 54.239.111.29 (54.239.111.29)  9.290 ms
14  * * *
15  54.239.111.23 (54.239.111.23)  9.136 ms * *
16  * * *
17  * * *
18  * * *
19  * * *
20  * * *
21  * * *
22  * * *
23  * * *
24  * * *
25  * * *
26  * * *
27  * * *
28  * * *
29  * * *
30  * * *

 

So after running traceroute several more times, I notice the following patterns:

  • The "***" outputs begin at some point on or slightly after the 15th hop.

  • The last IP Address before the "* * *" hops mostly seems to alternate between the same to addresses: 205.251.XXX.XXX (slightly more often the case) or 54.239.XXX.XXX. In a few select instances I'll get an address like 72.21.222.155.

In addition, I have seen no differences when:

  • Running traceroute with the -m 255 option (i.e. max number of hops).

  • Running traceroute with the -I option.

  • Running traceroute with the -e option.

  • Running traceroute with the -p 80 or -p 25 options.

  • Running traceroute on a different droplet located in the same data center as the droplet in question.

 

Diagnosing with ping

Using ping, here's a running list of sites I can and cannot connect to:

Can connect

  • google.com

  • facebook.com

  • reddit.com

  • github.com

  • stackoverflow.com

  • youtube.com

  • twitter.com

Can't connect:

  • amazon.com

  • microsoft.com

  • apple.com

  • walmart.com

  • paypal.com

  • cnn.com

  • nyt.org

  • wolframalpha.com

Observations: Is there a reason why I seem to be able to connect to sites that have 'social' features (and otherwise not)?

 

Apparently, it's common for sites not to return replies by ICMP (which is what ping, traceroute uses). Please disregard the above...

 

Additional findings

So I've noticed that if I modify my request to take an additional 'User-Agent' header (code example provided below), I'm able to initially get back the html response.

var request = require('request');

var requestOptions = 
{
    url: 'http://www.supremenewyork.com/some/route',
    headers: {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
};

request(requestOptions, function(err, res, body) { 
  if (err) {
    console.log(err);
    return;
  }

  console.log('body:', body);
});

I'm actually able to get back a response using the above method a few times. Afterwards, it seems all my connections lead to the aforementioned ETIMEDOUT error. Then I'll have to wait some lengthy period of time and it's rinse, wash, and repeat.

I actually performed a simple two-tailed proportional test for the above (i.e. receiving a response with and without a 'User-Agent' header) and got a p-value of 0.8493... so no statistical significance between the two. Again, please disregard the aforementioned...

Rube answered 1/4, 2016 at 13:46 Comment(16)
have you tried to traceroute the request from your droplet ? and compare it to the local tracerouteGoldy
@RabeeAbdelWahab - no I haven't. I'm actually not sure what that is. Would you mind explaining what it does briefly, and point me to a resource to learn more? Thanks!Rube
howtogeek.com/134132/… and this is a DO one digitalocean.com/community/tutorials/…Goldy
@RabeeAbdelWahab - ah ok I see. Sounds useful, I'll check it out. Thanks!Rube
Hey @RabeeAbdelWahab - just figured it might be worth adding that the question now has a bounty - in case you had any ideas.Rube
The suggestion then ended up with both requests giving the same outcome, i will try to look if I will be able to add something else that could be helpfulGoldy
Have you checked this #23633414 ?Goldy
@RabeeAbdelWahab - yes I have. Unfortunately, my issue is trying to figure out the cause of the error, not what to do with it.Rube
The issue can be originating from the server you are trying to call, do you have access to it ? any logging that can be useful in this case ?Goldy
@RabeeAbdelWahab - no, I do not have access to the server in question.Rube
Your DO is hosted on NY, and what about your current location, are you from NY too? What happens when you increase your timeout settings? Same error after waiting some more time? Have you tried using nodejs's http.ClientRequest instead of request package?Schumer
@Schumer - I've been able to successfully run the sample code from where I live, which is not in New York (and just for reference, the website name is somewhat misleading; they are a national brand although yes they are biggest in NY). As for the timeout duration, I've used the longer default before (which I believe lasted at least 1 minute, perhaps more) - same error. I have tried using the native http module (i.e. http.get), but to no avail. Would there be a difference between what I tried and what you suggested?Rube
There isn't. http.ClientRequest is the object returned by http.get. You could also try to create more DO droplets over different locations and other OS. Or even create it at a different DC, like NY2 or NY3. It's trial and error now.Schumer
@Schumer - agreed. It should be fun though. Thanks for your help.Rube
@Rube how did u resolve the issue I'm facing the same I've deployed onto okd local works fine and when deployed onto okd timeout error :( for past few days I'm searching for itQuickie
@Quickie sorry man, I think I ended up deciding that pursuing this wasn't worth the effort. I'm kinda convinced that the website I was scraping was using some software which had measures to prevent people from doing what I was doing. Best of luck, and if you find anything, do report back.Rube
D
3

Since you said they had issues and are trying to prevent scraping or something, you may be subject to those efforts. Why would you need to keep hitting their page so often?

I think if you really want it to work you are going to need to fool their anti-scraping systems (firewall or whatever). So you can try using a droplet in a different data center/city and also try adding headers to imitate a web browser. User-Agent would be the first I would try.

var options = { headers: { "user-agent":
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko)  Chrome/41.0.2228.0 Safari/537.36"}, url: "www.supremenewyork.com"}

Also make sure you don't hit their site too often and get rate limited.

Dramaturgy answered 7/4, 2016 at 9:26 Comment(3)
To address your first question, I don't believe I am making a whole lot of requests. Definitely not in the hundreds, but on the other hand way more than a typical user interfacing with their site via a browser. Either way, I am definitely trying to minimize the number of requests I am making, in case they do have some sort of 'detection' system set in place (although I am inclined to believe they are not that capable).Rube
To address your main point: In my post you will find that I actually have tried setting the User-Agent (under 'Additional Findings'). However, it does not seem to yield better results than sending a request without that header attached. This, combined with the fact that I have never had problems making requests from my local machine (i.e. I get a 100% response rate using the script mentioned in my post), leads me to suspect that HOW I'm making the requests is not the problem. That said, I'm also not sure if changing the droplet's location will do what I'm looking for.Rube
Anyway, I appreciate the time you took to make a reply, and I DEFINITELY welcome ANY more suggestions and ideas you might have. Thanks a lot!! :)Rube

© 2022 - 2024 — McMap. All rights reserved.