rvest Error in open.connection(x, "rb") : Timeout was reached
Asked Answered
L

5

18

I'm trying to scrape the content from http://google.com. the error message come out.

library(rvest)  
html("http://google.com")

Error in open.connection(x, "rb") :
Timeout was reached In addition:
Warning message: 'html' is deprecated.
Use 'read_html' instead.
See help("Deprecated")

since I'm using company network ,this maybe caused by firewall or proxy. I try to use set_config ,but not working .

Leadership answered 23/10, 2015 at 5:54 Comment(5)
have you also tried the read_html command, since the error message says html is deprecated... This might not solve you problem but maybe the output is more helpful...Botsford
yes,the message is :Error in open.connection(x, "rb") : Timeout was reached In addition: Warning message: closing unused connection 3 (google.com)Leadership
actually , this code works fine in my home network. but when I try to use this code in the company network ,the error comes up.Leadership
Seems not reproducible as a code issue, this returns a result for me. If you figured out what was going on with the network and how to work around it you could post that answer.Flashing
Same issue for me, apparently from the network I am using google asks proof of not being a bot, and the page of course times out when the scraper runs.Ahmedahmedabad
C
35

I encountered the same Error in open.connection(x, “rb”) : Timeout was reached issue when working behind a proxy in the office network.

Here's what worked for me,

library(rvest)
url = "http://google.com"
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
content <- read_html("scrapedpage.html")

Credit : https://stackoverflow.com/a/38463559

Camshaft answered 3/3, 2017 at 1:46 Comment(2)
That worked for me as well... In my case I found a more permanent solution to be setting the proxy environment variables. Here are the steps: https://mcmap.net/q/669404/-how-to-configure-the-curl-package-in-r-with-default-web-proxy-settingsExplicate
Thank you- that worked for me, using company network.Hanser
F
8

This is probably an issue with your call to read_html (or html in your case) not properly identifying itself to server it's trying to retrieve content from, which is the default behaviour. Using curl, add a user agent to the handle argument of read_html to have your scraper identify itself.

library(rvest)
library(curl)
read_html(curl('http://google.com', handle = curl::new_handle("useragent" = "Mozilla/5.0")))
Fierro answered 4/8, 2016 at 16:43 Comment(0)
B
1

I ran into this issue because my VPN was switched on. Immediately after turning it off, I re-tried, and it resolved the issue.

Barghest answered 30/9, 2017 at 3:49 Comment(0)
E
0

I was facing a similar problem and a small hack solved it. There were 2 characters in the hyperlink who were creating the problem for me. Hence I replaced "è" with "e" & "é" with "e" and it worked. But just ensure that the hyperlink still remains valid.

Epistle answered 8/4, 2018 at 14:36 Comment(0)
H
0

I got the error message when my laptop was wifi connected to my router, but my ISP was having some sort of an outage:

read_html(brand_url)
Error in open.connection(x, "rb") : 
  Timeout was reached: [somewebsite.com.au] Operation timed out after 10024 milliseconds with 0 out of 0 bytes received

In the above case, my wifi was still connected to the modem, but pages wouldn't load via rvest (nor in a browser). It was temporary and lasted ~2 minutes.

May also be worth noting that a different error message is received when wifi is turned off entirely:

brand_page <- read_html(brand_url)
Error in open.connection(x, "rb") : 
  Could not resolve host: somewebsite.com.au
Hilel answered 5/8, 2020 at 2:8 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.