I wrote a simple web scraper to scrape expedia.com. Using Java Selenium HtmlUnitDriver, i was able to successfully scrape data from the site if i run it locally.
However, when i deploy this on to an EC2 Server, it always returns me the page where expedia detected it as a bot, thus, it displays this captcha to prove a human is accessing it.
I think it might have something to do with ip address of ec2 servers which got blacklisted by expedia.com somehow?
I've tried scraping different websites where they don't care / don't do human test.
Any idea how to go around this?
Things I tried but still detected as bot:
- Changing user agent to something i use on my local browser
- Setting a proxy
Update: Actually setting a proxy server gives me a different error:
Current URL is https://www.expedia.com/things-to-do/search?location=Paris&pageNumber=1
The htmlString:
<!--?xml version="1.0" encoding="ISO-8859-1"?-->
<html>
<head>
<title>
500 Internal Server Error
</title>
</head>
<body>
<h1> Internal Server Error </h1>
<p> The server encountered an internal error or misconfiguration and was unable to complete your request. </p>
<p> Please contact the server administrator at [no address given] to inform them of the time this error occurred, and the actions you performed just before this error. </p>
<p> More information about this error may be available in the server error log. </p>
<hr>
<address> Apache/2.4.18 (Ubuntu) Server at www.expedia.com Port 443 </address>
</body>
</html>
captcha
can't be automated , if it can be then it fails being the captcha. – Coupling