Selenium HtmlUnitDriver Web Scrape Got Captcha Page From EC2 Server
Asked Answered
I

1

7

I wrote a simple web scraper to scrape expedia.com. Using Java Selenium HtmlUnitDriver, i was able to successfully scrape data from the site if i run it locally.

However, when i deploy this on to an EC2 Server, it always returns me the page where expedia detected it as a bot, thus, it displays this captcha to prove a human is accessing it.

I think it might have something to do with ip address of ec2 servers which got blacklisted by expedia.com somehow?

I've tried scraping different websites where they don't care / don't do human test.

Any idea how to go around this?

Things I tried but still detected as bot:

  • Changing user agent to something i use on my local browser
  • Setting a proxy

Update: Actually setting a proxy server gives me a different error:

Current URL is https://www.expedia.com/things-to-do/search?location=Paris&pageNumber=1

The htmlString:

<!--?xml version="1.0" encoding="ISO-8859-1"?-->
<html>
 <head> 
  <title>
      500 Internal Server Error
    </title> 
 </head> 
 <body> 
  <h1> Internal Server Error </h1> 
  <p> The server encountered an internal error or misconfiguration and was unable to complete your request. </p> 
  <p> Please contact the server administrator at [no address given] to inform them of the time this error occurred, and the actions you performed just before this error. </p> 
  <p> More information about this error may be available in the server error log. </p> 
  <hr> 
  <address> Apache/2.4.18 (Ubuntu) Server at www.expedia.com Port 443 </address>   
 </body>
</html>
Irving answered 1/8, 2018 at 13:54 Comment(5)
You can talk to your devs or respective devs to give you test environment to bypass the captcha. Basically captcha can't be automated , if it can be then it fails being the captcha.Coupling
do you mean talk to expedia devs in this case?Irving
Well, if that's an external client then nobody can help. :(Coupling
if you suspect it may be because of the ip you can also try to set a proxy to a different ip range than those EC2 is using (your own, perhaps, idk how to set this up from the top of my head). Additionally try to modify / spoof your user-agent string to something more common.Crawfish
the only thing that's different executing it from my local, vs on my ec2 server, is the ip isn't that right? if it works on my local, that means the site cannot detect htmlunitdriver scraping it... it has to be the ip... is there a scalable way to set up proxy?Irving
B
2

Are you covering these topics:

-Which agent are you using? Make sure you are using the same agent which you would use in a human navigation, more details in this link.

-Are you inserting waits in your navigation? If as soon as a page load you try to click or navigate, this isn't simulating a regular navigation. More details.

-Which driver are you using, there is a trick with chromedriver to rename a internal variable "cdc_" to other name like "aaa_" then if there is a javascript code in the server trying to detect this variable (cdc_), it will fail. More details.

-There are more things to be studied if you really need to not be detected by the server:

-Is there a honeypot in place?
-Are your IP (EC2 IP) already blocked? You could redirect using a VPN tunnel.

Interesting articles:

https://www.kdnuggets.com/2018/02/web-scraping-tutorial-python.html

https://antoinevastel.com/bot%20detection/2017/08/05/detect-chrome-headless.html

https://intoli.com/blog/making-chrome-headless-undetectable/

Bufford answered 1/8, 2018 at 17:4 Comment(2)
tried setting user agent to something which i use on my local browser , and i tried setting proxy too - both didn't work...Irving
the only thing that's different executing it from my local, vs on my ec2 server, is the ip isn't that right? if it works on my local, that means the site cannot detect htmlunitdriver scraping it... it has to be the ip... is there a scalable way to set up proxy?Irving

© 2022 - 2024 — McMap. All rights reserved.