PhantomJS returning empty web page (python, Selenium)
Asked Answered
C

3

19

Trying to screen scrape a web site without having to launch an actual browser instance in a python script (using Selenium). I can do this with Chrome or Firefox - I've tried it and it works - but I want to use PhantomJS so it's headless.

The code looks like this:

import sys
import traceback
import time

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 "
    "(KHTML, like Gecko) Chrome/15.0.87"
)

try:
    # Choose our browser
    browser = webdriver.PhantomJS(desired_capabilities=dcap)
    #browser = webdriver.PhantomJS()
    #browser = webdriver.Firefox()
    #browser = webdriver.Chrome(executable_path="/usr/local/bin/chromedriver")

    # Go to the login page
    browser.get("https://www.whatever.com")

    # For debug, see what we got back
    html_source = browser.page_source
    with open('out.html', 'w') as f:
        f.write(html_source)

    # PROCESS THE PAGE (code removed)

except Exception, e:
    browser.save_screenshot('screenshot.png')
    traceback.print_exc(file=sys.stdout)

finally:
    browser.close()

The output is merely:

<html><head></head><body></body></html>

But when I use the Chrome or Firefox options, it works fine. I thought maybe the web site was returning junk based on the user agent, so I tried faking that out. No difference.

What am I missing?

UPDATED: I will try to keep the below snippet updated with until it works. What's below is what I'm currently trying.

import sys
import traceback
import time
import re

from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.support import expected_conditions as EC

dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 (KHTML, like Gecko) Chrome/15.0.87")

try:
    # Set up our browser
    browser = webdriver.PhantomJS(desired_capabilities=dcap, service_args=['--ignore-ssl-errors=true'])
    #browser = webdriver.Chrome(executable_path="/usr/local/bin/chromedriver")

    # Go to the login page
    print "getting web page..."
    browser.get("https://www.website.com")

    # Need to wait for the page to load
    timeout = 10
    print "waiting %s seconds..." % timeout
    wait = WebDriverWait(browser, timeout)
    element = wait.until(EC.element_to_be_clickable((By.ID,'the_id')))
    print "done waiting. Response:"

    # Rest of code snipped. Fails as "wait" above.
Cutcheon answered 5/4, 2015 at 23:54 Comment(0)
D
30

I was facing the same problem and no amount of code to make the driver wait was helping.
The problem is the SSL encryption on the https websites, ignoring them will do the trick.

Call the PhantomJS driver as:

driver = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true', '--ssl-protocol=TLSv1'])

This solved the problem for me.

Dordrecht answered 22/3, 2016 at 15:58 Comment(2)
This worked for me, the difference being from the other answer the '--ssl-protocol=TLSv1' part. Do you know why this caused it to work?Meridith
I ran into this issue today as well. My pages stopped working and were returning <html><head><body></body></html></head> The ssl-protocol=TLSv1 solved it. Amazing find.Chuckchuckfull
C
4

You need to wait for the page to load. Usually, it is done by using an Explicit Wait to wait for a key element to be present or visible on a page. For instance:

from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC


# ...
browser.get("https://www.whatever.com")

wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.content")))

html_source = browser.page_source
# ...

Here, we'll wait up to 10 seconds for a div element with class="content" to become visible before getting the page source.


Additionally, you may need to ignore SSL errors:

browser = webdriver.PhantomJS(desired_capabilities=dcap, service_args=['--ignore-ssl-errors=true'])

Though, I'm pretty sure this is related to the redirecting issues in PhantomJS. There is an open ticket in phantomjs bugtracker:

Cattier answered 6/4, 2015 at 0:0 Comment(13)
Okay, I'll give that a try.... but how useful is the "get" command if it doesn't wait for "page loaded" to complete before returning?? Seems like that should be built in. Is there a non-timed wait command you can use, that waits for the "page loaded" event (or whatever it's called)?Cutcheon
@Cutcheon nope, selenium would not wait for outstanding async requests or async code execution in the browser. Using an explicit wait should solve the problem.Cattier
We're getting close, but still no cigar. I added the wait, but waited for an ID to be present - timed out, though I know that ID should be there. Code output and screenshot are still empty. Traceback (most recent call last): File "scrape_CS.py", line 35, in <module> element = wait.until(EC.element_to_be_clickable((By.ID,'loginField'))) File "/Users/carey/anaconda/lib/python2.7/site-packages/selenium/webdriver/support/wait.py", line 75, in until raise TimeoutException(message, screen, stacktrace) TimeoutException: Message: Screenshot: available via screenCutcheon
@Cutcheon okay, thanks for trying it out. I've updated the answer, please check.Cattier
That had the same result, unfortunately. Do I need all the "dcap" stuff, by the way? If not, I'll remove that. And can you explain why you think the ignore-ssl-errors is the problem? I did see a warning about this in Chrome, but it still works. It just won't work in PhantomJS.Cutcheon
@Cutcheon I've seen ignoring SSL errors helped others before. Can you share the URL you are using in your code so I can try reproducing the problem? Thanks.Cattier
BTW, it works when I got back to Chrome. I captured the warning from Chrome, too - it says "You are using an unsupported command-line flag: --ignore-certificate-errors. Stability and security will suffer." This was there before and after the change you asked me to make. @CattierCutcheon
@Cutcheon thanks, that was helpful. I was able to reproduce the problem. Please see the update.Cattier
Thanks @Cattier for running that down. That issue was raised a long time ago, though... seems like they haven't fixed it. I haven't read all the comments yet, maybe there are workarounds in there. I want to run this as a cron job, but I don't want a real browser UI instance popping up every time getting in my way. I really wanted headless. Not sure what to do now.Cutcheon
@Cutcheon just switch to chrome or firefox with a virtual display (see xvfb), will save you from current and future issues with phantomjs.Cattier
I've been looking into that, but doesn't seem like xvfb is supported on OSX since OSX doesn't do X11. Still reading, though...Cutcheon
@Cutcheon I'm in the same situation, trying to set phantomJS because I need script to run both in OSX and a linux server, and xvfb was giving problems over macOS. Did you find a workaround to this?Lissy
@AlexanderFradiani i think you don't need xvfb if you are running PhantomJS - it is already headless.Cattier
D
0

driver = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true', '--ssl-protocol=TLSv1'])

This worked for me

Decomposition answered 11/10, 2018 at 13:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.