Selenium use chrome on Colab got unexpectedly exited - how to fix this?
Asked Answered
S

0

0

i am trying to get data form a page

see url = "https://clutch.co/il/it-services"

The website i am trying to scrap from has some sort of anti-bot protection with CloudFlare or similar services, hence the scrapper need to use selenium with a headless browser like Headless Chrome or PhantomJS. Selenium automates a real browser, which can navigate Cloudflare's anti-bot pages just like a human user.

Here's how i use selenium to imitate a real human browser interaction:

but on Google-Colab it does not work propperly

import pandas as pd
from bs4 import BeautifulSoup
from tabulate import tabulate
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
driver = webdriver.Chrome(options=options)

url = "https://clutch.co/il/it-services"
driver.get(url)

html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')

# Your scraping logic goes here
company_names = soup.select(".directory-list div.provider-info--header .company_info a")
locations = soup.select(".locality")

company_names_list = [name.get_text(strip=True) for name in company_names]
locations_list = [location.get_text(strip=True) for location in locations]

data = {"Company Name": company_names_list, "Location": locations_list}
df = pd.DataFrame(data)
df.index += 1
print(tabulate(df, headers="keys", tablefmt="psql"))
df.to_csv("it_services_data.csv", index=False)

driver.quit()

question: can i use selenium to imitate a real human browser interaction - on google colab too? How to fix the issues i am facing

see my results: https://pastebin.com/FpEDLNiA

SessionNotCreatedException                Traceback (most recent call last)
<ipython-input-4-ffdb44a94ddd> in <cell line: 9>()
      7 options = Options()
      8 options.headless = True
----> 9 driver = webdriver.Chrome(options=options)
     10 
     11 url = "https://clutch.co/il/it-services"

5 frames
/usr/local/lib/python3.10/dist-packages/selenium/webdriver/remote/errorhandler.py in check_response(self, response)
    227                 alert_text = value["alert"].get("text")
    228             raise exception_class(message, screen, stacktrace, alert_text)  # type: ignore[call-arg]  # mypy is not smart enough here
--> 229         raise exception_class(message, screen, stacktrace)

SessionNotCreatedException: Message: session not created: Chrome failed to start: exited normally.
  (session not created: DevToolsActivePort file doesn't exist)
  (The process started from chrome location /root/.cache/selenium/chrome/linux64/120.0.6099.109/chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
Stacktrace:
#0 0x56d4ca1b8f83 <unknown>
#1 0x56d4c9e71cf7 <unknown>
#2 0x56d4c9ea960e <unknown>
#3 0x56d4c9ea626e <unknown>
#4 0x56d4c9ef680c <unknown>
#5 0x56d4c9eeae53 <unknown>
#6 0x56d4c9eb2dd4 <unknown>
#7 0x56d4c9eb41de <unknown>
#8 0x56d4ca17d531 <unknown>
#9 0x56d4ca181455 <unknown>
#10 0x56d4ca169f55 <unknown>
#11 0x56d4ca1820ef <unknown>
#12 0x56d4ca14d99f <unknown>
#13 0x56d4ca1a6008 <unknown>
#14 0x56d4ca1a61d7 <unknown>
#15 0x56d4ca1b8124 <unknown>
#16 0x79bb253feac3 <unknown>

btw: see my colab: https://colab.research.google.com/drive/1WilnQwzDq45zjpJmgdjoyU5wTVAgJqvd#scrollTo=pyd0BcMaPxkJ

Scamander answered 22/1 at 11:21 Comment(3)
Colab kernel needs a running webdriver and browser installed, no longer default, but you can fix it: #51046954Gift
hi there - many thanks for the headsup - i am glad to hear from you. At the moment i tried to fix it but was not lucky yet so far. Well can you give a hint how exactly to fix it - with the additional install of selenium !? btw - my colab is here : colab.research.google.com/drive/… - i shared the link ....: would be more than happy if you can help me here...Scamander
many thanks dear @MortemB for the heads up. i am very glad to hear from you. This is so great. I am trying to get the selenium on colab up and running. Look forward to hear from you have a great day. ;)Scamander

© 2022 - 2024 — McMap. All rights reserved.