i am trying to get data form a page
see url = "https://clutch.co/il/it-services"
The website i am trying to scrap from has some sort of anti-bot protection with CloudFlare or similar services, hence the scrapper need to use selenium with a headless browser like Headless Chrome or PhantomJS. Selenium automates a real browser, which can navigate Cloudflare's anti-bot pages just like a human user.
Here's how i use selenium to imitate a real human browser interaction:
but on Google-Colab it does not work propperly
import pandas as pd
from bs4 import BeautifulSoup
from tabulate import tabulate
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.headless = True
driver = webdriver.Chrome(options=options)
url = "https://clutch.co/il/it-services"
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
# Your scraping logic goes here
company_names = soup.select(".directory-list div.provider-info--header .company_info a")
locations = soup.select(".locality")
company_names_list = [name.get_text(strip=True) for name in company_names]
locations_list = [location.get_text(strip=True) for location in locations]
data = {"Company Name": company_names_list, "Location": locations_list}
df = pd.DataFrame(data)
df.index += 1
print(tabulate(df, headers="keys", tablefmt="psql"))
df.to_csv("it_services_data.csv", index=False)
driver.quit()
question: can i use selenium to imitate a real human browser interaction - on google colab too? How to fix the issues i am facing
see my results: https://pastebin.com/FpEDLNiA
SessionNotCreatedException Traceback (most recent call last)
<ipython-input-4-ffdb44a94ddd> in <cell line: 9>()
7 options = Options()
8 options.headless = True
----> 9 driver = webdriver.Chrome(options=options)
10
11 url = "https://clutch.co/il/it-services"
5 frames
/usr/local/lib/python3.10/dist-packages/selenium/webdriver/remote/errorhandler.py in check_response(self, response)
227 alert_text = value["alert"].get("text")
228 raise exception_class(message, screen, stacktrace, alert_text) # type: ignore[call-arg] # mypy is not smart enough here
--> 229 raise exception_class(message, screen, stacktrace)
SessionNotCreatedException: Message: session not created: Chrome failed to start: exited normally.
(session not created: DevToolsActivePort file doesn't exist)
(The process started from chrome location /root/.cache/selenium/chrome/linux64/120.0.6099.109/chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
Stacktrace:
#0 0x56d4ca1b8f83 <unknown>
#1 0x56d4c9e71cf7 <unknown>
#2 0x56d4c9ea960e <unknown>
#3 0x56d4c9ea626e <unknown>
#4 0x56d4c9ef680c <unknown>
#5 0x56d4c9eeae53 <unknown>
#6 0x56d4c9eb2dd4 <unknown>
#7 0x56d4c9eb41de <unknown>
#8 0x56d4ca17d531 <unknown>
#9 0x56d4ca181455 <unknown>
#10 0x56d4ca169f55 <unknown>
#11 0x56d4ca1820ef <unknown>
#12 0x56d4ca14d99f <unknown>
#13 0x56d4ca1a6008 <unknown>
#14 0x56d4ca1a61d7 <unknown>
#15 0x56d4ca1b8124 <unknown>
#16 0x79bb253feac3 <unknown>
btw: see my colab: https://colab.research.google.com/drive/1WilnQwzDq45zjpJmgdjoyU5wTVAgJqvd#scrollTo=pyd0BcMaPxkJ