Parse HTML and Read HTML Table with Selenium Python
Asked Answered
L

3

7

I am converting some of my web-scraping code from R to Python (I can't get geckodriver to work with R, but it's working with Python). Anyways, I am trying to understand how to parse and read HTML tables with Python. Quick background, here is my code for R:

doc <- htmlParse(remDr$getPageSource()[[1]],ignoreBlanks=TRUE, replaceEntities = FALSE, trim=TRUE, encoding="UTF-8")

WebElem <- readHTMLTable(doc, stringsAsFactors = FALSE)[[7]]

I would parse the HTML page to the doc object. Then I would start with doc[[1]], and move through higher numbers until I saw the data I wanted. In this case I got to doc[[7]] and saw the data I wanted. I then would read that HTML table and assign it to the WebElem object. Eventually I would turn this into a dataframe and play with it.

So what I am doing in Python is this:

html = None
doc = None
html = driver.page_source
doc = BeautifulSoup(html)

Then I started to play with doc.get_text but I don't really know how to get just the data I want to see. The data I want to see is like a 10x10 matrix. When I used R, I would just use doc[[7]] and that matrix would almost be in a perfect structure for me to convert it to a dataframe. However, I just can't seem to do that with Python. Any advice would be much appreciated.

UPDATE:

I have been able to get the data I want using Python--I followed this blog for creating a dataframe with python: Python Web-Scraping. Here is the website that we are scraping in that blog: Most Popular Dog Breeds. In that blog post, you have to work your way through the elements, create a dict, loop through each row of the table and store the data in each column, and then you are able to create a dataframe.

With R, the only code I had to write was:

doc <- htmlParse(remDr$getPageSource()[[1]],ignoreBlanks=TRUE, replaceEntities = FALSE, trim=TRUE, encoding="UTF-8")

df <- as.data.frame(readHTMLTable(doc, stringsAsFactors = FALSE)

With just that, I have a pretty nice dataframe that I only need to adjust the column names and data types--it looks like this with just that code:

NULL.V1 NULL.V2 NULL.V3 NULL.V4 1 BREED 2015 2014 2013 2 Retrievers (Labrador) 1 1 1 3 German Shepherd Dogs 2 2 2 4 Retrievers (Golden) 3 3 3 5 Bulldogs 4 4 5 6 Beagles 5 5 4 7 French Bulldogs 6 9 11 8 Yorkshire Terriers 7 6 6 9 Poodles 8 7 8 10 Rottweilers 9 10 9

Is there not something available in Python to make this a bit simpler, or is this just simpler in R because R is more built for dataframes(at least that's how it seems to me, but I could be wrong)?

Luminary answered 19/12, 2016 at 1:33 Comment(7)
most important advice - always add url to your data. Every page is different and we have to see HTML to give any advices.Tulatulip
Hi @furas, I would have added it but it's a private URL. I know this makes it difficult. Would it be helpful for me to create a similar matrix in my post?Luminary
I'll look for something similar on a public site and update my post tonight, thanks @TulatulipLuminary
I haven't been able to do any comparisons to R because I can't get RSelenium to work now. Basically, what I have done to get the data I want is parse the column headers to a dict with blank values and then append the values with another parse. Then save it as a dataframe. It seems like with R I was able to just reference an html table location like I explained above and it was almost already in a dataframe format. I'll leave this question open and clarify/answer this question when I can get RSelenium to work again--when there is an update to RSelenium.Luminary
as I said before: add in question some example data/HTML (it not have to be link but simple HTML/text) which you want to parse. R and Pandas are not identical so it may need different solution, and every page is different so every page/example may need different solution. It doesn't matter how you do it in R, most important are data which you have - we have to see it.Tulatulip
I came across this blog which helped me out a bit: link. The html referenced at: link is similar to the html code I'm looking at. I followed the steps of the blog to create a dataframe. This is where I wish I had RSelenium running so that I could show the comparison.Luminary
@furas, I hope my updated question is helpful. If not, let me know what else I could do to make it better. Thanks.Luminary
L
17

Ok, after some hefty digging around I feel like I came to good solution--matching that of R. If you are looking at the HTML provided in the link above, Dog Breeds, and you have the web driver running for that link you can run the following code:

tbl = driver.find_element_by_xpath("//html/body/main/article/section[2]/div/article/table").get_attribute('outerHTML')

df = pd.read_html(tbl)

Then you are looking a pretty nice dataframe after only a couple lines of code:

In [145]: df Out[145]: [ 0 1 2 3 0 BREED 2015 2014 2013.0 1 Retrievers (Labrador) 1 1 1.0 2 German Shepherd Dogs 2 2 2.0 3 Retrievers (Golden) 3 3 3.0 4 Bulldogs 4 4 5.0 5 Beagles 5 5 4.0

I feel like this is much easier than working through the tags, creating a dict, and looping through each row of data as the blog suggests. It might not be the most correct way of doing things, I'm new to Python, but it gets the job done quickly. I hope this helps out some fellow web-scrapers.

Luminary answered 29/12, 2016 at 20:12 Comment(2)
import pandas as pdLuminary
If you're getting a FutureWarning: Passing literal html to 'read_html' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object., then you need to from io import StringIO; df = pd.read_html(SrringIO(tbl))Synge
E
2
tbl = driver.find_element_by_xpath("//html/body/main/article/section[2]/div/article/table").get_attribute('outerHTML')
df  = pd.read_html(tbl)

it Worked pretty well.

Encephalitis answered 22/10, 2019 at 21:7 Comment(0)
I
-3

First, read Selenium with Python, you will get basic idea of how Selenium work with Python.

Than, if you want to locate element in Python, there are tow ways:

  1. Use Selenium API, you can refer Locating Elements
  2. Use BeautifulSoup, there is nice Document you can read BeautifulSoupDocumentation
Insanitary answered 19/12, 2016 at 1:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.