How to extract tables from websites in Python
Asked Answered
C

6

31

Here,

http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500

There is a table. My goal is to extract the table and save it to a csv file. I wrote a code:

import urllib
import os

web = urllib.urlopen("http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500")

s = web.read()
web.close()

ff = open(r"D:\ex\python_ex\urllib\output.txt", "w")
ff.write(s)
ff.close()

I lost from here. Anyone who can help on this? Thanks!

Cheng answered 11/5, 2012 at 17:33 Comment(0)
L
19

So essentially you want to parse out html file to get elements out of it. You can use BeautifulSoup or lxml for this task.

You already have solutions using BeautifulSoup. I'll post a solution using lxml:

from lxml import etree
import urllib.request

web = urllib.request.urlopen("http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500")
s = web.read()

html = etree.HTML(s)

## Get all 'tr'
tr_nodes = html.xpath('//table[@id="Report1_dgReportDemographic"]/tr')

## 'th' is inside first 'tr'
header = [i[0].text for i in tr_nodes[0].xpath("th")]

## Get text from rest all 'tr'
td_content = [[td.text for td in tr.xpath('td')] for tr in tr_nodes[1:]]
Lacedaemon answered 11/5, 2012 at 17:41 Comment(1)
I'm trying to follow your code to extract table from this website, but on the line tr_nodes = html.xpath('//table[@id="DataTables_Table_0"]/tr') (which is a modification based on the webpage's content), I get empty tr_nodes, any idea what could be the problem?Corona
C
99

Pandas can do this right out of the box, saving you from having to parse the html yourself. read_html() extracts all tables from your html and puts them in a list of dataframes. to_csv() can be used to convert each dataframe to a csv file. For the web page in your example, the relevant table is the last one, which is why I used df_list[-1] in the code below.

import requests
import pandas as pd

url = 'http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print(df)
df.to_csv('my data.csv')

It's simple enough to do in one line, if you prefer:

pd.read_html(requests.get(<url>).content)[-1].to_csv(<csv file>)

P.S. Just make sure you have lxml, html5lib, and BeautifulSoup4 packages installed in advance.

Cohesive answered 12/6, 2017 at 18:36 Comment(5)
One of the best solution available over internet. Cheers man.Cesura
Excellent solution! Note to others that pandas requires additional dependencies to accomplish this: beautifulsoup4, html5lib, and lxmlPeneus
Missing parantheses !! Amended as print (df)Mazel
I tried your code snippet on this website, but I get the ValueError: No tables found error. Any idea how it should be solved?Corona
@Foad If you look at that page's source files, you will see that the table you are after is loaded dynamically, which is why it is missing from requests.get(url).content. You will need to look into web-scraping methods that support running scripts first. Alternatively, you can save the webpage to a file beforehand and then do pd.read_html(<file path>).Cohesive
L
19

So essentially you want to parse out html file to get elements out of it. You can use BeautifulSoup or lxml for this task.

You already have solutions using BeautifulSoup. I'll post a solution using lxml:

from lxml import etree
import urllib.request

web = urllib.request.urlopen("http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500")
s = web.read()

html = etree.HTML(s)

## Get all 'tr'
tr_nodes = html.xpath('//table[@id="Report1_dgReportDemographic"]/tr')

## 'th' is inside first 'tr'
header = [i[0].text for i in tr_nodes[0].xpath("th")]

## Get text from rest all 'tr'
td_content = [[td.text for td in tr.xpath('td')] for tr in tr_nodes[1:]]
Lacedaemon answered 11/5, 2012 at 17:41 Comment(1)
I'm trying to follow your code to extract table from this website, but on the line tr_nodes = html.xpath('//table[@id="DataTables_Table_0"]/tr') (which is a modification based on the webpage's content), I get empty tr_nodes, any idea what could be the problem?Corona
A
3

I would recommend BeautifulSoup as it has the most functionality. I modified a table parser that I found online that can extract all tables from a webpage, as long as there are no nested tables. Some of the code is specific to the problem I was trying to solve, but it should be pretty easy to modify for your usage. Here is the pastbin link.

http://pastebin.com/RPNbtX8Q

You could use it as follows:

from urllib2 import Request, urlopen, URLError
from TableParser import TableParser
url_addr ='http://foo/bar'
req = Request(url_addr)
url = urlopen(req)
tp = TableParser()
tp.feed(url.read())

# NOTE: Here you need to know exactly how many tables are on the page and which one
# you want. Let's say it's the first table
my_table = tp.get_tables()[0]
filename = 'table_as_csv.csv'
f = open(filename, 'wb')
with f:
    writer = csv.writer(f)
    for row in table:
        writer.writerow(row)

The code above is an outline, but if you use the table parser from the pastbin link you should be able to get to where you want to go.

Amara answered 11/5, 2012 at 18:56 Comment(0)
H
1

You need to parse the table into an internal data structure and then output it in CSV form.

Use BeautifulSoup to parse the table. This question is about how to do that (the accepted answer uses version 3.0.8 which is out of date by now, but you can still use it, or convert the instructions to work with BeautifulSoup version 4).

Once you have the table in a data structure (probably a list of lists in this case) you can write it out with csv.write.

Hau answered 11/5, 2012 at 17:42 Comment(0)
M
1

Look at BeautifulSOup module. In documentation you will find many examples of parsing html.

Also for csv you have ready solution - csv module.

It should be quite easy.

Moribund answered 11/5, 2012 at 17:42 Comment(0)
C
1

Look at this answer parsing table with BeautifulSoup and write in text file. Also use google with next words "python beautifulsoup"

Cess answered 11/5, 2012 at 17:42 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.