Extracting and Constructing Tables from HTML Files using Julia

Asked 20/3, 2017 at 23:41 Answered 9/2, 2018 at 17:7

Here's a public link to an example html file. I would like to extract each set of CAN and yearly tax information (example highlighted in red in the image below) from the file and construct a dataframe that looks like the one below.

Target Fields

Example DataFrame

| Row |          CAN | Crtf_NoCrtf | Tax_Year | Land_Value | Improv_Value | Total_Value | Total_Tax |
|-----+--------------+-------------+----------+------------+--------------+-------------+-----------|
|   1 | 184750010210 | Yes         |     2016 |      16720 |       148330 |      165050 | 4432.24   |
|   2 | 184750010210 | Yes         |     2015 |      16720 |       128250 |      144970 | 3901.06   |
|   3 | 184750010210 | Yes         |     2014 |      16720 |       109740 |      126460 | 3412.63   |
|   4 | 184750010210 | Yes         |     2013 |      16720 |       111430 |      128150 | 3474.46   |
|   5 | 184750010210 | Yes         |     2012 |      16720 |        99340 |      116060 | 3146.17   |
|   6 | 184750010210 | Yes         |     2011 |      16720 |       102350 |      119070 | 3218.80   |
|   7 | 184750010210 | Yes         |     2010 |      16720 |       108440 |      125160 | 3369.97   |
|   8 | 184750010210 | Yes         |     2009 |      16720 |       113870 |      130590 | 3458.14   |
|   9 | 184750010210 | Yes         |     2008 |      16720 |       122390 |      139110 | 3629.85   |
|  10 | 184750010210 | Yes         |     2007 |      16720 |       112820 |      129540 | 3302.72   |
|  11 | 184750010210 | Yes         |     2006 |      12380 |       112760 |             | 3623.12   |
|  12 | 184750010210 | Yes         |     2005 |      19800 |       107400 |             | 3882.24   |

Additional Information

If it is not possible to insert the CAN to each row that is okay, I can export the CAN numbers separately and find a way to attach them to the dataframe containing the tax values. I have looked into using beautiful soup for python, but I am an absolute novice with python and the rest of the scripts I am writing are in Julia, so I would prefer to keep everything in one language.

Is there any way to achieve what I am trying to achieve? I have looked at Gumbo.jl but can not find any detailed documentation/tutorials.

Statutory answered 20/3, 2017 at 23:41 Comment(4)

The downvote you got is perhaps a little harsh, but I'm guessing it was given because the question lacks evidence that you've tried anything other than looking for some docs (and yes, whoever gave it should have left a comment explaining it). Maybe try a few things, and then post that code here, and you'll get a better response. – Cobalt 20/3, 2017 at 23:54

Thanks for taking the time @ColinTBowers, I will go through some code I have tried and see if I can find some snippets that are in the correct general direction. – Statutory 20/3, 2017 at 23:59

Check github.com/bicycle1885/EzXML.jl – Armentrout 23/3, 2017 at 13:36

EzXML looks quite a bit more mature, but if for some reason it doesn't meet you needs, Gumbo.jl by yours truly is another option to consider. – Glut 26/3, 2017 at 2:5

So Gumbo.jl will parse the HTML and give you a programatic representation of the structure of the HTML file (called a DOM - Document Object Model). This is typically a tree of html tags, which you can traverse and extract the data you need.

To make this easier, what you really want is a way to query the DOM, so that you can extract the data you need without having to traverse the entire tree yourself. The Cascadia.jl project does this for you. It is built on top of Gumbo, and uses CSS selectors as the query language.

So for your example, you could use something like the following to extract all the CAN fields:

julia> using Gumbo

julia> using Cascadia

julia> h=parsehtml(read("/Users/aviks/Download/z1.html", String))

julia> c = matchall(Selector("td:containsOwn(\"CAN:\") + td span"), h.root)
  13-element Array{Gumbo.HTMLNode,1}:
  Gumbo.HTMLElement{:span}:
  <span class="value">184750010210</span> 
   ...
#print all the CAN values
julia> for x in c
               println( x.children[1].text )
            end

  184750010210
  186170040070
  175630130020
  172640020290
  168330020230
  156340030160
  118210000020
  190490040500
  173480080430
  161160010050
  153510060090
  050493000250
  050470630910

Hopefully this gives you an idea of how to extract all the data you need.

Impassable answered 27/3, 2017 at 14:32 Comment(0)

The current answer is a bit out of date since the readall() function no longer exists. I'll update his answer below.

Here's a general breakdown of the package ecosystem for Julia (as of the time of writing this answer):

Requests.jl is used to download the HTML file itself (note that in avik's answer, he reads the HTML file from his local machine)
Cascadia.jl is required to search for CSS tags (e.g. the tag that you would find if you were to use Selector Gadget).
Gumbo.jl is required to parse the resulting HTML

The key thing to remember is that Gumbo stores objects in tree format as HTMLNodes or HTMLElements. So most objects have "parents" and "children." To get the data you need, it's simply a matter of filtering with the right selector (using Cascadia) and then going to the correct point in the Gumbo tree.

An updated version of avik's answer:

using Requests, Cascadia, Gumbo

# r = get(url) # Normally, you'd put a url here, but I couldn't find a way to grab it without having to download it and read it locally
# h = parsehtml(String(r.data)) # Then normally you'd execute this

# Instead, I'm going to read in the html file as a string and give it to Gumbo
h = parsehtml(readstring("z1.html"))

# Exploring with the various structure of Gumbo objects:
println(fieldnames(h.root))
println(fieldnames(h.root.children))
println(size(h.root.children))

# aviks code:
c = matchall(Selector("td:containsOwn(\"CAN:\") + td span"), h.root);
for x in c
    println( x.children[1].text )
end

This particular webpage is more difficult to scrape than most, since it doesn't have a great CSS structure.

There's some nice documentation on workflow on the Cascadia README, but I still had some questions after reading it. For anyone else (like me, yesterday) who comes to this page looking for guidance on web scraping in Julia, I've created a jupyter notebook with a simple example that will hopefully help you understand the workflow in greater detail.

Mohammadmohammed answered 9/2, 2018 at 17:7 Comment(3)

HI @Tyler R, if you have any questions on Cascadia, please raise an issue on the repository. Improvements to the documentation are much appreciated. – Impassable 23/2, 2018 at 11:51

Requests.jl is DEPRECATED in favor of HTTP.jl github.com/JuliaWeb/HTTP.jl – Ketti 19/6, 2019 at 15:8

Note from Cascadia.js's page: "The top level matching function name has changed from matchall in v0.6 to eachmatch in v0.7 and higher to reflect the change in Julia base." Could you please update the code provided? – Accepter 9/11, 2020 at 21:12

Target Fields

Example DataFrame

Additional Information

Recommended topics

Hot tags