Scrape HTML tables from a given URL into CSV
Asked Answered
R

3

5

I seek a tool that can be run on the command line like so:

tablescrape 'http://someURL.foo.com' [n]

If n is not specified and there's more than one HTML table on the page, it should summarize them (header row, total number of rows) in a numbered list. If n is specified or if there's only one table, it should parse the table and spit it to stdout as CSV or TSV.

Potential additional features:

  • To be really fancy you could parse a table within a table, but for my purposes -- fetching data from wikipedia pages and the like -- that's overkill.
  • An option to asciify any unicode.
  • An option to apply an arbitrary regex substitution for fixing weirdnesses in the parsed table.

What would you use to cobble something like this together? The Perl module HTML::TableExtract might be a good place to start and can even handle the case of nested tables. This might also be a pretty short Python script with BeautifulSoup. Would YQL be a good starting point? Or, ideally, have you written something similar and have a pointer to it? (I'm surely not the first person to need this.)

Related questions:

Resultant answered 9/4, 2010 at 22:40 Comment(5)
Sorry @dreeves, but Stack Overflow isn't a "free programmers for hire" service.Enchain
But where's the Question?Enchain
Edit the question if you would like to reformulate it.Enchain
(done; thanks. note that i ask this in case a tool like this exists. and in case not, i'm writing it myself and will share it here)Resultant
If you already have jq and pup installed (both useful tools in their own right), then see this answer for a one-liner that will very likely what you're looking for. If you want that exact syntax, wrap in some shell script that passes the first argument to an :nth-of-type() CSS selector.Haas
R
13

This is my first attempt:

http://yootles.com/outbox/tablescrape.py

It needs a bit more work, like better asciifying, but it's usable. For example, if you point it at this list of Olympic records:

./tablescrape http://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics

it tells you that there are 8 tables available and it's clear that the 2nd and 3rd ones (men's and women's records) are the ones you want:

1: [  1 cols,   1 rows] Contents 1 Men's rec
2: [  7 cols,  25 rows] Event | Record | Name | Nation | Games | Date | Ref
3: [  7 cols,  24 rows] Event | Record | Name | Nation | Games | Date | Ref
[...]

Then if you run it again, asking for the 2nd table,

./tablescrape http://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics 2

You get a reasonable plaintext data table:

100 metres | 9.69 | Usain Bolt | Jamaica (JAM) | 2008 Beijing | August 16, 2008 | [ 8 ]
200 metres | 19.30 | Usain Bolt | Jamaica (JAM) | 2008 Beijing | August 20, 2008 | [ 8 ]
400 metres | 43.49 | Michael Johnson | United States (USA) | 1996 Atlanta | July 29, 1996 | [ 9 ]
800 metres | 1:42.58 | Vebjørn Rodal | Norway (NOR) | 1996 Atlanta | July 31, 1996 | [ 10 ]
1,500 metres | 3:32.07 | Noah Ngeny | Kenya (KEN) | 2000 Sydney | September 29, 2000 | [ 11 ]
5,000 metres | 12:57.82 | Kenenisa Bekele | Ethiopia (ETH) | 2008 Beijing | August 23, 2008 | [ 12 ]
10,000 metres | 27:01.17 | Kenenisa Bekele | Ethiopia (ETH) | 2008 Beijing | August 17, 2008 | [ 13 ]
Marathon | 2:06:32 | Samuel Wanjiru | Kenya (KEN) | 2008 Beijing | August 24, 2008 | [ 14 ]
[...]
Resultant answered 12/4, 2010 at 15:50 Comment(0)
S
1

Using TestPlan I produced a rough script. Given the complexity of web tables it'll likely need to be tailored on all sites.

This first script lists the tables on the page:

# A simple table scraping example. It lists the tables on a page
#
# Cmds.Site = the URL to scan
default %Cmds.Site% http://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics
GotoURL %Cmds.Site%

set %Count% 1
foreach %Table% in (response //table)
    Notice Table #%Count%
    # find a suitable name, look back for a header
    set %Check% ./preceding::*[name()='h1' or name()='h2' or name()='h3'][1]
    if checkIn %Table% %Check%
        Notice (selectIn %Table% %Check%)
    end

    set %Count% as binOp %Count% + 1
end

The second script then extracts the data of one table into a CSV file.

# Generic extract of contents of a table in a webpage
# Use list_tables to get the list of table and indexes
#
# Cmds.Site = the URL to scan
# Cmds.Index = Table index to scan
default %Cmds.Site% http://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics
default %Cmds.Index% 2

GotoURL %Cmds.Site%

set %Headers% //table[%Cmds.Index%]/tbody/tr[1]
set %Rows% //table[%Cmds.Index%]/tbody/tr[position()>1]

# Get an cleanup the header fields 
set %Fields% withvector
end
foreach %Header% in (response %Headers%/*)
    putin %Fields% (trim %Header%)
end
Notice %Fields%

# Create an output CSV
call unit.file.CreateDataFile with
    %Name% %This:Dir%/extract_table.csv
    %Format% csv
    %Fields% %Fields%
end
set %DataFile% %Return:Value%

# Now extract each row
foreach %Row% in (response %Rows%)
    set %Record% withvector
    end
    foreach %Cell% in (selectIn %Row% ./td)
        putin %Record% (trim %Cell%)
    end

    call unit.file.WriteDataFile with
        %DataFile% %DataFile%
        %Record% %Record%
    end
end

call unit.file.CloseDataFile with
    %DataFile% %DataFile%
end

My CSV file looks like below. Note that wikipedia has extract information in each cell. There are many ways to get rid of it, but not in a generic fashion.

Shot put,22.47 m,"Timmermann, UlfUlf Timmermann",East Germany (GDR),1988 1988 Seoul,"01988-09-23 September 23, 1988",[25]
Discus throw,69.89 m,"Alekna, VirgilijusVirgilijus Alekna",Lithuania (LTU),2004 2004 Athens,"02004-08-23 August 23, 2004",[26]
Hammer throw,84.80 m,"Litvinov, SergeySergey Litvinov",Soviet Union (URS),1988 1988 Seoul,"01988-09-26 September 26, 1988",[27]
Javelin throw,90.57 m,"Thorkildsen, AndreasAndreas Thorkildsen",Norway (NOR),2008 2008 Beijing,"02008-08-23 August 23, 2008",[28]
Siegbahn answered 23/4, 2010 at 10:47 Comment(2)
Thanks! This looks nice. In the version I wrote I ended up putting in special cases to remove what you're calling wikipedia's extract information.Resultant
RIP TestPlan, I guess. The last Wayback Machine capture is from 2013.Haas
H
0

Using jq and pup, and a tip of the hat to this SO answer:

#!/bin/bash
# tablescrape - convert nth HTML table on a page to CSV or tab-delimited
# author: https://stackoverflow.com/users/785213
# source: https://stackoverflow.com/a/77031218
set -u

input=${1:?"Expected a file, URL, or '-' as the first argument."}
nth=${2:-1}
mode=${3:-csv}

(
    if [[ -r $input || $input == - ]]; then
        cat "$input"
    else
        # '--location' means "follow redirects"
        curl --silent --show-error --location "$input"
    fi
) \
  | pup "table.wikitable:nth-of-type($nth) tr json{}" \
  | jq --raw-output '.[]
      | [
          .children[]                            # all .children of <tr>s
            | select(.tag=="td" or .tag=="th")   # that are <td>s or <th>s
            | [ .. | .text? ]                    # recurse, looking for .text
            | map(select(.))                     # filter out empty nodes
            | join(" ")                          # concatenate .texts together
        ]                                        # all <td>s/<th>s as an array
      | @'$mode                                  # join together as CSV or TSV

Both jq and pup are super-useful on their own. It seemed like one of those tools (or else xidel) should be able to extract HTML tables directly to a delimited text file, but I guess it isn't so. Fortunately, pipes, man. They're so good!

Usage

RECORDS='https://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics'

# read from a URL
./tablescrape $RECORDS 2

# read from a pipe or redirection
curl -sS $RECORDS | ./tablescrape - 1 tsv

curl -sS $RECORDS > records.html
< records.html ./tablescrape - 1 tsv

# read from a file
./tablescrape records.html 1 tsv

Update: Turns out Xidel (0.9.8) can do it. Compliant CSV would be tricky (escaping delimiters and quoting quotes, oh my), but tab-delimited is pretty straightforward, and could be converted by another tool like Miller, or LibreOffice Calc. An advantage of the tab-delimited format is many other Unix text-processing tools already understand it (cut -d $'\t', sort -t $'\t', awk -F '\t'), and in a pinch, you can write a nearly-foolproof parser yourself, e.g., in shell script.

# NB: uses Bash's "ANSI-C quoting" feature for a literal tab character
xidel -s https://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics \
  -e $'//table[2]//tr/string-join(td|th,"\t")' \
  | column -t -s$'\t'

Output:

Event                 Record      Athlete(s)                                                   Nation                Games                Date                Ref(s)
100 metres            10.61       Elaine Thompson Herah                                         Jamaica (JAM)        2020 Tokyo           July 31, 2021       [32]
200 metres            ♦21.34      Florence Griffith Joyner                                      United States (USA)  1988 Seoul           September 29, 1988  [33][34]
400 metres            48.25       Marie-José Pérec                                              France (FRA)         1996 Atlanta         July 29, 1996       [35]
⋮
3,000 m steeplechase  8:58.81     Gulnara Galkina-Samitova                                      Russia (RUS)         2008 Beijing         August 17, 2008     [41]
4×100 m relay         ♦40.82      Tianna MadisonAllyson FelixBianca KnightCarmelita Jeter       United States (USA)  2012 London          August 10, 2012     [42]
⋮

That smooshes the names together in the relay events, and there are some pesky leading spaces on the "Nation" column, but it gets you pretty close.

Haas answered 3/9, 2023 at 6:54 Comment(3)
Doesn't work for me, but maybe that's because my column is too old. This however can all be done with xidel; removing the no-break-spaces and excessive white-space, string-joining all items to create "|"-separated values and even white-space aligned columns. See this Gist.Virginavirginal
@Virginavirginal It's impressive to see what Xidel is capable of, and I struggle with it without good examples, so thanks for sharing that Gist. The fact remains, though, I'm going to remember column -t, but not the Xidel syntax. ;)Haas
The column -t -s$'\t' is using a Bashism (ANSI-C quoting), which could be why you had trouble? It's just for better visual appearance in the answer anyway, and can be dropped in practice, or you can pipe into something else like mlr --itsv --opprint --barred cat or xsv table, if you happen to have those.Haas

© 2022 - 2024 — McMap. All rights reserved.