How to parse a HTML table with Nokogiri?
Asked Answered
O

3

17

I'm trying to parse a table but I don't know how to save the data from it. I want to save the data in each row row to look like:

['Raw name 1', 2,094, 0,017, 0,098, 0,113, 0,452]

The sample table is:

html = <<EOT
    <table class="open">
        <tr>
            <th>Table name</th>
            <th>Column name 1</th>
            <th>Column name 2</th>
            <th>Column name 3</th>
            <th>Column name 4</th>
            <th>Column name 5</th>
        </tr>
        <tr>
            <th>Raw name 1</th>
            <td>2,094</td>
            <td>0,017</td>
            <td>0,098</td>
            <td>0,113</td>
            <td>0,452</td>         
        </tr>
        .
        .
        .
        <tr>
            <th>Raw name 5</th>
            <td>2,094</td>
            <td>0,017</td>
            <td>0,098</td>
            <td>0,113</td>
            <td>0,452</td>         
        </tr>
    </table>
EOT

My scraper's code is:

  doc = Nokogiri::HTML(open(html), nil, 'UTF-8')
  tables = doc.css('div.open')

  @tablesArray = []

  tables.each do |table|
    title = table.css('tr[1] > th').text
    cell_data = table.css('tr > td').text
    raw_name = table.css('tr > th').text
    @tablesArray << Table.new(cell_data, raw_name)
  end

  render template: 'scrape_krasecology'
  end
  end

When I try to display the data in the HTML page it looks like all the column names are stored in one array's element and all the data the same way.

Oneself answered 14/1, 2016 at 4:9 Comment(5)
Please reduce your code to the bare minimum necessary to demonstrate the problem. Supply a minimal example of the HTML in the question itself that demonstrates the problem also. Don't ask us to go to the page to extract the HTML or build the needed surrounding code necessary to test yours. Read "How to Ask", "minimal reproducible example" and codeblog.jonskeet.uk/2010/08/29/writing-the-perfect-questionBrig
@the-tin-man thanks. I've update my code. Belive now it looks much more better?Oneself
General information for people looking into this subject in general: ruby.bastardsbook.com/chapters/web-crawlingChelicera
While it is more readable, it's still not testable, or even runnable. That's the point of the links mentioned above; We need to be able to test your code to duplicate the problem. We can remove some of your code and make it runnable but we shouldn't have to. Your HTML doesn't have any divs but your code shows you're trying to find them. What is Table?. Why do you have render template and the two terminating ends? We have to remove that stuff to test. Show a minimal sample of the data you're returning minus the use of the custom class.Brig
Also, your desired output format isn't likely to give you the results you want. Ever. Paste it into IRb and look at what Ruby thinks it means. Programming is a very exacting science; You have to describe it (ask questions) in equally exacting terms.Brig
M
25

The key of the problem is that calling #text on multiple results will return the concatenation of the #text of each individual element.

Lets examine what each step does:

# Finds all <table>s with class open
# I'm assuming you have only one <table> so
#  you don't actually have to loop through
#  all tables, instead you can just operate
#  on the first one. If that is not the case,
#  you can use a loop the way you did
tables = doc.css('table.open')

# The text of all <th>s in <tr> one in the table
title = table.css('tr[1] > th').text

# The text of all <td>s in all <tr>s in the table
# You obviously wanted just the <td>s in one <tr>
cell_data = table.css('tr > td').text

# The text of all <th>s in all <tr>s in the table
# You obviously wanted just the <th>s in one <tr>
raw_name = table.css('tr > th').text

Now that we know what is wrong, here is a possible solution:

html = <<EOT
    <table class="open">
        <tr>
            <th>Table name</th>
            <th>Column name 1</th>
            <th>Column name 2</th>
            <th>Column name 3</th>
            <th>Column name 4</th>
            <th>Column name 5</th>
        </tr>
        <tr>
            <th>Raw name 1</th>
            <td>1001</td>
            <td>1002</td>
            <td>1003</td>
            <td>1004</td>
            <td>1005</td>         
        </tr>
        <tr>
            <th>Raw name 2</th>
            <td>2001</td>
            <td>2002</td>
            <td>2003</td>
            <td>2004</td>
            <td>2005</td>         
        </tr>
        <tr>
            <th>Raw name 3</th>
            <td>3001</td>
            <td>3002</td>
            <td>3003</td>
            <td>3004</td>
            <td>3005</td>         
        </tr>
    </table>
EOT

doc = Nokogiri::HTML(html, nil, 'UTF-8')

# Fetches only the first <table>. If you have
#  more than one, you can loop the way you
#  originally did.
table = doc.css('table.open').first

# Fetches all rows (<tr>s)
rows = table.css('tr')

# The column names are the first row (shift returns
#  the first element and removes it from the array).
# On that row we get the text of each individual <th>
# This will be Table name, Column name 1, Column name 2...
column_names = rows.shift.css('th').map(&:text)

# On each of the remaining rows
text_all_rows = rows.map do |row|

  # We get the name (<th>)
  # On the first row this will be Raw name 1
  #  on the second - Raw name 2, etc.
  row_name = row.css('th').text

  # We get the text of each individual value (<td>)
  # On the first row this will be 1001, 1002, 1003...
  #  on the second - 2001, 2002, 2003... etc
  row_values = row.css('td').map(&:text)

  # We map the name, followed by all the values
  [row_name, *row_values]
end

p column_names  # => ["Table name", "Column name 1", "Column name 2",
                #     "Column name 3", "Column name 4", "Column name 5"]
p text_all_rows # => [["Raw name 1", "1001", "1002", "1003", "1004", "1005"],
                #     ["Raw name 2", "2001", "2002", "2003", "2004", "2005"],
                #     ["Raw name 3", "3001", "3002", "3003", "3004", "3005"]]

# If you want to combine them
text_all_rows.each do |row_as_text|
  p column_names.zip(row_as_text).to_h
end # =>
    # {"Table name"=>"Raw name 1", "Column name 1"=>"1001", "Column name 2"=>"1002", "Column name 3"=>"1003", "Column name 4"=>"1004", "Column name 5"=>"1005"}
    # {"Table name"=>"Raw name 2", "Column name 1"=>"2001", "Column name 2"=>"2002", "Column name 3"=>"2003", "Column name 4"=>"2004", "Column name 5"=>"2005"}
    # {"Table name"=>"Raw name 3", "Column name 1"=>"3001", "Column name 2"=>"3002", "Column name 3"=>"3003", "Column name 4"=>"3004", "Column name 5"=>"3005"}
Malamud answered 17/1, 2016 at 9:16 Comment(2)
Instead of css(...).first use at_css(...) or one of its siblings. It's more readable and shorter. Also, don't get in the habit of using css('...').text. It can bite you badly. See https://mcmap.net/q/744692/-how-to-avoid-joining-all-text-from-nodes-when-scraping/128421 for more information.Brig
Thank you for this. It really helped me.Nightfall
B
2

Your desired output is nonsense:

['Raw name 1', 2,094, 0,017, 0,098, 0,113, 0,452]
# ~> -:1: Invalid octal digit
# ~> ['Raw name 1', 2,094, 0,017, 0,098, 0,113, 0,452]

I'll assume you want quoted numbers.

After stripping the stuff that keeps the code from working, and reducing the HTML to a more manageable example, then running it:

require 'nokogiri'

html = <<EOT
    <table class="open">
        <tr>
            <th>Table name</th>
            <th>Column name 1</th>
            <th>Column name 2</th>
        </tr>
        <tr>
            <th>Raw name 1</th>
            <td>2,094</td>
            <td>0,017</td>
        </tr>
        <tr>
            <th>Raw name 5</th>
            <td>2,094</td>
            <td>0,017</td>
        </tr>
    </table>
EOT


doc = Nokogiri::HTML(html)
tables = doc.css('table.open')

tables_data = []

tables.each do |table|
  title = table.css('tr[1] > th').text # !> assigned but unused variable - title
  cell_data = table.css('tr > td').text
  raw_name = table.css('tr > th').text
  tables_data << [cell_data, raw_name]
end

Which results in:

tables_data
# => [["2,0940,0172,0940,017",
#      "Table nameColumn name 1Column name 2Raw name 1Raw name 5"]]

The first thing to notice is you're not using title though you assign to it. Possibly that happened when you were cleaning up your code as an example.

css, like search and xpath, returns a NodeSet, which is akin to an array of Nodes. When you use text or inner_text on a NodeSet it returns the text of each node concatenated into a single string:

Get the inner text of all contained Node objects.

This is its behavior:

require 'nokogiri'

doc = Nokogiri::HTML('<html><body><p>foo</p><p>bar</p></body></html>')

doc.css('p').text # => "foobar"

Instead, you should iterate over each node found, and extract its text individually. This is covered many times here on SO:

doc.css('p').map{ |node| node.text } # => ["foo", "bar"]

That can be reduced to:

doc.css('p').map(&:text) # => ["foo", "bar"]

See "How to avoid joining all text from Nodes when scraping" also.

The docs say this about content, text and inner_text when used with a Node:

Returns the content for this Node.

Instead, you need to go after the individual node's text:

require 'nokogiri'

html = <<EOT
    <table class="open">
        <tr>
            <th>Table name</th>
            <th>Column name 1</th>
            <th>Column name 2</th>
            <th>Column name 3</th>
            <th>Column name 4</th>
            <th>Column name 5</th>
        </tr>
        <tr>
            <th>Raw name 1</th>
            <td>2,094</td>
            <td>0,017</td>
            <td>0,098</td>
            <td>0,113</td>
            <td>0,452</td>         
        </tr>
        <tr>
            <th>Raw name 5</th>
            <td>2,094</td>
            <td>0,017</td>
            <td>0,098</td>
            <td>0,113</td>
            <td>0,452</td>         
        </tr>
    </table>
EOT


tables_data = []

doc = Nokogiri::HTML(html)

doc.css('table.open').each do |table|

  # find all rows in the current table, then iterate over the second all the way to the final one...
  table.css('tr')[1..-1].each do |tr|

    # collect the cell data and raw names from the remaining rows' cells...
    raw_name = tr.at('th').text
    cell_data = tr.css('td').map(&:text)

    # aggregate it...
    tables_data += [raw_name, cell_data]
  end
end

Which now results in:

tables_data
# => ["Raw name 1",
#     ["2,094", "0,017", "0,098", "0,113", "0,452"],
#     "Raw name 5",
#     ["2,094", "0,017", "0,098", "0,113", "0,452"]]

You can figure out how to coerce the quoted numbers into decimals acceptable to Ruby, or manipulate the inner arrays however you want.

Brig answered 19/1, 2016 at 17:20 Comment(1)
thanks a lot for the answer and explanation! The answer is very useful and helped me!Oneself
H
0

I assume you were borrowing some code from here or any other related references (or I am sorry for adding wrong reference) - http://quabr.com/34781600/ruby-nokogiri-parse-html-table.

However, if you want to capture all the rows, you can change the following codes -

Hope this help you to solve your problem.

doc = Nokogiri::HTML(open(html), nil, 'UTF-8')

# We need .open tr, because we want to capture all the columns from a specific table's row

@tablesArray = doc.css('table.open tr').reduce([]) do |array, row|
  # This will allow us to create result as this your illustrated one
  # ie. ['Raw name 1', 2,094, 0,017, 0,098, 0,113, 0,452]
  array << row.css('th, td').map(&:text)
end

render template: 'scrape_krasecology'
Haughty answered 24/1, 2016 at 3:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.