Hpricot, Get all text from document

Asked 7/8, 2009 at 9:27 Answered 31/10, 2011 at 18:45

I have just started learning Ruby. Very cool language, liking it a lot.

I am using the very handy Hpricot HTML parser.

What I am looking to do is grab all the text from the page, excluding the HTML tags.

Example:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
    <head>
        <title>Data Protection Checks</title>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    </head>
    <body>
        <div>
        This is what I want to grab.
        </div>
        <p>
        I also want to grab this text
        </p>
    </body>
</html>

I am basically wanting to grab only the text so I end up with a string like so:

"This is what I want to grab. I also want to grab this text"

What would be the best method of doing this?

Cheers

Eef

Lineation answered 7/8, 2009 at 9:27 Comment(0)

You can do this using the XPath text() selector.

require 'hpricot'
require 'open-uri'

doc  = open("http://stackoverflow.com/") { |f| Hpricot(f) }
text = (doc/"//*/text()") # array of text values
puts text.join("\n")

However this is a fair expensive operation. A better solution might be available.

Moonstone answered 7/8, 2009 at 9:41 Comment(1)

@Eef, you may need to remove javascript code before collecting the text array (doc/"script").each {|js| js.inner_html=''}. – Amphi 14/12, 2010 at 16:7

You might want to try inner_text.

Like this:

h = Hpricot("<html><body><a href='http://yoursite.com?utm=trackmeplease'>http://yoursite.com</a> is <strong>awesome</strong>")
puts h.inner_text
http://yoursite.com is awesome

Flaviaflavian answered 31/10, 2011 at 18:45 Comment(0)

@weppos: This will be bit better:

text = doc/"//p|div/text()" # array of text values

Endlong answered 7/8, 2009 at 11:1 Comment(1)

yeah, but this assumes he only wants p and div. I think he wants everything. – Superannuation 7/8, 2009 at 11:4

Recommended topics

Hot tags