Convert HTML to plain text (with inclusion of <br>s)
Asked Answered
R

5

10

Is it possible to convert HTML with Nokogiri to plain text? I also want to include <br /> tag.

For example, given this HTML:

<p>ala ma kota</p> <br /> <span>i kot to idiota </span>

I want this output:

ala ma kota
i kot to idiota

When I just call Nokogiri::HTML(my_html).text it excludes <br /> tag:

ala ma kota i kot to idiota
Renaissance answered 13/4, 2012 at 16:30 Comment(0)
R
17

Instead of writing complex regexp I used Nokogiri.

Working solution (K.I.S.S!):

def strip_html(str)
  document = Nokogiri::HTML.parse(str)
  document.css("br").each { |node| node.replace("\n") }
  document.text
end
Renaissance answered 16/4, 2012 at 12:48 Comment(0)
F
8

Nothing like this exists by default, but you can easily hack something together that comes close to the desired output:

require 'nokogiri'
def render_to_ascii(node)
  blocks = %w[p div address]                      # els to put newlines after
  swaps  = { "br"=>"\n", "hr"=>"\n#{'-'*70}\n" }  # content to swap out
  dup = node.dup                                  # don't munge the original

  # Get rid of superfluous whitespace in the source
  dup.xpath('.//text()').each{ |t| t.content=t.text.gsub(/\s+/,' ') }

  # Swap out the swaps
  dup.css(swaps.keys.join(',')).each{ |n| n.replace( swaps[n.name] ) }

  # Slap a couple newlines after each block level element
  dup.css(blocks.join(',')).each{ |n| n.after("\n\n") }

  # Return the modified text content
  dup.text
end

frag = Nokogiri::HTML.fragment "<p>It is the end of the world
  as         we
  know it<br>and <i>I</i> <strong>feel</strong>
  <a href='blah'>fine</a>.</p><div>Capische<hr>Buddy?</div>"

puts render_to_ascii(frag)
#=> It is the end of the world as we know it
#=> and I feel fine.
#=> 
#=> Capische
#=> ----------------------------------------------------------------------
#=> Buddy?
Forecastle answered 13/4, 2012 at 17:8 Comment(0)
F
0

Try

Nokogiri::HTML(my_html.gsub('<br />',"\n")).text
Farci answered 13/4, 2012 at 17:32 Comment(0)
L
0

Nokogiri will strip out links, so I use this first to preserve links in the text version:

html_version.gsub!(/<a href.*(http:[^"']+).*>(.*)<\/a>/i) { "#{$2}\n#{$1}" }

that will turn this:

<a href = "http://google.com">link to google</a>

to this:

link to google
http://google.com
Likable answered 13/4, 2012 at 17:57 Comment(0)
I
0

If you use HAML you can solve html converting by putting html with 'raw' option, f.e.

      = raw @product.short_description
Inexplicable answered 6/5, 2016 at 7:23 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.