Is there a way to use readability and python to extract just text, not HTML?
Asked Answered
B

3

7

I need to extract pure text form a random web page at runtime, on the server side. I use Google App Engine, and Readability python port. There are a number of those.

  1. early version by gfxmonk, based on BeautifulSoup
  2. version by minvolai based on gfxmonk's except uses lxml and not BeautifulSoap, making it (according to minvolai, see the project page) faster, albeit introducing dependency on lxml.
  3. version by Yuri Baburov aka buriy. Same as minvolai's, depens on lxml. Also depends on chardet to detect encoding.

I use Yuri's version, as it is most recent, and seems to be in active development. I managed to make it run on Google App Engine using Python 2.7. Now the "problem" is that it returns HTML, whereas I need pure text.

The advice in this Stackoverflow article about links extraction, is to use BeatifulSoup. I will, if there is no other choice. BeatifulSoup would be yet another dependency, as I use lxml based version.

My questions:

  • Is there a way to get pure text from Python Readability version that I use without forking the code?
  • Is there a way to easily retrive pure text from the HTML result of Python Readability e.g. by using lxml, or BeatifulSoap, or RegEx, or something else
  • If answer to the above is no, or yes but not easily, what is the way to modify Python Readability. Is such modification even desirable enough (to enough people) to make such extension official?
Bordello answered 22/6, 2012 at 6:15 Comment(3)
Do you mean strip out the html tags, resulting in only text? #753552Castellatus
Its desirable to have a tool like this. I think there is a scope for a good tool to develop. Hope you would start workign towards it.Eisenach
Right, I mean to have text only. I would like to annotate a link to the page by first paragraph or two, so the person can make a better informed decision to go to the link or not.Bordello
B
4

Not to let it linger, my current solution

  1. I did not find the way to use Readability ports.
  2. I decided to use Beautiful Soup, version 4
  3. BS has one simple function to extract text

code:

from bs4 import BeautifulSoup 
soup = BeautifulSoup(html) 
text =  soup.get_text() 
Bordello answered 28/6, 2012 at 6:17 Comment(0)
M
5

You can use html2text. It is a nifty tool.

Here is a link on how to use it with python readability tool - together they are called read2text.

http://brettterpstra.com/scripting-readability-markdownify-for-clipping-web-pages/

Hope this helps :)

Mell answered 22/6, 2012 at 6:21 Comment(0)
B
4

Not to let it linger, my current solution

  1. I did not find the way to use Readability ports.
  2. I decided to use Beautiful Soup, version 4
  3. BS has one simple function to extract text

code:

from bs4 import BeautifulSoup 
soup = BeautifulSoup(html) 
text =  soup.get_text() 
Bordello answered 28/6, 2012 at 6:17 Comment(0)
O
3

First, you extract the HTML contents with readability,

html_snippet = Document(html).summary()

Then, use a library to remove HTML tags. There are caveats: 1) you probably need spaces, "<p>some text<br>other text" shouldn't be "some textother text", and you might need the lists converted into " - ". 2) "#&39;" should be displayed as "'", and "&gt;" should be displayed as ">" -- this is called HTML entities replacement (see below).

I usually use a library called bleach to clean out unnecessary tags and attributes:

cleaned_text = bleach.clean(html_snippet, tags=[])

or

cleaned_text = bleach.clean(html_snippet, tags=['i', 'b'])

You need to use any kind of html2text library if you want to remove all tags and get a better text formatting, or you can implement custom formatting procedure yourself.

But I think you now got the raw idea.

For a simple text formatting with bleach: For example, if you want paragraphs as "\n", and list items as "\n - ", then:

norm_html = bleach.clean(html_snippet, tags=['p', 'br', 'li'])
replaced_html = norm_html.replace('<p>', '\n').replace('</p>', '\n')
replaced_html = replaced_html.replace('<br>', '\n').replace('<li>', '\n - ')
cleaned_text = bleach.clean(replaced_html, tags=[])

For a regexp that only strips HTML tags and does entities replacement ("&gt;" should be ">" and so on), you can take a look at https://mcmap.net/q/92394/-strip-html-from-strings-in-python

Ohaus answered 4/6, 2016 at 18:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.