I need to extract pure text form a random web page at runtime, on the server side. I use Google App Engine, and Readability python port. There are a number of those.
- early version by gfxmonk, based on BeautifulSoup
- version by minvolai based on gfxmonk's except uses lxml and not BeautifulSoap, making it (according to minvolai, see the project page) faster, albeit introducing dependency on lxml.
- version by Yuri Baburov aka buriy. Same as minvolai's, depens on lxml. Also depends on chardet to detect encoding.
I use Yuri's version, as it is most recent, and seems to be in active development. I managed to make it run on Google App Engine using Python 2.7. Now the "problem" is that it returns HTML, whereas I need pure text.
The advice in this Stackoverflow article about links extraction, is to use BeatifulSoup. I will, if there is no other choice. BeatifulSoup would be yet another dependency, as I use lxml based version.
My questions:
- Is there a way to get pure text from Python Readability version that I use without forking the code?
- Is there a way to easily retrive pure text from the HTML result of Python Readability e.g. by using lxml, or BeatifulSoap, or RegEx, or something else
- If answer to the above is no, or yes but not easily, what is the way to modify Python Readability. Is such modification even desirable enough (to enough people) to make such extension official?