Is there a way to use readability (text extraction algorithm) and a custom algorithm in python to extract links from text?
Asked Answered
C

1

0

Is there a way to use readability (text extraction algorithm) and a custom algorithm in python to extract links from text?

I'd like to figure out a way of extracting links that are in the body of text.

1.) I use readability in python https://github.com/gfxmonk/python-readability

2.) I'd like to somehow compare the extracted text to the original html text in order to extract links in the actual body of an article.

Coparcener answered 3/1, 2011 at 23:20 Comment(0)
C
2

Well, it looks like it returns a BeautifulSoup tree. So you should be able to do something like:

article = page.summary()   # Extract article using readability
article.findAll("a")       # Return a list of all links in the article
Clinandrium answered 4/1, 2011 at 0:5 Comment(2)
BeautifulSoup is definitely the way to go.Revest
@Sri: Readability already uses BeautifulSoup. It's designed to pare a page down to the content, minus advertising, navigation and so forth.Clinandrium

© 2022 - 2024 — McMap. All rights reserved.