HTML article content extraction - Alchemy API alternative
Asked Answered
M

2

7

I've been doing a lot of research to figure out the best way to code an application to get the main article content from almost any HTML webpage. I have a C program that uses libxml2 to parse through the XML, but I came across Alchemy API, which appears to do what I want.

However, it only has an online API and I wanted to keep the application in-house without relying on any external calls.

So does anybody have tips? I was hoping for an off-line alternative that does what Alchemy API can do (paid/non-paid).

My alternative may be to just parse the HTML and use NLP (Natural Language Processing) techniques and other methods to get at the main article content. The types of websites that it will be used include websites with a news section or a blog.

Marxmarxian answered 8/11, 2010 at 14:3 Comment(1)
I believe you've tagged this question incorrectly. The "Alchemy" tag refers to Adobe Alchemy. I'm guessing that you're talking about alchemyapi.comMunafo
L
4

there are a few open source tools available that do similar article extraction tasks. https://github.com/jiminoc/goose which was open source by Gravity.com

It has info on the wiki as well as the source you can view. There are dozens of unit tests that show the text extracted from various articles.

Lithopone answered 8/5, 2011 at 16:6 Comment(1)
do you know any other alternatives similar to goose but in PHP?Pictor
U
0

AlchemyAPI also offers an on-premise solution so that you don't have to access it online. Generally our customers that have the on-premise solutions are using it if they have special security or latency requirements. More information on on-premise solutions can be found here: http://www.alchemyapi.com/products/on-premise/

Underweight answered 8/8, 2013 at 15:10 Comment(1)
The link is invalid now; on-premise solution is discontinued.Plumber

© 2022 - 2024 — McMap. All rights reserved.