What is the state of the art in HTML content extraction?
Asked Answered
I

8

18

There's a lot of scholarly work on HTML content extraction, e.g., Gupta & Kaiser (2005) Extracting Content from Accessible Web Pages, and some signs of interest here, e.g., one, two, and three, but I'm not really clear about how well the practice of the latter reflects the ideas of the former. What is the best practice?

Pointers to good (in particular, open source) implementations and good scholarly surveys of implementations would be the kind of thing I'm looking for.

Postscript the first: To be precise, the kind of survey I'm after would be a paper (published, unpublished, whatever) that discusses both criteria from the scholarly literature, and a number of existing implementations, and analyses how unsuccessful the implementations are from the viewpoint of the criteria. And, really, a post to a mailing list would work for me too.

Postscript the second To be clear, after Peter Rowell's answer, which I have accepted, we can see that this question leads to two subquestions: (i) the solved problem of cleaning up non-conformant HTML, for which Beautiful Soup is the most recommended solution, and (ii) the unsolved problem or separating cruft (mostly site-added boilerplate and promotional material) from meat (the contentthat the kind of people who think the page might be interesting in fact find relevant. To address the state of the art, new answers need to address the cruft-from-meat peoblem explicitly.

Intern answered 26/12, 2009 at 1:22 Comment(3)
"Scholarly Surveys"? Like in scholarly journals? Any particular journals? What's wrong with citeseer? Not scholarly enough? Seriously -- what searches have you used?Goods
How goes the research? any new insights?Hent
@Harry: Nothing very neat, I'm afraid.Intern
C
18

Extraction can mean different things to different people. It's one thing to be able to deal with all of the mangled HTML out there, and Beautiful Soup is a clear winner in this department. But BS won't tell you what is cruft and what is meat.

Things look different (and ugly) when considering content extraction from the point of view of a computational linguist. When analyzing a page I'm interested only in the specific content of the page, minus all of the navigation/advertising/etc. cruft. And you can't begin to do the interesting stuff -- co-occurence analysis, phrase discovery, weighted attribute vector generation, etc. -- until you have gotten rid of the cruft.

The first paper referenced by the OP indicates that this was what they were trying to achieve -- analyze a site, determine the overall structure, then subtract that out and Voila! you have just the meat -- but they found it was harder than they thought. They were approaching the problem from an improved accessibility angle, whereas I was an early search egine guy, but we both came to the same conclusion:

Separating cruft from meat is hard. And (to read between the lines of your question) even once the cruft is removed, without carefully applied semantic markup it is extremely difficult to determine 'author intent' of the article. Getting the meat out of a site like citeseer (cleanly & predictably laid out with a very high Signal-to-Noise Ratio) is 2 or 3 orders of magnitude easier than dealing with random web content.

BTW, if you're dealing with longer documents you might be particularly interested in work done by Marti Hearst (now a prof at UC Berkely). Her PhD thesis and other papers on doing subtopic discovery in large documents gave me a lot of insight into doing something similar in smaller documents (which, surprisingly, can be more difficult to deal with). But you can only do this after you get rid of the cruft.


For the few who might be interested, here's some backstory (probably Off Topic, but I'm in that kind of mood tonight):

In the 80's and 90's our customers were mostly government agencies whose eyes were bigger than their budgets and whose dreams made Disneyland look drab. They were collecting everything they could get their hands on and then went looking for a silver bullet technology that would somehow ( giant hand wave ) extract the 'meaning' of the document. Right. They found us because we were this weird little company doing "content similarity searching" in 1986. We gave them a couple of demos (real, not faked) which freaked them out.

One of the things we already knew (and it took a long time for them to believe us) was that every collection is different and needs it's own special scanner to deal with those differences. For example, if all you're doing is munching straight newspaper stories, life is pretty easy. The headline mostly tells you something interesting, and the story is written in pyramid style - the first paragraph or two has the meat of who/what/where/when, and then following paras expand on that. Like I said, this is the easy stuff.

How about magazine articles? Oh God, don't get me started! The titles are almost always meaningless and the structure varies from one mag to the next, and even from one section of a mag to the next. Pick up a copy of Wired and a copy of Atlantic Monthly. Look at a major article and try to figure out a meaningful 1 paragraph summary of what the article is about. Now try to describe how a program would accomplish the same thing. Does the same set of rules apply across all articles? Even articles from the same magazine? No, they don't.

Sorry to sound like a curmudgeon on this, but this problem is genuinely hard.

Strangely enough, a big reason for google being as successful as it is (from a search engine perspective) is that they place a lot of weight on the words in and surrounding a link from another site. That link-text represents a sort of mini-summary done by a human of the site/page it's linking to, exactly what you want when you are searching. And it works across nearly all genre/layout styles of information. It's a positively brilliant insight and I wish I had had it myself. But it wouldn't have done my customers any good because there were no links from last night's Moscow TV listings to some random teletype message they had captured, or to some badly OCR'd version of an Egyptian newspaper.

/mini-rant-and-trip-down-memory-lane

Caudate answered 26/12, 2009 at 3:5 Comment(2)
This is pure gold. I haven't quite found what I was after yet, but I am much closer. One quibble: I am not persuaded about the meaninglessness of magazine titles - don't they get cited, usually verbatim, allowing you to do Google's trick?Intern
Well, it's complicated. First of all, google always shows you what was in the TITLE tag of the page. Google does phrase searching when you put " around your query, thereby forcing it to do fairly standard SE type of stuff. But when you don't put quotes around the query, that's when they can start to have fun. Try googling sheep shearing snoring (no quotes) and click on the Cached link for the second article. It turns out "snoring" does not occur on that page, only on pages that linked to it. Shoot me an email if you want to pursue this further: [email protected]Caudate
D
14

One word: boilerpipe.

For the news domain, on a representative corpus, we're now at 98% / 99% extraction accuracy (avg/median)

Also quite language independent (today, I've learned it works for Nepali, too).

Disclaimer: I am the author of this work.

Derosier answered 25/4, 2012 at 20:29 Comment(1)
Tried it on few URLs with all the Extractors. Good, but not perfect. E.g. Great on this. Fails on this.Subjoinder
S
6

Have you seen boilerpipe? Found it mentioned in a similar question.

Strohben answered 21/12, 2010 at 17:27 Comment(2)
I hadn't seen this until about a week ago. I have a friend now using it on a project and it seems to work pretty well. Unfortunately the code itself is ... not beautiful. One of my projects this Fall will be recreating boilerpipe's better features in a Python environment.Caudate
Yes it's not perfect but the majority of the time it "just works", which is pretty amazing considering as the highest-rated answer to this question points out it is a hard problem.Strohben
C
5

I have come across http://www.keyvan.net/2010/08/php-readability/

Last year I ported Arc90′s Readability to use in the Five Filters project. It’s been over a year now and Readability has improved a lot — thanks to Chris Dary and the rest of the team at Arc90.

As part of an update to the Full-Text RSS service I started porting a more recent version (1.6.2) to PHP and the code is now online.

For anyone not familiar, Readability was created for use as a browser addon (a bookmarklet). With one click it transforms web pages for easy reading and strips away clutter. Apple recently incorporated it into Safari Reader.

It’s also very handy for content extraction, which is why I wanted to port it to PHP in the first place.

Corum answered 20/12, 2010 at 12:39 Comment(1)
Ah! The project behind Safari Reader.Intern
R
3

there are a few open source tools available that do similar article extraction tasks. https://github.com/jiminoc/goose which was open source by Gravity.com

It has info on the wiki as well as the source you can view. There are dozens of unit tests that show the text extracted from various articles.

Rihana answered 8/5, 2011 at 16:7 Comment(0)
E
2

I've worked with Peter Rowell down through the years on a wide variety of information retrieval projects, many of which involved very difficult text extraction from a diversity of markup sources.

Currently I'm focused on knowledge extraction from "firehose" sources such as Google, including their RSS pipes that vacuum up huge amounts of local, regional, national and international news articles. In many cases titles are rich and meaningful, but are only "hooks" used to draw traffic to a Web site where the actual article is a meaningless paragraph. This appears to be a sort of "spam in reverse" designed to boost traffic ratings.

To rank articles even with the simplest metric of article length you have to be able to extract content from the markup. The exotic markup and scripting that dominates Web content these days breaks most open source parsing packages such as Beautiful Soup when applied to large volumes characteristic of Google and similar sources. I've found that 30% or more of mined articles break these packages as a rule of thumb. This has caused us to refocus on developing very low level, intelligent, character based parsers to separate the raw text from the markup and scripting. The more fine grained your parsing (i.e. partitioning of content) the more intelligent (and hand made) your tools must be. To make things even more interesting, you have a moving target as web authoring continues to morph and change with the development of new scripting approaches, markup, and language extensions. This tends to favor service based information delivery as opposed to "shrink wrapped" applications.

Looking back over the years there appears to have been very few scholarly papers written about the low level mechanics (i.e. the "practice of the former" you refer to) of such extraction, probably because it's so domain and content specific.

Ecotone answered 26/7, 2011 at 14:30 Comment(2)
These are very good points. Obviously, some people have better techniques for content extraction, as Apple's success with Safari Reader shows.Intern
Indeed, it's a big world out there filled with really impressive talent. The increasing complexity (nonlinear) of both our operating environments and open source tools, however, is making it more difficult these days for small fast moving groups to construct complex hierarchies of high level open source code to solve such problems (i.e. content extraction) while living within increasingly difficult financial and schedule constraints. When these packages are not regularly maintained the architectures that depend upon them rapidly become very brittle.Ecotone
I
1

Beautiful Soup is a robust HTML parser written in Python.

It gracefully handles HTML with bad markup and is also well-engineered as a Python library, supporting generators for iteration and search, dot-notation for child access (e.g., access <foo><bar/></foo>' usingdoc.foo.bar`) and seamless unicode.

Imine answered 26/12, 2009 at 1:32 Comment(1)
I agree with Peter Rowell's assessment, which I'd rephrase so - there are two subproblems to my question: One, html content extraction from properly constructed html; and Two, handling the deviations that typical html has from good html. Beautiful Soup looks like an important technology in our toolbox for the latter, but it doesn't have anything to say about the former.Intern
P
0

If you are out to extract content from pages that heavily utilize javascript, selenium remote control can do the job. It works for more than just testing. The main downside of doing this is that you'll end up using a lot more resources. The upside is you'll get a much more accurate data feed from rich pages/apps.

Phares answered 16/8, 2011 at 0:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.