Extracting webpage information based on a template in Java

Asked 4/3, 2013 at 12:45 Answered 12/3, 2013 at 12:53

Solved java text-extraction named-entity-extraction

Right now I use Jsoup to extract certain information (not all the text) from some third party webpages, I do it periodically. This works fine until the HTML of certain webpage changes, this change leads to a change in the existing Java code, this is a tedious task, because these webpage change very frequently. Also it requires a programmer to fix the Java code. Here is an example of HTML code of my interest on a webpage:

<div>
<p><strong>Score:</strong>2.5/5</p>
<p><strong>Director:</strong> Bryan Singer</p>
</div>
<div>some other info which I dont need</div>

Now here is what I want to do, I want to save this webpage (an HTML file) locally and create a template out of it, like:

<div>
<p><strong>Score:</strong>{MOVIE_RATING}</p>
<p><strong>Director:</strong>{MOVIE_DIRECTOR}</p>
</div>
<div>some other info which I dont need</div>

Along with the actual URLs of the webpages these HTML templates will be the input to the Java program which will find out the location of these predefined keywords (e.g. {MOVIE_RATING}, {MOVIE_DIRECTOR}) and extract the values from the actual webpages.

This way I wouldn't have to modify the Java program every time a webpage changes, I will just save the webpage's HTML and replace the data with these keywords and rest will be taken care by the program. For example in future the actual HTML code may look like this:

<div>
<div><b>Rating:</b>**1/2</div>
<div><i>Director:</i>Singer, Bryan</div>
</div>

and the corresponding template will look like this:

<div>
<div><b>Rating:</b>{MOVIE_RATING}</div>
<div><i>Director:</i>{MOVIE_DIRECTOR}</div>
</div>

Also creating these kind of templates can be done by a non-programmer, anyone who can edit a file.

Now the question is, how can I achieve this in Java and is there any existing and better approach to this problem?

Note: While googling I found some research papers, but most of them require some prior learning data and accuracy is also a matter of concern.

Vestibule answered 4/3, 2013 at 12:45 Comment(0)

The approach you gave is pretty much similar to the Gilbert's except the regex part. I don't want to step into the ugly regex world, I am planning to use template approach for many other areas apart from movie info e.g. prices, product specs extraction etc.

The template you describe is not actually a "template" in the normal sense of the word: a set static content that is dumped to the output with a bunch of dynamic content inserted within it. Instead, it is the "reverse" of a template - it is a parsing pattern that is slurped up & discarded, leaving the desired parameters to be found.
Because your web pages change regularly, you don't want to hard-code the content to be parsed too precisely, but want to "zoom in" on its' essential features, making the minimum of assumptions. i.e. you want to commit to literally matching key text such as "Rating:" and treat interleaving markup such as"<b/>" in a much more flexible manner - ignoring it and allowing it to change without breaking.
When you combine (1) and (2), you can give the result any name you like, but IT IS parsing using regular expressions. i.e. the template approach IS the parsing approach using a regular expression - they are one and the same. The question is: what form should the regular expression take?

3A. If you use java hand-coding to do the parsing then the obvious answer is that the regular expression format should just be the java.util.regex format. Anything else is a development burden and is "non-standard" and will be hard to maintain.

3B. If you use want to use an html-aware parser, then jsoup is a good solution. Problem is you need more text/regular expression handling and flexibility than jsoup seems to provide. It seems too locked into specific html tags and structures and so breaks when pages change.

3C. You can use a much more powerful grammar-controlled general text parser such as ANTLR - a form of backus-naur inspired grammar is used to control the parsing and generator code is inserted to process parsed data. Here, the parsing grammar expressions can be very powerful indeed with complex rules for how text is ordered on the page and how text fields and values relate to each other. The power is beyond your requirements because you are not processing a language. And there's no escaping the fact that you still need to describe the ugly bits to skip - such as markup tags etc. And wrestling with ANTLR for the first time involves educational investment before you get productivity payback.

3D. Is there a java tool that just uses a simple template type approach to give a simple answer? Well a google search doesn't give too much hope https://www.google.com/search?q=java+template+based+parser&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-GB:official&client=firefox-a. I believe that any attempt to create such a beast will degenerate into either basic regex parsing or more advanced grammar-controlled parsing because the basic requirements for matching/ignoring/replacing text drive the solution in those directions. Anything else would be too simple to actually work. Sorry for the negative view - it just reflects the problem space.

My vote is for (3A) as the simplest, most powerful and flexible solution to your needs.

Pulcheria answered 11/3, 2013 at 1:31 Comment(2)

thanks for the detailed answer, what do you think about the Igor Spasic approach? right now it seem like I don't have many options, so I'm leaning more towards Igor's than regex based approach. – Vestibule 12/3, 2013 at 12:43

In Igor's answer, he gives the example of JQuery text matching $.find("div#movie").find("div:nth-child(2)")....text(); That's totally dependent on the structure of the tags in the page - what you don't want. An example of extracting fields with regex: String stripped=htmlString.replaceAll("\<[/]?(div|br|i|strong)\>"); Pattern pattern = Pattern.compile("Rating:\s*([1/2])\sDirector:([a-zA-Z,.\-'])"); Matcher matcher=pattern.match(stripped); while (matcher.find()) { String rating=matcher.group(1); String director=matcher.group(2); } – Pulcheria 14/3, 2013 at 4:12

Not really a template-based approach here, but jsoup can still be a workable solution if you just externalize your Selector queries to a configuration file.

Your non-programmer doesn't even have to see HTML, just update the selectors in the configuration file. Something like SelectorGadget will make it easier to pick out what selector to actually use.

Roguish answered 7/3, 2013 at 2:54 Comment(0)

How can I achieve this in Java and is there any existing and better approach to this problem?

The template approach is a good approach. You gave all of the reasons why in your question.

Your templates would consist of just the HTML you want to process, and nothing else. Here's my example based on your example.

<div>
<p><strong>Score:</strong>{MOVIE_RATING}</p>
<p><strong>Director:</strong>{MOVIE_DIRECTOR}</p>
</div>

Basically, you would use Jsoup to process your templates. Then, as you use Jsoup to process the web pages, you check all of your processed templates to see if there's a match.

On a template match, you find the keywords in the processed template, then you find the corresponding values in the processed web page.

Yes, this would be a lot of coding, and more difficult than my description indicates. Your Java programmer will have to break this description down into simpler and simpler tasks until she or he can code the tasks.

Helix answered 4/3, 2013 at 19:42 Comment(2)

I thought about it and realized using Jsoup for this problem may not be a good idea, Jsoup highly depends on the position of the DOM elements, Jsoup Java code will have to be written w.r.t a template and when a webpage HTML changes, corresponding template will also require a change and that will lead to Java code changes. Another issue is with a large number of templates the number of comparisons will increase, e.g. for 100 templates there will around 100^2 comparisons, this will be time taking. – Vestibule 5/3, 2013 at 8:26

10,000 comparisons aren't that time consuming. I've never used Jsoup. However, any HTML parser will maintain the position of the elements. I thought that was the point of the templates. – Helix 5/3, 2013 at 9:45

If the web page changes frequently, then you'll probably want to confine your search for the fields like MOVIE_RATING to the smallest possible part of the page, and ignore everything else. There are two possibilities: you could either use a regular expression for each field, or you could use some kind of CSS selector. I think either would work and either "template" can consist of a simple list of search expressions, regex or css, that you would apply. Just roll through the list and extract what you can, and fail if some particular field isn't found because the page changed.

For example, the regex could look like this:

"Score:"(.)*[0-9]\.[0-9]\/[0-9]

(I haven't tested this.)

Lytton answered 7/3, 2013 at 0:9 Comment(7)

The approach you gave is pretty much similar to the Gilbert's except the regex part. I don't want to step into the ugly regex world, I am planning to use template approach for many other areas apart from movie info e.g. prices, product specs extraction etc. Also in the present case a reviewer can give a rating like 3.5/5, A-,*** or two and half I'll have to create multiple regexs to get this one value. – Vestibule 7/3, 2013 at 8:7

A regex is just a way of expressing the rules for extracting a piece of text. You have to express those rules one way or another. You'll have to do it in code, or as css selectors, or in a regex. You could certainly simplify the regex I suggested: "Score:</b>"~"</div>". That would capture all of the scores, regardless of format, at the cost of relying on the existence of a trailing "</div>". – Lytton 7/3, 2013 at 17:14

I will have to update the regex in case HTML changes from <p><strong>Score:</strong>2.5/5</p> to <p>Rating: A-</p>, this is just what I am trying to avoid. Just to emphasize the point I made about the regexes: #1732848 – Vestibule 7/3, 2013 at 17:47

My point still stands. If the HTML changes, then something has to change in your scraper code or template or regex. There's no magic that will read the page and understand it semantically. Google "java screen scraper" to get an idea of how others have solved the problem. BTW, bobince is wrong. Regex is entirely appropriate for locating really small portions of a page where you don't care about the dom. – Lytton 7/3, 2013 at 19:50

Its there in the original question, whole template will change when the HTML of a webpage changes, template will have the same HTML code as the original webpage but with keywords in place of real data. Can you please provide a link where someone has solved a similar problem using a screen scraper. I am open to all the languages not just java. – Vestibule 7/3, 2013 at 20:17

Your approach will fail if there is any variable data in the page. If the page displays today's date, for example, the whole template fails. This is why you need rules. Again, try google for screen scrapers. – Lytton 7/3, 2013 at 20:26

didn't get you about the date thing, can you please elaborate more about it in your post. I googled screen scrapers, didn't find anything useful – Vestibule 7/3, 2013 at 20:30

Or you can try different approach, using what i would call 'rules' instead of templates: for each piece of information that you need from the page, you can define jQuery expression(s) that extracts the text. Often when page change is small, the same well written jQuery expressions would still give the same results.

Then you can use Jerry (jQuery in Java), with the almost the same expressions to fetch the text you are looking for. So its not only about selectors, but you also have other jQuery methods for walking/filtering the DOM tree.

For example, rule for some Director text would be (in sort of sudo-java-jerry-code):

$.find("div#movie").find("div:nth-child(2)")....text();

There could be more (and more complex) expressions in the rule, spread across several lines, that for example iterate some nodes etc.

If you are OO person, each rule may be defined in its own implementation. If you are groovy person, you can even rewrite rules when needed, without recompiling your project, and still being in java. Etc.

As you see, the core idea here is to define rules how to find your text; and not to match to patterns as that may be fragile to minor changes - imagine if just a space has been added between two divs:). In this example of mine, I've used jQuery-alike syntax (actually, it's Jerry-alike syntax, since we are in Java) to define rules. This is only because jQuery is popular and simple, and known by your web developer too; at the end you can define your own syntax (depending on parsing tool you are using): for example, you may parse HTML into DOM tree and then write rules using your helper methods how to traverse it to the place of interest. Jerry also gives you access to underlaying DOM tree, too.

Hope this helps.

Fantoccini answered 7/3, 2013 at 21:4 Comment(2)

This sounds interesting, will it be possible to use Rhino with these kind of rules? If yes in that case I can just write these rules in the form of key:value pair e.g. movie_rating:$.find("div#movie").find("div:nth-child(2)") – Vestibule 11/3, 2013 at 16:13

For start I would try to skip big Rhino (its big and potentially slow). I would instead try to use Jerry - if that make sense for you, of course - as it is in java and you could write jquery-alike syntax with it (see the docs). If that for some reason does not work for you, yes, you probably could use Rhino and fire the javascript event. – Fantoccini 11/3, 2013 at 16:27

I used the following approach to do something similar in a personal project of mine that generates a RSS feed out of here the leading real estate website in spain.

Using this tool I found the rented place I'm currently living in ;-)

Get the HTML code from the page
Transform the HTML into XHTML. I used this this library I guess there might be today better options available
Use XPath to navigate the XHTML to the information you're interesting in

Of course every time they change the original page you will have to change the XPath expression. The other approach I can think of -semantic analysis of the original HTML source- is far, far beyond my humble skills ;-)

Agon answered 12/3, 2013 at 12:53 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags