Algorithms recognizing physical address on a webpage
Asked Answered
D

9

7

What are the best algorithms for recognizing structured data on an HTML page?

For example Google will recognize the address of home/company in an email, and offers a map to this address.

Detrition answered 8/12, 2008 at 9:6 Comment(2)
Somebody edit this to say location or physical address, since this is still pretty ambiguous.Neologism
Thanks, I realized now that the question is ambiguous.Detrition
N
12

A named-entity extraction framework such as GATE has at least tackled the information extraction problem for locations, assisted by a gazetteer of known places to help resolve common issues. Unless the pages were machine generated from a common source, you're going to find regular expressions a bit weak for the job.

Neologism answered 8/12, 2008 at 11:59 Comment(1)
can anyone tell what Google use ?Monumental
J
4

If you have the markup proper—and not just the text from the page—I second the Beautiful Soup suggestion above. In particular, the address tag should provide the lowest of low-hanging fruit. Also look into the adr microformat. I'd only falll back to regexes if the first two didn't pull enough info or I didn't have the necessary data to look for the first two.

Jambalaya answered 8/12, 2008 at 12:7 Comment(0)
P
3

If you also have to handle international addresses, you're in for a world of headaches; international address formats are amazingly varied.

Pycno answered 8/12, 2008 at 9:23 Comment(0)
D
2

Do not use regular expressions. Use an existing HTML parser, for example in Python I strongly recommend BeautifulSoup. Even if you use a regular expression to parse the HTML elements BeautifulSoup grabs.

If you do it with your own regexs, you not only have to worry about finding the data you require, you have to worry about things like invalid HTML, and lots of other very non-obvious problems you'll stumble over..

Dubenko answered 8/12, 2008 at 10:16 Comment(0)
L
2

I'd guess that Google takes a two step approach to the problem (at least that's what I would do). First they use some fairly general search pattern to pick out everything that could be an address, and then they use their map database to look up that string and see if they get any matches. If they do it's probably an address if they don't it probably isn't. If you can use a map database in your code that will probably make your life easier.

Unless you can limit the geographic location of the addresses, I'm guessing that it's pretty much impossible to identify a string as an address just by parsing it, simply due to the huge variation of address formats used around the world.

Lawrenson answered 8/12, 2008 at 11:34 Comment(0)
L
1

What you're asking is really quite a hard problem if you want to get it perfect. While a simple regexp will get it mostly right most of them time, writing one that will get it exactly right everytime is fiendishly hard. There are plenty of strange corner cases and in several cases there is no single unambiguous answer. Most web sites that I've seen to a pretty bad job handling all but the simplest URLs.

If you want to go down the regexp route your best bet is probably to check out the sourcecode of http://metacpan.org/pod/Regexp::Common::URI::http

Lawrenson answered 8/12, 2008 at 10:6 Comment(1)
1) Link no longer works. 2) Is this answer about regex for postal address?Maziar
H
0

Again, regular expressions should do the trick.

Because of the wide variety of addresses, you can only guess if a string is an address or not by an expression like "(number), (name) Street|Boulevard|Main", etc

You can consider looking into some firefox extensions which aim to map addresses found in text to see how they work

Humber answered 8/12, 2008 at 9:56 Comment(0)
C
0

You can check this USA extraction example http://code.google.com/p/graph-expression/wiki/USAAddressExtraction

Caplin answered 20/5, 2011 at 5:23 Comment(0)
C
0
  1. It depends upon your requirement.

for email and contact details regex is more than enough. For addresses regex alone will not help. Think about NLP(NER) & POS tagging. For finding people related information you cant do anything without NER.

  • If you need information like paragraphs get the contents by using tags.
Chandachandal answered 27/7, 2017 at 14:14 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.