algorithm for checking addresses for matches?

Asked 20/5, 2010 at 16:23 Answered 20/5, 2010 at 18:17

Solved algorithm street-address user-data

I'm working on a survey program where people will be given promotional considerations the first time they fill out a survey. In a lot of scenarios, the only way we can stop people from cheating the system and getting a promotion they don't deserve is to check street address strings against each other.

I was looking at using levenshtein distance to give me a number to measure similarity, and consider those below a certain threshold a duplicate.

However, if someone were looking to game the system, they could easily write "S 5th St" instead of "South Fifth Street", and levenshtein would consider those strings to be very different. So then I was thinking to convert all strings to a 'standard address form' i.e. 'South' becomes 's', 'Fifth' becomes '5th', etc.

Then I was thinking this is hopeless, and too much effort to get it working robustly. Is it?

I'm working with PHP/MySql, so I have the limitations inherent in that system.

Lida answered 20/5, 2010 at 16:23 Comment(8)

What if instead of "S. 5th St." someone enters "S. 4th St."? This couldn't be used to game the system (assuming you're mailing the promotional stuff), but it could disqualify people for living one block over. Just an edge case to test. – Dried 20/5, 2010 at 16:27

@Bill that scenario is not a problem because then they wouldn't recieve their promotional consideration. Unless they're in cahoots with the folks who reside on that house address on 4th street, but there's only so many households they can conspire with. It's self-limiting, I think :) – Lida 20/5, 2010 at 16:39

@user15841: No, I mean what if those two people legitimately sign up independently of each other? Your algorithm needs to be smart enough to see the difference between those two addresses, but also smart enough that it sees the original examples you gave as the same. – Dried 20/5, 2010 at 16:49

You mean, if someone accidentally gives someone else's address? Yeah, that's a problem, but I don't see how the system could address it without being open to more gaming ("Are you sure you meant 4th street? We already have one for that address. Care to try again?" ) – Lida 20/5, 2010 at 16:51

No, I meant if two people living at very similar but different addresses both sign up, one of them might not get their prize. – Dried 20/5, 2010 at 16:55

Is street address only identifier? How about email ID / cookies / IP based detection? – Oke 20/5, 2010 at 16:59

@Oke they could be using any computer anywhere - at home, at the library, at grandma's house. – Lida 20/5, 2010 at 17:4

@Bill well, I think the street name has to be one of the major identifiers. If the algorithm is going to confuse Fourth Street with Fifth Street, that obviously won't work. No different from confusing Main Street with Town Street :) – Lida 20/5, 2010 at 17:5

I think your second idea is better than using Levenshtein distance. If you try to compare the addresses for similarity, then two different people who live nearby each other might accidentally "cheat" one another out of their prize. If I live at "S. 4th St." but my neighbor at "S. 5th St." already signed up, those two addresses might seem too similar by Lev distance.

You could reduce (but probably not eliminate) a lot of potential cheating by running addresses through a synonym normalizer. Before you check for equality, just convert

North -> N.
East -> E.
...
First -> 1st
Second -> 2nd
Third -> 3rd
...
Street -> St.
Avenue -> Ave.

The longer the list of synonyms you come up with, the better it will be at catching matches. It will be a little bit slower at processing, but addresses are tiny.

This is similar to converting strings to all lower (or upper) case before comparing them. (Which I also recommend, naturally.)

Dried answered 20/5, 2010 at 16:54 Comment(4)

Oh, finally I understand what you're saying! I haven't used levenshtein, so I wasn't familiar enough with it to forsee how that situation would arise :) – Lida 21/5, 2010 at 13:11

Additionally, it must be noted that multiple residents may be living in the same building... That's where it gets tricky, even after standardization. For example "511 N 15th St, Unit 123" vs "511 NORTH 15th St, Apt 124" – Halfcocked 7/7, 2014 at 18:54

You should consider using string distance comparison on the synonyms too. Otherwise "South" will become "S", but "Soith" (typo) will not, and "Soith"->"S" is not similar. Also, be mindful that there are thousands of unicode characters that will produce characters that look like a-z characters but aren't. Furthermore, "Street"->"St." can lead to a false positive for the 'Saint' abbreviation "St.". – Depend 5/3, 2016 at 23:30

Furthermore, it significantly helps to extract the number parts and the text parts, and weight similar numbers higher than dissimilar text tokens. (In this situation, I'd consider the Jaro Winkler edit distance over the Levenshtein distance for comparing the text tokens, since 'Ave'-'Avenue' is within a sensible threshold, whereas it wouldn't be for lev. distance.) – Depend 5/3, 2016 at 23:46

You can use Google Map API (or any other mapping API) to normalize addresses as geo-location (lat/long).

Incorruptible answered 20/5, 2010 at 16:27 Comment(2)

Wouldn't work because the majority of these geospatial api's fail to consider the apartment numbers (e.g., 12th floor 6th room). – Rosarosabel 20/5, 2010 at 16:33

Yes, such normalization wouldn't be 100% accurate and you would need to do additional checks as well. – Incorruptible 20/5, 2010 at 16:55

See these questions for related discussion.

Normalize your data first as much as possible:

avenue -> ave road -> rd Rd. -> rd

first -> 1 1st -> 1

You could look into SOUNDEX or something similar to catch cases where words sound same but have different spelling (e.g. Schmitt, Schmitd, Smith). SOUNDEX works on word level, so you'll need to split address to words first, and compare the SOUNDEX values.

You could also feed the addresses to some geo location service such as Google Maps, store resulting longitude and latitude to your database. When a new address is entered, you just get its longitude/latitude and compare against existing locations in your database. See this question for details.

Sarson answered 20/5, 2010 at 18:17 Comment(0)

Recommended topics

Hot tags