Which data structure should I use for geocoding?

Asked 12/4, 2012 at 12:56 Answered 15/4, 2012 at 15:19

Solved python openstreetmap geocoding large-data-volumes large-data

I am trying to create a Python script which will take an address as input and will spit out its latitude and longitude, or latitudes and longitudes in case of multiple matches, quite like Nominatim.

So, the possible input and outputs could be:-

In: New York, USA => Out: New York (lat:x1 lon:y1)
In: New York => Out: New York (lat:x1 lon:y1)
In: Pearl Street, New York, USA => Out: Pearl Street (lat:x2 lon:y2)
In: Pearl Street, USA => Out: Pearl Street (lat:x2 lon:y2), Pearl Street (lat:x3 lon:y3)
In: Pearl Street => Out: Pearl Street (lat:x2 lon:y2), Pearl Street (lat:x3 lon:y3)
In: 103 Alkazam, New York, USA => Out: New York (lat:x1 lon:y1)

In 6 above, New York was returned since no place was found with address 103 Alkazam, New York, USA, but it could at least find New York, USA.

Initially I thought of building a tree representing the hierarchy relation where siblings are sorted alphabetically. It could have been like:-

                                     GLOBAL
                                       |
                   ---------------------------------------------
                   |            | ...
                  USA
             ---------------
             |        | ...
         CALIFORNIA  NEW YORK 
            |         |
     -----------    -------------
     |        |..   |          |....
 PEARL STREET      PEARL STREET

But the problem was user can provide incomplete address as in 2, 4 and 5.

So, I next thought of using a search tree and store the fully qualified address in each node. But this too is quite bad since:-

This will store highly redundant data in each node. Since this will be a really big data so, space conservation matters.
It won't be able to leverage the fact that user has narrowed down the search space.

I have one additional requirement. I need to detect misspellings. I guess that will have to be dealt as a separate problem and can treat each node as generic strings.

Update 1

A little elaboration. The input would be a list, where the item on lower index is parent of the item in higher index; and they of course may or may not be immediate parent or child. So for query 1, input would be ["USA", "NEW YORK"]. So, it is perfectly fine that USA, New York returns no result.

The user should be able to locate a building if he has the address and our data is so detailed.

Update 2 (Omission Case)

If user queries Pearl Street, USA, so our algo should be able to locate the address since it knows Pearl Street has New York as parent and USA is its parent.

Update 3 (Surplus Case)

Suppose the user queries for 101 C, Alley A, Pearl Street, New York. Also suppose our data does know of 101 C but not about Alley A. According to it 101 C is a immediate child of Pearl Street. Even in this case it should be able to locate the address.

Selhorst answered 12/4, 2012 at 12:56 Comment(8)

So are the only things with locations streets, or is it streets and towns/cities, or is it places on streets (i.e. 63 Pearl street), streets and towns/cities, or something more? – Inharmonious 12/4, 2012 at 13:21

It can be flat no., street, town/city, state, country. Any part could be missing. – Selhorst 12/4, 2012 at 13:30

I think the tag [missing-data] would be appropriate here. – Lowman 12/4, 2012 at 13:44

By missing data I meant it is missing in user's query. Check query 4, above, for example. It doesn't have New York. Our data may or may not be very detailed. So, in this case user said, fetch me Pearl Street which is in USA, which should work since our data knows that although it is not directly in USA, but via New York it is. – Selhorst 12/4, 2012 at 13:49

@Selhorst I sometimes find it easier to make progress by solving parts of a problem. For example, how would a structure of all town/cities be made, and how would a structure of all countries, which only contain town/cities be made? Then how would a structure of all streets be made to refer to all the town/cities that contain that street? Would it also need a structure of streets for each country, or could that be handled by looking up the cities? Is the key & value clear in each case? Maybe treat numbers in a different way; for example, what information might the '7' in "7, Boston, USA" add? – Inharmonious 12/4, 2012 at 14:0

@Inharmonious Totally agree. But Update 2 and 3 are valid scenarios. Particularly Update 2. I am treating 101 C just like a name; a string. The number part has no special meaning. The point of Updates 2 and 3 are that user might miss/skip some levels, and our data might have some levels missing too. So from the codes perspective the second case is like, user has entered some non-existant levels. – Selhorst 12/4, 2012 at 14:12

@Inharmonious On second thought. It seems you meant that I should limit and define the exact levels I want. Yes, that should simplify things a lot. You could have said that in simple words. ;-) – Selhorst 12/4, 2012 at 14:18

@Selhorst - I try to strike a balance between giving too much of an answer to a question, and taking away all the pleasure of solving it, or not being helpful enough, and causing frustration (everyone has their own philosophy). The fact that you :-)'d suggests that maybe I got it about right, you got a 'Eureka' moment :-) – Inharmonious 12/4, 2012 at 14:31

Thanks to all for their answers, their answers were helpful, but did not address everything I needed. I finally found an approach which took care of all my cases. The approach is the modified version of what I suggested in the question.

The Basic Approach

Here I will refer to something called 'node', it is a class object which will contain the geo information like a place entity's latitude, longitude, maybe dimension too, and its full address.

If the address of the entity is '101 C, Pearl Street, New York, USA', then this means our data structure will have at least four nodes for - '101 C', 'Pearl Street', 'New York' and 'USA'. Each node will have a name and one address part. For '101 C', name will be '101 C' and address will be 'Pearl Street, New York, USA'.

The basic idea is to have a search tree of these nodes, where node name will be used as the key for the search. We may get multiple matches, so later we need to rank the results on how good the node's address match with the queried one.

                                    EARTH
                                      |
                ---------------------------------------------
                |                                           |
               USA                                        INDIA
                |                                           |
        ---------------------------                     WEST BENGAL
        |                         |                         |
     NEW YORK                 CALIFORNIA                 KOLKATA
        |                         |                         |
   ---------------            PEARL STREET              BARA BAZAR
   |             |                                          |
PEARL STREET   TIME SQUARE                                 101 C
   |             |
  101 C         101 C

Suppose we have a geographic data as above. So, a search for '101 C, NEW YORK' will not only return the '101 C' nodes in 'NEW YORK' but also the one in 'INDIA'. This is because the algo will only use the name, i.e. '101 C' here, for searching the nodes. Later we can grade the the quality of the result by measuring the difference of the node's address from the queried address. We are not using an exact match since the user is allowed to provide incomplete address, as in this case.

Grading Search Result

To grade the quality of the result we can use Longest Common Subsequence. The 'Omission' and 'Surplus' cases are well taken care of in this algo.

It is best if I let the code do the talking. Below is a Python implementation tailored for this purpose.

def _lcs_diff_cent(s1, s2):
    """
    Calculates Longest Common Subsequence Count Difference in percentage between two strings or lists.

    LCS reference: http://en.wikipedia.org/wiki/Longest_common_subsequence_problem.
    Returns an integer from 0-100. 0 means that `s1` and `s2` have 0% difference, i.e. they are same.
    """
    m = len(s1)
    n = len(s2)

    if s1 == s2:
        return 0
    if m == 0: # When user given query is empty then that is like '*'' (match all)
        return 0
    if n == 0:
        return 100

    matrix = [[0] * (n + 1)] * (m + 1)
    for i in range(1, m+1):
        for j in range(1, n+1):
            if s1[i-1] == s2[j-1]:
                matrix[i][j] = matrix[i-1][j-1] + 1
            else:
                matrix[i][j] = max(matrix[i][j-1], matrix[i-1][j])

    return int( ( 1 - float(matrix[m][n]) / m ) * 100 )

Optimized Approach

I bailed out on the above (basic) approach since it forced redundancy, and it could not cut leverage the fact that if user has provided 'USA' in his query then we need not look into nodes in 'INDIA'.

This optimized approach addresses both the above problems to quite an extent. The solution is not to have one big search tree. We can partition the search space into say 'USA' and 'INDIA'. Later we can further repartition those search spaces state-wise. This is what I call 'slicing'.

In the below diagram - SearchSlice represents a 'slice', and SearchPool represents a search tree.

                            SearchSlice()
                                  |
            ---------------------------------------------
            |                                           |
        SearchSlice(USA)                           SearchSlice(INDIA)
            |                                           |
    ---------------------------                  SearchPool(WEST BENGAL)
    |                         |                   |
 SearchPool(NEW YORK)     SearchPool(CALIFORNIA)  |- KOLKATA
    |                         |                   |- BARA BAZAR, KOLKATA
    |- PEARL STREET           |- PEARL STREET     |- 101 C, BARA BAZAR, KOLKATA
    |- TIME SQUARE
    |- 101 C, PEARL STREET
    |- 101 C, TIME SQUARE

Few key points to notice above...

Each slice is only a single level deep. Well this not really apparent above.
Sliced level's name does not appear in the address of its children. For example, SearchSlice(USA) maintains a slice of states in 'USA'. So, nodes under 'NEW YORK' does not include the name 'NEW YORK' or 'USA' in their address. Same goes for other regions too. The hierarchy relation implicitly defines the full address.
'101 C' address includes its parent's name too since they are not sliced.

Scaling Possibilities

Where there is a bucket (pool), there is an implicit scaling possibility. We (say) divide geo data for 'USA' into two groups. Both can be on different systems. So, it is perfectly fine if 'NEW YORk' pool is on System A but 'CALIFORNIA' pool is on System B, since they do not share any data, except for the parents of course.

Here is the caveat. We need to duplicate the parents which will always be a slice. Since slices are meant to be limited in number so the hierarchy won't be too deep, so it should not be too redundant to duplicate them.

The Working Code

Please refer to my GitHub for a working demo Python code.

Selhorst answered 15/4, 2012 at 15:19 Comment(0)

how about using a key-value storage map and fulltext search.

key for location string
value for location_level and lat&lon data.
search by:
- split user input string to single locations words （not only by comma)
- search each words in map
- return the lat&lon of smallest location level

python.dict,memcached,mongodb .... will meet ur need.

if u have too many location words, split location_level as a new map, two searches will speed up
forget location levels, traet as fulltext search
huge data? hash key to short string or numbers

some questiongs to consider:

how to store the data in database
how to init your searching tree from data, if any
how to extend/edit searching tree in runtime
fault-tolerant for input/storage
storage space > speed? or speed > storage?

so, more usable test case for user input

101 C, Time Square, New York, US
101 C, Pearl street, New York, US

101 C, Time Square, SomeCity, Mars
101 C
101 C, US
101 C, New York, US

101 C, New York, Time Square, US

North Door, 101 C, Time Square, New York, US
South Door, 101 C, Time Square, New York, US

for the situation:

fast-speed with huge-data;
full-fault-tolerant;
easy-adjust with storage and runtime

the best solution:(also complex-est)

flat key-value map storage
fulltext search
- or hash key with B-tree search

your program/website maybe able to run as fast as google.

Champ answered 12/4, 2012 at 13:15 Comment(2)

You mean key would be full location string? Please note that the 'full location' as per the data may not be actually the full address. (Please see 'Update 3'). – Selhorst 12/4, 2012 at 13:59

@Selhorst I am making things too complicated. u got ur runable solution. – Champ 16/4, 2012 at 8:23

If you try to create a data structure for this problem, I think you will have data redundancy. Rather you can use tree/graph & try to implement a search algorithm that searches the words from user's input against the node values. Fuzzy matching can help you to generate most likely results, and you can suggest/show user a top few of them according to the confidence level of their resemblance-quotent.

This can also take care of misspells etc.

Yoshi answered 12/4, 2012 at 14:18 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++