I am trying to create a Python script which will take an address as input and will spit out its latitude and longitude, or latitudes and longitudes in case of multiple matches, quite like Nominatim.
So, the possible input and outputs could be:-
- In: New York, USA => Out: New York (lat:x1 lon:y1)
- In: New York => Out: New York (lat:x1 lon:y1)
- In: Pearl Street, New York, USA => Out: Pearl Street (lat:x2 lon:y2)
- In: Pearl Street, USA => Out: Pearl Street (lat:x2 lon:y2), Pearl Street (lat:x3 lon:y3)
- In: Pearl Street => Out: Pearl Street (lat:x2 lon:y2), Pearl Street (lat:x3 lon:y3)
- In: 103 Alkazam, New York, USA => Out: New York (lat:x1 lon:y1)
In 6 above, New York was returned since no place was found with address 103 Alkazam, New York, USA
, but it could at least find New York, USA
.
Initially I thought of building a tree representing the hierarchy relation where siblings are sorted alphabetically. It could have been like:-
GLOBAL
|
---------------------------------------------
| | ...
USA
---------------
| | ...
CALIFORNIA NEW YORK
| |
----------- -------------
| |.. | |....
PEARL STREET PEARL STREET
But the problem was user can provide incomplete address as in 2, 4 and 5.
So, I next thought of using a search tree and store the fully qualified address in each node. But this too is quite bad since:-
- This will store highly redundant data in each node. Since this will be a really big data so, space conservation matters.
- It won't be able to leverage the fact that user has narrowed down the search space.
I have one additional requirement. I need to detect misspellings. I guess that will have to be dealt as a separate problem and can treat each node as generic strings.
Update 1
A little elaboration. The input would be a list, where the item on lower index is parent of the item in higher index; and they of course may or may not be immediate parent or child. So for query 1, input would be ["USA", "NEW YORK"]
. So, it is perfectly fine that USA, New York
returns no result.
The user should be able to locate a building if he has the address and our data is so detailed.
Update 2 (Omission Case)
If user queries Pearl Street, USA
, so our algo should be able to locate the address since it knows Pearl Street
has New York
as parent and USA
is its parent.
Update 3 (Surplus Case)
Suppose the user queries for 101 C, Alley A, Pearl Street, New York
. Also suppose our data does know of 101 C
but not about Alley A
. According to it 101 C
is a immediate child of Pearl Street
. Even in this case it should be able to locate the address.
New York
. Our data may or may not be very detailed. So, in this case user said, fetch mePearl Street
which is inUSA
, which should work since our data knows that although it is not directly inUSA
, but viaNew York
it is. – Selhorst101 C
just like a name; a string. The number part has no special meaning. The point of Updates 2 and 3 are that user might miss/skip some levels, and our data might have some levels missing too. So from the codes perspective the second case is like, user has entered some non-existant levels. – Selhorst