Clustering Lat/Longs in a Database
Asked Answered
L

7

16

I'm trying to see if anyone knows how to cluster some Lat/Long results, using a database, to reduce the number of results sent over the wire to the application.

There are a number of resources about how to cluster, either on the client side OR in the server (application) side .. but not in the database side :(

This is a similar question, asked by a fellow S.O. member. The solutions are server side based (ie. C# code behind).

Has anyone had any luck or experience with solving this, but in a database? Are there any database guru's out there who are after a hawt and sexy DB challenge?

please help :)

EDIT 1: Clarification - by clustering, i'm hoping to group x number of points into a single point, for an area. So, if i say cluster everything in a 1 mile / 1 km square, then all the results in that 'square' are GROUP'D into a single result (say ... the middle of the square).

EDIT 2: I'm using MS Sql 2008, but i'm open to hearing if there are other solutions in other DB's.

Libava answered 1/12, 2008 at 4:36 Comment(4)
What are you looking for exactly--a reduced set of lat/long points that represent the data set well, a set of points near a given "test" point, or something else entirely?Yugoslavia
Added clarification to the opening post.Libava
I'm having the same issue. Did u find a solution?Aluino
@Aluino try googling for openlayers (it's a mapping JS library) and seeing if that can cluster.Libava
Y
13

I'd probably use a modified* version of k-means clustering using the cartesian (e.g. WGS-84 ECF) coordinates for your points. It's easy to implement & converges quickly, and adapts to your data no matter what it looks like. Plus, you can pick k to suit your bandwidth requirements, and each cluster will have the same number of associated points (mod k).

I'd make a table of cluster centroids, and add a field to the original data table to indicate what cluster it belonged too. You'd obviously want to update the clustering periodically if your data is at all dynamic. I don't know if you could do that with a stored procedure & trigger, but perhaps.

*The "modification" would be to adjust the length of the computed centroid vectors so they'd be on the surface of the earth. Otherwise you'd end up with a bunch of points with negative altitude (when converted back to LLH).

Yugoslavia answered 1/12, 2008 at 5:23 Comment(1)
kewlies! ... er .. i have no idea how to do this .. but i sorta get what you're saying. hmm.. the data isn't too dynamic. but i'll still have to think about how (and how often) i'll need to calc this stuff. hmm. so tough!Libava
G
6

If you're clustering on geographic location, and I can't imagine it being anything else :-), you could store the "cluster ID" in the database along with the lat/long co-ordinates.

What I mean by that is to divide the world map into (for example) a 100x100 matrix (10,000 clusters) and each co-ordinate gets assigned to one of those clusters.

Then, you can detect very close coordinates by selecting those in the same square and moderately close ones by selecting those in adjacent squares.

The size of your squares (and therefore the number of them) will be decided by how accurate you need the clustering to be. Obviously, if you only have a 2x2 matrix, you could get some clustering of co-ordinates that are a long way apart.

You will always have the edge cases such as two points close together but in different clusters (one northernmost in one cluster, the other southernmost in another) but you could adjust the cluster size OR post-process the results on the client side.

Greenbrier answered 1/12, 2008 at 4:48 Comment(1)
With MS SQL Server 2008, they have spatial indexes. Maybe one of these indexes could be harnessed as the clusterID, then group results into this clusterID index?Libava
R
5

I did a similar thing for a geographic application where I wanted to ensure I could cache point sets easily. My geohashing code looks like this:

def compute_chunk(latitude, longitude)
  (floor_lon(longitude) * 0x1000) | floor_lat(latitude)
end

def floor_lon(longitude)
  ((longitude + 180) * 10).to_i
end

def floor_lat(latitude)
  ((latitude + 90) * 10).to_i
end

Everything got really easy from there. I had some code for grabbing all of the chunks from a given point to a given radius that would translate into a single memcache multiget (and some code to backfill that when it was missing).

Rubbico answered 1/12, 2008 at 4:52 Comment(2)
Hi Dustin - i don't get it. is this some type of DB sql code? or some php or something? i can't see how it's related to a db?Libava
My app is written in ruby and this is library code. I use it to compute a hash for a given latitude and longitude and store that in a column along with the point. Every point edit recalculates the hash and invalidates the cache of all points for a given hash.Rubbico
G
2

For movielandmarks.com I used the clustering code from Mike Purvis, one of the authors of Beginning Google Maps Applications with PHP and AJAX. It builds trees of clusters/points for different zoom levels using PHP and MySQL, storing it in the database so that recall is very fast. Some of it may be useful to you even if you are using a different database.

Garish answered 1/12, 2008 at 5:38 Comment(1)
Posting this in case someone gets here from google as i did.. you can find the thread mentioned above using archive.org - it includes links to source files. Appears the heavy lifting is done via php tho - perhaps not the best approach, but worth a read.Nuke
T
1

Why not testing multiple approaches?

  1. translate the weka library in .NET CLI with IKVM.NET
  2. add an assembly resulted from your code and weka.dll (use ilmerge) into your database

Make some tests, that is. No specific clustering works better than anyone else.

Troostite answered 15/1, 2010 at 11:43 Comment(2)
whoa dude. i have no idea what you mean :(Libava
There are many algorithms for clustering. Every algorithm has its own parameters. It's quite impossible to provide the best answer. Test some clustering algorithms (k-means, fuzzy-c means etc) from weka library. To not translate the whole code, you can embed an assembly that includes weka in your database server (sql 2008 accepts .NET assemblies). Thus, you can test multiple variants.Troostite
A
0

I believe you can use MSSQL's spatial data types. If they are similar to other spatial data types I know, they will store your points in a tree of rectangles, and then you can go to the lower-resolution rectangles to get implicit clusters.

Almeta answered 1/12, 2008 at 6:47 Comment(2)
I'm currently using the GEOGRAPHY type with a spatial index. But i'm ot sure how to use that to get a grouped/clustered result. Do u have some sql code examples?Libava
I was wrong in assuming that GEOGRAPHY explicitly gives you a tree. I believe you can use Drew Hall's suggestion, using GEOGRAPHY.STDistance as the distance function needed for k-means.Almeta
S
0

If you end up wanting to explore Geohash's (which were invented at exactly the same time you posted this question), here's a more fleshed-out implementation of Geohash related functions for SQL Server's TSQL in which you might be interested.

I have used the Integer version of the Geohash extensively to cluster results to reduce data sent to a client for a limited viewport.

Suzisuzie answered 6/3, 2021 at 22:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.