Finding a density peak / cluster centrum in 2D grid / point process

Asked 6/6, 2012 at 4:12 Answered 12/2, 2014 at 9:35

I have a dataset with minute by minute GPS coordinates recorded by a persons cellphone. I.e. the dataset has 1440 rows with LON/LAT values. Based on the data I would like a point estimate (lon/lat value) of where the participants home is. Let's assume that home is the single location where they spend most of their time in a given 24h interval. Furthermore, the GPS sensor most of the time has quite high accuracy, however sometimes it is completely off resulting in gigantic outliers.

I think the best way to go about this is to treat it as a point process and use 2D density estimation to find the peak. Is there a native way to do this in R? I looked into kde2d (MASS) but this didn't really seem to do the trick. Kde2d creates a 25x25 grid of the data range with density values. However, in my data, the person can easily travel 100 miles or more per day, so these blocks are generally too large of an estimate. I could narrow them down and use a much larger grid but I am sure there must be a better way to get a point estimate.

Dodds answered 6/6, 2012 at 4:12 Comment(5)

If you are looking for the centroid of the coordinates then a clustering algorithm on the coordinates might be a reasonable approach, kmeans with one center? The package flexclust has the option of kmedians which would alleviate some of the problems to do with outliers – Tarsometatarsus 6/6, 2012 at 4:21

Well I am a bit worried that kmeans will cluster too many points together (e.g find a single cluster with the home, but also the nearby supermarket and coffeeshop) and then the center of this cluster will actually be off. It would be better to find a direct high density location without classifying points into groups. – Dodds 6/6, 2012 at 4:25

Yes. It is perhaps too simplistic a solution. You could define a bounding by looking at the range of locations visited and then define number of grid points in kde2d to reflect a sensible spatial resolution. – Tarsometatarsus 6/6, 2012 at 4:37

You could round the data (say, in hundreds of meters, assuming that the house is less than 100 meters wide), and use table to find the position with the most points. – Porter 6/6, 2012 at 5:32

See residenceTime in adehabitatLT for another option – Lafollette 6/6, 2012 at 9:18

There are "time spent" functions in the trip package (I'm the author). You can create objects from the track data that understand the underlying track process over time, and simply process the points assuming straight line segments between fixes. If "home" is where the largest value pixel is, i.e. when you break up all the segments based on the time duration and sum them into cells, then it's easy to find it. A "time spent" grid from the tripGrid function is a SpatialGridDataFrame with the standard sp package classes, and a trip object can be composed of one or many tracks.

Using rgdal you can easily transform coordinates to an appropriate map projection if lon/lat is not appropriate for your extent, but it makes no difference to the grid/time-spent calculation of line segments.

There is a simple speedfilter to remove fixes that imply movement that is too fast, but that is very simplistic and can introduce new problems, in general updating or filtering tracks for unlikely movement can be very complicated. (In my experience a basic time spent gridding gets you as good an estimate as many sophisticated models that just open up new complications). The filter works with Cartesian or long/lat coordinates, using tools in sp to calculate distances (long/lat is reliable, whereas a poor map projection choice can introduce problems - over short distances like humans on land it's probably no big deal).

(The function tripGrid calculates the exact components of the straight line segments using pixellate.psp, but that detail is hidden in the implementation).

In terms of data preparation, trip is strict about a sensible sequence of times and will prevent you from creating an object if the data have duplicates, are out of order, etc. There is an example of reading data from a text file in ?trip, and a very simple example with (really) dummy data is:

library(trip)
d <- data.frame(x = 1:10, y = rnorm(10), tms = Sys.time() + 1:10, id = gl(1, 5))
coordinates(d) <- ~x+y
tr <- trip(d, c("tms", "id"))
g <- tripGrid(tr)

pt <- coordinates(g)[which.max(g$z), ]
image(g, col = c("transparent", heat.colors(16)))
lines(tr, col = "black")
points(pt[1], pt[2], pch = "+", cex = 2)

That dummy track has no overlapping regions, but it shows that finding the max point in "time spent" is simple enough.

Lafollette answered 6/6, 2012 at 7:2 Comment(0)

How about using the location that minimises the sum squared distance to all the events? This might be close to the supremum of any kernel smoothing if my brain is working right.

If your data comprises two clusters (home and work) then I think the location will be in the biggest cluster rather than between them. Its not the same as the simple mean of the x and y coordinates.

For an uncertainty on that, jitter your data by whatever your positional uncertainty is (would be great if you had that value from the GPS, otherwise guess - 50 metres?) and recompute. Do that 100 times, do a kernel smoothing of those locations and find the 95% contour.

Not rigorous, and I need to experiment with this minimum distance/kernel supremum thing...

Diastase answered 6/6, 2012 at 8:18 Comment(1)

No, but thanks for reminding me its her birthday tomorrow. I meant the location of the maximum of the kernel smoothing surface... Might have time to think about this today... – Diastase 7/6, 2012 at 7:22

In response to spacedman - I am pretty sure least squares won't work. Least squares is best known for bowing to the demands of outliers, without much weighting to things that are "nearby". This is the opposite of what is desired.

The bisquare estimator would probably work better, in my opinion - but I have never used it. I think it also requires some tuning.

It's more or less like a least squares estimator for a certain distance from 0, and then the weighting is constant beyond that. So once a point becomes an outlier, it's penalty is constant. We don't want outliers to weigh more and more and more as we move away from them, we would rather weigh them constant, and let the optimization focus on better fitting the things in the vicinity of the cluster.

Igniter answered 12/2, 2014 at 9:35 Comment(0)

Recommended topics

Hot tags