Problem: We need a big data method for calculating distances between points. We outline what we'd like to do below with a five-observation dataframe. However, this particular method is infeasible as the number of rows gets large (> 1 million). In the past, we've used SAS to do this kind of analysis, but we'd prefer R if possible. (Note: I'm not going to show code because, while I outline a way to do this on smaller datasets below, this is basically an impossible method to use with data on our scale.)
We start with a dataframe of stores, each of which has a latitude and longitude (though this is not a spatial file, nor do we want to use a spatial file).
# you can think of x and y in this example as Cartesian coordinates
stores <- data.frame(id = 1:5,
x = c(1, 0, 1, 2, 0),
y = c(1, 2, 0, 2, 0))
stores
id x y
1 1 1 1
2 2 0 2
3 3 1 0
4 4 2 2
5 5 0 0
For each store, we want to know the number of stores within x distance. In a small dataframe, this is straightforward. Create another dataframe of all coordinates, merge back in, calculate distances, create an indicator if the distance is less than x and add up the indicators (minus one for the store itself, which is at distance 0). This would result in a dataset that looks like this:
id x y s1.dist s2.dist s3.dist s4.dist s5.dist
1: 1 1 1 0.000000 1.414214 1.000000 1.414214 1.414214
2: 2 0 2 1.414214 0.000000 2.236068 2.000000 2.000000
3: 3 1 0 1.000000 2.236068 0.000000 2.236068 1.000000
4: 4 2 2 1.414214 2.000000 2.236068 0.000000 2.828427
5: 5 0 0 1.414214 2.000000 1.000000 2.828427 0.000000
When you count (arbitrarily) under 1.45 as "close," you end up with indicators that look like this:
# don't include the store itself in the total
id x y s1.close s2.close s3.close s4.close s5.close total.close
1: 1 1 1 1 1 1 1 1 4
2: 2 0 2 1 1 0 0 0 1
3: 3 1 0 1 0 1 0 1 2
4: 4 2 2 1 0 0 1 0 1
5: 5 0 0 1 0 1 0 1 2
The final product should look like this:
id total.close
1: 1 4
2: 2 1
3: 3 2
4: 4 1
5: 5 2
All advice appreciated.
Thank you very much
data.table
package is your best bet. I'm not sure what metric you're looking for between coordinates (i.e., haversine, Vincenty, euclidean, etc.) or the scale (i.e., miles, kilometers, etc.), I can't offer much more than a package name! – Swisher