The tricky part about clustering two different types of information (location and rating) is determining how they should relate to each other. It's simple to ask when it is just one domain and you are comparing the same units. My approach would be to look at how to relate rows within a domain and then determine some interaction between the domains. This could be done using scaling options like MinMaxScaler mentioned, however, I think this is a bit heavy handed and we could use our knowledge of the domains to cluster better.
Handling Location
Location distance is best handled directly as this has real world meaning that we can precalculate distances for. The meaning of meters apart is direct to what we
You could use the scaling option mentioned in the previous answer but this risks distorting the location data. For example, if you have a long and thin set of locations, MinMaxScaling would give more importance to variation on the thin axis than the long axis. If you are going to use scaling, do it on the computed distance matrix, not on the lat lon themselves.
import numpy as np
from sklearn.metrics.pairwise import haversine_distances
points_in_radians = df[['business_lat','business_lng']].apply(np.radians).values
distances_in_km = haversine_distances(points_in_radians) * 6371
Adding in Rating
We can think of the problem through asking a couple of questions that relate rating to distance. We could ask, how different must ratings be to separate observations in the same place? What is the meter difference to rating difference ratio? With an idea of ratio, we can calculate another distance matrix for the rating difference for all observations and use this to scale or add on the original location distance matrix or we could increase the distance for every gap in rating. This location-plus-ratings-difference matrix can then be clustered on.
from sklearn.metrics.pairwise import euclidean_distances
added_km_per_rating_gap = 1
rating_distances = euclidean_distances(df[['business_rating']].values) * added_km_per_rating_gap
We can then simply add these together and cluster on the resulting matrix.
from sklearn.cluster import DBSCAN
distance_matrix = rating_distances + distances_in_km
clustering = DBSCAN(metric='precomputed', eps=1, min_samples=2)
clustering.fit(distance_matrix)
What we have done is cluster by location, adding a penalty for ratings difference. Making that penalty direct and controllable allows for optimisation to find the best clustering.
Testing
The problem I'm finding is that (with my test data at least) DBSCAN has a tendency to 'walk' from observation to observation forming clusters that either blend ratings together because the penalty is not high enough or separates into single rating groups. It might be that DBSCAN is not suitable for this type of clustering. If I had more time, I would look for some open data to test this on and try other clustering methods.
Here is the code I used to test. I used the square of the ratings distance to emphasise larger gaps.
import random
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=300, centers=6, cluster_std=0.60, random_state=0)
ratings = np.array([random.randint(1,4) for _ in range(len(X)//2)] \
+[random.randint(2,5) for _ in range(len(X)//2)]).reshape(-1, 1)
distances_in_km = euclidean_distances(X)
rating_distances = euclidean_distances(ratings)
def build_clusters(multiplier, eps):
rating_addition = (rating_distances ** 2) * multiplier
distance_matrix = rating_addition + distances_in_km
clustering = DBSCAN(metric='precomputed', eps=eps, min_samples=10)
clustering.fit(distance_matrix)
return clustering.labels_