scikit-learn DBSCAN memory usage

Asked 5/5, 2013 at 5:4 Answered 22/8, 2024 at 11:23

Solved python scikit-learn cluster-analysis data-mining dbscan

UPDATED: In the end, the solution I opted to use for clustering my large dataset was one suggested by Anony-Mousse below. That is, using ELKI's DBSCAN implimentation to do my clustering rather than scikit-learn's. It can be run from the command line and with proper indexing, performs this task within a few hours. Use the GUI and small sample datasets to work out the options you want to use and then go to town. Worth looking into. Anywho, read on for a description of my original problem and some interesting discussion.

I have a dataset with ~2.5 million samples, each with 35 features (floating point values) that I'm trying to cluster. I've been trying to do this with scikit-learn's implementation of DBSCAN, using the Manhattan distance metric and a value of epsilon estimated from some small random samples drawn from the data. So far, so good. (here is the snippet, for reference)

db = DBSCAN(eps=40, min_samples=10, metric='cityblock').fit(mydata)

My issue at the moment is that I easily run out of memory. (I'm currently working on a machine with 16 GB of RAM)

My question is, is DBSCAN calculating the pairwise distance matrix on the fly as it runs, and that's what's gobbling up my memory? (2.5 million ^ 2) * 8 bytes is obviously stupidly large, I would understand that. Should I not be using the fit() method? And more generally, is there a way around this issue, or am I generally barking up the wrong tree here?

Apologies if the answer winds up being obvious. I've been puzzling over this for a few days. Thanks!

Addendum: Also if anyone could explain the difference between fit(X) and fit_predict(X) to me more explicitly I'd also appreciate that--I'm afraid I just don't quite get it.

Addendum #2: To be sure, I just tried this on a machine with ~550 GB of RAM and it still blew up, so I feel like DBSCAN is likely trying to make a pairwise distance matrix or something I clearly don't want it to do. I guess now the big question is how to stop that behavior, or find other methods that might suit my needs more. Thanks for bearing with me here.

Addendum #3(!): I forgot to attach the traceback, here it is,

Traceback (most recent call last):
  File "tDBSCAN.py", line 34, in <module>
    db = DBSCAN(eps=float(sys.argv[2]), min_samples=10, metric='cityblock').fit(mydata)
  File "/home/jtownsend/.local/lib/python2.6/site-packages/sklearn/base.py", line 329, in fit_predict
    self.fit(X)
  File "/home/jtownsend/.local/lib/python2.6/site-packages/sklearn/cluster/dbscan_.py", line 186, in fit
    **self.get_params())
  File "/home/jtownsend/.local/lib/python2.6/site-packages/sklearn/cluster/dbscan_.py", line 69, in dbscan
    D = pairwise_distances(X, metric=metric)
  File "/home/jtownsend/.local/lib/python2.6/site-packages/sklearn/metrics/pairwise.py", line 651, in pairwise_distances
    return func(X, Y, **kwds)
  File "/home/jtownsend/.local/lib/python2.6/site-packages/sklearn/metrics/pairwise.py", line 237, in manhattan_distances
    D = np.abs(X[:, np.newaxis, :] - Y[np.newaxis, :, :])
MemoryError

Derick answered 5/5, 2013 at 5:4 Comment(0)

The problem apparently is a non-standard DBSCAN implementation in scikit-learn.

DBSCAN does not need a distance matrix. The algorithm was designed around using a database that can accelerate a regionQuery function, and return the neighbors within the query radius efficiently (a spatial index should support such queries in O(log n)).

The implementation in scikit however, apparently, computes the full O(n^2) distance matrix, which comes at a cost both memory-wise and runtime-wise.

So I see two choices:

You may want to try the DBSCAN implementation in ELKI instead, which when used with an R*-tree index usually is substantially faster than a naive implementation.
Otherwise, you may want to reimplement DBSCAN, as the implementation in scikit apparently isn't too good. Don't be scared of that: DBSCAN is really simple to implement yourself. The trickiest part of a good DBSCAN implementation is actually the regionQuery function. If you can get this query fast, DBSCAN will be fast. And you can actually reuse this function for other algorithms, too.

Update: by now, sklearn no longer computes a distance matrix and can, e.g., use a kd-tree index. However, because of "vectorization" it will still precompute the neighbors of every point, so the memory usage of sklearn for large epsilon is O(n²), whereas to my understanding the version in ELKI will only use O(n) memory. So if you run out of memory, choose a smaller epsilon and/or try ELKI.

Comp answered 5/5, 2013 at 16:35 Comment(6)

Actually it seems that it would not be too hard to improve the sklearn implementation. We have a ball-tree data structure which exactly supports the radius query. I am not very familiar with dbscan so I didn't know it only needed these queries. We should definitely improve there. – Tommietommy 6/5, 2013 at 8:12

I think that the sklearn implementation has significantly improved with sklearn 0.14: The ball-tree implementation now supports a good selection of metrics and DBSCAN has been adapted to not internally compute the entire pairwise distance matrix. So it seems to be an option again, unfortunately haversine distance is still not supported by the pairwise metrics package. Relevant github ticket (beware, the changes are spread out over many pull requests and tickets): github.com/scikit-learn/scikit-learn/issues/1938 – Felt 21/1, 2014 at 22:7

I agree, sklearn has improved its DBSCAN. Still, ELKI remains to be more powerful when it comes to index acceleration and cluster analysis. For example, it also has OPTICS, and other DBSCAN-derivatives. – Comp 22/1, 2014 at 8:49

The problem is ELKI doesn't have good documentation or a 'hello world' example. – Spectrohelioscope 27/1, 2018 at 7:48

I found the tutorials such as thr mouse example on the web site "hello world" enough. And the javadoc is pretty good, too. – Comp 27/1, 2018 at 9:11

isn't one problem with ELKI the fact that it's written in Java? Whereas sci-kit learn is Python. – Tyrus 19/5, 2021 at 4:49

You can do this using scikit-learn's DBSCAN with the haversine metric and ball-tree algorithm. You do not need to precompute a distance matrix.

This example clusters over a million GPS latitude-longitude points with DBSCAN/haversine and avoids memory usage problems:

df = pd.read_csv('gps.csv')
coords = df.as_matrix(columns=['lat', 'lon'])
db = DBSCAN(eps=eps, min_samples=ms, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))

Note that this specifically uses scikit-learn v0.15, as some earlier/later versions seem to require a full distance matrix to be computed, which blows up your RAM real quick. But if you use Anaconda, you can quickly set this up with:

conda install scikit-learn=0.15

Or, create a clean virtual environment for this clustering task:

conda create -n clusterenv python=3.4 scikit-learn=0.15 matplotlib pandas jupyter
activate clusterenv

Bixler answered 2/8, 2016 at 23:10 Comment(1)

confirmed, sklearn v0.15.2 requires far less memory than v0.17.1 to run the same model fit – Jobie 12/8, 2016 at 16:13

This issue with sklearn is discussed here:

https://github.com/scikit-learn/scikit-learn/issues/5275

There are two options presented there;

One is to use OPTICS (which requires sklearn v21+), which is an alternative but closely related algorithm to DBSCAN:

https://scikit-learn.org/dev/modules/generated/sklearn.cluster.OPTICS.html

The others are to precompute the adjacency matrix, or to use sample weights. Some more details on these options can be found under Notes here:

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

Scenarist answered 4/12, 2018 at 10:6 Comment(1)

Nice nice, I just tried with the OPTICS one and it worked, took around 2 minutes with a ndarray of 43000 lines, with DBSCAN with the same ndarray I was getting the memory crash error. – Vaclav 27/8, 2019 at 17:31

I faced the same problem when i was using old version on sklearn 0.19.1 because the complexity was O(N^2).

But now the problem has been resolved in new version 0.20.2 and no memory error anymore, and the complexity become O(n.d) where d is the average number of neighbours. it's not the ideal complexity but much better than old versions.

Check the notes in this release, to avoid high memory usage: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

Selfcommand answered 18/2, 2019 at 10:18 Comment(0)

The DBSCAN algorithm actually does compute the distance matrix, so no chance here. For this much data, I would recommend using MiniBatchKMeans. You can not use the Manhattan metric there out of the box, but you could do your own implementation. Maybe try the standard implementation with the euclidean metric first.

I don't know many clustering algorithms that don't perform pairwise distances.

Using the newly embedded cheat-sheet bottom center: though luck.

Tommietommy answered 5/5, 2013 at 14:31 Comment(5)

There's no way to compute them on the fly? The way I understand DBSCAN I'm not clear on why I couldn't just start with a random point, compute its distance to some other point, and compare it to epsilon, chucking it out or adding it as a neighbor over and over again... – Derick 5/5, 2013 at 15:44

@JamesT: while it would be possible, the current scikit-learn implementation simply doesn't do that. It doesn't really scale up to large numbers of samples because it takes quadratic space and time. – Convolvulaceous 5/5, 2013 at 15:46

Incorrect. DBSCAN does not need a distance matrix (and in particular, not a matrix). A good implementation should use a spatial index, to significantly reduce the number of distance computations needed. It should be implemented in O(n) memory and O(n log n) runtime. – Comp 5/5, 2013 at 16:5

The DBSCAN algorithm in itself does not require to compute the whole distance matrix. See for instance the basic pseudocode on Wikipedia en.wikipedia.org/wiki/DBSCAN#Algorithm Previous versions on scikit relied on the full computation of the distance matrix but it is no longer the case – Oscitancy 25/2, 2016 at 22:49

@Oscitancy in my experience v0.15.2 requires far less memory than v0.17.1 to run the same code. Any idea why? – Jobie 12/8, 2016 at 16:13

There is the DBSCAN package available which implements Theoretically-Efficient and Practical Parallel DBSCAN. It's lightening quick compared to scikit-learn and doesn't suffer from the memory issue.

Hypnos answered 22/8, 2024 at 11:23 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags