Tuning leaf_size to decrease time consumption in Scikit-Learn KNN
Asked Answered
B

1

9

I was trying to implement KNN for handwritten character recognition where I found out that the execution of code was taking a lot of time. When added parameter leaf_size with value 400, I observed that time taken by code to execute was significantly reduced.

Original Code:

knn = KNeighborsClassifier(n_neighbors=3)

New Code:

knn = KNeighborsClassifier(n_neighbors=3,leaf_size=400)

I have read few documentation and articles regarding the leaf_size parameter of the KDtree/Balltree but couldn't find any good enough reference on how to safely tune this parameter without any accuracy and information loss.

It would be very kind if someone could share some insights regarding the above issue.

Related Articles I referred to:
1.) http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html
2.) https://jakevdp.github.io/blog/2013/04/29/benchmarking-nearest-neighbor-searches-in-python/
3.) http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

Burford answered 21/4, 2018 at 8:49 Comment(0)
T
10

couldn't find any good enough reference on how to safely tune this parameter without any accuracy and information loss.

In general, the larger leaf_size, the closer neighbors the algorithm picks (note this is not the same as higher accuracy). This is because the tree's main purpose is to reduce the number of candidates for the neighbours. Neither KD-tree, nor Ball Tree guarantees that the split will keep true nearest points close in a tree. In this sense, the use of a tree implies "information loss", and the larger the tree, the higher the chance to put the true neighbors into a wrong branch. In extreme case, there's simply no tree at all (called brute force), so all points are the candidates for the neighbours, as a result the algorithm is guaranteed to find the exact solution.

However, it's possible, at least in theory, that even the wrong choice of the neighbors actually results in higher classification accuracy. But it's almost impossible to say in advance how the change of leaf_size affects the final accuracy.

When added parameter leaf_size with value 400, I observed that time taken by code to execute was significantly reduced.

This is interesting, because leaf_size increase (default is 30) usually results in query time reduction. But it is possible that most of that time was spent on construction, in this case the larger the size of a leaf, the faster this phase is completed. Remember, that the Ball Tree doesn't guarantee that the result tree is balanced. When it's highly unbalanced the construction can even take O(N^2) time. In this case the leaf_size increase makes perfect sense. This post contains very interesting experiments on this matter.

Tributary answered 22/4, 2018 at 19:3 Comment(2)
If I want to query accurately the neighbors of a point that are within a given radius r, then should I increase or decrease the leafsize ?Johnathanjohnathon
@Johnathanjohnathon I am working on the same topic of KDtree for sampling wanna have a small chat?Ejaculation

© 2022 - 2024 — McMap. All rights reserved.