binning data via DecisionTreeClassifier sklearn?

About

Asked 20/6, 2017 at 6:0 Answered 5/2, 2019 at 7:28

python scikit-learn decision-tree entropy binning

suppose I have a data set:

I try to discretize X into few bins by minimizing the entropy. so I did the following:

clf = tree.DecisionTreeClassifier(criterion = 'entropy',max_depth = 4)
clf.fit(X.values.reshape(-1,1),y.values)

threshold = clf.tree_.threshold[clf.tree_.threshold>-2]
threshold = np.sort(threshold)

'threshold' should give the splitting points, is this a correct way of binning data?

any suggestions?

Samarskite answered 20/6, 2017 at 6:0 Comment(2)

This might be a silly question, but why are there so many -2 thresholds and why just exclude them? I might be missing an obvious google search that would reveal this (so apologies for the ignorance), but have not found anything so far. – Autocephalous 9/11, 2018 at 6:41

@Autocephalous - did you find out why there is so much of -2? I also have the same problem – Guild 26/1, 2022 at 9:25

first, what you did is correct.

There are many ways to bin your data:

based on the values of the column (like: dividing the column for 10 equal groups between min and max of the column value).
based on the distribution of the column values, for example it's could be 10 groups based on the deciles of the column (better to use pandas.qcut for that)
based on the target, like you did. I found this blog relevant to you and I think your method for finding the best splits works just fine https://towardsdatascience.com/discretisation-using-decision-trees-21910483fa4b

Dunkle answered 5/2, 2019 at 7:28 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags