binning data via DecisionTreeClassifier sklearn?
Asked Answered
S

1

6

suppose I have a data set:

    X     y
   20     0
   22     0
   24     1
   27     0
   30     1
   40     1
   20     0
   ...

I try to discretize X into few bins by minimizing the entropy. so I did the following:

clf = tree.DecisionTreeClassifier(criterion = 'entropy',max_depth = 4)
clf.fit(X.values.reshape(-1,1),y.values)

threshold = clf.tree_.threshold[clf.tree_.threshold>-2]
threshold = np.sort(threshold)

'threshold' should give the splitting points, is this a correct way of binning data?

any suggestions?

Samarskite answered 20/6, 2017 at 6:0 Comment(2)
This might be a silly question, but why are there so many -2 thresholds and why just exclude them? I might be missing an obvious google search that would reveal this (so apologies for the ignorance), but have not found anything so far.Autocephalous
@Autocephalous - did you find out why there is so much of -2? I also have the same problemGuild
D
2

first, what you did is correct.

There are many ways to bin your data:

  1. based on the values of the column (like: dividing the column for 10 equal groups between min and max of the column value).
  2. based on the distribution of the column values, for example it's could be 10 groups based on the deciles of the column (better to use pandas.qcut for that)
  3. based on the target, like you did. I found this blog relevant to you and I think your method for finding the best splits works just fine https://towardsdatascience.com/discretisation-using-decision-trees-21910483fa4b
Dunkle answered 5/2, 2019 at 7:28 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.