In scikit learn, how to deal with the data mixed with numerical and nominal value?

About

Asked 27/7, 2012 at 15:26 Answered 27/7, 2012 at 15:49

python machine-learning scikit-learn data-mining mixed

I know that the computation in scikit-learn is based on NumPy so everything is a matrix or array.

How does this package handle mixed data (numerical and nominal values)?

For example, a product could have the attribute 'color' and 'price', where color is nominal and price is numerical. I notice there is a model called 'DictVectorizer' to numerate the nominal data. For example, two products are:

products = [{'color':'black','price':10}, {'color':'green','price':5}]

And the result from 'DictVectorizer' could be:

[[1,0,10],
 [0,1,5]]

If there are lots of different values for the attribute 'color', the matrix would be very sparse. And long features will degrade the performance of some algorithms, such as decision trees.

Is there any way to use the nominal value without the need to create dummy codes?

Fortune answered 27/7, 2012 at 15:26 Comment(2)

It's worth noting that Weka Instances store nominal values as floating point numbers corresponding to the index of the nominal in the attribute's definition. You could simply follow this same strategy to generate a numeric dataset for use with scikit-learn. – Lauzon 6/11, 2012 at 0:31

Thanks a lot for enlarging my knowledge. – Fortune 6/11, 2012 at 14:31

The DecisionTree class in scikit-learn will need some refactoring to deal efficiently with high-cardinality categorical features (and maybe even with naturally sparse data such as text TF-IDF vectors).

Nobody is working on that yet AFAIK.

Breadbasket answered 27/7, 2012 at 15:49 Comment(2)

thanks a lot. In scikit, is there any smart way to do refactoring compared with manual operation? – Fortune 30/7, 2012 at 15:59

My answer states that this current state of affair is a limitation of the current implementation of the Decision Tree in scikit-learn. There is no easy fix I know of to remove that limitation. I don't understand what you call "manual operation". – Breadbasket 30/7, 2012 at 16:44

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags