Handling continuous data in Spark NaiveBayes - McMap

About

Handling continuous data in Spark NaiveBayes

Asked 11/8, 2017 at 4:0 Answered 11/8, 2017 at 11:51

apache-spark apache-spark-mllib naivebayes

K

1

1

As per official documentation of Spark NaiveBayes:

It supports Multinomial NB (see here) which can handle finitely supported discrete data.

How can I handle continuous data (for example: percentage of some in some document ) in Spark NaiveBayes?

Kibe answered 11/8, 2017 at 4:0 Comment(0)

S

1

The current implementation can process only binary features so for good result you'll have to discretize and encode your data. For discretization you can use either Buketizer or QuantileDiscretizer. The former one is less expensive and might be a better fit when you want to use some domain specific knowledge.

For encoding you can use dummy encoding using OneHotEncoder. with adjusted dropLast Param.

So overall you'll need:

QuantileDiscretizer or Bucketizer -> OneHotEncoder for each continuous feature.
StringIndexer* -> OneHotEncoder for each discrete feature.
VectorAssembler to combine all of the above.

* Or predefined column metadata.

Stranger answered 11/8, 2017 at 11:51 Comment(0)

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.