How to encode categorical features in Apache Spark
Asked Answered
C

3

6

I have a set of data based on which I want to create a classification model. Each row has the following form:

user1,class1,product1
user1,class1,product2
user1,class1,product5
user2,class1,product2
user2,class1,product5
user3,class2,product1

There are about 1M users, 2 classes, and 1M products. What I would like to do next is create the sparse vectors (something already supported by MLlib) BUT in order to apply that function I have to create the dense vectors (with the 0s), first. In other words, I have to binarize my data. What's the easiest (or most elegant) way of doing that?

Given that I am a newbie in regards to MLlib, may I ask you to provide a concrete example? I am using MLlib 1.2.

EDIT

I have ended up with the following piece of code but is turns out to be really slow... Any other ideas provided that I can only use MLlib 1.2?

val data = test11.map(x=> ((x(0) , x(1)) , x(2))).groupByKey().map(x=> (x._1 , x._2.toArray)).map{x=>
  var lt : Array[Double] = new Array[Double](test12.size)
  val id = x._1._1
  val cl = x._1._2
  val dt = x._2
  var i = -1
  test12.foreach{y => i += 1; lt(i) = if(dt contains y) 1.0 else 0.0}
  val vs = Vectors.dense(lt)
  (id , cl , vs)
}
Cinerarium answered 7/8, 2015 at 7:53 Comment(7)
Can give you an example of what you'd like the dense vector output to look like for that input?Unexceptional
what kind of classification do you want to do exactly? i.e. if userX and classY then it most probably will be productZ or something else?Gaming
not really. I am gonna use binary classification where userX is a sparse vector of values and classY is the corresponding class.Cinerarium
@Cinerarium is userX an actual object or a very simple string? and are you taking into account product in any way during the classification?Gaming
What I am asking here, I think, is very straight forward for someone who has work with machine learning before. I am just trying to find out the best way to implement this in MLlib. Have a look here for similar example in scikit-learn: scikit-learn.org/stable/modules/…Cinerarium
What language do you use?Softboiled
I use Scala and MLlib 1.2 Please have a look on my edit.Cinerarium
C
9

You can use spark.ml's OneHotEncoder.

You first use:

OneHotEncoder.categories(rdd, categoricalFields)

Where categoricalField is the sequence of indexes at which your RDD contains categorical data. categories, given a dataset and the index of columns which are categorical variables, returns a structure that, for each field, describes the values that are present for in the dataset. That map is meant to be used as input to the encode method:

OneHotEncoder.encode(rdd, categories)

Which returns your vectorized RDD[Array[T]].

Callista answered 7/8, 2015 at 8:40 Comment(3)
which is not available in MLlib 1.2 :-)Gaming
Yeah, is it not, and I cannot update unfortunately... Please have a look on my edit.Cinerarium
This doesn't even seem to be available in 1.4.Inhaul
S
4

If using built-in OneHotEncoder is not an option and you have only a single variable implementing poor man's one-hot is more or less straightforward. First lets create an example data:

import org.apache.spark.mllib.linalg.{Vector, Vectors}

val rdd = sc.parallelize(List(
    Array("user1", "class1", "product1"),
    Array("user1", "class1", "product2"),
    Array("user1", "class1", "product5"),
    Array("user2", "class1", "product2"),
    Array("user2", "class1", "product5"),
    Array("user3", "class2", "product1")))

Next we have to create a mapping from a value to the index:

val prodMap = sc.broadcast(rdd.map(_(2)).distinct.zipWithIndex.collectAsMap)

and a simple encoding function:

def encodeProducts(products: Iterable[String]): Vector =  {
    Vectors.sparse(
        prodMap.value.size,
        products.map(product => (prodMap.value(product).toInt, 1.0)).toSeq
    )
}

Finally we can apply it to the dataset:

rdd.map(x => ((x(0), x(1)), x(2))).groupByKey.mapValues(encodeProducts)

It is relatively easy to extend above to handle multiple variables.

Edit:

If number of products is to large to make broadcasting useful it should be possible to use join instead. First we can create similar mapping from product to index but keep it as a RDD:

import org.apache.spark.HashPartitioner

val nPartitions = ???

val prodMapRDD = rdd
     .map(_(2))
     .distinct
     .zipWithIndex
     .partitionBy(new HashPartitioner(nPartitions))
     .cache

val nProducts = prodMapRDD.count // Should be < Int.MaxValue

Next we reshape input RDD to get PairRDD indexed by product:

val pairs = rdd
    .map(rec => (rec(2), (rec(0), rec(1))))
    .partitionBy(new HashPartitioner(nPartitions))

Finally we can join both

def indicesToVec(n: Int)(indices: Iterable[Long]): Vector = {
     Vectors.sparse(n, indices.map(x => (x.toInt, 1.0)).toSeq)
}

pairs.join(prodMapRDD)
   .values
   .groupByKey
   .mapValues(indicesToVec(nProducts.toInt))
Softboiled answered 7/8, 2015 at 13:16 Comment(2)
+1 for the general solution. Do you have another solution which does not use broadcast? I use the solution like yours but sometimes this is not working because the prodMap is too big to broadcast.Rochet
@emeth It is much more expensive but it should be possible you use joins. Please see an edit for details.Softboiled
M
-1

Original question asks for the easiest way to specify categorical features from non-categorical.

In Spark ML, you can use VectorIndexer's setMaxCategories method, where you do not have to specify the fields - instead, it will understand as categorical those fields with lower or equal cardinality than a given number (in this case, 2).

val indexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexed")
.setMaxCategories(10)

Please see this reply for details.

Myceto answered 14/12, 2017 at 11:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.