I have a set of data based on which I want to create a classification model. Each row has the following form:
user1,class1,product1
user1,class1,product2
user1,class1,product5
user2,class1,product2
user2,class1,product5
user3,class2,product1
There are about 1M users, 2 classes, and 1M products. What I would like to do next is create the sparse vectors (something already supported by MLlib) BUT in order to apply that function I have to create the dense vectors (with the 0s), first. In other words, I have to binarize my data. What's the easiest (or most elegant) way of doing that?
Given that I am a newbie in regards to MLlib, may I ask you to provide a concrete example? I am using MLlib 1.2.
EDIT
I have ended up with the following piece of code but is turns out to be really slow... Any other ideas provided that I can only use MLlib 1.2?
val data = test11.map(x=> ((x(0) , x(1)) , x(2))).groupByKey().map(x=> (x._1 , x._2.toArray)).map{x=>
var lt : Array[Double] = new Array[Double](test12.size)
val id = x._1._1
val cl = x._1._2
val dt = x._2
var i = -1
test12.foreach{y => i += 1; lt(i) = if(dt contains y) 1.0 else 0.0}
val vs = Vectors.dense(lt)
(id , cl , vs)
}
userX
andclassY
then it most probably will beproductZ
or something else? – GaminguserX
is a sparse vector of values andclassY
is the corresponding class. – CinerariumuserX
an actual object or a very simple string? and are you taking into accountproduct
in any way during the classification? – Gaming