MLlib to Breeze vectors/matrices are private to org.apache.spark.mllib scope?

Asked 30/10, 2014 at 22:8 Answered 20/1, 2019 at 22:32

apache-spark apache-spark-mllib scala-breeze

I have read somewhere that MLlib local vectors/matrices are currently wrapping Breeze implementation, but the methods converting MLlib to Breeze vectors/matrices are private to org.apache.spark.mllib scope. The suggestion to work around this is to write your code in org.apache.spark.mllib.something package.

Is there a better way to do this? Can you cite some relevant examples?

Thanks and regards,

Cotsen answered 30/10, 2014 at 22:8 Comment(0)

I did the same solution as @dlwh suggested. Here is the code that does it:

package org.apache.spark.mllib.linalg

object VectorPub {

  implicit class VectorPublications(val vector : Vector) extends AnyVal {
    def toBreeze : breeze.linalg.Vector[scala.Double] = vector.toBreeze

  }

  implicit class BreezeVectorPublications(val breezeVector : breeze.linalg.Vector[Double]) extends AnyVal {
    def fromBreeze : Vector = Vectors.fromBreeze(breezeVector)
  }
}

notice that the implicit class extends AnyVal to prevent allocation of a new object when calling those methods

Bournemouth answered 15/11, 2014 at 16:42 Comment(3)

This code is placed inside the spark mllib.linalg package. That is not a viable general solution for clients of the mllib framework: they should not be touching the framework classes and packages. – Drin 2/2, 2015 at 18:48

It's in the spark.mllib.linalg package, but spark shouldn't be recompiled for this. Only create a new assembly that wraps the existing spark assembly, and add this class there. It's kinda hacky, but It's the best I found. – Bournemouth 3/2, 2015 at 3:30

Stuff like this is a bit dangerous. For instance, if you take a slice of your breeze vector and attempt to wrap it with fromBreeze, it will fail. – Indraft 28/8, 2016 at 23:9

My solution is kind of a hybrid of those of @barclar and @lev, above. You don't need to put your code in the org.apache.spark.mllib.linalg if you don't make use of the spark-ml implicit conversions. You can define your own implicit conversions in your own package, like:

package your.package

import org.apache.spark.ml.linalg.DenseVector
import org.apache.spark.ml.linalg.SparseVector
import org.apache.spark.ml.linalg.Vector
import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV}

object BreezeConverters
{
    implicit def toBreeze( dv: DenseVector ): BDV[Double] =
        new BDV[Double](dv.values)

    implicit def toBreeze( sv: SparseVector ): BSV[Double] =
        new BSV[Double](sv.indices, sv.values, sv.size)

    implicit def toBreeze( v: Vector ): BV[Double] =
        v match {
            case dv: DenseVector => toBreeze(dv)
            case sv: SparseVector => toBreeze(sv)
        }

    implicit def fromBreeze( dv: BDV[Double] ): DenseVector =
        new DenseVector(dv.toArray)

    implicit def fromBreeze( sv: BSV[Double] ): SparseVector =
        new SparseVector(sv.length, sv.index, sv.data)

    implicit def fromBreeze( bv: BV[Double] ): Vector =
        bv match {
            case dv: BDV[Double] => fromBreeze(dv)
            case sv: BSV[Double] => fromBreeze(sv)
        }
}

Then you can import these implicits into your code with:

import your.package.BreezeConverters._

Latea answered 20/1, 2019 at 22:32 Comment(0)

As I understand it, the Spark people do not want to expose third party APIs (including Breeze) so that it's easier to change if they decide to move away from them.

You could always put just a simple implicit conversion class in that package and write the rest of your code in your own package. Not much better than just putting everything in there, but it makes it a little more obvious why you're doing it.

Willianwillie answered 31/10, 2014 at 17:6 Comment(5)

putting code in the mllib.linalg package is not a viable solution for clients of the mllib framework – Drin 1/2, 2015 at 1:35

I agree it's dumb, but you only have to put one little class (as witnessed by @lev), and it's the best workaround that doesn't involve needless creation of extra arrays, like your solution below. – Willianwillie 1/2, 2015 at 2:44

(I of course think they should just expose Breeze as "experimental" if they want to reserve the right to change it, but it's out of my hands.) – Willianwillie 1/2, 2015 at 2:45

But adding to the mllib/linalg is an "out of bounds" solution for a general client (which shall not modify that package) : it is a non-starter. Neither do I prefer my solution in terms of convenience: but at least it is "legal". If you have an idea for a generally permissible solution that is better I am all for it. – Drin 1/2, 2015 at 5:43

Anything that's a better solution requires politicking, I'm afraid. – Willianwillie 1/2, 2015 at 18:8

Here is the best I have so far. Note to @dlwh: please do provide any improvements you might have to this.

The solution I could come up with - that does not put code inside the mllib .linalg package - is to convert each Vector to a new Breeze DenseVector.

val v1 = Vectors.dense(1.0, 2.0, 3.0)
val v2 = Vectors.dense(4.0, 5.0, 6.0)
val bv1 = new DenseVector(v1.toArray)
val bv2 = new DenseVector(v2.toArray)
val vectout = Vectors.dense((bv1 + bv2).toArray)
vectout: org.apache.spark.mllib.linalg.Vector = [5.0,7.0,9.0]

Drin answered 1/2, 2015 at 1:25 Comment(1)

This seems to be a good solution, at least worked for my purpose, but when we do v1.toArray we are collecting all the elements of v1, that could potentially cause problems when for example, 'v1` is huge and cannot fit in RAM! – Roundy 2/7, 2015 at 5:35

This solution avoids putting code into Spark's packages and avoids converting sparse to dense vectors:

def toBreeze(vector: Vector) : breeze.linalg.Vector[scala.Double] = vector match {
      case sv: SparseVector => new breeze.linalg.SparseVector[Double](sv.indices, sv.values, sv.size)
      case dv: DenseVector => new breeze.linalg.DenseVector[Double](dv.values)
    }

Lorianne answered 4/2, 2017 at 21:3 Comment(0)

this is a method i wort to convert an Mlib DenceMatrix to a breeze matrix, maybe it help!!

import breeze.linalg._
import org.apache.spark.mllib.linalg.Matrix

def toBreez(X:org.apache.spark.mllib.linalg.Matrix):breeze.linalg.DenseMatrix[Double] = {
var i=0;
var j=0;
val m = breeze.linalg.DenseMatrix.zeros[Double](X.numRows,X.numCols)
for(i <- 0 to X.numRows-1){
  for(j <- 0 to X.numCols-1){
    m(i,j)=X.apply(i, j)
  }
}
m
}

Sienese answered 12/5, 2018 at 21:34 Comment(0)

Recommended topics

Hot tags