How to convert a mllib matrix to a spark dataframe?
Asked Answered
S

1

2

I want to pretty print the result of a correlation in a zeppelin notebook:

val Row(coeff: Matrix) = Correlation.corr(data, "features").head

One of the ways to achieve this is to convert the result into a DataFrame with each value in a separate column and call z.show().

However, looking into the Matrix api I don't see any way to do this.

Is there another straight forward way to achieve this?

Edit:

The dataframe has 50 columns. Just converting to a string would not help as the output get truncated.

Subjectify answered 25/2, 2018 at 18:50 Comment(0)
A
3

Using the toString method should be the easiest and fastest way if you simply want to print the matrix. You can change the output by inputting the maximum number of lines to print as well as max line width. You can change the formatting by splitting on new lines and ",". For example:

val matrix = Matrices.dense(2,3, Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0))
matrix.toString
  .split("\n")
  .map(_.trim.split(" ").filter(_ != "").mkString("[", ",", "]"))
  .mkString("\n")

which will give the following:

[1.0,3.0,5.0]
[2.0,4.0,6.0]

However, if you want to convert the matrix to an DataFrame, the easiest way would be to first create an RDD and then use toDF().

val matrixRows = matrix.rowIter.toSeq.map(_.toArray)
val df = spark.sparkContext.parallelize(matrixRows).toDF("Row")

Then to put each value in a separate column you can do the following

val numOfCols = matrixRows.head.length
val df2 = (0 until numOfCols).foldLeft(df)((df, num) => 
    df.withColumn("Col" + num, $"Row".getItem(num)))
  .drop("Row")
df2.show(false)

Result using the example data:

+----+----+----+
|Col0|Col1|Col2|
+----+----+----+
|1.0 |3.0 |5.0 |
|2.0 |4.0 |6.0 |
+----+----+----+
Alagez answered 26/2, 2018 at 6:8 Comment(5)
Unfortunately none of this helps me. The dataframe has 50 columns and they get truncated in the output. To be sure of a clean output I want to convert the Matrix into a dataframe containing each column of the matrix on an individual column.Subjectify
@djWann: Did you try using df.show(false)? It won't truncate the output.Alagez
Yeah it doesn't truncate the output, but in the zeppelin cell it's very hard to understand because of the 50 columns.Subjectify
50 columns in an Wrapped Array I mean.Subjectify
@djWann: I see. I changed the solution to build a dataframe that has each value as a separate column. Hope it helps!Alagez

© 2022 - 2024 — McMap. All rights reserved.