Creating Spark dataframe from numpy matrix
Asked Answered
T

3

18

it is my first time with PySpark, (Spark 2), and I'm trying to create a toy dataframe for a Logit model. I ran successfully the tutorial and would like to pass my own data into it.

I've tried this:

%pyspark
import numpy as np
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.mllib.regression import LabeledPoint

df = np.concatenate([np.random.randint(0,2, size=(1000)), np.random.randn(1000), 3*np.random.randn(1000)+2, 6*np.random.randn(1000)-2]).reshape(1000,-1)
df = map(lambda x: LabeledPoint(x[0], Vectors.dense(x[1:])), df)

mydf = spark.createDataFrame(df,["label", "features"])

but I cannot get rid of :

TypeError: Cannot convert type <class 'pyspark.ml.linalg.DenseVector'> into Vector

I'm using the ML library for vector and the input is a double array, so what's the catch, please? It should be fine according to the documentation.

Many thanks.

Toothache answered 12/7, 2017 at 16:55 Comment(0)
W
8

You are mixing functionality from ML and MLlib, which are not necessarily compatible. You don't need a LabeledPoint when using spark-ml:

sc.version
# u'2.1.1'

import numpy as np
from pyspark.ml.linalg import Vectors

df = np.concatenate([np.random.randint(0,2, size=(1000)), np.random.randn(1000), 3*np.random.randn(1000)+2, 6*np.random.randn(1000)-2]).reshape(1000,-1)
dff = map(lambda x: (int(x[0]), Vectors.dense(x[1:])), df)

mydf = spark.createDataFrame(dff,schema=["label", "features"])

mydf.show(5)
# +-----+-------------+ 
# |label|     features| 
# +-----+-------------+ 
# |    1|[0.0,0.0,0.0]| 
# |    0|[0.0,1.0,1.0]| 
# |    0|[0.0,1.0,0.0]| 
# |    1|[0.0,0.0,1.0]| 
# |    0|[0.0,1.0,0.0]|
# +-----+-------------+

PS: As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API in the spark.ml package. [ref.]

Whatnot answered 12/7, 2017 at 18:17 Comment(3)
I think it's very important to clear these things up because this is where the mess begins. It's not the first time OPs mix up between them and at a certain point they ask themselves what should they use.Cordie
yh when you see it first time its gets a bit confusing :) nodalpoint.com/spark-classificationToothache
I think you want to use column_stack instead of concatenateSnips
L
12

From Numpy to Pandas to Spark:

data = np.random.rand(4,4)
df = pd.DataFrame(data, columns=list('abcd'))
spark.createDataFrame(df).show()

Output:

+-------------------+-------------------+------------------+-------------------+
|                  a|                  b|                 c|                  d|
+-------------------+-------------------+------------------+-------------------+
| 0.8026427193838694|0.16867056812634307|0.2284873209015007|0.17141853164400833|
| 0.2559088794287595| 0.3896957084615589|0.3806810025185623| 0.9362280141470332|
|0.41313827425060257| 0.8087580640179158|0.5547653674054028| 0.5386190454838264|
| 0.2948395900484454| 0.4085807623354264|0.6814694724946697|0.32031773805256325|
+-------------------+-------------------+------------------+-------------------+
Lancelle answered 18/10, 2017 at 20:2 Comment(1)
The thing is, if you are to continue processing this data with Spark ML, you are going to need something like VectorAssembler further downstream, in order to convert the 4 columns into a single one, like features in my answer, as Spark ML needs the features in this form...Whatnot
W
8

You are mixing functionality from ML and MLlib, which are not necessarily compatible. You don't need a LabeledPoint when using spark-ml:

sc.version
# u'2.1.1'

import numpy as np
from pyspark.ml.linalg import Vectors

df = np.concatenate([np.random.randint(0,2, size=(1000)), np.random.randn(1000), 3*np.random.randn(1000)+2, 6*np.random.randn(1000)-2]).reshape(1000,-1)
dff = map(lambda x: (int(x[0]), Vectors.dense(x[1:])), df)

mydf = spark.createDataFrame(dff,schema=["label", "features"])

mydf.show(5)
# +-----+-------------+ 
# |label|     features| 
# +-----+-------------+ 
# |    1|[0.0,0.0,0.0]| 
# |    0|[0.0,1.0,1.0]| 
# |    0|[0.0,1.0,0.0]| 
# |    1|[0.0,0.0,1.0]| 
# |    0|[0.0,1.0,0.0]|
# +-----+-------------+

PS: As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API in the spark.ml package. [ref.]

Whatnot answered 12/7, 2017 at 18:17 Comment(3)
I think it's very important to clear these things up because this is where the mess begins. It's not the first time OPs mix up between them and at a certain point they ask themselves what should they use.Cordie
yh when you see it first time its gets a bit confusing :) nodalpoint.com/spark-classificationToothache
I think you want to use column_stack instead of concatenateSnips
F
2

The problem is easy to solve. You're using the ml and the mllib API at the same time. Stick to one. Otherwise you get this error.

This is the solution for the mllibAPI:

import numpy as np
from pyspark.mllib.linalg import Vectors, VectorUDT
from pyspark.mllib.regression import LabeledPoint

df = np.concatenate([np.random.randint(0,2, size=(1000)), np.random.randn(1000), 3*np.random.randn(1000)+2, 6*np.random.randn(1000)-2]).reshape(1000,-1)
df = map(lambda x: LabeledPoint(x[0], Vectors.dense(x[1:])), df)

mydf = spark.createDataFrame(df,["label", "features"])

For the ml API, you don't really need LabeledPoint anymore. Here is an example. I would suggest to use the ml API since the mllib API is going to deprecated soon.

Frech answered 12/7, 2017 at 18:19 Comment(2)
thanks a lot for your answer as well. I awarded desertnaut's and upvoted yours. Thanks a lot!Toothache
upvoted too, since it is complementary to mine (mllib). It is not visible now, but both our answers came with just 2 minutes difference - cool... :)Whatnot

© 2022 - 2024 — McMap. All rights reserved.