AttributeError: 'DataFrame' object has no attribute 'map'
Asked Answered
T

2

53

I wanted to convert the spark data frame to add using the code below:

from pyspark.mllib.clustering import KMeans
spark_df = sqlContext.createDataFrame(pandas_df)
rdd = spark_df.map(lambda data: Vectors.dense([float(c) for c in data]))
model = KMeans.train(rdd, 2, maxIterations=10, runs=30, initializationMode="random")

The detailed error message is:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-11-a19a1763d3ac> in <module>()
      1 from pyspark.mllib.clustering import KMeans
      2 spark_df = sqlContext.createDataFrame(pandas_df)
----> 3 rdd = spark_df.map(lambda data: Vectors.dense([float(c) for c in data]))
      4 model = KMeans.train(rdd, 2, maxIterations=10, runs=30, initializationMode="random")

/home/edamame/spark/spark-2.0.0-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in __getattr__(self, name)
    842         if name not in self.columns:
    843             raise AttributeError(
--> 844                 "'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
    845         jc = self._jdf.apply(name)
    846         return Column(jc)

AttributeError: 'DataFrame' object has no attribute 'map'

Does anyone know what I did wrong here? Thanks!

Tuberous answered 16/9, 2016 at 15:44 Comment(3)
Keep in mind that MLLIB is built around RDDs while ML is generally built around dataframes. Since you appear to be using Spark 2.0, I would suggest you look up the KMeans from ML: spark.apache.org/docs/latest/ml-clustering.htmlRosendorosene
@JeffL: I checked ml, and I noticed that the input has to be dataset, not data frame. So we need to do another layer of conversion to convert data frame to dataset in order to use ml?Tuberous
I'm not 100% clear on the distinction any more, though in Python I believe it's nearly moot. In fact if you browse the github code, in 1.6.1 the various dataframe methods are in a dataframe module, while in 2.0 those same methods are in a dataset module and there is no dataframe module. So I don't think you would face any conversion issues between dataframe and dataset, at least in the Python API.Rosendorosene
H
111

You can't map a dataframe, but you can convert the dataframe to an RDD and map that by doing spark_df.rdd.map(). Prior to Spark 2.0, spark_df.map would alias to spark_df.rdd.map(). With Spark 2.0, you must explicitly call .rdd first.

Hummock answered 16/9, 2016 at 16:28 Comment(4)
right on, this is one of the main changes in dataframes in spark 2.0Vernice
'RDD' object has no attribute 'collectAsList'Scranton
This has major downsides: "Converting to RDD breaks Dataframe lineage, there is no predicate pushdown, no column prunning, no SQL plan and less efficient PySpark transformations." See my answer for more details and alternatives.Broderick
Fair that you should stick with DFs if possible and it should be possible nearly all of the time. Depending on what you're trying to do, there will be different techniques to accomplish that. But, there still may be use cases for converting to RDDs.Hummock
B
1

You can use df.rdd.map(), as DataFrame does not have map or flatMap, but be aware of the implications of using df.rdd:

Converting to RDD breaks Dataframe lineage, there is no predicate pushdown, no column prunning, no SQL plan and less efficient PySpark transformations.

What should you do instead?

Keep in mind that the high-level DataFrame API is equipped with many alternatives. First, you can use select or selectExpr.

Another example is using explode instead of flatMap(which existed in RDD):

df.select($"name",explode($"knownLanguages"))
    .show(false)

Result:

+-------+------+
|name   |col   |
+-------+------+
|James  |Java  |
|James  |Scala |
|Michael|Spark |
|Michael|Java  |
|Michael|null  |
|Robert |CSharp|
|Robert |      |
+-------+------+

You can also use withColumn or UDF, depending on the use-case, or another option in the DataFrame API.

Broderick answered 6/2, 2021 at 20:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.