AttributeError: 'DataFrame' object has no attribute 'map'

Asked 16/9, 2016 at 15:44 Answered 6/2, 2021 at 20:39

Solved python apache-spark pyspark apache-spark-sql apache-spark-mllib

I wanted to convert the spark data frame to add using the code below:

from pyspark.mllib.clustering import KMeans
spark_df = sqlContext.createDataFrame(pandas_df)
rdd = spark_df.map(lambda data: Vectors.dense([float(c) for c in data]))
model = KMeans.train(rdd, 2, maxIterations=10, runs=30, initializationMode="random")

The detailed error message is:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-11-a19a1763d3ac> in <module>()
      1 from pyspark.mllib.clustering import KMeans
      2 spark_df = sqlContext.createDataFrame(pandas_df)
----> 3 rdd = spark_df.map(lambda data: Vectors.dense([float(c) for c in data]))
      4 model = KMeans.train(rdd, 2, maxIterations=10, runs=30, initializationMode="random")

/home/edamame/spark/spark-2.0.0-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in __getattr__(self, name)
    842         if name not in self.columns:
    843             raise AttributeError(
--> 844                 "'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
    845         jc = self._jdf.apply(name)
    846         return Column(jc)

AttributeError: 'DataFrame' object has no attribute 'map'

Does anyone know what I did wrong here? Thanks!

Tuberous answered 16/9, 2016 at 15:44 Comment(3)

Keep in mind that MLLIB is built around RDDs while ML is generally built around dataframes. Since you appear to be using Spark 2.0, I would suggest you look up the KMeans from ML: spark.apache.org/docs/latest/ml-clustering.html – Rosendorosene 16/9, 2016 at 16:33

@JeffL: I checked ml, and I noticed that the input has to be dataset, not data frame. So we need to do another layer of conversion to convert data frame to dataset in order to use ml? – Tuberous 16/9, 2016 at 16:53

I'm not 100% clear on the distinction any more, though in Python I believe it's nearly moot. In fact if you browse the github code, in 1.6.1 the various dataframe methods are in a dataframe module, while in 2.0 those same methods are in a dataset module and there is no dataframe module. So I don't think you would face any conversion issues between dataframe and dataset, at least in the Python API. – Rosendorosene 16/9, 2016 at 17:1

111

You can't map a dataframe, but you can convert the dataframe to an RDD and map that by doing spark_df.rdd.map(). Prior to Spark 2.0, spark_df.map would alias to spark_df.rdd.map(). With Spark 2.0, you must explicitly call .rdd first.

Hummock answered 16/9, 2016 at 16:28 Comment(4)

right on, this is one of the main changes in dataframes in spark 2.0 – Vernice 13/12, 2016 at 15:40

'RDD' object has no attribute 'collectAsList' – Scranton 21/1, 2020 at 19:49

This has major downsides: "Converting to RDD breaks Dataframe lineage, there is no predicate pushdown, no column prunning, no SQL plan and less efficient PySpark transformations." See my answer for more details and alternatives. – Broderick 6/2, 2021 at 20:40

Fair that you should stick with DFs if possible and it should be possible nearly all of the time. Depending on what you're trying to do, there will be different techniques to accomplish that. But, there still may be use cases for converting to RDDs. – Hummock 8/2, 2021 at 15:39

You can use df.rdd.map(), as DataFrame does not have map or flatMap, but be aware of the implications of using df.rdd:

Converting to RDD breaks Dataframe lineage, there is no predicate pushdown, no column prunning, no SQL plan and less efficient PySpark transformations.

What should you do instead?

Keep in mind that the high-level DataFrame API is equipped with many alternatives. First, you can use select or selectExpr.

Another example is using explode instead of flatMap(which existed in RDD):

df.select($"name",explode($"knownLanguages"))
    .show(false)

Result:

+-------+------+
|name   |col   |
+-------+------+
|James  |Java  |
|James  |Scala |
|Michael|Spark |
|Michael|Java  |
|Michael|null  |
|Robert |CSharp|
|Robert |      |
+-------+------+

You can also use withColumn or UDF, depending on the use-case, or another option in the DataFrame API.

Broderick answered 6/2, 2021 at 20:39 Comment(0)

Recommended topics

Hot tags