Converting a list of rows to a PySpark dataframe

Asked 19/8, 2019 at 15:29 Answered 17/6, 2021 at 14:28

python apache-spark pyspark apache-spark-sql rows

I have the following lists of rows that I want to convert to a PySpark df:

data= [Row(id=u'1', probability=0.0, thresh=10, prob_opt=0.45),
 Row(id=u'2', probability=0.4444444444444444, thresh=60, prob_opt=0.45),
 Row(id=u'3', probability=0.0, thresh=10, prob_opt=0.45),
 Row(id=u'80000000808', probability=0.0, thresh=100, prob_opt=0.45)]

I need to convert it to a PySpark DF.

I have tried doing data.toDF():

AttributeError: 'list' object has no attribute 'toDF'

Stomachache answered 19/8, 2019 at 15:29 Comment(0)

This seems to work:

spark.createDataFrame(data)

Test results:

from pyspark.sql import SparkSession, Row

spark = SparkSession.builder.getOrCreate()

data = [Row(id=u'1', probability=0.0, thresh=10, prob_opt=0.45),
        Row(id=u'2', probability=0.4444444444444444, thresh=60, prob_opt=0.45),
        Row(id=u'3', probability=0.0, thresh=10, prob_opt=0.45),
        Row(id=u'80000000808', probability=0.0, thresh=100, prob_opt=0.45)]

df = spark.createDataFrame(data)
df.show()
#  +-----------+------------------+------+--------+
#  |         id|       probability|thresh|prob_opt|
#  +-----------+------------------+------+--------+
#  |          1|               0.0|    10|    0.45|
#  |          2|0.4444444444444444|    60|    0.45|
#  |          3|               0.0|    10|    0.45|
#  |80000000808|               0.0|   100|    0.45|
#  +-----------+------------------+------+--------+

Grani answered 17/6, 2021 at 14:28 Comment(0)

You can try the following code:

from pyspark.sql import Row

rdd = sc.parallelize(data)

df=rdd.toDF()

Pham answered 19/8, 2019 at 18:19 Comment(0)

found the answer!

rdd = sc.parallelize(data)

df=spark.createDataFrame(rdd, ['id', 'probability','thresh','prob_opt'])

Stomachache answered 19/8, 2019 at 15:43 Comment(0)

Recommended topics

Hot tags