How to overwrite entire existing column in Spark dataframe with new column?

Asked 19/6, 2017 at 6:21 Answered 14/5, 2018 at 22:59

apache-spark dataframe pyspark apache-spark-sql apache-spark-mllib

I want to overwrite a spark column with a new column which is a binary flag.

I tried directly overwriting the column id2 but why is it not working like a inplace operation in Pandas?

How to do it without using withcolumn() to create new column and drop() to drop the old column?

I know that spark dataframe is immutable, is that the reason or there is a different way to overwrite without using withcolumn() & drop()?

    df2 = spark.createDataFrame(
        [(1, 1, float('nan')), (1, 2, float(5)), (1, 3, float('nan')), (1, 4, float('nan')), (1, 5, float(10)), (1, 6, float('nan')), (1, 6, float('nan'))],
        ('session', "timestamp1", "id2"))

    df2.select(df2.id2 > 0).show()

+---------+
|(id2 > 0)|
+---------+
|     true|
|     true|
|     true|
|     true|
|     true|
|     true|
|     true|
+---------+
 # Attempting to overwriting df2.id2
    df2.id2=df2.select(df2.id2 > 0).withColumnRenamed('(id2 > 0)','id2')
    df2.show()
#Overwriting unsucessful
+-------+----------+----+
|session|timestamp1| id2|
+-------+----------+----+
|      1|         1| NaN|
|      1|         2| 5.0|
|      1|         3| NaN|
|      1|         4| NaN|
|      1|         5|10.0|
|      1|         6| NaN|
|      1|         6| NaN|
+-------+----------+----+

Wheresoever answered 19/6, 2017 at 6:21 Comment(1)

spark can't write inplace like pandas if that is what you are looking for. DataFrames are immutable structure, they cannot be overwritten. – Broody 19/6, 2017 at 6:40

You can use

d1.withColumnRenamed("colName", "newColName")
d1.withColumn("newColName", $"colName")

The withColumnRenamed renames the existing column to new name.

The withColumn creates a new column with a given name. It creates a new column with same name if there exist already and drops the old one.

In your case changes are not applied to the original dataframe df2, it changes the name of column and return as a new dataframe which should be assigned to new variable for the further use.

d3 = df2.select((df2.id2 > 0).alias("id2"))

Above should work fine in your case.

Hope this helps!

Kathrynkathryne answered 19/6, 2017 at 6:28 Comment(0)

As stated above it's not possible to overwrite DataFrame object, which is immutable collection, so all transformations return new DataFrame.

The fastest way to achieve your desired effect is to use withColumn:

df = df.withColumn("col", some expression)

where col is name of column which you want to "replace". After running this value of df variable will be replaced by new DataFrame with new value of column col. You might want to assign this to new variable.

In your case it can look:

df2 = df2.withColumn("id2", (df2.id2 > 0) & (df2.id2 != float('nan')))

I've added comparison to nan, because I'm assuming you don't want to treat nan as greater than 0.

Marque answered 19/6, 2017 at 7:11 Comment(2)

id2 column originally exists in df2. So when we try to create a new column in df2 with the existing name id2 using withColumn(), why is it not throwing an conflict error saying "id2 already existing so cannot be changed" or something since dataframe is immutable? Please explain this and how withcolumn() works – Wheresoever 19/6, 2017 at 7:43

withColumn() does not throws exception if the column already exists, if the column exist, it drops the old column and create a new column with the change data. – Kathrynkathryne 19/6, 2017 at 8:23

If you're working with multiple columns of the same name in different joined tables you can use the table alias in the colName in withColumn.

Eg. df1.join(df2, df1.id = df2.other_id).withColumn('df1.my_col', F.greatest(df1.my_col, df2.my_col))

And if you only want to keep the columns from df1 you can also call .select('df1.*')

If you instead do df1.join(df2, df1.id = df2.other_id).withColumn('my_col', F.greatest(df1.my_col, df2.my_col))

I think it overwrites the last column which is called my_col. So it outputs: id, my_col (df1.my_col original value), id, other_id, my_col (newly computed my_col)

Insurmountable answered 14/5, 2018 at 22:59 Comment(0)

Recommended topics

Hot tags