Pyspark removing multiple characters in a dataframe column

Asked 8/6, 2018 at 18:27 Answered 5/4, 2021 at 13:2

Looking at pyspark, I see translate and regexp_replace to help me a single characters that exists in a dataframe column.

I was wondering if there is a way to supply multiple strings in the regexp_replace or translate so that it would parse them and replace them with something else.

Use case: remove all $, #, and comma(,) in a column A

Julissa answered 8/6, 2018 at 18:27 Comment(0)

You can use pyspark.sql.functions.translate() to make multiple replacements. Pass in a string of letters to replace and another string of equal length which represents the replacement values.

For example, let's say you had the following DataFrame:

import pyspark.sql.functions as f
df = sqlCtx.createDataFrame([("$100,00",),("#foobar",),("foo, bar, #, and $",)], ["A"])
df.show()
#+------------------+
#|                 A|
#+------------------+
#|           $100,00|
#|           #foobar|
#|foo, bar, #, and $|
#+------------------+

and wanted to replace ('$', '#', ',') with ('X', 'Y', 'Z'). Simply use translate like:

df.select("A", f.translate(f.col("A"), "$#,", "XYZ").alias("replaced")).show()
#+------------------+------------------+
#|                 A|          replaced|
#+------------------+------------------+
#|           $100,00|           X100Z00|
#|           #foobar|           Yfoobar|
#|foo, bar, #, and $|fooZ barZ YZ and X|
#+------------------+------------------+

If instead you wanted to remove all instances of ('$', '#', ','), you could do this with pyspark.sql.functions.regexp_replace().

df.select("A", f.regexp_replace(f.col("A"), "[\$#,]", "").alias("replaced")).show()
#+------------------+-------------+
#|                 A|     replaced|
#+------------------+-------------+
#|           $100,00|        10000|
#|           #foobar|       foobar|
#|foo, bar, #, and $|foo bar  and |
#+------------------+-------------+

The pattern "[\$#,]" means match any of the characters inside the brackets. The $ has to be escaped because it has a special meaning in regex.

Baggett answered 8/6, 2018 at 19:4 Comment(2)

For removing all instances, you can also use translate. Switching to regex_replace is not needed. I believe the following would do the job df.select("A", f.translate(f.col("A"), "$#,", "").alias("replaced")).show() – Blastoff 31/10, 2021 at 0:4

@Sheldore, your solution does not work properly. It replaces characters with space – English 27/9, 2022 at 12:13

If someone need to do this in scala you can do this as below code:

val df = Seq(("Test$",19),("$#,",23),("Y#a",20),("ZZZ,,",21)).toDF("Name","age")
import org.apache.spark.sql.functions._
val df1 = df.withColumn("NewName",translate($"Name","$#,","xyz"))
display(df1)

You can see the output as below:

Beggar answered 5/4, 2021 at 13:2 Comment(0)

Recommended topics

Hot tags