How to remove elements from an array Column in Spark?
Asked Answered
D

2

7

I have a Seq and dataframe. The dataframe contains a column of array type. I am trying to remove the elements that are in the Seq from the column.

For example:

val stop_words = Seq("a", "and", "for", "in", "of", "on", "the", "with", "s", "t")

    +---------------------------------------------------+
    |sorted_items                                       |
    +---------------------------------------------------+
    |[flannel, and, for, s, shirts, sleeve, warm]       |
    |[3, 5, kitchenaid, s]                              |
    |[5, 6, case, flip, inch, iphone, on, xs]           |
    |[almonds, chocolate, covered, dark, joe, s, the]   |
    |null                                               |
    |[]                                                 |
    |[animation, book]                                  |

Expected output:

+---------------------------------------------------+
|sorted_items                                       |
+---------------------------------------------------+
|[flannel, shirts, sleeve, warm]                    |
|[3, 5, kitchenaid]                                 |
|[5, 6, case, flip, inch, iphone, xs]               |
|[almonds, chocolate, covered, dark, joe, the]      |
|null                                               |
|[]                                                 |
|[animation, book]                                  |

How can this be done in an effective and optimized way?

Departure answered 17/5, 2019 at 6:38 Comment(0)
W
2

Use the StopWordsRemover from the MLlib package. It is possible to set custom stop words using the setStopWords function. StopWordsRemover will not handle null values so those will need to be dealt with before usage. It can be done as follows:

val df2 = df.withColumn("sorted_values", coalesce($"sorted_values", array()))

val remover = new StopWordsRemover()
  .setStopWords(stop_words.toArray)
  .setInputCol("sorted_values")
  .setOutputCol("filtered")

val df3 = remover.transform(df2)
Whom answered 17/5, 2019 at 7:4 Comment(1)
#56179775 If you what I am doing wrong here. Please let me knowDeparture
H
8

Use array_except from spark.sql.functions:

import org.apache.spark.sql.{functions => F}

val stopWords = Array("a", "and", "for", "in", "of", "on", "the", "with", "s", "t")

val newDF = df.withColumn("sorted_items", F.array_except(df("sorted_items"), F.lit(stopWords)))

newDF.show(false)

Output:

+----------------------------------------+
|sorted_items                            |
+----------------------------------------+
|[flannel, shirts, sleeve, warm]         |
|[3, 5, kitchenaid]                      |
|[5, 6, case, flip, inch, iphone, xs]    |
|[almonds, chocolate, covered, dark, joe]|
|null                                    |
|[]                                      |
|[animation, book]                       |
+----------------------------------------+
Halfassed answered 17/5, 2019 at 7:11 Comment(1)
How can this be translated into SQL?Calender
W
2

Use the StopWordsRemover from the MLlib package. It is possible to set custom stop words using the setStopWords function. StopWordsRemover will not handle null values so those will need to be dealt with before usage. It can be done as follows:

val df2 = df.withColumn("sorted_values", coalesce($"sorted_values", array()))

val remover = new StopWordsRemover()
  .setStopWords(stop_words.toArray)
  .setInputCol("sorted_values")
  .setOutputCol("filtered")

val df3 = remover.transform(df2)
Whom answered 17/5, 2019 at 7:4 Comment(1)
#56179775 If you what I am doing wrong here. Please let me knowDeparture

© 2022 - 2024 — McMap. All rights reserved.