How do I split a column by using delimiters from another column in Spark/Scala
Asked Answered
A

5

5

I have another question that is related to the split function. I am new to Spark/Scala.

below is the sample data frame -


+-------------------+---------+
|             VALUES|Delimiter|
+-------------------+---------+
|       50000.0#0#0#|        #|
|          [email protected]@|        @|
|                 1$|        $|
|1000.00^Test_string|        ^|
+-------------------+---------+

and I want the output to be -

+-------------------+---------+----------------------+
|VALUES             |Delimiter|split_values          |
+-------------------+---------+----------------------+
|50000.0#0#0#       |#        |[50000.0, 0, 0, ]     |
|[email protected]@          |@        |[0, 1000.0, ]         |
|1$                 |$        |[1, ]                 |
|1000.00^Test_string|^        |[1000.00, Test_string]|
+-------------------+---------+----------------------+

I tried to split this manually -

dept.select(split(col("VALUES"),"#|@|\\$|\\^").show()

and the output is -

+-----------------------+
|split(VALUES,#|@|\$|\^)|
+-----------------------+
|      [50000.0, 0, 0, ]|
|          [0, 1000.0, ]|
|                  [1, ]|
|   [1000.00, Test_st...|
+-----------------------+


But I want to pull up the delimiter automatically for a large dataset.

Addict answered 14/7, 2021 at 15:41 Comment(6)
Please add what you have tried and what failed for you in the question. You should provide a minimal reproducible example with your question.Lamblike
I know that it can be done manually based on the above sample data frame - ``` dept.select(split(col("VALUES"),"#|@|\\$|\\^").show() ``` and the out put does match but I don't want to put the delimiter manually.Addict
what doe the logic for extra column ?Grim
@Grim I have edited the question. I am not sure about the logic for the extra column as of now. My main concern is with the automatic fetching of the delimiters for large data sets.Addict
A UDF would probably work here.Donn
@Shashwat - Pls see the ans below - Please do not hesitate to accept and upvote in case you are happy with the answer :)Grim
G
4

You need to use expr with split() to make the split dynamic

df = spark.createDataFrame([("50000.0#0#0#","#"),("[email protected]@","@")],["VALUES","Delimiter"])
df = df.withColumn("split", F.expr("""split(VALUES, Delimiter)"""))
df.show()

+------------+---------+-----------------+
|      VALUES|Delimiter|            split|
+------------+---------+-----------------+
|50000.0#0#0#|        #|[50000.0, 0, 0, ]|
|   [email protected]@|        @|    [0, 1000.0, ]|
+------------+---------+-----------------+
Grim answered 14/7, 2021 at 16:23 Comment(5)
This is a really cool answer. What is the import for F? is it regex.F? I couldn't get it working.Lamblike
Hi @PubuduSitinamaluwa function expr() is under spark.sql.functions. You can directly import expr or can use alias to call the expr(). Hope this helps.Addict
Hi, @Grim I think we have to include escape characters for other cases to handle '$', '^' etc.Addict
@PubuduSitinamaluwa from pyspark.sql import functions as F from pyspark.sql import types as T from pyspark.sql import Window as WGrim
@Shashwat Will Appreciate if you can accept the answer :) Glad that it helped youGrim
L
3

EDIT: Please check the bottom of the answer for scala version.

You can use a custom user-defined function (pyspark.sql.functions.udf) to achieve this.

from typing import List

from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType, ArrayType


def split_col(value: StringType, delimiter: StringType) -> List[str]:
    return str(value).split(str(delimiter))


udf_split = udf(lambda x, y: split_col(x, y), ArrayType(StringType()))

spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame([
    ('50000.0#0#0#', '#'), ('[email protected]@', '@'), ('1$', '$'), ('1000.00^Test_string', '^')
], schema='VALUES String, Delimiter String')

df = df.withColumn("split_values", udf_split(df['VALUES'], df['Delimiter']))

df.show(truncate=False)

Output

+-------------------+---------+----------------------+
|VALUES             |Delimiter|split_values          |
+-------------------+---------+----------------------+
|50000.0#0#0#       |#        |[50000.0, 0, 0, ]     |
|[email protected]@          |@        |[0, 1000.0, ]         |
|1$                 |$        |[1, ]                 |
|1000.00^Test_string|^        |[1000.00, Test_string]|
+-------------------+---------+----------------------+

Note that the split_values column contains a list of strings. You can also update split_col function to do more changes to values.

EDIT : Scala version

import org.apache.spark.sql.functions.udf

import spark.implicits._

val data = Seq(("50000.0#0#0#", "#"), ("[email protected]@", "@"), ("1$", "$"), ("1000.00^Test_string", "^"))
var df = data.toDF("VALUES", "Delimiter")

val udf_split_col = udf {(x:String,y:String)=> x.split(y)}

df = df.withColumn("split_values", udf_split_col(df.col("VALUES"), df.col("Delimiter")))

df.show(false)

Edit 2

To avoid the issue with special characters used in regexes, you can use char instead of a String when using the split() method as follow.

val udf_split_col = udf { (x: String, y: String) => x.split(y.charAt(0)) }
Lamblike answered 14/7, 2021 at 16:31 Comment(9)
OP is asking for Scala ;)Donn
Thank you for pointing that out. But the concepts are the same.Lamblike
Sure but not everyone is able to "translate" Python to Scala.Donn
I agree. Let me come up with a scala version of it.Lamblike
@Shashwat Hope you were able to solve the issue. I added a sample of scala version of it.Lamblike
Thanks @PubuduSitinamaluwa both method works really well.Addict
Hi @PubuduSitinamaluwa I think we have to include escape characters for other cases to handle '$', '^' etc.Addict
Good point. Or you can change the Delimiter column type to char if it is always a single character.Lamblike
Let us continue this discussion in chat.Addict
P
2

This is another way of handling this,using sparksql

df.createOrReplaceTempView("test")

spark.sql("""select VALUES,delimiter,split(values,case when delimiter in ("$","^") then concat("\\",delimiter) else delimiter end) as split_value from test""").show(false)

Note that I included the case when statement to add escape characters to handle cases for '$' and '^',otherwise it doesn't split.

+-------------------+---------+----------------------+
|VALUES             |delimiter|split_value           |
+-------------------+---------+----------------------+
|50000.0#0#0#       |#        |[50000.0, 0, 0, ]     |
|[email protected]@          |@        |[0, 1000.0, ]         |
|1$                 |$        |[1, ]                 |
|1000.00^Test_string|^        |[1000.00, Test_string]|
+-------------------+---------+----------------------+

Preferential answered 14/7, 2021 at 16:59 Comment(0)
V
1

This is my lately solution

import java.util.regex.Pattern
val split_udf = udf((value: String, delimiter: String) => value.split(Pattern.quote(delimiter), -1))
val solution = dept.withColumn("split_values", split_udf(col("VALUES"),col("Delimiter")))
solution.show(truncate = false)

it will skip special characters in Delimiter column. Other answers not work for

("50000.0\\0\\0\\", "\\")

and linusRian's answer need to add special characters manually

Vitia answered 28/9, 2021 at 9:47 Comment(0)
L
1

Enhancing @dsk Answer to handle two more cases.


Scala Code

First Case: To handle special characters ('$' ,'^') we need to add escape Characters backslash '\\' as prefix on Delimiter column, like this below..

import org.apache.spark.sql.functions.expr
val new_deptArray = dept.withColumn("split_values",expr("""split(VALUES, Concat('\\',Delimiter))"""))
new_deptArray.show(truncate = false)

And this will be the Output.

+-------------------+---------+----------------------+
|VALUES             |Delimiter|split_values          |
+-------------------+---------+----------------------+
|50000.0#0#0#       |#        |[50000.0, 0, 0, ]     |
|[email protected]@          |@        |[0, 1000.0, ]         |
|1$                 |$        |[1, ]                 |
|1000.00^Test_string|^        |[1000.00, Test_string]|
+-------------------+---------+----------------------+

Second Case: Notice that few rows got that extra space at the end and ideally, we don't want them. so, this new column split_values is of type Array<String> and we can just use array_remove to remove these extra spaces.

import org.apache.spark.sql.functions.{expr, array_remove}
val new_deptArray = dept.withColumn("split_values",array_remove(expr("""split(VALUES, Concat('\\',Delimiter))"""), ""))
new_deptArray.show(truncate = false)

it will give this output.

+-------------------+---------+----------------------+
|VALUES             |Delimiter|split_values          |
+-------------------+---------+----------------------+
|50000.0#0#0#       |#        |[50000.0, 0, 0]       |
|[email protected]@          |@        |[0, 1000.0]           |
|1$                 |$        |[1]                   |
|1000.00^Test_string|^        |[1000.00, Test_string]|
+-------------------+---------+----------------------+

Pyspark Code

It worked almost same in pyspark also just the difference was we need to add four backslashes backslash '\\\\' unlike Scala where only two backslash was sufficient.

from pyspark.sql.functions import expr, array_remove
new_deptArray = dept.withColumn("split_values",array_remove(expr("""split(VALUES, concat("\\\\",Delimiter))"""), ""))
new_deptArray.show(truncate = False)

+-------------------+---------+----------------------+
|VALUES             |Delimiter|split_values          |
+-------------------+---------+----------------------+
|50000.0#0#0#       |#        |[50000.0, 0, 0]       |
|[email protected]@          |@        |[0, 1000.0]           |
|1$                 |$        |[1]                   |
|1000.00^Test_string|^        |[1000.00, Test_string]|
+-------------------+---------+----------------------+

Thanks.

Lacuna answered 16/3 at 18:42 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.