How do I split a column by using delimiters from another column in Spark/Scala

Asked 14/7, 2021 at 15:41 Answered 16/3 at 18:42

I have another question that is related to the split function. I am new to Spark/Scala.

below is the sample data frame -


+-------------------+---------+
|             VALUES|Delimiter|
+-------------------+---------+
|       50000.0#0#0#|        #|
|          [email protected]@|        @|
|                 1$|        $|
|1000.00^Test_string|        ^|
+-------------------+---------+

and I want the output to be -

+-------------------+---------+----------------------+
|VALUES             |Delimiter|split_values          |
+-------------------+---------+----------------------+
|50000.0#0#0#       |#        |[50000.0, 0, 0, ]     |
|[email protected]@          |@        |[0, 1000.0, ]         |
|1$                 |$        |[1, ]                 |
|1000.00^Test_string|^        |[1000.00, Test_string]|
+-------------------+---------+----------------------+

I tried to split this manually -

dept.select(split(col("VALUES"),"#|@|\\$|\\^").show()

and the output is -

+-----------------------+
|split(VALUES,#|@|\$|\^)|
+-----------------------+
|      [50000.0, 0, 0, ]|
|          [0, 1000.0, ]|
|                  [1, ]|
|   [1000.00, Test_st...|
+-----------------------+

But I want to pull up the delimiter automatically for a large dataset.

Addict answered 14/7, 2021 at 15:41 Comment(6)

Please add what you have tried and what failed for you in the question. You should provide a minimal reproducible example with your question. – Lamblike 14/7, 2021 at 15:54

I know that it can be done manually based on the above sample data frame - ``` dept.select(split(col("VALUES"),"#|@|\\$|\\^").show() ``` and the out put does match but I don't want to put the delimiter manually. – Addict 14/7, 2021 at 15:56

what doe the logic for extra column ? – Grim 14/7, 2021 at 16:9

@Grim I have edited the question. I am not sure about the logic for the extra column as of now. My main concern is with the automatic fetching of the delimiters for large data sets. – Addict 14/7, 2021 at 16:18

A UDF would probably work here. – Donn 14/7, 2021 at 16:23

@Shashwat - Pls see the ans below - Please do not hesitate to accept and upvote in case you are happy with the answer :) – Grim 14/7, 2021 at 16:24

You need to use expr with split() to make the split dynamic

df = spark.createDataFrame([("50000.0#0#0#","#"),("[email protected]@","@")],["VALUES","Delimiter"])
df = df.withColumn("split", F.expr("""split(VALUES, Delimiter)"""))
df.show()

+------------+---------+-----------------+
|      VALUES|Delimiter|            split|
+------------+---------+-----------------+
|50000.0#0#0#|        #|[50000.0, 0, 0, ]|
|   [email protected]@|        @|    [0, 1000.0, ]|
+------------+---------+-----------------+

Grim answered 14/7, 2021 at 16:23 Comment(5)

This is a really cool answer. What is the import for F? is it regex.F? I couldn't get it working. – Lamblike 14/7, 2021 at 17:42

Hi @PubuduSitinamaluwa function expr() is under spark.sql.functions. You can directly import expr or can use alias to call the expr(). Hope this helps. – Addict 14/7, 2021 at 18:9

Hi, @Grim I think we have to include escape characters for other cases to handle '$', '^' etc. – Addict 14/7, 2021 at 18:32

@PubuduSitinamaluwa from pyspark.sql import functions as F from pyspark.sql import types as T from pyspark.sql import Window as W – Grim 15/7, 2021 at 12:53

@Shashwat Will Appreciate if you can accept the answer :) Glad that it helped you – Grim 15/7, 2021 at 12:54

EDIT: Please check the bottom of the answer for scala version.

You can use a custom user-defined function (pyspark.sql.functions.udf) to achieve this.

from typing import List

from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType, ArrayType


def split_col(value: StringType, delimiter: StringType) -> List[str]:
    return str(value).split(str(delimiter))


udf_split = udf(lambda x, y: split_col(x, y), ArrayType(StringType()))

spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame([
    ('50000.0#0#0#', '#'), ('[email protected]@', '@'), ('1$', '$'), ('1000.00^Test_string', '^')
], schema='VALUES String, Delimiter String')

df = df.withColumn("split_values", udf_split(df['VALUES'], df['Delimiter']))

df.show(truncate=False)

Output

+-------------------+---------+----------------------+
|VALUES             |Delimiter|split_values          |
+-------------------+---------+----------------------+
|50000.0#0#0#       |#        |[50000.0, 0, 0, ]     |
|[email protected]@          |@        |[0, 1000.0, ]         |
|1$                 |$        |[1, ]                 |
|1000.00^Test_string|^        |[1000.00, Test_string]|
+-------------------+---------+----------------------+

Note that the split_values column contains a list of strings. You can also update split_col function to do more changes to values.

EDIT : Scala version

import org.apache.spark.sql.functions.udf

import spark.implicits._

val data = Seq(("50000.0#0#0#", "#"), ("[email protected]@", "@"), ("1$", "$"), ("1000.00^Test_string", "^"))
var df = data.toDF("VALUES", "Delimiter")

val udf_split_col = udf {(x:String,y:String)=> x.split(y)}

df = df.withColumn("split_values", udf_split_col(df.col("VALUES"), df.col("Delimiter")))

df.show(false)

Edit 2

To avoid the issue with special characters used in regexes, you can use char instead of a String when using the split() method as follow.

val udf_split_col = udf { (x: String, y: String) => x.split(y.charAt(0)) }

Lamblike answered 14/7, 2021 at 16:31 Comment(9)

OP is asking for Scala ;) – Donn 14/7, 2021 at 16:33

Thank you for pointing that out. But the concepts are the same. – Lamblike 14/7, 2021 at 16:39

Sure but not everyone is able to "translate" Python to Scala. – Donn 14/7, 2021 at 16:43

I agree. Let me come up with a scala version of it. – Lamblike 14/7, 2021 at 16:44

@Shashwat Hope you were able to solve the issue. I added a sample of scala version of it. – Lamblike 14/7, 2021 at 17:32

Thanks @PubuduSitinamaluwa both method works really well. – Addict 14/7, 2021 at 18:10

Hi @PubuduSitinamaluwa I think we have to include escape characters for other cases to handle '$', '^' etc. – Addict 14/7, 2021 at 18:31

Good point. Or you can change the Delimiter column type to char if it is always a single character. – Lamblike 14/7, 2021 at 18:36

Let us continue this discussion in chat. – Addict 14/7, 2021 at 18:58

This is another way of handling this,using sparksql

df.createOrReplaceTempView("test")

spark.sql("""select VALUES,delimiter,split(values,case when delimiter in ("$","^") then concat("\\",delimiter) else delimiter end) as split_value from test""").show(false)

Note that I included the case when statement to add escape characters to handle cases for '$' and '^',otherwise it doesn't split.

+-------------------+---------+----------------------+
|VALUES             |delimiter|split_value           |
+-------------------+---------+----------------------+
|50000.0#0#0#       |#        |[50000.0, 0, 0, ]     |
|[email protected]@          |@        |[0, 1000.0, ]         |
|1$                 |$        |[1, ]                 |
|1000.00^Test_string|^        |[1000.00, Test_string]|
+-------------------+---------+----------------------+

Preferential answered 14/7, 2021 at 16:59 Comment(0)

This is my lately solution

import java.util.regex.Pattern
val split_udf = udf((value: String, delimiter: String) => value.split(Pattern.quote(delimiter), -1))
val solution = dept.withColumn("split_values", split_udf(col("VALUES"),col("Delimiter")))
solution.show(truncate = false)

it will skip special characters in Delimiter column. Other answers not work for

("50000.0\\0\\0\\", "\\")

and linusRian's answer need to add special characters manually

Vitia answered 28/9, 2021 at 9:47 Comment(0)

Enhancing @dsk Answer to handle two more cases.

Scala Code

First Case: To handle special characters ('$' ,'^') we need to add escape Characters backslash '\\' as prefix on Delimiter column, like this below..

import org.apache.spark.sql.functions.expr
val new_deptArray = dept.withColumn("split_values",expr("""split(VALUES, Concat('\\',Delimiter))"""))
new_deptArray.show(truncate = false)

And this will be the Output.

+-------------------+---------+----------------------+
|VALUES             |Delimiter|split_values          |
+-------------------+---------+----------------------+
|50000.0#0#0#       |#        |[50000.0, 0, 0, ]     |
|[email protected]@          |@        |[0, 1000.0, ]         |
|1$                 |$        |[1, ]                 |
|1000.00^Test_string|^        |[1000.00, Test_string]|
+-------------------+---------+----------------------+

Second Case: Notice that few rows got that extra space at the end and ideally, we don't want them. so, this new column split_values is of type Array<String> and we can just use array_remove to remove these extra spaces.

import org.apache.spark.sql.functions.{expr, array_remove}
val new_deptArray = dept.withColumn("split_values",array_remove(expr("""split(VALUES, Concat('\\',Delimiter))"""), ""))
new_deptArray.show(truncate = false)

it will give this output.

+-------------------+---------+----------------------+
|VALUES             |Delimiter|split_values          |
+-------------------+---------+----------------------+
|50000.0#0#0#       |#        |[50000.0, 0, 0]       |
|[email protected]@          |@        |[0, 1000.0]           |
|1$                 |$        |[1]                   |
|1000.00^Test_string|^        |[1000.00, Test_string]|
+-------------------+---------+----------------------+

Pyspark Code

It worked almost same in pyspark also just the difference was we need to add four backslashes backslash '\\\\' unlike Scala where only two backslash was sufficient.

from pyspark.sql.functions import expr, array_remove
new_deptArray = dept.withColumn("split_values",array_remove(expr("""split(VALUES, concat("\\\\",Delimiter))"""), ""))
new_deptArray.show(truncate = False)

+-------------------+---------+----------------------+
|VALUES             |Delimiter|split_values          |
+-------------------+---------+----------------------+
|50000.0#0#0#       |#        |[50000.0, 0, 0]       |
|[email protected]@          |@        |[0, 1000.0]           |
|1$                 |$        |[1]                   |
|1000.00^Test_string|^        |[1000.00, Test_string]|
+-------------------+---------+----------------------+

Thanks.

Lacuna answered 16/3 at 18:42 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Enhancing @dsk Answer to handle two more cases.

Scala Code

Pyspark Code

Recommended topics

Hot tags