use length function in substring in spark
Asked Answered
L

5

18

I am trying to use the length function inside a substring function in a DataFrame but it gives error

val substrDF = testDF.withColumn("newcol", substring($"col", 1, length($"col")-1))

below is the error

 error: type mismatch;
 found   : org.apache.spark.sql.Column
 required: Int

I am using 2.1.

Lunna answered 21/9, 2017 at 21:21 Comment(0)
D
30

Function "expr" can be used:

val data = List("first", "second", "third")
val df = sparkContext.parallelize(data).toDF("value")
val result = df.withColumn("cutted", expr("substring(value, 1, length(value)-1)"))
result.show(false)

output:

+------+------+
|value |cutted|
+------+------+
|first |firs  |
|second|secon |
|third |thir  |
+------+------+
Duro answered 22/9, 2017 at 9:39 Comment(1)
Why use length at all here? https://mcmap.net/q/650861/-use-length-function-in-substring-in-sparkRoane
P
18

You could also use $"COLUMN".substr

val substrDF = testDF.withColumn("newcol", $"col".substr(lit(1), length($"col")-1))

Output:

val testDF = sc.parallelize(List("first", "second", "third")).toDF("col")
val result = testDF.withColumn("newcol", $"col".substr(org.apache.spark.sql.functions.lit(1), length($"col")-1))
result.show(false)
+------+------+
|col   |newcol|
+------+------+
|first |firs  |
|second|secon |
|third |thir  |
+------+------+
Pimental answered 24/3, 2018 at 19:27 Comment(1)
Imho this is a much better solution as it allows you to build custom functions taking a column and returning a column. E.g. in pyspark def foo(in:Column)->Column: return in.substr(2, length(in)) Without relying on aliases of the column (which you would have to with the expr as in the accepted answer.Sheik
M
2

You get that error because you the signature of substring is

def substring(str: Column, pos: Int, len: Int): Column 

The len argument that you are passing is a Column, and should be an Int.

You may probably want to implement a simple UDF to solve that problem.

val strTail = udf((str: String) => str.substring(1))
testDF.withColumn("newCol", strTail($"col"))
Morello answered 21/9, 2017 at 21:39 Comment(1)
Only implement customs udfs if it is really necessary as they are slower that built in functions. For this case this is not necessary.Elisabetta
K
1

If all you want is to remove the last character of the string, you can do that without UDF as well. By using regexp_replace :

testDF.show
+---+----+
| id|name|
+---+----+
|  1|abcd|
|  2|qazx|
+---+----+

testDF.withColumn("newcol", regexp_replace($"name", ".$" , "") ).show
+---+----+------+
| id|name|newcol|
+---+----+------+
|  1|abcd|   abc|
|  2|qazx|   qaz|
+---+----+------+
Kirsten answered 22/9, 2017 at 7:3 Comment(1)
I think a regex is an overkill for this.Spaghetti
A
-1

You have to use the SUBSTR function to achieve this.

val substrDF = testDF.withColumn("newcol", 'col.substr(lit(1), length('col)-1))

The first parameter is the position from which you want the data to be trimmed, the second parameter is the length of the trimmed field. (startPos: Int,len: Int)

Armilda answered 28/3, 2021 at 5:28 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.