convert scientific notation in string format to numeric in spark dataframe [duplicate]
Asked Answered
C

1

5
Day_Date,timeofday_desc,Timeofday_hour,Timeofday_minute,Timeofday_second,value
2017-12-18,12:21:02 AM,0,21,2,“1.779209040E+08”
2017-12-19,12:21:02 AM,0,21,2,“1.779209040E+08”
2017-12-20,12:30:52 AM,0,30,52,“1.779209040E+08”
2017-12-21,12:30:52 AM,0,30,52,“1.779209040E+08”
2017-12-22,12:47:10 AM,0,47,10,“1.779209040E+08”
2017-12-23,12:47:10 AM,0,47,10,“1.779209040E+08”
2017-12-24,02:46:59 AM,2,46,59,“1.779209040E+08”
2017-12-25,02:46:59 AM,2,46,59,“1.779209040E+08”
2017-12-26,03:10:27 AM,3,10,27,“1.779209040E+08”
2017-12-27,03:10:27 AM,3,10,27,“1.779209040E+08”
2017-12-28,03:52:08 AM,3,52,8,“1.779209040E+08”

I am trying to convert value column to 177920904

val df1 = df.withColumn("s", 'value.cast("Decimal(10,4)")).drop("value").withColumnRenamed("s", "value")

also tried casting value as Float, Double. Always get null as output

df1.select("value").show()


+-----------+
|   value   |
+-----------+
|       null|
|       null|
|       null|
|       null|
|       null|
|       null|
|       null|
|       null|

df.printSchema

root
 |-- Day_Date: string (nullable = true)
 |-- timeofday_desc: string (nullable = true)
 |-- Timeofday_hour: string (nullable = true)
 |-- Timeofday_minute: string (nullable = true)
 |-- Timeofday_second: string (nullable = true)
 |-- value: string (nullable = true)
Claudetteclaudia answered 30/4, 2019 at 5:58 Comment(0)
O
8

Just need to cast it to decimal with enough room to fit the number.

Decimal is Decimal(precision, scale), so Decimal(10, 4) means 10 digits in total, 6 at the left of the dot, and 4 to the right, so the number does not fit in your Decimal type.

From the documentation

precision represents the total number of digits that can be represented

scale represents the number of fractional digits. This value must be less than or equal to precision. A scale of 0 produces integral values, with no fractional part

Since you don't want any number to the right, you can try this

df.withColumn("s", 'value.cast("Decimal(10,0)"))

If you want to keep 4 decimal digits, you can just change it to

df.withColumn("s", 'value.cast("Decimal(14,4)"))

INPUT

df.show
+---------------+
|          value|
+---------------+
|1.779209040E+08|
+---------------+

OUTPUT

scala> df.withColumn("s", 'value.cast("Decimal(10,0)")).show
+---------------+---------+
|          value|        s|
+---------------+---------+
|1.779209040E+08|177920904|
+---------------+---------+

FULL SOLUTION

Without dropping nor renamig

val df1 = df.withColumn("value", 'value.cast("Decimal(10,0)"))

FIX INPUT DATA

As I said in the comment, the problem is that your numbers contain some weird characters around them, you should remove it before casting

ORIGINAL

scala> df.show
+----------+--------------+--------------+----------------+----------------+-----------------+
|  Day_Date|timeofday_desc|Timeofday_hour|Timeofday_minute|Timeofday_second|            value|
+----------+--------------+--------------+----------------+----------------+-----------------+
|2017-12-18|   12:21:02 AM|             0|              21|               2| ?1.779209040E+08|
|2017-12-19|   12:21:02 AM|             0|              21|               2|?1.779209040E+08?|
|2017-12-20|   12:30:52 AM|             0|              30|              52| ?1.779209040E+08|
|2017-12-21|   12:30:52 AM|             0|              30|              52| ?1.779209040E+08|
|2017-12-22|   12:47:10 AM|             0|              47|              10| ?1.779209040E+08|
|2017-12-23|   12:47:10 AM|             0|              47|              10| ?1.779209040E+08|
|2017-12-24|   02:46:59 AM|             2|              46|              59| ?1.779209040E+08|
|2017-12-25|   02:46:59 AM|             2|              46|              59| ?1.779209040E+08|
|2017-12-26|   03:10:27 AM|             3|              10|              27| ?1.779209040E+08|
|2017-12-27|   03:10:27 AM|             3|              10|              27| ?1.779209040E+08|
|2017-12-28|   03:52:08 AM|             3|              52|               8| ?1.779209040E+08|
+----------+--------------+--------------+----------------+----------------+-----------------+

There are many ways to remove them, a quick one is with an UDF and a regular expression to remove everything but numbers, letters, dot, + and -

 def clean(input: String) = input.replaceAll("[^a-zA-Z0-9\\+\\.-]", "")
 val cleanUDF = udf(clean _ )
df.withColumn("value", cleanUDF($"value").cast(DecimalType(10,0))).show
+----------+--------------+--------------+----------------+----------------+---------+
|  Day_Date|timeofday_desc|Timeofday_hour|Timeofday_minute|Timeofday_second|    value|
+----------+--------------+--------------+----------------+----------------+---------+
|2017-12-18|   12:21:02 AM|             0|              21|               2|177920904|
|2017-12-19|   12:21:02 AM|             0|              21|               2|177920904|
|2017-12-20|   12:30:52 AM|             0|              30|              52|177920904|
|2017-12-21|   12:30:52 AM|             0|              30|              52|177920904|
|2017-12-22|   12:47:10 AM|             0|              47|              10|177920904|
|2017-12-23|   12:47:10 AM|             0|              47|              10|177920904|
|2017-12-24|   02:46:59 AM|             2|              46|              59|177920904|
|2017-12-25|   02:46:59 AM|             2|              46|              59|177920904|
|2017-12-26|   03:10:27 AM|             3|              10|              27|177920904|
|2017-12-27|   03:10:27 AM|             3|              10|              27|177920904|
|2017-12-28|   03:52:08 AM|             3|              52|               8|177920904|
+----------+--------------+--------------+----------------+----------------+---------+
Ocampo answered 30/4, 2019 at 6:8 Comment(7)
don't know why but i am still getting nullClaudetteclaudia
With the code i posted? Post it yours again please, just to check if there are some minor errors or typos. Which is the data type of the numbers in the original dataframe? Strings or what?Ocampo
used same code df.withColumn("s", 'value.cast("Decimal(10,0)")).show, and my column s is still none. can you please post how do you read the csv?Claudetteclaudia
I created one from scratch in the shell. Please print a df.show and a df.printSchemaOcampo
I think the problem may be with those weirs quotation marks around the numberOcampo
if you print df.show you'll see some weird characters around your number, those are the ones which cause the problem when castingOcampo
ok i got the problem, its because of “1.779209040E+08” double quotes because of the excel it got different type of double quotes. with normal double quotes or by replacing these double quotes you can easily cast it with decimal.Claudetteclaudia

© 2022 - 2024 — McMap. All rights reserved.