How to force inferSchema for CSV to consider integers as dates (with "dateFormat" option)?

Asked 2/10, 2017 at 16:8 Answered 2/10, 2017 at 16:54

Solved apache-spark dataframe apache-spark-sql spark-csv

I use Spark 2.2.0

I am reading a csv file as follows:

val dataFrame = spark.read.option("inferSchema", "true")
                          .option("header", true)
                          .option("dateFormat", "yyyyMMdd")
                          .csv(pathToCSVFile)

There is one date column in this file, and all records has a value equal to 20171001 for this particular column.

The issue is that spark is inferring that that the type of this column is integer rather than date. When I remove the "inferSchema" option, the type of that column is string.

There is no null values, nor any wrongly formatted line in this file.

What is the reason/solution for this issue?

Palua answered 2/10, 2017 at 16:8 Comment(3)

You can try disabling ("inferSchema", "true") and providing custom schema to read the csv file – Caiman 2/10, 2017 at 16:12

I can, but I guess that the "dateFormat" option is made to avoid doing what you suggested, right? – Palua 2/10, 2017 at 16:14

It needs to be escaped with double quotes if my memory is still good. – Highness 2/10, 2017 at 16:38

If my understanding is correct, the code implies the following order of type inference (with the first types being checked against first):

NullType
IntegerType
LongType
DecimalType
DoubleType
TimestampType
BooleanType
StringType

With that, I think the issue is that 20171001 matches IntegerType before even considering TimestampType (which uses timestampFormat not dateFormat option).

One solution would be to define the schema and use it with schema operator (of DataFrameReader) or let Spark SQL infer the schema and use cast operator.

I'd choose the former if the number of fields is not high.

Radically answered 2/10, 2017 at 16:54 Comment(0)

In this case you simply cannot depend on the schema inference due to format ambiguity.

Since input can be parsed both as IntegerType (or any higher precision numeric format) as well as TimestamType and the former one has higher precedence (internally Spark tries IntegerType -> LongType -> DecimaType -> DoubleType -> TimestampType) inference mechanism will never reach TimestampType case.

To be specific, with schema inference enabled, Spark will call tryParseInteger, which will correctly parse the input and stop. Subsequent call will match the second case and finish at the same tryParseInteger call.

Bluebird answered 2/10, 2017 at 16:52 Comment(2)

"inference mechanism will never reach TimestampType case." I don't think that holds given github.com/apache/spark/blob/master/sql/core/src/main/scala/org/… – Radically 2/10, 2017 at 16:56

It holds. Order of clauses in pattern match you've linked is not relevant. – Bluebird 3/10, 2017 at 16:37

Recommended topics

Hot tags