Custom delimiter csv reader spark
Asked Answered
D

3

55

I would like to read in a file with the following structure with Apache Spark.

628344092\t20070220\t200702\t2007\t2007.1370

The delimiter is \t. How can I implement this while using spark.read.csv()?

The csv is much too big to use pandas because it takes ages to read this file. Is there some way which works similar to

pandas.read_csv(file, sep = '\t')

Thanks a lot!

Donatus answered 21/9, 2017 at 17:20 Comment(0)
T
101

Use spark.read.option("delimiter", "\t").csv(file) or sep instead of delimiter.

If it's literally \t, not tab special character, use double \: spark.read.option("delimiter", "\\t").csv(file)

Therese answered 21/9, 2017 at 17:21 Comment(5)
Is there any website to check the documentation of spark.read or anything else? Thanks for the answer! :)Donatus
CSV supports is a merge of this project: github.com/databricks/spark-csv It has some documentation. I'm personally just checking the code :)Alcmene
What's the difference between sep and delimiter?Signorino
@Signorino None, both means the same :)Alcmene
This changed in Spark now, with the pandas solution at the top also possible?Banderole
O
6

This works for me and it is much more clear (for me): As you mentioned, in pandas you would do:

df_pandas = pandas.read_csv(file_path, sep = '\t')

In spark:

df_spark = spark.read.csv(file_path, sep ='\t', header = True)

Please note that if the first row of your csv are the column names, you should set header = False, like this:

df_spark = spark.read.csv(file_path, sep ='\t', header = False)

You can change the separator (sep) to fit your data.

Officialdom answered 21/10, 2021 at 14:27 Comment(0)
S
0

If you are using SparkSQL, you can use the DDL below with the OPTION syntax to specify your delimiter.

CREATE TABLE sample_table
USING CSV
OPTIONS ('delimiter'='\t')
AS SELECT ...

SparkSQL Documentation

Sibelle answered 16/11, 2022 at 17:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.