Provide schema while reading csv file as a dataframe in Scala Spark
Asked Answered
G

13

86

I am trying to read a csv file into a dataframe. I know what the schema of my dataframe should be since I know my csv file. Also I am using spark csv package to read the file. I trying to specify the schema like below.

val pagecount = sqlContext.read.format("csv")
  .option("delimiter"," ").option("quote","")
  .option("schema","project: string ,article: string ,requests: integer ,bytes_served: long")
  .load("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000")

But when I check the schema of the data frame I created, it seems to have taken its own schema. Am I doing anything wrong ? how to make spark to pick up the schema I mentioned ?

> pagecount.printSchema
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
|-- _c3: string (nullable = true)
Gord answered 7/10, 2016 at 22:2 Comment(1)
which version of spark you are using ?Twopenny
T
104

Try the below code, you need not specify the schema. When you give inferSchema as true it should take it from your csv file.

val pagecount = sqlContext.read.format("csv")
  .option("delimiter"," ").option("quote","")
  .option("header", "true")
  .option("inferSchema", "true")
  .load("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000")

If you want to manually specify the schema, you can do it as below:

import org.apache.spark.sql.types._

val customSchema = StructType(Array(
  StructField("project", StringType, true),
  StructField("article", StringType, true),
  StructField("requests", IntegerType, true),
  StructField("bytes_served", DoubleType, true))
)

val pagecount = sqlContext.read.format("csv")
  .option("delimiter"," ").option("quote","")
  .option("header", "true")
  .schema(customSchema)
  .load("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000")
Twopenny answered 8/10, 2016 at 0:7 Comment(4)
I tried executing the code but it gave me the below error. val customSchema = StructType(Array( StructField("project", StringType, true), StructField("article", StringType, true), StructField("requests", IntegerType, true), StructField("bytes_served", DoubleType, true))) <console>:30: error: not found: value StructType val customSchema = StructType(Array(Gord
Theoretically I know we can mention the schema but I am lost on how to mention the schema in terms of syntax, is there any help that I can look up for ? I referred official documentation bit it do not mention this case and do not have much examplesGord
This doesn't answer the question although it is helpful. How can he provide the schema?Planography
When I tried the option("inferSchema", "true"), I got java.lang.UnsupportedOperationException: Schema inference is not supported for format: csv. Please specify the schema. This was for a readStream though, so maybe that's the difference.Lymphocytosis
C
24

For those interested in doing this in Python here is a working version.

customSchema = StructType([
    StructField("IDGC", StringType(), True),        
    StructField("SEARCHNAME", StringType(), True),
    StructField("PRICE", DoubleType(), True)
])
productDF = spark.read.load('/home/ForTesting/testProduct.csv', format="csv", header="true", sep='|', schema=customSchema)

testProduct.csv
ID|SEARCHNAME|PRICE
6607|EFKTON75LIN|890.88
6612|EFKTON100HEN|55.66

Hope this helps.

Conspecific answered 9/8, 2018 at 9:3 Comment(0)
M
18

I'm using the solution provided by Arunakiran Nulu in my analysis (see the code). Despite it is able to assign the correct types to the columns, all the values returned are null. Previously, I've tried to the option .option("inferSchema", "true") and it returns the correct values in the dataframe (although different type).

val customSchema = StructType(Array(
    StructField("numicu", StringType, true),
    StructField("fecha_solicitud", TimestampType, true),
    StructField("codtecnica", StringType, true),
    StructField("tecnica", StringType, true),
    StructField("finexploracion", TimestampType, true),
    StructField("ultimavalidacioninforme", TimestampType, true),
    StructField("validador", StringType, true)))

val df_explo = spark.read
        .format("csv")
        .option("header", "true")
        .option("delimiter", "\t")
        .option("timestampFormat", "yyyy/MM/dd HH:mm:ss") 
        .schema(customSchema)
        .load(filename)

Result

root


|-- numicu: string (nullable = true)
 |-- fecha_solicitud: timestamp (nullable = true)
 |-- codtecnica: string (nullable = true)
 |-- tecnica: string (nullable = true)
 |-- finexploracion: timestamp (nullable = true)
 |-- ultimavalidacioninforme: timestamp (nullable = true)
 |-- validador: string (nullable = true)

and the table is:

|numicu|fecha_solicitud|codtecnica|tecnica|finexploracion|ultimavalidacioninforme|validador|
+------+---------------+----------+-------+--------------+-----------------------+---------+
|  null|           null|      null|   null|          null|                   null|     null|
|  null|           null|      null|   null|          null|                   null|     null|
|  null|           null|      null|   null|          null|                   null|     null|
|  null|           null|      null|   null|          null|                   null|     null|
Motorbus answered 14/3, 2018 at 14:59 Comment(4)
It looks like .option("timestampFormat", "yyyy/mm/dd HH:mm:ss") should instead be .option("timestampFormat", "yyyy/MM/dd HH:mm:ss"). [Notice the capital MM for month] Otherwise it will be interpreting the month digits as the minutes of the timestamp.Frothy
Yes! You are right! I didn't notice. I'll edit my answer. ThanksMotorbus
If you have a DateType column which may contain 'null' values, set .option("nullValue", "null") otherwise it considers entire row with null values.Mesmerism
I'm experiencing the same problem - did you ever get this working in scala?Curricle
W
16

The previous solutions have used the custom StructType.

With spark-sql 2.4.5 (scala version 2.12.10) it is now possible to specify the schema as a string using the schema function

import org.apache.spark.sql.SparkSession;

val sparkSession = SparkSession.builder()
            .appName("sample-app")
            .master("local[2]")
            .getOrCreate();

val pageCount = sparkSession.read
  .format("csv")
  .option("delimiter","|")
  .option("quote","")
  .schema("project string ,article string ,requests integer ,bytes_served long")
  .load("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000")
Wryneck answered 10/4, 2020 at 16:56 Comment(0)
O
14

Thanks to the answer by @Nulu, it works for pyspark with minimal tweaking

from pyspark.sql.types import LongType, StringType, StructField, StructType, BooleanType, ArrayType, IntegerType

customSchema = StructType(Array(
    StructField("project", StringType, true),
    StructField("article", StringType, true),
    StructField("requests", IntegerType, true),
    StructField("bytes_served", DoubleType, true)))

pagecount = sc.read.format("com.databricks.spark.csv")
         .option("delimiter"," ")
         .option("quote","")
         .option("header", "false")
         .schema(customSchema)
         .load("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000")
Olshausen answered 30/8, 2017 at 1:54 Comment(2)
Above code doesn't work In pyspark. What worked for me is:>>> customSchema = StructType().add("MovieID", IntegerType(), True).add("Title", StringType(), True).add("Genres", StringType(), True) >>> df = sqlContext.read.format("csv").option("delimiter",",").option("header", "true").schema(customSchema).load("movies.csv") >>> df DataFrame[MovieID: int, Title: string, Genres: string] >>>Checkerwork
The Array is not PySpark type. You should use the Python array [] instead of "Array". And use "True" instead of "true" in Python.Deuce
G
9

schema definition as simple string

Just in case if some one is interested in schema definition as simple string with date and time stamp

data file creation from Terminal or shell

echo " 
2019-07-02 22:11:11.000999, 01/01/2019, Suresh, abc  
2019-01-02 22:11:11.000001, 01/01/2020, Aadi, xyz 
" > data.csv

Defining the schema as String

    user_schema = 'timesta TIMESTAMP,date DATE,first_name STRING , last_name STRING'

reading the data

    df = spark.read.csv(path='data.csv', schema = user_schema, sep=',', dateFormat='MM/dd/yyyy',timestampFormat='yyyy-MM-dd HH:mm:ss.SSSSSS')

    df.show(10, False)

    +-----------------------+----------+----------+---------+
    |timesta                |date      |first_name|last_name|
    +-----------------------+----------+----------+---------+
    |2019-07-02 22:11:11.999|2019-01-01| Suresh   | abc     |
    |2019-01-02 22:11:11.001|2020-01-01| Aadi     | xyz     |
    +-----------------------+----------+----------+---------+

Please note defining the schema explicitly instead of letting spark infer the schema also improves the spark read performance.

Greenlaw answered 13/6, 2019 at 20:33 Comment(1)
val user_schema = "timesta TIMESTAMP,date DATE,first_name STRING , last_name STRING" val mydf1 = spark.read.schema(user_schema).option("dateFormat","MM/dd/yyyy").option("timestampFormat","yyyy-MM-dd HH:mm:ss.SSSSSS").csv("data.csv")Terminology
O
4

Here's how you can work with a custom schema, a complete demo:

$> shell code,

echo "
Slingo, iOS 
Slingo, Android
" > game.csv

Scala code:

import org.apache.spark.sql.types._

val customSchema = StructType(Array(
  StructField("game_id", StringType, true),
  StructField("os_id", StringType, true)
))

val csv_df = spark.read.format("csv").schema(customSchema).load("game.csv")
csv_df.show 

csv_df.orderBy(asc("game_id"), desc("os_id")).show
csv_df.createOrReplaceTempView("game_view")
val sort_df = sql("select * from game_view order by game_id, os_id desc")
sort_df.show 
Optics answered 9/1, 2018 at 23:31 Comment(1)
mujiang.blogspot.ca/2018/01/…Earreach
S
0
// import Library
import java.io.StringReader ;

import au.com.bytecode.opencsv.CSVReader

//filename

var train_csv = "/Path/train.csv";

//read as text file

val train_rdd = sc.textFile(train_csv)   

//use string reader to convert in proper format

var full_train_data  = train_rdd.map{line =>  var csvReader = new CSVReader(new StringReader(line)) ; csvReader.readNext();  }   

//declares  types

type s = String

// declare case class for schema

case class trainSchema (Loan_ID :s ,Gender :s, Married :s, Dependents :s,Education :s,Self_Employed :s,ApplicantIncome :s,CoapplicantIncome :s,
    LoanAmount :s,Loan_Amount_Term :s, Credit_History :s, Property_Area :s,Loan_Status :s)

//create DF RDD with custom schema 

var full_train_data_with_schema = full_train_data.mapPartitionsWithIndex{(idx,itr)=> if (idx==0) itr.drop(1); 
                     itr.toList.map(x=> trainSchema(x(0),x(1),x(2),x(3),x(4),x(5),x(6),x(7),x(8),x(9),x(10),x(11),x(12))).iterator }.toDF

Sandbox answered 30/3, 2019 at 18:7 Comment(0)
T
0

In pyspark 2.4 onwards, you can simply use header parameter to set the correct header:

data = spark.read.csv('data.csv', header=True)

Similarly, if using scala you can use header parameter as well.

Trilingual answered 11/7, 2019 at 2:21 Comment(0)
C
0

You can also do like this by using sparkSession and implicit

import sparkSession.implicits._
val pagecount:DataFrame = sparkSession.read
.option("delimiter"," ")
.option("quote","")
.option("inferSchema","true")
.csv("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000")
.toDF("project","article","requests","bytes_served")

Corrosion answered 5/12, 2019 at 12:11 Comment(0)
S
0

if your spark version is 3.0.1, you can use following Scala scripts:

val df = spark.read.format("csv").option("delimiter",",").option("header",true).load("file:///LOCAL_CSV_FILE_PATH")

but in this way, all datatypes will be set as String.

Syd answered 25/6, 2021 at 7:45 Comment(0)
E
-1

This is one of option where we can pass the column names to the dataframe while loading CSV.

import pandas
    names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
    dataset = pandas.read_csv("C:/Users/NS00606317/Downloads/Iris.csv", names=names, header=0)
print(dataset.head(10))

Output

    sepal-length  sepal-width  petal-length  petal-width        class
1            5.1          3.5           1.4          0.2  Iris-setosa
2            4.9          3.0           1.4          0.2  Iris-setosa
3            4.7          3.2           1.3          0.2  Iris-setosa
4            4.6          3.1           1.5          0.2  Iris-setosa
5            5.0          3.6           1.4          0.2  Iris-setosa
6            5.4          3.9           1.7          0.4  Iris-setosa
7            4.6          3.4           1.4          0.3  Iris-setosa
8            5.0          3.4           1.5          0.2  Iris-setosa
9            4.4          2.9           1.4          0.2  Iris-setosa
10           4.9          3.1           1.5          0.1  Iris-setosa
Edwinedwina answered 15/3, 2019 at 10:4 Comment(1)
Good answer, but the question is referring to Spark DataFrame, not the one for PandasDorladorlisa
P
-1

here my solution is:

import org.apache.spark.sql.types._
  val spark = org.apache.spark.sql.SparkSession.builder.
  master("local[*]").
  appName("Spark CSV Reader").
  getOrCreate()

val movie_rating_schema = StructType(Array(
  StructField("UserID", IntegerType, true),
  StructField("MovieID", IntegerType, true),
  StructField("Rating", DoubleType, true),
  StructField("Timestamp", TimestampType, true)))

val df_ratings: DataFrame = spark.read.format("csv").
  option("header", "true").
  option("mode", "DROPMALFORMED").
  option("delimiter", ",").
  //option("inferSchema", "true").
  option("nullValue", "null").
  schema(movie_rating_schema).
  load(args(0)) //"file:///home/hadoop/spark-workspace/data/ml-20m/ratings.csv"

val movie_avg_scores = df_ratings.rdd.map(_.toString()).
  map(line => {
    // drop "[", "]" and then split the str 
    val fileds = line.substring(1, line.length() - 1).split(",")
    //extract (movie id, average rating)
    (fileds(1).toInt, fileds(2).toDouble)
  }).
  groupByKey().
  map(data => {
    val avg: Double = data._2.sum / data._2.size
    (data._1, avg)
  })
Prader answered 16/5, 2019 at 6:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.