Reading Avro into spark using spark-avro
Asked Answered
S

3

5

I'm not being able to read spark files using the spark-avro library. Here are the steps I took:

  • Got the jar from: http://mvnrepository.com/artifact/com.databricks/spark-avro_2.10/0.1
  • Invoked spark-shell using spark-shell --jars avro/spark-avro_2.10-0.1.jar
  • Executed commands as given in the git readme:

    import com.databricks.spark.avro._
    import org.apache.spark.sql.SQLContext
    val sqlContext = new SQLContext(sc)
    val episodes = sqlContext.avroFile("episodes.avro") 
    
  • The action sqlContext.avroFile("episodes.avro") fails with the following error:

    scala> val episodes = sqlContext.avroFile("episodes.avro")
    java.lang.IncompatibleClassChangeError: class com.databricks.spark.avro.AvroRelation has interface org.apache.spark.sql.sources.TableScan as super class
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
    at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
    
Saltwater answered 7/8, 2015 at 18:20 Comment(1)
Instead of pulling down the .jar file yourself you can just reference the package and the shell will pull down the package and any of its dependencies for you, using: spark-shell --packages com.databricks:spark-avro_2.10:1.1.0Lukey
S
6

My bad. The readme clearly says:

Versions

Spark changed how it reads / writes data in 1.4, so please use the correct version of this dedicated for your spark version

1.3 -> 1.0.0

1.4+ -> 1.1.0-SNAPSHOT

I used spark:1.3.1 and spark-avro: 1.1.0. When I used spark-avro: 1.0.0, it worked.

Saltwater answered 7/8, 2015 at 18:30 Comment(0)
W
1

Since spark-avro module is external, there is no .avro API in DataFrameReader or DataFrameWriter.

To load/save data in Avro format, you need to specify the data source option format as avro.

Example:

val usersDF = spark.read.format("avro").load("examples/src/main/resources/users.avro")
usersDF.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
Wesla answered 30/1, 2020 at 13:53 Comment(0)
A
0
import org.apache.spark.sql.SparkSession        

val spark = SparkSession.builder()
            .appName(appName).master(master).getOrCreate()

val sqlContext = spark.sqlContext
val episodes = sqlContext.read.format("com.databricks.spark.avro")
                .option("header","true")
                .option("inferSchema","true")
                .load("episodes.avro")

episodes.show(10)
Adolfoadolph answered 10/7, 2018 at 13:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.