What are the common practices to write Avro files with Spark (using Scala API) in a flow like this:
- parse some logs files from HDFS
- for each log file apply some business logic and generate Avro file (or maybe merge multiple files)
- write Avro files to HDFS
I tried to use spark-avro, but it doesn't help much.
val someLogs = sc.textFile(inputPath)
val rowRDD = someLogs.map { line =>
createRow(...)
}
val sqlContext = new SQLContext(sc)
val dataFrame = sqlContext.createDataFrame(rowRDD, schema)
dataFrame.write.avro(outputPath)
This fails with error:
org.apache.spark.sql.AnalysisException:
Reference 'StringField' is ambiguous, could be: StringField#0, StringField#1, StringField#2, StringField#3, ...