How to read/write protocol buffer messages with Apache Spark?

I want to Read/write protocol buffer messages from/to HDFS with Apache Spark. I found these suggested ways:

1) Convert protobuf messsages to Json with Google's Gson Library and then read/write them by SparkSql. This solution is explained in this link But I think doing that (convert to json) is an extra task.

2) Convert to Parquet file. There are parquet-mr and sparksql-protobuf github projects for this way but I don't want parquet file because I always work with all columns (not some columns) and in this way Parquet Format does not give me any gain (at least I think).

3) ScalaPB. May be it's what I am looking for. but in scala language that I don't know anything about it. I am looking for a java-based solution. This youtube video introduce scalaPB and explain how to use it (for scala developers).

4) Through the use of the sequence file and this is what I looking for, but found nothing about that. So, my question is: How can I write protobuf messages to sequence file on HDFS and from that? Any other suggestion will be useful.

5) Through twitter's Elephant-bird Library.

// Importing org.apache.hadoop.io package import org.apache.hadoop.io._ // As we need data in sequence file format to read. Let us see how to write first // Reading data from text file format val dataRDD = sc.textFile("/public/retail_db/orders") // Using null as key and value will be of type Text while saving in sequence file format // By Int and String, we do not need to convert types into IntWritable and Text // But for others we need to convert to writable object // For example, if the key/value is of type Long, we might have to // type cast by saying new LongWritable(object) dataRDD. map(x => (NullWritable.get(), x)). saveAsSequenceFile("/user/`whoami`/orders_seq") // Make sure to replace `whoami` with the appropriate OS user id // Saving in sequence file with key of type Int and value of type String dataRDD. map(x => (x.split(",")(0).toInt, x.split(",")(1))). saveAsSequenceFile("/user/`whoami`/orders_seq") // Make sure to replace `whoami` with the appropriate OS user id

Recommended topics

Hot tags