I want to Read/write protocol buffer messages from/to HDFS with Apache Spark. I found these suggested ways:
1) Convert protobuf messsages to Json with Google's Gson Library and then read/write them by SparkSql. This solution is explained in this link But I think doing that (convert to json) is an extra task.
2) Convert to Parquet file. There are parquet-mr and sparksql-protobuf github projects for this way but I don't want parquet file because I always work with all columns (not some columns) and in this way Parquet Format does not give me any gain (at least I think).
3) ScalaPB. May be it's what I am looking for. but in scala language that I don't know anything about it. I am looking for a java-based solution. This youtube video introduce scalaPB and explain how to use it (for scala developers).
4) Through the use of the sequence file and this is what I looking for, but found nothing about that. So, my question is: How can I write protobuf messages to sequence file on HDFS and from that? Any other suggestion will be useful.
5) Through twitter's Elephant-bird Library.