Write pojo's to parquet file using reflection

Asked 18/10, 2014 at 0:53 Answered 17/1, 2018 at 13:49

Solved apache hadoop serialization avro parquet

HI Looking for APIs to write parquest with Pojos that I have. I was able to generate avro schema using reflection and then create parquet schema using AvroSchemaConverter. Also i am not able to find a way to convert Pojos to GenericRecords (avro) else I could have been able to use AvroParquetWriter to write out the Pojos into parquet files. Any suggestions ?

Haugen answered 18/10, 2014 at 0:53 Comment(0)

If you want to go through avro you have two options:

1) Let avro generate your pojos (see the tutorial here). The generated pojos extend SpecificRecord which can then be used with AvroParquetWriter.

2) Write the conversion from your pojo to GenericRecord yourself. You can do this either manually or a more generic solution would be to use reflection. However, I encountered difficulties with this approach when I tried to read the data. Based on the supplied schema avro found the pojo in the classpath and tried to instantiate a SpecificRecord instead of GenericRecord. Because of this reason I went with option 1.

Parquet also supports now writing pojo directly. Here is the pull request on parquet github page. However, I think this is not part of an official release yet. In another words, I did not find this code in maven.

Mendelssohn answered 24/10, 2014 at 21:39 Comment(2)

Thank you for the suggestions Karolovbrat. I will try these out. Ideally i would love to see the pull request released. I see there is Parquet release 2.x coming out – Haugen 30/10, 2014 at 17:30

You can do the following to fix the reader issue that @Mendelssohn stated Configuration conf = new Configuration(); conf.set(AVRO_DATA_SUPPLIER, GenericDataSupplier.class.getName()); ParquetReader<GenericRecord> parquetReader = AvroParquetReader.<GenericRecord>builder(new Path(this.file.getAbsolutePath())) .withConf(conf) .build(); – Albaugh 7/1, 2020 at 22:48

DISCLAIMER: The following code was written when I was in a hurry. It is not efficient and future versions of parquet will surely fix this more directly. That being said, this is a lightweight inefficient approach to what you need. The strategy is POJO -> AVRO -> PARQUET

POJO -> AVRO: Declare a schema via reflection. Declare writers and readers based on the schema. At the time of conversion write the object to byte stream and read it back as avro.
AVRO -> Parquet: use the AvroParquetWriter included in the parquet-me project.

private static final Schema avroSchema = ReflectData.AllowNull.get().getSchema(YOURCLASS.class);
private static final ReflectDatumWriter<YOURCLASS> reflectDatumWriter = new ReflectDatumWriter<>(avroSchema);
private static final GenericDatumReader<Object> genericRecordReader = new GenericDatumReader<>(avroSchema);

public GenericRecord toAvroGenericRecord() throws IOException {
    ByteArrayOutputStream bytes = new ByteArrayOutputStream();
    reflectDatumWriter.write(this, EncoderFactory.get().directBinaryEncoder(bytes, null));
    return (GenericRecord) genericRecordReader.read(null, DecoderFactory.get().binaryDecoder(bytes.toByteArray(), null));
}

One more thing: it seems the parquet writers are currently very strict about null fields. Make sure none of your fields are null before attempting to write to parquet

Pasteur answered 27/3, 2015 at 4:13 Comment(0)

I wasn't able to find an existing solution, so I implemented it myself. Here is the link to the implementation: https://gist.github.com/alexeygrigorev/eab72e40c6051e0163a6693054906d66

In short, it does the following:

uses reflection to get Avro schema from the pojo
using the schema and reflection it converts pojos to GenericRecord objects
reflection is applied recursively if the pojo contains other pojos or list of pojos

Lichfield answered 17/1, 2018 at 13:49 Comment(0)

Recommended topics

Hot tags