I am currently using the code below to write parquet via Avro. This code writes it to a file system but I want to write to S3.
try {
StopWatch sw = StopWatch.createStarted();
Schema avroSchema = AvroSchemaBuilder.build("pojo", message.getTransformedMessage().get(0));
final String parquetFile = "parquet/data.parquet";
final Path path = new Path(parquetFile);
ParquetWriter writer = AvroParquetWriter.<GenericData.Record>builder(path)
.withSchema(avroSchema)
.withConf(new org.apache.hadoop.conf.Configuration())
.withCompressionCodec(CompressionCodecName.SNAPPY)
.withWriteMode(Mode.OVERWRITE)//probably not good for prod. (overwrites files).
.build();
for (Map<String, Object> row : message.getTransformedMessage()) {
StopWatch stopWatch = StopWatch.createStarted();
final GenericRecord record = new GenericData.Record(avroSchema);
row.forEach((k, v) -> {
record.put(k, v);
});
writer.write(record);
}
//todo: Write to S3. We should probably write via the AWS objects. This does not show that.
//https://mcmap.net/q/266144/-how-to-generate-parquet-file-using-pure-java-including-date-amp-decimal-types-and-upload-to-s3-windows-no-hdfs
writer.close();
System.out.println("Total Time: " + sw);
} catch (Exception e) {
//do somethign here. retryable? non-retryable? Wrap this excetion in one of these?
transformedParquetMessage.getOriginalMessage().getMetaData().addException(e);
}
This writes to a file fine, but how do I get it to stream it into the AmazonS3 api? I have found some code on the web using the Hadoop-aws jar, but that requires some Windows exe files to work and, of course, we want to avoid that. Currently I am using only:
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>1.9.2</version>
</dependency>
<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-avro</artifactId>
<version>1.8.1</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>1.2.1</version>
</dependency>
So the question is, is there a way to intercept the output stream on the AvroParquetWriter so I can stream it to S3? The main reason I want to do this is for retries. S3 automagically retries up to 3 times. This would help us out a lot.