Is it possible to read and write Parquet using Java without a dependency on Hadoop and HDFS?
Asked Answered
D

1

9

I've been hunting around for a solution to this question.

It appears to me that there is no way to embed reading and writing Parquet format in a Java program without pulling in dependencies on HDFS and Hadoop. Is this correct?

I want to read and write on a client machine, outside of a Hadoop cluster.

I started to get excited about Apache Drill, but it appears that it must run as a separate process. What I need is an in-process ability to read and write a file using the Parquet format.

Dwyer answered 6/2, 2017 at 22:53 Comment(0)
R
4

You can write parquet format out side hadoop cluster using java Parquet Client API.

Here is a sample code in java which writes parquet format to local disk.

import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;
import org.apache.hadoop.fs.Path;
import org.apache.parquet.avro.AvroSchemaConverter;
import org.apache.parquet.avro.AvroWriteSupport;
import org.apache.parquet.hadoop.ParquetWriter;
import org.apache.parquet.hadoop.metadata.CompressionCodecName;
import org.apache.parquet.schema.MessageType;

public class Test {
    void test() throws IOException {
        final String schemaLocation = "/tmp/avro_format.json";
        final Schema avroSchema = new Schema.Parser().parse(new File(schemaLocation));
        final MessageType parquetSchema = new AvroSchemaConverter().convert(avroSchema);
        final WriteSupport<Pojo> writeSupport = new AvroWriteSupport(parquetSchema, avroSchema);
        final String parquetFile = "/tmp/parquet/data.parquet";
        final Path path = new Path(parquetFile);
        ParquetWriter<GenericRecord> parquetWriter = new ParquetWriter(path, writeSupport, CompressionCodecName.SNAPPY, BLOCK_SIZE, PAGE_SIZE);
        final GenericRecord record = new GenericData.Record(avroSchema);
        record.put("id", 1);
        record.put("age", 10);
        record.put("name", "ABC");
        record.put("place", "BCD");
        parquetWriter.write(record);
        parquetWriter.close();
    }
}

avro_format.json,

{
   "type":"record",
   "name":"Pojo",
   "namespace":"com.xx.test",
   "fields":[
      {
         "name":"id",
         "type":[
            "int",
            "null"
         ]
      },
      {
         "name":"age",
         "type":[
            "int",
            "null"
         ]
      },
      {
         "name":"name",
         "type":[
            "string",
            "null"
         ]
      },
      {
         "name":"place",
         "type":[
            "string",
            "null"
         ]
      }
   ]
}

Hope this helps.

Recall answered 14/2, 2017 at 10:53 Comment(8)
OK. This works (on Windows) if I have winutils.exe. I should have worded the question differently. I don't think I'm going to have winutils.exe available where I want to do this write (and read). However, as asked, this answers (though I will need to figure out the read). Thank you.Dwyer
I should also add that I found there are some deprecated APIs in your answer. I think it is expected to use the Builders for creation of the AvroWriteSupport and ParquetWriter objects.Dwyer
Yes, the constructors are deprecated not the class. As you said we should use the builders.Recall
But in our case the ParquetWriter has only an abstract builder.Recall
Code examples aren't good without the proper "import" statements and dependency jars.Nahuatlan
@Nahuatlan I've added the imports.Giffer
@jon-hanson What jar / lib dependencies does this code need?Lenity
Hello, I'm getting a java.lang.ClassNotFoundException: org.apache.hadoop.mapreduce.lib.output.FileOutputFormat error. any idea?Undercut

© 2022 - 2024 — McMap. All rights reserved.