How to read and write Map<String, Object> from/to parquet file in Java or Scala?
Asked Answered
H

3

14

Looking for a concise example on how to read and write Map<String, Object> from/to parquet file in Java or Scala?

Here is expected structure, using com.fasterxml.jackson.databind.ObjectMapper as serializer in Java (i.e. looking for equivalent using parquet):

public static Map<String, Object> read(InputStream inputStream) throws IOException {
    ObjectMapper objectMapper = new ObjectMapper();

    return objectMapper.readValue(inputStream, new TypeReference<Map<String, Object>>() {

    });
}

public static void write(OutputStream outputStream, Map<String, Object> map) throws IOException {
    ObjectMapper objectMapper = new ObjectMapper();

    objectMapper.writeValue(outputStream, map);        
}
Halvorson answered 1/6, 2015 at 4:10 Comment(4)
Check this: github.com/Parquet/parquet-mr/blob/master/parquet-pig/src/main/…Seizing
This question does not show any research effortUnbuild
@Unbuild question purposefully distilled to the essence, but I'll add a level of detailHalvorson
@sid_dude that's a multi-file project for reading Apache Pig file (which all the schema "magic"), need an example for Map<String, Object> without extraneous dependencies (also use answer box below to get credit;)Halvorson
O
3

i'm not quite good about parquet but, from here:

Schema schema = new Schema.Parser().parse(Resources.getResource("map.avsc").openStream());

    File tmp = File.createTempFile(getClass().getSimpleName(), ".tmp");
    tmp.deleteOnExit();
    tmp.delete();
    Path file = new Path(tmp.getPath());

    AvroParquetWriter<GenericRecord> writer = 
        new AvroParquetWriter<GenericRecord>(file, schema);

    // Write a record with an empty map.
    ImmutableMap emptyMap = new ImmutableMap.Builder<String, Integer>().build();
    GenericData.Record record = new GenericRecordBuilder(schema)
        .set("mymap", emptyMap).build();
    writer.write(record);
    writer.close();

    AvroParquetReader<GenericRecord> reader = new AvroParquetReader<GenericRecord>(file);
    GenericRecord nextRecord = reader.read();

    assertNotNull(nextRecord);
    assertEquals(emptyMap, nextRecord.get("mymap"));

In your situation change ImmutableMap (Google Collections) with a default Map as below:

Schema schema = new Schema.Parser().parse( Resources.getResource( "map.avsc" ).openStream() );

        File tmp = File.createTempFile( getClass().getSimpleName(), ".tmp" );
        tmp.deleteOnExit();
        tmp.delete();
        Path file = new Path( tmp.getPath() );

        AvroParquetWriter<GenericRecord> writer = new AvroParquetWriter<GenericRecord>( file, schema );

        // Write a record with an empty map.
        Map<String,Object> emptyMap = new HashMap<String, Object>();

        // not empty any more
        emptyMap.put( "SOMETHING", new SOMETHING() );
        GenericData.Record record = new GenericRecordBuilder( schema ).set( "mymap", emptyMap ).build();
        writer.write( record );
        writer.close();

        AvroParquetReader<GenericRecord> reader = new AvroParquetReader<GenericRecord>( file );
        GenericRecord nextRecord = reader.read();

        assertNotNull( nextRecord );
        assertEquals( emptyMap, nextRecord.get( "mymap" ) );

I didn't test the code, but give it a try..

Ouellette answered 5/6, 2015 at 10:13 Comment(1)
I think that would not work; the key is to serialize Map<String, Object> and I am not sure if map.avsc (the AVRO schema?) can support the Object part (i.e. must be a concrete type, which defeats the purpose)Halvorson
S
0

I doubt there is a solution to this readily available. When you talk about Maps, its still possible to create a AvroSchema out of it provided the values of the maps is a primitive type, or a complexType which inturn has primitive type fields.

In your case,

  • If you have a Map => which will create schema with values of map being int.
  • If you have a Map,
    • a. CustomObject has fields int, float, char ... (i.e. any primitive type) the schema generation will be valid and can then be used to successfully convert to parquet.
    • b. CustomObject has fields which are non primitive, the schema generated will be malformed and the resulting ParquetWritter will fail.

To resolve this issue, you can try to convert your object into a JsonObject and then use the Apache Spark libraries to convert it to Parquet.

Spark answered 22/7, 2019 at 5:30 Comment(0)
B
-1

Apache Drill is your answer!

Convert to parquet : You can use the CTAS(create table as) feature in drill. By default drill creates a folder with parquet files after executing the below query. You can substitute any query and drill writes the output of you query into parquet files

create table file_parquet as select * from dfs.`/data/file.json`;

Convert from parquet : We also use the CTAS feature here, however we request drill to use a different format for writing the output

alter session set `store.format`='json';
create table file_json as select * from dfs.`/data/file.parquet`;

Refer to http://drill.apache.org/docs/create-table-as-ctas-command/ for more information

Beasley answered 6/8, 2015 at 17:27 Comment(1)
This is not really meeting the spirit of the question - i.e. standalone program. Drill is yet another heavy-ish framework dependency.Tatty

© 2022 - 2024 — McMap. All rights reserved.