Convert file of JSON objects to Parquet file

Asked 11/2, 2014 at 0:54 Answered 15/7, 2015 at 12:7

Motivation: I want to load the data into Apache Drill. I understand that Drill can handle JSON input, but I want to see how it performs on Parquet data.

Is there any way to do this without first loading the data into Hive, etc and then using one of the Parquet connectors to generate an output file?

Ballesteros answered 11/2, 2014 at 0:54 Comment(1)

related : #30566010 – Cinque 5/6, 2015 at 5:13

Kite has support for importing JSON to both Avro and Parquet formats via its command-line utility, kite-dataset.

First, you would infer the schema of your JSON:

kite-dataset json-schema sample-file.json -o schema.avsc

Then you can use that file to create a Parquet Hive table:

kite-dataset create mytable --schema schema.avsc --format parquet

And finally, you can load your JSON into the dataset.

kite-dataset json-import sample-file.json mytable

You can also import an entire directly stored in HDFS. In that case, Kite will use a MR job to do the import.

Archer answered 28/5, 2015 at 17:20 Comment(2)

FYI - I tried this and it needs a hadoop installation. I'm also trying to convert the files for use with drill and I don't have hadoop. – Tepefy 15/7, 2015 at 10:57

Correct. The link above is for installing on a Hadoop cluster. If you want, there is also a tarball distribution in maven central that includes the dependencies. The trouble with that one is that you have to decide what those dependencies should be, which varies by Hadoop distribution. That's why I recommend using the instructions that get the Hadoop dependencies from the cluster you're running on. – Archer 5/8, 2015 at 23:46

You can actually use Drill itself to create a parquet file from the output of any query.

create table student_parquet as select * from `student.json`;

The above line should be good enough. Drill interprets the types based on the data in the fields. You can substitute your own query and create a parquet file.

Dullard answered 17/1, 2015 at 5:52 Comment(0)

To complete the answer of @rahul, you can use drill to do this - but I needed to add more to the query to get it working out of the box with drill.

create table dfs.tmp.`filename.parquet` as select * from dfs.`/tmp/filename.json` t

I needed to give it the storage plugin (dfs) and the "root" config can read from the whole disk and is not writable. But the tmp config (dfs.tmp) is writable and writes to /tmp. So I wrote to there.

But the problem is that if the json is nested or perhaps contains unusual characters, I would get a cryptic

org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: java.lang.IndexOutOfBoundsException:

If I have a structure that looks like members: {id:123, name:"joe"} I would have to change the select to

select members.id as members_id, members.name as members_name

select members.id as `members.id`, members.name as `members.name`

to get it to work.

I assume the reason is that parquet is a "column" store so you need columns. JSON isn't by default so you need to convert it.

The problem is I have to know my json schema and I have to build the select to include all the possibilities. I'd be happy if some knows a better way to do this.

Tepefy answered 15/7, 2015 at 12:7 Comment(1)

No need to select the individual columns. A simple select * should work. Since you are seeing an error, I guess your 'members' structure has different number of fields in different records. And also be aware drill as of today cannot handle schema changes within the same column. So if you have a column 'val1' which has a combination of string and integers for different records, then drill will not be able to handle it. – Dullard 6/8, 2015 at 17:18

Recommended topics

Hot tags