How do I parse JSON in Pig?
Asked Answered
B

7

19

I have a lot of gzip'd log files in s3 that has 3 types of log lines: b,c,i. i and c are both single level json:

{"this":"that","test":"4"}

Type b is deeply nested json. I came across this gist talking about compiling a jar to make this work. Since my java skills are less than stellar, I didn't really know what to do from here.

{"this":{"foo":"bar","baz":{"test":"me"},"total":"5"}}

Since types i and c are not always in the same order, this makes specifying everything in the generate regex difficult. Is handling JSON (in a gzip'd file) possible with Pig? I am using whichever version of Pig comes built on an Amazon Elastic Map Reduce instance.

This boils down to two questions: 1) Can I parse JSON with Pig (and if so, how)? 2) If I can parse JSON (from a gzip'd logfile), can I parse nested JSON objects?

Bullbat answered 16/2, 2011 at 5:59 Comment(0)
B
8

After a lot of workarounds and working through things, I was able to answer to get this done. I did a write-up about it on my blog about how to do this. It is available here: http://eric.lubow.org/2011/hadoop/pig-queries-parsing-json-on-amazons-elastic-map-reduce-using-s3-data/

Bullbat answered 1/3, 2011 at 21:30 Comment(1)
"Error establishing a database connection" on the linkCottar
I
17

Pig 0.10 comes with builtin JsonStorage and JsonLoader().

pig doc for json load/store

Ilka answered 3/7, 2012 at 22:0 Comment(1)
This question brings up a good concern: #15397050Module
B
8

After a lot of workarounds and working through things, I was able to answer to get this done. I did a write-up about it on my blog about how to do this. It is available here: http://eric.lubow.org/2011/hadoop/pig-queries-parsing-json-on-amazons-elastic-map-reduce-using-s3-data/

Bullbat answered 1/3, 2011 at 21:30 Comment(1)
"Error establishing a database connection" on the linkCottar
N
5

Pig comes with a JSON loader. To load you use:

A = LOAD ‘data.json’
USING PigJsonLoader();

To store you can use:

STORE INTO ‘output.json’ 
    USING PigJsonLoader();

However, I'm not sure it supports GZIPed data....

Nahama answered 24/2, 2011 at 11:53 Comment(2)
where/what version? 0.8.0 doesn't seem to know about it by default.Aver
PigJsonLoader seems to be a separate package? github.com/mmay/PigJsonLoaderSika
M
3

Please try this: https://github.com/a-b/elephant-bird

Muna answered 27/12, 2011 at 19:45 Comment(0)
T
2

We can do it by using JsonLoader...But we have to mention the schema for your json data or else it may arise an error..just follow the below link

         http://joshualande.com/read-write-json-apache-pig/

We can also do it by creating UDF to parse it...

Tilefish answered 3/7, 2014 at 10:49 Comment(0)
B
0

You can try usin the twitter elephantbird json loader , It handles the json data dynamically.But you have to be very precise with the schema .

api_data = LOAD 'file name' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');

Budding answered 9/9, 2015 at 20:44 Comment(0)
C
0

I have seen the usage of twitter elephantbird increase a lot and it is quickly becoming the goto library for json parsing in PIG.

Example :

DEFINE TwitterJsonLoader com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true ');

JsonInput = LOAD 'input_path' USING TwitterJsonLoader() AS (entity: map[]);

InputObjects = FOREACH JsonInput GENERATE (map[]) entity#'Object' AS   JsonObject;

InputIds = FOREACH InputObjects GENERATE JsonObject#'id' AS id;
Carlson answered 29/2, 2016 at 23:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.