Filtering null values with pig
Asked Answered
B

4

14

It looks like a silly problem, but I can´t find a way to filter null values from my rows. This is the result when I dump the object geoinfo:

DUMP geoinfo;
([longitude#70.95853,latitude#30.9773])
([longitude#-9.37944507,latitude#38.91780853])
(null)
(null)
(null)
([longitude#-92.64416,latitude#16.73326])
(null)
(null)
([longitude#-9.15199849,latitude#38.71179122])
([longitude#-9.15210796,latitude#38.71195131])

here is the description

DESCRIBE geoinfo;
geoinfo: {geoLocation: bytearray}

What I'm trying to do is to filter null values like this:

geoinfo_no_nulls = FILTER geoinfo BY geoLocation is not null;

but the result remains the same. nothing is filtered.

I also tried something like this

geoinfo_no_nulls = FILTER geoinfo BY geoLocation != 'null';

and I got an error

org.apache.pig.backend.executionengine.ExecException: ERROR 1071: Cannot convert a map to a String

What am I doing wrong?

details, running on ubuntu, hadoop-1.0.3 with pig 0.9.3

pig -version Apache Pig version 0.9.3-SNAPSHOT (rexported) compiled Oct 24 2012, 19:04:03

java version "1.6.0_24" OpenJDK Runtime Environment (IcedTea6 1.11.4) (6b24-1.11.4-1ubuntu0.12.04.1) OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

Busiek answered 31/10, 2012 at 18:26 Comment(1)
Please post your complete script.Octoroon
B
4

Thank you for your answers guys. It help me to find the path.

In the end the problem seams to be with the JsonLoader I was using. I don't know why exactly, but it seams to have a bug with null strings.

I finally changed my code to use https://github.com/kevinweil/elephant-bird.

the code now looks like this:

    register 'elephant-bird-core-3.0.0.jar'
    register 'elephant-bird-pig-3.0.0.jar'
    register 'google-collections-1.0.jar'
    register 'json-simple-1.1.jar'

    json_lines = LOAD '/twitterecho/tweets/stream/v1/json/2012_10_10/08' USING com.twitter.elephantbird.pig.load.JsonLoader();

    geo_tweets = FOREACH json_lines GENERATE (CHARARRAY) $0#'id' AS id, (CHARARRAY) $0#'geoLocation' AS geoLocation;

    tweets_grp = GROUP geo_tweets BY id;
    unique_tweets = FOREACH tweets_grp {
          first_tweet = LIMIT inpt 1;
          GENERATE FLATTEN(first_tweet);
    };

    only_not_nulls = FILTER geo_tweets BY geoLocation is not null;
    store only_not_nulls into '/twitter_data/results/geo_tweets';

cheers

Busiek answered 31/10, 2012 at 18:26 Comment(0)
T
3

I was having similar issue and something like this worked for me:

geoinfo_no_nulls = FILTER geoinfo BY geoLocation != '';
Toddtoddie answered 31/10, 2012 at 18:40 Comment(2)
I got the error >ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2997: Unable to recreate exception from backed error: org.apache.pig.backend.executionengine.ExecException: ERROR 1071: Cannot convert a map to a StringBusiek
Try giving it as geoinfo.$0!='' or geoinfo.geoLocation!=''. Either of them should workToddtoddie
L
0

Assuming you want your geolocation to be a map with longitude and latitude as keys, you could LOAD your data as map and check for nulls as follows:

A = LOAD 'data' AS (f1:map[]);
B = FILTER A BY f1#longitude is not null and f1#latitude is not null;

It's kind of a workaround only, though.

Lacerate answered 31/10, 2012 at 22:55 Comment(0)
L
0

Try this:

geoinfo_no_nulls = FILTER geoinfo BY (geomap#'geoLocation' is not null);

As I can see from the error it is recognizing the type as a map. You need to specify the particular key of the map. If that doesn't work, you are normally not getting that data properly as a map. Please try as Fred indicates.

Laxation answered 7/11, 2012 at 12:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.