Pig local mode, group, or join = java.lang.OutOfMemoryError: Java heap space
Asked Answered
D

2

6

Using Apache Pig version 0.10.1.21 (reported), CentOS release 6.3 (Final), jdk1.6.0_31 (The Hortonworks Sandbox v1.2 on Virtualbox, with 3.5 GB RAM)

$ cat data.txt
11,11,22
33,34,35
47,0,21
33,6,51
56,6,11
11,25,67

$ cat GrpTest.pig
A = LOAD 'data.txt' USING PigStorage(',') AS (f1:int,f2:int,f3:int);
B = GROUP A BY f1;
DESCRIBE B;
DUMP B;

pig -x local GrpTest.pig

[Thread-12] WARN  org.apache.hadoop.mapred.JobClient - No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
[Thread-12] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
[Thread-13] INFO  org.apache.hadoop.mapred.Task -  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@19a9bea3
[Thread-13] INFO  org.apache.hadoop.mapred.MapTask - io.sort.mb = 100
[Thread-13] WARN  org.apache.hadoop.mapred.LocalJobRunner - job_local_0002
java.lang.OutOfMemoryError: Java heap space
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:949)
    at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:674)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:756)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
[main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias B

The java.lang.OutOfMemoryError: Java heap space error occurs each time I use GROUP or JOIN in a pig script executed in local mode. There is no error when the script is executed in mapreduce mode on HDFS.

Question 1: How come there is an OutOfMemory error while the data sample is minuscule and local mode is supposed to use less resources than HDFS mode?

Question 2: Is there a solution to run successfully a small pig scripts with GROUP or JOIN in local mode?

Determine answered 11/5, 2013 at 16:36 Comment(4)
I've never had any troubles doing groups or joins in local map reduce mode, even on very large datasets... I imagine either your JVM's settings are screwed up, or your local pig/hadoop set some sort of maximum memory allowed setting to 0. You sure its just GROUP and JOIN that fail locally? If you use a large amount of memory for a non-pig related java program what happens?Retaliate
Hi, this is the Hortonwords Sandbox. When I connect via the the GUI (the Hortons HUE GUI which allows to run Pig, Hive via web browser on HDFS of the same sandbox). All demo runs fine with much bigger dataset (10 MB). So I suppose the JVM handles OK bigger load. This is clearly a bug in local mode. As soon as there is GROUP or JOIN, Pig failed with the java OutOfMemory error. Regardless of the data sample size, Grunt shell or Pig Script.Determine
I don't know anything about Hortonwords, but if you do a query with DISTINCT locally does that work fine? There's a bunch of Hadoop/pig related settings for maximum memory allowed for shuffles, sorting, joining etc. My guess is still that 1 of those is 0.Retaliate
For people who found this post when looking for ERROR 1066: Unable to open iterator for alias here is a generic solution.Pressmark
D
20

Solution: force pig to allocate less memory for the java property io.sort.mb I set to 10 MB here and the error disappears. Not sure what would be the best value but at least, this allow to practice pig syntax in local mode

$ cat GrpTest.pig
--avoid java.lang.OutOfMemoryError: Java heap space (execmode: -x local)
set io.sort.mb 10;

A = LOAD 'data.txt' USING PigStorage(',') AS (f1:int,f2:int,f3:int);
B = GROUP A BY f1;
DESCRIBE B;
DUMP B;
Determine answered 18/5, 2013 at 18:15 Comment(1)
can I keep set io.sort.mb 10; this in the MR mode as well or should I remove it?Quaquaversal
B
0

The reason is you have less memory allocated to Java locally than you do on your Hadoop cluster machines. This is actually a pretty common error in Hadoop. It mostly occurs when you create a really long relation in Pig at any point, and happens because Pig always wants to load an entire relation into memory and doesn't want to lazy load it in any way.

When you do something like GROUP BY where the tuple you're grouping by is non-sparse over many records, you frequently wind up creating single long relations at least temporarily since you're basically taking a whole bunch of individual relations and cramming them all into one single long relation. Either change your code so you don't wind up creating single very long relations at any point (i.e. group by something more sparse), or increase the memory available to Java.

Brendonbrenk answered 15/5, 2013 at 3:29 Comment(1)
Please have a look at my initial post. The entire size of the data used in the example is less an 100 bytes. In other words, less then the length of this comment. Regardless the smartness of the pig underlying plumbing, there is no excuse that it failed by OutOfMemory when there is absolutely no memory issue. This is clearly a bug.Determine

© 2022 - 2024 — McMap. All rights reserved.