Pig Latin: Load multiple files from a date range (part of the directory structure)
Asked Answered
S

11

29

I have the following scenario-

Pig version used 0.70

Sample HDFS directory structure:

/user/training/test/20100810/<data files>
/user/training/test/20100811/<data files>
/user/training/test/20100812/<data files>
/user/training/test/20100813/<data files>
/user/training/test/20100814/<data files>

As you can see in the paths listed above, one of the directory names is a date stamp.

Problem: I want to load files from a date range say from 20100810 to 20100813.

I can pass the 'from' and 'to' of the date range as parameters to the Pig script but how do I make use of these parameters in the LOAD statement. I am able to do the following

temp = LOAD '/user/training/test/{20100810,20100811,20100812}' USING SomeLoader() AS (...);

The following works with hadoop:

hadoop fs -ls /user/training/test/{20100810..20100813}

But it fails when I try the same with LOAD inside the pig script. How do I make use of the parameters passed to the Pig script to load data from a date range?

Error log follows:

Backend error message during job submission
-------------------------------------------
org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://<ServerName>.com/user/training/test/{20100810..20100813}
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:269)
        at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:858)
        at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:875)
        at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:793)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:752)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
        at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:752)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:726)
        at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
        at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
        at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
        at java.lang.Thread.run(Thread.java:619)
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input Pattern hdfs://<ServerName>.com/user/training/test/{20100810..20100813} matches 0 files
        at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36)
        at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:258)
        ... 14 more



Pig Stack Trace
---------------
ERROR 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://<ServerName>.com/user/training/test/{20100810..20100813}

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias test
        at org.apache.pig.PigServer.openIterator(PigServer.java:521)
        at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:544)
        at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:241)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
        at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
        at org.apache.pig.Main.main(Main.java:357)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://<ServerName>.com/user/training/test/{20100810..20100813}
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:169)

Do I need to make use of a higher language like Python to capture all date stamps in the range and pass them to LOAD as a comma separated list?

cheers

Snowblink answered 18/8, 2010 at 18:39 Comment(1)
For people who found this post when looking for ERROR 1066: Unable to open iterator for alias here is a generic solution.Outsell
S
21

Pig is processing your file name pattern using the hadoop file glob utilities, not the shell's glob utilities. Hadoop's are documented here. As you can see, hadoop does not support the '..' operator for a range. It seems to me you have two options - either write out the {date1,date2,date2,...,dateN} list by hand, which if this is a rare use case is probably the way to go, or write a wrapper script which generates that list for you. Building such a list from a date range should be a trivial task for the scripting language of your choice. For my application, I've gone with the generated list route, and it's working fine (CHD3 distribution).

Shortbread answered 16/2, 2011 at 15:57 Comment(1)
link update hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/…Evette
H
31

As zjffdu said, the path expansion is done by the shell. One common way to solve your problem is to simply use Pig parameters (which is a good way to make your script more resuable anyway):

shell:

pig -f script.pig -param input=/user/training/test/{20100810..20100812}

script.pig:

temp = LOAD '$input' USING SomeLoader() AS (...);
Harper answered 24/9, 2010 at 18:7 Comment(2)
This doesn't work at all, but it can appear to work because pig has appalling command line handling. You're using bash to generate a command line that will invoke pig. But it expands to pig -f script.pig -param input=/user/training/test/20100810 input=/user/training/test/20100811 input=/user/training/test/20100812 (change the pig to echo if you want to see it). Only the first input= is preceded by -param; the rest aren't pig parameters bindings at all. But pig simply stops processing command line arguments at the first unrecognised input=... and runs only the first date!Karlykarlyn
You can prove that pig's ignoring everything after the second input=... by using a pig script with a second parameter (say output), and putting the -param output=... after the input binding. You get an error about Undefined parameter : output.Karlykarlyn
S
21

Pig is processing your file name pattern using the hadoop file glob utilities, not the shell's glob utilities. Hadoop's are documented here. As you can see, hadoop does not support the '..' operator for a range. It seems to me you have two options - either write out the {date1,date2,date2,...,dateN} list by hand, which if this is a rare use case is probably the way to go, or write a wrapper script which generates that list for you. Building such a list from a date range should be a trivial task for the scripting language of your choice. For my application, I've gone with the generated list route, and it's working fine (CHD3 distribution).

Shortbread answered 16/2, 2011 at 15:57 Comment(1)
link update hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/…Evette
N
10

i ran across this answer when i was having trouble trying to create a file glob in a script and then pass it as a parameter into a pig script.

none of the current answers applied to my situation, but i did find a general answer that might be helpful here.

in my case, the shell expansion was happening and then passing that into the script - causing complete problems with the pig parser, understandably.

so by simply surrounding the glob in double-quotes protects it from being expanded by the shell, and passes it as is into the command.

WON'T WORK:

$ pig -f my-pig-file.pig -p INPUTFILEMASK='/logs/file{01,02,06}.log' -p OTHERPARAM=6

WILL WORK

$ pig -f my-pig-file.pig -p INPUTFILEMASK="/logs/file{01,02,06}.log" -p OTHERPARAM=6

i hope this saves someone some pain and agony.

Navicular answered 15/12, 2011 at 22:34 Comment(0)
F
6

So since this works:

temp = LOAD '/user/training/test/{20100810,20100811,20100812}' USING SomeLoader()

but this does not work:

temp = LOAD '/user/training/test/{20100810..20100812}' USING SomeLoader()

but if you want a date range that spans say 300 days and passing a full list to LOAD is not elegant to say the least. I came up with this and it works.

Say you want to load data from 2012-10-08 to today 2013-02-14, what you can do is

temp = LOAD '/user/training/test/{201210*,201211*,201212,2013*}' USING SomeLoader()

then do a filter after that

filtered = FILTER temp BY (the_date>='2012-10-08')
Feline answered 14/2, 2013 at 22:21 Comment(1)
the wildcard in braces syntax e.g. "/test/{201210*, 201211*}" is very efficient and easy to implementScansorial
H
4

I found this problem is caused by linux shell. Linux shell will help you expand

 {20100810..20100812} 

to

  20100810 20100811 20100812, 

then you actually run command

bin/hadoop fs -ls 20100810 20100811 20100812

But in the hdfs api, it won't help you to expand the expression.

Harlandharle answered 15/9, 2010 at 10:12 Comment(0)
P
4
temp = LOAD '/user/training/test/2010081*/*' USING SomeLoader() AS (...);
load 20100810~20100819 data
temp = LOAD '/user/training/test/2010081{0,1,2}/*' USING SomeLoader() AS (...);
load 20100810~2010812 data

if the variable is in the middle of file path, concate subfolder name or use '*' for all files.

Procurator answered 24/7, 2011 at 13:35 Comment(0)
M
4

Thanks to dave campbell. Some of the answer beyond are wrong since they got some votes.

Following is my test result:

  • Works

    • pig -f test.pig -param input="/test_{20120713,20120714}.txt"
      • Cannot have space before or after "," in the expression
    • pig -f test.pig -param input="/test_201207*.txt"
    • pig -f test.pig -param input="/test_2012071?.txt"
    • pig -f test.pig -param input="/test_20120713.txt,/test_20120714.txt"
    • pig -f test.pig -param input=/test_20120713.txt,/test_20120714.txt
      • Cannot have space before or after "," in the expression
  • Doesn't Work

    • pig -f test.pig -param input="/test_{20120713..20120714}.txt"
    • pig -f test.pig -param input=/test_{20120713,20120714}.txt
    • pig -f test.pig -param input=/test_{20120713..20120714}.txt
Mellicent answered 23/7, 2012 at 1:42 Comment(0)
P
1

Do I need to make use of a higher language like Python to capture all date stamps in the range and pass them to LOAD as a comma separated list?

Probably you don't - this can be done using custom Load UDF, or try rethinking you directory structure (this will work good if your ranges are mostly static).

additionally: Pig accepts parameters, maybe this would help you (maybe you could do function that will load data from one day and union it to resulting set, but I don't know if it's possible)

edit: probably writing simple python or bash script that generates list of dates (folders) is the easiest solution, you than just have to pass it to Pig, and this should work fine

Pincenez answered 18/8, 2010 at 19:57 Comment(1)
Thanks Wojtek. Well, the grid is already in place and its not feasible to change the directory structure. I see that, temp = LOAD '/user/training/test/{20100810,20100811,20100812}' USING SomeLoader() AS (...); and hadoop fs -ls /user/training/test/{20100810,20100811,20100812} works fine. hadoop fs -ls /user/training/test/{20100810..20100812} also works but temp = LOAD '/user/training/test/{20100810..20100812}' USING SomeLoader() AS (...); fails at dump temp or store temp.Snowblink
R
1

To Romain's answer, if you want to just parameterize the date, the shell will run like this:

pig -param input="$(echo {20100810..20100812} | tr ' ' ,)" -f script.pig

pig:

temp = LOAD '/user/training/test/{$input}' USING SomeLoader() AS (...);

Please note the quotes.

Ropedancer answered 2/3, 2016 at 14:6 Comment(0)
H
0

Pig support globe status of hdfs,

so I think pig can handle the pattern /user/training/test/{20100810,20100811,20100812},

could you paste the error logs ?

Harlandharle answered 20/8, 2010 at 6:14 Comment(1)
Hi zjffdu, I have copied the error log into the question. ThanksSnowblink
T
0

Here's a script I'm using to generate a list of dates, and then put this list to pig script params. Very tricky, but works for me.

For example:

DT=20180101
DT_LIST=''
for ((i=0; i<=$DAYS; i++))
do
    d=$(date +%Y%m%d -d "${DT} +$i days");
    DT_LIST=${DT_LIST}$d','
done

size=${#DT_LIST}
DT_LIST=${DT_LIST:0:size-1}


pig -p input_data=xxx/yyy/'${DT_LIST}' script.pig

Tittivate answered 27/6, 2020 at 0:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.