How Can I Load Every File In a Folder Using PIG?

L

3

6

I have a folder of files created daily that all store the same type of information. I'd like to make a script that loads the newest 10 of them, UNIONs them, and then runs some other code on them. Since pig already has an ls method, I was wondering if there was a simple way for me to get the last 10 created files, and load them all under generic names using the same loader and options. I'm guessing it would look something like:

REGISTER /usr/local/lib/hadoop/hadoop-lzo-0.4.13.jar;
REGISTER /usr/local/lib/hadoop/elephant-bird-2.0.5.jar;
FOREACH file in some_path:
    file = LOAD 'file' 
    USING com.twitter.elephantbird.pig.load.LzoTokenizedLoader('\\t') 
    AS (i1, i2, i3);

Louanneloucks answered 7/9, 2011 at 20:38 Comment(0)

L

2

Donald Miner's answer still works perfectly well, but IMO there's a better approach to this now using Embedded Pig in Python. O'Reilly has a brief explanation here. There's also a presentation on why this is something you'd want to do, and how it works here. Long story short, there's a lot of functionality it would be nice to have access to before running a pig script to determine parts of the script. Wrapping and/or dynamically generating parts of the script in Jython let's you do that. Rejoice!

Louanneloucks answered 8/3, 2013 at 22:4 Comment(0)

P

7

This is not something I've been able to do out of the box, and is something that can be done outside of the script with some sort of wrapper script or helper script (bash, perl, etc.). If you write a script, called last10.sh, that would output your last 10 files, comma separated:

$ ./last10.sh
/input/file38,/input/file39,...,/input/file48

Something like this should do the trick for the most recent 10 files:

hadoop fs -ls /input/ | sort -k6,7 | tail -n10 | awk '{print $8}' | tr '\n' ','

you could do:

$ pig -p files="`last10.sh`" my_mr.pig

Then, in your pig script, do:

data = LOAD '$files'
       USING com.twitter.elephantbird.pig.load.LzoTokenizedLoader('\\t')
       AS (i1, i2, i3);

Pig loads up the separate files if they are comma separated like this. This would be equivalent to doing:

data = LOAD '/input/file38,/input/file39,...,/input/file48'
       USING com.twitter.elephantbird.pig.load.LzoTokenizedLoader('\\t')
       AS (i1, i2, i3);

Plane answered 7/9, 2011 at 22:26 Comment(5)

Sweet! Would be nicer if PIG provided me a way to do it directly, but this definitely works. Thanks! – Louanneloucks 8/9, 2011 at 14:8

I agree. Pig is great at doing the analytic stuff, but when it comes to any sort of real integration outside of the analytic, it doesn't have much. My team has pretty much conceded that all of our pig scripts need to be wrapped in bash. – Plane 8/9, 2011 at 14:9

Nevermind. Turns out pig does not like spaces, so something like pig -p files="file1 file2" script.pig does not work and dies with a "Encountered unexpected arguments on command line" error. Do you have a workaround for that? – Louanneloucks 8/9, 2011 at 18:52

Oops! My bad! It likes commas, not spaces. I'm updating my answer to replace newlines with commas. Let me know if that works. I know some people that use { } around the paths like {file1,file2,file3}, but I think they have the same effect. – Plane 8/9, 2011 at 19:4

Cool! My only suggestion is that you add one more pipe to awk at the end to get rid of the parentheses that are produced at the beginning and the end. This would be your answer above with and added: | awk '{ print substr( $0, 1, length($0)-1 ) }' – Louanneloucks 8/9, 2011 at 20:41

L

2

Donald Miner's answer still works perfectly well, but IMO there's a better approach to this now using Embedded Pig in Python. O'Reilly has a brief explanation here. There's also a presentation on why this is something you'd want to do, and how it works here. Long story short, there's a lot of functionality it would be nice to have access to before running a pig script to determine parts of the script. Wrapping and/or dynamically generating parts of the script in Jython let's you do that. Rejoice!

Louanneloucks answered 8/3, 2013 at 22:4 Comment(0)

Y

1

I like above 2 approaches. Just wanted to give one more option for oozie enthusiasts. Java action in oozie spits out a file in location configured by "oozie.action.output.properties" and Pig action takes it that passes to pig script. This is definitely not elegant solution compared to above 2. I have had trouble configuring embedded pig using java schedule in oozie so I had to go with this solution.

<workflow-app xmlns='uri:oozie:workflow:0.1' name='java-wf'>
<start to='java1' />

<action name='java1'>
    <java>
        <job-tracker>${jobTracker}</job-tracker>
        <name-node>${nameNode}</name-node>
        <configuration>
           <property>
                <name>mapred.job.queue.name</name>
                <value>${queueName}</value>
            </property>
        </configuration>
        <main-class>org.apache.oozie.test.MyTest</main-class>
        <arg>${outputFileName}</arg>
        <capture-output/>
    </java>
    <ok to="pig1" />
    <error to="fail" />
</action>


<action name='pig1'>
    <pig>
        <job-tracker>${jobTracker}</job-tracker>
        <name-node>${nameNode}</name-node>
        <configuration>
            <property>
                <name>mapred.job.queue.name</name>
                <value>${queueName}</value>
            </property>
        </configuration>
        <script>script.pig</script>
        <param>MY_VAR=${wf:actionData('java1')['PASS_ME']}</param>
    </pig>
    <ok to="end" />
    <error to="fail" />
</action>

<kill name="fail">
    <message>Pig failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name='end' />

Yelena answered 8/6, 2013 at 2:32 Comment(0)

Recommended topics

Hot tags