How to force STORE (overwrite) to HDFS in Pig?
Asked Answered
R

2

17

When developing Pig scripts that use the STORE command I have to delete the output directory for every run or the script stops and offers:

2012-06-19 19:22:49,680 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 6000: Output Location Validation Failed for: 'hdfs://[server]/user/[user]/foo/bar More info to follow:
Output directory hdfs://[server]/user/[user]/foo/bar already exists

So I'm searching for an in-Pig solution to automatically remove the directory, also one that doesn't choke if the directory is non-existent at call time.

In the Pig Latin Reference I found the shell command invoker fs. Unfortunately the Pig script breaks whenever anything produces an error. So I can't use

fs -rmr foo/bar

(i. e. remove recursively) since it breaks if the directory doesn't exist. For a moment I thought I may use

fs -test -e foo/bar

which is a test and shouldn't break or so I thought. However, Pig again interpretes test's return code on a non-existing directory as a failure code and breaks.

There is a JIRA ticket for the Pig project addressing my problem and suggesting an optional parameter OVERWRITE or FORCE_WRITE for the STORE command. Anyway, I'm using Pig 0.8.1 out of necessity and there is no such parameter.

Riff answered 19/6, 2012 at 22:28 Comment(0)
R
43

At last I found a solution on grokbase. Since finding the solution took too long I will reproduce it here and add to it.

Suppose you want to store your output using the statement

STORE Relation INTO 'foo/bar';

Then, in order to delete the directory, you can call at the start of the script

rmf foo/bar

No ";" or quotations required since it is a shell command.

I cannot reproduce it now but at some point in time I got an error message (something about missing files) where I can only assume that rmf interfered with map/reduce. So I recommend putting the call before any relation declaration. After SETs, REGISTERs and defaults should be fine.

Example:

SET mapred.fairscheduler.pool 'inhouse';
REGISTER /usr/lib/pig/contrib/piggybank/java/piggybank.jar;
%default name 'foobar'
rmf foo/bar
Rel = LOAD 'something.tsv';
STORE Rel INTO 'foo/bar';
Riff answered 20/6, 2012 at 9:32 Comment(4)
Although this is indeed nice, it's not atomic. I would rather do it in three steps: 1) store in 'foobar-tmp' 2)rmf foo/bar 3)mv 'foobar-tmp' to foo/barMandamus
@MiguelPing: It looks to me like your approach should run into my initial problem but for foobar-tmp instead of foo/bar. Storing first may also produce that elusive error I tentatively attributed to map/reduce. If your solution works on your side could you turn it into an answer with an example script and provide your pig version number?Riff
@Riff my solution is similar to yours, I just added an extra step to guarantee that if something happens between the rmf and the STORE (say, exception) you don't lose data. Pig scripts can fail any time, so my solution isn't atomic either, but at least you don't run the risk of losing data.Mandamus
Thank you so much for this! I was trying to to look for a similar function but somehow, I couldn't locate it in the official documentation.Clearway
A
2

Once you use the fs command, there a lot of ways to do this. For an individual file, I wound up adding this to the beginning of my scripts:

-- Delete file (won't work for output, which will be a directory
-- but will work for a file that gets copied or moved during the
-- the script.)
fs -touchz top_100
rm top_100

For a directory

-- Delete dir
fs -rm -r out
Airminded answered 23/12, 2013 at 20:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.