Splitting input into substrings in PIG (Hadoop)

Asked 9/9, 2009 at 14:42 Answered 17/6, 2010 at 16:36

Assume I have the following input in Pig:

some

And I would like to convert that into:

s
so
som
some

I've not (yet) found a way to iterate over a chararray in pig latin. I have found the TOKENIZE function but that splits on word boundries. So can "pig latin" do this or is this something that requires a Java class to do that?

Trista answered 9/9, 2009 at 14:42 Comment(0)

Niels, TOKENIZE takes a delimiter argument, so you can make it split each letter; however I can't think of a way to make it produce overlapping tokens.

It's pretty straightforward to write a UDF in Pig, though. You just implement a simple interface called EvalFunc (details here: http://wiki.apache.org/pig/UDFManual ). Pig was built around the idea of users writing their own functions to process most anything, and writing your own UDF is therefore a common and natural thing to do.

An even easier option, although not as efficient, is to use Pig streaming to pass your data through a script (I find whipping up a quick Perl or Python script to be faster than implementing Java classes for one-off jobs). There is an example of this here: http://www.cloudera.com/blog/2009/06/17/analyzing-apache-logs-with-pig/ -- it demonstrates the use of a pre-existing library, a Perl script, a UDF, and even an on-the-fly awk script.

Divinadivination answered 9/9, 2009 at 16:10 Comment(0)

Here is how you might do it with pig streaming and python without writing custom UDFs:

Suppose your data is just 1 column of words. The python script (lets call it wordSeq.py) to process things would be:

#!/usr/bin/python
### wordSeq.py ### [don't forget to chmod u+x wordSeq.py !]
import sys
for word in sys.stdin:
  word = word.rstrip()
  sys.stdout.write('\n'.join([word[:i+1] for i in xrange(len(word))]) + '\n')

Then, in your pig script, you tell pig you are using streaming with the above script and that you want to ship your script as necessary:

-- wordSplitter.pig ---
DEFINE CMD `wordSeq.py` ship('wordSeq.py');
W0 = LOAD 'words';
W = STREAM W0 THROUGH CMD as (word: chararray);

Feaster answered 13/11, 2009 at 8:59 Comment(0)

Use the piggybank library.

http://hadoop.apache.org/pig/docs/r0.7.0/api/org/apache/pig/piggybank/evaluation/string/SUBSTRING.html

Use like this:

REGISTER /path/to/piggybank.jar;
DEFINE SUBSTRING org.apache.pig.piggybank.evaluation.string.SUBSTRING();

OUTPUT = FOREACH INPUT GENERATE SUBSTRING((chararray)$0, 0, 10);

Gehlbach answered 17/6, 2010 at 16:36 Comment(0)

Recommended topics

Hot tags