using PIG to load a file
Asked Answered
S

1

11

I am very new to PIG and I am having what feels like a very basic problem. I have a line of code that reads:

A = load 'Sites/trial_clustering/shortdocs/*'
      AS (word1:chararray, word2:chararray, word3:chararray, word4:chararray);

where each file is basically a line of 4 comma separated words. However PIG is not splitting this into the 4 words. When I do dump A, I get: (Money, coins, loans, debt,,,) I have tried googling and I cannot seem to find what format my file needs to be in so that PIG will interpret it properly. Please help!

Spangler answered 11/11, 2011 at 19:36 Comment(0)
B
27

Your problem is that Pig, by default, loads files delimited by tab, not comma. What's happening is "Money, coins, loans, debt" are getting stuck in your first column, word1. When you are printing it, you get the illusion that you have multiple columns, but really the first one is filled with your whole line, then the others are null.

To fix this, you should specify PigStorage to load by comma by doing:

A = LOAD '...' USING PigStorage(',') AS (...);
Bendicta answered 12/11, 2011 at 1:39 Comment(3)
Thank you! This worked! Now I have a new question, how do I deal with file delimited by newline? I have triedSpangler
Thank you this worked. Now I have a new question; I cannot seem to make this work with a file delimited by new lines, A = LOAD '...' USING PigStorage('\n') AS (...); does not work! and neither does A = LOAD '...' USING PigStorage('\\n') AS (...); Thank you!Spangler
PigStorage will treat each new line as another tuple. There is no way to specify that X number of lines should be on tuple.Arly

© 2022 - 2024 — McMap. All rights reserved.