Skipping the header while loading the text file using Piglatin
Asked Answered
M

5

16

I have a text file and it's first row contains the header. Now I want to do some operation on the data, but while loading the file using PigStorage it takes the HEADER too. I just want to skip the HEADER. Is it possible to do so(directly or through a UDF)?

This is the command which i'm using to load the data:

input_file = load '/home/hadoop/smdb_tracedata.csv'
USING PigStorage(',')
as (trans:chararray, carrier:chararray,aainday:chararray);
Maximomaximum answered 1/10, 2013 at 11:44 Comment(4)
Please post the code you have tried. And before you do, take a brief look at sscce.org.Gwennie
Dude, that goes in the question, not the comments. Also, this does not look like Python to me at all. Why did you tag the question with "python"?Gwennie
Please pay attention to what I said in my 2 comments; otherwise the negative votes will just keep coming; in addition, I would draw your attention to the fact that Stack Overflow has perfectly nice formatting features, so please use them—it's hard to read what you posted otherwise.Gwennie
@ErikAllik Just so you know, he likely tagged the question with python because pig functions can be written in python. Also, for questions like this in pig, it is very difficult to produce a sscce because of the documentation.Hemispheroid
R
9

If you have pig version 0.11 you could try this:

input_file = load '/home/hadoop/smdb_tracedata.csv' USING PigStorage(',') as (trans:chararray, carrier :chararray,aainday:chararray);

ranked = rank input_file;

NoHeader = Filter ranked by (rank_input_file > 1);

Ordered = Order NoHeader by rank_input_file

New_input_file = foreach Ordered Generate trans, carrier, aainday;

This would get rid of the first row, leaving New_input_file exactly the same as the original, without the header row (assuming header row is the first row in the file). Please note that the rank operator is only available in pig 0.11, so if you have an earlier version you will need to find another way.

Edit: added the ordered line in order to make sure New_input_file maintains the same order as the original input file

Reinertson answered 1/10, 2013 at 12:42 Comment(3)
Note that this won't work if you need to load multiple csv files. Also, are you sure that the lines in input_file will still be in the same order as in the file?Hemispheroid
No, it will not work on multiple files (hadn't thought of that when I responded). The given code will not preserve order. However if you need to preserve order (and don't have the multiple files problem) just add the line ordered = order NoHeader by rank_input_file to get it in order. If you use NoHeader instead of New_input_file for later operations you can use the rank to get the data back into the original order at any point in the code that you require it by using order by rank.Reinertson
there is a much easier way if you have pig 0.12 or newer. See my answer using CSVExcelStorage.Lamoree
M
10

Usually the way I solve this problem is to use a FILTER on something I know is in the header. For example, consider the following data example:

STATE,NAME
MD,Bob
VA,Larry

I'll do:

B = FILTER A BY state != 'STATE';
Mila answered 1/10, 2013 at 15:20 Comment(2)
This seems to be the only answer that works for multiline headers in multiple files.Hultgren
This only works if the header has the same columns as the dataKoenraad
R
9

If you have pig version 0.11 you could try this:

input_file = load '/home/hadoop/smdb_tracedata.csv' USING PigStorage(',') as (trans:chararray, carrier :chararray,aainday:chararray);

ranked = rank input_file;

NoHeader = Filter ranked by (rank_input_file > 1);

Ordered = Order NoHeader by rank_input_file

New_input_file = foreach Ordered Generate trans, carrier, aainday;

This would get rid of the first row, leaving New_input_file exactly the same as the original, without the header row (assuming header row is the first row in the file). Please note that the rank operator is only available in pig 0.11, so if you have an earlier version you will need to find another way.

Edit: added the ordered line in order to make sure New_input_file maintains the same order as the original input file

Reinertson answered 1/10, 2013 at 12:42 Comment(3)
Note that this won't work if you need to load multiple csv files. Also, are you sure that the lines in input_file will still be in the same order as in the file?Hemispheroid
No, it will not work on multiple files (hadn't thought of that when I responded). The given code will not preserve order. However if you need to preserve order (and don't have the multiple files problem) just add the line ordered = order NoHeader by rank_input_file to get it in order. If you use NoHeader instead of New_input_file for later operations you can use the rank to get the data back into the original order at any point in the code that you require it by using order by rank.Reinertson
there is a much easier way if you have pig 0.12 or newer. See my answer using CSVExcelStorage.Lamoree
C
7

Here is another way of doing this:

  • Load the complete file including header record in a relation

    fileAllRecords = LOAD 'csvfilename' using PigStorage(',');
    
  • Use the Linux tail command to stream only the data records

    fileDataRecords = STREAM fileAllRecords THROUGH `tail -n +2` AS (chararray:f1 ..)
    
  • To verify the header record is removed, use following command -

    firstFewRecords = STREAM fileDataRecords THROUGH `head -20`;
    DUMP firstFewRecords;
    
Castanon answered 8/3, 2014 at 22:3 Comment(0)
L
6

You want to use CSVExcelStorage found in piggybank. It allows to set parameters for how to handle headers, line endings, quoted fields and other CSV options. The constructor you want is only available in PIG versions atleast 0.12 and has the signature:

CSVExcelStorage(String delimiter, String multilineTreatmentStr, String eolTreatmentStr, String headerTreatmentStr) 

pig code below :

REGISTER /usr/lib/pig/piggybank.jar;

input_file = load '/home/hadoop/smdb_tracedata.csv'
USING CSVExcelStorage(',', 'default', 'NOCHANGE', 'SKIP_INPUT_HEADER')
as (trans:chararray, carrier:chararray,aainday:chararray);
Lamoree answered 15/10, 2015 at 17:20 Comment(0)
C
-1

This kind of errors generally occur when you are trying to convert incompatible datatypes. I have faced the similar issue and reason --> The file I am trying to load is containing header and displaying the error. The other probable reasons might by presence of NA's , Spaces in the column

Carabin answered 26/3, 2016 at 5:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.