Add folder name to output Pig Latin
Asked Answered
J

1

0

I have next directory structure in HDFS:

logs_folder
   |---2021-03-01
          |---log1
          |---log2
          |---log3
       2021-03-02
          |---log1
          |---log2
       2021-03-03
          |---log1
          |---log2
...

Logs are made up of text data. There is no date in the data because it is already in the folder name. I want to read all the logs and save them in the following format:

date    id

where id - field from the log, but I need to take the date from the folder name. Expected output:

2021-03-01    id1
2021-03-01    id2
...
2021-03-02    id234
2021-03-02    id456
...

How to add date from folder name to output?


I found close question how to add full pathname to data on reading:

A = LOAD '/logs_folder/*' using PigStorage(',','-tagPath'); 
DUMP A  ;

How can I incorporate the current input filename into my Pig Latin script?

It is very close, but how to get parent folder name only instead of full path?

Jason answered 30/3, 2021 at 15:23 Comment(0)
J
1

Finally I used this approach:

  1. Load data using `-tagPathz attribute - it adds column to loaded data contains full path to every file
  2. Use regex to filter parent folder only

Code example:

hadoop_data = LOAD '/logs_folder/*' USING PigStorage(',', '-tagPath') as (filepath:chararray, id:chararray, feature:chararray, value:chararray);
hadoop_data = FOREACH hadoop_data GENERATE id,(chararray)REGEX_EXTRACT(filepath,'.*\\/(.*)\\/',1) as path,
    feature,value;

My data consist of 3 fields - id, feature, value, but you can see there are 4 of them - filepath field was added!

Jason answered 14/4, 2021 at 9:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.