StrSplit in Pig functions
Asked Answered
E

3

6

Can Some one explain me on getting this below output in Pigscript

my input file is below

a.txt

aaa.kyl,data,data
bbb.kkk,data,data
cccccc.hj,data,data
qa.dff,data,data

I am writing the pig script like this

A = LOAD 'a.txt' USING PigStorage(',') AS(a1:chararray,a2:chararray,a3:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(a1)),a2,a3;

I dont know how to proceed with this.. i need out put like this below.Basically i need all chars after the dot symbol in the first atom

(kyl,data,data)
(kkk,data,data)
(hj,data,data)
(dff,data,data)

Can some one give me the code for this

Ebberta answered 27/7, 2014 at 13:26 Comment(0)
H
10

Here is what you need to do -

Here is an escaping problem in the pig parsing routines when it encounters the dot as its considered as an operator refer this link for more information Dot Operator.

You can use a unicode escape sequence for a dot instead: \u002E. However this must also be slash escaped and put in a single quoted string.

The below code will do the work for you and you can fine tune it as per your convenience -

A = LOAD 'a.txt' USING PigStorage(',') AS(a1:chararray,a2:chararray,a3:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(a1,'\\u002E')) as (a1:chararray, a1of1:chararray),a2,a3;
C = FOREACH B GENERATE a1of1,a2,a3;

Hope this helps.

Hornstone answered 27/7, 2014 at 16:17 Comment(2)
Nice explanation.. Do we need to apply this kind of unicode escape sequence for pipe symbol and comma as well.Ebberta
No not for pipe and comma check the pig documentationHornstone
T
4

You can try with STRSPLIT() by following,

A = LOAD 'C:\\Users\\Ren\\Desktop\\file' USING PigStorage(',') AS(a1:chararray,a2:chararray,a3:chararray); 

B = foreach A generate SUBSTRING(a1,INDEXOF(a1,'.',0)+1,(int)SIZE(a1)),a2,a3;                                                                                 
Timid answered 28/7, 2014 at 9:31 Comment(0)
C
1
A = LOAD 'a.txt' USING PigStorage(',') AS(a1:chararray,a2:chararray,a3:chararray);

B = FOREACH A GENERATE FLATTEN(STRSPLIT(a1,'.')),a2,a3;

This will seperate a1 into 2 parts which is before dot and after dot, from this you can select after dot operator.

C = foreach B generate $1,$2,$3;

where $1 is after Dot operator

Chateau answered 28/7, 2014 at 1:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.