Regexp matching in pig
Asked Answered
H

3

6

Using apache pig and the text

hahahah.  my brother just didnt do anything wrong. He cheated on a test? no way!

I'm trying to match "my brother just didnt do anything wrong."

Ideally, I'd want to match anything beginning with "my brother just" and end with either punctuation(end of sentence) or EOL.

Looking at the pig docs, and then following the link to java.util.regex.Pattern, I figure I should be able to use

extrctd = FOREACH fltr GENERATE FLATTEN(EXTRACT(txt,'(my brother just .*\\p{Punct})')) as (txt:chararray);

But that seems to match until the end of the line. Any suggestions for performing this match? I'm ready to pull my hair out, and by pull my hair out, I mean switch to python streaming

Heywood answered 19/7, 2010 at 21:3 Comment(0)
C
4

By default quantifiers are greedy. This means they match as much as possible. In this case you want to match only up to the first punctuation mark. In other words you want to match as little as possible.

So to solve your problem you should make the quanitifer non greedy by adding a ? immediately after it:

my brother just .*?\\p{Punct}
                  ^

Note that the use of ? here is different from its use as a quantifier where it means 'match zero or one'.

Conflux answered 19/7, 2010 at 21:8 Comment(4)
Would you mind explaining the greedy part? I thought I would just be matching from the word just, any following text, up to the first instance of punctuation.Heywood
Without non-greedy, it does not match to the first instance of punctuation, rather it matches untill the last one.Hamitosemitic
what about if there is no punctuation? I'd like it to match either through the end of sentence or EOL if there is no punctuation.Heywood
More intuitive answer might be to match everything not a punctuation after "my brother just" and then match a punctuation. That way the "not a punctuation" part will match every word/space and stop at the first punctuation.Iodometry
R
0

Have you tried: .*(my brother just .*\\p{Punct})

It looks like your expression wanted the my brother part to be the begining of the string, but in your example it's in the middle of the string so you have to account for everything before my brother.

Romney answered 19/7, 2010 at 21:7 Comment(0)
G
0

You are matching .* which is... everything... try [az]* to match letters only

Gunter answered 19/7, 2010 at 21:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.