Extract some part of text separated by a delimiter using a regex
Asked Answered
L

4

13

I have a sample input file as follows, with columns Id, Name, start date, end date, Age, Description, and Location:

220;John;23/11/2008;22/12/2008;28;Working as a professor in University;Hyderabad
221;Paul;30;23/11/2008;22/12/2008;He is a software engineer at MNC;Bangalore
222;Emma;23/11/2008;22/12/200825;Working as a mechanical engineer;Chennai

It contains 30 lines of data. My requirement is to only extract descriptions from the above text file.

My output should contain

Working as a professor in University

He is a software engineer at MNC

working as a mechanical engineer

I need to find a regular expression to extract the Description, and have tried many kinds, but I haven't been able to find the solution. How can I do it?

Loony answered 19/2, 2013 at 4:53 Comment(9)
the delimiter in the above input file is ";"Loony
I may have messed up on my edit, did you mean to have the semicolons and commas in there?Sukin
OK, please re-edit with them. Sorry, thinking about databases too much.Sukin
Why do you want a regex? Just split by semicolon and grab the 4th column and you're done. Also, you should tag with what language you are using.Womankind
my requirement is to use regex......Loony
you mean your homework assignment?Womankind
aql annotated query languageLoony
The data is a mess. John has two dates then a number (age); Paul has a number and two dates; Emma has a date and a date scrunched up with the number. The columns listed don't include either of the date columns. (Someone can't spell 'engineer', or 'Bangalore'). How will the regex know to convert Working to working? That's tremendously fiddly!Diversification
sry for my english its Working not working in output.Loony
W
24

You can use this regex:

[^;]+(?=;[^;]*$)

[^;] matches any character except ;

+ is a quantifier that matches the preceding character or group one to many times

* is a quantifier that matches the preceding character or group zero to many times

$ is the end of the string

(?=pattern) is a lookahead which checks if a particular pattern occurs ahead

Washer answered 19/2, 2013 at 5:27 Comment(1)
([^;]+(?=;[^;]*(\r?\n|$)))Stormystorting
W
5

/^(?:[^;]+;){3}([^;]+)/ will grab the fourth group between semicolons.

Although as stated in my comment, you should just split the string by semicolon and grab the fourth element of the split...that's the whole point of a delimited file - you don't need complex pattern matching.

Example implementation in Perl using your input example:

open(my $IN, "<input.txt") or die $!;

while(<$IN>){
    (my $desc) = $_ =~ /^(?:[^;]+;){3}([^;]+)/;
    print "'$desc'\n";
}
close $IN;

yields:

'Working as a professor in University'
'He is a software engineer at MNC'
'Working as a mechanical engineer'
Womankind answered 19/2, 2013 at 5:13 Comment(3)
i can only use regex // in my coding, i can not use above codingLoony
What I provided is a regex. And since you didn't indicate what language you were using, I provided a sample implementation making use of the regex.Womankind
i am using aql language for biginsight text analyticsLoony
C
1

This should work:

/^[^\s]+\s+[^\s]+\s+[^\s]+\s+(.+)\s+[^\s]+$/m

Or as lone shepherd pointed out:

/^\S+\s+\S+\s+\S+\s+(.+)\s+\S+$/m

Or with semicolons:

/^[^;]+;[^;]+;+[^;]+;+(.+);+[^;]+$/m
Centaury answered 19/2, 2013 at 5:1 Comment(8)
\S is the same as [^\s]Womankind
no its not working 220;John;28;Working as a Professor in University;HyderabadLoony
This almost wworks if you can use a line modifier (m in php), so that ^ represents the beginning of the line while $ represents the end. In the previous example though I was just missing one column. /^[^\s]+\s+[^\s]+\s+[^\s]+\s+(.+)\s+[^\s]+$/mCentaury
And now I see you reverted back to semi-colons. /^[^;]+;[^;]+;+[^;]+;+(.+);+[^;]+$/mCentaury
No it works great in PHP using preg_replace. You of course never even specified if it's a perl regular expression that you needed, let alone what language this is for.Centaury
i am using annotated query language use to extract data from text files....a language for IBM biginsight text analyticsLoony
According to the documentation i'm reading on that language it should work. Of course that was without the date added in there. This one should would as long as there is only 1 column after the text you want /^.*;([^;]+);+[^;]+$/m (you don't need the m)Centaury
/^.*;([^;]+);+[^;]+$/ is also not extracting my output, it is extracting whole data in a single lineLoony
R
0

It seems relatively straightforward:

https://regex101.com/r/W9nfsd/2

.*;(.*);.*$

It is similar to Anirudha's answer, but a little simpler.

Riemann answered 4/1, 2019 at 4:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.