Extract some part of text separated by a delimiter using a regex

Asked 19/2, 2013 at 4:53 Answered 4/1, 2019 at 4:30

I have a sample input file as follows, with columns Id, Name, start date, end date, Age, Description, and Location:

220;John;23/11/2008;22/12/2008;28;Working as a professor in University;Hyderabad
221;Paul;30;23/11/2008;22/12/2008;He is a software engineer at MNC;Bangalore
222;Emma;23/11/2008;22/12/200825;Working as a mechanical engineer;Chennai

It contains 30 lines of data. My requirement is to only extract descriptions from the above text file.

My output should contain

Working as a professor in University

He is a software engineer at MNC

working as a mechanical engineer

I need to find a regular expression to extract the Description, and have tried many kinds, but I haven't been able to find the solution. How can I do it?

Loony answered 19/2, 2013 at 4:53 Comment(9)

the delimiter in the above input file is ";" – Loony 19/2, 2013 at 5:3

I may have messed up on my edit, did you mean to have the semicolons and commas in there? – Sukin 19/2, 2013 at 5:4

OK, please re-edit with them. Sorry, thinking about databases too much. – Sukin 19/2, 2013 at 5:4

Why do you want a regex? Just split by semicolon and grab the 4th column and you're done. Also, you should tag with what language you are using. – Womankind 19/2, 2013 at 5:8

my requirement is to use regex...... – Loony 19/2, 2013 at 5:16

you mean your homework assignment? – Womankind 19/2, 2013 at 5:17

aql annotated query language – Loony 19/2, 2013 at 5:34

The data is a mess. John has two dates then a number (age); Paul has a number and two dates; Emma has a date and a date scrunched up with the number. The columns listed don't include either of the date columns. (Someone can't spell 'engineer', or 'Bangalore'). How will the regex know to convert Working to working? That's tremendously fiddly! – Diversification 20/2, 2013 at 5:7

sry for my english its Working not working in output. – Loony 20/2, 2013 at 5:17

You can use this regex:

[^;]+(?=;[^;]*$)

[^;] matches any character except ;

+ is a quantifier that matches the preceding character or group one to many times

* is a quantifier that matches the preceding character or group zero to many times

$ is the end of the string

(?=pattern) is a lookahead which checks if a particular pattern occurs ahead

Washer answered 19/2, 2013 at 5:27 Comment(1)

([^;]+(?=;[^;]*(\r?\n|$))) – Stormystorting 19/3, 2018 at 12:23

/^(?:[^;]+;){3}([^;]+)/ will grab the fourth group between semicolons.

Although as stated in my comment, you should just split the string by semicolon and grab the fourth element of the split...that's the whole point of a delimited file - you don't need complex pattern matching.

Example implementation in Perl using your input example:

open(my $IN, "<input.txt") or die $!;

while(<$IN>){
    (my $desc) = $_ =~ /^(?:[^;]+;){3}([^;]+)/;
    print "'$desc'\n";
}
close $IN;

yields:

'Working as a professor in University'
'He is a software engineer at MNC'
'Working as a mechanical engineer'

Womankind answered 19/2, 2013 at 5:13 Comment(3)

i can only use regex // in my coding, i can not use above coding – Loony 19/2, 2013 at 5:27

What I provided is a regex. And since you didn't indicate what language you were using, I provided a sample implementation making use of the regex. – Womankind 19/2, 2013 at 5:28

i am using aql language for biginsight text analytics – Loony 19/2, 2013 at 5:33

This should work:

/^[^\s]+\s+[^\s]+\s+[^\s]+\s+(.+)\s+[^\s]+$/m

Or as lone shepherd pointed out:

/^\S+\s+\S+\s+\S+\s+(.+)\s+\S+$/m

Or with semicolons:

/^[^;]+;[^;]+;+[^;]+;+(.+);+[^;]+$/m

Centaury answered 19/2, 2013 at 5:1 Comment(8)

\S is the same as [^\s] – Womankind 19/2, 2013 at 5:3

no its not working 220;John;28;Working as a Professor in University;Hyderabad – Loony 19/2, 2013 at 5:5

This almost wworks if you can use a line modifier (m in php), so that ^ represents the beginning of the line while $ represents the end. In the previous example though I was just missing one column. /^[^\s]+\s+[^\s]+\s+[^\s]+\s+(.+)\s+[^\s]+$/m – Centaury 19/2, 2013 at 5:14

And now I see you reverted back to semi-colons. /^[^;]+;[^;]+;+[^;]+;+(.+);+[^;]+$/m – Centaury 19/2, 2013 at 5:19

No it works great in PHP using preg_replace. You of course never even specified if it's a perl regular expression that you needed, let alone what language this is for. – Centaury 19/2, 2013 at 5:34

i am using annotated query language use to extract data from text files....a language for IBM biginsight text analytics – Loony 19/2, 2013 at 5:39

According to the documentation i'm reading on that language it should work. Of course that was without the date added in there. This one should would as long as there is only 1 column after the text you want /^.*;([^;]+);+[^;]+$/m (you don't need the m) – Centaury 19/2, 2013 at 5:46

/^.*;([^;]+);+[^;]+$/ is also not extracting my output, it is extracting whole data in a single line – Loony 19/2, 2013 at 6:27

It seems relatively straightforward:

https://regex101.com/r/W9nfsd/2

.*;(.*);.*$

It is similar to Anirudha's answer, but a little simpler.

Riemann answered 4/1, 2019 at 4:30 Comment(0)

Recommended topics

Hot tags