Select the next line after match regex
Asked Answered
D

3

12

I'm currently using a scanning software "Drivve Image" to extract certain information from each paper. This software enables certain Regex code to be run if needed. It seems to be run with the UltraEdit Regex Engine.

I get the following scanned result:

 1. 21Sid1
 2. Ordernr
 3. E17222
 4. By
 5. Seller

I need to search the string for the text Ordernr and then pick the following line E17222 which in the end will be said filename of the scanned document. I will never know the exact position of these two values in the string. That is why I need to focus on Ordernr because the text I need will always follow as the next line.

My requirements are such that I need the E17222 to be the only thing in the match result for this to work. I am only allowed to type plain regular expressions.

There is a great thread already: Regex to get the words after matching string

I've tested " \bOrdernr\s+\K\S+ "which works great..

Had it not been that the software don't allow for /K to be used. Are there any other ways of implementing \K?

Continuation

Though If the sample text involves a character behind "Ordernr" the current answer doesn't work to the extent I need. Like this sample:

21Sid1

Ordernr 1

E17222

By

Seller

The current solution picks up "1" and not the "next line" which would be "E17222". in the matched group. Needed to point that out for further involvement on the issue.

Dunkle answered 30/5, 2016 at 12:55 Comment(3)
I doubt the lookbehind will work, but try (?<=\bOrdernr\r\n)\S+ (adjust the linebreak to \r or \n if necessary).Afloat
can you include the exact sample text you're working with?Wadewadell
It looks like Drivve Image supports .NET regex flavor (link to download the PDF Manual), so my regex above can be written as (?<=\bOrdernr[\r\n]+)\S+. However, ClasG's suggestion sounds correct to me.Afloat
D
8

Did some googling and from what I can grasp, the last parameter to the REGEXP.MATCH is the capture group to use. That means that you could use you own regex, without the \K, and just add a capture group to the number you want to extract.

 \bOrdernr\s+(\S+)

This means that the number ends up in capture group 1 (the whole match is in 0 which I assume you've used).

The documentation isn't crystal clear, but I guess the syntax is

REGEXP.MATCH(<ZoneName>, "REGEX", CaptureGroup)

meaning you should use

REGEXP.MATCH(<ZoneName>, "\bOrdernr\s+(\S+)", 1)

There's a fair amount of guessing here though... ;)

Disconcerted answered 30/5, 2016 at 13:33 Comment(3)
I can confirm that this answer works for my Software. @ClasGDunkle
@Dunkle That's great. Please consider marking it as accepted to make it easier for people with similar questions to find it.Disconcerted
Thought I've added a "Contiunation" to my post which involves characters behind Ordernr.Dunkle
W
47

Description

ordernr[\r\n]+([^\r\n]+)

Regular expression visualization

This regular expression will do the following:

  • find the ordernr substring
  • place the line following ordernr capture group 1

Example

Live Demo

https://regex101.com/r/dQ0gR6/1

Sample text

 1. 21Sid1
 2. Ordernr
 3. E17222
 4. By
 5. Seller

Sample Matches

[0][0] = Ordernr
 3. E17222
[0][1] =  3. E17222

Explanation

NODE                     EXPLANATION
----------------------------------------------------------------------
  ordernr                  'ordernr'
----------------------------------------------------------------------
  [\r\n]+                  any character of: '\r' (carriage return),
                           '\n' (newline) (1 or more times (matching
                           the most amount possible))
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    [^\r\n]+                 any character except: '\r' (carriage
                             return), '\n' (newline) (1 or more times
                             (matching the most amount possible))
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------

Alternatively

To just capture the line using a look-around so that ordernr is not included in capture group 0 and to accommodate all the variation of \r and \n

(?<=ordernr\r|ordernr\n|ordernr\r\n)[^\r\n]+

Regular expression visualization

Live Demo

https://regex101.com/r/pA4fD4/2

Wadewadell answered 30/5, 2016 at 13:6 Comment(5)
I can also confirm that this answer workes aswell. Though my knowledge of regex is limited and thus I'm not really sure which answer is the better one.. "if there is such a thing". @Ro Yo MiDunkle
The subtle difference between this and the accepted answer is that this constrains the match to only allow end of line characters, where as the accepted answer allows the matching of all white space characters and assumes spaces and new line characters are interchangeable which could lead to some odd edge cases.Wadewadell
From this example, is it possible to split based on the new line after the matched substring? I have a script that needs to split by new line, however due to a system limit if the file is too long it wont work. so i wanted to chunk it up by matching as substring and then finding the end of that line. I am using '\r\n|\r|\n' as my new line regex. @RoYoMiHand
Wow, great explanation. Thanks.Var
This is a hell of an answer! great job!Yuriyuria
D
8

Did some googling and from what I can grasp, the last parameter to the REGEXP.MATCH is the capture group to use. That means that you could use you own regex, without the \K, and just add a capture group to the number you want to extract.

 \bOrdernr\s+(\S+)

This means that the number ends up in capture group 1 (the whole match is in 0 which I assume you've used).

The documentation isn't crystal clear, but I guess the syntax is

REGEXP.MATCH(<ZoneName>, "REGEX", CaptureGroup)

meaning you should use

REGEXP.MATCH(<ZoneName>, "\bOrdernr\s+(\S+)", 1)

There's a fair amount of guessing here though... ;)

Disconcerted answered 30/5, 2016 at 13:33 Comment(3)
I can confirm that this answer works for my Software. @ClasGDunkle
@Dunkle That's great. Please consider marking it as accepted to make it easier for people with similar questions to find it.Disconcerted
Thought I've added a "Contiunation" to my post which involves characters behind Ordernr.Dunkle
O
0

In general, if you are bound to use a regex solution to return a whole line after a line with your specific pattern match, you can use

PATTERN.*\n(.*)
PATTERN[^\r\n]*[\r\n]+([^\r\n]*)

and if the \K operator (that you will need if you cannot access capturing groups) is supported:

PATTERN.*\n\K.*
PATTERN[^\r\n]*[\r\n]+\K[^\r\n]*

or, if variable-width lookbehind patterns are supported (as in .NET, or PyPi regex Python library):

(?<=PATTERN.*\n).*
(?<=PATTERN[^\r\n]*[\r\n]+)[^\r\n]*

where PATTERN is the pattern you want to match in the preceding line. Use [^\r\n] version when you do not want the . to capture carriage returns in regex libraries that have this behavior (e.g. .NET).

NOTE: When reading the contents from the file, you must make sure you read the whole file into a single string. E.g. in Python, you must do it with .read(), not .readlines(), otherwise, you won't be able to perform a multiline match. In Perl, you can slurp the file in, use Get-Content -Raw in PowerShell, etc.

So, to get the whole line after a line with your pattern, you can use

with open("data.txt", "r") as f:
    match = re.search(r'Ordernr.*\n(.*)', f.read())
    if match:
        print(match.group(1))

See this regex demo.

To only get the E17222, you may precise the (.*) pattern: say, you want to match the first whole word starting with an uppercase letter and then any digits after it till the end of the word. It would then be .*?\b([A-Z][0-9]+):

with open("data.txt", "r") as f:
    match = re.search(r'Ordernr.*\n.*?\b([A-Z][0-9]+)', f.read())
    if match:
        print(match.group(1))

See this regex demo.

Odelsting answered 23/4 at 10:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.