Regexextract over multiple lines within one cell
Asked Answered
T

3

5

In Google Sheets, I have this in one cell:

Random stuff blah blah 123456789
<Surname, Name><123456><A><100><B><200>
<Surname2, Name2><456789><A><300><B><400>
Some more random stuff

And would like to match the strings within <> brackets. With = REGEXEXTRACT(A4, "<(.*)>") I got thus far:

Surname, Name><123456><A><100><B><200

which is nice, but it is only the first line. The desired output would be this (maybe including the <> at the beginning/end, it doesn't really matter):

Surname, Name><123456><A><100><B><200>
<Surname2, Name2><456789><A><300><B><400

or simply:

Surname, Name><123456><A><100><B><200><Surname2, Name2><456789><A><300><B><400

How to get there?

Tuberculate answered 24/2, 2015 at 15:47 Comment(1)
This question should be merged to webapps.stackexchange.comBlouse
S
2

Please try:

=SUBSTITUTE(regexextract(substitute(A4,char(10)," "),"<(.*)>"),"> <",">"&char(10)&"<")

Starting in the middle, the substitute replaces line breaks (char(10)) with spaces. This enables the regexextract the complete (ie multi-line) string to work on, with the same pattern as already familiar to OP. SUBSTITUTE then reinstates the relevant space (identified as being immediately surrounded by > and <) with a line break.

Sachi answered 24/2, 2015 at 16:38 Comment(3)
While this code snippet may solve the question, including an explanation really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion.Gathering
@Sachi Looks nice, and thanks for the explanation! I'm gonna try it, but just before I do, a quick inquiry: Since I don't really care about the result being distributed over several lines (which was not obvious from my original question), could this be simplified somehow, ideally using just regex? (please see updated question).Tuberculate
While it is helpful to know how to substitute out the line breaks, it is specific to an OS and overly complicated when you can use the regex multi-line flags instead.Sociability
B
4

Google sheets uses RE2 syntax. You can set the multi-line and s flags in order to match multiple lines. The following will match all characters over multiple lines in cell A2.

=REGEXEXTRACT(A2, "(?ms)^(.*)$")
Bloodstock answered 17/11, 2018 at 16:47 Comment(0)
A
3
REGEXEXTRACT(A1,"text1(?ms)(.*)text2")

So, in this case:

REGEXEXTRACT(A1,"<(?ms)(.*)>")
Absorption answered 15/10, 2020 at 2:33 Comment(1)
Can you add context?Laundryman
S
2

Please try:

=SUBSTITUTE(regexextract(substitute(A4,char(10)," "),"<(.*)>"),"> <",">"&char(10)&"<")

Starting in the middle, the substitute replaces line breaks (char(10)) with spaces. This enables the regexextract the complete (ie multi-line) string to work on, with the same pattern as already familiar to OP. SUBSTITUTE then reinstates the relevant space (identified as being immediately surrounded by > and <) with a line break.

Sachi answered 24/2, 2015 at 16:38 Comment(3)
While this code snippet may solve the question, including an explanation really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion.Gathering
@Sachi Looks nice, and thanks for the explanation! I'm gonna try it, but just before I do, a quick inquiry: Since I don't really care about the result being distributed over several lines (which was not obvious from my original question), could this be simplified somehow, ideally using just regex? (please see updated question).Tuberculate
While it is helpful to know how to substitute out the line breaks, it is specific to an OS and overly complicated when you can use the regex multi-line flags instead.Sociability

© 2022 - 2024 — McMap. All rights reserved.