How to extract all regex matches in a file using Vim?
Asked Answered
A

5

22

Consider the following example:

case Foo:
    ...
    break;
case Bar:
    ...
    break;
case More: case Complex:
    ...
    break:
...

Say, we would like to retrieve all matches of the regex case \([^:]*\): (the whole matching text or, even better, the part between \( and \)), which should give us (preferably in a new buffer) something like this:

Foo
Bar
More
Complex
...

Another example of a use case would be extraction of some fragments of an HTML file, for instance, image URLs.

Is there a simple way to collect all regex matches and take them out to a separate buffer in Vim?

Note: It’s similar to the question “How to extract text matching a regex using Vim?”. However, unlike the setting in that question, I’m also interested in removing the lines that don’t match, preferably without a hugely complicated regex.

Alcantar answered 31/1, 2012 at 12:33 Comment(6)
Do you mean backreferences? :%s/^\vcase ([^:]+):/\1/ Use \1 to get the first capturing group.Paulina
If you just want to extract these to a new file (it's unclear from your question), you could do this more easily with sed or grep; sed example: sed -n '/^\s*case\s\+/{s/\s*case\s\+\([^:]\+\):/\1/;p}' fileElyssa
@beerbajay: Yes in a new file it's fine. I agree sed would do it well, just I would have to start a command prompt and find the file again, so I'm looking for a Vim solution.Alcantar
@mathematical.coffee: Not at all. The issue is not search & replace (unless you include new lines) but grabbing all matches and putting them in another buffer.Alcantar
This is very similar to this question: #4504248Deandra
@PeterRincker: You're right. The question was formulated differently but it's pretty much the same goal. Seems there is no "simple" answer. :(Alcantar
D
32

There is a general way of collecting pattern matches throughout a piece of text. The technique takes advantage of the substitute with an expression feature of the :substitute command (see :help sub-replace-\=). The key idea is to use a substitution enumerating all of the pattern matches to evaluate an expression storing them without replacement.

First, let us consider saving the matches. In order to keep a sequence of matching text fragments, it is convenient to use a list (see :help List). However, it is not possible to modify a list straightforwardly, using the :let command, since there is no way to run Ex commands in expressions (including \= substitute expressions). Yet, we can call one of the functions that modify a list in place, for example, the add() function that appends a given item to a list (see :help add()).

Another problem is how to avoid text modifications while running a substitution. One approach is to make the pattern always have a zero-width match by prepending \ze or by appending \zs atoms to it (see :help /\zs, :help /\ze). The pattern modified in this way captures an empty string preceding or succeeding an occurrence of the original pattern in text (such matches are called zero-width matches in Vim; see :help /zero-width). Then, if the replacement text is also empty, substitution effectively changes nothing: it just replaces a zero-width match with an empty string.

Since the add() function, like most of the list modifying functions, returns the reference to the changed list, for our technique to work we need to somehow get an empty string from it. The simplest way is to extract a sublist of zero length from it by specifying a range of indices such that a starting index is greater than an ending one.

Combining the aforementioned ideas, we obtain the following Ex command:

:let m=[] | %s/\<case\s\+\(\w\+\):\zs/\=add(m,submatch(1))[1:0]/g

After its execution, all matches of the first subgroup are accumulated in the list referenced by the variable m, and can be used as is or processed in some way. For instance, to paste the contents of the list one by one on separate lines in Insert mode, type

Ctrl+R=mEnter

To do the same in Normal mode, simply use the :put command:

:put=m

Starting with version 7.4 (see :helpg Patch 7.3.627), Vim evaluates a \= expression in the replacement string of a substitution command for every match of the pattern, even when the n flag is given (which instructs it to simply count the number of matches without substituting—see :help :s_n). What the expression evaluates to does not matter in that case, because the resulting value is being discarded anyway, as no substitution takes place during counting.

This allows us to take advantage of the side effects of an expression without worrying about leaving the contents of the buffer in tact in the process, so all the trickery with zero-width matching and empty-sublist indexing can be elided:

:let m=[] | %s/\<case\s\+\(\w\+\):/\=add(m,submatch(1))/gn

Conveniently, the buffer does not even get marked as modified after running this command.

Datestamp answered 31/1, 2012 at 13:6 Comment(4)
Nice answer. I especially like the little trick with extend() in the replace expression.Seismoscope
@HerbertSitz: Thanks, I just have noticed that it is possible to use the add() function instead of extend(). By the way, I have rewritten the answer to explain the technique in more detail.Datestamp
Nice trick. Since the substitution has the side effect of setting 'modified', anyway, we can alternatively have add() return the last added element [-1]; this saves us from the zero-width match and capture: :let t=[] | %s/\<case\s\+\(\w\+\):/\=add(t,submatch(0))[-1]/gPlaything
@Ingo: But then we will end up with the list containing case Foo:, case Bar:, etc, and not Foo, Bar, etc, as required. It seems that we can't solve the problem correctly without changing boundaries of the match using \zs or \ze anyway.Datestamp
M
3

Though it's not possible to write a one-liner to accomplish your example, it's hard to type commands such as :%s/case \([^:]*\):/\=.../ interactively.

I prefer using vim-grex with the following steps:

  1. Use / to check whether a regular expression matches to expected lines. For example: /^\s*\<case\s\+\([^:]*\):.*$<Enter>
  2. Execute :Grey. It yanks lines matched to the current search pattern.
  3. Open a new buffer by :new etc.
  4. Put the yanked lines by p etc.
  5. Trim uninteresting parts by :%s//\1/.
Mcclintock answered 1/2, 2012 at 9:25 Comment(0)
H
2

How to use vim regex to extract the word from the following line, given that 'help' might be any word like 'rust' or 'perlang'.

vim:tw=78:ts=8:ft=help:norl:

Solution:

let foo = substitute(foo, '^\s*vim:.*:ft=\([a-z]\+\).*:\s*$', '\1', '')
echo "foo: '" . foo . "'"

Prints:

foo: 'help'

Guru meditation: What's going on here?

Take the string in the variable foo and match it to assert the beginning of the line, then any number of spaces, the literal vim and a literal colon, then any number of any characters followed by colon ft= with any word with letters, then anything, and assert the line ends with a colon. Throw all that into a register named 1, then get that back in parameter 2 which substitute takes on and replaces the prior string with.

As a general philosophy, any regex longer than your finger on the screen is an epic fail, so decrease screen resolution until it fits.

Heliotropism answered 17/1, 2019 at 19:37 Comment(0)
A
1

As small addition to ib.'s accepted answer, which works well as is. It seems like the flag n is enough avoid the issues with unwanted substitution.

:let t=[] | %s/\<case\s\+\(\w\+\):/\=add(t,submatch(1))/gn

From the s_flag help:

[n] Report the number of matches, do not actually substitute. The [c] flag is ignored. The matches are reported as if 'report' is zero. Useful to count-items. If \= sub-replace-expression is used, the expression will be evaluated in the sandbox at every match.

Accoutre answered 14/1, 2020 at 16:22 Comment(2)
I just came across this behavior of the n flag while scanning :help :s_flags for something else! After going back and updating my answer to take advantage of this feature, I noticed that you have already discovered it since then, too. Great job on catching that!Datestamp
Turns out, it was introduced during the development of Vim 7.4 (see :helpg Patch 7.3.627), and when I was writing my original answer, it did not exist yet (it was committed to Vim repository eight months later in August 2012 and released with version 7.4 one more year after that in August 2013). I wish I had learned about it earlier.Datestamp
T
0
:g/^case\s\L\l\+\scase.*/s/case/\r&/g
:let @a=''|g/^case\s\L\l\+:/y A

Now open a new buffer or tmp file, and aply:

"ap
:%s_^\vcase ([^:]+):_\1_

Or if you don't care for your current buffer (you can undo this of course) (updated for the complex example):

:g/^case\s\L\l\+\scase.*/s/case/\r&/g
:v/^case\s\L\l\+:/d
:%s_^\vcase ([^:]+):_\1_
Thusly answered 31/1, 2012 at 13:0 Comment(4)
There are definitely some errors in the commands listed in the first code snipped. Have you run them before posting? Neither of those two commands won't even run! What you probably meant is something like :let@a=''|g/^case\s\L\l\+:/y A.Datestamp
:v/.../d or :g!/.../d is a nice trick, so it deletes all non matching lines. However it's not really exacting the regex matched expression. It's extracting the matching lines and then supposing there is single match per line the second search & replace would work. It wouldn't work in the general case. I'll update my sample.Alcantar
@Datestamp thanks for pointing it out, you are right. This happens when I'm on windows, in front of excel... updating hte answer.Thusly
@Wernight, OK, I had updated my answer for your special case.Thusly

© 2022 - 2024 — McMap. All rights reserved.