How to extract string following a pattern with grep, regex or perl [duplicate]

Asked 22/2, 2011 at 16:34 Answered 1/12, 2017 at 22:56

Solved regex perl sed html-parsing text-extraction

122

I have a file that looks something like this:

    <table name="content_analyzer" primary-key="id">
      <type="global" />
    </table>
    <table name="content_analyzer2" primary-key="id">
      <type="global" />
    </table>
    <table name="content_analyzer_items" primary-key="id">
      <type="global" />
    </table>

I need to extract anything within the quotes that follow name=, i.e., content_analyzer, content_analyzer2 and content_analyzer_items.

I am doing this on a Linux box, so a solution using sed, perl, grep or bash is fine.

Toilet answered 22/2, 2011 at 16:34 Comment(3)

no need to be shy, welcome here! – Nat 22/2, 2011 at 16:42

I feel that it would be wrong not to link to #1732848 – Saltant 22/2, 2011 at 16:42

Thanks everyone for the useful comments. I apologize for the XML not being properly formatted. I deleted some tags for simplification. – Toilet 24/2, 2011 at 15:20

223

Since you need to match content without including it in the result (must match name=" but it's not part of the desired result) some form of zero-width matching or group capturing is required. This can be done easily with the following tools:

Perl

With Perl you could use the n option to loop line by line and print the content of a capturing group if it matches:

perl -ne 'print "$1\n" if /name="(.*?)"/' filename

GNU grep

If you have an improved version of grep, such as GNU grep, you may have the -P option available. This option will enable Perl-like regex, allowing you to use \K which is a shorthand lookbehind. It will reset the match position, so anything before it is zero-width.

grep -Po 'name="\K.*?(?=")' filename

The o option makes grep print only the matched text, instead of the whole line.

Vim - Text Editor

Another way is to use a text editor directly. With Vim, one of the various ways of accomplishing this would be to delete lines without name= and then extract the content from the resulting lines:

:v/.*name="\v([^"]+).*/d|%s//\1

Standard grep

If you don't have access to these tools, for some reason, something similar could be achieved with standard grep. However, without the look around it will require some cleanup later:

grep -o 'name="[^"]*"' filename

A note about saving results

In all of the commands above the results will be sent to stdout. It's important to remember that you can always save them by piping it to a file by appending:

> result

to the end of the command.

Chaworth answered 22/2, 2011 at 17:21 Comment(11)

Lookarounds (in GNU grep): grep -Po '.*name="\K.*?(?=".*)' – Lodestar 22/2, 2011 at 19:54

@Dennis Williamson, great. I updated the answer accordingly, but left both .* aside, I hope you don't get angry with me. I'd like to ask, do you see any benefits from un-greedy match over "anything except ""? Don't take this as a fight, I'm just curious and I'm not a regex expert. Also, the \K tip, really nice. Thanks Dennis. – Chaworth 22/2, 2011 at 23:44

Why would I be angry? Without the .*, you can do grep -Po '(?<=name=").*?(?=")'. The \K can be used for shorthand, but it's really only needed if the match to its left is variable length. In cases like this, the reason for using lookarounds is fairly obvious. Ungreedy operations look a little neater ([^"]* versus .*? and you don't have to repeat the anchor character. I don't know about speed. That depends a lot on the context, I think. I hope that's helpful. – Lodestar 23/2, 2011 at 0:59

@Dennis Williamson: certainly sir, a lot of helpful information here. I think the reason I kept the \K (after researching on it) and removed the .* was the same: make it look pretty (simpler). And I've never thought in using .*? instead of the "traditional way" I learned from somewhere. But un-greedy here really makes sense. Thanks Dennis, best wishes. – Chaworth 23/2, 2011 at 1:33

+1 for describing the command. Would appreciate it if you could update your answer to explain the "[...]" part of the regex. – Butterworth 4/3, 2014 at 16:4

@Butterworth Thank you. That is a character class, when it begins with ^ that means it matches anything except its contents. So [^"] means every character that's not a quote. I didn't use it in the latter answer in favor of the unready version, .*?. The previous was greedy, so I used that class to match everything not a quote with the intention of stopping on the first quote, which is the same as matching anything "ungreedly" up to a quote. Hope this helps understanding it, let me know if I can clarify better some part. – Chaworth 7/3, 2014 at 15:11

The -P flag does not seem to be supported on OS X: grep [-abcDEFGHhIiJLlmnOoqRSsUVvwxZ] [-A num] [-B num] [-C[num]] [-e pattern] [-f file] [--binary-files=value] [--color=when] [--context[=num]] [--directories=action] [--label] [--line-buffered] [--null] [pattern] [file ...] – Knap 14/1, 2015 at 10:15

@PerQuestedAronsson its is cited in the manual as extension. Not sure how documented it is, but well, I'm on OS X as well and works here. – Chaworth 14/1, 2015 at 23:2

@sidyll: I found this article: "Perl Regex Removed From Grep in Mountain Lion" (dirtdon.com/?p=1452 ). I'm on Yosemite myself, but the article seems to be valid for that as well. – Knap 15/1, 2015 at 7:18

On OS X, simply install grep via homebrew and use that instead of the default one. It should work. – Margarettamargarette 29/5, 2015 at 23:43

grep -Po 'look-ahead \K capture' made my day. Slick. – Carolinecarolingian 20/1, 2017 at 1:30

The regular expression would be:

.+name="([^"]+)"

Then the grouping would be in the \1

Christean answered 22/2, 2011 at 16:39 Comment(0)

If you're using Perl, download a module to parse the XML: XML::Simple, XML::Twig, or XML::LibXML. Don't re-invent the wheel.

Morganne answered 22/2, 2011 at 16:43 Comment(1)

Note that example OP gave is not well-formed (<type="global" for instance), so most of XML parsers just complain and die. – Setose 22/2, 2011 at 17:20

An HTML parser should be used for this purpose rather than regular expressions. A Perl program that makes use of HTML::TreeBuilder:

Program

#!/usr/bin/env perl

use strict;
use warnings;

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new_from_file( \*DATA );
my @elements = $tree->look_down(
    sub { defined $_[0]->attr('name') }
);

for (@elements) {
    print $_->attr('name'), "\n";
}

__DATA__
<table name="content_analyzer" primary-key="id">
  <type="global" />
</table>
<table name="content_analyzer2" primary-key="id">
  <type="global" />
</table>
<table name="content_analyzer_items" primary-key="id">
  <type="global" />
</table>

Output

content_analyzer
content_analyzer2
content_analyzer_items

Anticlockwise answered 22/2, 2011 at 17:12 Comment(0)

this could do it:

perl -ne 'if(m/name="(.*?)"/){ print $1 . "\n"; }'

Nat answered 22/2, 2011 at 16:39 Comment(0)

Here's a solution using HTML tidy & xmlstarlet:

htmlstr='
<table name="content_analyzer" primary-key="id">
<type="global" />
</table>
<table name="content_analyzer2" primary-key="id">
<type="global" />
</table>
<table name="content_analyzer_items" primary-key="id">
<type="global" />
</table>
'

echo "$htmlstr" | tidy -q -c -wrap 0 -numeric -asxml -utf8 --merge-divs yes --merge-spans yes 2>/dev/null |
sed '/type="global"/d' |
xmlstarlet sel -N x="http://www.w3.org/1999/xhtml" -T -t -m "//x:table" -v '@name' -n

Martyrize answered 16/3, 2011 at 17:49 Comment(0)

Oops, the sed command has to precede the tidy command of course:

echo "$htmlstr" | 
sed '/type="global"/d' |
tidy -q -c -wrap 0 -numeric -asxml -utf8 --merge-divs yes --merge-spans yes 2>/dev/null |
xmlstarlet sel -N x="http://www.w3.org/1999/xhtml" -T -t -m "//x:table" -v '@name' -n

Martyrize answered 16/3, 2011 at 17:59 Comment(0)

If the structure of your xml (or text in general) is fixed, the easiest way is using cut. For your specific case:

echo '<table name="content_analyzer" primary-key="id">
  <type="global" />
</table>
<table name="content_analyzer2" primary-key="id">
  <type="global" />
</table>
<table name="content_analyzer_items" primary-key="id">
  <type="global" />
</table>' | grep name= | cut -f2 -d '"'

Paraguay answered 1/12, 2017 at 22:56 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++