How can I extract the text in an XML tag in Perl?
Asked Answered
M

3

5

I'm trying to parse/extract data from an XML file and retrieve necessary data.

For example:

<about>
    This is an XML file
    that I want to
    extract data from
</about>
<message>Hello, this is a message.</message>
<this>Blah</this>
<that>Blahh</that>
<person> 
    <name>Jack</name>
    <age>27</name>
    <email>[email protected]</email>
</person>

I'm having trouble getting the content within the <about> tags.

This is what I have so far:

(<\w*>)[\s*]?([\s*]?.*)(<\/\w*>)/m

I'm simply trying to extract the tag name and content, which is why I have the parentheses there. i.e. ($tag = $1) =~ s/[<>]// to get the tag name, $tagcontent = $2 to get the tag's contents. I'm using \s for the white-space characters (space, tab, newline) and the ? because it may or may not occur * amount of times.

I was testing this through http://www.regexe.com/, and no luck with the matching.

Meemeece answered 20/6, 2014 at 20:8 Comment(1)
See: #1732848Alathia
K
6

XML is not a regular language and cannot be accurately parsed using regular expressions. Use an XML parser instead. That is guaranteed to work in all situations, and will not break if the format of the markup changes in the future.

However, if you're absolutely sure of the format, you could get away with the following regex:

/<(\w+)>\s*(.*?)\s*<\/\1>/s

Explanation:

  • / - Starting delimiter
  • <(\w+)> - The opening tag
  • \s* - Match optional whitespace in between
  • (.*?) - Match the contents inside the tag
  • \s* - Match optional whitespace in between
  • <\/\1> - Match the closing tag. \1 here is a backreference which contains what was matched by the first capturing group.
  • / - Ending delimiter

Note that the s modifier and m modifier are entirely different, and do different things. See this answer for more information about what each does.

Regex101 Demo

Kendrakendrah answered 20/6, 2014 at 20:15 Comment(4)
Yea, I should be using XML::Parser, I thought it might be handy to know how to use regex for this too. Thanks for the very clear explanation!Meemeece
Glad I could help, @lkisac. Since you're new here, I'll say this: If one of the answers below fixes your issue, you should accept it (click the check mark next to the appropriate answer). That does two things. It lets everyone know your issue has been resolved, and it gives the person that helps you credit for the assist. See this post for more information.Kendrakendrah
Thank you. I initially tried to vote up and it wouldn't allow me to without enough reputation. Didn't know I had to click accept on my end. Thanks for the tip!Meemeece
+1, but how about a (\S*) in place of the (.*?) to appease the anti-dot-star monster?Garb
C
6

I advise you to not try using a regular expression for parsing XML, but to instead use an actual XML Parser.

The following uses XML::LibXML to display the text in the 'about' node. However, another excellent framework is XML::Twig.

use strict;
use warnings;

use XML::LibXML;

my $xml = XML::LibXML->load_xml(IO => \*DATA);

my $about = $xml->findvalue('//about');

print $about, "\n";

__DATA__
<root>
<about>
    This is an XML file
    that I want to
    extract data from
</about>
<message>Hello, this is a message.</message>
<this>Blah</this>
<that>Blahh</that>
<person> 
    <name>Jack</name>
    <age>27</age>
    <email>[email protected]</email>
</person>
</root>

Outputs:

    This is an XML file
    that I want to
    extract data from
Cuticula answered 20/6, 2014 at 21:3 Comment(1)
#1732848Acquainted
T
-1

I agree with the other answers, you should definitely use an XML parser. However if you are trying to avoid using external libraries, and you just need to quickly grab everything inside the about tag, something like this would work.

#!/usr/bin/perl -w

=begin
I agree with the other comments, use an xml parser for anything
other than quick code to grab everything inside the about tag
=end
=cut

my $entireFile="";  #yea this is probably bad practice    
while(<>){
    $entireFile =~ s/$/$_/; #eww
}
$entireFile =~ /<about>([\w\W]*)<\/about>/; #notice the . wouldnt work here because there are newlines
print "Match is\n$1\n";

output looks like this

$perl xmlMatch.pl xmlMatch.txt
Match is
    This is an XML file
    that I want to
    extract data from
Twophase answered 10/9, 2024 at 12:0 Comment(1)
Now try this on a file with more than one <about> tag. And, if you want to slurp the file, you can simply undefined the input record separator: my $entire_file = do { local $/; <> } with none of the other hijinks. But, this is one of the many reasons the previous respondents said not to do this. Imagine this with a 4 Gb XML file, because that's a thing that's out there.Planography

© 2022 - 2025 — McMap. All rights reserved.