Parse and count numeric only xml text including e-00 or e+01
Asked Answered
M

1

0

I am a python newbie. I am trying to parse through an xml file and count all text inputs that are all numeric including approximated values using e- or e+. E.g. Given the psuedo code below (jerry.xml),

<data>
<country name="Liechtenstein">
    <rank updated="yes">2</rank>
    <language>english</language>
    <currency>1.21$/kg</currency> 
    <gdppc>141100</gdppc>
    <gdpnp>2.304e+0150</gdpnp>
    <neighbor name="Austria" direction="E"/>
    <neighbor name="Switzerland" direction="W"/>
</country>
<country name="Singapore">
    <rank updated="yes">5</rank>
    <language>english</language>
    <currency>4.1$/kg</currency> 
    <gdppc>59900</gdppc>
    <gdpnp>5.2e-015</gdpnp>
    <neighbor name="Malaysia" direction="N"/>
</country>

I would like to return 6, having counted 2, 141100, 2.304e+0150, 5, 59900 and 5.2e-015 while omitting english, 1.21$/kg or 4.1$/kg.

Any help would be appreciated. For now I have the following.

import xml.etree.ElementTree as ET
tree = ET.parse("jerry.xml")
root = tree.getroot()
for text in root.itertext():
    print repr(text)   
charlie = file.writelines(root.itertext())
count = sum(element.firstChild.nodeValue.find(r'\d+$'') for element in xmldoc.getElementsByTagName('jerry.xml'))
Methacrylate answered 17/2, 2015 at 18:44 Comment(0)
P
1

You can simply try to convert each inner text element to a float, and ignore any errors.

import xml.etree.ElementTree as ET

tree = ET.parse("temp.txt")
root = tree.getroot()
nums = []

for e in root.itertext():
    try:
        nums.append(float(e))
    except ValueError:
        pass

print nums
print len(nums)

As requested, a probably inefficient but successful method to keep track of the locations of the elements:

def extractNumbers(path, node):
    nums = []

    path += '/' + node.tag
    if 'name' in node.keys():
        path += '=' + node.attrib['name']

    try:
        num = float(node.text)
        nums.append( (path, num) )
    except (ValueError, TypeError):
        pass

    for e in list(node):
        nums.extend( extractNumbers(path, e) )

    return nums

tree = ET.parse('temp.txt')
nums = extractNumbers('', tree.getroot())
print len(nums)
print nums

for n in nums:
    print n[0], n[1]
Pave answered 17/2, 2015 at 20:46 Comment(3)
Hello robert_x44. What if I need to extract the path of each of these floats as a separate row or column alongside. I tried 'tree.getroot().itertext()' but it doesn't seem to work.Methacrylate
I'm sure there is an infinitely more elegant way to do this, but I've added a quick and dirty method to my answer.Pave
What is an efficient way to keep track of the location of the elements. I need to read the texts back in the xml, based on the extracted location, but am having trouble with that. Please see postMethacrylate

© 2022 - 2024 — McMap. All rights reserved.