Edit XML file text based on path

Asked 1/4, 2015 at 2:56 Answered 24/4, 2015 at 16:21

I have an XML file (e.g. jerry.xml) which contains some data as given below.

<data>
<country name="Peru">
    <rank updated="yes">2</rank>
    <language>english</language>
    <currency>1.21$/kg</currency> 
    <gdppc month="06">141100</gdppc>
    <gdpnp month="10">2.304e+0150</gdpnp>
    <neighbor name="Austria" direction="E"/>
    <neighbor name="Switzerland" direction="W"/>
</country>
<country name="Singapore">
    <rank updated="yes">5</rank>
    <language>english</language>
    <currency>4.1$/kg</currency> 
    <gdppc month="05">59900</gdppc>
    <gdpnp month="08">1.9e-015</gdpnp>
    <neighbor name="Malaysia" direction="N"/>
</country>

I extracted the full paths of some selected texts from the xml above using the code below. The reasons are given in this post.

def extractNumbers(path, node):
    nums = []

    if 'month' in node.attrib:
        if node.attrib['month'] in ['05', '06']:
            return nums

    path += '/' + node.tag
    if 'name' in node.keys():
        path += '=' + node.attrib['name']

    elif 'year' in node.keys():
        path += ' ' + 'month' + '=' + node.attrib['month']
    try:
        num = float(node.text)
        nums.append( (path, num) )
    except (ValueError, TypeError):
        pass
    for e in list(node):
        nums.extend( extractNumbers(path, e) )
    return nums

tree = ET.parse('jerry.xml')
nums = extractNumbers('', tree.getroot())
print len(nums)
print nums

This gives me the location of the elements I need to change as shown in colomn 1 of the csv below (e.g. hrong.csv).

Path                                                      Text1       Text2       Text3       Text4       Text5 
'/data/country name=singapore/gdpnp month=08';            5.2e-015;   2e-05;      8e-06;      9e-04;      0.4e-05;   
'/data/country name=peru/gdppc month=06';                 0.04;       0.02;       0.15;       3.24;       0.98;

I would like to replace the text of the elements of the original XML file (jerry.xml) by those in column 2 of the hrong.csv above, based on the location of the elements in column 1.

I am a newbie to python and realize I might not be using the best approach. I would appreciate any help regards direction wrt this. I basically need to parse only some selected texts nodes of an xml file, modify the selected text nodes and save each file.

Thanks

Hypogynous answered 1/4, 2015 at 2:56 Comment(8)

"need to parse only some selected text nodes"-- which ones? how do you select them? – Thickset 18/4, 2015 at 13:32

@Thickset I need to only consider nodes whose texts are float-able. Please see earlier post for more clarification link – Hypogynous 19/4, 2015 at 18:53

But it also looks like you're selecting only months '05' and '06'? Are those the only months? other months like '08' and '10' don't apply??? – Thickset 19/4, 2015 at 20:16

what do you mean by "save each file"? is every line in the csv a new file? is every column a new file? you said replace text "by those in column 2". what about columns 3,4,5,etc.? the statement of the problem and desired output is still very confusing, as well as proper XPath notation. – Thickset 19/4, 2015 at 22:48

@Thickset Yes, only certain months need to be selected. Also I need to edit the xml based on the values in each column of the csv. So each column of the csv would correspond to a new xml file. I am trying to do a monte carlo simulation based on the original xml file. This involves changing certain parameters in the xml file. – Hypogynous 20/4, 2015 at 19:24

The 'hrong.csv' isn't a valid csv, even if you consider ';' as the delimiter. Assuming it was a valid file, where is the code to read the csv and create files based on the columns? Are you asking in this question for someone to write all the code to do the entire application? – Thickset 20/4, 2015 at 23:14

I'd also add that the xml is not valid, nor is the python. The code added here should be usable. XML needs a trailing </data> and .py needs an import xml.etree.ElementTree as ET. That said, I'm checking this, if there's anything I can add I will in a moment – Vaticinate 24/4, 2015 at 13:51

In addition to the above problems, the algorithm you've posted does not generate the path string in the csv that you've posted. A path with month=6 will never be generated by your algorithm. You posted 'I need help getting an exact solution to this problem' , and yet your question is full of errors which will prevent an exact solution from being created. I'll work to answer, as well as possible – Vaticinate 24/4, 2015 at 14:31

You should be able to use the XPath capabilities of the module to do this:

import xml.etree.ElementTree as ET
tree = ET.parse('jerry.xml')
root = tree.getroot()
for data in root.findall(".//country[@name='singapore']/gdpnp[@month='08']"):
    data.text = csv_value

tree.write("filename.xml")

So you need to rewrite the path in the csv to match the XPath rules defined for the module (see Supported XPath rules).

Trial answered 17/4, 2015 at 17:21 Comment(4)

Is there a way to retrieve the XPath defined paths of text nodes automatically? I used the method described here link. – Hypogynous 19/4, 2015 at 19:0

No, but the answer in the link contains already all the information for the xpath rule. You only need to rewrite it a bit so it matches my example above. – Trial 19/4, 2015 at 20:52

Quick question @rfkortekaas. Is it possible to dynamically name the written files? The problem is that I have to write >10,000 of these created xml files. Thanks – Hypogynous 4/5, 2015 at 15:8

Yes that's possible. You can just give a variable to tree.write which you are changing for the correct file name. – Trial 16/5, 2015 at 15:37

FIrst of all, documentation of how to modify an XML. Now, here is my own example:

import xml.etree.ElementTree as ET

s = """
<root>
    <parent attribute="value">
        <child_1 other_attr="other_value">child text</child_1>
        <child_2 yet_another_attr="another_value">more child text</child_2>
    </parent>
</root>
"""

root = ET.fromstring(s)

for parent in root.getchildren():
    parent.attrib['attribute'] = 'new value'
    for child in parent.getchildren():
        child.attrib['new_attrib'] = 'new attribute for {}'.format(child.tag)
        child.text += ', appended text!'

>>> ET.dump(root)
<root>
    <parent attribute="new value">
        <child_1 new_attrib="new attribute for child_1" other_attr="other_value">child text, appended text!</child_1>
        <child_2 new_attrib="new attribute for child_2" yet_another_attr="another_value">more child text, appended text!</child_2>
    </parent>
</root>

And you can do this with Xpath as well.

>>> root.find('parent/child_1[@other_attr]').attrib['other_attr'] = 'found it!'
>>> ET.dump(root)
<root>
    <parent attribute="new value">
        <child_1 new_attrib="new attribute for child_1" other_attr="found it!">child text, appended text!</child_1>
        <child_2 new_attrib="new attribute for child_2" yet_another_attr="another_value">more child text, appended text!</child_2>
    </parent>
</root>

Furniture answered 22/4, 2015 at 8:53 Comment(0)

I've altered your extractNumbers function and other code to generate a relative xpath based on the read in file.

import xml.etree.ElementTree as ET

def extractNumbers(path, node):
    nums = []
    # You'll want to store a relative, rather than an absolute path.
    if not path: # This is the root node, store the // Predicate to look at all root's children.
        path = ".//"
    else: # This is not the root node
        if 'month' in node.attrib:
            if node.attrib['month'] in ['05', '06']:
                return nums

        path += node.tag
        if 'name' in node.keys():
            path += '[@name="{:s}"]/'.format(node.attrib['name'])
        elif 'year' in node.keys():
            path += '[@month="{:s}"]/'.format(node.attrib['month'])
        try:
            num = float(node.text)
            nums.append((path, num) )
        except (ValueError, TypeError):
            pass
    # Descend into the node's child nodes
    for e in list(node):
        nums.extend( extractNumbers(path, e) )
    return nums

tree = ET.parse('jerry.xml')
nums = extractNumbers('', tree.getroot())

At this point you have a nums list populated with tuples of "path, num". You'll want to write the path into your csv. In the following, I've assumed that you know the Text1, Text2, and Text3 values before hand, and so I've written 'foo', 'bar', 'baz' into each row.

import csv
# Write the CSV file with the data found from extractNumbers
with open('records.csv', 'w') as records:
    writer = csv.writer(records, delimiter=';')
    writer.writerow(['Path', 'Text1', 'Text2', 'Text3'])
    for entry in nums:
        # Ensure that you're writing a relative xpath
        rel_path = entry[0]
        # you will want to "Text1", 'foo' below, to be an appropriate value, as it will be written into the xml below
        writer.writerow([rel_path, 'foo', 'bar', 'baz'])

You will now have the following CSV file

Path;Text1;Text2;Text3
".//country[@name=""Peru""]/rank";foo;bar;baz
".//country[@name=""Peru""]/gdpnp";foo;bar;baz
".//country[@name=""Singapore""]/rank";foo;bar;baz
".//country[@name=""Singapore""]/gdpnp";foo;bar;baz

In the following code, you will read the csv file Read the CSV file, and use the PATH column to alter the appropriate values

import csv
import xml.etree.ElementTree as ET
with open('records.csv', 'r') as records:
    reader = csv.reader(records, delimiter=';')
    for row in reader:
        if reader.line_num == 1: continue # skip the row of headers
        for data in tree.findall(row[0]):
            data.text = row[1]
tree.write('jerry_new.xml')

You'll have the following results in jerry_new.xml

<data>
    <country name="Peru">
        <rank updated="yes">foo</rank>
        <language>english</language>
        <currency>1.21$/kg</currency>
        <gdppc month="06">141100</gdppc>
        <gdpnp month="10">foo</gdpnp>
        <neighbor direction="E" name="Austria" />
        <neighbor direction="W" name="Switzerland" />
    </country>
    <country name="Singapore">
        <rank updated="yes">foo</rank>
        <language>english</language>
        <currency>4.1$/kg</currency>
        <gdppc month="05">59900</gdppc>
        <gdpnp month="08">foo</gdpnp>
        <neighbor direction="N" name="Malaysia" />
    </country>
</data>

Vaticinate answered 24/4, 2015 at 16:21 Comment(0)

Recommended topics

Hot tags