Converting xml to dictionary using ElementTree
Asked Answered
S

12

43

I'm looking for an XML to dictionary parser using ElementTree, I already found some but they are excluding the attributes, and in my case I have a lot of attributes.

Sangfroid answered 7/10, 2011 at 7:38 Comment(0)
S
32
def etree_to_dict(t):
    d = {t.tag : map(etree_to_dict, t.iterchildren())}
    d.update(('@' + k, v) for k, v in t.attrib.iteritems())
    d['text'] = t.text
    return d

Call as

tree = etree.parse("some_file.xml")
etree_to_dict(tree.getroot())

This works as long as you don't actually have an attribute text; if you do, then change the third line in the function body to use a different key. Also, you can't handle mixed content with this.

(Tested on LXML.)

Spitler answered 7/10, 2011 at 8:7 Comment(3)
I had got an error in the iterchildren so I changed to getchildren, With this example I'm getting the attributes but the node values is empty, example:{'Tag': 'Lidars', 'lidars_list': [{'positive_towards_LOS': 'false', 'scanner_3D': 'true', 'lidar': [{'name': []}, the name is LNAC but I'm getting an empty dictionarySangfroid
@OHLÁLÁ Hi, did you mange to modify the code to convert XML to Dictionary? ThanksInez
This returns a map (whatever this is) as the value of the first key of the dictionary, not a nested dictionary.Inez
S
62

The following XML-to-Python-dict snippet parses entities as well as attributes following this XML-to-JSON "specification":

from collections import defaultdict

def etree_to_dict(t):
    d = {t.tag: {} if t.attrib else None}
    children = list(t)
    if children:
        dd = defaultdict(list)
        for dc in map(etree_to_dict, children):
            for k, v in dc.items():
                dd[k].append(v)
        d = {t.tag: {k: v[0] if len(v) == 1 else v
                     for k, v in dd.items()}}
    if t.attrib:
        d[t.tag].update(('@' + k, v)
                        for k, v in t.attrib.items())
    if t.text:
        text = t.text.strip()
        if children or t.attrib:
            if text:
                d[t.tag]['#text'] = text
        else:
            d[t.tag] = text
    return d

It is used:

from xml.etree import cElementTree as ET
e = ET.XML('''
<root>
  <e />
  <e>text</e>
  <e name="value" />
  <e name="value">text</e>
  <e> <a>text</a> <b>text</b> </e>
  <e> <a>text</a> <a>text</a> </e>
  <e> text <a>text</a> </e>
</root>
''')

from pprint import pprint

d = etree_to_dict(e)

pprint(d)

The output of this example (as per above-linked "specification") should be:

{'root': {'e': [None,
                'text',
                {'@name': 'value'},
                {'#text': 'text', '@name': 'value'},
                {'a': 'text', 'b': 'text'},
                {'a': ['text', 'text']},
                {'#text': 'text', 'a': 'text'}]}}

Not necessarily pretty, but it is unambiguous, and simpler XML inputs result in simpler JSON. :)


Update

If you want to do the reverse, emit an XML string from a JSON/dict, you can use:

try:
  basestring
except NameError:  # python3
  basestring = str

def dict_to_etree(d):
    def _to_etree(d, root):
        if not d:
            pass
        elif isinstance(d, str):
            root.text = d
        elif isinstance(d, dict):
            for k,v in d.items():
                assert isinstance(k, str)
                if k.startswith('#'):
                    assert k == '#text' and isinstance(v, str)
                    root.text = v
                elif k.startswith('@'):
                    assert isinstance(v, str)
                    root.set(k[1:], v)
                elif isinstance(v, list):
                    for e in v:
                        _to_etree(e, ET.SubElement(root, k))
                else:
                    _to_etree(v, ET.SubElement(root, k))
        else:
            assert d == 'invalid type', (type(d), d)
    assert isinstance(d, dict) and len(d) == 1
    tag, body = next(iter(d.items()))
    node = ET.Element(tag)
    _to_etree(body, node)
    return node

print(ET.tostring(dict_to_etree(d)))
Seadog answered 9/4, 2012 at 17:3 Comment(10)
This code throws anerror if a node has no text (such as the first <e> node -- you get AttributeError: 'NoneType' object has no attribute 'strip'Cockayne
Is there any example of reverse (dict -> xml) convertion?Lacework
This is one of the best xml -> dict I have ever tried (and there are a lot of : xmltodict, several recipes on several websites etc.)Halfpint
@Lacework I added what I believe is the inverse of above function. Sorry, a bit late. :PSeadog
As @Halfpint said - This is the best XML-> dict implementation I have ever tried. And I have tried many.Samaniego
V. good. But got confused for a bit until I realised that for dict_to_etree to truly be an inverse it should return an etree not a string. I.e. last line return nodeAngleworm
@Angleworm Thanks, fixed. SE lets anyone edit answers, though. :)Seadog
In my experience, this solution did not account for xmlns at the root of the document. Easy solution is to just strip it out. I found this question to be a helpful additionElectrograph
Note: In python3, dict.iteritems() doesn't exist anymore. Change 3 instances of that method to just items() and all is well again.Drudgery
I know this is many years later, but this code fails when t.text is set but not t.attrib. To fix, replace these lines: def etree_to_dict(t): d = {t.tag: {}} # type Dict[Any, Any] ... return d With these: def etree_to_dict(t): # type: (ET.Element) -> Dict d = {t.tag: {}} # type Dict[Any, Any] ... return {t.tag: None} if len(d[t.tag]) == 0 else d NOTE: Type hints resolve errors reported by Pylance but are optional.Endpaper
S
32
def etree_to_dict(t):
    d = {t.tag : map(etree_to_dict, t.iterchildren())}
    d.update(('@' + k, v) for k, v in t.attrib.iteritems())
    d['text'] = t.text
    return d

Call as

tree = etree.parse("some_file.xml")
etree_to_dict(tree.getroot())

This works as long as you don't actually have an attribute text; if you do, then change the third line in the function body to use a different key. Also, you can't handle mixed content with this.

(Tested on LXML.)

Spitler answered 7/10, 2011 at 8:7 Comment(3)
I had got an error in the iterchildren so I changed to getchildren, With this example I'm getting the attributes but the node values is empty, example:{'Tag': 'Lidars', 'lidars_list': [{'positive_towards_LOS': 'false', 'scanner_3D': 'true', 'lidar': [{'name': []}, the name is LNAC but I'm getting an empty dictionarySangfroid
@OHLÁLÁ Hi, did you mange to modify the code to convert XML to Dictionary? ThanksInez
This returns a map (whatever this is) as the value of the first key of the dictionary, not a nested dictionary.Inez
N
7

For transforming XML from/to python dictionaries, xmltodict has worked great for me:

import xmltodict

xml = '''
<root>
  <e />
  <e>text</e>
  <e name="value" />
  <e name="value">text</e>
  <e> <a>text</a> <b>text</b> </e>
  <e> <a>text</a> <a>text</a> </e>
  <e> text <a>text</a> </e>
</root>
'''

xdict = xmltodict.parse(xml)

xdict will now look like

OrderedDict([('root',
              OrderedDict([('e',
                            [None,
                             'text',
                             OrderedDict([('@name', 'value')]),
                             OrderedDict([('@name', 'value'),
                                          ('#text', 'text')]),
                             OrderedDict([('a', 'text'), ('b', 'text')]),
                             OrderedDict([('a', ['text', 'text'])]),
                             OrderedDict([('a', 'text'),
                                          ('#text', 'text')])])]))])

If your XML data is not in raw string/bytes form but in some ElementTree object, you just need to print it out as a string and use xmldict.parse again. For instance, if you are using lxml to process the XML documents, then

from lxml import etree
e = etree.XML(xml)
xmltodict.parse(etree.tostring(e))

will produce the same dictionary as above.

Newbold answered 27/12, 2017 at 9:24 Comment(0)
I
3

Based on @larsmans, if you don't need attributes, this will give you a tighter dictionary --

def etree_to_dict(t):
    return {t.tag : map(etree_to_dict, t.iterchildren()) or t.text}
Interlay answered 24/10, 2013 at 4:37 Comment(0)
L
3

Several answers already, but here's one compact solution that maps attributes, text value and children using dict-comprehension:

def etree_to_dict(t):
    if type(t) is ET.ElementTree: return etree_to_dict(t.getroot())
    return {
        **t.attrib,
        'text': t.text,
        **{e.tag: etree_to_dict(e) for e in t}
    }
Limiter answered 22/6, 2021 at 11:36 Comment(0)
T
2

The lxml documentation brings an example of how to map an XML tree into a dict of dicts:

def recursive_dict(element):
    return element.tag, dict(map(recursive_dict, element)) or element.text

Note that this beautiful quick-and-dirty converter expects children to have unique tag names and will silently overwrite any data that was contained in preceding siblings with the same name. For any real-world application of xml-to-dict conversion, you would better write your own, longer version of this.

You could create a custom dictionary to deal with preceding siblings with the same name being overwritten:

from collections import UserDict, namedtuple
from lxml.etree import QName

class XmlDict(UserDict):
    """Custom dict to avoid preceding siblings with the same name being overwritten."""

    __ROOTELM = namedtuple('RootElm', ['tag', 'node'])

    def __setitem__(self, key, value):
        if key in self:
            if type(self.data[key]) is list:
                self.data[key].append(value)
            else:
                self.data[key] = [self.data[key], value]
        else:
            self.data[key] = value

    @staticmethod
    def xml2dict(element):
        """Converts an ElementTree Element to a dictionary."""
        elm = XmlDict.__ROOTELM(
            tag=QName(element).localname,
            node=XmlDict(map(XmlDict.xml2dict, element)) or element.text,
    )
    return elm

Usage

from lxml import etree
from pprint import pprint

xml_f = b"""<?xml version="1.0" encoding="UTF-8"?>
            <Data>
              <Person>
                <First>John</First>
                <Last>Smith</Last>
              </Person>
              <Person>
                <First>Jane</First>
                <Last>Doe</Last>
              </Person>
            </Data>"""

elm = etree.fromstring(xml_f)
d = XmlDict.xml2dict(elm)

Output

In [3]: pprint(d)
RootElm(tag='Data', node={'Person': [{'First': 'John', 'Last': 'Smith'}, {'First': 'Jane', 'Last': 'Doe'}]})

In [4]: pprint(d.node)
{'Person': [{'First': 'John', 'Last': 'Smith'},
            {'First': 'Jane', 'Last': 'Doe'}]}
Thousandfold answered 12/8, 2020 at 2:33 Comment(2)
For me this returns a tuple, not a dictionary.Inez
For sure this can be improved.Thousandfold
T
2

enhanced the accepted answer with python3 and use json list when all children have the same tag. Also provided an option whether to wrap the dict with root tag or not.

from collections import OrderedDict
from typing import Union
from xml.etree.ElementTree import ElementTree, Element

def etree_to_dict(root: Union[ElementTree, Element], include_root_tag=False):
    root = root.getroot() if isinstance(root, ElementTree) else root
    result = OrderedDict()
    if len(root) > 1 and len({child.tag for child in root}) == 1:
        result[next(iter(root)).tag] = [etree_to_dict(child) for child in root]
    else:
        for child in root:
            result[child.tag] = etree_to_dict(child) if len(list(child)) > 0 else (child.text or "")
    result.update(('@' + k, v) for k, v in root.attrib.items())
    return {root.tag: result} if include_root_tag else result

d = etree_to_dict(etree.ElementTree.parse('data.xml'), True)
Teacake answered 9/9, 2021 at 9:46 Comment(2)
A bit overcomplicated, but good. Didn't check for ElementTree, but for lxml an element object could be sequenced and has length already. For example, instead of children = list(root) if len(children) > 1 and len({child.tag for child in children}) == 1, you could use if len(root) > 1 and len({child.tag for child in root}) == 1Anchorage
thanks @SergeyNudnov! updated my codeTeacake
N
1

Here is a simple data structure in xml (save as file.xml):

<?xml version="1.0" encoding="UTF-8"?>
<Data>
  <Person>
    <First>John</First>
    <Last>Smith</Last>
  </Person>
  <Person>
    <First>Jane</First>
    <Last>Doe</Last>
  </Person>
</Data>

Here is the code to create a list of dictionary objects from it.

from lxml import etree
tree = etree.parse('file.xml')
root = tree.getroot()
datadict = []
for item in root:
    d = {}
    for elem in item:
        d[elem.tag]=elem.text
    datadict.append(d)

datadict now contains:

[{'First': 'John', 'Last': 'Smith'},{'First': 'Jane', 'Last': 'Doe'}]

and can be accessed like so:

datadict[0]['First']
'John'
datadict[1]['Last']
'Doe'
Nineteenth answered 18/1, 2017 at 18:16 Comment(2)
If there is some child tag how can we do this?Lactometer
Consider like this: <?xml version="1.0" encoding="UTF-8"?> <Data> <Person> <First>John</First> <Last>Smith</Last> <extra> <details1> <married>yes</married> <status>rich</status> </details1> </extra> </Person> <Person> <First>Jane</First> <Last>Doe</Last> <extra> <details1> <married>yes</married> <status>rich</status> </details1> <details2> <property>yes</property> </details2> </extra> </Person> </Data>Lactometer
P
1

You can use this snippet that directly converts it from xml to dictionary

import xml.etree.ElementTree as ET

xml = ('<xml>' +
       '<first_name>Dean Christian</first_name>' +
       '<middle_name>Christian</middle_name>' +
       '<last_name>Armada</last_name>' +
       '</xml>')
root = ET.fromstring(xml)

x = {x.tag: root.find(x.tag).text  for x in root._children}
# returns {'first_name': 'Dean Christian', 'last_name': 'Armada', 'middle_name': 'Christian'}
Pipes answered 10/5, 2017 at 11:7 Comment(0)
B
0
from lxml import etree, objectify
def formatXML(parent):
    """
    Recursive operation which returns a tree formated
    as dicts and lists.
    Decision to add a list is to find the 'List' word
    in the actual parent tag.   
    """
    ret = {}
    if parent.items(): ret.update(dict(parent.items()))
    if parent.text: ret['__content__'] = parent.text
    if ('List' in parent.tag):
        ret['__list__'] = []
        for element in parent:
            ret['__list__'].append(formatXML(element))
    else:
        for element in parent:
            ret[element.tag] = formatXML(element)
    return ret
Burnette answered 15/8, 2012 at 7:57 Comment(0)
C
0

Building on @larsmans, if the resulting keys contain xml namespace info, you can remove that before writing to the dict. Set a variable xmlns equal to the namespace and strip its value out.

xmlns = '{http://foo.namespaceinfo.com}'

def etree_to_dict(t):
    if xmlns in t.tag:
        t.tag = t.tag.lstrip(xmlns)
    if d = {t.tag : map(etree_to_dict, t.iterchildren())}
    d.update(('@' + k, v) for k, v in t.attrib.iteritems())
    d['text'] = t.text
    return d
Carruth answered 27/1, 2016 at 14:58 Comment(0)
L
0

If you have a schema, the xmlschema package already implements multiple XML-to-dict converters that honor the schema and attribute types. Quoting the following from the docs

Available converters

The library includes some converters. The default converter xmlschema.XMLSchemaConverter is the base class of other converter types. Each derived converter type implements a well know convention, related to the conversion from XML to JSON data format:

  • xmlschema.ParkerConverter: Parker convention
  • xmlschema.BadgerFishConverter: BadgerFish convention
  • xmlschema.AbderaConverter: Apache Abdera project convention
  • xmlschema.JsonMLConverter: JsonML (JSON Mark-up Language) convention

Documentation of these different conventions is available here: http://wiki.open311.org/JSON_and_XML_Conversion/

Usage of the converters is straightforward, e.g.:

from xmlschema import ParkerConverter, XMLSchema, to_dict

xml = '...'
schema = XMLSchema('...')
to_dict(xml, schema=schema, converter=ParkerConverter)
Lobster answered 9/9, 2022 at 14:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.