How to open this XML file to create dataframe in Python?
Asked Answered
I

3

9

Does anyone have a suggestion for the best way to open the xml data on the site below to put it in a dataframe (I prefer working with pandas) in python? The file is on the "Data - XML (sdmx/zip)" link on this site:

http://www.federalreserve.gov/pubs/feds/2006/200628/200628abs.html

I've tried using the following by copying from http://timhomelab.blogspot.com/2014/01/how-to-read-xml-file-into-dataframe.html, and it seems I'm getting close:

from lxml import objectify
import pandas as pd

path = 'feds200628.xml'
xml = objectify.parse(open(path))
root = xml.getroot()
root.getchildren()[0].getchildren()
df = pd.DataFrame(columns=('id', 'name'))

for i in range(0,4):
    obj = root.getchildren()[i].getchildren()
    row = dict(zip(['id', 'name'], [obj[0].text, obj[1].text]))
    row_s = pd.Series(row)
    row_s.name = i
    df = df.append(row_s)

Still, I don't know enough about xml to get me the rest of the way.

Any help would be awesome - I don't even need it to be in a dataframe, I just need to figure out how to parse this content in python somehow.

Implore answered 27/9, 2014 at 17:22 Comment(1)
Possible duplicate of How can I open an Excel file in Python?Etymon
A
10

XML is a tree-like structure, while a Pandas DataFrame is a 2D table-like structure. So there is no automatic way to convert between the two. You have to understand the XML structure and know how you want to map its data onto a 2D table. Thus, every XML-to-DataFrame problem is different.

Your XML has 2 DataSets, each containing a number of Series. Each Series contains a number of Obs elements.

Each Series has a NAME attribute, and each Obs has OBS_STATUS, TIME_PERIOD and OBS_VALUE attributes. So perhaps it would be reasonable to create a table with NAME, OBS_STATUS, TIME_PERIOD, and OBS_VALUE columns.

I found pulling the desired data out of the XML a bit complicated, which makes me doubtful that I've found the best way to do it. But here is one way (PS. Thomas Maloney's idea of starting with the 2D table-like XLS data should be way simpler):

import lxml.etree as ET
import pandas as pd

path = 'feds200628.xml'

def fast_iter(context, func, *args, **kwargs):
    """
    http://lxml.de/parsing.html#modifying-the-tree
    Based on Liza Daly's fast_iter
    http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
    See also http://effbot.org/zone/element-iterparse.htm
    https://mcmap.net/q/196988/-using-python-iterparse-for-large-xml-files (unutbu)
    """
    for event, elem in context:
        func(elem, *args, **kwargs)
        # It's safe to call clear() here because no descendants will be
        # accessed
        elem.clear()
        # Also eliminate now-empty references from the root node to elem
        for ancestor in elem.xpath('ancestor-or-self::*'):
            while ancestor.getprevious() is not None:
                del ancestor.getparent()[0]
    del context

data = list()
obs_keys = ['OBS_STATUS', 'TIME_PERIOD', 'OBS_VALUE']
columns = ['NAME'] + obs_keys

def process_obs(elem, name):
    dct = elem.attrib
    # print(dct)
    data.append([name] + [dct[key] for key in obs_keys])

def process_series(elem):
    dct = elem.attrib
    # print(dct)
    context = ET.iterwalk(
        elem, events=('end', ),
        tag='{http://www.federalreserve.gov/structure/compact/common}Obs'
        )
    fast_iter(context, process_obs, dct['SERIES_NAME'])

def process_dataset(elem):
    nsmap = elem.nsmap
    # print(nsmap)
    context = ET.iterwalk(
        elem, events=('end', ),
        tag='{{{prefix}}}Series'.format(prefix=elem.nsmap['kf'])
        )
    fast_iter(context, process_series)

with open(path, 'rb') as f:
    context = ET.iterparse(
        f, events=('end', ),
        tag='{http://www.federalreserve.gov/structure/compact/common}DataSet'
        )
    fast_iter(context, process_dataset)
    df = pd.DataFrame(data, columns=columns)

yields

            NAME OBS_STATUS TIME_PERIOD   OBS_VALUE
0        SVENY01          A  1961-06-14      2.9825
1        SVENY01          A  1961-06-15      2.9941
2        SVENY01          A  1961-06-16      3.0012
3        SVENY01          A  1961-06-19      2.9949
4        SVENY01          A  1961-06-20      2.9833
5        SVENY01          A  1961-06-21      2.9993
6        SVENY01          A  1961-06-22      2.9837
...
1029410     TAU2          A  2014-09-19  3.72896779
1029411     TAU2          A  2014-09-22  3.12836171
1029412     TAU2          A  2014-09-23  3.20146575
1029413     TAU2          A  2014-09-24  3.29972110
Augustinaaugustine answered 27/9, 2014 at 19:9 Comment(0)
U
6

I would export the XLS formatted file to a CSV file (using a freely available program like Gnumeric or LibreOffice, or if you have it, Excel), and then read the CSV file into pandas. I know this is not exactly an answer to your final question, but parsing XML is an overly complicated solution to what you're trying to do.

Regarding parsing XML in Python, the lxml library is my favorite library to use. I find using the XPath query language together with an lxml parser to be the best route.

Unashamed answered 27/9, 2014 at 18:51 Comment(0)
G
-1

This code it work to transform to df this type of Excel XML file:

import pandas as pd
from xml.sax import ContentHandler, parse

# Reference https://goo.gl/KaOBG3
class ExcelHandler(ContentHandler):
    def __init__(self):
        self.chars = [  ]
        self.cells = [  ]
        self.rows = [  ]
        self.tables = [  ]
    def characters(self, content):
        self.chars.append(content)
    def startElement(self, name, atts):
        if name=="Cell":
            self.chars = [  ]
        elif name=="Row":
            self.cells=[  ]
        elif name=="Table":
            self.rows = [  ]
    def endElement(self, name):
        if name=="Cell":
            self.cells.append(''.join(self.chars))
        elif name=="Row":
            self.rows.append(self.cells)
        elif name=="Table":
            self.tables.append(self.rows)

excelHandler = ExcelHandler()
parse('feds200628.xls', excelHandler)
df1 = pd.DataFrame(excelHandler.tables[0][10:], columns=excelHandler.tables[0][9])
print df1.head()

I can't make comment (low reputation), but the answer of this question about "How to open Excel XML file programmatically" (with python and pandas) it should work.

Galengalena answered 7/11, 2015 at 14:31 Comment(2)
This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post - you can always comment on your own posts, and once you have sufficient reputation you will be able to comment on any post.Yacov
@Yacov Ok, I added the code. I just changed the filename and the number of the row where the data start.Galengalena

© 2022 - 2024 — McMap. All rights reserved.