how to get results from xml sax parser in python
Asked Answered
C

3

7

I working on xml sax parser to parse xml files and below is my code

xml file code:

<job>
    <title>Registered Nurse-Epilepsy</title>
    <job-code>881723</job-code>
    <detail-url>http://search.careers-hcanorthtexas.com/s/Job-Details/Registered-Nurse-Epilepsy-Job/Medical-City/xjdp-cl289619-jf120-ct2181-jid4041800?s_cid=Advance
    </detail-url>
    <job-category>Neuroscience Nursing</job-category>
    <description>
        <summary>
            <div class='descriptionheader'>Description</div><P STYLE="margin-top:0px;margin-bottom:0px"><SPAN STYLE="font-family:Arial;font-size:small">Utilizing the standards set forth for Nursing Practice by the ANA and ONS, the RN will organize, modify, evaluate, document and maintain the plan of care for Epilepsy and/or Neurological patients. It will include individualized, family centered, holistic, supportive, and safe age-specific care.</SPAN></P><div class='qualificationsheader'>Qualifications</div><UL STYLE="list-style-type:disc"> <LI>Graduate of an accredited school of Professional Nursing.</LI> <LI>BSN preferred </LI> <LI>Current licensure with the Board of Nurse Examiners for the State of Texas</LI> <LI>Experience in Epilepsy Monitoring and/or Neurological background preferred.</LI> <LI>ACLS preferred, within 6 months of hire</LI> <LI>PALS required upon hire</LI> </UL>
       </summary>
    </description>
    <posted-date>2012-07-26</posted-date>
    <location>
       <address>7777 Forest Lane</address>
       <city>Dallas</city>
       <state>TX</state>
       <zip>75230</zip>
       <country>US</country>
    </location>
    <company>
       <name>Medical City (Dallas, TX)</name>
      <url>http://www.hcanorthtexas.com/careers/search-jobs.dot</url>
    </company>
</job> 

Python code: (partial code to clear my doubt until start element function)

from xml.sax.handler import ContentHandler
import xml.sax
import xml.parsers.expat
import ConfigParser

class Exact(xml.sax.handler.ContentHandler):
  def __init__(self):
    self.curpath = []

  def startElement(self, name, attrs):
    print name,attrs
    self.clearFields()


  def endElement(self, name):
    pass

  def characters(self, data):
    self.buffer += data

  def clearFields():
    self.fields = {}
    self.fields['title'] = None
    self.fields['job-code'] = None
    self.fields['detail-url'] = None
    self.fields['job-category'] = None
    self.fields['description'] = None
    self.fields['summary'] = None
    self.fields['posted-date'] = None
    self.fields['location'] = None
    self.fields['address'] = None
    self.fields['city'] = None
    self.fields['state'] = None
    self.fields['zip'] = None
    self.fields['country'] = None
    self.fields['company'] = None
    self.fields['name'] = None
    self.fields['url'] = None
    
    self.buffer = ''
      
if __name__ == '__main__':
  parser = xml.sax.make_parser()
  handler = Exact()
  parser.setContentHandler(handler)
  parser.parse(open('/path/to/xml_file.xml'))

result: The result to the above print statement is given below

job     <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
title   <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
job-code <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
detail-url <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
job-category <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
description  <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
summary       <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
posted-date   <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
location      <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
address       <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
city          <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
state         <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
zip           <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
country       <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
company       <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
name          <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>
url           <xml.sax.xmlreader.AttributesImpl instance at 0x2c0ba70>

As you can observe above i am getting name and attrs from the print statement, but now all my intention is to get value of that name, how to fetch the values for all those tags above because i am getting only node names but not values.

Edited Code:

i really confused on how to map the data from the nodes to the keys in the dictionary as stated above

Cabasset answered 4/9, 2012 at 11:59 Comment(2)
I assume with "values" you mean the character content of the nodes? e.g. 'TX' for the 'state' element?Adman
@Adman : exactly i need those data and later i will store that in to a dictionary, first of all how to get that data ?Cabasset
A
9

To get the content of an element, you need to overwrite the characters method... add this to your handler class:

def characters(self, data):
    print data

Be careful with this, though: The parser is not required to give you all data in a single chunk. You should use an internal Buffer and read it when needed. In most of my xml/sax code I do something like this:

class MyHandler(xml.sax.handler.ContentHandler):
    def __init__(self):
        self._charBuffer = []

    def _flushCharBuffer(self):
        s = ''.join(self._charBuffer)
        self._charBuffer = []
        return s

    def characters(self, data):
        self._charBuffer.append(data)

... and then call the flush method on the end of elements where I need the data.

For your whole use case - assuming you have a file containing multiple job descriptions and want a list which holds the jobs with each job being a dictionary of the fields, do something like this:

class MyHandler(xml.sax.handler.ContentHandler):
    def __init__(self):
        self._charBuffer = []
        self._result = []

    def _getCharacterData(self):
        data = ''.join(self._charBuffer).strip()
        self._charBuffer = []
        return data.strip() #remove strip() if whitespace is important

    def parse(self, f):
        xml.sax.parse(f, self)
        return self._result

    def characters(self, data):
        self._charBuffer.append(data)

    def startElement(self, name, attrs):
        if name == 'job': self._result.append({})

    def endElement(self, name):
        if not name == 'job': self._result[-1][name] = self._getCharacterData()

jobs = MyHandler().parse("job-file.xml") #a list of all jobs

If you just need to parse a single job at a time, you can simplify the list part and throw away the startElement method - just set _result to a dict and assign to it directly in endElement.

Adman answered 4/9, 2012 at 12:18 Comment(4)
Thank you very much, i really stucked how to map these data to their tags please see my above edited codeCabasset
how to map the results from the nodes to their respective tag names by creating a dictionary with keys as node names and values as node values, actually i am trying to do thisCabasset
cau u please answer my above edited code, i really stucked their on mappingCabasset
I was on my way back from work, give me a moment to edit my answer.Adman
P
3

To get the text content of a node, you need to implement a characters method. E.g.

class Exact(xml.sax.handler.ContentHandler):
  def __init__(self):
    self.curpath = []

  def startElement(self, name, attrs):
    print name,attrs


  def endElement(self, name):
    print 'end ' + name

  def characters(self, content):
    print content

Would output:

job <xml.sax.xmlreader.AttributesImpl instance at 0xb6d9baec>



title <xml.sax.xmlreader.AttributesImpl instance at 0xb6d9bb0c>
Registered Nurse-Epilepsy
end title



job-code <xml.sax.xmlreader.AttributesImpl instance at 0xb6d9bb2c>
881723
end job-code



detail-url <xml.sax.xmlreader.AttributesImpl instance at 0xb6d9bb2c>
http://search.careers-hcanorthtexas.com/s/Job-Details/Registered-Nurse-Epilepsy-Job/Medical-City/xjdp-cl289619-jf120-ct2181-jid4041800?s_cid=Advance



end detail-url

(sniped)

Papa answered 4/9, 2012 at 12:16 Comment(0)
A
2

You need to implement a characters handler too:

def characters(self, content):
    print content

but this potentially gives you text in chunks instead of as one block per tag.

Do yourself a big favour though and use the ElementTree API instead; that API is far pythononic and easier to use than the XML DOM API.

from xml.etree import ElementTree as ET

etree = ET.parse('/path/to/xml_file.xml')
jobtitle = etree.find('job/title').text

If all you want is a straight conversion to a dictionary, take a look at this handy ActiveState Python Cookbook recipe: Converting XML to dictionary and back. Note that it uses the ElementTree API as well.

If you have a set of existing elements you want to look for, just use these in the find() method:

fieldnames = [
    'title', 'job-code', 'detail-url', 'job-category', 'description',
    'summary', 'posted-date', 'location', 'address', 'city', 'state',
    'zip', 'country', 'company', 'name', 'url']
fields = {}

etree = ET.parse('/path/to/xml_file.xml')

for field in fieldnames:
    elem = etree.find(field)
    if field is not None and field.text is not None:
        fields[field] = elem.text
Adriatic answered 4/9, 2012 at 12:16 Comment(16)
Martijn Pieters : I edited my code above please have a look at it once, after fetching how to map those values by creating a dictionary with keys as node names and values as node valuesCabasset
here what is the data actually whether it is content in character function ? . Actually the fields in the dictionary are hard cored fields, because i will run this code for multiple xml urls actually, if the field in the dictionary matches the xml node then the result of the node must be mapped to that dictionary field. this is the actual concept finallyCabasset
Not sure what you are asking there, but the code above is easy enough to generalize. Note that SO is not a code writing service. :-)Adriatic
hey i am sorry and thats true Martijn , i can able to understand, also u helped a lot, actually what i am trying is declaring some fields in the dictionary, after fetching data using xml sax , i need to map those fetched data to the fields in the dictionary if xml node matched the field in dictionaryCabasset
Yup, I understood, and I showed you a much easier way to do this with ElementTree instead. Sorry, I rarely if ever use the SAX interfaces.Adriatic
k i will use stree then,for example i need to parse the following url northshorelij.jobs/feed/xml which has huge amount of data and need to map the nodes with thier dataCabasset
is etree helps to parse these kind of urlsCabasset
@Kouripm: why don't you try it? etree is a pythonic API to parse XML, so it'll parse that URL. It certainly will be easier than using the SAX API.Adriatic
Do yourself a big favour and learn how to use a SAX parser. Element Tree has its own uses, but this one is clearly a case for SAX.Adman
@l4mpi: Are the files that big? There are incremental ElementTree parsers available too, I'd rather use those still.Adriatic
You have a huge list of field names that gets searched for - this is completely unneccessary with a sax parser. I've got a boilerplate SAX parser class which solves his problems in ~3 lines, without knowing the field names. Also, what do you do if he has multiple <job> elements in his file?Adman
@l4mpi: The 'huge' requirement wasn't given in the OP initial question, and I was trying to teach a man to fish, not just drown in an API that he doesn't understand. Sure, SAX has it's place, but the initial (small) problem wasn't it.Adriatic
@l4mpi: Also, meta note: the large list came from the OPs post as well; he wanted just a way to put things into a dict quickly. That's not a solution for the large file case, of course.Adriatic
this is my point, for just putting things into a dict quickly SAX is the way to go. Well maybe I'm biased, I'd use SAX for everything but cases where DOM manipulation is required.Adman
@l4mpi: Right. I've written DOM parsers, and I definitely prefer easier-to-use APIs when I can get away with them. :-) As this particular (bad) question progressed, it only gradually became clear SAX might be a better fit, but I might instead try and use lxml iterparse instead.Adriatic
I would have recommended him ElementTree too, but the question is about SAX parsing. oh, 8 years ago..Preceptive

© 2022 - 2024 — McMap. All rights reserved.