How can I get the text between tags using python SAX parser?
Asked Answered
C

1

7

What I need is just get the text of the corresponding tag and persist it into database. Since the xml file is big (4.5GB) I'm using sax. I used the characters method to get the text and put it in a dictionary. However when I'm printing the text at the endElement method I'm getting a new line instead of the text.

Here is my code:

def characters(self,content):
   text = unescape(content))
   self.map[self.tag]=text

def startElement(self, name, attrs):
   self.tag = name

def endElement (self, name)
   if (name=="sometag")
   print self.map[name]

Thanks in advance.

Cupo answered 14/2, 2010 at 20:11 Comment(0)
S
8

The text in the tag is chunked by the SAX processor. characters might be called multiple times.

You need to do something like:

def startElement(self, name, attrs):
    self.map[name] = ''
    self.tag = name

def characters(self, content):
    self.map[self.tag] += content

def endElement(self, name):
    print self.map[name]
Sulfapyrazine answered 14/2, 2010 at 20:18 Comment(5)
Thanks ! The below code is an accident or it should be like that ? self.map[name] == ''Cupo
Where can I find reference to this behavior ? The text in the tag is chuncked by the SAX processor. characters might be called multiple times.Cupo
The behaviour is described in the docs: localhost/doc/python2.6-doc/html/library/…Sulfapyrazine
Guess my local copy won't be very helpful... docs.python.org/library/…Sulfapyrazine
From the SAX handler documentation: “The Parser will call this method to report each chunk of character data.”Equable

© 2022 - 2024 — McMap. All rights reserved.