Extract DOCX Comments
Asked Answered
L

4

9

I'm a teacher. I want a list of all the students who commented on the essay I assigned, and what they said. The Drive API stuff was too challenging for me, but I figured I could download them as a zip and parse the XML.

The comments are tagged in w:comment tags, with w:t for the comment text and . It should be easy, but XML (etree) is killing me.

via the tutorial (and official Python docs):

z = zipfile.ZipFile('test.docx')
x = z.read('word/comments.xml')
tree = etree.XML(x)

Then I do this:

children = tree.getiterator()
for c in children:
    print(c.attrib)

Resulting in this:

{}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}author': 'Joe Shmoe', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}id': '1', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}date': '2017-11-17T16:58:27Z'}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidR': '00000000', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidDel': '00000000', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidP': '00000000', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidRDefault': '00000000', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidRPr': '00000000'}
{}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val': '0'}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val': '0'}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val': '0'}

And after this I am totally stuck. I've tried element.get() and element.findall() with no luck. Even when I copy/paste the value ('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val'), I get None in return.

Can anyone help?

Lanai answered 20/11, 2017 at 11:26 Comment(2)
The text content of an element is in the text property. Does print(c.text) produce anything of interest?Goshawk
a = tree.get('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val') print(a.text) Results in AttributeError for c in children: print(c.text) Results in the comments! Do you know how I would access the the other fields?Lanai
D
13

You got remarkably far considering that OOXML is such a complex format.

Here's some sample Python code showing how to access the comments of a DOCX file via XPath:

from lxml import etree
import zipfile

ooXMLns = {'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}

def get_comments(docxFileName):
  docxZip = zipfile.ZipFile(docxFileName)
  commentsXML = docxZip.read('word/comments.xml')
  et = etree.XML(commentsXML)
  comments = et.xpath('//w:comment',namespaces=ooXMLns)
  for c in comments:
    # attributes:
    print(c.xpath('@w:author',namespaces=ooXMLns))
    print(c.xpath('@w:date',namespaces=ooXMLns))
    # string value of the comment:
    print(c.xpath('string(.)',namespaces=ooXMLns))
Distinguished answered 20/11, 2017 at 15:35 Comment(2)
An answer and a compliment! Thank you!Lanai
On getting the text the comment relates to: 1) the hardway with accepted xpath solution is to read docxZip.read('word/document.xml'), then parse it to find comment anchors (w:commentRangeStart) by the comment's id...so much work for a one off solution :-) 2) the easy way is to use the example with win32 api, since we are talking not about server deployment, but a local desktop script for personal use to parse few school essays... Coding time matters. The win32 example just does it.Monition
H
12

Thank you @kjhughes for this amazing answer for extracting all the comments from the document file. I was facing same issue like others in this thread to get the text that the comment relates to. I took the code from @kjhughes as a base and try to solve this using python-docx. So here is my take at this.

Sample document. enter image description here

I will extract the comment and the paragraph which it was referenced in the document.

from docx import Document
from lxml import etree
import zipfile
ooXMLns = {'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}
#Function to extract all the comments of document(Same as accepted answer)
#Returns a dictionary with comment id as key and comment string as value
def get_document_comments(docxFileName):
    comments_dict={}
    docxZip = zipfile.ZipFile(docxFileName)
    commentsXML = docxZip.read('word/comments.xml')
    et = etree.XML(commentsXML)
    comments = et.xpath('//w:comment',namespaces=ooXMLns)
    for c in comments:
        comment=c.xpath('string(.)',namespaces=ooXMLns)
        comment_id=c.xpath('@w:id',namespaces=ooXMLns)[0]
        comments_dict[comment_id]=comment
    return comments_dict
#Function to fetch all the comments in a paragraph
def paragraph_comments(paragraph,comments_dict):
    comments=[]
    for run in paragraph.runs:
        comment_reference=run._r.xpath("./w:commentReference")
        if comment_reference:
            comment_id=comment_reference[0].xpath('@w:id',namespaces=ooXMLns)[0]
            comment=comments_dict[comment_id]
            comments.append(comment)
    return comments
#Function to fetch all comments with their referenced paragraph
#This will return list like this [{'Paragraph text': [comment 1,comment 2]}]
def comments_with_reference_paragraph(docxFileName):
    document = Document(docxFileName)
    comments_dict=get_document_comments(docxFileName)
    comments_with_their_reference_paragraph=[]
    for paragraph in document.paragraphs:  
        if comments_dict: 
            comments=paragraph_comments(paragraph,comments_dict)  
            if comments:
                comments_with_their_reference_paragraph.append({paragraph.text: comments})
    return comments_with_their_reference_paragraph
if __name__=="__main__":
    document="test.docx"  #filepath for the input document
    print(comments_with_reference_paragraph(document))

Output for the sample document look like this enter image description here

I have done this at a paragraph level. This could be done at a python-docx run level as well. Hopefully it will be of help.

Hiatus answered 8/12, 2020 at 15:30 Comment(1)
This solution is superior solution since it has the reference text the comment was made on. Well done.Cochlea
V
5

I used Word Object Model to extract comments with replies from a Word document. Documentation on Comments object can be found here. This documentation uses Visual Basic for Applications (VBA). But I was able to use the functions in Python with slight modifications. Only issue with Word Object Model is that I had to use win32com package from pywin32 which works fine on Windows PC, but I'm not sure if it will work on macOS.

Here's the sample code I used to extract comments with associated replies:

    import win32com.client as win32
    from win32com.client import constants

    word = win32.gencache.EnsureDispatch('Word.Application')
    word.Visible = False 
    filepath = "path\to\file.docx"

    def get_comments(filepath):
        doc = word.Documents.Open(filepath) 
        doc.Activate()
        activeDoc = word.ActiveDocument
        for c in activeDoc.Comments: 
            if c.Ancestor is None: #checking if this is a top-level comment
                print("Comment by: " + c.Author)
                print("Comment text: " + c.Range.Text) #text of the comment
                print("Regarding: " + c.Scope.Text) #text of the original document where the comment is anchored 
                if len(c.Replies)> 0: #if the comment has replies
                    print("Number of replies: " + str(len(c.Replies)))
                    for r in range(1, len(c.Replies)+1):
                        print("Reply by: " + c.Replies(r).Author)
                        print("Reply text: " + c.Replies(r).Range.Text) #text of the reply
        doc.Close()
Valorize answered 16/7, 2018 at 21:18 Comment(3)
This should be the real answer, as the previous one doesn't deal with the referenced text.Enclose
But this is windows-focusedThoreau
@lucid_dreamer: Word Automation requires Microsoft Word and is unreliable in server deployments. If you need the referenced text in a purely python solution that doesn't require Word, ask a new question.Distinguished
E
0

If you want also the text the comments relates to :

def get_document_comments(docxFileName):
       comments_dict = {}
       comments_of_dict = {}
       docx_zip = zipfile.ZipFile(docxFileName)
       comments_xml = docx_zip.read('word/comments.xml')
       comments_of_xml = docx_zip.read('word/document.xml')
       et_comments = etree.XML(comments_xml)
       et_comments_of = etree.XML(comments_of_xml)
       comments = et_comments.xpath('//w:comment', namespaces=ooXMLns)
       comments_of = et_comments_of.xpath('//w:commentRangeStart', namespaces=ooXMLns)
       for c in comments:
          comment = c.xpath('string(.)', namespaces=ooXMLns)
          comment_id = c.xpath('@w:id', namespaces=ooXMLns)[0]
          comments_dict[comment_id] = comment
       for c in comments_of:
          comments_of_id = c.xpath('@w:id', namespaces=ooXMLns)[0]
          parts = et_comments_of.xpath(
            "//w:r[preceding-sibling::w:commentRangeStart[@w:id=" + comments_of_id + "] and following-sibling::w:commentRangeEnd[@w:id=" + comments_of_id + "]]",
            namespaces=ooXMLns)
          comment_of = ''
          for part in parts:
             comment_of += part.xpath('string(.)', namespaces=ooXMLns)
             comments_of_dict[comments_of_id] = comment_of
        return comments_dict, comments_of_dict
Ensure answered 19/1, 2023 at 8:45 Comment(1)
Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.Dogmatize

© 2022 - 2024 — McMap. All rights reserved.