Is there a glossary of Word .docx XML tags?
Asked Answered
C

4

14

I'm trying to create a parser to find the tracked changes and author of a Word .docx file...

I found the document.xml but there are so many tags! Is there a glossary somewhere to what all those tags stand for?

I'd like to avoid brute forcing my way through this if possible.

Careaga answered 12/10, 2017 at 16:6 Comment(0)
C
2
"w:ins" denotes what was inserted when trackedchanges are enabled.
"w:del" denotes what was deleted when  trackedchanges are enabled.
"w:commentRangeStart" denotes the start of a comment
"w:commentRangeEnd" denotes the end of the comment.

All text are found inside 
"w:t" tags.
Careaga answered 26/10, 2017 at 22:30 Comment(0)
U
4

You can start gathering information about it in the Stack Overflow docx tag wiki itself .

.docx files (as well as other new MS Office files like .xlsx) use OOXML format


In particular :

Microsoft Office Open XML WordProcessingML is mostly standardized in ECMA 376 and ISO 29500.

You can get the relevant ECMA standard specification here : http://www.ecma-international.org/news/TC45_current_work/TC45_available_docs.htm

The specific document you are probably looking for is probably the Open Office XML, Part 4 : Markup Language Reference

But of course... this is huge (5219 pages !)

I strongly recommend to pinpoint the functionalities you want, and have a look at existing open source libraries that already do some of the job you want to do.

Ulick answered 12/10, 2017 at 16:17 Comment(0)
C
2
"w:ins" denotes what was inserted when trackedchanges are enabled.
"w:del" denotes what was deleted when  trackedchanges are enabled.
"w:commentRangeStart" denotes the start of a comment
"w:commentRangeEnd" denotes the end of the comment.

All text are found inside 
"w:t" tags.
Careaga answered 26/10, 2017 at 22:30 Comment(0)
D
1

The "Office Open XML" format and its XML vocabularies are described in detail in http://www.ecma-international.org/publications/standards/Ecma-376.htm .

To give you an idea, the following piece of XSLT should extract just the effective result text without tracked deletions of a wordprocessingML document, like would be stored under word/document.xml in a .docx file (a ZIP archive).

<!-- Match and output text spans except when
     appearing in w:delText child content -->
<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
  <xsl:output method="text"/>
  <xsl:template match="w:t">
    <xsl:value-of select="."/>
  </xsl:template>
  <xsl:template match="w:delText"/>
  <xsl:template match="*">
    <xsl:apply-templates/>
  </xsl:template>
</xsl:stylesheet>

For your application to extract changes instead, you'd also have to take care of w:ins elements.

Duplication answered 12/10, 2017 at 17:22 Comment(0)
A
1

You can use my docx4j webapp, specifically http://webapp.docx4java.org/OnlineDemo/PartsList.html

With that you can click on a tag and it will take you to the corresponding definition in the spec.

Assumptive answered 13/10, 2017 at 9:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.