Best place for unique ID in DOCX (Open XML WordprocessingDocument)
Asked Answered
L

3

8

I am looking for a way to indentify DOCX files if they are moved or renamed. Reason is obvious, I am playing with the Open XML SDK, building a hyperlink checker.

Works perfect, at least it can add or update hyperlinks in a document.

Problem is, though, if I rename an external file (source.docx + target.docx to targetB.docx) the link is broken. I can find broken links (by simply checking if the linked file is in its given place).

But I want more. I want to be able to recover this lost links, by searching for all documents in a directory (docx) and scanning if they are the "target". The most simple way should be a GUID stored somewhere in the document properties, which will not change if the document is renamed or edited (checksum is no applicable).

Then I create either a seperate list of links and according IDs, and if any document is renamed, I just update the link. I hope the concept is clear.

So there are a few basic questions:

  • Is there a "best practice" to store this "custom information" in an Open XML Document
  • Does a wordprocessingdocument (DOCX) already have some unique identifier created by Word
  • Where would you save the mapping (GUID of hyperlink target)

I hope the question is clear, if not I try to clarify, just comment if questions..

Thanks, Chris

Lauryn answered 14/3, 2009 at 4:59 Comment(0)
M
1

As this was five years ago, I'm hoping you found an answer. In case anyone else is interested in this, the best bet would be to create a new custom property in the ZIP archive (\docProps\custom.xml) and store your metadata in that. Easiest way will be to generate one in the Word UI to see how they work, but you'll end up with a custom.xml inside the DOCX archive that looks something like:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Properties xmlns="http://schemas.openxmlformats.org/officeDocument/2006/custom-properties" xmlns:vt="http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes">
  <property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="2" name="test">
    <vt:lpwstr>chris</vt:lpwstr>
  </property>
</Properties>

How these work is all documented in ECMA 376, the standard documenting the file format.

As far as I know, Word does not store any GUIDs to uniquely identify a file.

Manzanilla answered 4/1, 2014 at 0:1 Comment(0)
F
1

MS Word generates unique ID (GUID) when creating a new document since Office 2013. It places it in file '\word\settings.xml' inside <w:settings> element as 'docId' name.

For instance in MS Word 2016:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:settings xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:w16se="http://schemas.microsoft.com/office/word/2015/wordml/symex" mc:Ignorable="w14 w15 w16se">
    <w15:docId w15:val="{982A3D80-A23D-4148-8230-4160F3D87FF5}"/>
</w:settings>

Note, that MS Word doesn't change it when a file copy is made. So, if each new file is created rather than copied from another, it's a reliable way.

Flivver answered 20/8, 2019 at 8:10 Comment(0)
S
0

Acrobat/PDF has something similar. Look up Bates numbering which is used to identify documents by putting in a unqiue number.

You should typically place this in the metadata section, if any. Or, add a custom part to the docx file that keeps the mapping (of course, remaining within the bounds of the spec). (I am not very familiar with the docx format, so you'll have figure this out.)

Substrate answered 14/3, 2009 at 6:40 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.