I am looking for a way to add machine-readable metadata to DOCX reports I generate. The goal is to allow users to modify the document's styles and re-upload it to the system while preserving the metadata.
In first attempt, I naively tried storing the metadata in comments, but I noticed that some editors, specifically Microsoft Word, remove my comments and generate a DOCX file without them after modification.
I also experimented with Structured Document Tags, but both Google Docs and Microsoft Word remove them after styles modification.
Lastly, I tried using custom XML, but both Google Docs and Microsoft Word stripped the attributes and tags I added.
I have searched extensively but couldn't find any solution that works. Has anyone dealt with a similar issue and can share some advice?
PS1
Because there are too many lines even in small DOCX files, I created a minimalistic repo to better show what I’ve tried so far. Each attempt is placed in a separate directory. Every directory contains:
Repo: https://github.com/kishieel/docx-metadata
In the first attempt, I added metadata using comments. This worked well with Google Docs, where comments were preserved even when the text was moved using cut and paste. However, Microsoft Word removed all the comments. Maybe Word expects a different way of creating comments?
Example input:
<!-- 1_comments/Document/word/document.xml -->
<w:document ...>
<w:body>
<w:p>
<w:commentRangeStart w:id="0" />
<w:r>
<w:t xml:space="preserve">Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris quis mollis tellus. Aenean at maximus nunc.</w:t>
</w:r>
<w:commentRangeEnd w:id="0" />
</w:p>
</w:body>
</w:document>
<!-- 1_comments/Document/word/comments.xml -->
<w:comments ...>
<w:comment w:id="0" w:date="2025-04-07T09:10:21.783Z">
<w:p>
<w:r>
<w:t xml:space="preserve">Some metadata #1</w:t>
</w:r>
</w:p>
</w:comment>
</w:comments>
In the second approach, I tried using SDTs. In this case, Microsoft Word preserved them (though it split each sentence into separate words, which might be default behavior or something went wrong). Google Docs removed them entirely from the modified file.
Example input:
<!-- 2_structured_document_tags/Document/word/document.xml -->
<w:document ...>
<w:body>
<w:p>
<w:sdt>
<w:sdtPr>
<w:tag w:val="Some metadata #1" />
<w:alias w:val="Some alias #1" />
</w:sdtPr>
<w:sdtContent>
<w:r>
<w:t xml:space="preserve">Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris quis mollis tellus. Aenean at maximus nunc.</w:t>
</w:r>
</w:sdtContent>
</w:sdt>
</w:p>
</w:body>
</w:document>
I will provide the custom XML example when it's ready.
I don't have recent real-world experience of preserving metadata via the 3 things you mention (Microsoft Word, Google Docs and LibreOffice Writer), but tried the various approaches to storing material that I know in Word.
For testing I have been using Microsoft 365 MSO (Version 2503 Build 16.0.18623.20116) 64-bit
on Windows 10
, LibreOffice Writer Version: 24.2.7.2 (X86_64) / LibreOffice Communit Build ID: 420(Build:2)
on Linux
, and the current free version of Google Docs
(I do not know if Google has a more capable pay-for version. I create documents directly in Word, using VBA as necessary, e.g. to add a Custom XML Part or Document Variable. I never tried saving or downloading in any format other than .docx. THere is plenty of documentation on the .docx format and some quite good MS documentation on the WOrd implementation, but I haven't looked for equivalent documentation for Google Docs and LibreOffice. You probably do need to know what features MS, LibreOffice and Google "officially support"
The sort of things available in Word for storing metadata are either "document wide" or "associated with a location in the document". For "document wide" there are
{ DOCVARIABLE }
fields{ DOCPROPERTY }
fields.For "associated with a location in the text", you could in principle use at least the following:
{ DOCPROPERTY }
fields with Custom Document Properties (visible){ DOCVARIABLE }
fields with Document Variables (visible){ SET }
fields (can be visible, depending on the user's settings and actions), e.g.{ SET ABookMarkName "to some metadata" }
I know you tried Comments, and I think they could work despite what you found, but I don't think they are really that easy to keep hidden. Nor are Hidden text, footnotes endnotes, or Content Controls so I haven't really pursued any of those. In addition, a lot of footnotes or endnotes tend to interfere with document layout.
LibreOffice successfully round-tripped most of the Othings I tried in Word. It did however pop up a "some of this stuff may not save correctly" type boxes when saving.
Google Docs lost most of the things I tried, but does preserve at least Comments, Custom Document Properties, and even the { DOCPROPERTY }
fields you need if you want to insert those values in the document. It removed:
{ DOCVARIABLE }
field codes{ SET }
field codeswhich suggests to me that the only thing that has a good chance of working with Google Docs is Custom Document Properties. They do have limitations (I think there is a limit to the number you can have, and either the length per Property or the total length).
For "document level" metadata, you might need to split your data up into smaller chunks.
For "positional" metadata, there could well be a problem with these maxima. Even if not, marking the position using the appropriate { DOCPROPERTY }
field means you display the property value - if you don't want to do that you would probably have to do something like this:
{ DOCPROPERTY mymark }
to mark the location, so you just have a result with a single space.In Word it is possible to use the fact that Word is not very picky about extra information in field codes, so you could have a single blank Property called blank and a field code { DOCPROPERTY blank myprop }
, but unfortunately Google Docs will remove the "myprop" part.
And that's about it.
Just to cover some of the points I originally made in the Comments:
In your "comments_1" example, the reason why no comments appear in Word in the initial version (the xml code you posted and the related Document.docx is because Word needs a <w:commentReference> element for the Comments to show up in the UI.
e.g. if you change the markup you posted in your Question to this and recreate the .docx, you should see the first comment when you open the .docx in Word.
<w:document ...>
<w:body>
<w:p>
<w:commentRangeStart w:id="0" />
<w:r>
<w:t xml:space="preserve">Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris quis mollis tellus. Aenean at maximus nunc.</w:t>
</w:r>
<w:commentRangeEnd w:id="0" />
<w:r>
<w:rPr>
<w:rStyle w:val="CommentReference"/>
</w:rPr>
<w:commentReferenceThe version of LibreOffice I have here does display the comments even
w:id="0"/>
</w:r>
</w:p>
</w:body>
</w:document>
(You don't have to have the <w:rPr>
element but Word inserts one).
As I originally mentioned in a comment to your question, the reason your 1_comments .docx does not open is because the docProps/custom.xml file contains two elements with the same FMTID and name (which is not allowed). It is also nearly 600 characters long, but although I thought Word had a limit of 255 characters for a custom document property, it doesn't seem to error or truncate.
So here, I also changed custom.xml to the following to fix this:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Properties xmlns="http://schemas.openxmlformats.org/officeDocument/2006/custom-properties"
xmlns:vt="http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes">
<property fmtid="{D5CDD505-2E9C-101B-9397-08002B2CF9AE}" pid="2" name="xbrl">
<vt:lpwstr>abc</vt:lpwstr>
</property>
</Properties>