Version-controlling zipped files (docx, odt)
Asked Answered
V

4

21

There are formats that are actually zip files in disguise, e.g. docx or odt. If I store them directly in version control, they are handled as binary files. My ideal solution would be

  • have a hook that creates a foo.docx/ directory for each foo.docx files before commit, unzipping all files into it
  • optionally, have a hook that reindents the xml files
  • have a hook that recreates foo.docx from the stored files after update

I don't want the docx files themselves to be version-controlled. (I am aware of a related question where a different approach with a custom diff was suggested.)

Is this doable? Is this doable with mercurial?

UPDATE:

I know about hooks. I am interested in the specifics. Here is a session to demonstrate the expected behavior.

> hg add foo.docx
> hg status
A foo.docx
> hg commit
> # Change foo.docx with external editor
> hg status
M foo.docx
> hg diff
+++ foo.docx/word/document.xml
- <w:t>An idea</w:t>
+ <w:t>A much better idea</w:t>
Vidal answered 21/9, 2010 at 22:41 Comment(4)
git has the hook behavior that will allow this, but I don't know about hgBereave
Regarding your second point: Be aware that these document formats (especially .xslx and ODF) don't treat whitespace as specified by the XML standard but - mostly for practical purposes - preserve whitespace even if this is not indicated. Therefore reindenting a file might change contents.Milkandwater
Why exactly don't you want the zip-format files put into revision control. What is the problem you want to solve?Lusatian
@Lusatian - I want to see meaningful changes. I don't want a huge repository just because I make small changes to a docx file every day.Vidal
A
6

If you can get past the hurdle of succesfully unzipping and zipping the Openoffice documents, then you should be able to use the filter system we have in Mercurial. That lets you transform files on every read/write from/to the repository.

You will unfortunately have to do more than just unzip the foo.docx file. The problem is that you need to generate a single file as output -- so perhaps you can unzip foo.docx and then tar up the generated files. You'll then be versioning the tarball, which should work since a tarball is just an uncompressed concatenations of all the individual files with some meta information. Come to think of it, a simpler solution would be to zip the unpacked foo.docx file again but specify no compression. That should give similar results as using tar.

Solving this problem is something I've wanted to do myself, so please report back by sending a mail to Mercurial mailing list.

Americium answered 24/9, 2010 at 11:23 Comment(2)
Zipping with no compression seems to work both for odt, and for docx files, thanks for the tip.Vidal
zipdoc extension unzips then zips with no compression, and vica-versa. I am here to find out how to diff them, though. I am getting them reported as an undiffable binary.Caspian
A
14

I was wondering the same thing, and just came across the ZipDoc extension/filter for Mercurial, which seems to do exactly this!

Haven't tried it yet, but it looks promising!

Abalone answered 17/6, 2011 at 12:8 Comment(3)
Do you need to hg rm and then re-add the file after you have installed the extension? Thanks!Blackfoot
@Blackfoot Not sure; I didn't actually get around to trying it! Should be easy enough to test out in a test repo :-)Abalone
is there something similar for git?Headlight
A
6

If you can get past the hurdle of succesfully unzipping and zipping the Openoffice documents, then you should be able to use the filter system we have in Mercurial. That lets you transform files on every read/write from/to the repository.

You will unfortunately have to do more than just unzip the foo.docx file. The problem is that you need to generate a single file as output -- so perhaps you can unzip foo.docx and then tar up the generated files. You'll then be versioning the tarball, which should work since a tarball is just an uncompressed concatenations of all the individual files with some meta information. Come to think of it, a simpler solution would be to zip the unpacked foo.docx file again but specify no compression. That should give similar results as using tar.

Solving this problem is something I've wanted to do myself, so please report back by sending a mail to Mercurial mailing list.

Americium answered 24/9, 2010 at 11:23 Comment(2)
Zipping with no compression seems to work both for odt, and for docx files, thanks for the tip.Vidal
zipdoc extension unzips then zips with no compression, and vica-versa. I am here to find out how to diff them, though. I am getting them reported as an undiffable binary.Caspian
O
4

You can use a precommit hook to unzip, and a update hook to zip. See the definite guide on how to use hooks.

Be careful about rename. If you rename foo.docx to bar.docx, your precommit hook will need to delete foo.docx/ and add bar.docx/.


UPDATE (sorry for giving an entry-level answer to a 1k-rep user)

If you want to use unpacked docx for core hg operations like diff (status can work with packed file), you'd have to go with an extension. I think you can take a similar approach as the keyword extension as to wrap the repo object with your own.

I have written some extensions but not at that hard core level, so I can't provide more details.

If you want to get crazy you could even do merge with unpacked file. But it's probably safer to treat it as binary and use external tool to diff and merge.

Osage answered 22/9, 2010 at 1:24 Comment(2)
I found out that at least Openoffice is very picky about how the files are zipped. A simple unzip->zip cycle can be sufficient to corrupt an .od* file.Lusatian
@Lusatian have you got more info: what zip-tool was used?, what happened?, etc.Caspian
B
2

I've been struggling with this exact problem for the last few days and have written a small .NET utility to extract and normalise Excel files in such a way that they're much easier to store in source control. I've published the executable here:

https://bitbucket.org/htilabs/ooxmlunpack/downloads/OoXmlUnpack.exe

..and the source here:

https://bitbucket.org/htilabs/ooxmlunpack

If there's any interest I'm happy to make this more configurable, but at the moment, you should put the executable in a folder (e.g. the root of your source repository) and when you run it, it will:

  • Scan the folder and its subfolders for any .xlsx and .xlsm files
  • Take a copy of the file as *.orig
  • Unzip each file and re-zip it with no compression
  • Pretty-print any files in the archive which are valid XML
  • Delete the calcchain.xml file from the archive (since it changes a lot and doesn't affect the content of the file)
  • Inline any unformatted text values (otherwise these are kept in a lookup table which causes big changes in the internal XML if even a single cell is modified)
  • Delete the values from any cells which contain formulas (since they can just be calculated when the sheet is next opened)
  • Create a subfolder *.extracted, containing the extracted zip archive contents

Clearly not all of these things are necessary, but the end result is a spreadsheet file that will still open in Excel but which is much more amenable to diffing and incremental compression. Also, storing the extracted files as well makes it much more obvious in the version history what changes have been applied in each version.

If there's any appetite out there, I'm happy to make the tool more configurable since I guess not everyone will want the contents extracted, or possibly the values removed from formula cells, but these are both very useful to me at the moment.

In tests, a 2MB spreadsheet 'unpacks' to 21MB but then I was able to store five versions of it with small changes between each, in a 1.9MB mercurial data file, and visualise the differences between versions effectively using Beyond Compare in text mode.

Burd answered 10/6, 2014 at 15:33 Comment(1)
The tool is working great. Unfortunately the VBA Project won't be extracted (will result in one file: vbaProject.bin). Do you know how to do that?Chifley

© 2022 - 2024 — McMap. All rights reserved.