Uncompress OpenOffice files for better storage in version control

E

6

16

I've heard discussion about how OpenOffice (ODF) files are compressed zip files of XML and other data. So making a tiny change to the file can potentially totally change the data, so delta compression doesn't work well in version control systems.

I've done basic testing on an OpenOffice file, unzipping it and then rezipping it with zero compression. I used the Linux zip utility for my testing. OpenOffice will still happily open it.

So I'm wondering if it's worth developing a small utility to run on ODF files each time just before I commit to version control. Any thoughts on this idea? Possible better alternatives?

Secondly, what would be a good and robust way to implement this little utility? Bash shell that calls zip (probably Linux only)? Python? Any gotchas you can think of? Obviously I don't want to accidentally mangle a file, and there are several ways that could happen.

Possible gotchas I can think of:

Insufficient disk space
Some other permissions issue that prevents writing the file or temporary files
ODF document is encrypted (probably should just leave these alone; the encryption probably also causes large file changes and thus prevents efficient delta compression)

Edward answered 10/6, 2009 at 12:1 Comment(0)

B

6

You may consider to store documents in FODT-format - flat XML format.
This is relatively new alternative solution available.

Document is just stored unzipped.

More info is available at https://wiki.documentfoundation.org/Libreoffice_and_subversion.

Boney answered 10/3, 2015 at 4:19 Comment(1)

Using *.fodt and *.fods format for the documents is the easiest way to keep libreoffice calc and writer files in version control. No need for any utilities or fancy commit hooks and the benefits of plain text version control are all there. – Gies 20/6, 2016 at 20:54

I

14

First, version control system you want to use should support hooks which are invoked to transform file from version in repository to the one in working area, like for example clean / smudge filters in Git from gitattributes.

Second, you can find such filter, instead of writing one yourself, for example rezip from "Management of opendocument (openoffice.org) files in git" thread on git mailing list (but see warning in "Followup: management of OO files - warning about "rezip" approach"),

You can also browse answers in "Tracking OpenOffice files/other compressed files with Git" thread, or try to find the answer inside "[PATCH 2/2] Add keyword unexpansion support to convert.c" thread.

Hope That Helps

Increasing answered 10/6, 2009 at 14:23 Comment(1)

Terrific information. I'm most interested in Subversion and Mercurial at the moment. I don't think Subversion has clean/smudge type feature. No idea for Mercurial—I'm relatively new to that. – Edward 10/6, 2009 at 16:16

B

6

You may consider to store documents in FODT-format - flat XML format.
This is relatively new alternative solution available.

Document is just stored unzipped.

More info is available at https://wiki.documentfoundation.org/Libreoffice_and_subversion.

Boney answered 10/3, 2015 at 4:19 Comment(1)

Using *.fodt and *.fods format for the documents is the easiest way to keep libreoffice calc and writer files in version control. No need for any utilities or fancy commit hooks and the benefits of plain text version control are all there. – Gies 20/6, 2016 at 20:54

I

3

I've modified the python program in Craig McQueen's answer just a bit. Changes include:

Actually checking the return of testZip (according to the docs, it appears that the original program will happily proceed with a corrupt zip file past the checkzip step).
Rewrite the for-loop to check for already-uncompressed files to be a single if-statement.

Here is the new program:

#!/usr/bin/python
# Note, written for Python 2.6

import sys
import shutil
import zipfile

# Get a single command-line argument containing filename
commandlineFileName = sys.argv[1]

backupFileName = commandlineFileName + ".bak"
inFileName = backupFileName
outFileName = commandlineFileName
checkFilename = commandlineFileName

# Check input file
# First, check it is valid (not corrupted)
checkZipFile = zipfile.ZipFile(checkFilename)

if checkZipFile.testzip() is not None:
    raise Exception("Zip file is corrupted")

# Second, check that it's not already uncompressed
if all(f.compress_type==zipfile.ZIP_STORED for f in checkZipFile.infolist()):
    raise Exception("File is already uncompressed")

checkZipFile.close()

# Copy to "backup" file and use that as the input
shutil.copy(commandlineFileName, backupFileName)
inputZipFile = zipfile.ZipFile(inFileName)

outputZipFile = zipfile.ZipFile(outFileName, "w", zipfile.ZIP_STORED)

# Copy each input file's data to output, making sure it's uncompressed
for fileObject in inputZipFile.infolist():
    fileData = inputZipFile.read(fileObject)
    outFileObject = fileObject
    outFileObject.compress_type = zipfile.ZIP_STORED
    outputZipFile.writestr(outFileObject, fileData)

outputZipFile.close()

Imbricate answered 6/3, 2010 at 19:47 Comment(0)

E

2

Here's another program I stumbled across: store_zippies_uncompressed by Mirko Friedenhagen.

The wiki also shows how to integrate it with Mercurial.

Edward answered 16/3, 2010 at 7:43 Comment(0)

E

1

Here is a Python script that I've put together. It's had minimal testing so far. I've done basic testing in Python 2.6. But I prefer the idea of Python in general because it should abort with an exception if any error occurs, whereas a bash script may not.

This first checks that the input file is valid and not already uncompressed. Then it copies the input file to a "backup" file with ".bak" extension. Then it uncompresses the original file, overwriting it.

I'm sure there are things I've overlooked. Please feel free to give feedback.


#!/usr/bin/python
# Note, written for Python 2.6

import sys
import shutil
import zipfile

# Get a single command-line argument containing filename
commandlineFileName = sys.argv[1]

backupFileName = commandlineFileName + ".bak"
inFileName = backupFileName
outFileName = commandlineFileName
checkFilename = commandlineFileName

# Check input file
# First, check it is valid (not corrupted)
checkZipFile = zipfile.ZipFile(checkFilename)
checkZipFile.testzip()

# Second, check that it's not already uncompressed
isCompressed = False
for fileObject in checkZipFile.infolist():
    if fileObject.compress_type != zipfile.ZIP_STORED:
        isCompressed = True
if isCompressed == False:
    raise Exception("File is already uncompressed")

checkZipFile.close()

# Copy to "backup" file and use that as the input
shutil.copy(commandlineFileName, backupFileName)
inputZipFile = zipfile.ZipFile(inFileName)

outputZipFile = zipfile.ZipFile(outFileName, "w", zipfile.ZIP_STORED)

# Copy each input file's data to output, making sure it's uncompressed
for fileObject in inputZipFile.infolist():
    fileData = inputZipFile.read(fileObject)
    outFileObject = fileObject
    outFileObject.compress_type = zipfile.ZIP_STORED
    outputZipFile.writestr(outFileObject, fileData)

outputZipFile.close()

This is in a Mercurial repository in BitBucket.

Edward answered 13/6, 2009 at 14:8 Comment(0)

D

0

If you don't need the storage savings, but just want to be able to diff OpenOffice.org files stored in your version control system, you can use the instructions on the oodiff page, which tells how to make oodiff the default diff for OpenDocument formats under git and mercurial. (It also mentions SVN, but it's been so long since I used SVN regularly I'm not sure if those are instructions or limitations.)

(I found this using Mirko Friedenhagen's page (cited by Craig McQueen above))

Daliladalis answered 15/7, 2012 at 1:22 Comment(0)

Recommended topics

Hot tags