How to perform better document version control on Excel files and SQL schema files
Asked Answered
F

10

131

I am in charge of several Excel files and SQL schema files. How should I perform better document version control on these files?

I need to know the part modified (different part) in these files and keep all the versions for reference. Currently I am appending the time stamp on the file name, but I found it seemed to be inefficient.

Is there a way or good practice to do better document version control?

By the way, editors send me the files via email.

Folium answered 13/6, 2013 at 9:22 Comment(2)
I may convert these Excel files into CSV files, and then track them using git so that I can use diff to see the modification. Is there any other good practice?Folium
See the other answers, which I think are better than the one you accepted.Wornout
A
54

Since you've tagged your question with I assume you are asking about Git usage for this.

Well, SQL dumps are normal text files so it makes perfect sense to track them with Git. Just create a repository and store them in it. When you get a new version of a file, simply overwrite it and commit, Git will figure out everything for you, and you'll be able to see modification dates, checkout specific versions of this file and compare different versions.

The same is true for .xlsx if you decompress them. .xlsx files are zipped up directories of XML files (See How to properly assemble a valid xlsx file from its internal sub-components?). Git will view them as binary unless decompressed. It is possible to unzip the .xlsx and track the changes to the individual XML files inside of the archive.

You could also do this with .xls files, but the problem here is that .xls format is binary, so you can't get meaningful diffs from it. But you'll still be able to see modification history and checkout specific versions.

Andro answered 13/6, 2013 at 9:51 Comment(4)
Yes, I know git. I think git is good when tracking SQL schemes. As for Excel files (.xlsx and .xls), because they are binary files, tracking them using git cannot show me what has been modified in human perspective. This is what I'm confusing.Folium
@MarcusThornton .xlsx is XML, so should work fine. In general, there is no way to easily compare two .xls files. You probably could add a pre-commit hook that will put a .csv near it and you'll be able to diff those.Andro
What about ’.xlsm'?Urticaceous
the repository is not available anymoreTattered
P
102

The answer I have written here can be applied in this case. A tool called xls2txt can provide human-readable output from .xls files. So in short, you should put this to your .gitattributes file:

*.xls diff=xls

And in the .git/config:

[diff "xls"]
    binary = true
    textconv = /path/to/xls2txt

Of course, I'm sure you can find similar tools for other file types as well, making git diff a very useful tool for office documents. This is what I currently have in my global .gitconfig:

[diff "xls"]
    binary = true
    textconv = /usr/bin/py_xls2txt
[diff "pdf"]
    binary = true
    textconv = /usr/bin/pdf2txt
[diff "doc"]
    binary = true
    textconv = /usr/bin/catdoc
[diff "docx"]
    binary = true
    textconv = /usr/bin/docx2txt

The Pro Git book has a good chapter on the subject: 8.2 Customizing Git - Git Attributes

Pecksniffian answered 14/6, 2013 at 10:6 Comment(7)
it doesn't work for me on windows7. i've downloaded the catdoc verion for Windows from here: blog.brush.co.nz/2009/09/catdoc-windows than edit gitconfig and attributes as described above. but i still get: diff --git a/src/Reports/src/main/etc/templates/nbcu.xls b/src/Reports/src/main/etc/templates/nbcu.xls index 2476319..1daec86 100644 Binary files a/src/.../test.xls and b/src/.../test.xls differ GIT version: 1.7.6.msysgit.1Frostbite
Is it still storing the doc as a doc file or as a text file? If it's a text file, how do you recover the doc?Amenra
@Amenra This has no effect on how the file is stored, only the output of diff command is affected.Pecksniffian
So it's still storing the entire file, not the diffs?Amenra
Re: xls2txt: extremely reluctant to install a closed-source tool from a Polish website. This might be the same thing? github.com/hroptatyr/xls2txt No README though...Ladylike
This answer doesn't work for the new format of Excel files xlsx. If someone is looking to diff Excel files in xlsx format, you could save time not trying this.Nusku
Indeed, the particular program mentioned here is only for xls files. If you replace it with something that can produce text output from xlsx file, it works.Pecksniffian
A
54

Since you've tagged your question with I assume you are asking about Git usage for this.

Well, SQL dumps are normal text files so it makes perfect sense to track them with Git. Just create a repository and store them in it. When you get a new version of a file, simply overwrite it and commit, Git will figure out everything for you, and you'll be able to see modification dates, checkout specific versions of this file and compare different versions.

The same is true for .xlsx if you decompress them. .xlsx files are zipped up directories of XML files (See How to properly assemble a valid xlsx file from its internal sub-components?). Git will view them as binary unless decompressed. It is possible to unzip the .xlsx and track the changes to the individual XML files inside of the archive.

You could also do this with .xls files, but the problem here is that .xls format is binary, so you can't get meaningful diffs from it. But you'll still be able to see modification history and checkout specific versions.

Andro answered 13/6, 2013 at 9:51 Comment(4)
Yes, I know git. I think git is good when tracking SQL schemes. As for Excel files (.xlsx and .xls), because they are binary files, tracking them using git cannot show me what has been modified in human perspective. This is what I'm confusing.Folium
@MarcusThornton .xlsx is XML, so should work fine. In general, there is no way to easily compare two .xls files. You probably could add a pre-commit hook that will put a .csv near it and you'll be able to diff those.Andro
What about ’.xlsm'?Urticaceous
the repository is not available anymoreTattered
M
24

I've been struggling with this exact problem for the last few days and have written a small .NET utility to extract and normalise Excel files in such a way that they're much easier to store in source control. I've published the executable here:

https://bitbucket.org/htilabs/ooxmlunpack/downloads/OoXmlUnpack.exe

..and the source here:

https://bitbucket.org/htilabs/ooxmlunpack

If there's any interest I'm happy to make this more configurable, but at the moment, you should put the executable in a folder (e.g. the root of your source repository) and when you run it, it will:

  • Scan the folder and its subfolders for any .xlsx and .xlsm files
  • Take a copy of the file as *.orig.
  • Unzip each file and re-zip it with no compression.
  • Pretty-print any files in the archive which are valid XML.
  • Delete the calcchain.xml file from the archive (since it changes a lot and doesn't affect the content of the file).
  • Inline any unformatted text values (otherwise these are kept in a lookup table which causes big changes in the internal XML if even a single cell is modified).
  • Delete the values from any cells which contain formulas (since they can just be calculated when the sheet is next opened).
  • Create a subfolder *.extracted, containing the extracted zip archive contents.

Clearly not all of these things are necessary, but the end result is a spreadsheet file that will still open in Excel, but which is much more amenable to diffing and incremental compression. Also, storing the extracted files as well makes it much more obvious in the version history what changes have been applied in each version.

If there's any appetite out there, I'm happy to make the tool more configurable since I guess not everyone will want the contents extracted, or possibly the values removed from formula cells, but these are both very useful to me at the moment.

In tests, a 2 MB spreadsheet 'unpacks' to 21 MB, but then I was able to store five versions of it with small changes between each, in a 1.9 MB Mercurial data file, and visualise the differences between versions effectively using Beyond Compare in text mode.

NB: although I'm using Mercurial, I read this question while researching my solution and there's nothing Mercurial-specific about the solution, should work fine for Git or any other VCS.

Minervamines answered 10/6, 2014 at 16:12 Comment(5)
I actually haven't tried, but I assume it would - if you do give that a try it would be great to knowMinervamines
@JonG I can't get it to work with LibreOffice and there's no Issues tab in the bitbucket repository. I'd love to contribute if we could get an issue going!Modulation
Hi @christian-droulers, I've enabled issue on the Repo, feel free to add something there!Minervamines
@JonG This looks great, having a diff-able version history could be really useful in a lot of document related scenarios! But why is it important that the file opens in Excel? Can't you just use the .orig file? And do you think the normalizations can be configurable/dynamic so that the code can be used for docx/pptx as well?Cabrilla
@Jon G, is your tool still available somewhere? The link in your post seems to be dead by now.Mariselamarish
F
12

Tante recommended a very simple approach in Managing ZIP-based file formats in Git:

Open your ~/.gitconfig file (create if not existing already) and add the following stanza:

[diff "zip"]
textconv = unzip -c -a
Fulmination answered 6/2, 2017 at 21:32 Comment(1)
then, Peng Xu extended the solution, allowing to versioning zip-based files using filter, in addition to only view diff changes: tante.cc/2010/06/23/managing-zip-based-file-formats-in-git/…Fulmination
M
6

Use the open document extension .fods. It's a plain, uncompressed XML markup format that both Excel and LibreOffice can open, and the diffs will look good.

Meingoldas answered 24/6, 2018 at 3:6 Comment(1)
I do not think Excel supports .fods (it does support the compressed, `.ods' format, which is not much help here).Hyperion
P
2

This Excel utility works very well for me:

Version Control for Excel

It is a quite straightforward versioning tool for workbooks and VBA macros. Once you commit a version, it is saved to a Git repository on your PC. I never tried it re. SQL schema files, but I'm sure there's a way around.

Prolific answered 12/4, 2016 at 17:24 Comment(1)
This is the only tool I've found that works with modules embedded in .xlsm files. The only alternative I know of is running a macro to export every module to its own file, committing them, and then running a macro to import them all again after pulling and merging. xltrail is a lot easier than that.Lappet
E
2

We've built an open-source Git command line extension for Excel workbooks: https://www.xltrail.com/git-xltrail.

In a nutshell, the main feature is that it makes git diff work on any workbook file formats so that it shows the diff on the workbook's VBA content (at some point, we'll make this work for the worksheets content, too).

It's still early days but it might help.

Ensepulcher answered 19/2, 2018 at 8:41 Comment(4)
and over two years later it still only handles VBA, while many of the other solutions handle the whole spreadsheet. I haven't actually cared about the VBA content of a spreadsheet in over a decade (or more correctly, I've actively tried to avoid having any...).Nummulite
Is it possible to integrate your tool to work from Visual Studio on Windows?Buzzer
@AndreyBelykh - what sort of VS integration do you have in mind?Ensepulcher
Either something on the server or client to be able to see diff from the Visual Studio UIBuzzer
P
1

As mentioned in another answer's comment, .xlsx files are just XML.

To get to the XML directory (which is git-able), you have to "unzip" the .xlsx file to a directory. A quick way see this on Windows is to rename the file <filename>.xlsx to <filename>.zip, and you'll see the inner contents. I'd store this along with the binary so that when you checkout, you do not have to do other steps in order to open the document in Excel.

Placentation answered 13/6, 2013 at 11:29 Comment(1)
At least the zip tool I use (7-zip) allows to open/extract all files - you don't have to rename them.Progressionist
D
1

My approach with Excel files is similar to Jon's, but instead of working with the raw Excel text data I export to more friendly formats.

Here is the tool that I use: https://github.com/stenci/ExcelToGit/tree/master

All you need is to download the .xlsm file (click the View Raw link on this page.) Don't forget to check the Excel setting as described in the readme. You can also add the code to export SQL data to text files.

The workbook is both a converter from binary Excel to text files and a launcher of the windows Git tools, and it can be used also with non Excel related projects.

My working version is configured with dozens of Excel workbooks. I use the file also to open Git-gui for non Excel projects, just adding the git folder by hand.

Dwayne answered 10/10, 2015 at 23:27 Comment(0)
B
0

If you are not willing to download third-party converters into your project, another option is to write your own script to convert the Excel file for you. I added my own converter using Python. The python script:

from pandas import DataFrame, read_excel
from argparse import ArgumentParser
from os import remove


def change_file_format_to_csv(filename):
    filename = filename.split(".")
    filename[-1] = 'csv'


if __name__ == "__main__":
    # Parse arguments
    parser = ArgumentParser()
    parser.add_argument("-i", "--input", default="", required=False,
                        help="Input file to be converted")
    args = parser.parse_args()

    # Load input
    print(args.input)
    content = read_excel(args.input)

    # Change filename to csv
    filename_output = args.input.split(".")
    filename_output[-1] = "csv"
    filename_output = '.'.join(filename_output)

    # Store input as CSV
    content.to_csv(filename_output)

    # Cat output to command line
    with open(filename_output, 'r') as f:
        print(f.read())

    # Remove temporary file
    remove(filename_output) 

Then, you add (as explained in the other answers) the name of the python script to your .git/config and .gitattributes files.

.gitattributes:

*.xls diff=xls

.git/config:

[diff "xlsx"]
        textconv = python ./utils/xlsx_to_csv.py --input
Bradford answered 13/5, 2022 at 7:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.