Getting readable diff displays in Mercurial on Unicode files (MS Windows)

Asked 10/6, 2010 at 14:52 Answered 14/11, 2010 at 23:31

windows unicode mercurial diff tortoisehg

I'm trying to store some Windows PowerShell scripts in a Mercurial repository. It seems the PowerShell editor likes to save files as UTF-16 Unicode. This means that there are lots of \0 bytes, which is what Mercurial uses to distinguish between "text" and "binary" files. I understand that this makes no difference to how Mercurial stores the data, but it does mean that it displays binary diffs, which are kind of hard to read. Is there a way to tell Mercurial that these really are text files? Presumably I would need to convince Mercurial to use an external Unicode-aware diff program for particular file types.

Herbherbaceous answered 10/6, 2010 at 14:52 Comment(3)

Specifically, my problem is with the "Text diff" page in the "Commit" tool using TortoiseHg, which usually shows a nice summary of the changes in the selected file, but shows junk with UTF-16 files. – Herbherbaceous 10/6, 2010 at 14:59

@orad: As of 9/22/2010, I still have not found an answer. – Herbherbaceous 22/9, 2010 at 15:7

The BOM.py answer will work. Just copy the whole thing into a file and then edit (or create) your users\yourname\Mercurial.ini file and under the line "[extensions]" (add it, if there's no such line), add a line with a name = file (like "bom = C:\path\to\the\bom.py"). – Morale 3/7, 2011 at 2:41

This may not be relevant to you; read the last paragraph if it doesn't sound like it is.

I'm not sure whether this is what you're needing, but I've needed diffs with UTF-16LE content more than just the "binary files are different" - when I searched around some months ago for it I found a thread and bug discussing it; here's part of it. I can't find the original source of this mini-extension now (though it's doing just what that patch does), but what I got was an extension, BOM.py:

#!/usr/bin/env python

from mercurial import hg, util

import codecs

boms = [
    codecs.BOM_UTF8,
    codecs.BOM_UTF16_BE, codecs.BOM_UTF16_LE,
    codecs.BOM_UTF32_BE, codecs.BOM_UTF32_LE
    ]

def binary(s):
    if s:
        for bom in boms:
            if s.startswith(bom):
                return False
        return '\0' in s
    return False


def reposetup(ui, repo):
    util.binary = binary

This gets loaded in the .hgrc (or your users\username\mercurial.ini) like this:

[extensions]
bom = ~/.hgexts/BOM.py

Note the path will vary between Windows and Linux; on my Windows copy I put the path as \...\whatever (it's on a USB disk where the drive letter can change). Unfortunately relative paths are taken relative to the current working directory rather than the repository root or any such thing, but if you are saving it on your C: drive, you can just put the full path.

In Linux (my main development environment), this works well; in Command Prompt (which I still use regularly), it generally works well. I've never tried it in PowerShell, but I would expect it to be better than Command Prompt in its support for arbitrary null bytes in the command line.

I'm not sure if this is what you want at all; by the way you've said "binary diffs" I suspect you may already either have this or be doing hg diff -a which is achieving the same thing. In that case, all I can think of is writing another extension which takes the UTF-16LE and attempts to decode it to UTF-8. I'm not sure of the syntax for such an extension, but I might try that out.

Edit: having now trawled the mercurial source through commands.py, cmdutil.py, patch.py and mdiff.py, I see that binary diffs are done with a base85 encoding (patch.b85diff) rather than the normal diff. I wasn't aware of that, I thought it just forced it to diff it. In that case, perhaps this text is relevant after all. I await a response to see if it is!

Honkytonk answered 14/11, 2010 at 22:12 Comment(1)

Beware! While this extension works for diffing on the commandline, I have had issues with corruption when creating MQ patches via qnew. – Wordage 11/6, 2012 at 20:41

I have worked around this by creating a new file with NotePad++ and saving it as a PowerShell file (.ps1 extension). NotePad++ will create the file as a plain text ANSI file. Once created I can open the file in the PowerShell editor and make any changes as necessary without the editor modifying the file encoding.

Disclaimer: I encountered this just moments ago and so I am not sure if there are any repercussions but so far my scripts appear to work as normal and my diffs are showing up nicely.

Oligochaete answered 14/11, 2010 at 16:53 Comment(1)

Converting to UTF-8 also works for .strings files in Xcode (genstrings generates UTF-16LE by default) – Deibel 23/6, 2014 at 10:18

If my other answer does not do what you want, I think this one may; although I haven't tested it on Windows at all yet, it's working well in Linux. It does what is potentially a nasty thing, in wrapping mercurial.mdiff.unidiff with a new function which converts utf-16le to utf-8. This will not affect hg st, but will affect hg diff. One potential pitfall is that the BOM will also be changed from UTF-16LE BOM to the UTF-8 BOM.

Anyway, I think it may be useful to you, so here it is.

Extension file utf16decodediff.py:

import codecs
from mercurial import mdiff

unidiff = mdiff.unidiff

def new_unidiff(a, ad, b, bd, fn1, fn2, r=None, opts=mdiff.defaultopts):
    """
    A simple wrapper around mercurial.mdiff.unidiff which first decodes
    UTF-16LE text.
    """

    if a.startswith(codecs.BOM_UTF16_LE):
        try:
            # Gets reencoded as utf-8 to be a str rather than a unicode; some
            # extensions may expect a str and may break if it's wrong.
            a = a.decode('utf-16le').encode('utf-8')
        except UnicodeDecodeError:
            pass

    if b.startswith(codecs.BOM_UTF16_LE):
        try:
            b = b.decode('utf-16le').encode('utf-8')
        except UnicodeDecodeError:
            pass

    return unidiff(a, ad, b, bd, fn1, fn2, r, opts)

mdiff.unidiff = new_unidiff

In .hgrc:

[extensions]
utf16decodediff = ~/.hgexts/utf16decodediff.py

(Or equivalent paths.)

Honkytonk answered 14/11, 2010 at 23:31 Comment(2)

Unfortunately, this approach suffers from a memory issue: the files are slurped up (by mercurial, not this extension) so if memory is tight you can run out. It requires you to set --config diff.nobinary=True (from personal experience I know that this will screw up mq patches so I don't recommend keeping it enabled permanently) to reach the wrapper. – Thithia 4/11, 2015 at 21:47

I would also recommend if isinstance(a, str): and if isinstance(b, str) because when doing a diff where one version lacks files that the other has those variables can be NoneType and cause the extension to crash mercurial – Sweetie 23/8, 2017 at 3:27

Recommended topics

Hot tags