I'm trying to store some Windows PowerShell scripts in a Mercurial repository. It seems the PowerShell editor likes to save files as UTF-16 Unicode. This means that there are lots of \0
bytes, which is what Mercurial uses to distinguish between "text" and "binary" files. I understand that this makes no difference to how Mercurial stores the data, but it does mean that it displays binary diffs, which are kind of hard to read. Is there a way to tell Mercurial that these really are text files? Presumably I would need to convince Mercurial to use an external Unicode-aware diff program for particular file types.
This may not be relevant to you; read the last paragraph if it doesn't sound like it is.
I'm not sure whether this is what you're needing, but I've needed diffs with UTF-16LE content more than just the "binary files are different" - when I searched around some months ago for it I found a thread and bug discussing it; here's part of it. I can't find the original source of this mini-extension now (though it's doing just what that patch does), but what I got was an extension, BOM.py
:
#!/usr/bin/env python
from mercurial import hg, util
import codecs
boms = [
codecs.BOM_UTF8,
codecs.BOM_UTF16_BE, codecs.BOM_UTF16_LE,
codecs.BOM_UTF32_BE, codecs.BOM_UTF32_LE
]
def binary(s):
if s:
for bom in boms:
if s.startswith(bom):
return False
return '\0' in s
return False
def reposetup(ui, repo):
util.binary = binary
This gets loaded in the .hgrc (or your users\username\mercurial.ini) like this:
[extensions]
bom = ~/.hgexts/BOM.py
Note the path will vary between Windows and Linux; on my Windows copy I put the path as \...\whatever
(it's on a USB disk where the drive letter can change). Unfortunately relative paths are taken relative to the current working directory rather than the repository root or any such thing, but if you are saving it on your C: drive, you can just put the full path.
In Linux (my main development environment), this works well; in Command Prompt (which I still use regularly), it generally works well. I've never tried it in PowerShell, but I would expect it to be better than Command Prompt in its support for arbitrary null bytes in the command line.
I'm not sure if this is what you want at all; by the way you've said "binary diffs" I suspect you may already either have this or be doing hg diff -a
which is achieving the same thing. In that case, all I can think of is writing another extension which takes the UTF-16LE and attempts to decode it to UTF-8. I'm not sure of the syntax for such an extension, but I might try that out.
Edit: having now trawled the mercurial source through commands.py, cmdutil.py, patch.py and mdiff.py, I see that binary diffs are done with a base85 encoding (patch.b85diff) rather than the normal diff. I wasn't aware of that, I thought it just forced it to diff it. In that case, perhaps this text is relevant after all. I await a response to see if it is!
qnew
. –
Wordage I have worked around this by creating a new file with NotePad++ and saving it as a PowerShell file (.ps1 extension). NotePad++ will create the file as a plain text ANSI file. Once created I can open the file in the PowerShell editor and make any changes as necessary without the editor modifying the file encoding.
Disclaimer: I encountered this just moments ago and so I am not sure if there are any repercussions but so far my scripts appear to work as normal and my diffs are showing up nicely.
If my other answer does not do what you want, I think this one may; although I haven't tested it on Windows at all yet, it's working well in Linux. It does what is potentially a nasty thing, in wrapping mercurial.mdiff.unidiff
with a new function which converts utf-16le to utf-8. This will not affect hg st
, but will affect hg diff
. One potential pitfall is that the BOM will also be changed from UTF-16LE BOM to the UTF-8 BOM.
Anyway, I think it may be useful to you, so here it is.
Extension file utf16decodediff.py
:
import codecs
from mercurial import mdiff
unidiff = mdiff.unidiff
def new_unidiff(a, ad, b, bd, fn1, fn2, r=None, opts=mdiff.defaultopts):
"""
A simple wrapper around mercurial.mdiff.unidiff which first decodes
UTF-16LE text.
"""
if a.startswith(codecs.BOM_UTF16_LE):
try:
# Gets reencoded as utf-8 to be a str rather than a unicode; some
# extensions may expect a str and may break if it's wrong.
a = a.decode('utf-16le').encode('utf-8')
except UnicodeDecodeError:
pass
if b.startswith(codecs.BOM_UTF16_LE):
try:
b = b.decode('utf-16le').encode('utf-8')
except UnicodeDecodeError:
pass
return unidiff(a, ad, b, bd, fn1, fn2, r, opts)
mdiff.unidiff = new_unidiff
In .hgrc
:
[extensions]
utf16decodediff = ~/.hgexts/utf16decodediff.py
(Or equivalent paths.)
--config diff.nobinary=True
(from personal experience I know that this will screw up mq patches so I don't recommend keeping it enabled permanently) to reach the wrapper. –
Thithia if isinstance(a, str):
and if isinstance(b, str)
because when doing a diff where one version lacks files that the other has those variables can be NoneType and cause the extension to crash mercurial –
Sweetie © 2022 - 2024 — McMap. All rights reserved.