How can I tell TortoiseHg to display a UTF-16 file as non-binary?
Asked Answered
G

3

15

In a Microsoft Access 2007 project the Access form objects are exported to files with a dedicated software by using the built-in function "SaveAsText". This is necessary because Access doesn't store any of it's code modules in isolated files at its own.

The file starts with the bytes "FF FE" (which is UTF-16 according to http://de.wikipedia.org/wiki/Byte_Order_Mark). I presume because of many NUL characters in this file, Hg treats this file as a binary file. Hence the diff pane in the TortoiseHG workbench always tells

File or diffs not displayed: File is binary.

which is quite understandable under this assumption. But nevertheless this file is just usual source code. I can view it for example in Windows' notepad without any problems.

Is there any way to tell Mercurial, that this particular file should be treated as text, not binary?

Edit: Additionally to the marked preferred answer below I decided not to change the saving behaviour, but to use the "Visual Diff" command (select file, then press Ctrl+d) instead.

Gibbsite answered 4/7, 2011 at 15:22 Comment(1)
I tweaked the title and tags to show this is about the TortoiseHG UI, which isn't a part of Mercurial.Gallard
H
6

I'm guessing that you frequently or occasionally export the form objects in order to track source code changes.

The only way to convince Mercurial that a file is not binary is to avoid NUL bytes.

You may want to convert the source code files to ASCII (or maybe ANSI) encoding as an additional step in your export in order to avoid the NUL bytes. If the source code files contain Unicode characters, you might try UTF-8, as this will only do multi-byte characters when necessary and single-byte characters otherwise, thus avoiding NUL bytes again. I tried it out briefly and Mercurial handles UTF-8: it doesn't show "File is binary", but the actual diff. I committed on the commandline, but viewed the diff in TortoiseHg. I have a link about commandline encoding challenges below.

The hgrc encode/decode sections might be particularly useful in helping to filter the UTF-16 files into something that works better.

A couple other pages on Mercurial and encoding:

TortoiseHg 2.1 + Mercurial 1.9

Hairtail answered 5/7, 2011 at 17:26 Comment(2)
Many thanks to this hint! You are correct in assuming that I export those objects when appropriate. Unfortunately I didn't write the export function myself, it's part of the Access application, which is closed source. In the meantime I found this python script which may give me a chance to write an en-/decoder: #3016042Putative
I didn't mean changing Access's export function, I meant adding a step to process the files that Access gives you. I'll re-word a bit.Hairtail
T
3

From https://www.mercurial-scm.org/wiki/BinaryFiles:

The question naturally arises, what is a binary file anyway? It turns out there's really no good answer to this question, so Mercurial uses the same heuristic that programs like diff(1) use. The test is simply if there are any NUL bytes in a file.

For diff, export, and annotate, this will get things right almost all of the time and it will not attempt to process files it thinks are binary. If necessary, you can force these commands to treat files as text with -a.

Tellurion answered 4/7, 2011 at 15:26 Comment(3)
Thanks for your really quick answer :-) I realized this (but didn't mention, sorry), but I see no way to tell the TortoiseHg workbench to remember this setting for this file (or for files matching a given filter condition).Putative
@Christoph Jüngling Did you check the files for a NUL using a hex editor (if I understand correctly there shouldn't be one)? If there's none it might be worth reporting it to the TortoiseHg staff as a bug report.Chingchinghai
I used wxHexEditor to check. And yes, there are many 00 characters in it, every second one I guess. So I understand why Hg or TortoiseHg treat this file as binary. But nevertheless I'ld like to tell them that this is indeed an ordinary source code file, which just has been saved as UTF-16. The above mentioned "-a" option works with "hg diff -a filename", but even then every second character is a dot, which makes this diff quite unreadable. TortoiseHg completely refuses to display the diff (or the content).Putative
H
1

This didn't exist at the time the question was asked, but now there's the msaccess-vcs-integration project, which exports/imports MS Access objects so that they can be version controlled.

Quote from the project's readme:

Encoding

For Access objects which are normally exported in UCS-2-little-endian encoding , the included module automatically converts to the source code to and from UTF-8 encoding during export/import; this is to ensure that you don't have trouble branching, merging, and comparing in tools such as Mercurial which treat any file containing 0x00 bytes as a non-diffable binary file.

If you export your forms and modules with this instead of directly using Access's SaveAsText function, Mercurial will not treat the files as binary.

Hazem answered 8/11, 2018 at 21:24 Comment(1)
Thanks for this hint, @christian-specht. Since a couple of years I use OASIS which does nearly the same.Putative

© 2022 - 2024 — McMap. All rights reserved.