WinMerge: How to compare files with the same content but different encodings?
Asked Answered
P

6

18

Motivation: I am rewriting a doc -- text files to be processed later. The new sources now use UTF-8. Large portions of the sources are the same. I need to find differences.

Details: The old doc sources use the cp1250 encoding, the new sources use the UTF-8. Both new and old sources use the same line endings (CR+LF). I am using the Unicode version of the WinMerge application (WinMergeU.exe), version 2.12.4.0.

It almost works, but... When the lines differ, they are initially marked as block by the dark yellow, and the different portions are marked using the lighter colour. When moving the red block cursor there, the panes below show the different part.

However, the block of text is marked by the dark yellow also in cases when (the Unicode representation of) the text is the same. The red block moves also to those portions of the files. In such case, the two panes below (that show the differences) containt the same text and nothing is marked as different. See the picture below:

Example of the line that should not differ.

The very first line differs -- this is OK. But the second line has visually the same content. The only character outside of the ASCII range is Ú there. It has a different representation in the encoded sources. This causes the line marked as different, but the panes below does not mark anyting at the line as different.

See also the following paragraphs that are exactly the same (only the encoding in the sources differ, the same line ending is used).

It looks as if the initial comparison were based on binary representation of the lines. Is there any setting to tell WinMerge that the comparison (I mean the block marking) should be based on Unicode content?

I tried hard, but no luck, yet.

Update: The above question was for the latest stable 2.12.4. The beta version 2.13.22 works just perfectly for me. See my answer below.

Phytopathology answered 9/1, 2013 at 12:44 Comment(0)
R
8

I think it really should not be the task of a merge tool to allow the merging of files stored in different encodings.

An encoding is a function that maps bytes (stored on the disk or in memory) to characters (displayed on screen). Unfortunately, by default the encoding of a file is not stored together with the file. Therefore, any program that wants to open the file and display its contents needs to guess the encoding. While this sometimes works, it is also an error prone procedure.

Now, the character sets of different encodings do not overlap in general. So what is the merge tool supposed to do if you merge a character C from file A in encoding X into a file B in encoding Y, if character C is not part of the character set of encoding Y?

Thus, I think the task of a merge tool should be to merge the binary content. Anything else is a dirty hack and damned to fail at some level. (A merge tool maker may decide to provide character level merging, which also might work most of the time. But there is some guesswork involved.)

Therefore, I'd also recommend you first translate the old files to UTF-8 and then merge those with the new versions.

Raine answered 29/1, 2013 at 15:7 Comment(7)
Well, it depends. For me, the ability to compare content of the text files is the ability to compare its content. The winmerge correctly recognizes the content and correctly displays it. I would expect at least some option to decide whether text files are to be compared in binary mode or not.Phytopathology
As I argued, comparing files on the character level is not a failproof procedure. That's why many difftools do not support it and kdiff makes a loud warning. And it's not an easy business, either, and that's probably why winmerge does not handle the situation gracefully. Maybe converting all old files to UTF-8 seems tedious, but it is definitely the right solution to your problem. (In particular, the solution is independent of the difftool you use.)Raine
My observation is that WinMerge works just fine when comparing the texts. It really finds the differences between the texts with different encodint. The only problem is that it does not mark blocks based on the difference. ...the character sets of different encodings do not overlap in general. So what is the merge tool supposed to do... If it cannot be done based on abstract characters, it must not be done using binary representation either.Phytopathology
My +1 anyway. Nothing is perfect. I will consider to convert my scripts to Unicode also for the older sources.Phytopathology
thx for the kudos. Yet, I disagree with If it cannot be done based on abstract characters, it must not be done using binary representation either. It is perfectly sensible to compare two files if they have the same encoding. Then, comparing characters and comparing bytes is the exact same thing. Also, it is perfectly sensible to change the encoding of a single file. The important point is, both of these tasks are nontrivial, can lead to unexpected problems and require your attention. If you mix them up, and you have a problem, you are much more unlikely to be able to solve it.Raine
You might contact the developers of WinMerge, so they can tackle that problem. In the end, if WinMerge provides the possibility to merge different encodings, it should handle such situations gracefully. Therefore your observations qualify as a bug of WinMerge. You can submit bug reports here. Get involved.Raine
Majority of programs does quite untrivial things. Once the files are detected as text files and the encoding is recognized, both contents can be treated internally as Unicode ones. I agree that merging a block from the encoded Unicode file to the 8bit encoded other file may be problem if there is some character outside the 8bit encoding. But the proces is interactive and the problem could be reported. // There is WinMerge 3 activity. I can imagine the problem will simply never solved for WinMerge 2 :)Phytopathology
T
10

This doesn't really answer your question about WinMerge, but have you considered using another diff program? One of my favorites is kdiff - http://kdiff3.sourceforge.net/

When I do a compare on KDiff using one UTF8 file and another Unicode file, I get the following:KDiff Compare Warning

Here is the compare screen - note that the encodings on the files are different, but the files are considered to be equal from a text standpoint:

KDiff Compare Results

Tiphanie answered 22/1, 2013 at 20:3 Comment(1)
+1 Thanks. But I hope someone who knows winmerge better than I know is going to look and tell: Hey, you spotted a bug! or Well, there is a magic option hidden here. It is because this is a counterexample... We were thinking a lot about the problem. :)Phytopathology
R
8

I think it really should not be the task of a merge tool to allow the merging of files stored in different encodings.

An encoding is a function that maps bytes (stored on the disk or in memory) to characters (displayed on screen). Unfortunately, by default the encoding of a file is not stored together with the file. Therefore, any program that wants to open the file and display its contents needs to guess the encoding. While this sometimes works, it is also an error prone procedure.

Now, the character sets of different encodings do not overlap in general. So what is the merge tool supposed to do if you merge a character C from file A in encoding X into a file B in encoding Y, if character C is not part of the character set of encoding Y?

Thus, I think the task of a merge tool should be to merge the binary content. Anything else is a dirty hack and damned to fail at some level. (A merge tool maker may decide to provide character level merging, which also might work most of the time. But there is some guesswork involved.)

Therefore, I'd also recommend you first translate the old files to UTF-8 and then merge those with the new versions.

Raine answered 29/1, 2013 at 15:7 Comment(7)
Well, it depends. For me, the ability to compare content of the text files is the ability to compare its content. The winmerge correctly recognizes the content and correctly displays it. I would expect at least some option to decide whether text files are to be compared in binary mode or not.Phytopathology
As I argued, comparing files on the character level is not a failproof procedure. That's why many difftools do not support it and kdiff makes a loud warning. And it's not an easy business, either, and that's probably why winmerge does not handle the situation gracefully. Maybe converting all old files to UTF-8 seems tedious, but it is definitely the right solution to your problem. (In particular, the solution is independent of the difftool you use.)Raine
My observation is that WinMerge works just fine when comparing the texts. It really finds the differences between the texts with different encodint. The only problem is that it does not mark blocks based on the difference. ...the character sets of different encodings do not overlap in general. So what is the merge tool supposed to do... If it cannot be done based on abstract characters, it must not be done using binary representation either.Phytopathology
My +1 anyway. Nothing is perfect. I will consider to convert my scripts to Unicode also for the older sources.Phytopathology
thx for the kudos. Yet, I disagree with If it cannot be done based on abstract characters, it must not be done using binary representation either. It is perfectly sensible to compare two files if they have the same encoding. Then, comparing characters and comparing bytes is the exact same thing. Also, it is perfectly sensible to change the encoding of a single file. The important point is, both of these tasks are nontrivial, can lead to unexpected problems and require your attention. If you mix them up, and you have a problem, you are much more unlikely to be able to solve it.Raine
You might contact the developers of WinMerge, so they can tackle that problem. In the end, if WinMerge provides the possibility to merge different encodings, it should handle such situations gracefully. Therefore your observations qualify as a bug of WinMerge. You can submit bug reports here. Get involved.Raine
Majority of programs does quite untrivial things. Once the files are detected as text files and the encoding is recognized, both contents can be treated internally as Unicode ones. I agree that merging a block from the encoded Unicode file to the 8bit encoded other file may be problem if there is some character outside the 8bit encoding. But the proces is interactive and the problem could be reported. // There is WinMerge 3 activity. I can imagine the problem will simply never solved for WinMerge 2 :)Phytopathology
P
5

Just for your information. The question was for the latest stable 2.12.4. I have tried the beta version 2.13.22, and it works just perfectly for me. See the difference for exactly the same files -- only the first lines in the files were removed. (My big thanks to authors.)

enter image description here

Phytopathology answered 30/1, 2013 at 14:24 Comment(0)
O
4
  1. Edit -> Options
  2. Select 'Compare' from categories pane on left.
  3. Check box 'Ignore carriage return differences' (UNIX, Windows, Mac)
Ogpu answered 14/2, 2020 at 9:37 Comment(2)
Hi Jamil, the question is rather old, and it was about different encodings, not about different line endings. Anyway, thanks for the info.Phytopathology
@Phytopathology actually this solved my problem! Thanks JamilShahMoltke
D
1

I would recommend converting the files to the same encoding before diffing.

If you are working with a version control system I'd recommend the following:

  1. Create a fresh checkout of the files
  2. Convert all files to UTF-8
  3. Commit the files
  4. Copy your new files over
  5. Use WinMerge

That way you end up with two commits in the history - one for the encoding change and another for the content changes and WinMerge will work as expected.

Decrypt answered 29/1, 2013 at 14:10 Comment(1)
The problem is that the older doc is still alive and must be kept in the encoding. The new doc is also alive and different tools are used to process the sources.Phytopathology
S
0

What about option File -> File Encoding... in WinMerge? It allows to set encoding for files independently.

Substitutive answered 29/1, 2013 at 14:51 Comment(1)
This works. The file panes display correct letters in both files. Only the comparison does not know about it.Phytopathology

© 2022 - 2024 — McMap. All rights reserved.