sort: string comparison failed Invalid or incomplete multibyte or wide character
Asked Answered
L

2

9

I'm trying to use the following command on a text file:

$ sort <m.txt | uniq -c | sort -nr >m.dict 

However I get the following error message:

sort: string comparison failed: Invalid or incomplete multibyte or wide character
sort: Set LC_ALL='C' to work around the problem.
sort: The strings compared were ‘enwedig\r’ and ‘mwy\r’.

I'm using Cygwin on Windows 7 and was having trouble earlier editing m.txt to put each word within the file on a new line. Please see:

Using AWK to place each word in a text file on a new line

I'm not sure if I'm getting these errors due to this, or because m.txt contains characters from the Welsh alphabet (When I was working with Welsh text in Python, I was required t change the encoding to 'Latin-1').

I tried following the error message's advice and changing LC_ALL='C' however this has not helped. Can anyone elaborate on the errors I'm receiving and provide any advice on how I might go about trying to solve this problem.

UPDATE:

When trying dos2unix, errors were being displayed about invalid characters at certain lines. It turns out these were not Welsh characters, but other strange characters (arrows etc). I went through my text file removing these characters until I was able to use the dos2unix command without error. However, after using the dos2unix command all the text was concatenated (no spaces/newlines or anything, whereas it should have been so that each word in the file was on a seperate line) I then used unix2dos and the text file was back to normal. How can I each word on its own individual line and use the sort command without it giving me errors about '\r' characters?

Lisle answered 29/3, 2016 at 18:29 Comment(2)
dos2unix doesn't lead to one long line; it's only the Windows tools that don't understand Unix line endings. Don't use a Windows editor to look at a Unix file, use a Unix editor such as vi and you'll see each word on one line. And make sure you use the cygwin sort program, not the Windows sort program. Use /usr/bin/sort to be sure.Assemble
Ah I see. My problem is still not quite solved but I think now it has drifted too far from the original question so I've created another. I will close this question now, thanks for the help.Lisle
M
11

I know it's an old question, but just running the command export LC_ALL='C' does the trick as described by sort: Set LC_ALL='C' to work around the problem..

Mangle answered 31/1, 2017 at 9:28 Comment(1)
Same here. LC_ALL=C sed (...) enabled sed to consider non-ASCII characters for the .* pattern I used.Pierides
A
3

Looks like a Windows line-ending related problem (\r\n versus \n). You can convert m.txt to Unix line-endings with

dos2unix m.txt

and then rerun your command.

Assemble answered 29/3, 2016 at 19:47 Comment(6)
Hi, this gives the this message "dos2unix: Binary symbol 0x1A found at line 11451024 dos2unix: Skipping binary file m.txt" and then when i try the original command i get the same error. Any ideas?Lisle
@Lisle Do you know the encoding of the file? I.e. is it UTF-8, Windows code page X, some other encoding? How was this file created? Does it look fine when opened with a Windows editor?Assemble
It looks fine when opened in a text editor (Notepad). I'm not entirely sure on the encoding, but it contains Welsh language characters such as: â, ê, î, ô, û, ŵ, ŷ. I also tried dos2unix with the -f command and it runs, but then when I try the sort its the same error.Lisle
You can try if any of the UTF-8 locales works. List the available locales with locale -a, then use e.g. export LC_ALL=en_US.UTF-8. Verify the setting with locale, then run the pipe again. If you suspect the encoding is some ISO8859, do the same with an appropriate locale.Assemble
I believe Welsh would be part of 'ISO/IEC 8859-14'. How can I change the locale to that? It doesn't show when listing locales with 'locale -a'.Lisle
If locale doesn't show it, then the C library does not support that locale. In that case, maybe the iconv codeset converter can convert it to a usable encoding. Failing that, it's time to think outside the box: delete the welsh lines; create the file with a usable encoding; do it with the Windows tools,...Assemble

© 2022 - 2024 — McMap. All rights reserved.