I'm trying to use the following command on a text file:
$ sort <m.txt | uniq -c | sort -nr >m.dict
However I get the following error message:
sort: string comparison failed: Invalid or incomplete multibyte or wide character
sort: Set LC_ALL='C' to work around the problem.
sort: The strings compared were ‘enwedig\r’ and ‘mwy\r’.
I'm using Cygwin on Windows 7 and was having trouble earlier editing m.txt to put each word within the file on a new line. Please see:
Using AWK to place each word in a text file on a new line
I'm not sure if I'm getting these errors due to this, or because m.txt contains characters from the Welsh alphabet (When I was working with Welsh text in Python, I was required t change the encoding to 'Latin-1').
I tried following the error message's advice and changing LC_ALL='C' however this has not helped. Can anyone elaborate on the errors I'm receiving and provide any advice on how I might go about trying to solve this problem.
UPDATE:
When trying dos2unix, errors were being displayed about invalid characters at certain lines. It turns out these were not Welsh characters, but other strange characters (arrows etc). I went through my text file removing these characters until I was able to use the dos2unix command without error. However, after using the dos2unix command all the text was concatenated (no spaces/newlines or anything, whereas it should have been so that each word in the file was on a seperate line) I then used unix2dos and the text file was back to normal. How can I each word on its own individual line and use the sort command without it giving me errors about '\r' characters?
dos2unix
doesn't lead to one long line; it's only the Windows tools that don't understand Unix line endings. Don't use a Windows editor to look at a Unix file, use a Unix editor such asvi
and you'll see each word on one line. And make sure you use the cygwin sort program, not the Windows sort program. Use/usr/bin/sort
to be sure. – Assemble