How can I be sure of the file encoding?

A

4

47

I have a PHP file that I created with VIM, but I'm not sure which is its encoding.

When I use the terminal and check the encoding with the command file -bi foo (My operating system is Ubuntu 11.04) it gives me the next result:

text/html; charset=us-ascii

But, when I open the file with gedit it says its encoding is UTF-8.

Which one is correct? I want the file to be encoded in UTF-8.

My guess is that there's no BOM in the file and that the command file -bi reads the file and doesn't find any UTF-8 characters, so it assumes that it's ascii, but in reality it's encoded in UTF-8.

Algarroba answered 13/6, 2012 at 16:6 Comment(2)

What non-ASCII characters are in your file? – Vaughn 13/6, 2012 at 17:13

there is a good solution to use Microsoft visual studio code described in askubuntu.com/a/681803 – Eleusis 12/6, 2019 at 11:26

M

62

Well, first of all, note that ASCII is a subset of UTF-8, so if your file contains only ASCII characters, it's correct to say that it's encoded in ASCII and it's correct to say that it's encoded in UTF-8.

That being said, file typically only examines a short segment at the beginning of the file to determine its type, so it might be declaring it us-ascii if there are non-ASCII characters but they are beyond the initial segment of the file. On the other hand, gedit might say that the file is UTF-8 even if it's ASCII because UTF-8 is gedit's preferred character encoding and it intends to save the file with UTF-8 if you were to add any non-ASCII characters during your edit session. Again, if that's what gedit is saying, it wouldn't be wrong.

Now to your question:

Run this command:
```
tr -d \\000-\\177 < your-file | wc -c
```
If the output says "0", then the file contains only ASCII characters. It's in ASCII (and it's also valid UTF-8) End of story.
Run this command
```
iconv -f utf-8 -t ucs-4 < your-file >/dev/null
```
If you get an error, the file does not contain valid UTF-8 (or at least, some part of it is corrupted).

If you get no error, the file is extremely likely to be UTF-8. That's because UTF-8 has properties that make it very hard to mistake typical text in any other commonly used character encoding for valid UTF-8.

Mineraloid answered 13/6, 2012 at 18:49 Comment(5)

The first command returned 0, and the second command didn't return an error, so we can say it's UTF-8. Thanks! – Algarroba 13/6, 2012 at 22:12

it is giving me 1120 , what does this mean? – Termless 8/5, 2015 at 4:49

What is giving you 1120? The wc? If so then I guess you have 1120 non-ASCII bytes in the file. – Mineraloid 8/5, 2015 at 15:6

Using tr -d is a very nice solution, since it allows for recognization of EBCDIC as well using tr -d \\100-\\377. Neither file nor chardet can do EBCDIC properly! – Dominga 8/7, 2020 at 9:44

The first command returned 3 and the second nothing. :) – Scrubby 16/6, 2022 at 13:45

I

76

$ file --mime my.txt 
my.txt: text/plain; charset=iso-8859-1

Isoelectronic answered 19/1, 2015 at 2:52 Comment(1)

I find it important to note that, like @Celada has already mentioned, file cannot grant that it's detection is 100% correct. – Proportionable 18/3, 2016 at 19:36

M

62

Well, first of all, note that ASCII is a subset of UTF-8, so if your file contains only ASCII characters, it's correct to say that it's encoded in ASCII and it's correct to say that it's encoded in UTF-8.

That being said, file typically only examines a short segment at the beginning of the file to determine its type, so it might be declaring it us-ascii if there are non-ASCII characters but they are beyond the initial segment of the file. On the other hand, gedit might say that the file is UTF-8 even if it's ASCII because UTF-8 is gedit's preferred character encoding and it intends to save the file with UTF-8 if you were to add any non-ASCII characters during your edit session. Again, if that's what gedit is saying, it wouldn't be wrong.

Now to your question:

Run this command:
```
tr -d \\000-\\177 < your-file | wc -c
```
If the output says "0", then the file contains only ASCII characters. It's in ASCII (and it's also valid UTF-8) End of story.
Run this command
```
iconv -f utf-8 -t ucs-4 < your-file >/dev/null
```
If you get an error, the file does not contain valid UTF-8 (or at least, some part of it is corrupted).

If you get no error, the file is extremely likely to be UTF-8. That's because UTF-8 has properties that make it very hard to mistake typical text in any other commonly used character encoding for valid UTF-8.

Mineraloid answered 13/6, 2012 at 18:49 Comment(5)

The first command returned 0, and the second command didn't return an error, so we can say it's UTF-8. Thanks! – Algarroba 13/6, 2012 at 22:12

it is giving me 1120 , what does this mean? – Termless 8/5, 2015 at 4:49

What is giving you 1120? The wc? If so then I guess you have 1120 non-ASCII bytes in the file. – Mineraloid 8/5, 2015 at 15:6

Using tr -d is a very nice solution, since it allows for recognization of EBCDIC as well using tr -d \\100-\\377. Neither file nor chardet can do EBCDIC properly! – Dominga 8/7, 2020 at 9:44

The first command returned 3 and the second nothing. :) – Scrubby 16/6, 2022 at 13:45

Q

30

(on Linux)

$ chardet <filename>

it also delivers the confidence level [0-1] of the output.

Quass answered 11/3, 2016 at 11:19 Comment(1)

chardet seems to be a Python wrapper around uchardet, the "Universal" character encoding detector. uchardet is available on macos via Homebrew, although it doesn't give a confidence level. – Rupe 14/9, 2022 at 5:31

M

0

Based on @Celada answer and the @Arthur Zennig, I have created this simple script:

#/bin/bash

if [ "$#" -lt 1 ]
then
  echo "Usage: utf8-check filename"
  exit 1
fi

chardet $1
countchars="$(tr -d \\000-\\177 < $1 | wc -c)"
if [ $countchars -eq 0 ]
then
 echo "Ascii";
 exit 0
fi

{
  iconv -f utf-8 -t ucs-4 < $1 >/dev/null
  echo "UTF-8"
} || {
  echo "not UTF-8 or corrupted"
}

Myriad answered 18/6, 2016 at 15:19 Comment(0)

Recommended topics

Hot tags