Why does VIM disregard my file's BOM?
Asked Answered
H

3

6

I need a file that I want to make sure is encoded with utf8.

So, I create the file

c:\> gvim umlaute.txt

In VIM I type the Umlaute:

äöü

I check the encoding ...

:set enc

(VIM echoes encoding=latin1)

and then I check the file encoding ...

:set fenc

(VIM echoes fileencoding=)

Then I write the file

:w

And check the file's size on the harddisk:

!dir umlaute.txt

(The size is 5 bytes) That is of course expected, 3 bytes for the text and 2 for the \x0a \x0d.

Ok, so I now set the encoding to

:set enc=utf8

The buffer get's wierd

<e4><f6><fc>

I guess this is the hex representation of the ascii characters I previously typed in. So I rewrite them

äöü

Writing, checking size:

:w
:$ dir umlaute.txt

This time, it's 8 bytes. I guess that makes sense 2 bytes for every character plus \x0d \x0a.

Ok, so I want to make sure the next time I open the file it will be opened with encodiung=utf8.

:setb
:w

:$ dir umlaute.txt

11 Bytes. This is of course 8 (previous) Bytes + 3 Bytes for the BOM (ef bb bf).

So I

:quit

vim and open the file again

and check, if the encoding is set:

:set enc

But VIM insists its encoding=latin1.

So, why is that. I would have expected the BOM to tell VIM that this is a UTF8 file.

Hurds answered 26/8, 2011 at 11:47 Comment(3)
I add this string to my files # vim: set fileencoding=utf-8 to make sure I get utf-8 in vimRendering
The problem is: I have so many (old and legacy) files that I am not sure if I want to mess with them just for this particular one.Lynnette
UTF-8 isn’t very BOM friendly. It is neither required nor recommended to put a BOM in a UTF-8 file. Everything I know that reads UTF-8 streams will treat a BOM as an extra character at the start of the data, not as a BOM metadatum.Prevocalic
L
18

You are confusing 'encoding' which is a Vim global setting, and 'fileencoding', which is a local setting to each buffer.

When opening a file, the variable 'fileencodings' (note the final s) determines what encodings Vim will try to open the file with. If it starts with ucs-bom then any file with a BOM will be properly opened if it parses correctly.

If you want to change the encoding of a file, you should use :set fenc=<foo>. If you want to remove the BOM you should use :set [no]bomb. Then use :w to save.

Avoid changing enc after having opened a buffer, it could mess up things. enc determines what characters vim can work with, and it has nothing to do with the files that you are working with.

Details

c:\> gvim umlaute.txt

You are opening vim, with a nonexistent file name. Vim creates a buffer, gives it that name, and sets fenc to an empty value since there is no file associated with it.

:set enc

(VIM echoes encoding=latin1)

This means that the Vim stores the buffer contents in ISO-8859-1 (maybe another number).

and then I check the file encoding ...

:set fenc

(VIM echoes fileencoding=)

This is normal, there is no file for the moment.

Then I write the file

:w

Since 'fileencoding' is empty, it will write it to the disk using the internal encoding, latin1.

And check the file's size on the harddisk:

!dir umlaute.txt

(The size is 5 bytes) That is of course expected, 3 bytes for the text and 2 for the \x0a \x0d.

Ok, so I now set the encoding to

:set enc=utf8

WRONG! You are telling vim that it must interpret the buffer contents as UTF8 content. the buffer contains, in hexadecimal, e4 f6 fc 0a 0d, the first three bytes are invalid UTF8 character sequences. You should have typed :set fenc=utf-8. This would have converted the buffer.

The buffer get's wierd

That's what happens when you force Vim to interpret an illegal UTF-8 file as UTF8.

I guess this is the hex representation of the ascii characters I previously typed in. So I rewrite them

äöü

Writing, checking size:

:w :$ dir umlaute.txt

This time, it's 8 bytes. I guess that makes sense 2 bytes for every character plus \x0d \x0a.

Ok, so I want to make sure the next time I open the file it will be opened with encodiung=utf8.

:set bomb :w

:$ dir umlaute.txt

11 Bytes. This is of course 8 (previous) Bytes + 3 Bytes for the BOM (ef bb bf).

So I

:quit

vim and open the file again

and check, if the encoding is set:

:set enc

But VIM insists its encoding=latin1.

You should run set fenc? to know what is the detected encoding of your file. And if you want Vim to be able to work with Unicode files, you should set in your vimrc that 'enc' is utf-8.

Leaves answered 26/8, 2011 at 12:8 Comment(2)
Thanks a lot for your explanations. I appreciated them very much!Lynnette
@René Nyfenegger: You are welcome. I have the feeling that I help you on Vim while you help me on SQL*Plus (I have browsed your web site a lot Yesterday).Leaves
Z
3

After many attempts I get here is a working example:

    setglobal bomb 
    set fileencodings=ucs-bom,utf-8,cp1251,koi8-r,cp866
    set nobin
    set fileencoding=utf-8 bomb

and if you want to cteate new fiel with BOM:

    c:\gvim umlaute.txt

it is working now!

Zeidman answered 22/12, 2011 at 6:31 Comment(0)
T
1

:help bomb reveals the following information:

When writing a file and the following conditions are met, a BOM (Byte Order Mark) is prepended to the file:

  • this option is on (edit: i.e. ':set bomb')
  • the 'binary' option is off
  • 'fileencoding' is "utf-8", "ucs-2", "ucs-4" or one of the little/big endian variants.

Some applications use the BOM to recognize the encoding of the file. Often used for UCS-2 files on MS-Windows. For other applications it causes trouble, for example: "cat file1 file2" makes the BOM of file2 appear halfway the resulting file. Gcc doesn't accept a BOM. When Vim reads a file and 'fileencodings' starts with "ucs-bom", a check for the presence of the BOM is done and 'bomb' set accordingly. Unless 'binary' is set, it is removed from the first line, so that you don't see it when editing. When you don't change the options, the BOM will be restored when writing the file.

So try setting this in your .vimrc:

set fileencodings=ucs-bom,utf-8,latin1
set nobin
setglobal fileencoding=utf-8
Tobitobiah answered 26/8, 2011 at 12:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.