How to get vim to show a byte-by-byte representation of file data
Asked Answered
T

3

14

I don't want vim to ever interpret my data in any encoding specific way. In other words, when I'm in vim, I want the character that my cursor is on to correspond to the actual byte, not a utf* (etc.) representation of that byte.

I need to use vim to analyze issues caused by Unicode conversion errors made by other people (using other software) so it's important that I see what is actually there.

For example, in Cygwin's vim, I have been able to see UTF-8 BOMs as

 [START OF FILE DATA]

This is perfect. I recognize this as a UTF-8 BOM and if I want to know what the hex for each character is, I can put the cursor on the characters and use 'ga'.

I recently got a proper Linux machine (Fedora). In /etc/vimrc, this line exists

set fileencodings=ucs-bom,utf-8,latin1

When I look at a UTF-8 BOM on this machine, the BOM is completely hidden.

When I add the following line to ~/.vimrc

set fileencodings=latin1

I see



The first 3 characters are the BOM (when ga is used against them). I don't know what the last 3 characters are.

At one point, I even saw the UTF-8 BOM represented as "feff" - the UTF-16 BOM.

Anyway, you see my problem. I need to see exactly what is in my file without vim interpreting the bytes for me. I know I could use xxd, od, etc but vim has always been very convenient as an analysis tool. Plus I want to be able to edit the files and save them without any conversion problems.

Thanks for your help.

Toh answered 31/8, 2012 at 17:19 Comment(1)
Mind you: whenever someone writes, says or even thinks "UTF-8 BOM", a kitten gets killed.Rosco
P
18

Use 'binary' mode:

:edit ++bin file

or

vim -b file

From :help 'binary':

The 'fileencoding' and 'fileencodings' options will not be used, the file is read without conversion.

Phila answered 31/8, 2012 at 17:48 Comment(1)
Thanks. It's a very logical suggestion but I get the same results.Toh
I
7

I get some good mileage from doing :e ++enc=latin1 after loading the file (VIm's initial guess on the encoding isn't important at this stage).

Inimical answered 13/2, 2015 at 16:34 Comment(1)
this was super helpfulShowdown
S
6

The sequence  is actually the U+FEFF (BOM) encoded UTF-8, decoded latin1, encoded UTF-8, and decoded latin1 again.  is the U+FEFF (BOM) encoded as UTF-8 and decoded as latin1. You can't get away from encodings. Those aren't the actual bytes, they are the latin1 characters displayed from an incorrect decoding. If you want bytes, use a hex editor; otherwise, use the correct decoding.

Sherlocke answered 1/9, 2012 at 0:59 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.