How can I make Notepad to save text in UTF-8 without the BOM?
Asked Answered
T

7

29

I have a CSV file with special accents and save it in Notepad by selecting UTF-8 encoding. When I read the file using Java, it reads the BOM characters too.

So I want to save this file in UTF-8 format without appending a BOM initially in Notepad.

Otherwise, is there a built-in class in Java that eliminates the BOM characters that present at beginning, when reading the contents in a file?

Tallbot answered 8/12, 2011 at 14:32 Comment(2)
Perhaps...don't use notepad to deal with UTF8 text? Try any of the other multitude of text editors, like Notepad++ or jEdit.Sportsmanship
Making the above feature in notepad as only it comes with Microsoft :)Tallbot
W
38
  1. Use Notepad++ - it is free and much better than Notepad. It will help to save text without a BOM using EncodingEncode in UTF-8 without BOM:

    Notepad++ v6 and olders: Screenshot of the Notepad++ Menubar -> Encoding -> Encode in UTF-8 without BOM menu in Notepad++ v6.7.9.2

    Notepad++ v7+:
    Screenshot of the Notepad++ Menubar -> Encoding -> Encode in UTF-8 without BOM menu in Notepad++ v7+

  2. When I encountered this problem in Java, I didn't find any library to parse these first three bytes (BOM). So my advice:

    • Use PushbackInputStream(in, 3).
    • Read the first three bytes
    • If it's not BOM (EF BB BF), push them back
    • Process the stream as UTF-8
Workingman answered 8/12, 2011 at 14:40 Comment(6)
I'm looking into this now.Will post here if I found a better way than stripping off bytes.Problem with stripping off bytes blindly is 'I cant say files are saved with only utf-8.It may be encoded in ANSI too.'Tallbot
You don't need to strip blindly. If you analyze first two bytes and it's BOM, you have 99% probability that file is in UTF-8. Only in this case you should cut them off. Anyway please write here your solution when you'll found it it.Workingman
Worked for me! As soon as I saved it in Notepad++ the utf-8 errors went away.Babirusa
Erm... anyone notice the UTF-8 BOM to be 3 bytes long and not 2 bytes? ;) It's 0xEF 0xBB 0xBF so you will need to strip the first 3 bytes of the file!!!Overpraise
@Tallbot the file command can detect utf8 without bom. Probably there are codes valid in utf8 that aren't valid ascii like df90 fileformat.info/info/unicode/char/05d0/index.htm df isn't valid ascii because ascii (extended ascii aside), ascii is 0-127 so 0-7f doesn't include df.Birdie
currently notepad++ describes that(utf8 without bom), as utf8 in contrast to utf8 with bom. imgur.com/a/wALijBirdie
G
11

I just learned from this Stack Overflow post, as @martin-geisler points out, that you can save files without the BOM in Windows Notepad, by selecting ANSI as the encoding.

I'm assuming that for more advanced uses this won't work because the resulting file is probably not the end encoding wished, but actually ANSI; but I tested and confirmed this works to save a very small .php script without BOM using only Notepad.

I learned the long, hard way that Windows' Notepad is not a true editor, although I'd like to point out for others that, despite this, it is misleadingly called up when you type "editor" on newer Windows machines, at least on one of mine.

I am currently using Emacs and other editors to solve this problem.

Goodhen answered 11/5, 2013 at 14:4 Comment(3)
choosing ANSI in notepad ++ worked for me, but encode it to w/o BOM didntPitterpatter
I've found that special characters in text files can change the encoding if edited in word, for example we had an .xml file with a comment where someone had copied and pasted from an email/ms-word caused the UTF-8 file to change to UTF-8-BOM. I removed the special characters and was able to verify that notepad saved the file as UTF-8 without BOM when those special characters were removed.Mixup
Note that for any file containing only the base 128 ASCII characters (0x00-0x7F), UTF-8 is exactly identical to "ANSI".Monet
S
9

Use Notepad++ instead. See my personal blog post on it. From within Notepad++, choose the "Encoding" menu, then "Encode in UTF-8 without BOM".

Spectator answered 8/12, 2011 at 14:38 Comment(2)
I am aware of notepad 2 and notepad++.I wanna do that in notepad itselfTallbot
Standard Windows notepad is not a true editor, and doesn't support any options around the BOM functionality. If you don't want to use another editor, you will need to follow the advice of one of the other answers here to properly handle the BOM within the Java code.Spectator
B
9

Notepad on Windows 10 version 1903 (May 2019 update) and later versions supports saving to UTF-8 without a BOM. In fact, UTF-8 is the default file format now.

Screenshot of Notepad

Reference: Windows 10 Notepad is Getting Better UTF-8 Encoding Support

Bobcat answered 25/7, 2019 at 21:51 Comment(0)
S
0

The answer is: Not at all. Notepad can't do that.

In Java you can just skip the first byte in your InputStream and be done.

Stotts answered 8/12, 2011 at 14:37 Comment(6)
Notepad adds some invisible bytes at the beginning of file to identify the byte order in which the current file is encoded.Tallbot
then just skip the appropriate bytes. If notepad adds them and you want to stick to notepad than skip them and everything is fine.Stotts
Will check any other solution than stripping off bytes.If nothing is feasible,then I must strip off bytes.I cant say files are saved with only utf-8.It may be encoded in ANSI too.Tallbot
@Tallbot then you want the bom to be there so you can distinguish between UTF-8 and ANSIStotts
@Tallbot It's not so much that Notepad adds the BOM to Unicode files, as it is that Windows in general frequently tends to use the various Unicode BOMs as a general-purpose Unicode signature, effectively turning them into magic numbers that serve as its preferred way to detect Unicode encodings when applicable. This is probably because checking for 2-4 specific bytes is more efficient than using heuristics to detect Unicode, but annoying because it breaks anything that doesn't understand the BOM; the option should be provided to save without the BOM.Monet
It is strange, though, that Notepad can run heuristics to detect Unicode (including UTF-8) even without a BOM, but doesn't provide the option to create Unicode files without BOMs.Monet
H
0

You might want to try out Notepad2 or Notepad++. Those Notepad replacements have the option for you to choose whether to output BOM.

As for a Java solution, as far as I know, Java does not understand the standard UTF-8. I googled and found Java's UTF-8 and Unicode writing is broken - Use this fix that might be the solution.

Homerus answered 8/12, 2011 at 14:39 Comment(0)
L
0

We're using the utility BOMStripperInputStream.java to strip the BOM from our input if present.

Liquorice answered 8/12, 2011 at 14:42 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.