Adding BOM to UTF-8 files
Asked Answered
A

10

72

I'm searching (without success) for a script, which would work as a batch file and allow me to prepend a UTF-8 text file with a BOM if it doesn't have one.

Neither the language it is written in (perl, python, c, bash) nor the OS it works on, matters to me. I have access to a wide range of computers.

I've found a lot of scripts to do the reverse (strip the BOM), which sounds to me as kind of silly, as many Windows program will have trouble reading UTF-8 text files if they don't have a BOM.

Did I miss the obvious?

Aircrewman answered 27/6, 2010 at 13:14 Comment(0)
N
66

The easiest way I found for this is

#!/usr/bin/env bash

#Add BOM to the new file
printf '\xEF\xBB\xBF' > with_bom.txt

# Append the content of the source file to the new file
cat source_file.txt >> with_bom.txt

I know it uses an external program (cat)... but it will do the job easily in bash

Tested on osx but should work on linux as well

NOTE that it assumes that the file doesn't already have BOM (!)

Nettienetting answered 24/5, 2016 at 22:48 Comment(1)
Tip: you can do it in-place using the tools "sponge" of moreutils. printf '\xEF\xBB\xBF' | cat - source.txt | sponge source.txtJordanna
I
53

I wrote this addbom.sh using the 'file' command and ICU's 'uconv' command.

#!/bin/sh

if [ $# -eq 0 ]
then
        echo usage $0 files ...
        exit 1
fi

for file in "$@"
do
        echo "# Processing: $file" 1>&2
        if [ ! -f "$file" ]
        then
                echo Not a file: "$file" 1>&2
                exit 1
        fi
        TYPE=`file - < "$file" | cut -d: -f2`
        if echo "$TYPE" | grep -q '(with BOM)'
        then
                echo "# $file already has BOM, skipping." 1>&2
        else
                ( mv "${file}" "${file}"~ && uconv -f utf-8 -t utf-8 --add-signature < "${file}~" > "${file}" ) || ( echo Error processing "$file" 1>&2 ; exit 1)
        fi
done

edit: Added quotes around the mv arguments. Thanks @DirkR and glad this script has been so helpful!

Illusory answered 20/7, 2010 at 19:58 Comment(5)
Absolutely perfect! A lot better than what I came with. Many thanks.Aircrewman
"$@" is better than $* here. This will keep arguments with spaces (usefull on windows+cygwin)Modiolus
The mv also needs "" or it won't work with path names with spaces. Nice script, thanks!Pogge
A question came in, about how to use this on subdirectories… You can probably use it like this: find . -type f -print0 | xargs -0 addbom.sh which will call the addbom.sh script for all subdirectories.Illusory
If you are on macOS, you need run install icu4c via homebrew (brew install icu4c) and use brew list icu4c | grep uconv to find the path to the uconv executable and replace it in the script.Pattiepattin
S
29

(Answer based on https://mcmap.net/q/275915/-how-can-i-re-add-a-unicode-byte-order-marker-in-linux by yingted)

To add BOMs to the all the files that start with "foo-", you can use sed. sed has an option to make a backup.

sed -i '1s/^\(\xef\xbb\xbf\)\?/\xef\xbb\xbf/' foo-*

If you know for sure there is no BOM already, you can simplify the command:

sed -i '1s/^/\xef\xbb\xbf/' foo-*

Make sure you need to set UTF-8, because i.e. UTF-16 is different (otherwise check How can I re-add a unicode byte order marker in linux?)

Succussion answered 4/3, 2016 at 22:19 Comment(5)
For UTF-8 use \xef\xbb\xbf; for UTF-16 little-endian use \xff\xfe; for UTF-16 big-endian use \xfe\xff. See w3.org/International/questions/qa-byte-order-markFoxworth
This did not work for me on Mac. The command line sed -i '1s/^/\xef\xbb\xbf/' temp.csv gave me sed: 1: "temp.csv": undefined label 'emp.csv'Granulation
@PerLundberg you could try to troubleshoot.. try sed '1s/asdfasdfasdf//' blah.csv The lack of -i will make it very safe because it leaves the input file unchanged and outputs the result to console. That line should look at line one, search for the string asdfasdfasdf and replace it with nothing i.e. delete that string. Then try making it ^adsfasdfasdf The ^ marks the beginning of the line, maybe that's causing the issue for some reason. Perhaps you need to use a switch with sed to get it to use the ^ like maybe -E though I don't know.Snowbound
@PerlLundberg I had the same problem with macOS 10.13, and after a lot of fiddling I found that sed -i '' $'1s/^/\xef\xbb\xbf/' foo-* worksIsotope
I'm probably doing something wrong, but it doesn't seem to work for me on mac. tom@vogon sbf-cpp % ls -l temp2.cpp -rw-r--r-- 1 tom staff 9 Apr 21 22:20 temp2.cpp tom@vogon sbf-cpp % sed -i '' '1s/^\(\xef\xbb\xbf\)\?/\xef\xbb\xbf/' *.cpp tom@vogon sbf-cpp % ls -l temp2.cpp -rw-r--r-- 1 tom staff 9 Apr 21 22:21 temp2.cpp tom@vogon sbf-cpp % sed -i '' '1s/^\(\xef\xbb\xbf\)\?/\xef\xbb\xbf/' temp2.cpp tom@vogon sbf-cpp % ls -l temp2.cpp -rw-r--r-- 1 tom staff 9 Apr 21 22:21 temp2.cppAntimonous
U
26

As an improvement on Yaron U.'s solution, you can do it all on a single line:

printf '\xEF\xBB\xBF' | cat - source.txt > source-with-bom.txt

The cat - bit says to concatenate to the front of source.txt what's being piped in from the print command. Tested on OS X and Ubuntu.

Underwater answered 6/11, 2018 at 2:5 Comment(2)
Tip: you can do it in-place using the tools "sponge" of moreutils. printf '\xEF\xBB\xBF' | cat - source.txt | sponge source.txtJordanna
I hadn't seen sponge before. It doesn't look like it's natively part of macOS. Additionally, "Unlike a shell redirect, sponge soaks up all its input before opening the output file. This allows constricting pipelines that read from and write to the same file." Cool.Underwater
R
5

open in notepad. click save-as. under encoding, select "UTF-8(BOM)" (this is under plain "UTF-8").

Reticulum answered 29/7, 2022 at 0:42 Comment(2)
Hi Timothy, welcome to Stack Overflow. You are right that could be an approach. Though the author requires a script and not a manual step.Sulfonal
@AmitDash this a common misunderstanding. Answers that only answer the headline and not what the OP in more detailed asked, are also perfectly ok, due to how people find these article using google search.Nidify
T
3

I find it pretty simple. Assuming the file is always UTF-8(you're not detecting the encoding, you know the encoding):

Read the first three characters. Compare them to the UTF-8 BOM sequence(wikipedia says it's 0xEF,0xBB,0xBF). If it's the same, print them in the new file and then copy everything else from the original file to the new file. If it's different, first print the BOM, then print the three characters and only then print everything else from the original file to the new file.

In C, fopen/fclose/fread/fwrite should be enough.

Trophic answered 27/6, 2010 at 13:18 Comment(0)
M
1

in VBA Access:

    Dim name As String
    Dim tmpName As String
    
    tmpName = "tmp1.txt"
    name = "final.txt"

    Dim file As Object
    Dim finalFile As Object
    Set file = CreateObject("Scripting.FileSystemObject")

    Set finalFile = file.CreateTextFile(name)
 
    
    'Add BOM
    finalFile.Write Chr(239)
    finalFile.Write Chr(187)
    finalFile.Write Chr(191)
    
    'transfer text from tmp to final file:
    Dim tmpFile As Object
    Set tmpFile = file.OpenTextFile(tmpName, 1)
    finalFile.Write tmpFile.ReadAll
    finalFile.Close
    tmpFile.Close
    file.DeleteFile tmpName
Midyear answered 27/11, 2020 at 11:9 Comment(0)
B
0

I've created a script based on Steven R. Loomis's code. https://github.com/Vdragon/addUTF-8bomb

Checkout https://github.com/Vdragon/C_CPP_project_template/blob/development/Tools/convertSourceCodeToUTF-8withBOM.bash.sh for example of using this script.

Brunhilde answered 23/6, 2014 at 9:8 Comment(0)
K
0

Here is the batch file I use for this purpose in Windows. It should be saved with ANSI (Windows-1252) encoding for the /p= part.

@echo off
if [%~1]==[] goto usage
if not exist "%~1" goto notfound

setlocal
set /p AREYOUSURE="Adding UTF-8 BOM to '%~1'. Are you sure (Y/[N])? "
if /i "%AREYOUSURE%" neq "Y" goto canceled

:: Main code is here. Create a temp file containing the BOM, then append the requested file contents, and finally overwrite the original file
(echo|set /p=)>"%~1.temp"
type "%~1">>"%~1.temp"
move /y "%~1.temp" "%~1" >nul

@echo Added UTF-8 BOM to "%~1"
pause
exit /b 0

:usage
@echo Usage: %0 ^<FILE_NAME^>
goto end

:notfound
@echo File not found: "%~1"
goto end

:canceled
@echo Operation canceled.
goto end

:end
pause
exit /b 1

You can save the file as e.g. C:\addbom.bat and use the following .reg file to add it to right-click context menu of all files:

Windows Registry Editor Version 5.00

[HKEY_CLASSES_ROOT\*\Shell\Add UTF-8 BOM]

[HKEY_CLASSES_ROOT\*\Shell\Add UTF-8 BOM\command]
@="C:\\addbom.bat \"%1\""

Kellda answered 19/12, 2022 at 14:56 Comment(0)
C
-2

This is a one-line solution that works natively without any temp files:

MacOS:

sed -i '' '1s/^/\xEF\xBB\xBF/' filename.txt

Other Unix systems:

sed -i '1s/^/\xEF\xBB\xBF/' filename.txt

There's a quirk in how MacOS uses the -i function inside its implementation of sed in that it wants a backup filename provided, but you can bypass it with the '' parameter above.

Note: ChatGPT 4 helped with this.

Cajun answered 13/9, 2023 at 3:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.