How can I re-add a unicode byte order marker in linux?
Asked Answered
H

7

14

I have a rather large SQL file which starts with the byte order marker of FFFE. I have split this file using the unicode aware linux split tool into 100,000 line chunks. But when passing these back to windows, it does not like any of the parts other than the first one as only it has the FFFE byte order marker on.

How can I add this two byte code using echo (or any other bash command)?

Haematite answered 25/6, 2009 at 15:31 Comment(0)
I
3

Something like (backup first)):

for i in $(ls *.sql)
do
  cp "$i" "$i.temp"
  printf '\xFF\xFE' > "$i"
  cat "$i.temp" >> "$i"
  rm "$i.temp"
done
Indoeuropean answered 25/6, 2009 at 15:37 Comment(2)
printf! Thanks mate, I think I'd have been googling until the end of time!Haematite
The BOM codepoint is U+FEFF but its literal representation in UTF-8 is EF BB BF (three bytes). This would only work if the file was already in UTF-16, little endian order. See en.wikipedia.org/wiki/…Warmblooded
S
17

Based on sed's solution of Anonymous, sed -i '1s/^/\xef\xbb\xbf/' foo adds the BOM to the UTF-8 encoded file foo. Usefull is that it also converts ASCII files to UTF8 with BOM

Synchromesh answered 21/9, 2012 at 15:20 Comment(0)
L
13

For a general-purpose solution—something that sets the correct byte-order mark regardless of whether the file is UTF-8, UTF-16, or UTF-32—I would use vim’s 'bomb' option:

$ echo 'hello' > foo
$ xxd < foo
0000000: 6865 6c6c 6f0a                           hello.
$ vim -e -s -c ':set bomb' -c ':wq' foo
$ xxd < foo
0000000: efbb bf68 656c 6c6f 0a                   ...hello.

(-e means runs in ex mode instead of visual mode; -s means don’t print status messages; -c means “do this”)

Lekishalela answered 10/7, 2009 at 2:9 Comment(0)
S
12

To add BOMs to the all the files that start with "foo-", you can use sed. sed has an option to make a backup.

sed -i '1s/^\(\xff\xfe\)\?/\xff\xfe/' foo-*

straceing this shows sed creates a temp file with a name starting with "sed". If you know for sure there is no BOM already, you can simplify the command:

sed -i '1s/^/\xff\xfe/' foo-*

Make sure you need to set UTF-16, because i.e. UTF-8 is different.

Stepha answered 22/3, 2012 at 1:7 Comment(3)
For UTF-8 use \xef\xbb\xbf; for UTF-16 little-endian use \xff\xfe; for UTF-16 big-endian use \xfe\xff. See w3.org/International/questions/qa-byte-order-markArmoire
Upvoting this answer because this is what I use myself. Mac OS and other BSD users should beware that that the -i,--inplace option is not specified by POSIX and is only available with GNU sed.Superadd
BTW, the g (global) modifier doesn't do anything here.Superadd
D
6

Try uconv

uconv --add-signature
Drumfire answered 2/8, 2012 at 19:34 Comment(1)
uconv needs to be installed (in Debian, it's in the libicu-dev package). Not that add-signature doesn't work if the file is otherwise in a different encoding.Absent
I
3

Something like (backup first)):

for i in $(ls *.sql)
do
  cp "$i" "$i.temp"
  printf '\xFF\xFE' > "$i"
  cat "$i.temp" >> "$i"
  rm "$i.temp"
done
Indoeuropean answered 25/6, 2009 at 15:37 Comment(2)
printf! Thanks mate, I think I'd have been googling until the end of time!Haematite
The BOM codepoint is U+FEFF but its literal representation in UTF-8 is EF BB BF (three bytes). This would only work if the file was already in UTF-16, little endian order. See en.wikipedia.org/wiki/…Warmblooded
S
2

Matthew Flaschen's answer is a good one, however it has a couple of flaws.

  • There's no check that the copy succeeded before the original file is truncated. It would be better to make everything contingent on a successful copy, or test for the existence of the temporary file, or to operate on the copy. If you're a belt-and-suspenders kind of person, you'd do a combo as I've illustrated below
  • The ls is unnecessary.
  • I'd use a better variable name than "i" - perhaps "file".

Of course, you could be very paranoid and check for the existence of the temporary file at the beginning so you don't accidentally overwrite it and/or use a UUID or a generated file name. One of mktemp, tempfile or uuidgen would do the trick.

td=TMPDIR
export TMPDIR=

usertemp=~/temp            # set this to use a temp directory on the same filesystem
                           # you could use ./temp to ensure that it's one the same one
                           # you can use mktemp -d to create the dir instead of mkdir

if [[ ! -d $usertemp ]]    # if this user temp directory doesn't exist
then                       # then create it, unless you can't 
    mkdir $usertemp || export TMPDIR=$td    # if you can't create it and TMPDIR is/was
fi                                          # empty then mktemp automatically falls
                                            # back to /tmp

for file in *.sql
do
    # TMPDIR if set overrides the argument to -p
    temp=$(mktemp -p $usertemp) || { echo "$0: Unable to create temp file."; exit 1; }

    { printf '\xFF\xFE' > "$temp" &&
    cat "$file" >> "$temp"; } || { echo "$0: Write failed on $file"; exit 1; }

    { rm "$file" && 
    mv "$temp" "$file"; } || { echo "$0: Replacement failed for $file; exit 1; }
done
export TMPDIR=$td

Traps might be better than all the separate error handlers I've added.

No doubt all this extra caution is overkill for a one-shot script, but these techniques can save you when push comes to shove, especially in a multi-file operation.

Silk answered 25/6, 2009 at 19:37 Comment(2)
The "cp" command is not needed. Also "mktemp" returns a name in /tmp; it would be better to write the temp file on the same filesystem so that "mv" will not have to copy it.Tectonics
@mark4o: You are correct on both counts. I've updated my answer accordingly.Silk
D
1
$ printf '\xEF\xBB\xBF' > bom.txt

Then check:

$ grep -rl $'\xEF\xBB\xBF' .
./bom.txt
Duralumin answered 19/10, 2017 at 2:35 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.