How to remove multiple UTF-8 BOM sequences
Asked Answered
S

12

75

Using PHP5 (cgi) to output template files from the filesystem and having issues spitting out raw HTML.

private function fetch($name) {
    $path = $this->j->config['template_path'] . $name . '.html';
    if (!file_exists($path)) {
        dbgerror('Could not find the template "' . $name . '" in ' . $path);
    }
    $f = fopen($path, 'r');
    $t = fread($f, filesize($path));
    fclose($f);
    if (substr($t, 0, 3) == b'\xef\xbb\xbf') {
        $t = substr($t, 3);
    }
    return $t;
}

Even though I've added the BOM fix I'm still having problems with Firefox accepting it. You can see a live copy here: http://ircb.in/jisti/ (and the template file I threw at http://ircb.in/jisti/home.html if you want to check it out)

Any idea how to fix this? o_o

Shearer answered 24/4, 2012 at 2:4 Comment(2)
utf8 file shouldn't have a BOM, if your editor put those in, there should be a configuration to omit those, if your editor won't allow you to not put in BOM, replace your editor.Castrato
yeah. I use n++, and I tried without BOMShearer
L
180

you would use the following code to remove utf8 bom

//Remove UTF8 Bom

function remove_utf8_bom($text)
{
    $bom = pack('H*','EFBBBF');
    $text = preg_replace("/^$bom/", '', $text);
    return $text;
}
Laurasia answered 15/3, 2013 at 2:55 Comment(5)
For some reason in the Google+ API, this BOM shows up at the end of the content variable, so I needed to tweak this to remove it from the end of the string.Anastatius
Can someone explain how the pack function is used here? I know it converts a string to a binary representation but struggling to understand how this helps with identifying the BOM Unicode character.Bittencourt
This worked great for my requirement to read the CSV output from SSRS and append to a larger file.Refuse
I used this with trim to cleanse copy/pasted form data like this: $bom = pack('H*','EFBBBF'); $replacementChars = " \n\r\t\v\0" . $bom; $cleanVar = trim($dirtyVar, $replacementChars);.Irriguous
@fsociety The BOM is three bytes - 0xef 0xbb 0xbf. So pack is is using a format of H* which means interpret all values in the string as hexadecimal bytes. I prefer o1max's answer (although has a lower score) that simply uses a string with escape characters:"\xEF\xBB\xBF"Twayblade
S
58

try:

// -------- read the file-content ----
$str = file_get_contents($source_file); 

// -------- remove the utf-8 BOM ----
$str = str_replace("\xEF\xBB\xBF",'',$str); 

// -------- get the Object from JSON ---- 
$obj = json_decode($str); 

:)

Shanta answered 18/9, 2013 at 11:19 Comment(0)
A
20

Another way to remove the BOM which is Unicode code point U+FEFF

$str = preg_replace('/\x{FEFF}/u', '', $file);
Arsonist answered 19/6, 2014 at 17:3 Comment(0)
C
8

b'\xef\xbb\xbf' stands for the literal string "\xef\xbb\xbf". If you want to check for a BOM, you need to use double quotes, so the \x sequences are actually interpreted into bytes:

"\xef\xbb\xbf"

Your files also seem to contain a lot more garbage than just a single leading BOM:

$ curl http://ircb.in/jisti/ | xxd

0000000: efbb bfef bbbf efbb bfef bbbf efbb bfef  ................
0000010: bbbf efbb bf3c 2144 4f43 5459 5045 2068  .....<!DOCTYPE h
0000020: 746d 6c3e 0a3c 6874 6d6c 3e0a 3c68 6561  tml>.<html>.<hea
...
Clougher answered 24/4, 2012 at 2:7 Comment(3)
if I was using n++, why would it cause this? it's saving it as unix/utf8 -bomShearer
Save it as UTF-8 NO BOM (or whatever it's called in N++).Clougher
I did and I'm still getting the same result. I curl'd the direct file (curl ircb.in/jisti/home.html | xxd) and got no leading characters, but curl'ing the PHP script adds the excess data in the front and all I'm using is print to output the data.Shearer
C
6

if anybody using csv import then below code useful

$header = fgetcsv($handle);
foreach($header as $key=> $val) {
     $bom = pack('H*','EFBBBF');
     $val = preg_replace("/^$bom/", '', $val);
     $header[$key] = $val;
}
Camire answered 18/7, 2018 at 6:10 Comment(0)
L
5

This global funtion resolve for UTF-8 system base charset. Tanks!

function prepareCharset($str) {

    // set default encode
    mb_internal_encoding('UTF-8');

    // pre filter
    if (empty($str)) {
        return $str;
    }

    // get charset
    $charset = mb_detect_encoding($str, array('ISO-8859-1', 'UTF-8', 'ASCII'));

    if (stristr($charset, 'utf') || stristr($charset, 'iso')) {
        $str = iconv('ISO-8859-1', 'UTF-8//TRANSLIT', utf8_decode($str));
    } else {
        $str = mb_convert_encoding($str, 'UTF-8', 'UTF-8');
    }

    // remove BOM
    $str = urldecode(str_replace("%C2%81", '', urlencode($str)));

    // prepare string
    return $str;
}
Leesaleese answered 22/6, 2016 at 15:13 Comment(0)
M
4

An extra method to do the same job:

function remove_utf8_bom_head($text) {
    if(substr(bin2hex($text), 0, 6) === 'efbbbf') {
        $text = substr($text, 3);
    }
    return $text;
}

The other methods I found cannot work in my case.

Hope it helps in some special case.

Mytilene answered 7/11, 2016 at 4:53 Comment(0)
D
3

A solution without pack function:

$a = "1";
var_dump($a); // string(4) "1"

function deleteBom($text)
{
    return preg_replace("/^\xEF\xBB\xBF/", '', $text);
}

var_dump(deleteBom($a)); // string(1) "1"
Dewan answered 18/2, 2019 at 9:6 Comment(1)
if they can show up more than once, you might want to use"/^(\xEF\xBB\xBF)+/"Crepuscule
K
2

I'm not so fond of using preg_replace or preg_match for simple tasks. What about this alternative method of detecting and removing the BOM?

function remove_utf8_bom(string $text): string
{
    $bomStart = mb_substr($text, 0, 1);
    return ($bomStart == pack('H*','EFBBBF')) ?
        mb_substr($text, 1) :
        $text;
}
Kraus answered 5/7, 2021 at 8:59 Comment(0)
C
2

How about this:

  function removeUTF8BomHeader($data) {
    if (substr($data, 0, 3) == pack('CCC', 0xef, 0xbb, 0xbf)) {
      $data = substr($data, 3);
    }

    return $data;
  }

tested a lot and it works perfect without any issue

Cyclograph answered 12/11, 2022 at 6:49 Comment(0)
F
1

If you are reading some API using file_get_contents and got an inexplicable NULL from json_decode, check the value of json_last_error(): sometimes the value returned from file_get_contents will have an extraneous BOM that is almost invisible when you inspect the string, but will make json_last_error() to return JSON_ERROR_SYNTAX (4).

>>> $json = file_get_contents("http://api-guiaserv.seade.gov.br/v1/orgao/all");
=> "\t{"orgao":[{"Nome":"Tribunal de Justi\u00e7a","ID_Orgao":"59","Condicao":"1"}, ...]}"
>>> json_decode($json);
=> null
>>>

In this case, check the first 3 bytes - echoing them is not very useful because the BOM is invisible on most settings:

>>> substr($json, 0, 3)
=> "  "
>>> substr($json, 0, 3) == pack('H*','EFBBBF');
=> true
>>>

If the line above returns TRUE for you, then a simple test may fix the problem:

>>> json_decode($json[0] == "{" ? $json : substr($json, 3))
=> {#204
     +"orgao": [
       {#203
         +"Nome": "Tribunal de Justiça",
         +"ID_Orgao": "59",
         +"Condicao": "1",
       },
     ],
     ...
   }
Flowery answered 12/7, 2017 at 17:14 Comment(0)
N
0

When working with faulty software it happens that the BOM part gets multiplied with every saving.

So I am using this to get rid of it.

function remove_utf8_bom($text) {
    $bom = pack('H*','EFBBBF');
    while (preg_match("/^$bom/", $text)) {
        $text = preg_replace("/^$bom/", '', $text);
    }
    return $text;
}
Nutgall answered 9/6, 2019 at 8:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.