Reading files with a BOM in Go
Asked Answered
N

5

15

I need to read Unicode files that may or may not contain a byte-order mark. I could of course check the first few bytes of the file myself, and discard a BOM if I find one. But before I do, is there any standard way of doing this, either in the core libraries or a third party?

Naamana answered 27/1, 2014 at 1:37 Comment(0)
H
13

No standard way, IIRC (and the standard library would really be a wrong layer to implement such a check in) so here are two examples of how you could deal with it yourself.

One is to use a buffered reader above your data stream:

import (
    "bufio"
    "os"
    "log"
)

func main() {
    fd, err := os.Open("filename")
    if err != nil {
        log.Fatal(err)
    }
    defer closeOrDie(fd)
    br := bufio.NewReader(fd)
    r, _, err := br.ReadRune()
    if err != nil {
        log.Fatal(err)
    }
    if r != '\uFEFF' {
        br.UnreadRune() // Not a BOM -- put the rune back
    }
    // Now work with br as you would do with fd
    // ...
}

Another approach, which works with objects implementing the io.Seeker interface, is to read the first three bytes and if they're not BOM, io.Seek() back to the beginning, like in:

import (
    "os"
    "log"
)

func main() {
    fd, err := os.Open("filename")
    if err != nil {
        log.Fatal(err)
    }
    defer closeOrDie(fd)
    bom := [3]byte
    _, err = io.ReadFull(fd, bom[:])
    if err != nil {
        log.Fatal(err)
    }
    if bom[0] != 0xef || bom[1] != 0xbb || bom[2] != 0xbf {
        _, err = fd.Seek(0, 0) // Not a BOM -- seek back to the beginning
        if err != nil {
            log.Fatal(err)
        }
    }
    // The next read operation on fd will read real data
    // ...
}

This is possible since instances of *os.File (what os.Open() returns) support seeking and hence implement io.Seeker. Note that that's not the case for, say, Body reader of HTTP responses since you can't "rewind" it. bufio.Buffer works around this feature of non-seekable streams by performing some buffering (obviously) — that's what allows you yo UnreadRune() on it.

Note that both examples assume the file we're dealing with is encoded in UTF-8. If you need to deal with other (or unknown) encoding, things get more complicated.

Huei answered 27/1, 2014 at 7:45 Comment(8)
The bufio approach worked, and I like that it considers the BOM as a single rune rather than a set of bytes.Naamana
@Huei cloud you explain how to derive the condition in the second approach?. ThanksNonstriated
@Anuruddha, sorry, I failed to parse your question—in particular the "derive the condition" part which appears to be essential. Care to elaborate? May be in a few sentences.Huei
@Huei the condition you used bom[0] != 0xef || bom[1] != 0xbb || bom[1] != 0xbfNonstriated
@Anuruddha, oh there's an error in it: the last bit must check the byte at index 2: bom[2] != 0xbf. Still, I fail to interpret the "derive" in your question: to derive from what? May be start from reading this? (Thanks for spotting the error; I'm going to fix it now).Huei
@Huei thank's. I was confused because of the error you corrected and it use the OR operator. Shouldn't it be AND?Nonstriated
@Anuruddha, no, an UTF-8-encoded BOM is the three bytes with specific values in specific order, so the logic is "if any of the bytes in the sequence has a value it must not have at that position, that's not a BOM".Huei
I like the bufio approach. Before reading this article I was doing a similar thing with a piped function because I hadn't realised I could pass a bufio reader on to the csv reader - that's very cool.Singband
C
6

You can use utfbom package. It wraps io.Reader, detects and discards BOM as necessary. It can also return the encoding detected by the BOM.

Capitalization answered 25/3, 2017 at 20:6 Comment(0)
R
4

I thought I would add here the way to strip the Byte Order Mark sequence from a string -- rather than messing around with bytes directly (as shown above).

package main

import (
    "fmt"
    "strings"
)

func main() {
    s := "\uFEFF is a string that starts with a Byte Order Mark"
    fmt.Printf("before: '%v' (len=%v)\n", s, len(s))

    ByteOrderMarkAsString := string('\uFEFF')

    if strings.HasPrefix(s, ByteOrderMarkAsString) {

        fmt.Printf("Found leading Byte Order Mark sequence!\n")
        
        s = strings.TrimPrefix(s, ByteOrderMarkAsString)
    }
    fmt.Printf("after: '%v' (len=%v)\n", s, len(s)) 
}

Other "strings" functions should work as well.

And this is what prints out:

before: ' is a string that starts with a Byte Order Mark (len=50)'
Found leading Byte Order Mark sequence!
after: ' is a string that starts with a Byte Order Mark (len=47)'

Cheers!

Roeser answered 14/1, 2021 at 17:16 Comment(0)
S
4

We used the transform package to read CSV files (which may have been saved from Excel in UTF8, UTF8-with-BOM, UTF16) as follows:

import (
    "encoding/csv"
    "golang.org/x/text/encoding"
    "golang.org/x/text/encoding/unicode"
    "golang.org/x/text/transform"
    "io"
}

// BOMAwareCSVReader will detect a UTF BOM (Byte Order Mark) at the
// start of the data and transform to UTF8 accordingly.
// If there is no BOM, it will read the data without any transformation.
func BOMAwareCSVReader(reader io.Reader) *csv.Reader {
    var transformer = unicode.BOMOverride(encoding.Nop.NewDecoder())
    return csv.NewReader(transform.NewReader(reader, transformer))
}

We are using Go 1.18.

Sage answered 15/4, 2023 at 16:24 Comment(0)
B
3

There's no standard way of doing this in the Go core packages. Follow the Unicode standard.

Unicode Byte Order Mark (BOM) FAQ

Bencher answered 27/1, 2014 at 3:12 Comment(1)
If I were generating all the files and streams myself, I would of course follow the Unicode standard to the letter. But like many people in the world, I'm stuck consuming data produced by somebody else.Naamana

© 2022 - 2024 — McMap. All rights reserved.