I need to read Unicode files that may or may not contain a byte-order mark. I could of course check the first few bytes of the file myself, and discard a BOM if I find one. But before I do, is there any standard way of doing this, either in the core libraries or a third party?
No standard way, IIRC (and the standard library would really be a wrong layer to implement such a check in) so here are two examples of how you could deal with it yourself.
One is to use a buffered reader above your data stream:
import (
"bufio"
"os"
"log"
)
func main() {
fd, err := os.Open("filename")
if err != nil {
log.Fatal(err)
}
defer closeOrDie(fd)
br := bufio.NewReader(fd)
r, _, err := br.ReadRune()
if err != nil {
log.Fatal(err)
}
if r != '\uFEFF' {
br.UnreadRune() // Not a BOM -- put the rune back
}
// Now work with br as you would do with fd
// ...
}
Another approach, which works with objects implementing the io.Seeker
interface, is to read the first three bytes and if they're not BOM, io.Seek()
back to the beginning, like in:
import (
"os"
"log"
)
func main() {
fd, err := os.Open("filename")
if err != nil {
log.Fatal(err)
}
defer closeOrDie(fd)
bom := [3]byte
_, err = io.ReadFull(fd, bom[:])
if err != nil {
log.Fatal(err)
}
if bom[0] != 0xef || bom[1] != 0xbb || bom[2] != 0xbf {
_, err = fd.Seek(0, 0) // Not a BOM -- seek back to the beginning
if err != nil {
log.Fatal(err)
}
}
// The next read operation on fd will read real data
// ...
}
This is possible since instances of *os.File
(what os.Open()
returns) support seeking and hence implement io.Seeker
. Note that that's not the case for, say, Body
reader of HTTP responses since you can't "rewind" it. bufio.Buffer
works around this feature of non-seekable streams by performing some buffering (obviously) — that's what allows you yo UnreadRune()
on it.
Note that both examples assume the file we're dealing with is encoded in UTF-8. If you need to deal with other (or unknown) encoding, things get more complicated.
bom[2] != 0xbf
. Still, I fail to interpret the "derive" in your question: to derive from what? May be start from reading this? (Thanks for spotting the error; I'm going to fix it now). –
Huei You can use utfbom package. It wraps io.Reader
, detects and discards BOM as necessary. It can also return the encoding detected by the BOM.
I thought I would add here the way to strip the Byte Order Mark sequence from a string -- rather than messing around with bytes directly (as shown above).
package main
import (
"fmt"
"strings"
)
func main() {
s := "\uFEFF is a string that starts with a Byte Order Mark"
fmt.Printf("before: '%v' (len=%v)\n", s, len(s))
ByteOrderMarkAsString := string('\uFEFF')
if strings.HasPrefix(s, ByteOrderMarkAsString) {
fmt.Printf("Found leading Byte Order Mark sequence!\n")
s = strings.TrimPrefix(s, ByteOrderMarkAsString)
}
fmt.Printf("after: '%v' (len=%v)\n", s, len(s))
}
Other "strings" functions should work as well.
And this is what prints out:
before: ' is a string that starts with a Byte Order Mark (len=50)'
Found leading Byte Order Mark sequence!
after: ' is a string that starts with a Byte Order Mark (len=47)'
Cheers!
We used the transform package to read CSV files (which may have been saved from Excel in UTF8, UTF8-with-BOM, UTF16) as follows:
import (
"encoding/csv"
"golang.org/x/text/encoding"
"golang.org/x/text/encoding/unicode"
"golang.org/x/text/transform"
"io"
}
// BOMAwareCSVReader will detect a UTF BOM (Byte Order Mark) at the
// start of the data and transform to UTF8 accordingly.
// If there is no BOM, it will read the data without any transformation.
func BOMAwareCSVReader(reader io.Reader) *csv.Reader {
var transformer = unicode.BOMOverride(encoding.Nop.NewDecoder())
return csv.NewReader(transform.NewReader(reader, transformer))
}
We are using Go 1.18.
There's no standard way of doing this in the Go core packages. Follow the Unicode standard.
© 2022 - 2024 — McMap. All rights reserved.