How can I compare two files in golang?
Asked Answered
G

10

13

With Python I can do the next:

equals = filecmp.cmp(file_old, file_new)

Is there any builtin function to do that in go language? I googled it but without success.

I could use some hash function in hash/crc32 package, but that is more work that the above Python code.

Giuliana answered 8/4, 2015 at 2:52 Comment(2)
Can you clarify the question? It's asking for two different things (a replacement for filecmp.cmp and a way to see if two files contain the same bytes).Dialectics
Sure, I write an diff tool in Python (for self learning Python) which make patches comparing files and using the filecmp.cmp function to compare the new and the old file. Right now I'm writing the same tool using Go Lang and I cannot find some function like the above, thus my questions if for to find a builtin function to compare files, but, if it doesn't exist, I had suggested to use some hash function or write a byte-to-byte comparison function. Sorry my englishGiuliana
W
11

I am not sure that function does what you think it does. From the docs,

Unless shallow is given and is false, files with identical os.stat() signatures are taken to be equal.

Your call is comparing only the signature of os.stat, which only includes:

  1. File mode
  2. Modified Time
  3. Size

You can learn all three of these things in Go from the os.Stat function. This really would only indicate that they are literally the same file, or symlinks to the same file, or a copy of that file.

If you want to go deeper you can open both files and compare them (python version reads 8k at a time).

You could use an crc or md5 to hash both files, but if there are differences at the beginning of a long file, you want to stop early. I would recommend reading some number of bytes at a time from each reader and comparing with bytes.Compare.

Willy answered 8/4, 2015 at 5:0 Comment(0)
Y
14

To complete the @captncraig answer, if you want to know if the two files are the same, you can use the SameFile(fi1, fi2 FileInfo) method from the OS package.

SameFile reports whether fi1 and fi2 describe the same file. For example, on Unix this means that the device and inode fields of the two underlying structures are identical;

Otherwise, if you want to check the files contents, here is a solution which checks the two files line by line avoiding the load of the entire files in memory.

First try: https://play.golang.org/p/NlQZRrW1dT


EDIT: Read by bytes chunks and fail fast if the files have not the same size. https://play.golang.org/p/YyYWuCRJXV

const chunkSize = 64000

func deepCompare(file1, file2 string) bool {
    // Check file size ...

    f1, err := os.Open(file1)
    if err != nil {
        log.Fatal(err)
    }
    defer f1.Close()

    f2, err := os.Open(file2)
    if err != nil {
        log.Fatal(err)
    }
    defer f2.Close()

    for {
        b1 := make([]byte, chunkSize)
        _, err1 := f1.Read(b1)

        b2 := make([]byte, chunkSize)
        _, err2 := f2.Read(b2)

        if err1 != nil || err2 != nil {
            if err1 == io.EOF && err2 == io.EOF {
                return true
            } else if err1 == io.EOF || err2 == io.EOF {
                return false
            } else {
                log.Fatal(err1, err2)
            }
        }

        if !bytes.Equal(b1, b2) {
            return false
        }
    }
}
Yeoman answered 4/5, 2015 at 19:39 Comment(6)
Why the overhead of a scanner? That needs to parse the bytes looking for line separators which you don't care about. It also may not do what you expect for binary files. You can just read "chunks" into a pair of reasonably sized buffers and use bytes.Equal as you go (which is what @captncraig suggests).Zamir
BTW, it definitely won't work for binary files without frequent enough 0x0A bytes: "Scanning stops unrecoverably at EOF, the first I/O error, or a token too large to fit in the buffer." (From bufio.Scanner).Zamir
Thanks for your feedback. I edited my answer to follow your advice. Do you have an idea of a good chunk size default ?Yeoman
4k, 8k, 64k, or 128k are likely choices for "real" code reading from files but anything is fine as an example. In general with an io.Reader you'd also have to handle short reads (or use io.ReadFull and deal with io.ErrUnexpectedEOF); os.File doesn't seem to guarantee it won't give a short read. All the corner cases start to get annoying :(. Probably not worth dealing with in an SO example, however.Zamir
Readers are allowed to return a partially filled buffer even if more data will be available later as the docs say If some data is available but not len(p) bytes, Read conventionally returns what is available instead of waiting for more. So here f1 and f2 could get out of sync while reading.Meddle
There are several things wrong with this code: 1. What happens if Read reads a different number of bytes for the two files? 2. You shouldn't allocate large reusable buffers in a loop. 3. From the io docs: "Callers should always process the n > 0 bytes returned before considering the error err"Amish
W
11

I am not sure that function does what you think it does. From the docs,

Unless shallow is given and is false, files with identical os.stat() signatures are taken to be equal.

Your call is comparing only the signature of os.stat, which only includes:

  1. File mode
  2. Modified Time
  3. Size

You can learn all three of these things in Go from the os.Stat function. This really would only indicate that they are literally the same file, or symlinks to the same file, or a copy of that file.

If you want to go deeper you can open both files and compare them (python version reads 8k at a time).

You could use an crc or md5 to hash both files, but if there are differences at the beginning of a long file, you want to stop early. I would recommend reading some number of bytes at a time from each reader and comparing with bytes.Compare.

Willy answered 8/4, 2015 at 5:0 Comment(0)
P
11

How about using bytes.Equal?

package main

import (
"fmt"
"io/ioutil"
"log"
"bytes"
)

func main() {
    // per comment, better to not read an entire file into memory
    // this is simply a trivial example.
    f1, err1 := ioutil.ReadFile("lines1.txt")

    if err1 != nil {
        log.Fatal(err1)
    }

    f2, err2 := ioutil.ReadFile("lines2.txt")

    if err2 != nil {
        log.Fatal(err2)
    }

    fmt.Println(bytes.Equal(f1, f2)) // Per comment, this is significantly more performant.
}
Prosper answered 9/4, 2015 at 2:33 Comment(6)
Two problems with this post. 1. you are encouraging loading all data into memory. 2. DeepEqual uses reflection and is slow. It makes more sense to use bytes.Equal and if such a function did not exist, I would recommend a for loop.Porshaport
Updated per @StephenWeinberg, 1. good point. 2. bytes.Equal does exist and you're right, it's significantly faster than reflecting, updated code snippet.Prosper
Updated per @Dave C 3. I was "lazy" in this example (I also didn't have a package declaration or a main function, so this code would error if someone copy-pasted it), so I handled the errors and updated any code that wouldn't have compiled and ran. Hope that satisfies your problem with my answer.Prosper
You did not solve problem 1. You are still loading both files completely into memory. You did solve problems 2 and 3.Porshaport
Sorry, I didn't mean to imply I solved your problem, but made it clear in the comments that it was just a trivial example on the chance that someone copy/pasted the example and had problems. What's an alternative solution you'd like to propose? I'm happy to delete my response if you think it needs to be removed because it's encouraging bad conduct and it's bad enough to be worthy of down votes.Prosper
All bytes.Equal() does is: return string(a) == string(b). See github.com/golang/go/blob/master/src/bytes/bytes.goKerin
B
2

After checking the existing answers I whipped up a simple package for comparing arbitrary (finite) io.Reader and files as a convenience method: https://github.com/hlubek/readercomp

Example:

package main

import (
    "fmt"
    "log"
    "os"

    "github.com/hlubek/readercomp"
)

func main() {
    result, err := readercomp.FilesEqual(os.Args[1], os.Args[2])
    if err != nil {
        log.Fatal(err)
    }
    fmt.Println(result)
}
Braggart answered 6/1, 2021 at 16:48 Comment(0)
O
1

You can use a package like equalfile

Main API:

func CompareFile(path1, path2 string) (bool, error)

Godoc: https://godoc.org/github.com/udhos/equalfile

Example:

package main

import (
    "fmt"
    "os"
    "github.com/udhos/equalfile"
 )

func main() {
    if len(os.Args) != 3 {
        fmt.Printf("usage: equal file1 file2\n")
        os.Exit(2)
    }

    file1 := os.Args[1]
    file2 := os.Args[2]

    equal, err := equalfile.CompareFile(file1, file2)
    if err != nil {
        fmt.Printf("equal: error: %v\n", err)
        os.Exit(3)
    }

    if equal {
        fmt.Println("equal: files match")
        os.Exit(0)
    }

    fmt.Println("equal: files differ")
    os.Exit(1)
}
Orthoclase answered 18/11, 2016 at 17:31 Comment(2)
const defaultMaxSize = 10000000000 // Only the first 10^10 bytes are compared. what the hellAc
This default max size is a protection against a possibly unlimited stream that would cause a never-ending comparison. You can override it by using the option 'Options.MaxSize'. If you have a better strategy for handling infinite streams, please open a pull request.Orthoclase
G
1

This does a piece-by-piece comparison of the two files, quitting as soon as it knows the two files are different. It only needs standard library functions.

It's an improvement to this that handles the short-read problem raised by mat007 and christopher by using io.ReadFull(). It also avoids reallocating the buffers.

package util

import (
    "bytes"
    "io"
    "os"
)

// Decide if two files have the same contents or not.
// chunkSize is the size of the blocks to scan by; pass 0 to get a sensible default.
// *Follows* symlinks.
//
// May return an error if something else goes wrong; in this case, you should ignore the value of 'same'.
//
// derived from https://stackoverflow.com/a/30038571
// under CC-BY-SA-4.0 by several contributors
func FileCmp(file1, file2 string, chunkSize int) (same bool, err error) {

    if chunkSize == 0 {
        chunkSize = 4 * 1024
    }

    // shortcuts: check file metadata
    stat1, err := os.Stat(file1)
    if err != nil {
        return false, err
    }

    stat2, err := os.Stat(file2)
    if err != nil {
        return false, err
    }

    // are inputs are literally the same file?
    if os.SameFile(stat1, stat2) {
        return true, nil
    }

    // do inputs at least have the same size?
    if stat1.Size() != stat2.Size() {
        return false, nil
    }

    // long way: compare contents
    f1, err := os.Open(file1)
    if err != nil {
        return false, err
    }
    defer f1.Close()

    f2, err := os.Open(file2)
    if err != nil {
        return false, err
    }
    defer f2.Close()

    b1 := make([]byte, chunkSize)
    b2 := make([]byte, chunkSize)
    for {
        n1, err1 := io.ReadFull(f1, b1)
        n2, err2 := io.ReadFull(f2, b2)

        // https://pkg.go.dev/io#Reader
        // > Callers should always process the n > 0 bytes returned
        // > before considering the error err. Doing so correctly
        // > handles I/O errors that happen after reading some bytes
        // > and also both of the allowed EOF behaviors.

        if !bytes.Equal(b1[:n1], b2[:n2]) {
            return false, nil
        }

        if (err1 == io.EOF && err2 == io.EOF) || (err1 == io.ErrUnexpectedEOF && err2 == io.ErrUnexpectedEOF) {
            return true, nil
        }

        // some other error, like a dropped network connection or a bad transfer
        if err1 != nil {
            return false, err1
        }
        if err2 != nil {
            return false, err2
        }
    }
}

It surprised me that this wasn't anywhere in the standard library.

Giro answered 19/8, 2022 at 4:52 Comment(0)
L
0

Here's an io.Reader I whipped out. You can _, err := io.Copy(ioutil.Discard, newCompareReader(a, b)) to get an error if two streams don't share equal contents. This implementation is optimized for performance by limiting unnecessary data copying.

package main

import (
    "bytes"
    "errors"
    "fmt"
    "io"
)

type compareReader struct {
    a    io.Reader
    b    io.Reader
    bBuf []byte // need buffer for comparing B's data with one that was read from A
}

func newCompareReader(a, b io.Reader) io.Reader {
    return &compareReader{
        a: a,
        b: b,
    }
}

func (c *compareReader) Read(p []byte) (int, error) {
    if c.bBuf == nil {
        // assuming p's len() stays the same, so we can optimize for both of their buffer
        // sizes to be equal
        c.bBuf = make([]byte, len(p))
    }

    // read only as much data as we can fit in both p and bBuf
    readA, errA := c.a.Read(p[0:min(len(p), len(c.bBuf))])
    if readA > 0 {
        // bBuf is guaranteed to have at least readA space
        if _, errB := io.ReadFull(c.b, c.bBuf[0:readA]); errB != nil { // docs: "EOF only if no bytes were read"
            if errB == io.ErrUnexpectedEOF {
                return readA, errors.New("compareReader: A had more data than B")
            } else {
                return readA, fmt.Errorf("compareReader: read error from B: %w", errB)
            }
        }

        if !bytes.Equal(p[0:readA], c.bBuf[0:readA]) {
            return readA, errors.New("compareReader: bytes not equal")
        }
    }
    if errA == io.EOF {
        // in happy case expecting EOF from B as well. might be extraneous call b/c we might've
        // got it already from the for loop above, but it's easier to check here
        readB, errB := c.b.Read(c.bBuf)
        if readB > 0 {
            return readA, errors.New("compareReader: B had more data than A")
        }

        if errB != io.EOF {
            return readA, fmt.Errorf("compareReader: got EOF from A but not from B: %w", errB)
        }
    }

    return readA, errA
}
Lawannalawbreaker answered 23/10, 2020 at 10:22 Comment(0)
L
0

The standard way is to stat them and use os.SameFile.

-- https://groups.google.com/g/golang-nuts/c/G-5D6agvz2Q/m/2jV_6j6LBgAJ

os.SameFile should roughly do the same things as Python's filecmp.cmp(f1, f2) (ie. shallow=true, meaning it only compares the file infos obtained by stat).

func SameFile(fi1, fi2 FileInfo) bool

SameFile reports whether fi1 and fi2 describe the same file. For example, on Unix this means that the device and inode fields of the two underlying structures are identical; on other systems the decision may be based on the path names. SameFile only applies to results returned by this package's Stat. It returns false in other cases.

But if you actually want to compare the file's content, you'll have to do it yourself.

Looper answered 17/11, 2020 at 4:28 Comment(0)
T
0

This is an optimized function for comparing io Readers, which handles the case where a reader may return less than the Buffer's worth of bytes, but not be at the EOF. It is optimized to fail fast by not using io.ReadFull or io.ReadAtLeast which will continue to try reading from slow sources when there may already be more data that can be compared.

If less data is retrieved from the second reader, whatever is retrieved is compared before more is read into the buffer.

const chunkSize = 64000

func readersEqual(r io.Reader, t io.Reader) (bool, error) {
    rBuf := make([]byte, chunkSize)
    tBuf := make([]byte, chunkSize)

    for {
        readFromR, errR := r.Read(rBuf)
        if errR != nil && !errors.Is(errR, io.EOF) {
            return false, errR
        }

        readFromT := 0
        tCmpBuf := tBuf[:readFromR]

        if readFromR == 0 && errors.Is(errR, io.EOF) {
            readFromT, errT := t.Read(tBuf[:1])
            if readFromT == 0 && errors.Is(errT, io.EOF) {
                return true, nil
            } else {
                return false, errT
            }
        }

        for readFromR > readFromT {
            nextReadFromT, errT := t.Read(tCmpBuf[readFromT:])
            if errT != nil && !errors.Is(errT, io.EOF) {
                return false, errT
            }
            prevReadFromT := readFromT
            readFromT = prevReadFromT + nextReadFromT
            if !bytes.Equal(rBuf[prevReadFromT:readFromT], tCmpBuf[prevReadFromT:readFromT]) {
                return false, nil
            }
            if errors.Is(errR, io.EOF) && errors.Is(errT, io.EOF) {
                return true, nil
            }
            if errors.Is(errR, io.EOF) || errors.Is(errT, io.EOF) {
                return false, nil
            }
        }
    }
}
Toadinthehole answered 24/9, 2023 at 7:2 Comment(0)
M
-1

Something like this should do the trick, and should be memory-efficient compared to the other answers. I looked at github.com/udhos/equalfile and it seemed a bit overkill to me. Before you call compare() here, you should do two os.Stat() calls and compare file sizes for an early out fast path.

The reason to use this implementation over the other answers is because you don't want to hold the entirety of both files in memory if you don't have to. You can read an amount from A and B, compare, and then continue reading the next amount, one buffer-load from each file at a time until you are done. You just have to be careful because you may read 50 bytes from A and then 60 bytes from B because your read may have blocked for some reason.

This implemention assumes a Read() call will not return N > 0 (some bytes read) at the same time as an error != nil. This is how os.File behaves, but not how other implementations of Read may behave, such as net.TCPConn.

import (
  "os"
  "bytes"
  "errors"
)

var errNotSame = errors.New("File contents are different")

func compare(p1, p2 string) error {
    var (
        buf1 [8192]byte
        buf2 [8192]byte
    )

    fh1, err := os.Open(p1)
    if err != nil {
        return err
    }
    defer fh1.Close()

    fh2, err := os.Open(p2)
    if err != nil {
        return err
    }
    defer fh2.Close()

    for {
        n1, err1 := fh1.Read(buf1[:])
        n2, err2 := fh2.Read(buf2[:])

        if err1 == io.EOF && err2 == io.EOF {
            // files are the same!
            return nil
        }
        if err1 == io.EOF || err2 == io.EOF {
            return errNotSame
        }
        if err1 != nil {
            return err1
        }
        if err2 != nil {
            return err2
        }

        // short read on n1
        for n1 < n2 {
            more, err := fh1.Read(buf1[n1:n2])
            if err == io.EOF {
                return errNotSame
            }
            if err != nil {
                return err
            }
            n1 += more
        }
        // short read on n2
        for n2 < n1 {
            more, err := fh2.Read(buf2[n2:n1])
            if err == io.EOF {
                return errNotSame
            }
            if err != nil {
                return err
            }
            n2 += more
        }
        if n1 != n2 {
            // should never happen
            return fmt.Errorf("file compare reads out of sync: %d != %d", n1, n2)
        }

        if bytes.Compare(buf1[:n1], buf2[:n2]) != 0 {
            return errNotSame
        }
    }
}
Medlock answered 14/12, 2020 at 11:22 Comment(3)
This code looks good at first sight but has some issues due to the semantics of io.Reader, e.g.: 1. If the first call to Read returns io.EOF and a non-zero count of bytes read - it is not necessarily true that the files are the same for files < 8K. It is allowed that a read that hits EOF can return the error and a non-zero number of bytes read in the same call. So it must be compared anyway. 2. If one of the reads returns io.EOF and the other is not, it may not be true that the files differ since one could be a "short read".Braggart
@Braggart Aha! good catch, though I think most implementations of Read() such as the one in "os" for os.File will never return(n > 0, EOF). They instead return(n > 0, nil), and then on the next call to read they return (0, EOF). But it looks like you're right about TCP connections in the base "net" package-- those may return some bytes, and an error, if I understand the docs correctly.Medlock
@Braggart I updated the text to make sure to note that caveat. Thanks!Medlock

© 2022 - 2024 — McMap. All rights reserved.