How a scanner can be implemented with a custom split
Asked Answered
S

5

6

I have a log file, and I need to parse each record in it using golang. Each record begin with "#", and a record can span one or more lines :

# Line1
# Line2
Continued line2
Continued line2
# line3
.....

Some code :), I'm a beginner

   f, _ := os.Open(mylog)
    scanner := bufio.NewScanner(f)
    var queryRec string

    for scanner.Scan() {
            line := scanner.Text()

            if strings.HasPrefix(line, "# ") && len(queryRec) == 0 {
                    queryRec = line
            } else if !strings.HasPrefix(line, "# ") && len(queryRec) == 0 {
                    fmt.Println("There is a big problem!!!")
            } else if !strings.HasPrefix(line, "# ") && len(queryRec) != 0 {
                    queryRec += line
            } else if strings.HasPrefix(line, "# ") && len(queryRec) != 0 {
                    queryRec = line
            }
    }

Thanks,

Synonymize answered 11/10, 2015 at 18:34 Comment(6)
Show us the code you have already written, explain what problems you are having, and why you think it might be hard.Armet
I'm trying to read each record, and put it in a mysql database @PedroLobitoSynonymize
Added some code @ArmetSynonymize
Now, what's the problem? Errors? Wrong output?Kahle
My questions : 1. How can I handle the last line from the log file? 2. There is a more optimized/elegant way to achieve the same goal?Synonymize
You could have asked this question in a more general sense by asking how a scanner can be implemented with a custom split. It would apply to more people who are interested in a permutation of this problem. Consider editing your question to make it general and use your particular instance as an example.Familist
F
18

The Scanner type has a function called Split which allows you to pass a SplitFunc to determine how the scanner will split the given byte slice. The default SplitFunc is the ScanLines which you can see the implementation source. From this point you can write your own SplitFunc to break the bufio.Reader content based on your specific format.

func crunchSplitFunc(data []byte, atEOF bool) (advance int, token []byte, err error) {

    // Return nothing if at end of file and no data passed
    if atEOF && len(data) == 0 {
        return 0, nil, nil
    }

    // Find the index of the input of a newline followed by a 
    // pound sign.
    if i := strings.Index(string(data), "\n#"); i >= 0 {
        return i + 1, data[0:i], nil
    }

    // If at end of file with data return the data
    if atEOF {
        return len(data), data, nil
    }

    return
}

You can see the full implementation of the example at https://play.golang.org/p/ecCYkTzme4. The documentation provides all the insight needed to implement something like this.

Familist answered 11/10, 2015 at 20:23 Comment(2)
Thanks @benjic. Elegant solution that works as expected :)Synonymize
This will be problematic when the input data contains multiple newlines before #s.Littlest
T
12

Slightly optimized solution of Ben Campbell and sto-b-doo

Conversion of byte slice to string appears to be quite heavy operation.

In my app for log processing it became a bottleneck.

Just keeping data in bytes gives ~1500% performance boost to my app.

func SplitAt(substring string) func(data []byte, atEOF bool) (advance int, token []byte, err error) {
    searchBytes := []byte(substring)
    searchLen := len(searchBytes)
    return func(data []byte, atEOF bool) (advance int, token []byte, err error) {
        dataLen := len(data)

        // Return nothing if at end of file and no data passed
        if atEOF && dataLen == 0 {
            return 0, nil, nil
        }

        // Find next separator and return token
        if i := bytes.Index(data, searchBytes); i >= 0 {
            return i + searchLen, data[0:i], nil
        }

        // If we're at EOF, we have a final, non-terminated line. Return it.
        if atEOF {
            return dataLen, data, nil
        }

        // Request more data.
        return 0, nil, nil
    }
}
Twotime answered 27/7, 2019 at 13:41 Comment(0)
V
2

Ben Campbell's answer wrapped into a func that returns a splitfunc for a substring:

demo on play.golang.org

Improvement suggestions welcome

// SplitAt returns a bufio.SplitFunc closure, splitting at a substring
// scanner.Split(SplitAt("\n# "))
func SplitAt(substring string) func(data []byte, atEOF bool) (advance int, token []byte, err error) {

    return func(data []byte, atEOF bool) (advance int, token []byte, err error) {

        // Return nothing if at end of file and no data passed
        if atEOF && len(data) == 0 {
            return 0, nil, nil
        }

        // Find the index of the input of the separator substring
        if i := strings.Index(string(data), substring); i >= 0 {
            return i + len(substring), data[0:i], nil
        }

        // If at end of file with data return the data
        if atEOF {
            return len(data), data, nil
        }

        return
    }
}
Vitebsk answered 1/7, 2018 at 23:19 Comment(0)
Y
0

Hopefully an improvement (maybe readability) over stu0292's improvements And uses the final token signal.

// SplitAt returns a bufio.SplitFunc closure, splitting at a substring
// scanner.Split(SplitAt("\n#"))
func SplitAt(substring string) func(data []byte, atEOF bool) (advance    int, token []byte, err error) {

   return func(data []byte, atEOF bool) (advance int, token []byte, err error) {

      // Find the index of the input of the separator substring
       if i := strings.Index(string(data), substring); i >= 0 {
         return i + len(substring), data[0:i], nil
       }

       if !atEOF {
         return 0, nil, nil
       }
     return len(data), data, bufio.ErrFinalToken
  }
}
Yod answered 10/8, 2021 at 2:29 Comment(0)
S
0

All the responses so far do a string conversion on the data when getting an index of the substring. This is inefficient because the string conversion creates a copy of the underlying data. I suggest people use the following code, which works the same but avoids unnecessary string conversions.

func SplitAt(substring []byte) func(data []byte, atEOF bool) (advance int, token []byte, err error) {
    return func(data []byte, atEOF bool) (advance int, token []byte, err error) {

        // Return nothing if at the end of the file and no data passed
        if atEOF && len(data) == 0 {
            return 0, nil, nil
        }

        // Find the index of the input of the separator substring
        if i := bytes.Index(data, substring); i >= 0 {
            return i + len(substring), data[0:i], nil
        }

        // If at the end of the file with data, return the data
        if atEOF {
            return len(data), data, nil
        }

        return
    }
}

In the calling code you can then specify the delimiter as a []byte.

Sadism answered 8/9, 2024 at 10:45 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.