Counting characters in golang string

C

5

24

I am trying to count "characters" in go. That is, if a string contains one printable "glyph", or "composed character" (or what someone would ordinarily think of as a character), I want it to count 1. For example, the string "Hello, 世🖖🏿🖖界", should count 11, since there are 11 characters, and a human would look at this and say there are 11 glyphs.

utf8.RuneCountInString() works well in most cases, including ascii, accents, asian characters and even emojis. However, as I understand it runes correspond to code points, not characters. When I try to use basic emojis it works, but when I use emojis that have different skin tones, I get the wrong count: https://play.golang.org/p/aFIGsB6MsO

From what I read here and here the following should work, but I still don't seem to be getting the right results (it over-counts):

func CountCharactersInString(str string) int {
    var ia norm.Iter
    ia.InitString(norm.NFC, str)
    nc := 0
    for !ia.Done() {
        nc = nc + 1
        ia.Next()
    }
    return nc
}

This doesn't work either:

func GraphemeCountInString(str string) int {
    re := regexp.MustCompile("\\PM\\pM*|.")
    return len(re.FindAllString(str, -1))
}

I am looking for something similar to this in Objective C:

+ (NSInteger)countCharactersInString:(NSString *) string {
    // --- Calculate the number of characters enterd by user and update character count label
    NSInteger count = 0;
    NSUInteger index = 0;
    while (index < string.length) {
        NSRange range = [string rangeOfComposedCharacterSequenceAtIndex:index];
        count++;
        index += range.length;
    }
    return count;
 }

Commonwealth answered 29/4, 2016 at 1:41 Comment(5)

You're looking for an implementation of the "Grapheme Cluster Boundary" algorithm from UAX #29. – Nickell 30/4, 2016 at 0:37

I believe that's right. I tried both implementations for grapheme counting from this answer https://mcmap.net/q/134633/-how-to-get-the-number-of-characters-in-a-string, but I run into the same trouble, but perhaps grapheme cluster boundary counting is more what I want? – Commonwealth 2/5, 2016 at 16:22

The answers to that question confuse "grapheme clusters" with "character normalisation" (all have serious errors in them). – Nickell 3/5, 2016 at 0:47

Were you able to find a solution to this? The problem is the skin-tone modifier is being counted as a separate character and norm does not "count" it as 1 character with the hand. – Glossotomy 25/1, 2018 at 4:16

Never found a correct solution, so I had to loosen my requirements. – Commonwealth 25/1, 2018 at 15:1

A

14

I wrote a package that allows you to do this: https://github.com/rivo/uniseg. It breaks strings according to the rules specified in Unicode Standard Annex #29 which is what you are looking for. Here is how you would use it in your case:

package main

import (
    "fmt"

    "github.com/rivo/uniseg"
)

func main() {
    fmt.Println(uniseg.GraphemeClusterCount("Hello, 世🖖🏿🖖界"))
}

This will print 11 as you expect.

Afraid answered 13/3, 2019 at 17:54 Comment(2)

Best solution. All the other solutions result in counting some emojis as 1 character and other emojis as 2 characters. – Domitian 31/5, 2022 at 23:56

There is a difference between bytes, runes, and graphemes, and it seems many people confuse the three. (In most use cases, it doesn't matter anyway.) For example, 🏳️‍🌈 (rainbow flag emoji) is 1 grapheme, 4 runes, and 14 bytes. The Go stdlib only has built-in functions for bytes and runes but not for graphemes. – Afraid 2/6, 2022 at 6:20

T

18

Straight forward natively use the utf8.RuneCountInString()

package main

import (
    "fmt"
    "unicode/utf8"
)

func main() {
    str := "Hello, 世🖖🖖界"
    fmt.Println("counts =", utf8.RuneCountInString(str))
}

Tribadism answered 31/10, 2020 at 11:32 Comment(4)

or even more straight with utf8.RuneCountInString – Bridoon 11/11, 2020 at 2:29

Thanks for modification @mvndaai RuneCountInString is like RuneCount but its input is a string instead of byte. – Tribadism 2/12, 2020 at 7:16

This is best answer cause it's used internal utf8 library instead of external – Roger 28/2, 2021 at 0:18

Go doesn't need a package to understand unicode. Just make sure you count runes and not bytes; len([]rune("Hello, 世🖖🖖界")). – Shing 24/6, 2021 at 9:4

A

14

I wrote a package that allows you to do this: https://github.com/rivo/uniseg. It breaks strings according to the rules specified in Unicode Standard Annex #29 which is what you are looking for. Here is how you would use it in your case:

package main

import (
    "fmt"

    "github.com/rivo/uniseg"
)

func main() {
    fmt.Println(uniseg.GraphemeClusterCount("Hello, 世🖖🏿🖖界"))
}

This will print 11 as you expect.

Afraid answered 13/3, 2019 at 17:54 Comment(2)

Best solution. All the other solutions result in counting some emojis as 1 character and other emojis as 2 characters. – Domitian 31/5, 2022 at 23:56

There is a difference between bytes, runes, and graphemes, and it seems many people confuse the three. (In most use cases, it doesn't matter anyway.) For example, 🏳️‍🌈 (rainbow flag emoji) is 1 grapheme, 4 runes, and 14 bytes. The Go stdlib only has built-in functions for bytes and runes but not for graphemes. – Afraid 2/6, 2022 at 6:20

P

11

Have you tried strings.Count?

package main

import (
     "fmt"
     "strings"
 )

 func main() {
     fmt.Println(strings.Count("Hello, 世🖖🖖界", "🖖")) // Returns 2
 }

Parahydrogen answered 29/4, 2016 at 13:42 Comment(1)

In the example "Hello, 世🖖🖖界", I would want it to count 11, since there are 11 characters, not 2. I will edit my question to clarify. – Commonwealth 29/4, 2016 at 14:3

S

5

Reference to the example of API document. https://golang.org/pkg/unicode/utf8/#example_DecodeLastRuneInString

package main

import (
    "fmt"
    "unicode/utf8"
)

func main() {
    str := "Hello, 世🖖界"
    count := 0
    for len(str) > 0 {
        r, size := utf8.DecodeLastRuneInString(str)
        count++
        fmt.Printf("%c %v\n", r, size)

        str = str[:len(str)-size]
    }
    fmt.Println("count:",count)
}

Selfappointed answered 29/4, 2016 at 2:23 Comment(9)

That counts runes, not graphemes: str := "🇦🇽" counts 2 instead of 1. – Nickell 29/4, 2016 at 6:42

what "AX" is and why it should be 1? – Selfappointed 29/4, 2016 at 6:47

It's U+1F1E6 U+1F1FD, which should render as the flag of the Åland Islands. Any other regional indicator symbol will have the same result (perhaps 🇫🇷 renders better on your system?). – Nickell 29/4, 2016 at 7:40

but U+1F1E6 and U+1F1FD can be two separate characters too, am I right? – Selfappointed 29/4, 2016 at 8:28

Yes, but in a regional indicator sequence they form one grapheme (or "one printable 'glyph'" as the original question put it). – Nickell 29/4, 2016 at 8:42

Apparently there is a 'unicode/norm' package to normalize unicode grapheme, is that what's needed here : blog.golang.org/normalization ? – Ambulator 29/4, 2016 at 9:1

how could we will think a colorful flag picture is a "glyph" or a "character"? And I find the is Objective C function rangeOfComposedCharacterSequenceAtIndex @Bjorn Roche used plays different in different system(#32831955). I'm totally confused by the complex Emoji! – Selfappointed 29/4, 2016 at 9:25

@phtrivier, yes, the examples I gave in my question use the unicode/norm package, but I still get the wrong answer sometimes, such as for the 🖖🏿 glyph. – Commonwealth 29/4, 2016 at 14:9

there is a standard function - utf8.RuneCountInString – Bridoon 11/11, 2020 at 2:28

C

-2

I think the easiest way to do this would be like this:

package main

import "fmt"

func main() {
    str := "Hello, 世🖖🖖界"
    var counter int
    for range str {
        counter++
    }
    fmt.Println(counter)
}

This one prints 11

Comorin answered 1/10, 2020 at 18:14 Comment(0)

Recommended topics

Hot tags