Counting characters in golang string
Asked Answered
C

5

24

I am trying to count "characters" in go. That is, if a string contains one printable "glyph", or "composed character" (or what someone would ordinarily think of as a character), I want it to count 1. For example, the string "Hello, δΈ–πŸ––πŸΏπŸ––η•Œ", should count 11, since there are 11 characters, and a human would look at this and say there are 11 glyphs.

utf8.RuneCountInString() works well in most cases, including ascii, accents, asian characters and even emojis. However, as I understand it runes correspond to code points, not characters. When I try to use basic emojis it works, but when I use emojis that have different skin tones, I get the wrong count: https://play.golang.org/p/aFIGsB6MsO

From what I read here and here the following should work, but I still don't seem to be getting the right results (it over-counts):

func CountCharactersInString(str string) int {
    var ia norm.Iter
    ia.InitString(norm.NFC, str)
    nc := 0
    for !ia.Done() {
        nc = nc + 1
        ia.Next()
    }
    return nc
}

This doesn't work either:

func GraphemeCountInString(str string) int {
    re := regexp.MustCompile("\\PM\\pM*|.")
    return len(re.FindAllString(str, -1))
}

I am looking for something similar to this in Objective C:

+ (NSInteger)countCharactersInString:(NSString *) string {
    // --- Calculate the number of characters enterd by user and update character count label
    NSInteger count = 0;
    NSUInteger index = 0;
    while (index < string.length) {
        NSRange range = [string rangeOfComposedCharacterSequenceAtIndex:index];
        count++;
        index += range.length;
    }
    return count;
 }
Commonwealth answered 29/4, 2016 at 1:41 Comment(5)
You're looking for an implementation of the "Grapheme Cluster Boundary" algorithm from UAX #29. – Nickell
I believe that's right. I tried both implementations for grapheme counting from this answer https://mcmap.net/q/134633/-how-to-get-the-number-of-characters-in-a-string, but I run into the same trouble, but perhaps grapheme cluster boundary counting is more what I want? – Commonwealth
The answers to that question confuse "grapheme clusters" with "character normalisation" (all have serious errors in them). – Nickell
Were you able to find a solution to this? The problem is the skin-tone modifier is being counted as a separate character and norm does not "count" it as 1 character with the hand. – Glossotomy
Never found a correct solution, so I had to loosen my requirements. – Commonwealth
A
14

I wrote a package that allows you to do this: https://github.com/rivo/uniseg. It breaks strings according to the rules specified in Unicode Standard Annex #29 which is what you are looking for. Here is how you would use it in your case:

package main

import (
    "fmt"

    "github.com/rivo/uniseg"
)

func main() {
    fmt.Println(uniseg.GraphemeClusterCount("Hello, δΈ–πŸ––πŸΏπŸ––η•Œ"))
}

This will print 11 as you expect.

Afraid answered 13/3, 2019 at 17:54 Comment(2)
Best solution. All the other solutions result in counting some emojis as 1 character and other emojis as 2 characters. – Domitian
There is a difference between bytes, runes, and graphemes, and it seems many people confuse the three. (In most use cases, it doesn't matter anyway.) For example, πŸ³οΈβ€πŸŒˆ (rainbow flag emoji) is 1 grapheme, 4 runes, and 14 bytes. The Go stdlib only has built-in functions for bytes and runes but not for graphemes. – Afraid
T
18

Straight forward natively use the utf8.RuneCountInString()

package main

import (
    "fmt"
    "unicode/utf8"
)

func main() {
    str := "Hello, δΈ–πŸ––πŸ––η•Œ"
    fmt.Println("counts =", utf8.RuneCountInString(str))
}
Tribadism answered 31/10, 2020 at 11:32 Comment(4)
or even more straight with utf8.RuneCountInString – Bridoon
Thanks for modification @mvndaai RuneCountInString is like RuneCount but its input is a string instead of byte. – Tribadism
This is best answer cause it's used internal utf8 library instead of external – Roger
Go doesn't need a package to understand unicode. Just make sure you count runes and not bytes; len([]rune("Hello, δΈ–πŸ––πŸ––η•Œ")). – Shing
A
14

I wrote a package that allows you to do this: https://github.com/rivo/uniseg. It breaks strings according to the rules specified in Unicode Standard Annex #29 which is what you are looking for. Here is how you would use it in your case:

package main

import (
    "fmt"

    "github.com/rivo/uniseg"
)

func main() {
    fmt.Println(uniseg.GraphemeClusterCount("Hello, δΈ–πŸ––πŸΏπŸ––η•Œ"))
}

This will print 11 as you expect.

Afraid answered 13/3, 2019 at 17:54 Comment(2)
Best solution. All the other solutions result in counting some emojis as 1 character and other emojis as 2 characters. – Domitian
There is a difference between bytes, runes, and graphemes, and it seems many people confuse the three. (In most use cases, it doesn't matter anyway.) For example, πŸ³οΈβ€πŸŒˆ (rainbow flag emoji) is 1 grapheme, 4 runes, and 14 bytes. The Go stdlib only has built-in functions for bytes and runes but not for graphemes. – Afraid
P
11

Have you tried strings.Count?

package main

import (
     "fmt"
     "strings"
 )

 func main() {
     fmt.Println(strings.Count("Hello, δΈ–πŸ––πŸ––η•Œ", "πŸ––")) // Returns 2
 }
Parahydrogen answered 29/4, 2016 at 13:42 Comment(1)
In the example "Hello, δΈ–πŸ––πŸ––η•Œ", I would want it to count 11, since there are 11 characters, not 2. I will edit my question to clarify. – Commonwealth
S
5

Reference to the example of API document. https://golang.org/pkg/unicode/utf8/#example_DecodeLastRuneInString

package main

import (
    "fmt"
    "unicode/utf8"
)

func main() {
    str := "Hello, δΈ–πŸ––η•Œ"
    count := 0
    for len(str) > 0 {
        r, size := utf8.DecodeLastRuneInString(str)
        count++
        fmt.Printf("%c %v\n", r, size)

        str = str[:len(str)-size]
    }
    fmt.Println("count:",count)
}
Selfappointed answered 29/4, 2016 at 2:23 Comment(9)
That counts runes, not graphemes: str := "πŸ‡¦πŸ‡½" counts 2 instead of 1. – Nickell
what "AX" is and why it should be 1? – Selfappointed
It's U+1F1E6 U+1F1FD, which should render as the flag of the Åland Islands. Any other regional indicator symbol will have the same result (perhaps πŸ‡«πŸ‡· renders better on your system?). – Nickell
but U+1F1E6 and U+1F1FD can be two separate characters too, am I right? – Selfappointed
Yes, but in a regional indicator sequence they form one grapheme (or "one printable 'glyph'" as the original question put it). – Nickell
Apparently there is a 'unicode/norm' package to normalize unicode grapheme, is that what's needed here : blog.golang.org/normalization ? – Ambulator
how could we will think a colorful flag picture is a "glyph" or a "character"? And I find the is Objective C function rangeOfComposedCharacterSequenceAtIndex @Bjorn Roche used plays different in different system(#32831955). I'm totally confused by the complex Emoji! – Selfappointed
@phtrivier, yes, the examples I gave in my question use the unicode/norm package, but I still get the wrong answer sometimes, such as for the πŸ––πŸΏ glyph. – Commonwealth
there is a standard function - utf8.RuneCountInString – Bridoon
C
-2

I think the easiest way to do this would be like this:

package main

import "fmt"

func main() {
    str := "Hello, δΈ–πŸ––πŸ––η•Œ"
    var counter int
    for range str {
        counter++
    }
    fmt.Println(counter)
}

This one prints 11

Comorin answered 1/10, 2020 at 18:14 Comment(0)

© 2022 - 2024 β€” McMap. All rights reserved.