tl;dr
Use code points, not char
.
Map < Integer, Integer > frequencies =
input
.codePoints ( ) // Generate an `IntStream` of `int` primitive values, one or more per character encountered in the input text.
.boxed ( ) // Convert each `int` primitive into a `Integer` object.
.collect (
Collectors.toMap (
Function.identity ( ) , // Code point.
( Integer a ) -> 1 ,
Integer :: sum // Increment the frequency count, the value for this entry in our Map.
)
);
Code point
Unfortunately, the char
type has been essentially broken since Java 2, and legacy since Java 5. As a 16-bit value, char
is physically incapable of representing most characters. Example: try "😷".length()
.
Instead, use code point integer numbers to work with individual characters.
Here is an code point savvy version of the Collectors.toMap
approach seen in the Answers by WJS and by Cardinal System.
As the key in our map, we use Integer
to represent each code point.
The IntStream
returned by String#codePoints
yields a series of int
primitive values, one per code point found in the input string. We convert each code point from an int
primitive to a Integer
object as we need an object rather than a primitive to be a key. The int
to Integer
conversion is performed by the call to .boxed()
.
String input = "😷🦜zwdddaaaaacbb🦜";
Map < Integer, Integer > frequencies =
input
.codePoints ( )
.boxed ( )
.collect (
Collectors.toMap (
( Integer codePoint ) -> codePoint ,
( Integer a ) -> 1 ,
Integer :: sum
)
);
System.out.println ( "frequencies = " + frequencies );
frequencies.forEach ( ( Integer codePoint , Integer count ) -> System.out.println ( Character.toString ( codePoint ) + " = " + count ) );
Function.identity
That first argument to toMap
is saying for every code point, just use that code point. A shorter way of doing that is Function.identity()
.
String input = "😷🦜zwdddaaaaacbb🦜";
Map < Integer, Integer > frequencies =
input
.codePoints ( ) // Generate an `IntStream` of `int` primitive values, one or more per character encountered in the input text.
.boxed ( ) // Convert each `int` primitive into a `Integer` object.
.collect (
Collectors.toMap (
Function.identity ( ) , // Code point.
( Integer a ) -> 1 ,
Integer :: sum // Increment the frequency count, the value for this entry in our Map.
)
);
System.out.println ( "frequencies = " + frequencies );
frequencies.forEach ( ( Integer codePoint , Integer count ) -> System.out.println ( Character.toString ( codePoint ) + " = " + count ) );
Result
When run:
frequencies = {97=5, 98=2, 99=1, 100=3, 128567=1, 119=1, 122=1, 129436=2}
a = 5
b = 2
c = 1
d = 3
😷 = 1
w = 1
z = 1
🦜 = 2
In contrast, if we changed .codePoints()
to .chars()
, we get incorrect results:
frequencies = {97=5, 98=2, 99=1, 100=3, 119=1, 56887=1, 122=1, 56732=2, 55357=1, 55358=2}
a = 5
b = 2
c = 1
d = 3
w = 1
? = 1
z = 1
? = 2
? = 1
? = 2
Character key->key
is not proper lambda syntax. – Forgetful