What's the difference between a character, a code point, a glyph and a grapheme?
Asked Answered
T

2

293

Trying to understand the subtleties of modern Unicode is making my head hurt. In particular, the distinction between code points, characters, glyphs and graphemes - concepts which in the simplest case, when dealing with English text using ASCII characters, all have a one-to-one relationship with each other - is causing me trouble.

Seeing how these terms get used in documents like Matthias Bynens' JavaScript has a unicode problem or Wikipedia's piece on Han unification, I've gathered that these concepts are not the same thing and that it's dangerous to conflate them, but I'm kind of struggling to grasp what each term means.

The Unicode Consortium offers a glossary to explain this stuff, but it's full of "definitions" like this:

Abstract Character. A unit of information used for the organization, control, or representation of textual data. ...

...

Character. ... (2) Synonym for abstract character. (3) The basic unit of encoding for the Unicode character encoding. ...

...

Glyph. (1) An abstract form that represents one or more glyph images. (2) A synonym for glyph image. In displaying Unicode character data, one or more glyphs may be selected to depict a particular character.

...

Grapheme. (1) A minimally distinctive unit of writing in the context of a particular writing system. ...

Most of these definitions possess the quality of sounding very academic and formal, but lack the quality of meaning anything, or else defer the problem of definition to yet another glossary entry or section of the standard.

So I seek the arcane wisdom of those more learned than I. How exactly do each of these concepts differ from each other, and in what circumstances would they not have a one-to-one relationship with each other?

Thigmotaxis answered 6/12, 2014 at 12:44 Comment(2)
There are many very different writing systems, for many different languages. Thus there are different views on the problem of writing, and there's also a long history behind it. IMHO it's useful to keep that in mind, because Unicode tries to cover everything. (Is cursive same or different character? Kanji radicals? Hangul? Diacritics? Skin-colored emoji??...)Workwoman
"Most of these definitions possess the quality of sounding very academic and formal, but lack the quality of meaning anything, or else defer the problem of definition to yet another glossary entry or section of the standard." 😆🤣Kingdom
T
419
  • Character is an overloaded term that can mean many things.

  • A code point is the atomic unit of information. Text is a sequence of code points. Each code point is a number which is given meaning by the Unicode standard.

  • A code unit is the unit of storage of a part of an encoded code point. In UTF-8 this means 8 bits, in UTF-16 this means 16 bits. A single code unit may represent a full code point, or part of a code point. For example, the snowman glyph () is a single code point but 3 UTF-8 code units, and 1 UTF-16 code unit.

  • A grapheme is a sequence of one or more code points that are displayed as a single, graphical unit that a reader recognizes as a single element of the writing system. For example, both a and ä are graphemes, but they may consist of multiple code points (e.g. ä may be two code points, one for the base character a followed by one for the diaeresis; but there's also an alternative, legacy, single code point representing this grapheme). Some code points are never part of any grapheme (e.g. the zero-width non-joiner, or directional overrides).

  • A glyph is an image, usually stored in a font (which is a collection of glyphs), used to represent graphemes or parts thereof. Fonts may compose multiple glyphs into a single representation, for example, if the above ä is a single code point, a font may choose to render that as two separate, spatially overlaid glyphs. For OTF, the font's GSUB and GPOS tables contain substitution and positioning information to make this work. A font may contain multiple alternative glyphs for the same grapheme, too.

Typesetter answered 6/12, 2014 at 12:53 Comment(29)
@MicahZoltu: Meh, I'm not sure. Now you just threw a ton of terms in there that aren't defined, and the example lacks explanatory detail and abuses the term "character"...Typesetter
Sorry for the ambiguous use of "character". I am a big advocate of not using "character" yet it is a trap I still fall into regularly. :/ As for the terms, I'm not sure which you are referring to? Code Unit is a well defined term in this space I believe, see en.wikipedia.org/wiki/Character_encoding#Terminology as a starting point. Perhaps you are referring to some other term I used?Galarza
I also removed comments about Endianness, which are out of scope for this Q&A I think.Galarza
@MicahZoltu: Well, UTF-{8, 16, 32} for starters. I had so far completely avoided considering encoding transformation schemes, because I think they're too much of a detail that doesn't warrant that kind of top billing. Also, comparing ASCII and UTF-? is a bit off, because ASCII is an encoding just like Unicode (an assignment of meaning to numbers), but UTF-? is something else (a method for representing numbers).Typesetter
I can get on board with keeping this Q&A as high level as possible as I think it is valuable to have a fairly "simplistic" answer to the question. However, I do think that code units play a very meaningful/important role and often people looking for answers to this class of questions are looking for that bit of information. I'm not sure how to best describe what a code point is without a meaningful example though, and in this case UTF-8, UTF-16 are pragmatic examples (especially with the snowman boundary condition). I'll remove ASCII and UTF-32, let me know how that sits with you.Galarza
@MicahZoltu: Wait, how are UTF-{8, 16, 32} relevant to describing "code points"?! I can describe code points entirely without reference to UTF (which indeed I did in my original answer).Typesetter
Sorry, typo in my comment. Meant code unit in all places in comment.Galarza
Let us continue this discussion in chat.Galarza
@KerrekSB "ASCII is an encoding just like Unicode (an assignment of meaning to numbers), but UTF-? is something else (a method for representing numbers)" - your usage of the term "encoding" here doesn't match what I'm used to - usually in this space "encoding" is used to mean converting some abstract concept of text to bytes. The terms in the unicode glossary seem to use it both in your sense (e.g. an "encoded character" still has nothing to do with bytes) and the one I'm used to (e.g. an "encoding scheme" is a scheme for mapping "textual information" to bytes).Thigmotaxis
I agree that @MicahZoltu's "code unit" edit feels slightly out of place - in particular because it introduces a distinct perspective (one focused on encoding stuff to bytes, rather than just Unicode's abstract model of what text is) from which a code point is very much not "atomic" right before the claim that "a code point is the atomic unit of information". I do think there's value to at least mentioning code units here purely to avoid readers confusing the two terms with each other, but perhaps code units should be introduced after code points, briefly and tangentially.Thigmotaxis
I just submitted an edit that re-arranged the order of code-point and code-unit. I agree with you that code-unit should come second. As for being "out of place", I suspect you see this answer as serving a different purpose than I do. I think there is great value in having all 5 of these terms in one place. The last thing I want is to google for "what is the difference between glyph, grapheme, code unit and code point and have to get the answer in two places. In a lot of discussions these terms are all used in the discussion, rarely do I see a discussion with the other 4 but not code-unit.Galarza
So for example '\uD83D\uDC0A' (which shows a crocodile emoji) what are the code points, graphems, etc? In particular, how does it relate to .length, .codePointAt(0),.codePointAt(1),.charCodeAt(0) and .charCodeAt(1) results?Cassella
@qbolec: Those are two UTF-16 code units expressing a single code point (U+1F40A), and given that it's an emoji, it's presumably its own, single grapheme.Typesetter
Thanks Kerrek SB! Now I see in the MDN js documentation that str.length is a number of code units in UTF-16, while I falsly believed up to this point that it is number of code points. Also it seems that the argument pos to str.codePointAt(pos) is expressed in code units as well, and that there is nothing preventing one to pass pos which is not aligned to code point boundary, which added to my confusion :)Cassella
@qbolec: Yes, Windows is in the very unfortunate situation of having adopted UTF-16 as a sort of standard text encoding, which is the worst of both worlds, being both variable-width and not byte-based (so you get endianness and BOM issues on top of surrogate handholding).Typesetter
You mentioned that the ä grapheme may be represented by multiple code points. So, is it two code points, or is it not? Can you clarify what determines that? Thank you.Sayed
@TomPažourek: In decomposed canonicalization, it's represented by two codepoints (a plus "combining diacritic"); in composed canonicalization it's represented by a single codepoint (ä from the old legacy Latin-1 range). Unicode canonicalization is the subject you want to investigate if this interests you. In a blank-slate world, there would only be base and combining characters and no prebuilt composites.Typesetter
It might be helpful to clarify that the meanings of "grapheme" and "glyph" are independent of text encoding whereas the meanings of "code unit" and "code point" are specific to the context of Unicode.Foochow
@Praxeolitic: That's true, but do you think the answer currently suggests such a false connection?Typesetter
No, I don't think any part of the answer is actively misleading. I just wouldn't be surprised if readers mistakenly assume that connection.Foochow
@KerrekSB The Unicode Glossary says that grapheme can mean either (1) "... minimally distinctive unit of writing ..." or (2) "What a user thinks of as a character." Meaning (2) of grapheme seems to match user-perceived character. What you describe in your explanation ("a sequence of one or more code points.") seems closer to grapheme cluster - which is an approximation of user-perceived characters.Torchwood
Maybe you could write something like: "The word grapheme can either stand for [...] minimally distinctive unit of writing [...] The word grapheme can alternatively stand for a user-perceived character, which can be approximated as a grapheme cluster." The explanation for grapheme cluster can then be your current explanation of grapheme. I could edit myself, but without asking you first, the edit would probably be rejected by reviewers...Torchwood
(Also if you edit: Could you include the references for grapheme cluster, code point and code unit (1 , 2)? You could also possibly mention, that in most programming languages string indexing and string length are code unit based.) Oh, by the way: Great answer! Sorry for nitpicking.Torchwood
Sorry, it seems, that I didn't link to the definitions, but to the short introductions... Better also link to the definitions: code point: definition D10 ; code unit: definition D77 ; grapheme cluster: definition D60 PS: Sorry for the noiseTorchwood
@Kaushik: I'm not sure what you mean: a code unit is a unit of storage, yes, but a code point in general requires multiple code units for storage (except in UTF-32).Typesetter
You should be writing the documentation for unicode. This is much better than the official documentation.Also
"In UTF-8 this means 8 bits" Hi I think this might not be the case strictly speaking, isn't UTF-8 variable length and can go to more than 1 byte?Poignant
@XuShaoyang UTF-8 is variable width for encoding code points, per wikipedia: "UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte (8-bit) code units." However, "A UTF maps each Unicode code point to a unique code unit sequence. A code unit is the minimal bit combination that can represent a character. Each UTF uses a different code unit size." (From ibm.com/docs/en/db2-for-zos/12?topic=unicode-utfs, which also has a brief comparison and examples of the various UTFs+ASCII encodings for a few code points.)Tressietressure
"For example, the snowman glyph..." should surely be, "For example, the snowman grapheme..."?Pessa
P
6

Outside the Unicode standard a character is an individual unit of text composed of one or more graphemes. What the Unicode standard defines as "characters" is actually a mix of graphemes and characters. Unicode provides rules for the interpretation of juxtaposed graphemes as individual characters.

A Unicode code point is a unique number assigned to each Unicode character (which is either a character or a grapheme).

Unfortunately, the Unicode rules allow some juxtaposed graphemes to be interpreted as other graphemes that already have their own code points (precomposed forms). This means that there is more than one way in Unicode to represent a character. Unicode normalization addresses this issue.

A glyph is the visual representation of a character. A font provides a set of glyphs for a certain set of characters (not Unicode characters). For every character, there is an infinite number of possible glyphs.

A Reply to Mark Amery

First, as I stated, there is an infinite number of possible glyphs for each character so no, a character is not "always represented by a single glyph". Unicode doesn't concern itself much with glyphs, and the things it defines in its code charts are certainly not glyphs. The problem is that neither are they all characters. So what are they?

Which is the greater entity, the grapheme or the character? What does one call those graphic elements in text that are not letters or punctuation? One term that springs quickly to mind is "grapheme". It's a word that precisely conjure up the idea of "a graphical unit in a text". I offer this definition: A grapheme is the smallest distinct component in a written text.

One could go the other way and say that graphemes are composed of characters, but then they would be called "Chinese graphemes", and all those bits and pieces Chinese graphemes are composed of would have to be called "characters" instead. However, that's all backwards. Graphemes are the distinct little bits and pieces. Characters are more developed. The phrase "glyphs are composable", would be better stated in the context of Unicode as "characters are composable".

Unicode defines characters but it also defines graphemes that are to be composed with other graphemes or characters. Those monstrosities you composed are a fine example of this. If they catch on maybe they'll get their own code points in a later version of Unicode ;)

There's a recursive element to all this. At higher levels, graphemes become characters become graphemes, but it's graphemes all the way down.

A Reply to T S

Chapter 1 of the standard states: "The Unicode character encoding treats alphabetic characters, ideographic characters, and symbols equivalently, which means they can be used in any mixture and with equal facility". Given this statement, we should be prepared for some conflation of terms in the standard. Sometimes the proper terminology only becomes clear in retrospect as a standard develops.

It often happens in formal definitions of a language that two fundamental things are defined in terms of each other. For example, in XML an element is defined as a starting tag possibly followed by content, followed by an ending tag. Content is defined in turn as either an element, character data, or a few other possible things. A pattern of self-referential definitions is also implicit in the Unicode standard:

A grapheme is a code point or a character.

A character is composed from a sequence of one or more graphemes.

When first confronted with these two definitions the reader might object to the first definition on the grounds that a code point is a character, but that's not always true. A sequence of two code points sometimes encodes a single code point under normalization, and that encoded code point represents the character, as illustrated in figure 2.7. Sequences of code points that encode other code points. This is getting a little tricky and we haven't even reached the layer where where character encoding schemes such as UTF-8 are used to encode code points into byte sequences.

In some contexts, for example a scholarly article on diacritics, and individual part of a character might show up in the text by itself. In that context, the individual character part could be considered a character, so it makes sense that the Unicode standard remain flexible as well.

As Mark Avery pointed out, a character can be composed into a more complex thing. That is, each character can can serve as a grapheme if desired. The final result of all composition is a thing that "the user thinks of as a character". There doesn't seem to be any real resistance, either in the standard or in this discussion, to the idea that at the highest level there are these things in the text that the user thinks of as individual characters. To avoid overloading that term, we can use "grapheme" in all cases where we want to refer to parts used to compose a character.

At times the Unicode standard is all over the place with its terminology. For example, Chapter 3 defines UTF-8 as an "encoding form" whereas the glossary defines "encoding form" as something else, and UTF-8 as a "Character Encoding Scheme". Another example is "Grapheme_Base" and "Grapheme_Extend", which are acknowledged to be mistakes but that persist because purging them is a bit of a task. There is still work to be done to tighten up the terminology employed by the standard.

The Proposal for addition of COMBINING GRAPHEME JOINER got it wrong when it stated that "Graphemes are sequences of one or more encoded characters that correspond to what users think of as characters." It should instead read, "A sequence of one or more graphemes composes what the user thinks of as a character." Then it could use the term "grapheme sequence" distinctly from the term "character sequence". Both terms are useful. "grapheme sequence" neatly implies the process of building up a character from smaller pieces. "character sequence" means what we all typically intuit it to mean: "A sequence of things the user thinks of as characters."

Sometimes a programmer really does want to operate at the level of grapheme sequences, so mechanisms to inspect and manipulate those sequences should be available, but generally, when processing text, it is sufficient to operate on "character sequences" (what the user thinks of as a character) and let the system manage the lower-level details.

In every case covered so far in this discussion, it's cleaner to use "grapheme" to refer to the indivisible components and "character" to refer to the composed entity. This usage also better reflects the long-established meanings of both terms.

Puncheon answered 12/4, 2018 at 9:57 Comment(8)
A cautious -1; I think this is wrong. You imply that a character can be composed of many graphemes, but always will be represented by a single glyph; I think in fact it is the other way around. Pages like en.wikipedia.org/wiki/N-diaeresis suggest that the combination of a letter with a diacritic (at least one that changes its meaning) forms a distinct new grapheme, and that the diacritic is not a grapheme on its own. Meanwhile, glyphs are clearly composable s͈̘̻̗̝i̙̳̩̯̮̥ͅn̪̭̹̝c̪̣̗̞̜e̥̖̮̫̣̯ͅ ̯ͅI̪͉̜̼̼̣̟̣ ̰̟̥̞̹c͈͔͇̼a̙̹̼̦̲̞n̙̺̳̟ͅ ̤̗d̘̭̙̪̦o̬̲̜̺ ̲̬̝t̺̖̗̩̱h̟̟̱i̹s̹̱.̯̖̝̯̟̜̥Thigmotaxis
I appreciate the reply, which I just saw. However, I still think that your definition of graphemes is in fact incorrect, or at least at odds with how Unicode defines the word. You dismiss the idea of a grapheme being composed of characters as being "all backwards", but I did a little digging and found unicode.org/L2/L2000/00274-N2236-grapheme-joiner.htm which literally begins with the statement "Graphemes are sequences of one or more encoded characters".Thigmotaxis
And that statement continues, "...that correspond to what users think of as characters." Even the term "grapheme-joiner", as well as the mechanism behind the term, is illustrative of what I stated at the beginning of my answer: What the Unicode standard defines as "characters" is actually a mix of graphemes and characters. It's cleaner to call graphemes "graphemes" and characters "characters" rather than inventing contortions such as "precomposed characters" and "grapheme clusters".Puncheon
@PoorYorick You claim, that "... a character is an individual unit of text composed of one or more graphemes" and "Graphemes are the distinct little bits and pieces. Characters are more developed". Do you have any reference that supports these claims? Because I somehow doubt, that the Unicode consortium deliberately decided to define their names somehow "inverted".Torchwood
@PoorYorick I understood your point, before you added this further explanation, that's not why I asked. I simply asked for references (lexicon, scientific articles, technical standard, ...), because I didn't know any document, that uses grapheme the way you interpret it. (The first few google results for grapheme also don't use the word in your way). You added a blockquote "A grapheme is a code point or a character. [...]" - but where is it from? At the end you say "long-established meanings of both terms." - If it's established then link to something, that's using the term this way.Torchwood
Regarding UTF-8: There is an encoding form named UTF-8 (see definition D92) and an encoding scheme named UTF-8 (see definition D95)Torchwood
@MarkAmery how did you write " s͈̘̻̗̝i̙̳̩̯̮̥ͅn̪̭̹̝c̪̣̗̞̜e̥̖̮̫̣̯ͅ ̯ͅI̪͉̜̼̼̣̟̣ ̰̟̥̞̹c͈͔͇̼a̙̹̼̦̲̞n̙̺̳̟ͅ ̤̗d̘̭̙̪̦o̬̲̜̺ ̲̬̝t̺̖̗̩̱h̟̟̱i̹s̹̱.̯̖̝̯̟̜̥"?Diaspore
@DavidKlempfner eeemo.net. It basically adds bajillions of accents and similar modifiers to the text you feed it.Thigmotaxis

© 2022 - 2024 — McMap. All rights reserved.