Outside the Unicode standard a character is an individual unit of text composed of one or more graphemes. What the Unicode standard defines as "characters" is actually a mix of graphemes and characters. Unicode provides rules for the interpretation of juxtaposed graphemes as individual characters.
A Unicode code point is a unique number assigned to each Unicode character (which is either a character or a grapheme).
Unfortunately, the Unicode rules allow some juxtaposed graphemes to be interpreted as other graphemes that already have their own code points (precomposed forms). This means that there is more than one way in Unicode to represent a character. Unicode normalization addresses this issue.
A glyph is the visual representation of a character. A font provides a set of glyphs for a certain set of characters (not Unicode characters). For every character, there is an infinite number of possible glyphs.
A Reply to Mark Amery
First, as I stated, there is an infinite number of possible glyphs for each character so no, a character is not "always represented by a single glyph". Unicode doesn't concern itself much with glyphs, and the things it defines in its code charts are certainly not glyphs. The problem is that neither are they all characters. So what are they?
Which is the greater entity, the grapheme or the character? What does one call those graphic elements in text that are not letters or punctuation? One term that springs quickly to mind is "grapheme". It's a word that precisely conjure up the idea of "a graphical unit in a text". I offer this definition: A grapheme is the smallest distinct component in a written text.
One could go the other way and say that graphemes are composed of characters, but then they would be called "Chinese graphemes", and all those bits and pieces Chinese graphemes are composed of would have to be called "characters" instead. However, that's all backwards. Graphemes are the distinct little bits and pieces. Characters are more developed. The phrase "glyphs are composable", would be better stated in the context of Unicode as "characters are composable".
Unicode defines characters but it also defines graphemes that are to be composed with other graphemes or characters. Those monstrosities you composed are a fine example of this. If they catch on maybe they'll get their own code points in a later version of Unicode ;)
There's a recursive element to all this. At higher levels, graphemes become characters become graphemes, but it's graphemes all the way down.
A Reply to T S
Chapter 1 of the
standard states: "The Unicode character encoding treats alphabetic characters,
ideographic characters, and symbols equivalently, which means they can be used
in any mixture and with equal facility". Given this statement, we should be
prepared for some conflation of terms in the standard. Sometimes the proper
terminology only becomes clear in retrospect as a standard develops.
It often happens in formal definitions of a language that two fundamental
things are defined in terms of each other. For example, in
XML an element is defined as a starting tag
possibly followed by content, followed by an ending tag. Content is defined in
turn as either an element, character data, or a few other possible things. A
pattern of self-referential definitions is also implicit in the Unicode
standard:
A grapheme is a code point or a character.
A character is composed from a sequence of one or more graphemes.
When first confronted with these two definitions the reader might object to the
first definition on the grounds that a code point is a character, but
that's not always true. A sequence of two code points sometimes encodes a
single code point under
normalization, and that
encoded code point represents the character, as illustrated in
figure 2.7. Sequences of
code points that encode other code points. This is getting a little tricky and
we haven't even reached the layer where where character encoding schemes such
as UTF-8 are used to
encode code points into byte sequences.
In some contexts, for example a scholarly article on
diacritics, and individual
part of a character might show up in the text by itself. In that context, the
individual character part could be considered a character, so it makes sense
that the Unicode standard remain flexible as well.
As Mark Avery pointed out, a character can be composed into a more complex
thing. That is, each character can can serve as a grapheme if desired. The
final result of all composition is a thing that "the user thinks of as a
character". There doesn't seem to be any real resistance, either in the
standard or in this discussion, to the idea that at the highest level there are
these things in the text that the user thinks of as individual characters. To
avoid overloading that term, we can use "grapheme" in all cases where we want
to refer to parts used to compose a character.
At times the Unicode standard is all over the place with its terminology. For
example, Chapter 3
defines UTF-8 as an "encoding form" whereas the glossary defines "encoding
form" as something else, and UTF-8 as a "Character Encoding Scheme". Another
example is "Grapheme_Base" and "Grapheme_Extend", which are
acknowledged to be
mistakes but that persist because purging them is a bit of a task. There is
still work to be done to tighten up the terminology employed by the standard.
The Proposal for addition of COMBINING GRAPHEME
JOINER got it
wrong when it stated that "Graphemes are sequences of one or more encoded
characters that correspond to what users think of as characters." It should
instead read, "A sequence of one or more graphemes composes what the user
thinks of as a character." Then it could use the term "grapheme sequence"
distinctly from the term "character sequence". Both terms are useful.
"grapheme sequence" neatly implies the process of building up a character from
smaller pieces. "character sequence" means what we all typically intuit it to
mean: "A sequence of things the user thinks of as characters."
Sometimes a programmer really does want to operate at the level of grapheme
sequences, so mechanisms to inspect and manipulate those sequences should be
available, but generally, when processing text, it is sufficient to operate on
"character sequences" (what the user thinks of as a character) and let the
system manage the lower-level details.
In every case covered so far in this discussion, it's cleaner to use "grapheme"
to refer to the indivisible components and "character" to refer to the composed
entity. This usage also better reflects the long-established meanings of both
terms.