Difficulties inherent in ASCII and Extended ASCII, and Unicode Compatibility?

ASCII

ASCII was less or more the first character encoding ever. At the ages when a byte was very expensive and 1MHz was extremely fast, only the characters which appeared on those ancient US typewriters (as well as at the average US International keyboard nowadays) were covered by the charset of the ASCII character encoding. This includes the complete Latin alphabet (A-Z, in both the lowercased and uppercased flavour), the numeral digits (0-9), the lexical control characters (space, dot, comma, colon, etcetera) and some special characters (the at sign, the sharp sign, the dollar sign, etcetera). All those characters fill up the space of 7 bits, half of the room a byte provides, with a total of 128 characters.

Extended ASCII and ISO 8859

Later the remaining bit of the byte is used for Extended ASCII which provides room for a total of 255 characters. Most of the remaining room is used by special characters, such as diacritical characters and line drawing characters. But because everyone used the remaining room their own way (IBM, Commodore, Universities, Organizations, etcetera), it was not interchangeable. Characters which were originally encoded using encoding X will show up as Mojibake when they are decoded using a different encoding Y. Later ISO came up with standard character encoding definitions for 8 bit ASCII extensions, resulting in the known ISO 8859 character encoding standards based on top of ASCII such as ISO 8859-1, so that it is all better interchangeable.

Unicode

8 bits may be enough for the languages using the Latin alphabet, but it is certainly not enough for the remaining non-Latin languages in the world, such as Chinese, Japanese, Hebrew, Cyrillic, Sanskrit, Arabic, etcetera, let alone to include them all in only 8 bits. They developed their own non-ISO character encodings which was -again- not interchangeable, such as Guobiao, BIG5, JIS, KOI, MIK, TSCII, etcetera. Finally a new character encoding standard based on top of ISO 8859-1 was established to cover any of the characters used at the world so that it is interchangeable everywhere: Unicode. It provides room for over a million characters of which currently about 10% is filled. The UTF-8 character encoding is based on Unicode.

Unicode Planes

The Unicode characters are categorized in seventeen planes, each providing room for 65536 characters (16 bits).

Plane 0: Basic Multilingual Plane (BMP), it contains characters of all modern languages known in the world.
Plane 1: Suplementary Multilingual Plane (SMP), it contains historic languages/scripts as well as multilingual musical and mathematical symbols.
Plane 2: Suplementary Ideographic Plane (SIP), it contains "special" CJK (Chinese/Japanese/Korean) characters of which there are pretty a lot, but very seldom used in modern writing. The "normal" CJK characters are already present in BMP.
Planes 3-13: unused.
Plane 14: Supplementary Special Plane (SSP), as far it contains only some tag characters and glyph variation selectors. The tag characters are currently deprecated and may be removed in the future. The glyph variation selectors are to be used as kind of metadata which you add to existing characters which in turn can instruct the reader to give the character a slight different glyph.
Planes 15-16: Private Use Planes (PUP), it provides room for (major) organizations or user initiatives to include their own special characters or symbols in the standard so that it is interchangeable everywhere. For example Emoji (Japanese-style smilies/emoticions).

Usually, you would be only interested in the BMP and using UTF-8 encoding as the standard character encoding throughout your entire application.

ASCII

Extended ASCII and ISO 8859

Unicode

Unicode Planes

Recommended topics

Hot tags