What is the difference between UTF-8 and Unicode?
Asked Answered
N

19

748

I have heard conflicting opinions from people - according to the Wikipedia UTF-8 page.

They are the same thing, aren't they? Can someone clarify?

Numeration answered 13/3, 2009 at 17:6 Comment(5)
What this WIKI writes about unicode and the UTFs is ok in my opinion. Some comments on it are wierd: "It is possible in UTF-8 (or any other multi-byte encoding) to split or truncate a string in the middle of a character, which may result in an invalid string." So a string that gets UTF-8 encoded is no more a string but a byte array or byte stream. The characters that make up the string get encoded. Of course it can be decoded as well. Now of course you can cut a utf-8 sequence after the start-byte or after a following byte but why should someone do this?Ambiversion
This article about string data types is educational: mortoray.com/2013/11/27/the-string-type-is-broken -- sometimes when working with strings and their byte-level components, you may inadvertently chop a character in half.Espinosa
@Ambiversion If that byte stream is being transmitted over a network that packetises it then the string could get split across two packets - i.e. at a place other than a UTF-8 boundary (i.e. where the next byte is not one with MSBits of 0, 110, 1110, 11110 or 10)...Grecize
@Grecize Do you talk about a byte stream or string? Of course a byte stream can be splitted across two packets but it's the job of TCP to re-create the puzzle at the destination e. g. every packet has its sequence number and the receiver does acknowledge each packet received. Of course, if a TCP/IP session gets disconnected ungracefully, only parts of a - let's say utf-8 encoded byte stream - arrive at the destination.Ambiversion
Both! I code mainly for a MUD client application and, in the absence of additional (so-called "Go-Ahead" or "End-of-record") signalling, packets can and do get split as they transverse the Internet - if the client does not wait long enough for any further packets...Grecize
P
647

To expand on the answers others have given:

We've got lots of languages with lots of characters that computers should ideally display. Unicode assigns each character a unique number, or code point.

Computers deal with such numbers as bytes... skipping a bit of history here and ignoring memory addressing issues, 8-bit computers would treat an 8-bit byte as the largest numerical unit easily represented on the hardware, 16-bit computers would expand that to two bytes, and so forth.

Old character encodings such as ASCII are from the (pre-) 8-bit era, and try to cram the dominant language in computing at the time, i.e. English, into numbers ranging from 0 to 127 (7 bits). With 26 letters in the alphabet, both in capital and non-capital form, numbers and punctuation signs, that worked pretty well. ASCII got extended by an 8th bit for other, non-English languages, but the additional 128 numbers/code points made available by this expansion would be mapped to different characters depending on the language being displayed. The ISO-8859 standards are the most common forms of this mapping; ISO-8859-1 and ISO-8859-15 (also known as ISO-Latin-1, latin1, and yes there are two different versions of the 8859 ISO standard as well).

But that's not enough when you want to represent characters from more than one language, so cramming all available characters into a single byte just won't work.

There are essentially two different types of encodings: one expands the value range by adding more bits. Examples of these encodings would be UCS2 (2 bytes = 16 bits) and UCS4 (4 bytes = 32 bits). They suffer from inherently the same problem as the ASCII and ISO-8859 standards, as their value range is still limited, even if the limit is vastly higher.

The other type of encoding uses a variable number of bytes per character, and the most commonly known encodings for this are the UTF encodings. All UTF encodings work in roughly the same manner: you choose a unit size, which for UTF-8 is 8 bits, for UTF-16 is 16 bits, and for UTF-32 is 32 bits. The standard then defines a few of these bits as flags: if they're set, then the next unit in a sequence of units is to be considered part of the same character. If they're not set, this unit represents one character fully. Thus the most common (English) characters only occupy one byte in UTF-8 (two in UTF-16, 4 in UTF-32), but other language characters can occupy six bytes or more.

Multi-byte encodings (I should say multi-unit after the above explanation) have the advantage that they are relatively space-efficient, but the downside that operations such as finding substrings, comparisons, etc. all have to decode the characters to unicode code points before such operations can be performed (there are some shortcuts, though).

Both the UCS standards and the UTF standards encode the code points as defined in Unicode. In theory, those encodings could be used to encode any number (within the range the encoding supports) - but of course these encodings were made to encode Unicode code points. And that's your relationship between them.

Windows handles so-called "Unicode" strings as UTF-16 strings, while most UNIXes default to UTF-8 these days. Communications protocols such as HTTP tend to work best with UTF-8, as the unit size in UTF-8 is the same as in ASCII, and most such protocols were designed in the ASCII era. On the other hand, UTF-16 gives the best average space/processing performance when representing all living languages.

The Unicode standard defines fewer code points than can be represented in 32 bits. Thus for all practical purposes, UTF-32 and UCS4 became the same encoding, as you're unlikely to have to deal with multi-unit characters in UTF-32.

Hope that fills in some details.

Perpetuity answered 13/3, 2009 at 17:37 Comment(11)
Conceptually, UCS-2 and UCS-4 are character sets, not character encodings (hence the name).Astromancy
@Tuukka Errors in this posting are legion. There are more than just 2 versions of ISO 8859. ASCII didn’t work for English, missing things like curly quotes, cent signs, accents,& a whole lot more—Unicode is not just about non-English; English needs it, too!! No codepoints occupy more than 4 bytes in ANY encoding; this 6-byte business is flat-out wrong. You can’t UTF-encode any Unicode scalar value as this says: surrogates & the 66 other noncharacters are all forbidden. UCS-4 and UTF-32 aren’t the same. There is no multi-unit UTF-32. UTF-16 is not as efficient as they pretend — &c&c&c!Seaden
ASCII also does not contain the pound sign £, and of course does not contain the euro sign € (which is considerably younger than ASCII).Opiumism
I see that here we are stating UTF-8 uses just 8 bits. Wiki Says en.wikipedia.org/wiki/Unicode UTF-8 uses one byte for any ASCII character, all of which have the same code values in both UTF-8 and ASCII encoding, and up to four bytes for other characters.Door
@Seaden Looks that 6 bytes are not improbable after all. See this: joelonsoftware.com/articles/Unicode.html which denotes that there IS a character space from 0x04000000 to 0x7FFFFFFF, or in binary it's 1111110v 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv - and that's indeed 6 bytes. However, 6 bytes IS the maximum, and not as the article confusingly claims "six bytes or more".Noelnoelani
Ok. but please don't be too harsh. Because the majority still regards this article worth reading, no matter when it was written (2003, I know that.) Maybe you would like to elaborate on what is wrong in that article, as well? That way it will help open the eyes of many. — Ah, and to the "you of all people" thing: I frankly didn't know what a luminary you are, since I've never bothered reading your profile — up to now. Mea culpa. :) O'Reilly? Yes, really. :P Chapeau, Sire. :DNoelnoelani
The info about the flag bits was something I have been looking for but was hard to find. Very nice.Bridesmaid
@syntaxerror: " Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes." was accurate when written, but later that same year (twelve years ago) it was invalidated. en.wikipedia.org/wiki/UTF-8 says "The original specification covered numbers up to 31 bits (the original limit of the Universal Character Set). In November 2003, UTF-8 was restricted by RFC 3629 to end at U+10FFFF, in order to match the constraints of the UTF-16 character encoding. This removed all 5- and 6-byte sequences, and about half of the 4-byte sequences."Speedball
8-bit characters? You are not going back far enough. In the '60s, the norm was to have 6-bit characters. Lower case characters were not available. A quick history.Indiscrete
I don't know about the other error's that @TuukkaMustonen refers to, but UTF-32 is indeed fixed-length: "UTF-32 is a fixed-length encoding, in contrast to all other Unicode transformation formats, which are variable-length encodings."Aspirant
UTF-32 is a fixed length codepoint encoding - but a codepoint is NOT a Grapheme - it can take more than one codepoint to represent a single "visual linguistic unit"...Grecize
N
474

Let me use an example to illustrate this topic:

A Chinese character:      汉
its Unicode value:        U+6C49
convert 6C49 to binary:   01101100 01001001

Nothing magical so far, it's very simple. Now, let's say we decide to store this character on our hard drive. To do that, we need to store the character in binary format. We can simply store it as is '01101100 01001001'. Done!

But wait a minute, is '01101100 01001001' one character or two characters? You knew this is one character because I told you, but when a computer reads it, it has no idea. So we need some sort of encoding to tell the computer to treat it as one.

This is where the rules of UTF-8 come in: https://www.fileformat.info/info/unicode/utf8.htm

Binary format of bytes in sequence

1st Byte    2nd Byte    3rd Byte    4th Byte    Number of Free Bits   Maximum Expressible Unicode Value
0xxxxxxx                                                7             007F hex (127)
110xxxxx    10xxxxxx                                (5+6)=11          07FF hex (2047)
1110xxxx    10xxxxxx    10xxxxxx                  (4+6+6)=16          FFFF hex (65535)
11110xxx    10xxxxxx    10xxxxxx    10xxxxxx    (3+6+6+6)=21          10FFFF hex (1,114,111)

According to the table above, if we want to store this character using the UTF-8 format, we need to prefix our character with some 'headers'. Our Chinese character is 16 bits long (count the binary value yourself), so we will use the format on row 3 above as it provides enough space:

Header  Place holder    Fill in our Binary   Result         
1110    xxxx            0110                 11100110
10      xxxxxx          110001               10110001
10      xxxxxx          001001               10001001

Writing out the result in one line:

11100110 10110001 10001001

This is the UTF-8 binary value of the Chinese character! See for yourself: https://www.fileformat.info/info/unicode/char/6c49/index.htm

Summary

A Chinese character:      汉
its Unicode value:        U+6C49
convert 6C49 to binary:   01101100 01001001
encode 6C49 as UTF-8:     11100110 10110001 10001001

P.S. If you want to learn this topic in Python, click here.

Nock answered 14/1, 2015 at 9:7 Comment(10)
"But wait a minute, is '01101100 01001001' one character or two characters? You knew this is one character because I told you, but when a computer reads it, it has no idea. So we need some sort of "encoding" to tell the computer to treat it as one." Well ok, but the computer still does not know it should encode it with utf-8 ?Underwrite
@KorayTugay The computer does not know what encoding it should use. You have to tell it when you save a character to a file and also when you read a character from a file.Nock
How does the computer know to choose the "format on row 3"? How does it know that the Chinese character needs 16 bits?Bronder
@Bronder The computer does not know what format to use. When you save the document, the text editor has to explicitly set its encoding to be utf-8 or whatever format the user wants to use. Also, when a text editor program reads a file, it needs to select a text encoding scheme to decode it correctly. Same goes when you are typing and entering a letter, the text editor needs to know what scheme you use so that it will save it correctly.Nock
So how are those headers interpreted? if i look at the first table then i think: if byte starts with bit 0 then the character is represented by 1 bite (the current one), if byte starts with 110 then the character is represented by 2 bytes(the current and the next one(remaining bits after 10)), if byte starts with 1110 then the character is represented by 3 bytes, the current and the next 2 bytes(remaining bits after 10).Loriannlorianna
In UTF-8, most Chinese characters take 3 bytes each. A few take 4 bytes, hence have a Unicode "code point" bigger than 65K. New Emoji characters also need 4 bytes.Indiscrete
Read 10 articles on UTF-8; after reading this I understood within 10 seconds:)Spindly
If UTF-8 can encode at most 21 bits of payload into 32 bit result, why is there only 1,114,112 distinct payload values? Shouldn't it be 2^21, which is 2,097,152 in decimal?Abreaction
And why 2nd to 4th bytes need their own headers? Shouldn't the header of the first byte be enough to tell the computer to read n bytest as a single character?Abreaction
Found the answer to my second question hereAbreaction
P
248

"Unicode" is unfortunately used in various different ways, depending on the context. Its most correct use (IMO) is as a coded character set - i.e. a set of characters and a mapping between the characters and integer code points representing them.

UTF-8 is a character encoding - a way of converting from sequences of bytes to sequences of characters and vice versa. It covers the whole of the Unicode character set. ASCII is encoded as a single byte per character, and other characters take more bytes depending on their exact code point (up to 4 bytes for all currently defined code points, i.e. up to U-0010FFFF, and indeed 4 bytes could cope with up to U-001FFFFF).

When "Unicode" is used as the name of a character encoding (e.g. as the .NET Encoding.Unicode property) it usually means UTF-16, which encodes most common characters as two bytes. Some platforms (notably .NET and Java) use UTF-16 as their "native" character encoding. This leads to hairy problems if you need to worry about characters which can't be encoded in a single UTF-16 value (they're encoded as "surrogate pairs") - but most developers never worry about this, IME.

Some references on Unicode:

Pyrophyllite answered 13/3, 2009 at 17:11 Comment(16)
I think UTF-16 only equals "Unicode" on Windows platforms. People tend to use UTF-8 by default on *nix. +1 though, good answerRene
I'll clarify that it's not a unicode standard but a subset of ISO 8859-1, and implemented as 1 byte unicodeDriftage
@Chris: No, ISO-8859-1 is not UTF-8. UTF-8 encodes U+0080 to U+00FF as two bytes, not one. Windows 1252 and ISO-8859-1 are mostly the same, but they differ between values 0x80 and 0x99 if I remember correctly, where ISO 8859-1 has a "hole" but CP1252 defines characters.Pyrophyllite
Some of your sources are out of date: UTF-8 uses a maximum of four bytes per character, not six. I believe it was reduced primarily to eliminate the "overlong forms" problem described by Markus Kuhn in his FAQ.Cytolysis
Alan: I originally had it as 4 (see edits) but then read the wrong bit of the document I was reading. Doh. U-04000000 – U-7FFFFFFF would take 6 bytes, but there are no characters above U-001FFFFF - at least at the moment...Pyrophyllite
Last I heard, the maximum Unicode code point is U+0010FFFF -- so there's even more room to grow. It's going to be a while before we have to graft surrogate pairs onto UTF-32, as the author of the accepted answer seems to think is the case. ;-)Cytolysis
The idea of calling UTF-16 "Unicode" sits uneasily with me due to its potential to confuse - even though this was clearly pointed out as a .NET convention only. UTF-16 is a way of representing Unicode, but it is not "The Unicode encoding".Postulate
@thomasrutter: It's not just a .NET convention. I've seen it in plenty of places. For example, open Notepad and do "Save As" and one of the encoding options is "Unicode". I know it's confusing and inaccurate, but it's worth being aware that it's in fairly widespread use for that meaning.Pyrophyllite
@Alan M: To quote myself: "The Unicode standard defines fewer code points than can be represented in 32 bits." The point is that the UTF family of encodings allow for surrogate pairs, while other encodings do not.Perpetuity
@unwesen: UTF-8 doesn't need surrogate pairs. It just represents non-BMP characters using progressively longer byte sequences.Pyrophyllite
@unwesen: My point was that, unlike UTF-8 and UTF-16, UTF-32 has always been a fixed-width encoding and always will be. Whether it's in the BMP or one of the supplemental planes, every code point is represented by exactly four bytes.Cytolysis
As for using "Unicode" to mean UTF-16, you're right, Jon: that's a Microsoft convention rather than a .NET convention, and I hate it too. This stuff is difficult enough to explain without MS exposing all its customers to this blatantly incorrect usage.Cytolysis
+1 for explaining the difference between character sets (UCS-4) and character encodings (UTF-8, -16, -32).Astromancy
In Unicode's own terminology, UTF stands for Unicode Transformation Format, so they prefer to say that UTF-8 is a transformation format than a character encoding, since the latter term has been made ambiguous by many people using it in multiple conflicting ways over the years.Shrinkage
@JonSkeet Hi. So the meaning here i.stack.imgur.com/MO0Cs.png ( when saving notepad file) is to "save the file with utf-16" ? cause ( as you said) unicode - is just a table of code points. whereas utf-x exists for "how" to store the unicode code point.....am I correct?Gobbler
@RoyiNamir: Yes, "Unicode" is unfortunately often used to mean "UTF-16" particularly in Windows.Pyrophyllite
B
131

They're not the same thing - UTF-8 is a particular way of encoding Unicode.

There are lots of different encodings you can choose from depending on your application and the data you intend to use. The most common are UTF-8, UTF-16 and UTF-32 s far as I know.

Billye answered 13/3, 2009 at 17:9 Comment(1)
however, the point is that some editors propose to save the file as "Unicode" OR "UTF-8". So the mention about that "Unicode" in that case is UTF-16 I believe necessary.Traci
F
106

Unicode only define code points, that is, a number which represents a character. How you store these code points in memory depends of the encoding that you are using. UTF-8 is one way of encoding Unicode characters, among many others.

Fullblown answered 13/3, 2009 at 17:14 Comment(2)
however, the point is that some editors propose to save the file as "Unicode" OR "UTF-8". So the mention about that "Unicode" in that case is UTF-16 I believe necessary.Traci
A number, which presents a character does ASCII as well.Ambiversion
A
42

Unicode is a standard that defines, along with ISO/IEC 10646, Universal Character Set (UCS) which is a superset of all existing characters required to represent practically all known languages.

Unicode assigns a Name and a Number (Character Code, or Code-Point) to each character in its repertoire.

UTF-8 encoding, is a way to represent these characters digitally in computer memory. UTF-8 maps each code-point into a sequence of octets (8-bit bytes)

For e.g.,

UCS Character = Unicode Han Character

UCS code-point = U+24B62

UTF-8 encoding = F0 A4 AD A2 (hex) = 11110000 10100100 10101101 10100010 (bin)

Augustaugusta answered 24/2, 2013 at 18:36 Comment(6)
No, UTF-8 maps only codepoints into a sequence that are greater than 127. Everything from 0 to 127 is not a sequence but a single byte. B.t.w., ASCII also assigns a Name of a character to a number, so this is the same what Unicode does. But Unicode doesn't stop at the codepoint 127 but goes up to 0x10ffff.Ambiversion
@brightly I differ. Ascii characters are indeed mapped to a single byte sequence. The first bit, which is 0 in the case of code for ascii characters, indicates how many bytes follow - zero. http://www.wikiwand.com/en/UTF-8#/Description Have a look at the first row.Augustaugusta
Well for me a sequence consists of more than one byte. An ASCII character within UTF-8 is a single byte as is, with the most significant bit set to 0. Codepoints higher than 127 then need sequences, that have always a startbyte and one, two or three following bytes. So why would you call a single byte a "sequence"?Ambiversion
Well... Many times English language lawyers can get baffled over it's intentional misuse in software. It's the same case here. You can argue over it. But that won't make it any clearer.Augustaugusta
@Ambiversion Hmmm, In mathematics, a sequence of 0 elements its OK. A sequence of 1 element is fine here too.Armoury
@chux When using utf-8 and storing just a single byte that makes up an ASCII character, we still can call it an utf-8 sequence of course. Indeed, utf-8 is often explained as a byte-sequence, no matter how many characters the sequence contains.Ambiversion
P
33

UTF-8 is an encoding scheme for Unicode text. It is becoming the best supported and best known text encoding for Unicode text in many contexts, especially the web, and is the text encoding used by default in JSON and XML.

Unicode is a broad-scoped standard which defines over 149,000 characters and allocates each a numerical code (a code point). It also defines rules for how to sort this text, normalise it, change its case, and more. A character in Unicode is represented by a code point from zero up to 0x10FFFF inclusive, though some code points are reserved and cannot be used for characters.

There is more than one way that a string of Unicode code points can be encoded into a binary stream. These are called "encodings". The most straightforward encoding is UTF-32, which simply stores each code point as a 32-bit integer, with each being 4 bytes wide. Since code points only go up to 0x10FFFF (requiring 21 bits), this encoding is somewhat wasteful.

UTF-8 is another encoding, and has gained its popularity over other encodings due to a number of advantages. UTF-8 encodes each code point as a sequence of either 1, 2, 3 or 4 byte values. Code points in the ASCII range are encoded as a single byte value, leaving them fully compatible with ASCII. Code points outside this range use either 2, 3, or 4 bytes each, depending on what range they are in.

UTF-8 has been designed with these properties in mind:

  • Characters also present in the ASCII encoding are encoded in exactly the same way as they are in ASCII, such that any ASCII string is naturally also a valid UTF-8 string representing the same characters.

  • More efficient: Text strings in UTF-8 almost always occupy less space than the same strings in either UTF-32 or UTF-16, with just a few exceptions.

  • Binary sorting: Sorting UTF-8 strings using a binary sort will still result in all code points being sorted in numerical order.

  • When a code point uses multiple bytes, none of those bytes (not even the first one) contain values in the ASCII range, ensuring that no part of them could be mistaken for an ASCII character. This is a feature that is important for security, especially when using UTF-8 encoded text in systems originally designed for 8-bit encodings.

  • UTF-8 can be easily validated to verify that it is valid UTF-8. Text in other 8-bit or multi-byte encodings will very rarely also validate as UTF-8 by chance due to the very specific structure of UTF-8.

  • Random access: At any point in a UTF-8 string it is possible to tell if the byte at that position is the first byte of a character or not, and to find the start of the next or current character, without needing to scan forwards or backwards more than 3 bytes or to know how far into the string we started reading from.

Postulate answered 26/9, 2017 at 5:5 Comment(6)
A couple of minor points: [1] Shouldn't "ASCII characters are encoded exactly as they are in ASCII" be changed to "ASCII characters are encoded exactly as they are in UTF-8"? [2] The phrase "The codes in Unicode..." is unclear (to me). Do you mean "Unicode code points..."?Perrone
@Perrone for point 1, I meant that the encoding of characters within the ASCII range is identical for ASCII and for UTF-8.Postulate
For point 2, that's a fair point and I'll edit that to make it clearerPostulate
Re: your recent edit, tonsky.me/blog/unicode cites 170,000 allocated code points.Takahashi
@Takahashi I think this comes down to the difference between code points and characters - your figure probably includes code points used for things like private use or surrogate encodings, whereas mine is just characters. My source is the unicode.org faq which is correct as of Unicode 15Postulate
For your interest, character counts are here unicode.org/versions/stats/charcountv15_0.htmlPostulate
R
26

Unicode is just a standard that defines a character set (UCS) and encodings (UTF) to encode this character set. But in general, Unicode is refered to the character set and not the standard.

Read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) and Unicode In 5 Minutes.

Romano answered 13/3, 2009 at 17:37 Comment(2)
@serhio: I know. Although there are three different UTF-16 encodings: The two explicit UTF-16LE and UTF-16BE and the implicit UTF-16 where the endianness is specified with a BOM.Romano
@Gumbo: The lack of a BOM does not mean it's a different encoding. There's only two encodings.Speedball
S
25

The existing answers already explain a lot of details, but here's a very short answer with the most direct explanation and example.

Unicode is the standard that maps characters to codepoints.
Each character has a unique codepoint (identification number), which is a number like 9731.

UTF-8 is an the encoding of the codepoints.
In order to store all characters on disk (in a file), UTF-8 splits characters into up to 4 octets (8-bit sequences) - bytes. UTF-8 is one of several encodings (methods of representing data). For example, in Unicode, the (decimal) codepoint 9731 represents a snowman (), which consists of 3 bytes in UTF-8: E2 98 83

Here's a sorted list with some random examples.

Slobbery answered 19/5, 2014 at 13:57 Comment(0)
D
19

1. Unicode

There're lots of characters around the world,like "$,&,h,a,t,?,张,1,=,+...".

Then there comes an organization who's dedicated to these characters,

They made a standard called "Unicode".

The standard is like follows:

  • create a form in which each position is called "code point",or"code position".
  • The whole positions are from U+0000 to U+10FFFF;
  • Up until now,some positions are filled with characters,and other positions are saved or empty.
  • For example,the position "U+0024" is filled with the character "$".

PS:Of course there's another organization called ISO maintaining another standard --"ISO 10646",nearly the same.

2. UTF-8

As above,U+0024 is just a position,so we can't save "U+0024" in computer for the character "$".

There must be an encoding method.

Then there come encoding methods,such as UTF-8,UTF-16,UTF-32,UCS-2....

Under UTF-8,the code point "U+0024" is encoded into 00100100.

00100100 is the value we save in computer for "$".

Dramshop answered 5/1, 2015 at 9:28 Comment(2)
In general, UTF-8 is the only variant anyone uses today.Indiscrete
ISO 10646 is an identical standard to the Unicode character set. Unicode defines a lot of things other than the character set, such as rules for sorting, cases, etc. ISO 10646 is just the character set (of which there are currently over 130,000). The Unicode Consortium and ISO develop Unicode jointly, with ISO concerned only with the character set and its encodings, and Unicode also defining character properties and rules for processing text.Postulate
D
19

My explanation, after reading numerous posts and articles about this topic:

1 - The Unicode Character Table

"Unicode" is a giant table, that is 21bits wide, these 21bits provide room for 1,114,112 code points / values / fields / places to store characters in.

Out of those 1,114,112 code points, 1,111,998 are able to store Unicode characters, because there are 2048 code points reserved as surrogates, and 66 code points reserved as non-characters. So, there are 1,111,998 code points that can store a unique character, symbol, emoji and etc.

However, as of now, only 144,697 out of those 1,114,112 code points, have been used. These 144,697 code points contain characters that cover all of the languages, as well as symbols, emojis and etc.

Each character in the "Unicode" is assigned to a specific code point aka has a specific value / Unicode number. For Example the character "❤", uses exactly one code point out of the 1,114,112 code points. It has the value (aka Unicode number) of "U+2764". This is a hexadecimal code point consisting of two bytes, which in binary is represented as 00100111 01100100. But to represent this code point, UTF-8 encoding uses 3 bytes (24 bits), which is represented in binary as 11100010 10011101 10100100 (without the two empty space characters, each of which is using 1 bit, and I have added them for visual purposes only, in order to make the 24bits more readable, so please ignore them).

Now, how is our computer supposed to know if those 3 bytes "11100010 10011101 10100100" are to be read separate or together? If those 3 bytes are read separate and then converted to characters the result would be "Ô, Ø, ñ", which is quite the difference compared to our heart emoji "❤".

2 - Encoding Standards (UTF-8, ISO-8859, Windows-1251 and etc)

In order to solve this problem people have invented the Encoding Standards. The most popular one being UTF-8, since 2008. UTF-8 accounts for an average of 97.6% of all web pages, that is why we will UTF-8, for the example below.

2.1 - What is Encoding?

Encoding, simply said means to to convert something, from one thing to another. In our case we are converting data, more specifically bytes to the UTF-8 format, I would also like to rephrase that sentence as: "converting bytes to UTF-8 bytes", although it might not be technically correct.

2.2 Some information about the UTF-8 format, and why it's so important

UTF-8 uses a minimum of 1 bytes to store a character and a maximum of 4 bytes. Thanks to the UTF-8 format we can have characters which take more than 1 byte of information.

This is very important, because if it was not for the UTF-8 format, we would not be able to have such a vast diversity of alphabets, since the letters of some alphabets can't fit into 1 byte, We also wouldn't have emojis at all, since each one requires at least 3 bytes. I am pretty sure you got the point by now, so let's continue forward.

2.3 Example of Encoding a Chinese character to UTF-8

Now, lets say we have the Chinese character "汉".

This character takes exactly 16 binary bits "01101100 01001001", thus as we discussed above, we can not read this character, unless we encode it to UTF-8, because the computer will have no way of knowing, if these 2 bytes are to be read separately or together.

Converting this "汉" character's 2 bytes into, as I like to call it UTF-8 bytes, will result in the following:

(Normal Bytes) "01101100 01001001" -> (UTF-8 Encoded Bytes) "11100110 10110001 10001001"

Now, how did we end up with 3 bytes instead of 2? How is that supposed to be UTF-8 Encoding, turning 2 bytes into 3?

In order to explain how the UTF-8 encoding works, I am going to literally copy the reply of @MatthiasBraun, a big shoutout to him for his terrific explanation.

2.4 How does the UTF-8 encoding actually work?

What we have here is the template for Encoding bytes to UTF-8. This is how Encoding happens, pretty exciting if you ask me!

Now, take a good look at the table below and then we are going to go through it together.

        Binary format of bytes in sequence:

        1st Byte    2nd Byte    3rd Byte    4th Byte    Number of Free Bits   Maximum Expressible Unicode Value
        0xxxxxxx                                                7             007F hex (127)
        110xxxxx    10xxxxxx                                (5+6)=11          07FF hex (2047)
        1110xxxx    10xxxxxx    10xxxxxx                  (4+6+6)=16          FFFF hex (65535)
        11110xxx    10xxxxxx    10xxxxxx    10xxxxxx    (3+6+6+6)=21          10FFFF hex (1,114,111)
  1. The "x" characters in the table above represent the number of "Free Bits", those bits are empty and we can write to them.

  2. The other bits are reserved for the UTF-8 format, they are used as headers / markers. Thanks to these headers, when the bytes are being read using the UTF-8 encoding, the computer knows, which bytes to read together and which separately.

  3. The byte size of your character, after being encoded using the UTF-8 format, depends on how many bits you need to write.

  • In our case the "汉" character is exactly 2 bytes or 16bits:

  • "01101100 01001001"

  • thus the size of our character after being encoded to UTF-8, will be 3 bytes or 24bits

  • "11100110 10110001 10001001"

  • because "3 UTF-8 bytes" have 16 Free Bits, which we can write to

  1. Solution, step by step below:

2.5 Solution:

        Header  Place holder    Fill in our Binary   Result         
        1110    xxxx            0110                 11100110
        10      xxxxxx          110001               10110001
        10      xxxxxx          001001               10001001 

2.6 Summary:

        A Chinese character:      汉
        its Unicode value:        U+6C49
        convert 6C49 to binary:   01101100 01001001
        encode 6C49 as UTF-8:     11100110 10110001 10001001

3 - The difference between UTF-8, UTF-16 and UTF-32

Original explanation of the difference between the UTF-8, UTF-16 and UTF-32 encodings: https://javarevisited.blogspot.com/2015/02/difference-between-utf-8-utf-16-and-utf.html

The main difference between UTF-8, UTF-16, and UTF-32 character encodings is how many bytes they require to represent a character in memory:

UTF-8 uses a minimum of 1 byte, but if the character is bigger, then it can use 2, 3 or 4 bytes. UTF-8 is also compatible with the ASCII table.

UTF-16 uses a minimum of 2 bytes. UTF-16 can not take 3 bytes, it can either take 2 or 4 bytes. UTF-16 is not compatible with the ASCII table.

UTF-32 always uses 4 bytes.

Remember: UTF-8 and UTF-16 are variable-length encodings, where UTF-8 can take 1 to 4 bytes, while UTF-16 will can take either 2 or 4 bytes. UTF-32 is a fixed-width encoding, it always takes 32 bits.

Diabolize answered 15/1, 2022 at 1:23 Comment(2)
How can you find that is is exactly 2 bytes or 16bits: 01101100 01001001 . in unicode. can I see the table?Headsman
This is much more confusing than it needs to be. I was lost already at »The value "U+2764" looks like that in binary: "11100010 10011101 10100100"«. First of all it’s not obvious that 2764 is hexadecimal. Secondly in binary that should be something like 00100111 01100100. Really that sentence should say »U+2764 is a hexadecimal codepoint consisting of two bytes. To represent this codepoint, the UTF-8 encoding however uses 3 bytes. How and why UTF-8 gets from 00100111 01100100 to 11100010 10011101 10100100 is explained in the following.«Apostles
M
16

This article explains all the details http://kunststube.net/encoding/

WRITING TO BUFFER

if you write to a 4 byte buffer, symbol with UTF8 encoding, your binary will look like this:

00000000 11100011 10000001 10000010

if you write to a 4 byte buffer, symbol with UTF16 encoding, your binary will look like this:

00000000 00000000 00110000 01000010

As you can see, depending on what language you would use in your content this will effect your memory accordingly.

e.g. For this particular symbol: UTF16 encoding is more efficient since we have 2 spare bytes to use for the next symbol. But it doesn't mean that you must use UTF16 for Japan alphabet.

READING FROM BUFFER

Now if you want to read the above bytes, you have to know in what encoding it was written to and decode it back correctly.

e.g. If you decode this : 00000000 11100011 10000001 10000010 into UTF16 encoding, you will end up with not

Note: Encoding and Unicode are two different things. Unicode is the big (table) with each symbol mapped to a unique code point. e.g. symbol (letter) has a (code point): 30 42 (hex). Encoding on the other hand, is an algorithm that converts symbols to more appropriate way, when storing to hardware.

30 42 (hex) - > UTF8 encoding - > E3 81 82 (hex), which is above result in binary.

30 42 (hex) - > UTF16 encoding - > 30 42 (hex), which is above result in binary.

enter image description here

Malmsey answered 12/10, 2019 at 4:30 Comment(2)
For that Chinese character in UTF-8, why is saved as 3 bytes and not 2, in the same format as UTF-16?Brunella
finally, a good articleKirovograd
L
15

If I may summarise what I gathered from this thread:

Unicode assigns characters to ordinal numbers (in decimal form). (These numbers are called code points.)

à -> 224

UTF-8 is an encoding that 'translates' these ordinal numbers (in decimal form) to binary representations.

224 -> 11000011 10100000

Note that we're talking about the binary representation of 224, not its binary form, which is 0b11100000.

Lithotomy answered 18/7, 2019 at 7:17 Comment(0)
I
12

I have checked the links in Gumbo's answer, and I wanted to paste some part of those things here to exist on Stack Overflow as well.

"...Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct. It is the single most common myth about Unicode, so if you thought that, don't feel bad.

In fact, Unicode has a different way of thinking about characters, and you have to understand the Unicode way of thinking of things or nothing will make sense.

Until now, we've assumed that a letter maps to some bits which you can store on disk or in memory:

A -> 0100 0001

In Unicode, a letter maps to something called a code point which is still just a theoretical concept. How that code point is represented in memory or on disk is a whole other story..."

"...Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639. This magic number is called a code point. The U+ means "Unicode" and the numbers are hexadecimal. U+0639 is the Arabic letter Ain. The English letter A would be U+0041...."

"...OK, so say we have a string:

Hello

which, in Unicode, corresponds to these five code points:

U+0048 U+0065 U+006C U+006C U+006F.

Just a bunch of code points. Numbers, really. We haven't yet said anything about how to store this in memory or represent it in an email message..."

"...That's where encodings come in.

The earliest idea for Unicode encoding, which led to the myth about the two bytes, was, hey, let's just store those numbers in two bytes each. So Hello becomes

00 48 00 65 00 6C 00 6C 00 6F

Right? Not so fast! Couldn't it also be:

48 00 65 00 6C 00 6C 00 6F 00 ? ..."

Imidazole answered 30/5, 2011 at 9:37 Comment(1)
In ASCII, a letter maps to a codepoint too, not just in unicode.Ambiversion
R
5

They are the same thing, aren't they?

No, they aren't.


I think the first sentence of the Wikipedia page you referenced gives a nice, brief summary:

UTF-8 is a variable width character encoding capable of encoding all 1,112,064 valid code points in Unicode using one to four 8-bit bytes.

To elaborate:

  • Unicode is a standard, which defines a map from characters to numbers, the so-called code points, (like in the example below). For the full mapping, you can have a look here.

    ! -> U+0021 (21),  
    " -> U+0022 (22),  
    \# -> U+0023 (23)
    
  • UTF-8 is one of the ways to encode these code points in a form a computer can understand, aka bits. In other words, it's a way/algorithm to convert each of those code points to a sequence of bits or convert a sequence of bits to the equivalent code points. Note that there are a lot of alternative encodings for Unicode.


Joel gives a really nice explanation and an overview of the history here.

Recalescence answered 11/1, 2018 at 19:12 Comment(0)
K
3

UTF-8 is a method for encoding Unicode characters using 8-bit sequences.

Unicode is a standard for representing a great variety of characters from many languages.

Kazak answered 26/1, 2018 at 13:35 Comment(3)
"8-bit sequences"…? Might want to specify that more precisely…Extension
"8-bit sequences" mean, it can be presented 8bit format. Like these, 01000001 , or 11010011 10000101 or 11100101 10100011 10000110 or 11110001 10110001 10000010 10110001 . As you can see, when it comes to UTF-8, it can be minimum 1 byte, maximum 4byte.Headsman
Notice that when you want to use 1 byte, first digit is 0 . When you want to use 2byte, first 3digits are 110. When you want to use 3 byte, first 4digits are 1110. When you want to use 4 byte, first 5digits are 11110. Huh. are you getting it? :)Headsman
J
3

A simple answer that gets straight to the point:

  • Unicode is a standard for representing characters from many human languages.
  • UTF-8 is a method for encoding Unicode characters.

* I'm overlooking the inner workings of UTF-8 on purpose.

Jello answered 10/11, 2021 at 21:52 Comment(1)
this really answers the question in terms of the concept of Unicode vs UTF-8 and their roles.Perusal
I
1

So you end up here usually from Google, and want to try different stuff.
But how do you print and convert all these character sets?

Here I list a few useful one-liners.

In Powershell:

# Print character with the Unicode point (U+<hexcode>) using this: 
[char]0x2550

# With Python installed, you can print the unicode character from U+xxxx with:
python -c 'print(u"\u2585")'

If you have more Powershell trix or shortcuts, please comment.

In Bash, you'd appreciate the iconv, hexdump and xxd from the libiconv and util-linux packages (probably named differently on other *nix distros.)

# To print the 3-byte hex code for a Unicode character:
printf "\\\x%s" $(printf '═'|xxd -p -c1 -u)
#\xE2\x95\x90

# To print the Unicode character represented by hex string:
printf '\xE2\x96\x85'
#▅

# To convert from UTF-16LE to Unicode
echo -en "════"| iconv -f UTF-16LE -t UNICODEFFFE

# To convert a string into hex: 
echo -en '═�'| xxd -g 1
#00000000: e2 95 90 ef bf bd

# To convert a string into binary:
echo -en '═�\n'| xxd -b
#00000000: 11100010 10010101 10010000 11101111 10111111 10111101  ......
#00000006: 00001010

# To convert a binary string into hex:
printf  '%x\n' "$((2#111000111000000110000010))"
#e38182

Indetermination answered 4/1, 2022 at 14:50 Comment(0)
F
0

Does Unicode just mean UTF-8?

No, but ...

If you're even asking the question, you can pretend they're the same, and that will probably be good enough for now. (This is even more true in 2024 than it was in 2009.)

If you're going to be writing code, then you should treat Unicode as an opaque handle (pointer). Unicode just means text, and the actual physical encoding -- even how much memory you need -- is magic that you should never ever see.

Because if you do see it, you will be tempted to take short cuts, and they will seem to work, and the data corruption will be subtle enough that you won't catch it until it is too late to fix properly.

Asking about UTF-8 (or other physical encodings) is like asking which pixels are used to display an a. Even if you get the answer right, it will fail when someone zooms in, or changes the font.

That said, if you're sending text to another system, you do need to agree on the message format, and UTF-8 is probably the best default. Treat it as a magic token and just plug it in as a constant wherever the API asks for an encoding but you can't just pass through whatever magic name your own caller gave you. As a general rule, unless the document (or at least the code) is older than your computer, it will default to UTF-8, or at least a (possibly mislabeled) close variant.

Personally, I really want to know how the stuff is represented in memory. It is hard to trust an API if I can't see what it is doing. If you also suffer from this problem, then the next step is to read the code, looking for terms like "encoding", "encoder", "decoder", "codec", "character", and "charset". ("char" will produce a lot of false alarms.)

If you want to know how to do it right, perhaps in a new system, then ... well, you're sort of out of luck. The first approximation is to treat "Unicode" as dark magic, and to use UTF-8 as the encoding for talking to the rest of the world. The next level is to dig in to the actual Unicode specification. Do not be tempted by shorter/simpler explanations of UTF-8; that way lies subtle data corruption.

And how do you find the Unicode specification? Uh ... that also turns out to be surprisingly complicated, but your difficulties will be good preparation for the standard itself. Some parts will seem bizarrely complicated. Some parts will seem insane. Some parts will actually be insane, because of logically inconsistent requirements. But as a start at finding the standard, https://www.unicode.org/versions/Unicode15.1.0/ provides the now-current standard. Except that you really do have to look at the various annexes and technical reports and such. https://www.unicode.org/reports/

Given your interest in UTF-8, you're probably interested in Section 3.9, Unicode Encoding Forms, which you can find from the table of contents by happening to know that it is part of Section 3, Conformance. Or maybe by going to the superseded Unicode Standard Annex 19 (UAX #19, tr19) and noting what it was superseded by. Come to think of it, you might just want to look up UTF-8, UTF-16, and UTF-32 elsewhere to get a general understanding first. But remember not to stop with those simpler explanations -- they do tend to simplify things in a way that permits (usually) subtle data corruption.

If you're wondering why this is so complicated, you might want to read about collation (trying to alphabetize "normally" is one of those logically impossible issues, particularly across languages) and canonicalization (because of course there are multiple canonical forms) and legacy charsets/character sets/encodings. Or take a peek at how to tell when you can add a line break, or how to tell which direction the writing goes, or look up the Turkish I, or ...

And if you needed to get some work done soon, go back to "Unicode is just an opaque handle (pointer), and I don't touch the physical layout directly."

Fungistat answered 4/2, 2024 at 19:28 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.