What's the difference between encoding and charset?
Asked Answered
N

10

192

I am confused about the text encoding and charset. For many reasons, I have to learn non-Unicode, non-UTF8 stuff in my upcoming work.

I find the word "charset" in email headers as in "ISO-2022-JP", but there's no such a encoding in text editors. (I looked around the different text editors.)

What's the difference between text encoding and charset? I'd appreciate it if you could show me some use case examples.

Nauseous answered 17/2, 2010 at 14:55 Comment(1)
See this post: #13743750Pargeting
F
189

Basically:

  1. charset is the set of characters you can use
  2. encoding is a way these characters are stored into memory

People sometimes use charset to refer both to the character repertoire and the encoding scheme. The Unicode Standard charset has multiple encodings, e.g., UTF-8, UTF-16, UTF-32, UCS-4, UTF-EBCDIC, Punycode, and GB18030.

Fahy answered 17/2, 2010 at 15:0 Comment(0)
A
103

Every encoding has a particular charset associated with it, but there can be more than one encoding for a given charset. A charset is simply what it sounds like, a set of characters. There are a large number of charsets, including many that are intended for particular scripts or languages.

However, we are well along the way in the transition to Unicode, which includes a character set capable of representing almost all the world's scripts. However, there are multiple encodings for Unicode. An encoding is a way of mapping a string of characters to a string of bytes. Examples of Unicode encodings include UTF-8, UTF-16 BE, and UTF-16 LE . Each of these has advantages for particular applications or machine architectures.

Agronomy answered 17/2, 2010 at 14:59 Comment(3)
Note that javadoc wrongly uses "charset" instead of "encoding", for example in InputStreamReader, we read "An InputStreamReader is a bridge from byte streams to character streams: It reads bytes and decodes them into characters using a specified charset. The charset that it uses may be specified by name or may be given explicitly, or the platform's default charset may be accepted.". However, what they mean is "encoding".Flexible
Thanks for your explain. Unicode is a character set, and UTF-8 is one encoding way of Unicode, and UTF-16 is another encoding way of Unicode.Griego
Thanks! This answer helped my understanding much more than the (currently) accepted answer.Cantus
D
40

Throwing more light for people visiting henceforth, hopefully it would be helpful.


Character Set

There are characters in each language and collection of those characters form the “character set” of that language. When a character is encoded then it is assigned a unique identifier or a number called as code point. In computer, these code points will be represented by one or more bytes.

Examples of character set: ASCII (covers all English characters), ISO/IEC 646, Unicode (covers characters from all living languages in the world)

Coded Character Set

A coded character set is a set in which a unique number is assigned to each character. That unique number is called as "code point".
Coded character sets are sometimes called code pages.

Encoding

Encoding is the mechanism to map the code points with some bytes so that a character can be read and written uniformly across different system using same encoding scheme.

Examples of encoding: ASCII, Unicode encoding schemes like UTF-8, UTF-16, UTF-32.

Elaboration of above 3 concepts

  • Consider this - Character 'क' in Devanagari character set has a decimal code point of 2325 which will be represented by two bytes (09 15) when using the UTF-16 encoding
  • In “ISO-8859-1” encoding scheme “ü” (this is nothing but a character in Latin character set) is represented as hexa-decimal value of FC while in “UTF-8” it represented as C3 BC and in UTF-16 as FE FF 00 FC.
  • Different encoding schemes may use same code point to represent different characters, for example in “ISO-8859-1” (also called as Latin1) the decimal code point value for the letter ‘é’ is 233. However, in ISO 8859-5, the same code point represents the Cyrillic character ‘щ’.
  • On the other hand, a single code point in the Unicode character set can actually be mapped to different byte sequences, depending on which encoding was used for the document. The Devanagari character क, with code point 2325 (which is 915 in hexadecimal notation), will be represented by two bytes when using the UTF-16 encoding (09 15), three bytes with UTF-8 (E0 A4 95), or four bytes with UTF-32 (00 00 09 15)
Dexterous answered 8/10, 2015 at 23:33 Comment(0)
P
35

A character encoding consists of:

  1. The set of supported characters
  2. A mapping between characters and integers ("code points")
  3. How code points are encoded as a series of "code units" (e.g., 16-bit units for UTF-16)
  4. How code units are encoded into bytes (e.g., big-endian or little-endian)

Step #1 by itself is a "character repertoire" or abstract "character set", and #1 + #2 = a "coded character set".

But back before Unicode became popular and everyone (except East Asians) was using a single-byte encoding, steps #3 and #4 were trivial (code point = code unit = byte). Thus, older protocols didn't clearly distinguish between "character encoding" and "coded character set". Older protocols use charset when they really mean encoding.

Peltry answered 23/6, 2010 at 5:29 Comment(1)
would it be why we can read charset='utf-8' in the html META tag? because it was defined long agoBerga
D
15

A character set, or character repertoire, is simply a set (an unordered collection) of characters. A coded character set assigns an integer (a "code point") to each character in the repertoire. An encoding is a way of representing code points unambiguously as a stream of bytes.

Diarrhoea answered 17/2, 2010 at 15:1 Comment(2)
This should be the accepted answer. It clearly defines three concepts: character set, coded character set, and encoding.Roundtree
Given that we are making sure not to allow ambiguity due to poorly chosen English terms, we should stick to not using the term "byte", since byte doesn't have a fixed size (atleast not in C). If you mean 8-bit units, then please use the term "octets", instead of "bytes".Load
P
8

Googled for it. http://en.wikipedia.org/wiki/Character_encoding

The difference seems to be subtle. The term charset actually doesn't apply to Unicode. Unicode goes through a series of abstractions. abstract characters -> code points -> encoding of code points to bytes.

Charsets actually skip this and directly jump from characters to bytes. sequence of bytes <-> sequence of characters

In short, encoding : code points -> bytes charset: characters -> bytes

Priesthood answered 17/2, 2010 at 15:15 Comment(1)
Further confusion is "charset" vs. "character set". Are they the same? Or is charset what is literally in an HTML document and should not be used as a clipping of "character set"? Peter O.'s answer alludes to this.Mucilaginous
A
7

In my opinion, a charset is part of an encoding (a component), encoding has a charset attribute, so a charset can be used in many encodings. For example, Unicode is a charset used in encodings like UTF-8, UTF-16 and so on. See illustration here:

See illustration here

The char in charset doesn't mean the char type in the programming world. It means a character in the real world. In English it maybe the same, but in other languages not, like in Chinese, '我' is a inseparable 'char' in charsets (Unicode, GB [used in GBK and GB2312]), 'a' is also a char in charsets (ASCII, ISO-8859, Unicode).

Apperceive answered 27/6, 2019 at 6:28 Comment(0)
G
6

A charset is just a set; it either contains, e.g. the Euro sign, or else it doesn't. That's all.

An encoding is a bijective mapping from a character set to a set of integers. If it supports the Euro sign, it must assign a specific integer to that character and to no other.

Godsey answered 17/2, 2010 at 15:3 Comment(3)
Does it have to be bijective?Blemish
Well, encoding and decoding should be deterministic, so there really can't be any ambiguous mappings. I suppose you could have a non-contiguous set of integers as the codomain, but that would waste space when you store text, and engineers hate wasted space.Godsey
Legacy character encodings are often not bijective. For example, in IBM437, both ß and β are represented by 0xE1.Peltry
K
3

In my opinion the word "charset" should be limited to identifying the parameter used in HTTP, MIME, and similar standards to specify a character encoding (a mapping from a series of text characters to a sequence of bytes) by name. For example:charset=utf-8.

I'm aware, though, that MySQL, Java, and other places may use the word "charset" to mean a character encoding.

Kinsley answered 10/1, 2016 at 20:54 Comment(1)
Agreed. "charset" should not be used as a clipping of "character set" - it adds to the confusion.Mucilaginous
P
3

An encoding is a mapping between bytes and characters from a character set, so it will be helpful to discuss and understand the difference between between bytes and characters.

Think of bytes as numbers between 0 and 255, whereas characters are abstract things like "a", "1", "$" and "Ä". The set of all characters that are available is called a character set.

Each character has a sequence of one or more bytes that are used to represent it; however, the exact number and value of the bytes depends on the encoding used and there are many different encodings.

Most encodings are based on an old character set and encoding called ASCII which is a single byte per character (actually, only 7 bits) and contains 128 characters including a lot of the common characters used in US English.

For example, here are 6 characters in the ASCII character set that are represented by the values 60 to 65.

Extract of ASCII Table 60-65
╔══════╦══════════════╗
║ Byte ║  Character   ║
╠══════╬══════════════║
║  60  ║      <       ║
║  61  ║      =       ║
║  62  ║      >       ║
║  63  ║      ?       ║
║  64  ║      @       ║
║  65  ║      A       ║
╚══════╩══════════════╝

In the full ASCII set, the lowest value used is zero and the highest is 127 (both of these are hidden control characters).

However, once you start needing more characters than the basic ASCII provides (for example, letters with accents, currency symbols, graphic symbols, etc.), ASCII is not suitable and you need something more extensive. You need more characters (a different character set) and you need a different encoding as 128 characters is not enough to fit all the characters in. Some encodings offer one byte (256 characters) or up to six bytes.

Over time a lot of encodings have been created. In the Windows world, there is CP1252, or ISO-8859-1, whereas Linux users tend to favour UTF-8. Java uses UTF-16 natively.

One sequence of byte values for a character in one encoding might stand for a completely different character in another encoding, or might even be invalid.

For example, in ISO 8859-1, â is represented by one byte of value 226, whereas in UTF-8 it is two bytes: 195, 162. However, in ISO 8859-1, 195, 162 would be two characters, Ã, ¢.

When computers store data about characters internally or transmit it to another system, they store or send bytes. Imagine a system opening a file or receiving message sees the bytes 195, 162. How does it know what characters these are?

In order for the system to interpret those bytes as actual characters (and so display them or convert them to another encoding), it needs to know the encoding used. That is why encoding appears in XML headers or can be specified in a text editor. It tells the system the mapping between bytes and characters.

Pargeting answered 24/4, 2018 at 12:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.