What is the difference between UTF-8 and ISO-8859-1? [closed]
Asked Answered
C

8

510

What is the difference between UTF-8 and ISO-8859-1?

Cowell answered 13/8, 2011 at 5:21 Comment(0)
S
404

UTF-8 is a multibyte encoding that can represent any Unicode character. ISO 8859-1 is a single-byte encoding that can represent the first 256 Unicode characters. Both encode ASCII exactly the same way.

Swane answered 13/8, 2011 at 5:26 Comment(6)
One thing to note that ASCII extends from 0 to 127 only. The MSB is always 0.Mandymandych
When code points above 127 are defined, the encoding system is a version of Extended ASCII.Histoplasmosis
@RohanBhale Don't use the phrase Extended ASCII; it'll only cause confusion.Maier
But extended ascii might be the correct term. I read it on multiple resourcesHistoplasmosis
I always heard it as High ASCII.Diplopod
In over 30 years of MsDos, windows, *nix, and the internet I've never heard "high" ASCII ever mentioned. Its always been "Extended ASCII"Quilt
M
162

Wikipedia explains both reasonably well: UTF-8 vs Latin-1 (ISO-8859-1). Former is a variable-length encoding, latter single-byte fixed length encoding. Latin-1 encodes just the first 256 code points of the Unicode character set, whereas UTF-8 can be used to encode all code points. At physical encoding level, only codepoints 0 - 127 get encoded identically; code points 128 - 255 differ by becoming 2-byte sequence with UTF-8 whereas they are single bytes with Latin-1.

Mercia answered 13/8, 2011 at 5:30 Comment(2)
@mu maybe my statement was ambiguous, but it is not incorrect -- I was not talking about encoded byte sequences, but rather character sets being encoded; meaning that ISO-8859-1 is used to encode first 256 code points of the Unicode character set.Mercia
Your clarification works for me and "ambiguous" would have been a better word choice than "incorrect".Yorgos
N
121

UTF

UTF is a family of multi-byte encoding schemes that can represent Unicode code points which can be representative of up to 2^31 [roughly 2 billion] characters. UTF-8 is a flexible encoding system that uses between 1 and 4 bytes to represent the first 2^21 [roughly 2 million] code points.

Long story short: any character with a code point/ordinal representation below 127, aka 7-bit-safe ASCII is represented by the same 1-byte sequence as most other single-byte encodings. Any character with a code point above 127 is represented by a sequence of two or more bytes, with the particulars of the encoding best explained here.

ISO-8859

ISO-8859 is a family of single-byte encoding schemes used to represent alphabets that can be represented within the range of 127 to 255. These various alphabets are defined as "parts" in the format ISO-8859-n, the most familiar of these likely being ISO-8859-1 aka 'Latin-1'. As with UTF-8, 7-bit-safe ASCII remains unaffected regardless of the encoding family used.

The drawback to this encoding scheme is its inability to accommodate languages comprised of more than 128 symbols, or to safely display more than one family of symbols at one time. As well, ISO-8859 encodings have fallen out of favor with the rise of UTF. The ISO "Working Group" in charge of it having disbanded in 2004, leaving maintenance up to its parent subcommittee.

Windows Code Pages

It's worth mentioning that Microsoft also maintains a set of character encodings with limited compatibility with ISO-8859, usually denoted as "cp####". MS seems to have a push to move their recent product releases to using Unicode in one form or another, but for legacy and/or interoperability reasons you're still likely to run into them.

For example, cp1252 is a superset of the ISO-8859-1, containing additional printable characters in the 0x80-0x9F range, notably the Euro symbol and the much maligned "smart quotes" “”. This frequently leads to a mismatch where 8859-1 can be displayed as 1252 perfectly fine, and 1252 may seem to display fine as 8859-1, but will misbehave when one of those extra symbols shows up.

Aside from cp1252, the Turkish cp1254 is a similar superset of ISO-8859-9, but all other Windows Code Pages have at least some fundamental conflicts, if not differing entirely from their 8859 equivalent.

Nianiabi answered 23/8, 2016 at 19:15 Comment(2)
+1 for answering the question but going beyond and offering info about related encodings. Re: code points for UTF-8, according to https://mcmap.net/q/75179/-how-many-characters-can-utf-8-encode, UTF-8 supports 2^21 code points. Is that an error, or might a fix be needed here?Hans
Unicode is actually 17 planes of 2^16 code points. 0x00_0000 to 0x1F_FFFF. The 17 planes can accommodate 1,114,112 code points. Of these, 2,048 are surrogates, 66 are non-characters, and 137,468 are reserved for private use, leaving 974,530 for public assignment.about 1 million. See How many characters can UTF-8 encode?.Epimenides
D
42
  • ASCII: 7 bits. 128 code points.

  • ISO-8859-1: 8 bits. 256 code points.

  • UTF-8: 8-32 bits (1-4 bytes). 1,112,064 code points.

Both ISO-8859-1 and UTF-8 are backwards compatible with ASCII, but UTF-8 is not backwards compatible with ISO-8859-1:

#!/usr/bin/env python3

c = chr(0xa9)
print(c)
print(c.encode('utf-8'))
print(c.encode('iso-8859-1'))

Output:

©
b'\xc2\xa9'
b'\xa9'
Disgrace answered 28/10, 2018 at 23:4 Comment(0)
A
27

ISO-8859-1 is a legacy standards from back in 1980s. It can only represent 256 characters so only suitable for some languages in western world. Even for many supported languages, some characters are missing. If you create a text file in this encoding and try copy/paste some Chinese characters, you will see weird results. So in other words, don't use it. Unicode has taken over the world and UTF-8 is pretty much the standards these days unless you have some legacy reasons (like HTTP headers which needs to compatible with everything).

Assentor answered 3/6, 2016 at 19:31 Comment(6)
I had seen where Umlaut's are not supposedly converted with UTF8. We saw examples of this and in searching we found the ISO-8859-1 and it seems to work. We have a lot of German Scientist we work with.Fannie
Umlaut's are represented as two characters in utf8. They convert fine and work well. The problem comes from programs that expect 1 byte per character. For these legacy programs, ISO-8859-1 has 1-byte umlaut's.Adze
"So in other words, don't use it." I wouldn's say so, because there are use cases where ISO-8859-1 suits much better then UTF-8 because single byte and 256 chars can be sufficient, resulting in faster processing and less payload.Petua
Just as an example of where single byte encoding is preferred, SMS messages have a limit of 140 bytes and primarily use single-byte encoding. If you were a business that sends automated SMS messages, you don't want to double your cost just to not use a legacy standard.Nates
@CalebMcNevin Unless you're sending t̸̡̧̢̬̼͈̯̰͎̫̯̖̻͎͖̱͉̠̟̪͈͓̫̥͉̞̠̠̮͎͒́ḣ̶̡̢̢̛͍͖̺̘̫̻͚̦͚̯̬͖̗̫̗̄̚ͅí̷͕̪̤͊̀̀͌̅́̄̈́̅̈́̊͂̓͂̇͐͒̈́̀̕s̶̛͕̹̞̦͔͚̠̟͈̳͇̪̓̒̀̋̎̒͌̋̀̍̊̿̄͛̿͛͂̌̐̕͜ or čřž it would still be 1 byte per character in UTF-8.Skullcap
@OleMorud Actually it's easier than you'd think. At my last job my manager would write sms message templates in word, and send them to us like that. Some strange stuff would happen, like with quotation marks - Word is "smart" and uses a different character than the default. “ & ” instead of ", very hard to spot. Literally doubled the cost for the first run when we copy pasted directly. We had to make a whole process to clean all the weird Word idiosyncrocies out of the text.Nates
C
4

One more important thing to realise: if you see iso-8859-1, it probably refers to Windows-1252 rather than ISO/IEC 8859-1. They differ in the range 0x80–0x9F, where ISO 8859-1 has the C1 control codes, and Windows-1252 has useful visible characters instead.

For example, ISO 8859-1 has 0x85 as a control character (in Unicode, U+0085, ``), while Windows-1252 has a horizontal ellipsis (in Unicode, U+2026 HORIZONTAL ELLIPSIS, ).

The WHATWG Encoding spec (as used by HTML) expressly declares iso-8859-1 to be a label for windows-1252, and web browsers do not support ISO 8859-1 in any way: the HTML spec says that all encodings in the Encoding spec must be supported, and no more.

Also of interest, HTML numeric character references essentially use Windows-1252 for 8-bit values rather than Unicode code points; per https://html.spec.whatwg.org/multipage/parsing.html#numeric-character-reference-end-state, … will produce U+2026 rather than U+0085.

Cayes answered 19/3, 2020 at 9:43 Comment(0)
S
3

From another perspective, files that both unicode and ascii encodings fail to read because they have a byte 0xc0 in them, seem to get read by iso-8859-1 properly. The caveat is that the file shouldn't have unicode characters in it of course.

Smashed answered 15/4, 2018 at 5:49 Comment(0)
L
0

My reason for researching this question was from the perspective, is in what way are they compatible. Latin1 charset (iso-8859) is 100% compatible to be stored in a utf8 datastore. All ascii & extended-ascii chars will be stored as single-byte.

Going the other way, from utf8 to Latin1 charset may or may not work. If there are any 2-byte chars (chars beyond extended-ascii 255) they will not store in a Latin1 datastore.

Loseff answered 2/9, 2016 at 14:20 Comment(4)
Helpful, but I think you meant 127 instead of 255 in extended-ascii 255?Mallee
Latin-1, or iso-8859-1 is not 100% compatible to be stored in utf8. Any Latin-n or iso-8859-n character above 127 will not be translated to a single byte utf-8 character. However, for values 1-127, they will translate exactly.Marvelmarvella
This answer is a bit confusing in its use of the term "extended ascii", which just is a term to refer to any character encoding that is not ASCII. UTF-8 and latin-1 are examples of extended-ASCII encodings. But, non-ascii latin-1 characters (ie. code points above 127) cannot be encoded as a single byte in UTF-8.Residual
In UTF-8 2 byte encodings begin at 128. However there are matching characters in both, so it is possible to go: ISO 8859-1 -> UTF-8 -> ISO 8859-1 losslessly but if there are any characters in a UTF-8 document greater than 255 then it cannot be converted losslessly.Consistence

© 2022 - 2024 — McMap. All rights reserved.