Why UTF-32 exists whereas only 21 bits are necessary to encode every character?

Asked 14/6, 2011 at 6:15 Answered 22/10, 2017 at 8:38

We know that codepoints can be in this interval 0..10FFFF which is less than 2^21. Then why do we need UTF-32 when all codepoints can be represented by 3 bytes? UTF-24 should be enough.

Garwin answered 14/6, 2011 at 6:15 Comment(0)

Computers are generally much better at dealing with data on 4 byte boundaries. The benefits in terms of reduced memory consumption are relatively small compared with the pain of working on 3-byte boundaries.

(I speculate there was also a reluctance to have a limit that was "only what we can currently imagine being useful" when coming up with the original design. After all, that's caused a lot of problems in the past, e.g. with IPv4. While I can't see us ever needing more than 24 bits, if 32 bits is more convenient anyway then it seems reasonable to avoid having a limit which might just be hit one day, via reserved ranges etc.)

I guess this is a bit like asking why we often have 8-bit, 16-bit, 32-bit and 64-bit integer datatypes (byte, int, long, whatever) but not 24-bit ones. I'm sure there are lots of occasions where we know that a number will never go beyond 2²¹, but it's just simpler to use int than to create a 24-bit type.

Cunaxa answered 14/6, 2011 at 6:19 Comment(9)

To expand beyond 21 bits we'd need to a new 'UTF-16 compatible' encoding. Or we'd just abandon UTF-16. I wouldn't mind that but all the applications and libraries and systems that treat Unicode as synonymous with UTF-16 probably wouldn't be happy. – Binny 13/4, 2012 at 16:50

What about stuffing 3 code points into a 64-bit integer? 3 21-bit numbers fit perfectly in a 64-bit integer (signed or unsigned). – Mete 25/9, 2017 at 4:9

@ColeJohnson: That would work, but only until we find that 21 bits aren't enough... and it still ends up being less easily-handled in terms of requiring bitshifting etc. It could be a useful implementation in some cases though. – Cunaxa 25/9, 2017 at 6:41

Think of UTF-32 as equivalent to the 32-bit padded RGBx pixel formats often used for images without alpha channels to keep the pixels word-aligned. It's just another artifact of the CPU time vs. memory footprint trade-off that permeates software design. – Charlatanism 23/10, 2020 at 11:58

Even on an 8-bit computer, where 4-byte alignment isn't really a thing, you wind up having to multiply by 3 to index an array of UTF-24 characters. Using 6502 machines like the Commodore 64 as an example, multiplying a single-byte value by 3 takes four instructions totaling 6 bytes and 10 clock cycles; multiplying it by 4 instead takes only two instructions occupying two bytes and only 4 cycles. – Swage 31/1, 2021 at 17:7

"Allow for future expansion" is just plain wrong. Unicode will not be expanded. It has 1,112,064 code points out of which 144,697 code points are in use. More code points will be assigned a character in the future, but none will be created. And if more code points are created, UTF-16 will not be able to represent them, since it has a limit of 1,114,112 characters (!!!). That's a margin on only 2,048 code points! – Snowmobile 27/12, 2021 at 9:50

@SenhorLucas: Just because it turns out that it won't be used doesn't mean that wasn't the reason behind it originally. I'll edit the answer to indicate that this was speculation on my part though. – Cunaxa 27/12, 2021 at 9:54

unicode.org/faq/utf_bom.html > Will UTF-16 ever be extended to more than a million characters? > No. Both Unicode and ISO 10646 have policies in place that formally limit future code assignment to the integer range that can be expressed with current UTF-16 (0 to 1,114,111). – Snowmobile 28/12, 2021 at 11:34

@SenhorLucas: Yes, they do now... but that doesn't mean it was never an aim. Anyway, I hope the edit I made yesterday satisfies you. – Cunaxa 28/12, 2021 at 11:35

First there were 2 character coding schemes: UCS-4 that coded each character into 32 bits, as an unsigned integer in range 0x00000000 - 0x7FFFFFFF, and UCS-2 that used 16 bits for each codepoint.

Later it was found out that using just the 65536 codepoints of UCS-2 would get one into problems anyway, but many programs (Windows, cough) relied on wide characters being 16 bits wide, so UTF-16 was created. UTF-16 encodes the codepints in the range U+0000 - U+FFFF just like UCS-2; and U+10000 - U+10FFFF using surrogate pairs, i.e. a pair of two 16-bit values.

As this was a bit complicated, UTF-32 was introduced, as a simple one-to-one mapping for characters beyond U+FFFF. Now, since UTF-16 can only encode up to U+10FFFF, it was decided that this is will be the maximum value that will be ever assigned, so that there will be no further compatibility problems, so UTF-32 indeed just uses 21 bits. As an added bonus, UTF-8, which was initially planned to be a 1-6-byte encoding, now never needs more than 4 bytes for each code point. Therefore it can be easily proven that it never requires more storage than UTF-32.

It is true that a hypothetical UTF-24 format would save memory. However its savings would be dubious anyway, as it would mostly consume more storage than UTF-8, except for just blasts of emoji or such - and not many interesting texts of significant length consist solely of emojis.

But, UTF-32 is used as in memory representation for text in programs that need to have simply-indexed access to codepoints - it is the only encoding where the Nth element in a C array is also the Nth codepoint - UTF-24 would do the same for 25 % memory savings but more complicated element accesses.

Materialism answered 22/10, 2017 at 8:38 Comment(0)

It's true that only 21 bits are required (reference), but modern computers are good at moving 32-bit units of things around and generally interacting with them. I don't think I've ever used a programming language that had a 24-bit integer or character type, nor a platform where that was a multiple of the processor's word size (not since I last used an 8-bit computer; UTF-24 would be reasonable on an 8-bit machine), though naturally there have been some.

Talkathon answered 14/6, 2011 at 6:19 Comment(3)

I used a processor with 24-bit words not too long ago. It was a Sigmatel product, maybe? I can't remember now. – Arrester 14/6, 2011 at 6:20

Although 24-bit words don't map very well to most processors, 24-bit can nevertheless be a very efficient storage base. It is very widely used in media applications, both audio (24 being the standard bit depth for studio recordings) and video (three colour channels à 8 bits). And it's not like performance isn't important in those applications! – Ancalin 27/4, 2018 at 11:40

And now that we're in 2018, nearly all computers can easily work with 64-bit data types at full performance. – Underbodice 5/9, 2018 at 16:9

UTF-32 is a multiple of 16bit. Working with 32 bit quantities is much more common than working with 24 bit quantities and is usually better supported. It also helps keep each character 4-byte aligned (assuming the entire string is 4-byte aligned). Going from 1 byte to 2 bytes to 4 bytes is the most "logical" procession.

Apart from that: The Unicode standard is ever-growing. Codepoints outside of that range could eventually be assigned (it is somewhat unlikely in the near future, however, due to the huge number of unassigned codepoints still available).

Careworn answered 14/6, 2011 at 6:19 Comment(0)

Recommended topics

Hot tags