Why does base64 encoding require padding if the input length is not divisible by 3?
Asked Answered
M

5

200

What is the purpose of padding in base64 encoding. The following is the extract from wikipedia:

"An additional pad character is allocated which may be used to force the encoded output into an integer multiple of 4 characters (or equivalently when the unencoded binary text is not a multiple of 3 bytes) ; these padding characters must then be discarded when decoding but still allow the calculation of the effective length of the unencoded text, when its input binary length would not be not a multiple of 3 bytes (the last non-pad character is normally encoded so that the last 6-bit block it represents will be zero-padded on its least significant bits, at most two pad characters may occur at the end of the encoded stream)."

I wrote a program which could base64 encode any string and decode any base64 encoded string. What problem does padding solves?

Myrtia answered 2/11, 2010 at 18:35 Comment(0)
A
375

Your conclusion that padding is unnecessary is right. It's always possible to determine the length of the input unambiguously from the length of the encoded sequence.

However, padding is useful in situations where base64 encoded strings are concatenated in such a way that the lengths of the individual sequences are lost, as might happen, for example, in a very simple network protocol.

If unpadded strings are concatenated, it's impossible to recover the original data because information about the number of odd bytes at the end of each individual sequence is lost. However, if padded sequences are used, there's no ambiguity, and the sequence as a whole can be decoded correctly.

Edit: An Illustration

Suppose we have a program that base64-encodes words, concatenates them and sends them over a network. It encodes "I", "AM" and "TJM", sandwiches the results together without padding and transmits them.

  • I encodes to SQ (SQ== with padding)
  • AM encodes to QU0 (QU0= with padding)
  • TJM encodes to VEpN (VEpN with padding)

So the transmitted data is SQQU0VEpN. The receiver base64-decodes this as I\x04\x14\xd1Q) instead of the intended IAMTJM. The result is nonsense because the sender has destroyed information about where each word ends in the encoded sequence. If the sender had sent SQ==QU0=VEpN instead, the receiver could have decoded this as three separate base64 sequences which would concatenate to give IAMTJM.

Why Bother with Padding?

Why not just design the protocol to prefix each word with an integer length? Then the receiver could decode the stream correctly and there would be no need for padding.

That's a great idea, as long as we know the length of the data we're encoding before we start encoding it. But what if, instead of words, we were encoding chunks of video from a live camera? We might not know the length of each chunk in advance.

If the protocol used padding, there would be no need to transmit a length at all. The data could be encoded as it came in from the camera, each chunk terminated with padding, and the receiver would be able to decode the stream correctly.

Obviously that's a very contrived example, but perhaps it illustrates why padding might conceivably be helpful in some situations.

Annabelle answered 29/10, 2014 at 13:55 Comment(10)
+1 The only answer that actually provides a reasonable answer besides "because we like verbosity and redundancy for some inexplicable reason".Sufferance
This works OK for chunks that are encoded distinctly, but are expected to be indivisibly concatenated after decoding. If you send U0FNSQ==QU0=, you can reconstruct the sentence, but you lose the words that make up the sentence. Better than nothing, I guess. Notably, the GNU base64 program automatically handles concatenated encodings.Cohere
What if the length of words was a multiple of 3? This dumb way of concatenation destroys information (endings of words), not the removal of padding.Monitor
Base64 concatenation allows encoders to process large chunks in parallel without the burden of aligning the chunk sizes to a multiple of three. Similarly, as an implementation detail, there might be an encoder out there that needs to flush an internal data buffer of a size that is not a multiple of three.Scalenus
@MarceloCantos That's right, the use of padding doesn't allow the words to be recovered individually, it only ensures that the whole sequence can be decoded properly.Annabelle
@Monitor I'm not implying above that padding preserves the length of the individual words - it doesn't, since, as you pointed out, if the length of a word is a multiple of 3 it needs no padding. Rather, for words whose lengths are not multiples of 3, padding prevents the encoding of the 1 or 2 dangling bytes at the end of the word from getting mixed up with the encoding of the next word, so the whole sequence can be decoded as intended.Annabelle
This answer could make you think that you can decode something like "SQ==QU0=VEpN" by just giving it to a decoder. Actually it seems you can't, for example the implementations in javascript and php don't support this. Starting with a concatenated string, you either have to decode 4 bytes at a time or split the string after padding characters. It seems like those implementations just ignore the padding chars, even when they are in the middle of a string.Rosanne
@Roman, sadly, it seems that decode_base64() from MIME::Base64 in Perl, and base64.b64decode in Python also stop after the first set of padding characters. RFC 4648 seems to basically say that the treatment of padding characters is up to the specification of whatever uses Base64 (and it would be allowable to define padding characters in the middle to be totally ignored, even). Too bad the decoders don't seem to have documented the exact behaviour.Recountal
Great concise and precise explanation, thanks >> "padding is useful in situations where base64 encoded strings are concatenated in such a way that the lengths of the individual sequences are lost"Norven
Again, It is really for this purpose, then as @Rosanne Starkov said, "why not single = but the necessity for three different padding cases". Maybe it's for the simplicity of decoding algorithm (memory allocation and layout, etc.), so we can just deal with chunks of 4 characters.Jacksmelt
C
68

On a related note, here's an arbitrary base converter I created for you. Enjoy! https://convert.zamicol.com

What are Padding Characters?

Padding characters help satisfy length requirements and carry no other meaning.

Decimal Example of Padding: Given the arbitrary requirement all strings be 8 characters in length, the number 640 can meet this requirement using preceding 0's as padding characters as they carry no meaning, "00000640".

Binary Encoding

The Byte Paradigm: For encoding, the byte is the de facto standard unit of measurement and any scheme must relate back to bytes.

Base256 fits exactly into the byte paradigm. One byte is equal to one character in base256.

Base16, hexadecimal or hex, uses 4 bits for each character. One byte can represent two base16 characters.

Base64 does not fit evenly into the byte paradigm (nor does base32), unlike base256 and base16. All base64 characters can be represented in 6 bits, 2 bits short of a full byte.

We can represent base64 encoding versus the byte paradigm as a fraction: 6 bits per character over 8 bits per byte. Reduced this fraction is 3 bytes over 4 characters.

This ratio, 3 bytes for every 4 base64 characters, is the rule we want to follow when encoding base64. Base64 encoding can only promise even measuring with 3 byte bundles, unlike base16 and base256 where every byte can stand on it's own.

So why is padding encouraged even though encoding could work just fine without the padding characters?

If the length of a stream is unknown or if it could be helpful to know exactly when a data stream ends, use padding. The padding characters communicate explicitly that those extra spots should be empty and rules out any ambiguity. Even if the length is unknown with padding you'll know where your data stream ends.

As a counter example, some standards like JOSE don't allow padding characters. In this case, if there is something missing, a cryptographic signature won't work or other non base64 characters will be missing (like the "."). Although assumptions about length aren't made, padding isn't needed because if there is something wrong it simply won't work.

And this is exactly what the base64 RFC says,

In some circumstances, the use of padding ("=") in base-encoded data is not required or used. In the general case, when assumptions about the size of transported data cannot be made, padding is required to yield correct decoded data.

[...]

The padding step in base 64 [...] if improperly implemented, lead to non-significant alterations of the encoded data. For example, if the input is only one octet for a base 64 encoding, then all six bits of the first symbol are used, but only the first two bits of the next symbol are used. These pad bits MUST be set to zero by conforming encoders, which is described in the descriptions on padding below. If this property do not hold, there is no canonical representation of base-encoded data, and multiple base- encoded strings can be decoded to the same binary data. If this property (and others discussed in this document) holds, a canonical encoding is guaranteed.

Padding allows us to decode base64 encoding with the promise of no lost bits. Without padding there is no longer the explicit acknowledgement of measuring in three byte bundles. Without padding you may not be able to guarantee exact reproduction of original encoding without additional information usually from somewhere else in your stack, like TCP, checksums, or other methods.

Alternatively to bucket conversion schemes like base64 is radix conversion which has no arbitrary bucket sizes and for left-to-right readers is left padded. The "iterative divide by radix" conversion method is typically employed for radix conversions.

Examples

Here is the example form RFC 4648 (https://www.rfc-editor.org/rfc/rfc4648#section-8)

Each character inside the "BASE64" function uses one byte (base256). We then translate that to base64.

BASE64("")       = ""           (No bytes used. 0 % 3 = 0)
BASE64("f")      = "Zg=="       (One byte used. 1 % 3 = 1)
BASE64("fo")     = "Zm8="       (Two bytes.     2 % 3 = 2)
BASE64("foo")    = "Zm9v"       (Three bytes.   3 % 3 = 0)
BASE64("foob")   = "Zm9vYg=="   (Four bytes.    4 % 3 = 1)
BASE64("fooba")  = "Zm9vYmE="   (Five bytes.    5 % 3 = 2)
BASE64("foobar") = "Zm9vYmFy"   (Six bytes.     6 % 3 = 0)

Here's an encoder that you can play around with: http://www.motobit.com/util/base64-decoder-encoder.asp

Commandeer answered 29/8, 2013 at 18:36 Comment(7)
-1 It's a nice and thorough post on how number systems work, but it doesn't explain why padding is used when the encoding would work perfectly without.Sodamide
Did you even read the question? You don't need padding to decode correctly.Fourfold
I think this answer did in fact explain the reason as stated here: "we can no longer guarantee exact reproduction of original encoding without additional information". It's simple really, the padding let us know that we received the complete encoding. Every time you have 3 bytes, you can safely assume it's ok to go ahead and decode it, you don't worry that, hum... maybe one more byte is going to come possibly changing the encoding.Shredding
@DidierA. How do you know that there isn't 3 more bytes in a base64 substring? To decode a char*, you need either the size of the string or a null terminator. Padding is redundant. Hence, OP's question.Fourfold
@Fourfold If you are stream decoding the base64 bytes, you do not know the length, with the 3 bytes padding, you know that every time you got 3 bytes you can process the 4 characters, until you reach the end of the stream. Without it, you might need to backtrack, because the next byte could cause the previous character to change, therefore making it that you can only be sure you decoded it properly once you've reached the end of the stream. So, it's not very useful, but it has a few edge cases where you might want it on.Shredding
A string with equals signs anywhere but at the end is non-conforming, though some implementations will accept it. It would have been helpful to have such strings classified as "non-canonical" and recognize contexts where applications would or would not be expected to accept such strings, since there are times when being able to concatenate strings is useful, but other times when it's more important to be able to compare two strings for equality.Coney
@DidierA. If you're decoding base64, then you would need to process 4 bytes at a time, which would represent between 1 and 3 decoded bytes (characters in your words).Hluchy
D
13

There is not much benefit to it in the modern day. So let's look at this as a question of what the original historical purpose may have been.

Base64 encoding makes its first appearance in RFC 1421 dated 1993. This RFC is actually focused on encrypting email, and base64 is described in one small section 4.3.2.4.

This RFC does not explain the purpose of the padding. The closest we have to a mention of the original purpose is this sentence:

A full encoding quantum is always completed at the end of a message.

It does not suggest concatenation (top answer here), nor ease of implementation as an explicit purpose for the padding. However, considering the entire description, it is not unreasonable to assume that this may have been intended to help the decoder read the input in 32-bit units ("quanta"). That is of no benefit today, however in 1993 unsafe C code would have very likely actually taken advantage of this property.

Dionysiac answered 21/3, 2011 at 11:1 Comment(3)
In the absence of padding, an attempt to concatenate two strings when the first string's length is not a multiple of three would often yield a seemingly-valid string, but the contents of the second string would decode incorrectly. Adding the padding ensures that does not occur.Coney
@Coney If that were the goal, wouldn't it be easier to end every base64 string with just a single "="? The average length would be shorter, and it would still prevent erroneous concatenations.Dionysiac
The average length of b'Zm9vYmFyZm9vYg==' b'Zm9vYmFyZm9vYmE=' b'Zm9vYmFyZm9vYmFy' b'Zm9vYmFyZm9vYmFyZg==' b'Zm9vYmFyZm9vYmFyZm8=' b'Zm9vYmFyZm9vYmFyZm9v' is the same as that of b'Zm9vYmFyZm9vYg=' b'Zm9vYmFyZm9vYmE=' b'Zm9vYmFyZm9vYmFy=' b'Zm9vYmFyZm9vYmFyZg=' b'Zm9vYmFyZm9vYmFyZm8=' b'Zm9vYmFyZm9vYmFyZm9v='Weisbrodt
D
6

With padding, a base64 string always has a length that is a multiple of 4 (if it doesn't, the string has been corrupted for sure) and thus code can easily process that string in a loop that processes 4 characters at a time (always converting 4 input characters to three or less output bytes). So padding makes sanity checking easy (length % 4 != 0 ==> error as not possible with padding) and it makes processing simpler and more efficient.

I know what people will think: Even without padding, I can process all 4-byte chunks in a loop and then just add special handling for the last 1 to 3 bytes, if those exist. It's just a few lines of extra code and the speed difference will be too tiny to even measure. Probably true but you are thinking in terms of C (or higher languages) and a powerful CPU with plenty of RAM. What if you need to decode base64 in hardware, using a simple DSP, that has very limited processing power, no RAM storage and you have to write the code in very limited micro-assembly? What if you cannot use code at all and everything has to be done with just transistors stacked together (a hardwired hardware implementation)? With padding that's way simpler than without.

Decolorize answered 19/7, 2022 at 12:14 Comment(0)
W
0

Padding fills the output length to a multiple of four bytes in a defined way.

Weir answered 1/1, 2022 at 11:15 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.