Storing binary data in QR codes
Asked Answered
T

7

24

I'm trying to store binary data in a QR code. Apparently QR codes do support storing raw binary data (or ISO-8859-1 / Latin1). Here is what I want to encode (hex):

d1 50 01 00 00 00 f6 5f 05 2d 8f 0b 40 e2 01

I've tried the following encoders:

  1. qr.js

qrjs

  1. Google Charts

charts

  1. qrcode.js

qrcode.js

Decoding with zxing.org produces various incorrect results. The two javascript ones produce this (it's wrong; the first text character should be Ñ.

qr.js and qrcode.js

Whereas Google Charts produces this...

charts what

What is going on? Are any of these correct? What's really weird is that if I encode this sequence (with the JS ones at least) then it works fine - I would have thought the issue was non-ASCII characters but Ñ (0xd1) is non-ASCII.

d1 50 01 00 00 00 01 02 03 04 05 06 40 e2 01

Does anyone know what is going on?

Update

It occurred to me to try scanning them with a ZBar-based scanner app I found. It scans both JS versions ok (at least they start with ÑP). The Google Charts one is just wrong. So it seems like the issue is with ZXing (which is surprisingly shit - I wouldn't recommend it to anyone).

Update 2

ZBar can't handle null bytes. :-(

Tyrrell answered 23/6, 2016 at 15:38 Comment(5)
All kinds of encodings are possible, just a matter of interpretation. I guess you'd use binary mode (0100) so your input clearly should work, but the output would require you to code it yourself. Like Binary2Hex..Capwell
Well qrcode.js at least claims to only support 8-bit mode. I'm pretty sure it is just bugs in the decoders (ZXing is totally screwed, and ZBar uses null-terminated strings; yeay C).Tyrrell
I think, that the above QR codes are impossible. They were artificiallly created via those javascript library, but are impossible to generate from real QR code libraries (non-javascript). The presence of 00 byte will cut-off rigth away and will not generate those QR code. Many have sad that it could not be read, but i say yea that those qr codes could not be created on the first placeIngot
@Tyrrell the library you used for your first image does not support binary: github.com/neocotic/qrious/issues/47 I am very curious how on earth you manage feed your data to it...Crone
I have no idea, sorry! I would have guess that I just used a string where each character was one byte... but then I don't know why I would have reported that issue. I guess we will never know!Tyrrell
T
2

It turned out that ZXing is just crap, and ZBar does some weird stuff with the data (converting it to UTF-8 for example). I managed to get it to output the raw data including null bytes though. Here is a patch for the best Android ZBar library I found, that has now been merged.

Tyrrell answered 12/7, 2016 at 8:35 Comment(0)
T
11

"What is going on? Are any of these correct?"

Except for the google chart (which is just empty), your QR codes are correct.

You can see the binary data from zxing is what you would expect:

4: Byte mode indicator  
0f: length of 15 byte  
d15001...: your 15 bytes of data  
ec11 is just padding  

The problem comes from the decoding. Because most decoders will try to interpret it as text. But since it's binary data, you should not try to handle it as text. Even if you think you can convert it from text to binary, as you saw this may cause issues with values which are not valid text.

So the solution is to use a decoder that will output you the binary data, and not text data.

Now about interpreting the QR code binary data as text, you said the first character should be 'Ñ' which is true if interpreted it as "ISO-8859-1", which according to the QR code standard, is what should be done when there is no ECI mode defined.

But in practice, most smartphone QR code reader will interpret it as UTF-8 in this case (or at least try to auto-detect the encoding).

Even though this is not the standard, this had become common practice: binary mode with no ECI, UTF-8 encoded text.

Maybe the reason behind it is that no one wants to waste these precious bytes adding an ECI mode specifying UTF-8. And actually, not all decoders support ECI.

Tallinn answered 25/2, 2019 at 10:47 Comment(0)
N
10

There are two issues that you have to overcome to store binary data in QR codes.

  1. ISO-8859-1 does not allow bytes in ranges of 00-1F and 7F-9F. If you need to encode these bytes anyway, quote or encode them, i.e. use quoted-printable or Base-64 encoding to avoid these ranges.

  2. Since you are trying to store binary data in QR codes, you have to rely only on your own scanner that will handle this binary data. You don’t have to display text from your QR codes by other software, like web application at zxing.org, because most QR decoders, including that of zxing.org use heuristics to detect the character set used. These heuristics may detect a character set other than ISO-8859-1 and thus fail to properly display your binary data. Some scanners use heuristics to detect a character set even if the character set is explicitly given by ECI. This is why providing ECI may not help much – scanners still use heuristics even with ECI.

So, using US-ASCII printable characters only (e.g., binary data encoded in Base64 before passing it to a QR Code generator) is the safest choice for QR code against the heuristics. This will also overcome another complication: that ISO-8859-1 was not the default encoding in earlier QR code standard published in 2000 (ISO/IEC 18004:2000). That standard did specify 8-bit Latin/Kana character set in accordance with JIS X 0201 (JIS8 also known as ISO-2022-JP) as default encoding for 8-bit mode, while the updated standard published in 2005 did change the default to ISO-8859-1.

As an alternative to Base-64, you can encode each byte with two hexadecimal characters (0-9, A-F), so, in the QR code your data will be encoded in the alphanumeric mode, not in 8-bit mode. This will disable all heuristics for sure and should not produce larger QR Code than with Base-64, because each character in the alphanumeric mode takes only 6 bits in the QR code stream.

Nigrify answered 4/4, 2020 at 23:30 Comment(4)
Where did you find the range 00-1F and 7F-9F? The bytes that were rejected and got replaced with 3 bytes are not of the range that you indicated. For example, these bytes stayed intact: 42, 03, 24, 6A, 6E, 77. But these bytes are gone: EE, DB, C1, ...Ingot
If you look at the character map at the code page layout en.wikipedia.org/wiki/ISO/IEC_8859-1#Code_page_layout you will see that values 00-1F and 7F-9F are "Undefined", i.e. do not encode any defined character. ISO 8859-1 encodes what it refers to as "Latin alphabet no. 1", consisting of 191 characters from the Latin script.Nigrify
Thank you! Upvoting answer.Ingot
At least in .NET Framework I have confirmed that the encoding ISO-8859-1 round-trips all characters in the range from 0 to 255, in both directions. (Technically, the encoding is Windows-1252.)Postmark
A
4

Update - Apr. 24, 2023: I Just rewrote my Base45 Library to be compliant with RFC-9285 Base45 Standard, and it is no longer dependent upon ZXing. The code in this answer is old; the repo uses a more efficient algorithm, which results in much simpler code, and it improved the storage efficiency loss down to only 3% behind raw binary.

See: v2.1.0 https://github.com/yurelle/Base45Encoder


Update - Nov. 13, 2021: I recently went back and published the referenced code as a project on GitHub for anyone who wants to use it. https://github.com/yurelle/Base45Encoder


This is a bit necro, but I just hit this problem, and figured out a solution.

The problem with reading QR Codes with ZXING is that it assumes all QR Payloads are Strings. If you're willing to generate the QR Code in java with ZXING, I developed a solution which enables storing a binary payload in ZXING QR Codes with a storage efficiently loss of only -8%; better than the 33% inflation from Base64.

It exploits an internal compression optimization of the ZXING library based around pure Alphanum Strings. If you want a full explanation, with math and Unit Tests, check out my other answer.

But the short answer is this:

Solution

I implemented it as a self-contained static utility class, so all you have to do is call:

//Encode
final byte[] myBinaryData = ...;
final String encodedStr = BinaryToBase45Encoder.encodeToBase45QrPayload(myBinaryData);

//Decode
final byte[] decodedBytes = BinaryToBase45Encoder.decodeBase45QrPayload(encodedStr);

Alternatively, you can also do it via InputStreams:

//Encode
final InputStream in_1 = ... ;
final String encodedStr = BinaryToBase45Encoder.encodeToBase45QrPayload(in_1);

//Decode
final InputStream in_2 = ... ;
final byte[] decodedBytes = BinaryToBase45Encoder.decodeBase45QrPayload(in_2);

Here's the implementation

import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.lang.reflect.Field;
import java.util.HashMap;
import java.util.LinkedList;
import java.util.Map;

/**
 * For some reason none of the Java QR Code libraries support binary payloads. At least, none that
 * I could find anyway. The commonly suggested workaround for this is to use Base64 encoding.
 * However, this results in a 33% payload size inflation. If your payload is already near the size
 * limit of QR codes, this is a lot.
 *
 * This class implements an encoder which takes advantage of a built-in compression optimization
 * of the ZXING QR Code library, to enable the storage of Binary data into a QR Code, with a
 * storage efficiency loss of only -8%.
 *
 * The built-in optimization is this: ZXING will automatically detect if your String payload is
 * purely AlphaNumeric (by their own definition), and if so, it will automatically compress 2
 * AlphaNumeric characters into 11 bits.
 *
 *
 * ----------------------
 *
 *
 * The included ALPHANUMERIC_TABLE is the conversion table used by the ZXING library as a reverse
 * index for determining if a given input data should be classified as alphanumeric.
 *
 * See:
 *
 *      com.google.zxing.qrcode.encoder.Encoder.chooseMode(String content, String encoding)
 *
 * which scans through the input string one character at a time and passes them to:
 *
 *      getAlphanumericCode(int code)
 *
 * in the same class, which uses that character as a numeric index into the the
 * ALPHANUMERIC_TABLE.
 *
 * If you examine the values, you'll notice that it ignores / disqualifies certain values, and
 * effectively converts the input into base 45 (0 -> 44; -1 is interpreted by the calling code
 * to mean a failure). This is confirmed in the function:
 *
 *      appendAlphanumericBytes(CharSequence content, BitArray bits)
 *
 * where they pack 2 of these base 45 digits into 11 bits. This presents us with an opportunity.
 * If we can take our data, and convert it into a compatible base 45 alphanumeric representation,
 * then the QR Encoder will automatically pack that data into sub-byte chunks.
 *
 * 2 digits in base 45 is 2,025 possible values. 11 bits has a maximum storage capacity of 2,048
 * possible states. This is only a loss of 1.1% in storage efficiency behind raw binary.
 *
 *      45 ^ 2 = 2,025
 *      2 ^ 11 = 2,048
 *      2,048 - 2,025 = 23
 *      23 / 2,048 = 0.01123046875 = 1.123%
 *
 * However, this is the ideal / theoretical efficiency. This implementation processes data in
 * chunks, using a Long as a computational buffer. However, since Java Long's are singed, we
 * can only use the lower 7 bytes. The conversion code requires continuously positive values;
 * using the highest 8th byte would contaminate the sign bit and randomly produce negative
 * values.
 *
 *
 * Real-World Test:
 *
 * Using a 7 byte Long to encode a 2KB buffer of random bytes, we get the following results.
 *
 *      Raw Binary Size:        2,048
 *      Encoded String Size:    3,218
 *      QR Code Alphanum Size:  2,213 (after the QR Code compresses 2 base45 digits to 11 bits)
 *
 * This is a real-world storage efficiency loss of only 8%.
 *
 *      2,213 - 2,048 = 165
 *      165 / 2,048 = 0.08056640625 = 8.0566%
 */
public class BinaryToBase45Encoder {
    public final static int[] ALPHANUMERIC_TABLE;

    /*
     * You could probably just copy & paste the array literal from the ZXING source code; it's only
     * an array definition. But I was unsure of the licensing issues with posting it on the internet,
     * so I did it this way.
     */
    static {
        final Field SOURCE_ALPHANUMERIC_TABLE;
        int[] tmp;

        //Copy lookup table from ZXING Encoder class
        try {
            SOURCE_ALPHANUMERIC_TABLE = com.google.zxing.qrcode.encoder.Encoder.class.getDeclaredField("ALPHANUMERIC_TABLE");
            SOURCE_ALPHANUMERIC_TABLE.setAccessible(true);
            tmp = (int[]) SOURCE_ALPHANUMERIC_TABLE.get(null);
        } catch (NoSuchFieldException e) {
            e.printStackTrace();//Shouldn't happen
            tmp = null;
        } catch (IllegalAccessException e) {
            e.printStackTrace();//Shouldn't happen
            tmp = null;
        }

        //Store
        ALPHANUMERIC_TABLE = tmp;
    }

    public static final int NUM_DISTINCT_ALPHANUM_VALUES = 45;
    public static final char[] alphaNumReverseIndex = new char[NUM_DISTINCT_ALPHANUM_VALUES];

    static {
        //Build AlphaNum Index
        final int len = ALPHANUMERIC_TABLE.length;
        for (int x = 0; x < len; x++) {
            // The base45 result which the alphanum lookup table produces.
            // i.e. the base45 digit value which String characters are
            // converted into.
            //
            // We use this value to build a reverse lookup table to find
            // the String character we have to send to the encoder, to
            // make it produce the given base45 digit value.
            final int base45DigitValue = ALPHANUMERIC_TABLE[x];

            //Ignore the -1 records
            if (base45DigitValue > -1) {
                //The index into the lookup table which produces the given base45 digit value.
                //
                //i.e. to produce a base45 digit with the numeric value in base45DigitValue, we need
                //to send the Encoder a String character with the numeric value in x.
                alphaNumReverseIndex[base45DigitValue] = (char) x;
            }
        }
    }

    /*
     * The storage capacity of one digit in the number system; i.e. the maximum
     * possible number of distinct values which can be stored in 1 logical digit
     */
    public static final int QR_PAYLOAD_NUMERIC_BASE = NUM_DISTINCT_ALPHANUM_VALUES;

    /*
     * We can't use all 8 bytes, because the Long is signed, and the conversion math
     * requires consistently positive values. If we populated all 8 bytes, then the
     * last byte has the potential to contaminate the sign bit, and break the
     * conversion math. So, we only use the lower 7 bytes, and avoid this problem.
     */
    public static final int LONG_USABLE_BYTES = Long.BYTES - 1;

    //The following mapping was determined by brute-forcing -1 Long (all bits 1), and compressing to base45 until it hit zero.
    public static final int[] BINARY_TO_BASE45_DIGIT_COUNT_CONVERSION = new int[] {0,2,3,5,6,8,9,11,12};
    public static final int NUM_BASE45_DIGITS_PER_LONG = BINARY_TO_BASE45_DIGIT_COUNT_CONVERSION[LONG_USABLE_BYTES];
    public static final Map<Integer, Integer> BASE45_TO_BINARY_DIGIT_COUNT_CONVERSION = new HashMap<>();

    static {
        //Build Reverse Lookup
        int len = BINARY_TO_BASE45_DIGIT_COUNT_CONVERSION.length;
        for (int x=0; x<len; x++) {
            int numB45Digits = BINARY_TO_BASE45_DIGIT_COUNT_CONVERSION[x];
            BASE45_TO_BINARY_DIGIT_COUNT_CONVERSION.put(numB45Digits, x);
        }
    }

    public static String encodeToBase45QrPayload(final byte[] inputData) throws IOException {
        return encodeToBase45QrPayload(new ByteArrayInputStream(inputData));
    }

    public static String encodeToBase45QrPayload(final InputStream in) throws IOException {
        //Init conversion state vars
        final StringBuilder strOut = new StringBuilder();
        int data;
        long buf = 0;

        // Process all input data in chunks of size LONG.BYTES, this allows for economies of scale
        // so we can process more digits of arbitrary size before we hit the wall of the binary
        // chunk size in a power of 2, and have to transmit a sub-optimal chunk of the "crumbs"
        // left over; i.e. the slack space between where the multiples of QR_PAYLOAD_NUMERIC_BASE
        // and the powers of 2 don't quite line up.
        while(in.available() > 0) {
            //Fill buffer
            int numBytesStored = 0;
            while (numBytesStored < LONG_USABLE_BYTES && in.available() > 0) {
                //Read next byte
                data = in.read();

                //Push byte into buffer
                buf = (buf << 8) | data; //8 bits per byte

                //Increment
                numBytesStored++;
            }

            //Write out in lower base
            final StringBuilder outputChunkBuffer = new StringBuilder();
            final int numBase45Digits = BINARY_TO_BASE45_DIGIT_COUNT_CONVERSION[numBytesStored];
            int numB45DigitsProcessed = 0;
            while(numB45DigitsProcessed < numBase45Digits) {
                //Chunk out a digit
                final byte digit = (byte) (buf % QR_PAYLOAD_NUMERIC_BASE);

                //Drop digit data from buffer
                buf = buf / QR_PAYLOAD_NUMERIC_BASE;

                //Write Digit
                outputChunkBuffer.append(alphaNumReverseIndex[(int) digit]);

                //Track output digits
                numB45DigitsProcessed++;
            }

            /*
             * The way this code works, the processing output results in a First-In-Last-Out digit
             * reversal. So, we need to buffer the chunk output, and feed it to the OutputStream
             * backwards to correct this.
             *
             * We could probably get away with writing the bytes out in inverted order, and then
             * flipping them back on the decode side, but just to be safe, I'm always keeping
             * them in the proper order.
             */
            strOut.append(outputChunkBuffer.reverse().toString());
        }

        //Return
        return strOut.toString();
    }

    public static byte[] decodeBase45QrPayload(final String inputStr) throws IOException {
        //Prep for InputStream
        final byte[] buf = inputStr.getBytes();//Use the default encoding (the same encoding that the 'char' primitive uses)

        return decodeBase45QrPayload(new ByteArrayInputStream(buf));
    }

    public static byte[] decodeBase45QrPayload(final InputStream in) throws IOException {
        //Init conversion state vars
        final ByteArrayOutputStream out = new ByteArrayOutputStream();
        int data;
        long buf = 0;
        int x=0;

        // Process all input data in chunks of size LONG.BYTES, this allows for economies of scale
        // so we can process more digits of arbitrary size before we hit the wall of the binary
        // chunk size in a power of 2, and have to transmit a sub-optimal chunk of the "crumbs"
        // left over; i.e. the slack space between where the multiples of QR_PAYLOAD_NUMERIC_BASE
        // and the powers of 2 don't quite line up.
        while(in.available() > 0) {
            //Convert & Fill Buffer
            int numB45Digits = 0;
            while (numB45Digits < NUM_BASE45_DIGITS_PER_LONG && in.available() > 0) {
                //Read in next char
                char c = (char) in.read();

                //Translate back through lookup table
                int digit = ALPHANUMERIC_TABLE[(int) c];

                //Shift buffer up one digit to make room
                buf *= QR_PAYLOAD_NUMERIC_BASE;

                //Append next digit
                buf += digit;

                //Increment
                numB45Digits++;
            }

            //Write out in higher base
            final LinkedList<Byte> outputChunkBuffer = new LinkedList<>();
            final int numBytes = BASE45_TO_BINARY_DIGIT_COUNT_CONVERSION.get(numB45Digits);
            int numBytesProcessed = 0;
            while(numBytesProcessed < numBytes) {
                //Chunk out 1 byte
                final byte chunk = (byte) buf;

                //Shift buffer to next byte
                buf = buf >> 8; //8 bits per byte

                //Write byte to output
                //
                //Again, we need to invert the order of the bytes, so as we chunk them off, push
                //them onto a FILO stack; inverting their order.
                outputChunkBuffer.push(chunk);

                //Increment
                numBytesProcessed++;
            }

            //Write chunk buffer to output stream (in reverse order)
            while (outputChunkBuffer.size() > 0) {
                out.write(outputChunkBuffer.pop());
            }
        }

        //Return
        out.flush();
        out.close();
        return out.toByteArray();
    }
}
Assentation answered 13/11, 2020 at 8:24 Comment(7)
doens't base64 only have 33% (4/3) inflation?Grommet
It's been a couple years since I messed with QR Codes, but unless ZXING does something special for Base64 and does some kind of special packing, then no. Base64 only has 64 possible values per character, but that single ASCII character takes up a whole byte. That byte has the capacity to store 256 possible unique values, but you're only using 64 of them for data, and the rest are wasted. So, a single base64 digit can only hold 64/256=25% of what a raw binary byte can, and thus you need (4x) Base64 digits to store all 256 possible values of a byte.Assentation
uhhh... 64 = 6 bits, 256 = 8 bits. converting bytes to bits makes the data 8x longer, converting bits to base64 makes the data 6x shorter. for a total of 8/6 (= 4/3).Grommet
otherwise what you're saying is, if we use 4-bit ints to store data, it's a whopping 16 million times shorter (but each unit is 4 bytes instead of one, so it's only... 4 million times shorter. meaning you can compress 16MB into 4 bytes)Grommet
I don't get how you twisted what I said into asserting that you could compress 16MB into 4 bytes, but yes, I was wrong. Base64 is only ~33% expansion; I just wrote a quick java program to test it and: byte[] buf = byte[128]; Random.nextBytes(buf); Base64.getEncoder().encodeToString(buf).length(); is 172.Assentation
For some reason I was thinking about the base64 digits as being independant rather than stretching the data across the usable bit space withing multiple digits; i.e. addition rather than multiplication. Base64 is 6 bits, but the 2 remaining bits (7&8), can index 4 unique variants of those 6 bits within a single byte. But, as you pointed out, you can just overflow the remaining 2 bits into the next digit. Which I should have known, because that's exactly what my algorithm does with Base45.Assentation
I think back when I was doing that QR stuff, I had binary data that should have been able to fit in the QR code, but when I tried encoding to Base64, it was too big. And, without thinking too much, I did that rough calculation on the Base64 expansion, and just accepted the 4x inflation without thinking too much about it. Then discovered that Base45 exploit, and started working on that.Assentation
S
3

Just at a glance, the qr formats are different. I'd compare the qr formats to see if it's a problem of error correction or encoding or something else.

enter image description here

Selfsacrifice answered 23/6, 2016 at 15:44 Comment(2)
This should be a comment because it doesn't answer the question.Capwell
I'm pretty sure 1. and 3. are both 8-bit mode. The Enc bits seem to be error-correct so you can't read them directly.Tyrrell
T
2

It turned out that ZXing is just crap, and ZBar does some weird stuff with the data (converting it to UTF-8 for example). I managed to get it to output the raw data including null bytes though. Here is a patch for the best Android ZBar library I found, that has now been merged.

Tyrrell answered 12/7, 2016 at 8:35 Comment(0)
S
2

I used System.Convert.ToBase64String to convert the supplied sample byte array into a Base64-encoded string, then I used ZXing to create a QRCode image.

Next I called ZXing to read the string back from the generated QRCode, and then called System.Convert.FromBase64String to convert the string back into a byte array.

I confirm that the data completed the round trip successfully.

Scriber answered 7/9, 2017 at 16:25 Comment(0)
P
1

The informational RFC 9285 - The Base45 Data Encoding document describing the optimal scheme for storing binary data within the constraints of QR Alphanumeric Mode was recently published by the IETF.

(one positive side-effect of ongoing standardization work surrounding Health Certificate QR-codes)

Phenomena answered 15/8, 2022 at 9:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.