Delimiting binary sequences
Asked Answered
W

6

20

I need to be able to delimit a stream of binary data. I was thinking of using something like the ASCII EOT (End of Transmission) character to do this.

However I'm a bit concerned -- how can I know for sure that the particular binary sequence used for this (0b00000100) won't appear in my own binary sequences, thus giving a false positive on delimitation?

In other words, how is binary delimiting best handled?

EDIT: ...Without using a length header. Sorry guys, should have mentioned this before.

Whithersoever answered 17/12, 2011 at 0:38 Comment(0)
P
22

You've got five options:

  • Use a delimiter character that is unlikely to occur. This runs the risk of you guessing incorrectly. I don't recommend this approach.
  • Use a delimiter character and an escape sequence to include the delimiter. You may need to double the escape character, depending upon what makes for easier parsing. (Think of the C \0 to include an ASCII NUL in some content.)
  • Use a delimiter phrase that you can determine does not occur. (Think of the mime message boundaries.)
  • Prepend a length field of some sort, so you know to read the following N bytes as data. This has the downside of requiring you to know this length before writing the data, which is sometimes difficult or impossible.
  • Use something far more complicated, like ASN.1, to completely describe all your content for you. (I don't know if I'd actually recommend this unless you can make good use of it -- ASN.1 is awkward to use in the best of circumstances, but it does allow completely unambiguous binary data interpretation.)
Porphyroid answered 17/12, 2011 at 0:50 Comment(0)
P
10

Usually, you wrap your binary data in a well known format, for example with a fixed header that describes the subsequent data. If you are trying to find delimeters in an unknown stream of data, usually you need an escape sequence. For example, something like HDLC, where 0x7E is the frame delimeter. Data must be encoded such that if there is 0x7E inside the data, it is replaced with 0x7D followed by an XOR of the original data. 0x7D in the data stream is similarly escaped.

Peterkin answered 17/12, 2011 at 0:45 Comment(1)
Can you explain the logic here? Given the choice of 0x7E as frame delimiter, why replace this byte in the stream by 0x7D and a XORed value? Why not simply replace with 0x7E0x7E? How is the XORed value calculated exactly? What happens when a stream contains 0x7D? How do you discriminate between this "naturally" in the stream and as a replacement for 0x7E?Valetudinarian
L
4

If the binary records can really contain any data, try adding a length before the data instead of a marker after the data. This is sometimes called a prefix length because the length comes before the data.

Otherwise, you'd have to escape the delimiter in the byte stream (and escape the escape sequence).

Landowska answered 17/12, 2011 at 0:43 Comment(0)
T
4

You can prepend the size of the binary data before it. If you are dealing with streamed data and don't know its size beforehand, you can divide it into chunks and have each chunk begin with size field.

If you set a maximum size for a chunk, you will end up with all but the last chunk the same length which will simplify random access should you require it.

Trimer answered 17/12, 2011 at 0:43 Comment(0)
B
1

As a space-efficient and fixed-overhead alternative to prepending your data with size fields and escaping the delimiter character, the escapeless encoding can be used to trim off that delimiter character, probably together with other characters that should have special meaning, from your data.

Buiron answered 4/6, 2019 at 12:39 Comment(0)
P
1

@sarnold's answer is excellent, and here I want to share some code to illustrate it.

First here is a wrong way to do it: using a \n delimiter. Don't do it! the binary data could contain \n, and it would be mixed up with the delimiters:

import os, random
with open('test', 'wb') as f:
    for i in range(100):                         # create 100 binary sequences of random
        length = random.randint(2, 100)          # length (between 2 and 100)
        f.write(os.urandom(length) + b'\n')      # separated with the character b"\n"
with open('test', 'rb') as f:
    for i, l in enumerate(f):
        print(i, l)                              # oops we get 123 sequences! wrong!

...
121 b"L\xb1\xa6\xf3\x05b\xc9\x1f\x17\x94'\n"
122 b'\xa4\xf6\x9f\xa5\xbc\x91\xbf\x15\xdc}\xca\x90\x8a\xb3\x8c\xe2\x07\x96<\xeft\n'

Now the right way to do it (option #4 in sarnold's answer):

import os, random
with open('test', 'wb') as f:
    for i in range(100):
        length = random.randint(2, 100)
        f.write(length.to_bytes(2, byteorder='little'))   # prepend the data with the length of the next data chunk, packed in 2 bytes
        f.write(os.urandom(length))
with open('test', 'rb') as f:
    i = 0
    while True:
        l = f.read(2)     # read the length of the next chunk
        if l == b'':      # end of file
            break
        length = int.from_bytes(l, byteorder='little') 
        s = f.read(length)
        print(i, s)
        i += 1

...
98 b"\xfa6\x15CU\x99\xc4\x9f\xbe\x9b\xe6\x1e\x13\x88X\x9a\xb2\xe8\xb7(K'\xf9+X\xc4"
99 b'\xaf\xb4\x98\xe2*HInHp\xd3OxUv\xf7\xa7\x93Qf^\xe1C\x94J)'

Paris answered 19/11, 2020 at 22:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.