Writing a stream protocol: Message size field or Message delimiter?
Asked Answered
D

7

17

I am about to write a message protocol going over a TCP stream. The receiver needs to know where the message boundaries are.

I can either send 1) fixed length messages, 2) size fields so the receiver knows how big the message is, or 3) a unique message terminator (I guess this can't be used anywhere else in the message).

I won't use #1 for efficiency reasons.

I like #2 but is it possible for the stream to get out of sync?

I don't like idea #3 because it means receiver can't know the size of the message ahead of time and also requires that the terminator doesn't appear elsewhere in the message.

With #2, if it's possible to get out of sync, can I add a terminator or am I guaranteed to never get out of sync as long as the sender program is correct in what it sends? Is it necessary to do #2 AND #3?

Please let me know.

Thanks, jbu

Dziggetai answered 25/6, 2009 at 23:5 Comment(1)
For option #3, look into byte stuffing for a way to use the delimiter value in the message body. I'm not saying that you should use option #3, just pointing out how delimiters can be made unambiguous in a stream of bytes.Provision
S
7

I agree with sigjuice. If you have a size field, it's not necessary to add and end-of-message delimiter -- however, it's a good idea. Having both makes things much more robust and easier to debug.

Consider using the standard netstring format, which includes both a size field and also a end-of-string character. Because it has a size field, it's OK for the end-of-string character to be used inside the message.

Sucre answered 25/6, 2009 at 23:5 Comment(0)
I
5

You are using TCP, the packet delivery is reliable. So the connection either drops, timeouts or you will read the whole message. So option #2 is ok.

Involucre answered 25/6, 2009 at 23:14 Comment(1)
I think even TCP data can get corrupted.Irresolute
T
4

If you are developing both the transmit and receive code from scratch, it wouldn't hurt to use both length headers and delimiters. This would provide robustness and error detection. Consider the case where you just use #2. If you write a length field of N to the TCP stream, but end up sending a message which is of a size different from N, the receiving end wouldn't know any better and end up confused.

If you use both #2 and #3, while not foolproof, the receiver can have a greater degree of confidence that it received the message correctly if it encounters the delimiter after consuming N bytes from the TCP stream. You can also safely use the delimiter inside your message.

Take a look at HTTP Chunked Transfer Coding for a real world example of using both #2 and #3.

Turbofan answered 25/6, 2009 at 23:34 Comment(0)
S
3

Depending on the level at which you're working, #2 may actually not have an issues with going out of sync (TCP has sequence numbering in the packets, and does reassemble the stream in correct order for you if it arrives out of order).

Thus, #2 is probably your best bet. In addition, knowing the message size early on in the transmission will make it easier to allocate memory on the receiving end.

Saline answered 25/6, 2009 at 23:16 Comment(2)
In addition, knowing the message size early on in the transmission will make it easier to allocate memory on the receiving end. A word of care: Make sure to limit how much memory gets allocated. Otherwise, you are susceptible to DDoS attacks with custom packets that have a size field of 2^32-1 (or however large your integers are), quickly filling up your memory.Disfranchise
If length get corrupted, for example, become larger than expected, thing will get very wrong for that. TCP can have some sort of data corrupted btw.Irresolute
U
2

Interesting there is no clear answer here. #2 is generally safe over TCP, and is done "in the real world" quite often. This is because TCP guarantees that all data arrives both uncorrupted* and in the order that it was sent.

*Unless corrupted in such a way that the TCP checksum still passes.

Umberto answered 9/10, 2012 at 6:37 Comment(1)
Actually, TCP does not guarantee data to arrive uncorrupted.Irresolute
D
1

Answering to old message since there is stuff to correnct:

Unlike many answers here claim, TCP does not guarantee data to arrive uncorrupted. Not even practically.

TCP protocol has a 2-byte crc-checksum that obviously has a 1:65536 chance of collision if more than one bit flips. This is such a small chance it will never be encountered in tests, but if you are developing something that either transmits large amounts of data and/or is used by very many end users, that dice gets thrown trillions of times (not kidding, youtube throws it about 30 times a second per user.)

Option 2: size field is the only practical option for the reasons you yourself listed. Fixed length messages would be wasteful, and delimiter marks necessitate running the entire payload through some sort of encoding-decoding stage to replace at least three different symbols: start-symbol, end-symbol, and the replacement-symbol that signals replacement has occurred.

In addition to this one will most likely want to use some sort of error checking with a serious checksum. Probably implemented in tandem with the encryption protocol as a message validity check.


As to the possibility of getting out of sync: This is possible per message, but has a remedy.

A useful scheme is to start each message with a header. This header can be quite short (<30 bytes) and contain the message payload length, eventual correct checksum of the payload, and a checksum for that first portion of the header itself. Messages will also have a maximum length. Such a short header can also be delimited with known symbols.

Now the receiving end will always be in one of two states:

  1. Waiting for new message header to arrive
  2. Receiving more data to an ongoing message, whose length and checksum are known.

This way the receiver will in any situation get out of sync for at most the maximum length of one message. (Assuming there was a corrupted header with corruption in message length field)

With this scheme all messages arrive as discrete payloads, the receiver cannot get stuck forever even with maliciously corrupted data in between, the length of arriving payloads is know in advance, and a successfully transmitted payload has been verified by an additional longer checksum, and that checksum itself has been verified. The overhead for all this can be a mere 26 byte header containing three 64-bit fields, and two delimiting symbols.

(The header does not require replacement-encoding since it is expected only in a state whout ongoing message, and the entire 26 bytes can be processed at once)

Diapositive answered 18/11, 2020 at 0:21 Comment(3)
"30 times a second per user"? Really? Any reference?Christiansand
My writing is probably a bit convoluted. What I mean is that a user (who is getting video data for HD video) gets ~30 tcp packets a second. Each packet is in essence a dice throw in the sense that if it was corrupted, the crc could match accidentally. A small fraction is corrupted, and a smaller fraction is not caught.Diapositive
The header does not require replacement-encoding: when you are out-of-sync and searching for headers, header symbol in message will mislead you.Irresolute
V
0

There is a fourth alternative: a self-describing protocol such as XML.

Vintager answered 4/7, 2010 at 5:15 Comment(1)
This is unsuitable for pure binary messages. It's basically option #3 but much worse than a single delimiter.Shroudlaid

© 2022 - 2024 — McMap. All rights reserved.