Portable C binary serialization primitives
Asked Answered
A

3

10

As far as I know, the C library provides no help in serializing numeric values into a non-text byte stream. Correct me if I'm wrong.

The most standard tool in use is htonl et al from POSIX. These functions have shortcomings:

  • There is no 64-bit support.
  • There is no floating-point support.
  • There are no versions for signed types. When deserializing, the unsigned-to-signed conversion relies on signed integral overflow which is UB.
  • Their names do not state the size of the datatype.
  • They depend on 8-bit bytes and the presence of exact-size uint_N_t.
  • The input types are the same as the output types, instead of referring to a byte stream.
    • This requires the user to perform a pointer typecast which is possibly unsafe in alignment.
    • Having performed that typecast, the user is likely to attempt to convert and output a structure in its native memory layout, a poor practice which results in unexpected errors.

An interface for serializing arbitrary-size char to 8-bit standard bytes would fall in between the C standard, which doesn't really acknowledge 8-bit bytes, and whatever standards (ITU?) set the octet as the fundamental unit of transmission. But the older standards aren't getting revised.

Now that C11 has many optional components, a binary serialization extension could be added alongside things like threads without placing demands on existing implementations.

Would such an extension be useful, or is worrying about non-two's-complement machines just that pointless?

Animation answered 16/7, 2012 at 8:15 Comment(13)
About hton?/ntoh? and size in name, the size actually is in the name. The last letter is the size, s for short and l for long.Servia
@JoachimPileborg Except long is 64 bits on many systems, and a system with non-8-bit bytes is unlikely to match exactly 16-bit short or 32-bit long. The C standard is purposely ambiguous about it.Animation
It could be useful as an external library, for the platforms where it might work. I don't see how the language has anything to win from this. If I am on my IBM mainframe, receiving your IEEE floating point format is not useful. A text representation would be much easier to handle.Sparkle
@BoPersson It's not? We have a standard network byte order and IEEE binary format, so there's already pretty much just one way to send floats. Text has problems in particular with FP, where you trade precision against a lot of bloat.Animation
No, IBMs float format was designed long before IEEE, so the values will have to be converted anyway. Receiving them in binary is no advantage at all, in that case.Sparkle
@BoPersson That's completely missing the point, which is portability. File formats exist regardless of whether you want them. If you have a stream containing IEEE 754 floats according to RFC 791 byte order, that is a standard-compliant interface and it would be reasonable to expect C to mesh with it. Storing or transmitting in text would likely double the size and might not be a viable solution even disregarding the notion of portability.Animation
Your portability requires non-IEEE systems to support conversion to and from non-native floating point formats. That's not something I would like in a language standard.Sparkle
@BoPersson So it's OK for C to define a new text format to portably represent FP, but not to refer to already standardized (and universally adopted) description of a binary format?Animation
htonl et al do not "operate in place"; they return the converted result without changing the input valueTheretofore
@Potatoswatter: Bo's point is precisely that IEEE format is not universally adopted. Specifically, it is not supported by IBM zSeries mainframes, and quite probably not by other manufacturer's mainframes. All the world is not a PC (nor an iPad).Wynn
@JonathanLeffler Native computation is irrelevant, we're talking about data exchange. It's universally adopted in the world of interoperable binary data transmission.Animation
@DavidGelhar Took me a few minutes to remember what I actually meant :vP See edit.Animation
@JonathanLeffler If a mainframe has no library to deserialize such a file, that perfectly illustrates my point.Animation
D
6

I've never used them, but I think Google's Protocol Buffers satisfy your requirements.

  • 64 bit types, signed/unsigned, and floating point types are all supported.
  • The API generated is typesafe
  • Serialisation can be done to/from streams

This tutorial seems like a pretty good introduction, and you can read about the actual binary storage format here.


From their web page:

What Are Protocol Buffers?

Protocol buffers are Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages – Java, C++, or Python.

There's no official implementation in pure C (only C++), but there are two C ports that might fit your needs:

I don't know how they fare in the presence of non-8 bit bytes, but it should be relatively easy to find out.

Dreary answered 24/7, 2012 at 1:50 Comment(0)
O
4

In my opinion the main drawback of functions like htonl() is that they do only half the work what is serialization. They only flip the bytes in a multi-byte integer if you machine is little endian. The other important thing that must be done when serializing is handling alignment, and these functions don't do that.

A lot of CPUs are not capable of (efficiently) accessing multi-byte integers which aren't stored at an memory location which address isn't a multiple of the size of the integer in bytes. This is the reason to never ever use struct overlays to (de)serialize network packets. I'm not sure if this is what you mean by 'in-place conversion'.

I work a lot with embedded systems, and I've functions in my own library which I always use when generating or parsing network packets (or any other I/O: disk, RS232, etc):

/* Serialize an integer into a little or big endian byte buffer, resp. */
void SerializeLeInt(uint64_t value, uint8_t *buffer, size_t nrBytes);
void SerializeBeInt(uint64_t value, uint8_t *buffer, size_t nrBytes);

/* Deserialize an integer from a little or big endian byte buffer, resp. */
uint64_t DeserializeLeInt(const uint8_t *buffer, size_t nrBytes);
uint64_t DeserializeBeInt(const uint8_t *buffer, size_t nrBytes);

Along with these functions there are a bunch of macros defined suchs as:

#define SerializeBeInt16(value, buffer)     SerializeBeInt(value, buffer, sizeof(int16_t))
#define SerializeBeUint16(value, buffer)    SerializeBeInt(value, buffer, sizeof(uint16_t))
#define DeserializeBeInt16(buffer)          DeserializeBeType(buffer, int16_t)
#define DeserializeBeUint16(buffer)         DeserializeBeType(buffer, uint16_t)

The (de)serialize functions read or write the values byte by byte, so alignment problems will not occur. You don't need to worry about signedness either. In the first place all systems these days use 2s complement (besides a few ADCs maybe, but then you wouldn't use these functions). However it should even work on a system using 1s complement because (as far as I know) a signed integer is converted to 2s complement when casted to unsigned (and the functions accept/return unsigned integers).

Another argument of you is they depend on 8-bit bytes and the presence of exact-size uint_N_t. This also counts for my functions, but in my opinion this is not a problem (those types are always defined for the systems and their compilers I work with). You could tweak the function prototypes to use unsigned char instead of uint8_t and something like long long or uint_least64_t instead of uint64_t if you like.

Observation answered 23/7, 2012 at 13:56 Comment(2)
Yep, you pretty much describe my ideal interface…Animation
It is true that a signed integer is converted to 2s complement when casted to unsigned. Converting in the other direction is more difficult without invoking undefined behavior: from an unsigned value in twos complement representation to a signed number in machine representation. It can be done, but it's not as trivial as a cast.Impolicy
I
1

See xdr library and XDR standards RFC-1014 RFC-4506

Impolicy answered 23/7, 2012 at 14:12 Comment(1)
Interesting, but it appears to be more like a high-level wrapper around htonl than a more portable alternative. For some reason they didn't name the datatypes with numbers. I don't see how the Linux API can be portable because xdr_long takes a long argument — not even unsigned long. That can't implement hyper integer from the RFC. Also each datum is placed in a four-byte code unit for the sake of performance, which seems to be confused about the role of serialization.Animation

© 2022 - 2024 — McMap. All rights reserved.