C#/.NET - Custom Binary File Formats - Where to Start?

Asked 27/4, 2009 at 19:40 Answered 29/4, 2015 at 14:6

Solved .net file binary file-format binary-data

I need to be able to store some data in a custom binary file format. I've never designed my own file format before. It needs to be a friendly format for traveling between the C#, Java and Ruby/Perl/Python worlds.

To start with the file will consist of records. A GUID field and a JSON/YAML/XML packet field. I'm not sure what to use as delimiters. A comma, tab or newline kind of thing seems too fragile. What does Excel do? or the pre-XML OpenOffice formats? Should you use ASCII chars 0 or 1. Not sure where to begin. Any articles or books on the topic?

This file format may expand later to include a "header section".

Note: To start with I'll be working in .NET, but I'd like the format to be easily portable.

UPDATE:
The processing of the "packets" can be slow, but navigation within the file format cannot. So I think XML is off the table.

Bleach answered 27/4, 2009 at 19:40 Comment(2)

Re the edit: what is the use-case here? In many circumstances you choose not to navigate within the file, but to deserialize it into an object model an then work within that. Anything more, and you might as well use a database file of some (common) kind. – Numbat 27/4, 2009 at 20:5

I should add this file will be to large to serialize. So I would never want to have all the data in memory at one time. It could be a List<SomeObject> serialized but I need a delimiter so I don't have to read in the entire list at one time. – Bleach 27/4, 2009 at 22:9

I'll try to add some general hints for creating a portable binary file format.

Note that to invent a binary file format means to document, how the bits in it must go and what they mean. It's not coding, but documentation.

Now the hints:

Decide what to do with endianess. Good and simple way to go is to decide it once and forever. The choice would be preferably little endian when used on common PC (that is x86) to save conversions (performance).
Create header. Yes, it is good idea to always have a header. First bytes of the file should be able to tell you, what format you are messing with.
- Start with magic to be able to recognize your format (ASCII string will do the trick)
- Add version. Version of your file format will not hurt to add and it will allow you to do backward compatibility later.
Finally, add the data. Now, the format of the data will be specific and it will always be based on your exact needs. Basically, the data will be stored in a binary image of some data structure. The data structure is what you need to come up with.

If you need random access to your data by some sort of indices, B-Trees are way to go, while if you just need a lot of numbers to write them all and then read them all an "array" will do the trick.

Additionally, you might use a TLV (Type-Length-Value) concept for forward compatibility.

Pinckney answered 29/4, 2015 at 14:6 Comment(2)

Any suggestions on building my knowledge on "pages" inside file fomats? Articles or books I should read? – Bleach 29/4, 2015 at 14:38

When I say "pages" I mean like database pages. SQLite was a little hard to follow the C code. Maybe Java or C# examples I could follow more clearly. – Bleach 29/4, 2015 at 14:40

How about looking at using "protocol buffers"? Designed as an efficient, portable, version-tolerant general purpose binary format, it gives you C++, Java and Python in the google library, and C#, Perl, Ruby and others in the community ports?

Note that Guid doesn't have a specific data type, but you can shim it as a message with (essentially) a byte[].

Normally for .NET work, I'd recommend protobuf-net (but as the author, I'm somewhat biased) - however, if you intend to use other languages later you might do better (long term) using Jon's dotnet-protobufs; that'll give you a familiar API accross the platforms (where-as protobuf-net uses .NET idioms).

Numbat answered 27/4, 2009 at 19:49 Comment(2)

And Python - that's one of the languages Google provides directly. – Lan 27/4, 2009 at 19:55

I'm wondering if I should take on a dependency on the Protocol Buffers stuff (even if it is Apache License). Or should I just learn about binary file formats from what the Protocol Buffers stuff is doing. I think I'm already picking up one dependency on the Json.NET and thats MIT license. – Bleach 27/4, 2009 at 22:21

I'll try to add some general hints for creating a portable binary file format.

Note that to invent a binary file format means to document, how the bits in it must go and what they mean. It's not coding, but documentation.

Now the hints:

Decide what to do with endianess. Good and simple way to go is to decide it once and forever. The choice would be preferably little endian when used on common PC (that is x86) to save conversions (performance).
Create header. Yes, it is good idea to always have a header. First bytes of the file should be able to tell you, what format you are messing with.
- Start with magic to be able to recognize your format (ASCII string will do the trick)
- Add version. Version of your file format will not hurt to add and it will allow you to do backward compatibility later.
Finally, add the data. Now, the format of the data will be specific and it will always be based on your exact needs. Basically, the data will be stored in a binary image of some data structure. The data structure is what you need to come up with.

If you need random access to your data by some sort of indices, B-Trees are way to go, while if you just need a lot of numbers to write them all and then read them all an "array" will do the trick.

Additionally, you might use a TLV (Type-Length-Value) concept for forward compatibility.

Pinckney answered 29/4, 2015 at 14:6 Comment(2)

Any suggestions on building my knowledge on "pages" inside file fomats? Articles or books I should read? – Bleach 29/4, 2015 at 14:38

When I say "pages" I mean like database pages. SQLite was a little hard to follow the C code. Maybe Java or C# examples I could follow more clearly. – Bleach 29/4, 2015 at 14:40

ASCII chars 0 or 1 each take up several bits (just like any other character), so if you're storing it like that your "binary" file will be several times larger than it should be. At text file of zeros and ones is not exactly a binary file :)

You can use the BinaryWriter to write raw data directly to a file stream. The only part you need to figure out is translating your in-memory format (usually some kind of object graph) into a byte sequence that the BinaryWriter can consume.

However, if your primary interest is portability, I recommend against a binary format at all. ~~XML is precisely designed to solve the portability and interoperability problem. It's verbose and weighty as a file format, but that's the trade-off you make to get those problems solved for you.~~ If a human-readable format is off the table, Marc's answer is the way to go. No need to reinvent the portability wheel!

Caves answered 27/4, 2009 at 19:44 Comment(6)

There's no need to trade speed and size in order to get portability - see Marc's Protocol Buffers answer. You lose human readability (while in encoded form - you can dump a PB to text) and you need to specify the structure up-front, but you get size, speed and backward/forward compatibility for free. – Lan 27/4, 2009 at 19:51

You bring up a good point about the ASCII comment. How do most people delimit the beginning or ending of string in a binary format? I know my GUID is going to have a standard length but my "packet data" will be string based. I've heard of the term "null terminated" string. What is that? My lack of a proper CS degree is showing. – Bleach 27/4, 2009 at 20:6

@Jon Skeet that is a good point. To me the question between protocol buffers and a human-readable format like XML is just the degree of portability, flexibility and openness one needs. My professional experience has tended to swing toward needing hugely open formats, so I'll always recommend something XML-ish first :) – Caves 27/4, 2009 at 20:6

@Tundall - with string/array data, your best bet is to prefix the data with the size. Then you can skip it if you don't need it. The alternative is to use some special marker (such as 0, which doesn't happen in regular text) as the end - but of course, you can't use this in binary data (like guids), because 0 is a perfectly valid and expected binary value. So length prefix becomes the best option. – Numbat 27/4, 2009 at 20:14

For example (from the protocol buffers encoding document) - 12 07 74 65 73 74 69 6e 67 represents "field 2 as a string" (the 12) "7 bytes" (the 07), "testing" (the rest of the data in UTF8). I won't try to explain the "12", or what happens for long strings (which need more than 1 byte to specify the length) - but it is all well defined. – Numbat 27/4, 2009 at 20:17

Think I will check out the protocol buffers encoding document tonight. Think I'm also going to take a look at the SQLite file format if I make heads or tails of it in a binary editor. – Bleach 27/4, 2009 at 22:38

It depends on what type of data you will be writing in to the binary file and what is the purpose of the binary file. Are they class object or just record data? If it is record data i would recommend to put it in xml format. That way you can include an schema validation to validate that the file conforms with you standards. There are tools in both java and .NET to import and export data from / to xml format.

Porfirioporgy answered 27/4, 2009 at 19:53 Comment(0)

Suppose your format is:

    struct Format
    {
        struct Header // 1
        {
            byte a;
            bool b1, b2, b3, b4, b5, b6, b7, b8;
            string name;
        }
        struct Container // 1...*
        {
            MyTypeEnum Type;
            byte[] data;
        }
    }

    enum MyTypeEnum
    {
        Sound,
        Video,
        Image
    }

Then I'd have a sequential file with:

byte // a

byte // b

int // name size

char[] // name (which has the size specified above, remember a char is 16 bits in .NET)

int // MyTypeEnum type

int // data size

byte[] // data (which has the size specified above)

Then you can repeat the last three lines as many as you want.

To read you use the BinaryReader which has support for reading bytes, integers and series of bytes. There is also a BinaryWriter.

Further, remember that Microsoft .NET (thus on a Windows/Intel machine) is little-endian. So is the BinaryReader and BinaryWriter.

Brittnybritton answered 27/4, 2009 at 21:5 Comment(2)

See my other comment on this thread about file size. I think I understand the BinaryReader/Writer, but would this allow me to go through the file a little at a time? I don't need to deserialize this thing all at once right? – Bleach 27/4, 2009 at 22:16

A BinaryReader/BinaryWriter is just a helper for any .NET Stream. It is unbuffered, so you can just go to the BaseStream and seek to where you want the BinaryReader to read, or the BinaryWriter to write. A FileStream supports seeking forwards and backwards. So having an index somewhere in your header would probably help you to read only the index and then seek to the position you'd want to read. – Brittnybritton 27/4, 2009 at 22:37

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags