Are there any known parser combinator library's in F# that can parse binary (not text) files?
Asked Answered
R

2

13

I am familiar with some of the basics of fparsec but it seems to be geared towards text files or streams.

Are there any other F# library's that can efficiently parse binary files? Or can fparsec be easily modified to work efficiently with binary streams?

Raviv answered 17/10, 2011 at 22:57 Comment(1)
I figure I'd mention this, 2 years later, but I have been working on a project that does just this. I have a sample mp4 container binary parser written using fparsec style combinators. github.com/devshorts/ParsecCloneUraemia
L
12

You may be interested in pickler combinators. These are a bit like parser combinators, but are more focused at simpler binary formats (picklers allow you to produce binary data and unpicklers parse them). There is a quite readable article about the idea (PDF) by Andrew Kennedy (the author of units of measure).

I don't have much experience with these myself, but I just realized it may be relevant for you. The idea is used in the F# compiler for generating some binary resources (like quotations stored in resources). Although, I'm not sure if the F# compiler implementation is any good (it is one of those things from early days of the F# compiler).

Licentiate answered 17/10, 2011 at 23:57 Comment(6)
I think I remember the term 'pickler' from the Expert F# Book... Thanks Tomas I'll check them out too. That reminds me I was going to check out the resumptions you mentioned in the F# compiler too wasn't I.Raviv
@Tomas Looking through the document in the link you submitted, you have answered the question you asked me about the utility of worker/wrapper transformations. See the discussion about implementing picklers in ML toward the end of your link.Electioneer
I'm going for monadic pickles :-)Raviv
@Ryan Hm, that sounds pretty interesting, it looks that worker/wrapper transformations could actually be quite useful. I wish I had more time to play with this :-).Licentiate
link to pdf is dead. Is it this what it's supposed to point to?Amalamalbena
@Amalamalbena Yes! Thanks - I fixed the link in the answer.Licentiate
A
6

The problem with working with binary streams is not a parser problem per se, it's a lexing problem. The lexer is what turns the raw data in to elements that the parse can handle.

Most any parsing system has few problems letting you supply your own lexer, and if that's the case you could, ideally, readily write a compliant lexer that works on your binary stream.

The problem, however, is that most parsing and lexing systems today are themselves created from a higher level tool. And THAT tool most likely is not designed to work with binary streams. That is, it's not practical for you specify the tokens and grammar of the binary stream that can be used to create the subsequent parsers and lexer. Also, there is likely no support whatsoever for the higher level concepts of multi byte binary numbers (shorts, longs, floats, etc.) that you are likely to encounter in a binary stream, nor for the generated parser to possibly work well upon them if you actually need to work on their actual value, again because the systems are mostly designed for text based tokens, and the underlying runtime handles the details of converting that text it something the machine can use (such as sequences of ascii numerals in to actual binary integers).

All that said, you can probably actually use the parsing section of the tool, since parsers work more on abstract tokens that are fed them by the lexer. Once you create your grammar, at a symbolic level, you would need to redo the lexer to create the problem tokens from the binary stream to feed in to the parser.

This is actually good, because the parser tends to be far more complicated than the basic lexer, so the toolkit would handle much of the "hard part" for you. But you would still need to deal with creating your own lexer and interfacing it properly to the generated parser. Not an insurmountable task, and if the grammar is of any real complexity, likely worth your effort in the long run.

If it's all mostly simple, then you're likely just better off doing it your self by hand. Of the top of my head, it's hard to imagine a difficult binary grammar, since the major selling point of a binary format is that it's much closer to the machine, which is in contradiction to the text that most parsers are designed to work with. But I don't know your use case.

But consider the case of a disassembler. That's a simple lexer that may be able to under stand at a high level the different instruction types (such as those operands that have no arguments, those that take a single byte as an argument, or a word), and feed that to a parser can then be used to convert the instructions in to their mnemonics and operands in the normal assembler syntax, as well as handle the label references and such.

It's a contrived case, as a disassembler typically doesn't separate the lexing and parsing phases, it's usually not complicated enough to bother, but it's one way to look at the problem.

Addenda:

If you have enough information to convert the binary stream in to text to feed to the engine, then you you have enough information to instead of creating text, you could create the actual tokens that the parser would want to see from the lexer.

That said, what you could do is take your text format, use that as the basis for your parsing tool and grammar, and have it create the lexer and parser machines for you, and then, by hand, you can test your parser and its processing using "text tests".

But when you get around to reading the binary, rather than creating text to then be lexed and parsed, simply create the tokens that the lexer would create (these should be simple objects), and pump the parser directly. This will save you the lex step and save you some processing time.

Affected answered 17/10, 2011 at 23:14 Comment(1)
First of all thanks for the detailed answer. I was going to work on the premise that I would be working with a subset of low level grammar and composing it upwards into blocks that could be handled by something like fparsec. i.e. (pbyte 0x45) >>= fun x -> (pbyte 0x78) etc. Then use those blocks to produce maybe text, which could then be handled by fparsec.Raviv

© 2022 - 2024 — McMap. All rights reserved.