YAML parsing - lex or hand-rolled?
Asked Answered
R

1

8

I am trying to write a simple YAML parser, I read the spec from yaml.org, before I start, I was wondering if it is better to write a hand-rolled parser, or use lex (flex/bison). I looked at the libyaml (C library) - doesn't seem to use lex/yacc. YAML (excluding the flow styles), seems to be more line-oriented, so, is it easier to write a hand-rolled parser, or use flex/bison Thanks.

Raynell answered 19/9, 2011 at 19:18 Comment(4)
Have you considered just using a standard, off-the-shelf YAML parser? Or are you specifically interested in building your own? Also, note that lex and flex are scanner generators, not parser generators; to do parsing, you'd want to use yacc or bison.Calaboose
@Calaboose I am interested in building my own.Raynell
@Calaboose I probably didn't make my question clear. I understand lex is just a tokenizer. I wanted to know if structure of YAML better suits flex/bison or hand rolled parserRaynell
I should warn that flex/bison will not work to parse YAML. YAML1.2 has over 200 syntax rules, most of which require indentation matching that is near impossible to correctly implement with a flex tokenizer (specifically, multiple de-dents cannot be matched, even with sophisticated flex hacks). You need a tokenizer that respects indentation rules. If I may recommend one: RE/flex includes a YAML1.2 tokenizer and parser example.Kathernkatheryn
T
7

This answer is basically an answer to the question: "Should I roll my own parser or use parser generator?" and has not much to do with YAML. But nevertheless it will "answer" your question.

The question you need to ask is not "does this work with this given language/grammar", but "do I feel confident to implement this". The truth of the matter is that most formats you want to parse will just work with a generated parser. The other truth is that it is feasible to parse even complex languages with a simple hand written recursive descent parser.

I have written among others, a recursive descent parser for EDDL (C and structured elements) and a bison/flex parser for INI. I picked these examples, because they go against intuition and exterior requirements dictated the decision.

Since I established on a technical level it is possible, why would you pick one over the other? This is really hard question to answer, here are some thoughts on the subject:

  • Writing a good lexer is really hard. In most cases it makes sense to use flex to generate the lexer. There is little use of hand-rolling your own lexer, unless you have really exotic input formats.
  • Using bison or similar generators make the grammar used for parsing explicitly visible. The primary gain here is that the developer maintaining your parser in five years will immediately see the grammar used and can compare it with any specs.
  • Using a recursive descent parser makes is quite clear what happens in the parser. This provides the easy means to gracefully handle harry conflicts. You can write a simple if, instead of rearranging the entire grammar to be LALR1.
  • While developing the parser you can "gloss over details" with a hand written parser, using bison this is almost impossible. In bison the grammar must work or the generator will not do anything.
  • Bison is awesome at pointing out formal flaws in the grammar. Unfortunately you are left alone to fix them. When hand-rolling a parser you will only find the flaws when the parser reads nonsense.

This is not a definite answer for one or the other, but it points you in the right direction. Since it appears that you are writing the parser for fun, I think you should have written both types of parser.

Thurgau answered 17/6, 2014 at 13:14 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.