I am working on a parser for an indentation based language. It so far has involved a lot of manual code. I want to try to do the dedent/indent thing that Python was mentioned doing, but it's kind of hard TBH.
I divided it into 3 phases:
- Tokenizing
- Directing
- Treeifying
The tokenization phase is about 100 lines, plus the definitions of the regexps. It spits out a sequence of helpful tokens.
The "directing" phase (or "folding" phase) takes the tokens and spits out push and pop basically, to create instructions on how to make a tree data structure. It essentially folds the list into a tree.
The final "treeifying" phase takes the tree instructions and actually builds the tree. It turns out to be pretty hard, mentally doing gymnastics to think about how the list of tokens becomes a tree. I spent all weekend trying to get it to work, but still have a ways to go to get the output tree to be properly aligned with push and pop.
It should serve as a real-world example of how to build an indentation-based parser, though it's not that great of code.