Writing a Z80 assembler - lexing ASM and building a parse tree using composition?

Asked 20/8, 2009 at 9:50 Answered 26/10, 2009 at 0:49

Solved parsing assembly lexical-analysis z80

I'm very new to the concept of writing an assembler and even after reading a great deal of material, I'm still having difficulties wrapping my head around a couple of concepts.

What is the process to actually break up a source file into tokens? I believe this process is called lexing, and I've searched high and low for a real code examples that make sense, but I can't find a thing so simple code examples very welcome ;)
When parsing, does information ever need to be passed up or down the tree? The reason I ask is as follows, take:

LD BC, nn

It needs to be turned into the following parse tree once tokenized(???)

  ___ LD ___
  |        |
 BC        nn

Now, when this tree is traversed it needs to produce the following machine code:

01 n n

If the instruction had been:

LD DE,nn

Then the output would need to be:

11 n n

Meaning that it raises the question, does the LD node return something different based on the operand or is it the operand that returns something? And how is this achieved? More simple code examples would be excellent if time permits.

I'm most interested in learning some of the raw processes here rather than looking at advanced existing tools so please bear that in mind before sending me to Yacc or Flex.

Kartis answered 20/8, 2009 at 9:50 Comment(0)

Well, the structure of the tree you really want for an instruction that operates on a register and an memory addressing mode involing an offset displacement and an index register would look like this:

    INSTRUCTION-----+
    |      |        |
  OPCODE  REG     OPERAND
                  |     |
                OFFSET  INDEXREG

And yes, you want want to pass values up and down the tree. A method for formally specifying such value passing is called "attribute grammars", and you decorate the grammar for your langauge (in your case, your assembler syntax) with the value-passing and the computations over those values. For more background, see Wikipedia on attribute grammars.

In a related question you asked, I discussed a tool, DMS, which handles expression grammars and building trees. As language manipulation tool, DMS faces exactly these same up-and-down the tree information flows issues. It shouldn't surprise you, that as a high-end language manipulation tool, it can handle attribute grammar computations directly.

Moonstone answered 23/8, 2009 at 20:27 Comment(12)

This was intended to be illustrative of the trees he needs to build, rather than matching a specific Z80 instrution. I coded a LOT of Z80 assembler back in the 1980s; I may know more about it than you think. – Moonstone 23/8, 2009 at 20:36

Then demonstrate your knowledge. Have you written a Z80 assembler? I Haven't, but I have written 8080 and 6809 assemblers, and your advice seems to me to be dead wrong. – Mopboard 23/8, 2009 at 20:48

Your experience is probably different than mine. I've coded assemblers for the Collins 8311 (circa 1968), a 12 and 16 bit machines that didn't make it commercially so you havent heard of them, microcode assemblers, and the 6809 assembler you saw in the other message. I've also built a variety of compiler front ends(check out my bio information and website). My early assemblers were built by ad hoc means and worked. Stuff since the 80s I've built using lexers and parsers because it is far easier to specify, to maintain, to extend, you name it. – Moonstone 23/8, 2009 at 21:7

My experience may be qualitatively different than yours along another axis. I've always wanted tools to help me specify/analyze/manipulate languages. I've spent the last 25 years figuring how to go far beyond lex and yacc, and now have tools that let me (and others if they wish) do so. I use them daily, and they are effective. Without experiencing such tools, you might not be convinced. OK, let many flowers bloom. – Moonstone 23/8, 2009 at 21:21

Hmm. I'm totally in favour of tools, but some tools are fairly crappy. Along with Bjarne Stroustrup, I'm a big fan of recursive descent when it comes to parsing as opposed to table driven tools like yacc/bison. And when it comes to parsing assembler mnemonics, I'm an even bigger fan of ad hoc parsing. Assembler mnemonics are really not that regular. But you obviously do know what you are talking about, so I've deleted the initial comment (which was a bit rude) and removed the downvote. – Mopboard 23/8, 2009 at 21:50

When recursive descent works, it works great. But it only works well on modest sized grammars that don't change fast (when you build DSLs sometimes the grammar does change fast) and which don't require any lookahead or any type information to resolve the parse (your beloved C++ has a bad lookahead problem here). When you aren't in that window, more powerful (read "full context free") parsing engines are better, and explicit specifications of the full grammar make it a lot easier to revise and maintain. And once you have such machinery, its easy to kill gnats, too, and you should. – Moonstone 23/8, 2009 at 22:34

Well, the C++ grammar problems are of course a difficulty for both recursive descent and table driven parsers. But Stroustrup is on record (the D&E book, chapter 2) as saying that he should have used a recursive descent parser rather than a yacc based one to implement cfront (the first c++ compiler), which ended up with both table driven and RD code. And was horrible - I speak as a user. – Mopboard 23/8, 2009 at 22:51

DMS has a full C++ parser (even handles GNU and MS dialects) and uses a table driven scheme (called GLR, a generalization of LR) based off exactly the same grammar style as you see in the related question to this one. That C++ parser has been used to carry out massive automated transformations on large embedded software systems. The parser is as beautiful as the C++ reference manual will allow it to be :-} I speak as a front end builder. – Moonstone 23/8, 2009 at 23:29

See this answer about parsing C++ with good tools: #243883 – Moonstone 30/1, 2015 at 20:23

Hi Ira, I'm familiar with your work and it's extremely impressive. Without any hesitation I'd consider you an authority on the matter. I'm somewhat confused by your answer here though, I have written a few small assemblers ( not Z80 mind you ) and I'm not sure why you would suggest using a tree structure in parsing a regular grammar like an assembler. What you've demonstrated above seems better represented by a series of nested specific structures rather than a tree. – Internationale 1/2, 2019 at 21:48

Nested structures and trees are simply isomorphic representations. So, its take your choice. The good thing about the tree is that you can get it directly from the parser. If you want to convert it to nested structures you can do that, and it might make navigation of the nested structures seem simpler. But, to do the conversion, you have to walk the tree anyway, so you already have navigation. If you access the nested structures heavily, they will win out in performance... but surely when building an assembler you are only going to consult each "field" of the instruction once. – Moonstone 1/2, 2019 at 22:3

Thanks for your answer, that clarifies matters somewhat. I think my misunderstanding is stemming from the fact that the assemblers I have written implemented ad-hoc parsing and I have been attempting to apply your explanation to the mental model for parsing the assembly in that manner. It just reinforces the fact that I have much to learn in this area. – Internationale 3/2, 2019 at 0:36

It is not necessary to build a parse tree. Z80 op codes are very simple. They consist of the op code and 0, 1 or 2 operands, separated by commas. You just need to split the opcode up into the (maximum of 3) components with a very simple parser - no tree is needed.

Mopboard answered 20/8, 2009 at 9:54 Comment(6)

What about relative memory addresses> surely some kind of symbolic tree is needed for this? perhaps even a table – Gametocyte 20/8, 2009 at 10:21

The questioner was asking about parsing the OP codes. You certainly need a symbol table to assist in creating the assembler's actual machine code output, but that is a completely different issue. – Mopboard 20/8, 2009 at 10:23

@Darknight: Would there still not be a valid point to creating a parse tree. For example, other occurrences could play a part such as here are some ASM code strings that have variations in the syntax: Labels, Comments, Brackets, Global Variables, .DB, .DS, .DW Would these not also become more simplified if a parse tree was used? @Neil: Could you elaborate? Regards, GP – Kartis 20/8, 2009 at 12:59

@Gary You could do that if you want to complicate your life. I've written a couple of assemblers (though not one for the Z80) and they are normally easily implemented via a simple tokenising scheme (as I suggested in my answer) and an expression evaluator to evaluate address and other expressions. The expression evlauator may use a tree (though a stack based one is enough in my experience) but there is no need to make the parse tree your basic data structure. – Mopboard 20/8, 2009 at 13:14

@Neil Thanks for your feedback... You don't happen to have a code example do you? Would probably clarify alot! :) – Kartis 20/8, 2009 at 13:35

@Gary Sorry, the last assembler I wrote was in REXX on IBM's VM/CMS mainframe OS. Even if I had the code, I don't think it would be useful to you. – Mopboard 20/8, 2009 at 15:19

Actually, the opcodes do have not a byte base, but an octal base. The best description I know is DECODING Z80 OPCODES.

Deductible answered 26/10, 2009 at 0:49 Comment(1)

Ya, right on. Octal lets you translate directly, and a recursive call can deal with the offset registers. One does not need to build a syntax tree at all. Each instruction and be directly mapped. – Kiele 29/7, 2011 at 2:7

Recommended topics

Hot tags