lexers vs parsers

Asked 16/5, 2010 at 6:47 Answered 25/4, 2022 at 15:23

366

Are lexers and parsers really that different in theory?

It seems fashionable to hate regular expressions: coding horror, another blog post.

However, popular lexing based tools: pygments, geshi, or prettify, all use regular expressions. They seem to lex anything...

When is lexing enough, when do you need EBNF?

Has anyone used the tokens produced by these lexers with bison or antlr parser generators?

Brach answered 16/5, 2010 at 6:47 Comment(7)

yes. I am trying to parse autohotkey. I was able to build a syntax highlighter using pygments really fast. But antlr is taking much longer... I haven't seen a lot of cross pollination between the two tools. – Brach 16/5, 2010 at 9:32

Its only fashionable to hate regular expressions when they are misused. Many people try to use regular expressions when context-free parsing is needed. They always fail. And they blame regular expression technology. That's much like complaining that your hammer is a crummy saw. True, but you won't get a lot of sympathy. – Zebulun 17/5, 2010 at 21:51

I am starting to pick up some speed with antlr, thankfully. A lot of lexing is context-free and sometimes even context dependent also by the way. – Brach 25/5, 2010 at 5:22

One fundamental aspect of the lexer vs parser issue is that lexers are based on finite automata (FSA), or more precisely finite transducers (FST). Most parsing formalisms (not just Context-Free) are closed under intersection with FSA or application of FST. Hence using the simpler regular expression based formnalism for lexer does not increase the complexity of syntactic structures of the more complex parser formalisms. This is an absolutely major modularity issue when defining structure and semantics of languages, happily ignored by the high voted answers. – Horton 27/12, 2014 at 11:39

It should be noted that lexers and parsers do not have to be different, e.g. LLLPG and earlier versions of ANTLR use the same LL(k) parsing system for both lexers and parsers. The main difference is that regexes are usually sufficient for lexers but not parsers. – Angelika 7/4, 2016 at 10:36

Possible duplicate of Looking for a clear definition of what a "tokenizer", 'parser" and "lexers" are and how they are related to each other and used? – Erechtheus 9/5, 2016 at 13:15

Useful: A Guide to Parsing: Algorithms and Terminology – Testosterone 4/8, 2020 at 0:16

561

What parsers and lexers have in common:

They read symbols of some alphabet from their input.
- Hint: The alphabet doesn't necessarily have to be of letters. But it has to be of symbols which are atomic for the language understood by parser/lexer.
- Symbols for the lexer: ASCII characters.
- Symbols for the parser: the particular tokens, which are terminal symbols of their grammar.
They analyse these symbols and try to match them with the grammar of the language they understood.
- Here's where the real difference usually lies. See below for more.
- Grammar understood by lexers: regular grammar (Chomsky's level 3).
- Grammar understood by parsers: context-free grammar (Chomsky's level 2).
They attach semantics (meaning) to the language pieces they find.
- Lexers attach meaning by classifying lexemes (strings of symbols from the input) as the particular tokens. E.g. All these lexemes: *, ==, <=, ^ will be classified as "operator" token by the C/C++ lexer.
- Parsers attach meaning by classifying strings of tokens from the input (sentences) as the particular nonterminals and building the parse tree. E.g. all these token strings: [number][operator][number], [id][operator][id], [id][operator][number][operator][number] will be classified as "expression" nonterminal by the C/C++ parser.
They can attach some additional meaning (data) to the recognized elements.
- When a lexer recognizes a character sequence constituting a proper number, it can convert it to its binary value and store with the "number" token.
- Similarly, when a parser recognize an expression, it can compute its value and store with the "expression" node of the syntax tree.
They all produce on their output a proper sentences of the language they recognize.
- Lexers produce tokens, which are sentences of the regular language they recognize. Each token can have an inner syntax (though level 3, not level 2), but that doesn't matter for the output data and for the one which reads them.
- Parsers produce syntax trees, which are representations of sentences of the context-free language they recognize. Usually it's only one big tree for the whole document/source file, because the whole document/source file is a proper sentence for them. But there aren't any reasons why parser couldn't produce a series of syntax trees on its output. E.g. it could be a parser which recognizes SGML tags sticked into plain-text. So it'll tokenize the SGML document into a series of tokens: [TXT][TAG][TAG][TXT][TAG][TXT]....

As you can see, parsers and tokenizers have much in common. One parser can be a tokenizer for other parser, which reads its input tokens as symbols from its own alphabet (tokens are simply symbols of some alphabet) in the same way as sentences from one language can be alphabetic symbols of some other, higher-level language. For example, if * and - are the symbols of the alphabet M (as "Morse code symbols"), then you can build a parser which recognizes strings of these dots and lines as letters encoded in the Morse code. The sentences in the language "Morse Code" could be tokens for some other parser, for which these tokens are atomic symbols of its language (e.g. "English Words" language). And these "English Words" could be tokens (symbols of the alphabet) for some higher-level parser which understands "English Sentences" language. And all these languages differ only in the complexity of the grammar. Nothing more.

So what's all about these "Chomsky's grammar levels"? Well, Noam Chomsky classified grammars into four levels depending on their complexity:

Level 3: Regular grammars
They use regular expressions, that is, they can consist only of the symbols of alphabet (a,b), their concatenations (ab,aba,bbb etd.), or alternatives (e.g. a|b).
They can be implemented as finite state automata (FSA), like NFA (Nondeterministic Finite Automaton) or better DFA (Deterministic Finite Automaton).
Regular grammars can't handle with nested syntax, e.g. properly nested/matched parentheses (()()(()())), nested HTML/BBcode tags, nested blocks etc. It's because state automata to deal with it should have to have infinitely many states to handle infinitely many nesting levels.
Level 2: Context-free grammars
They can have nested, recursive, self-similar branches in their syntax trees, so they can handle with nested structures well.
They can be implemented as state automaton with stack. This stack is used to represent the nesting level of the syntax. In practice, they're usually implemented as a top-down, recursive-descent parser which uses machine's procedure call stack to track the nesting level, and use recursively called procedures/functions for every non-terminal symbol in their syntax.
But they can't handle with a context-sensitive syntax. E.g. when you have an expression x+3 and in one context this x could be a name of a variable, and in other context it could be a name of a function etc.
Level 1: Context-sensitive grammars
Level 0: Unrestricted grammars
Also called recursively enumerable grammars.

Chromite answered 1/9, 2010 at 3:53 Comment(13)

Oh yeah? So what are those "words or tokens"? They're just sentences in the regular language, consisting of letters of the alphabet. And what are those "constructs" or "trees" in the parser? They're also sentences, but in a different, higher-level language, for which the particular tokens are alphabetical symbols. The difference is not what you've said, but in the COMPLEXITY OF THE LANGUAGE USED. Confront your -1 with any handbook about the parsing theory. – Chromite 19/2, 2012 at 18:20

@Chromite Would it be fair to say that both Lexers and Parsers take some grammar and a series of tokens as input ? – Axle 24/7, 2012 at 12:1

Quite so. They both take a series of symbols from the alphabet they recognize. For lexer, this alphabet consists just of plain characters. For parser, the alphabet consists of terminal symbols, whatever they are defined. They could be characters, too, if you don't use lexer and use one-character identifiers and one-digit numbers etc (quite useful at first stages of developement). But they're usually tokens (lexical classes) because tokens are a good abstraction: you can change the actual lexemes (strings) they stand for, and parser doesn't see the change. – Chromite 2/8, 2012 at 1:2

For example, you can use a terminal symbol STMT_END in your syntax (for the parser) to denote the end of instructions. Now you can have a token with the same name associated with it, generated by the lexer. But you can change the actual lexeme it stands for. Eg. you can define STMT_END as ; to have C/C++-like source code. Or you can define it as end to have it somehow similar to Pascal-style. Or you can define it as just '\n' to end the instruction with the end of line, like in Python. But the syntax of instruction (and the parser) stays unchanged :-) Only lexer needs to be changed. – Chromite 2/8, 2012 at 1:8

Hours on wikipedia and google didn't help, but you explained Chomsky's grammars in 3 minutes. Thank you. – Icecap 20/3, 2013 at 15:26

You're welcome ;-) I'd add that these levels of grammars differ by the number of separate things they can COUNT at once. Regular grammars cannot count, because the only thing they count is their current state (ie. their context in the time passed). But they cannot, for example, count how many open parentheses were there in the input, to match the same number of closing parentheses in the input. But context-free grammars can do that, because they can count one more thing (using the stack). They cannot count more than one thing though (eg. matching () and {} at the same time). And so on. – Chromite 23/3, 2013 at 14:11

@SasQ: Could you define the word 'context' more? It looks as if the parser matches input symbols based on the 'context' of a given symbol ~'within a rule' (~within a sentence). But at the same time, a context-free parser (as the name suggests), skips 'the environment' of symbols that surround the symbols that are part of the [matched] rule (~the parser matches against single sentences, but skips analyzing sequences of sentences). Is that the case or did I get it wrong? (looks like I did, because I could just build a bigger free-context parser that would parse those sequences of sentencesAswell – Prendergast 3/12, 2013 at 15:25

A related question would be: in a context-free syntax, can one atomic symbol i.e. x be interpreted as a different tokens, depending on which of two rules matches x, provided that I defined two separate rules that both match x, but in different 'contexts' (contexts within one definition of a rule)? If not, would that (the possibility of one input symbol mean two or more different things) be a difference between context-free and context-sensitive grammars? – Prendergast 3/12, 2013 at 15:49

Saying that lexers use regular language grammars and parsers use context-free grammars is a false distinction with lots of exceptions. If a parser can be a lexer for another parser, as you state, then a lexer can use a context-free grammar, and there are lots of lexer examples which use context-free or even context-sensitive logic. Saying that this is the "real difference" between lexers and parsers is misleading at best. – Envious 15/10, 2018 at 10:13

Well, if you think it's wrong, you can always provide your own answer. But if you're saying this answer is invalid, then I guess you have to tell that to all those scholars who work on parsing theory, because it's THEM who define these concepts this way. Lexers are, as the name says, for LEXICAL ANALYSIS, so they work on "word level" (regular grammars), not on "sentence level" (context-free grammars and up). The latter is a job for the parser that, as the name suggests, PARSES the grammar of the sentence, treating words as "indivisible" (terminal) symbols (that is, its "alphabet"). – Chromite 20/10, 2018 at 22:1

Your explanation of regular grammars might be misleading. They can be "implemented" as finite state automata, yes, but that "implementation" only tells you whether a stream of characters forms a valid stream of tokens – it doesn't tell you what those tokens are. In other words, even if a lexical grammar is regular, it can't be lexed with just a FSA, nor with just regular expressions. With that in mind, it's unclear why that classification of grammatical complexity would be relevant to the division between lexer and parser. – Endogamy 3/3, 2019 at 20:24

The EQUIVALENCE between regular grammars, regular expressions and finite automata is not something I made up – people smarter than me figured it out, and you can find their rigorous proofs in the literature, no need to take my word on that. But agan, if you believe that something that I wrote is incorrect or misleading, you can always provide your own answer where you explain it better, and even get some points for that. But judging by the fact that my answer was accepted and so many people find it useful (judging by the number of upvotes), I don't think that I should change anything in it. – Chromite 5/3, 2019 at 1:0

•Level 1: Context-sensitive grammars XML is excellent example. – Buell 30/10, 2019 at 10:36

130

Yes, they are very different in theory, and in implementation.

Lexers are used to recognize "words" that make up language elements, because the structure of such words is generally simple. Regular expressions are extremely good at handling this simpler structure, and there are very high-performance regular-expression matching engines used to implement lexers.

Parsers are used to recognize "structure" of a language phrases. Such structure is generally far beyond what "regular expressions" can recognize, so one needs "context sensitive" parsers to extract such structure. Context-sensitive parsers are hard to build, so the engineering compromise is to use "context-free" grammars and add hacks to the parsers ("symbol tables", etc.) to handle the context-sensitive part.

Neither lexing nor parsing technology is likely to go away soon.

They may be unified by deciding to use "parsing" technology to recognize "words", as is currently explored by so-called scannerless GLR parsers. That has a runtime cost, as you are applying more general machinery to what is often a problem that doesn't need it, and usually you pay for that in overhead. Where you have lots of free cycles, that overhead may not matter. If you process a lot of text, then the overhead does matter and classical regular expression parsers will continue to be used.

Zebulun answered 17/5, 2010 at 20:52 Comment(10)

Nice explanation, Ira. Adding to your analogy: While lexers are about getting the words right, parsers are about getting the sentences right. "See spot run" and "spot run See" are both valid as far as a lexer is concerned. It takes a parser to determine that phrase structure is wrong (in English grammar). – Waaf 24/5, 2010 at 15:16

i guess a parser is to a lexer as a tree walker is to a parser. I am not convinced that the theory is that different: antlr.org/wiki/display/~admin/ANTLR+v4+lexers but I am starting to understand the differences in convention between them... – Brach 25/5, 2010 at 5:19

The theory is very different. Most parser technologies are trying to handle context-free languages to some degree (some do only part, e.g., LALR, some do it all, e.g., GLR). Most lexer technologies only try to do regular expressions. – Zebulun 8/6, 2010 at 9:42

The theory is different, because it has been proposed by many different people and use different terminology and algorithms. But if you look them closely, you can spot the similarities. For example, the problem of left recursion is very similar to the problem of non-determinism in NFAs, and removing left recursion is similar to removing non-determinism and converting NFA into DFA. Tokens are sentences for the tokenizer (output), but alphabetical symbols for the parser (input). I don't deny the differences (Chomsky levels), but similarities help a lot in design. – Chromite 19/2, 2012 at 18:24

My officemate was into category theory. He showed how the categorical theory notion of sheaves covered all kinds of pattern matching, and was able to derive LR parsing from an abstract categorical specification. So in fact, if you go abstract enough, you can find such commonalities. The point of category theory is you can often abstract "all the way up"; I'm sure you could build a category theory parser that erased the differences. But any practical uses of it have to instantiate down to the specific problem domain, and then the differences show up as real. – Zebulun 30/10, 2013 at 0:55

How does the state machine of a regex differ to that of a parser? A grammar for a lanuage can be expressed by "raw" characters as well. Where a character represents a word. e.g "A setence is valid if it starts with word A has a 0 in middle and ends with A" is the same technically as "A word is valid if it starts with character A has a 0 in middle and ends with A". You claim that a lexer detects words. But even in NLP, the difference between words isn't that clear cut. (Take many Asian languages for example where there aren't spaces). – Syllabize 24/1, 2023 at 13:13

@The19thFighter: Regexes imply state machines. Parsers do not: a parser may have a state machine (e.g., LR parsers) or not (hand-coded recursive descent). Yes, you can write a grammar for a language at the character level; then nonterminals in that language may be considered "words", e.g., NUMBER = MANTISSA EXPONENT; MANTISSA = DIGITS '.' DIGITS; EXPONENT = [+-]DIGITS. If you do this, you'll find yourself running a relatively expensive parser to collect "words". Yes, that works but the breaking of syntax checking into lexing then parsing is considerably more efficient. Chinese? dunno. – Zebulun 13/2, 2023 at 22:10

As someone who tries to write a Emacs mode parser with SMIE, I'm still kinda confused. You say a lexer may get away with just regexps. But imagine a python line s = "hello world" What should lexer return? Well, it would be s, = — but what next? It's probably ", but then you should have complete hello world as a token, right? That is because whitespace is significant in a string content. And then there may be embedded quotes,escaped escapes,escaped newlines (perhaps of different types like LF, CRLF, CR), all this gets hairy pretty quickly for regexps. Is this still a job of a lexer? – Yep 13/4, 2023 at 22:9

What lexer should return is a "token IDENTIFIER 's'", "token +", "token STRING 'hello world'". Yes this is the job for the lexer and easily done, especially if you have a lexer-generator program that will accept sepearate regexes for each token type. You have no idea what "hairy" is :-{{ We produce COBOL parsers, which have lexers for COBOL tokens feeding them. Our COBOL lexer (which handles 7 full dialects of COBOL) has 22,000 (YES!) lines of regex definitions of all the different tokens that all 7 COBOL dialects can produce. Yes, this was hard to build and test even with good tools – Zebulun 13/4, 2023 at 22:17

... but it doesn't mean you can't do it. You just need enthusiasm. – Zebulun 13/4, 2023 at 22:17

When is lexing enough, when do you need EBNF?

EBNF really doesn't add much to the power of grammars. It's just a convenience / shortcut notation / "syntactic sugar" over the standard Chomsky's Normal Form (CNF) grammar rules. For example, the EBNF alternative:

S --> A | B

you can achieve in CNF by just listing each alternative production separately:

S --> A      // `S` can be `A`,
S --> B      // or it can be `B`.

The optional element from EBNF:

S --> X?

you can achieve in CNF by using a nullable production, that is, the one which can be replaced by an empty string (denoted by just empty production here; others use epsilon or lambda or crossed circle):

S --> B       // `S` can be `B`,
B --> X       // and `B` can be just `X`,
B -->         // or it can be empty.

A production in a form like the last one B above is called "erasure", because it can erase whatever it stands for in other productions (product an empty string instead of something else).

Zero-or-more repetiton from EBNF:

S --> A*

you can obtan by using recursive production, that is, one which embeds itself somewhere in it. It can be done in two ways. First one is left recursion (which usually should be avoided, because Top-Down Recursive Descent parsers cannot parse it):

S --> S A    // `S` is just itself ended with `A` (which can be done many times),
S -->        // or it can begin with empty-string, which stops the recursion.

Knowing that it generates just an empty string (ultimately) followed by zero or more As, the same string (but not the same language!) can be expressed using right-recursion:

S --> A S    // `S` can be `A` followed by itself (which can be done many times),
S -->        // or it can be just empty-string end, which stops the recursion.

And when it comes to + for one-or-more repetition from EBNF:

S --> A+

it can be done by factoring out one A and using * as before:

S --> A A*

which you can express in CNF as such (I use right recursion here; try to figure out the other one yourself as an exercise):

S --> A S   // `S` can be one `A` followed by `S` (which stands for more `A`s),
S --> A     // or it could be just one single `A`.

Knowing that, you can now probably recognize a grammar for a regular expression (that is, regular grammar) as one which can be expressed in a single EBNF production consisting only from terminal symbols. More generally, you can recognize regular grammars when you see productions similar to these:

A -->        // Empty (nullable) production (AKA erasure).
B --> x      // Single terminal symbol.
C --> y D    // Simple state change from `C` to `D` when seeing input `y`.
E --> F z    // Simple state change from `E` to `F` when seeing input `z`.
G --> G u    // Left recursion.
H --> v H    // Right recursion.

That is, using only empty strings, terminal symbols, simple non-terminals for substitutions and state changes, and using recursion only to achieve repetition (iteration, which is just linear recursion - the one which doesn't branch tree-like). Nothing more advanced above these, then you're sure it's a regular syntax and you can go with just lexer for that.

But when your syntax uses recursion in a non-trivial way, to produce tree-like, self-similar, nested structures, like the following one:

S --> a S b    // `S` can be itself "parenthesized" by `a` and `b` on both sides.
S -->          // or it could be (ultimately) empty, which ends recursion.

then you can easily see that this cannot be done with regular expression, because you cannot resolve it into one single EBNF production in any way; you'll end up with substituting for S indefinitely, which will always add another as and bs on both sides. Lexers (more specifically: Finite State Automata used by lexers) cannot count to arbitrary number (they are finite, remember?), so they don't know how many as were there to match them evenly with so many bs. Grammars like this are called context-free grammars (at the very least), and they require a parser.

Context-free grammars are well-known to parse, so they are widely used for describing programming languages' syntax. But there's more. Sometimes a more general grammar is needed -- when you have more things to count at the same time, independently. For example, when you want to describe a language where one can use round parentheses and square braces interleaved, but they have to be paired up correctly with each other (braces with braces, round with round). This kind of grammar is called context-sensitive. You can recognize it by that it has more than one symbol on the left (before the arrow). For example:

A R B --> A S B

You can think of these additional symbols on the left as a "context" for applying the rule. There could be some preconditions, postconditions etc. For example, the above rule will substitute R into S, but only when it's in between A and B, leaving those A and B themselves unchanged. This kind of syntax is really hard to parse, because it needs a full-blown Turing machine. It's a whole another story, so I'll end here.

Chromite answered 16/5, 2010 at 6:47 Comment(1)

You state that EBNF is "just a convenience / shortcut notation / "syntactic sugar" over the standard Chomsky's Normal Form (CNF) grammar rules". But CNF has hardly anything to do with the topic at hand. EBNF can easily be transformed into standard BNF. Period. It is syntactic sugar for standard BNF. – Horton 11/6, 2014 at 12:35

To answer the question as asked (without repeating unduly what appears in other answers)

Lexers and parsers are not very different, as suggested by the accepted answer. Both are based on simple language formalisms: regular languages for lexers and, almost always, context-free (CF) languages for parsers. They both are associated with fairly simple computational models, the finite state automaton and the push-down stack automaton. Regular languages are a special case of context-free languages, so that lexers could be produced with the somewhat more complex CF technology. But it is not a good idea for at least two reasons.

A fundamental point in programming is that a system component should be buit with the most appropriate technology, so that it is easy to produce, to understand and to maintain. The technology should not be overkill (using techniques much more complex and costly than needed), nor should it be at the limit of its power, thus requiring technical contortions to achieve the desired goal.

That is why "It seems fashionable to hate regular expressions". Though they can do a lot, they sometimes require very unreadable coding to achieve it, not to mention the fact that various extensions and restrictions in implementation somewhat reduce their theoretical simplicity. Lexers do not usually do that, and are usually a simple, efficient, and appropriate technology to parse token. Using CF parsers for token would be overkill, though it is possible.

Another reason not to use CF formalism for lexers is that it might then be tempting to use the full CF power. But that might raise sructural problems regarding the reading of programs.

Fundamentally, most of the structure of program text, from which meaning is extracted, is a tree structure. It expresses how the parse sentence (program) is generated from syntax rules. Semantics is derived by compositional techniques (homomorphism for the mathematically oriented) from the way syntax rules are composed to build the parse tree. Hence the tree structure is essential. The fact that tokens are identified with a regular set based lexer does not change the situation, because CF composed with regular still gives CF (I am speaking very loosely about regular transducers, that transform a stream of characters into a stream of token).

However, CF composed with CF (via CF transducers ... sorry for the math), does not necessarily give CF, and might makes things more general, but less tractable in practice. So CF is not the appropriate tool for lexers, even though it can be used.

One of the major differences between regular and CF is that regular languages (and transducers) compose very well with almost any formalism in various ways, while CF languages (and transducers) do not, not even with themselves (with a few exceptions).

(Note that regular transducers may have others uses, such as formalization of some syntax error handling techniques.)

BNF is just a specific syntax for presenting CF grammars.

EBNF is a syntactic sugar for BNF, using the facilities of regular notation to give terser version of BNF grammars. It can always be transformed into an equivalent pure BNF.

However, the regular notation is often used in EBNF only to emphasize these parts of the syntax that correspond to the structure of lexical elements, and should be recognized with the lexer, while the rest with be rather presented in straight BNF. But it is not an absolute rule.

To summarize, the simpler structure of token is better analyzed with the simpler technology of regular languages, while the tree oriented structure of the language (of program syntax) is better handled by CF grammars.

I would suggest also looking at AHR's answer.

But this leaves a question open: Why trees?

Trees are a good basis for specifying syntax because

they give a simple structure to the text
there are very convenient for associating semantics with the text on the basis of that structure, with a mathematically well understood technology (compositionality via homomorphisms), as indicated above. It is a fundamental algebraic tool to define the semantics of mathematical formalisms.

Hence it is a good intermediate representation, as shown by the success of Abstract Syntax Trees (AST). Note that AST are often different from parse tree because the parsing technology used by many professionals (Such as LL or LR) applies only to a subset of CF grammars, thus forcing grammatical distorsions which are later corrected in AST. This can be avoided with more general parsing technology (based on dynamic programming) that accepts any CF grammar.

Statement about the fact that programming languages are context-sensitive (CS) rather than CF are arbitrary and disputable.

The problem is that the separation of syntax and semantics is arbitrary. Checking declarations or type agreement may be seen as either part of syntax, or part of semantics. The same would be true of gender and number agreement in natural languages. But there are natural languages where plural agreement depends on the actual semantic meaning of words, so that it does not fit well with syntax.

Many definitions of programming languages in denotational semantics place declarations and type checking in the semantics. So stating as done by Ira Baxter that CF parsers are being hacked to get a context sensitivity required by syntax is at best an arbitrary view of the situation. It may be organized as a hack in some compilers, but it does not have to be.

Also it is not just that CS parsers (in the sense used in other answers here) are hard to build, and less efficient. They are are also inadequate to express perspicuously the kinf of context-sensitivity that might be needed. And they do not naturally produce a syntactic structure (such as parse-trees) that is convenient to derive the semantics of the program, i.e. to generate the compiled code.

Horton answered 11/6, 2014 at 14:19 Comment(10)

Yes, parse trees and ASTs are different, but pretty much not in a really useful way. See my discussion of this: https://mcmap.net/q/82595/-what-is-the-difference-between-an-abstract-syntax-tree-and-a-concrete-syntax-tree – Zebulun 15/1, 2019 at 16:28

@IraBaxter I do not agree with you, but I do not really have time now to draft a clean answer to your post. Basically, you are taking a pragmatic point of view (and also defending your own system, I think). This is even easier because you are using general CF parsers (however GLR may not be the most efficient), rather than deterministic ones as in some systems. I consider AST as the reference representation, which lends itself to formally defined treatment, provably correct transformations, mathematical proofs, unparsing to multiple concrete representations, etc. – Horton 18/1, 2019 at 18:40

The "pragmatic" view is the reason I claim they aren't very different in a useful way. And I simply do not believe that using an (ad hoc AST) gives you "provably correct transforms"; your ad hoc AST has no obvious relation to the actual grammar of the langauge being processed (and here, yes, my system is defensible in that our "AST" is provably an isomorphic equivalent to the BNF). Ad hoc ASTs dont give you any additional capability to unparse to "multiple concrete representations). You objection to GLR (not most efficient) seems pretty pointless. Nor are they nondeterministic. – Zebulun 18/1, 2019 at 20:27

So in fact I don't understand any part of your objection to my comment. You'll have to write that "clean answer". – Zebulun 18/1, 2019 at 20:28

@IraBaxter Comments are too constrained for a proper answer (suggestion?). "Ad hoc" is not a proper qualifiers for AST I advocate, which should be (sometimes is) the reference syntax. This is historically true, looking both at the history of the concept of AST in computer science, and at the history of formal systems as terms (trees) in a sorted algebra, together with interpretation. AST is the reference form, not a derived one. See also modern proof systems and automatic program generation. You may be biased by the fact that you have to work from a concrete syntax designed by others. – Horton 19/1, 2019 at 14:35

@ira Regarding parsers, GLR is non-deterministic (but I did not say it) in the sense that it can parse non-deterministic CF languages. I was opposing GLR to traditional DPDA technique, such as LR(k), LL(k), which parse only deterministic languages. Allowing for more flexible general CF parsing may allows simpler CF syntax, which may lend more naturally to automatic "AST" derivation as you do. I had noticed elsewhere that you use GLR, and my point is that there may be simpler algorithms that may also be a bit more efficient in time and space. LR is not as relevant here as often believed. – Horton 19/1, 2019 at 14:52

A language has a concrete syntax presumably defined by a standards document. You can define a reference syntax as an AST, but I don't see how you can gaurantee it properly models the defined concrete syntax by any formal method; it certainly isn't derived formally from the standard. I have yet to see a language standard defined starting with a reference AST syntax (an interesting idea). I don't know what you mean by "non-deterministic CF languages"; not a definition I've ever encountered, and if I use the classic definition of non-determinism, then GLR parsers are completely deterministic. – Zebulun 19/1, 2019 at 16:50

@IraBaxter Proof of consistency of concrete &andabstract syntax depend of course on the way they are defined. But it should presumably be done by structural induction. First example in history for a programming language defined by reference AST is Lisp (1958). Regarding "non-deterministic CF languages", it refers to CF languages that cannot be parsed by a deterministic pushdown automaton (see wikipedia). GLR is an algorithm intended to mimic a pushdown automaton obtained by the LR techniques, keeping conflicts as non-deterministic steps, so that it works for all CF languages. – Horton 19/1, 2019 at 17:44

Hmm The notion that you might demonstrate some kind of equivalence between an "ad hoc" AST and a language defined grammar... OK, I guess see that as possible. (We generate our ASTs in a way that ensures a purely constructive proof). The question I raised was, if you had such an AST, how is it empirically more useful that the generated kind? I guess the phrase "nondeterministic CF language" threw me because is really means "non [deterministic PDA parseable]" and not "nondeterministic"; seems like a poorly chosen (not by you) phrase precisely because it induces this confusion. – Zebulun 19/1, 2019 at 21:1

@IraBaxter The difference is that you think as an engineer who has to provide tools that have to be easily used by people who think primarily in terms of concrete syntax (strings), whereas my own experience is more in attempting to understand the most appropriate organization of knowledge for formal processing of languages and programs. The design of abstract syntax must ideally be driven by the formal concepts and entities used in the language and their interactions. Concrete syntax is only a somewhat arbitrary linearization that must be invertible (parseable), and may not be unique. – Horton 19/1, 2019 at 22:24

There are a number of reasons why the analysis portion of a compiler is normally separated into lexical analysis and parsing ( syntax analysis) phases.

Simplicity of design is the most important consideration. The separation of lexical and syntactic analysis often allows us to simplify at least one of these tasks. For example, a parser that had to deal with comments and white space as syntactic units would be. Considerably more complex than one that can assume comments and white space have already been removed by the lexical analyzer. If we are designing a new language, separating lexical and syntactic concerns can lead to a cleaner overall language design.
Compiler efficiency is improved. A separate lexical analyzer allows us to apply specialized techniques that serve only the lexical task, not the job of parsing. In addition, specialized buffering techniques for reading input characters can speed up the compiler significantly.
Compiler portability is enhanced. Input-device-specific peculiarities can be restricted to the lexical analyzer.

resource___Compilers (2nd Edition) written by- Alfred V. Abo Columbia University Monica S. Lam Stanford University Ravi Sethi Avaya Jeffrey D. Ullman Stanford University

Cassius answered 10/3, 2014 at 17:40 Comment(0)

The parser will typically combine the tokens produced by the lexer and group them.

The parser defined as the analysis of an input to organize the data according to the rule of a grammar while the lexer transforms a sequence of characters in a sequence of tokens

Let’s look at the following example and imagine that we are trying to parse an addition.

437 + 734

The lexer scans the text and find 4, 3, 7 and then a space ( ). The job of the lexer is to recognize that the characters 437 constitute one token of type NUM.

Then the lexer finds a + symbol, which corresponds to a second token of type PLUS, and lastly it finds another token of type NUM.

See more detail:
A Guide to Parsing: Algorithms and Terminology

Musick answered 25/4, 2022 at 15:23 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

They read symbols of some alphabet from their input.

They analyse these symbols and try to match them with the grammar of the language they understood.

They attach semantics (meaning) to the language pieces they find.

They can attach some additional meaning (data) to the recognized elements.

They all produce on their output a proper sentences of the language they recognize.

Level 3: Regular grammars

Level 2: Context-free grammars

Level 1: Context-sensitive grammars

Level 0: Unrestricted grammars
Also called recursively enumerable grammars.

Recommended topics

Hot tags

They read symbols of some alphabet from their input.

They analyse these symbols and try to match them with the grammar of the language they understood.

They attach semantics (meaning) to the language pieces they find.

They can attach some additional meaning (data) to the recognized elements.

They all produce on their output a proper sentences of the language they recognize.

Level 3: Regular grammars

Level 2: Context-free grammars

Level 1: Context-sensitive grammars

Level 0: Unrestricted grammarsAlso called recursively enumerable grammars.

Recommended topics

Hot tags

Level 0: Unrestricted grammars
Also called recursively enumerable grammars.