Looking for a clear definition of what a "tokenizer", "parser" and "lexers" are and how they are related to each other and used?
Asked Answered
S

4

196

I am looking for a clear definition of what a "tokenizer", "parser" and "lexer" are and how they are related to each other (e.g., does a parser use a tokenizer or vice versa)? I need to create a program will go through c/h source files to extract data declaration and definitions.

I have been looking for examples and can find some info, but I really struggling to grasp the underlying concepts like grammar rules, parse trees and abstract syntax tree and how they interrelate to each other. Eventually these concepts need to be stored in an actual program, but 1) what do they look like, 2) are there common implementations.

I have been looking at Wikipedia on these topics and programs like Lex and Yacc, but having never gone through a compiler class (EE major) I am finding it difficult to fully understand what is going on.

Solvency answered 19/12, 2008 at 9:14 Comment(1)
Useful: A Guide to Parsing: Algorithms and TerminologyCuirassier
B
230

A tokenizer breaks a stream of text into tokens, usually by looking for whitespace (tabs, spaces, new lines).

A lexer is basically a tokenizer, but it usually attaches extra context to the tokens -- this token is a number, that token is a string literal, this other token is an equality operator.

A parser takes the stream of tokens from the lexer and turns it into an abstract syntax tree representing the (usually) program represented by the original text.

Last I checked, the best book on the subject was "Compilers: Principles, Techniques, and Tools" usually just known as "The Dragon Book".

Basrelief answered 19/12, 2008 at 9:25 Comment(12)
No doubt "The Dragon Book" is a good book, but it does require the reader to have a good grounding in CS. Some book with more practical appeal would be "Writing Compilers and Interpreters" by Ronald Mak, "Modern Compiler Implementation", Andrew Appel; "Compiler Construction", Niklaus Wirth; "Compiling with C# and Java" and "Compilers and Compiler Generators: an Introduction with C++" by Pat Terry; and, of course, "The Definitive ANTLR Reference" by Terrence Parr.Sanson
Sure. Last I checked, I was doing a CS degree :-) I defer to your more-recent recommendations.Basrelief
Just to be sure, I am not knocking your recommendation. "The Dragon Book" was my first book on compiler tech, but it was hard going compared to, say, Wirth's book, which is a book you can grok in a few hours. Back then I had few options as it was the only book I could get my hands on (it being 1991, before Amazon and the WWW). I had that and a collection of text files produced by Jack W. Crenshaw called "LET'S BUILD A COMPILER" (thanks Jack!). This is still the book to get for a more complete understanding of the principles, but most programmers just need a pragmatic introduction.Sanson
I wouldn't agree that a parser /by definition/ produces an abstract syntax tree. Parsers can produce all sorts of different outputs. For example, it is common that a parser produces a sequence of calls to some builder interface -- see the Builder Pattern in the Gang of Four patterns book. The key point is that the parser analyses a sequence of tokens to determine whether or not the sequence conforms to some (usually context free) grammar and may produce some output based on the sequence's grammatical structure.Shark
"Let's Build a Compiler" is here: compilers.iecc.com/crenshaw. I found the link from here: prog21.dadgum.com/30.htmlBasrelief
A function that takes user's input and creates a data-structure with it, should be named parser or tokenizer?Rozalin
@Pithkos: if those are the only constraints, all you've said is the function takes an input in one unnamed (mathematical) domain and produces and output in another unamed domain, e.g., F(X) -> Y Pretty much this means you can only call this a "function". If you insist that the domain of X is <StreamOfCharacter,Grammar> and the domain of Y is Tree with the property that it reflects the shape of the grammar, then F(X,G) -> T would be something I would call a parser. Often we curry F with respect to G because G doesn't change often, so F[G](X)->T is what you commonly see as a parser.Coppola
@IraBaxter, do you agree with the distinction being made here between a tokenizer and a lexer? A book I'm reading says: "The difference between tokenizing and lexing is subtle and academic, but it centers on whether or not these tokens are identified in a stateless or stateful way. Put simply, if the tokenizer were to invoke stateful parsing rules to figure out whether a should be considered a distinct token or just part of another token, that would be lexing." I'm not sure if the answer above is saying the same thing or what is meant by "stateful parsing rules".Davinadavine
I've never heard that distinction. You can accept those definitions if you want, but I've never bothered to worry about it. Adding state to your "tokenizer" is easy and very useful, especially lexing modes which handle different parts of languages that have different syntax in different sections such as COBOL. Consequently I prefer to build lexers from the the point of view of that definition; its easy to ignore the state if you don't need it. In practice, most people haven't heard of this distinction, so when I talk to folk I use the terms interchangeably, but I mean "lexer".Coppola
... and I've never found any real use for a tokenizer under this definition. Perhaps that's because I'm only interested in parsing-producing-manipulable-trees for real programming languages, and to do that sensibly you have to be able to distinguish among the lexeme types. As a further very practical enhancement of the notion of "lexer" is the idea that the lexer should capture the semantic value of the token in the native form of the machine that is going to use it, e.g, a floating pointer is recognized as a distinct lexeme with an attached IEEE binary floating point equivalent value.Coppola
This breaking by a stream into tokens by splitting one whitespace cannot commonly be used in real parsers, though, right? Because if you were to tokenize an expression like int x= 343; by that method, you would end up with the tokens int, x= and 343;. It seems to me that a real-world parser would have to consume input pretty much character by character. Am I missing something?Colligan
@antred, depends on your language grammar. If it requires whitespace around tokens, you're golden. I said "usually" in my answer, because breaking on whitespace is the easiest way to think about tokenisation, even if that's not what your language actually does. But, no, you're not really missing anything. Most commonly-used languages don't require whitespace between tokens where, say, the character class changes (from alphanumeric to punctuation, e.g.).Basrelief
M
25

Example:

int x = 1;

A lexer or tokeniser will split that up into tokens 'int', 'x', '=', '1', ';'.

A parser will take those tokens and use them to understand in some way:

  • we have a statement
  • it's a definition of an integer
  • the integer is called 'x'
  • 'x' should be initialised with the value 1
Moxie answered 26/3, 2009 at 17:37 Comment(1)
A lexer will note that "int", "=", and ";" are tokens without further meaning, that "x" is a identifier name or something, value "x", and "1" is an integer or number, value "1". A tokenizer won't necessarily do that.Fallonfallout
T
6

I would say that a lexer and a tokenizer are basically the same thing, and that they smash the text up into its component parts (the 'tokens'). The parser then interprets the tokens using a grammar.

I wouldn't get too hung up on precise terminological usage though - people often use 'parsing' to describe any action of interpreting a lump of text.

Titular answered 19/12, 2008 at 9:25 Comment(1)
With PEG parsers the distinction between tokenizer and parser is even less clear.Sanson
S
2

(adding to the given answers)

  • Tokenizer will also remove any comments, and only return tokens to the Lexer.
  • Lexer will also define scopes for those tokens (variables/functions)
  • Parser then will build the code/program structure
Stakeout answered 9/3, 2017 at 12:9 Comment(5)
Hello @downvoter, can you elaborate on why you actually did downvote?Eggshaped
I'm not the downvoter, but I think the downvote may have been because your answer does not seem correct. A tokenizer may remove noise (typically whitespace but maybe also comments), but it often does not feed the lexer. A DFA-based lexer will tokenize and identify what tokens are (e.g. a number, a string, an identifier, but also a whitespace or a comment), but it cannot scope these since this would require the the syntax tree which is later built by the parser.Banner
1) I don't understand your apparant distinction between "lexer" and "tokenizer". I've built parsers for 50+ languages and I have never had two seperate mechanisms that break the source text into atoms, so for me these are just synonyms. 2) If you are compiling, removing comments and whitespace makes sense in the lexer. If you are building source-to-source transformation tools, you cannot lose comments because they must reappear in the transformed text. So ALWAYS removing comments is wrong; we can argue about how one manages to preserve whitespace. ...Coppola
... [The tools I build (see my bio) capture both with adequate fidelity to reproduce them in the transformed code; we go further, and capture the format of the atoms, including weird things like the quotes used on character strings and the radix/leading zero count on numbers, all in service of avoiding the user rejecting the transformated result. So what you missed is not only do lexers not necessarily strip information, but in fact they may need to capture information above and beyond the raw token]. ....Coppola
... 3) Lexers only define "scopes" in hopelessly awkward parsers which have a hard time handling syntactic ambiguities. C and C++ parsers are the canonical example; see my discussion at https://mcmap.net/q/81747/-why-can-39-t-c-be-parsed-with-a-lr-1-parser). One doesn't have to do it that (ugly) way. So I find your answer simply misguided.Coppola

© 2022 - 2024 — McMap. All rights reserved.