what exactly is a token, in relation to parsing

Asked 12/4, 2011 at 17:24 Answered 12/4, 2011 at 18:12

I have to use a parser and writer in c++, i am trying to implement the functions, however i do not understand what a token is. one of my function/operations is to check to see if there are more tokens to produce

bool Parser::hasMoreTokens()

how exactly do i go about this, please help

SO!

I am opening a text file with text in it, all words are lowercased. How do i go about checking to see if it hasmoretokens?

This is what i have

bool Parser::hasMoreTokens() {

while(source.peek()!=NULL){
    return true;
}
    return false;
}

Overuse answered 12/4, 2011 at 17:24 Comment(1)

Please do not expect Stack Overflow to write your code for you. Especially if it's for homework (is it? it sounds like it). Show us what you've tried. If you simply have no idea what to do, and if (as I'm guessing) this is homework, then you should probably ask your teacher / professor / TA and they can (e.g.) point you to the relevant bit of your notes or textbook. – Sontich 12/4, 2011 at 17:38

Tokens are the output of lexical analysis and the input to parsing. Typically they are things like

numbers
variable names
parentheses
arithmetic operators
statement terminators

That is, roughly, the biggest things that can be unambiguously identified by code that just looks at its input one character at a time.

One note, which you should feel free to ignore if it confuses you: The boundary between lexical analysis and parsing is a little fuzzy. For instance:

Some programming languages have complex-number literals that look, say, like 2+3i or 3.2e8-17e6i. If you were parsing such a language, you could make the lexer gobble up a whole complex number and make it into a token; or you could have a simpler lexer and a more complicated parser, and make (say) 3.2e8, -, 17e6i be separate tokens; it would then be the parser's job (or even the code generator's) to notice that what it's got is really a single literal.
In some programming languages, the lexer may not be able to tell whether a given token is a variable name or a type name. (This happens in C, for instance.) But the grammar of the language may distinguish between the two, so that you'd like "variable foo" and "type name foo" to be different tokens. (This also happens in C.) In this case, it may be necessary for some information to be fed back from the parser to the lexer so that it can produce the right sort of token in each case.

So "what exactly is a token?" may not always have a perfectly well defined answer.

Sontich answered 12/4, 2011 at 17:28 Comment(1)

Wouldn't the names of types be also considered tokens? I think that's a big one to explicitly call out. It looks like you mention it in item 2. To generalize the actual definition of tokens, they seem to be all the most atomic units of input (e.g. a contiguous block of characters, or a single character, that must exist on its own and cannot further be reduced to something smaller as it'd defile the underlying meaning). – Daffi 31/7 at 20:8

A token is whatever you want it to be. Traditionally (and for good reasons), language specifications broke the analysis into two parts: the first part broke the input stream into tokens, and the second parsed the tokens. (Theoretically, I think you can write any grammar in only a single level, without using tokens—or what is the same thing, using individual characters as tokens. I wouldn't like to see the results of that for a language like C++, however.) But the definition of what a token is depends entirely on the language you are parsing: most languages, for example, treat white space as a separator (but not Fortran); most languages will predefine a set of punctuation/operators using punctuation characters, and not allow these characters in symbols (but not COBOL, where "abc-def" would be a single symbol). In some cases (including in the C++ preprocessor), what is a token depends on context, so you may need some feedback from the parser. (Hopefully not; that sort of thing is for very experienced programmers.)

One thing is probably sure (unless each character is a token): you'll have to read ahead in the stream. You typically can't tell whether there are more tokens by just looking at a single character. I've generally found it useful, in fact, for the tokenizer to read a whole token at a time, and keep it until the parser needs it. A function like hasMoreTokens would in fact scan a complete token.

(And while I'm at it, if source is an istream: istream::peek does not return a pointer, but an int.)

Valer answered 12/4, 2011 at 18:12 Comment(0)

A token is the smallest unit of a programming language that has a meaning. A parenthesis (, a name foo, an integer 123, are all tokens. Reducing a text to a series of tokens is generally the first step of parsing it.

Basal answered 12/4, 2011 at 17:28 Comment(1)

Ahhhhh....now this is the definition I was seeking and the one I most agree with. It hit me yesterday when I was walking around Macon, GA after work. I'm like I should have a natural derivation of many of these concepts I got initially exposed to in college. This is it!! – Daffi 31/7 at 20:9

When you split a large unit (long string) into a group of sub-units (smaller strings), each of the sub-units (smaller strings) is referred to as a "token". If there are no more sub-units, then you are done parsing.

How do I tokenize a string in C++?

Queenstown answered 12/4, 2011 at 17:27 Comment(0)

A token is usually akin to a word in sponken language. In C++, (int, float, 5.523, const) will be tokens. Is the minimal unit of text which constitutes a semantic element.

Programme answered 12/4, 2011 at 17:27 Comment(0)

A token is a terminal in a grammar, a sequence of one or more symbol(s) that is defined by the sequence itself, ie it does not derive from any other production defined in the grammar.

Silicon answered 12/4, 2011 at 17:30 Comment(0)

Recommended topics

Hot tags