Making a lexical Analyzer
Asked Answered
V

6

30

I'm working with a Lexical Analyzer program right now and I'm using Java. I've been researching for answers on this problem but until now I failed to find any. Here's my problem:

Input:

System.out.println ("Hello World");

Desired Output:

Lexeme----------------------Token

System [Key_Word]

.       [Object_Accessor]

out   [Key_Word]

. [Object_Accessor]

println  [Key_Word]

(  [left_Parenthesis]

"Hello World"    [String_Literal]

)   [right_Parenthesis]

;  [statement_separator]

I'm still a beginner so I hope you guys can help me on this. Thanks.

Viscount answered 25/7, 2013 at 2:51 Comment(0)
S
66

You need neither ANTLR nor the Dragon book to write a simple lexical analyzer by hand. Even lexical analyzers for fuller languages (like Java) aren't terribly complicated to write by hand. Obviously if you have an industrial task you might want to consider industrial strength tools like ANTLR or some lex variant, but for the sake of learning how lexical analysis works, writing one by hand would likely prove to be a useful exercise. I'm assuming that this is the case, since you said you're still a beginner.

Here's a simple lexical analyzer, written in Java, for a subset of a Scheme-like language, that I wrote after seeing this question. I think the code is relatively easy to understand even if you've never seen a lexer before, simply because breaking a stream of characters (in this case a String) into a stream of tokens (in this case a List<Token>) isn't that hard. If you have questions I can try to explain in more depth.

import java.util.List;
import java.util.ArrayList;

/*
 * Lexical analyzer for Scheme-like minilanguage:
 * (define (foo x) (bar (baz x)))
 */
public class Lexer {
    public static enum Type {
        // This Scheme-like language has three token types:
        // open parens, close parens, and an "atom" type
        LPAREN, RPAREN, ATOM;
    }
    public static class Token {
        public final Type t;
        public final String c; // contents mainly for atom tokens
        // could have column and line number fields too, for reporting errors later
        public Token(Type t, String c) {
            this.t = t;
            this.c = c;
        }
        public String toString() {
            if(t == Type.ATOM) {
                return "ATOM<" + c + ">";
            }
            return t.toString();
        }
    }

    /*
     * Given a String, and an index, get the atom starting at that index
     */
    public static String getAtom(String s, int i) {
        int j = i;
        for( ; j < s.length(); ) {
            if(Character.isLetter(s.charAt(j))) {
                j++;
            } else {
                return s.substring(i, j);
            }
        }
        return s.substring(i, j);
    }

    public static List<Token> lex(String input) {
        List<Token> result = new ArrayList<Token>();
        for(int i = 0; i < input.length(); ) {
            switch(input.charAt(i)) {
            case '(':
                result.add(new Token(Type.LPAREN, "("));
                i++;
                break;
            case ')':
                result.add(new Token(Type.RPAREN, ")"));
                i++;
                break;
            default:
                if(Character.isWhitespace(input.charAt(i))) {
                    i++;
                } else {
                    String atom = getAtom(input, i);
                    i += atom.length();
                    result.add(new Token(Type.ATOM, atom));
                }
                break;
            }
        }
        return result;
    }

    public static void main(String[] args) {
        if(args.length < 1) {
            System.out.println("Usage: java Lexer \"((some Scheme) (code to) lex)\".");
            return;
        }
        List<Token> tokens = lex(args[0]);
        for(Token t : tokens) {
            System.out.println(t);
        }
    }
}

Example use:

~/code/scratch $ java Lexer ""
~/code/scratch $ java Lexer "("
LPAREN
~/code/scratch $ java Lexer "()"
LPAREN
RPAREN
~/code/scratch $ java Lexer "(foo)"
LPAREN
ATOM<foo>
RPAREN
~/code/scratch $ java Lexer "(foo bar)"
LPAREN
ATOM<foo>
ATOM<bar>
RPAREN
~/code/scratch $ java Lexer "(foo (bar))"
LPAREN
ATOM<foo>
LPAREN
ATOM<bar>
RPAREN
RPAREN

Once you've written one or two simple lexers like this, you will get a pretty good idea of how this problem decomposes. Then it would be interesting to explore how to use automated tools like lex. The theory behind regular expression based matchers is not too difficult, but it does take a while to fully understand. I think writing lexers by hand motivates that study and helps you come to grips with the problem better than diving into the theory behind converting regular expressions to finite automate (first NFAs, then NFAs to DFAs), etc... wading into that theory can be a lot to take in at once, and it is easy to get overwhelmed.

Personally, while the Dragon book is good and very thorough, the coverage might not be the easiest to understand because it aims to be complete, not necessarily accessible. You might want to try some other compiler texts before opening up the Dragon book. Here are a few free books, which have pretty good introductory coverage, IMHO:

http://www.ethoberon.ethz.ch/WirthPubl/CBEAll.pdf

http://www.diku.dk/~torbenm/Basics/

Some articles on the implementation of regular expressions (automated lexical analysis usually uses regular expressions)

http://swtch.com/~rsc/regexp/

Shults answered 25/7, 2013 at 3:50 Comment(9)
THANK YOU VERY MUCH SIR. It helped me a lot. I would just like to ask if it's necessary for me to study these things as a Computer Science student. What is it's relevance to my major?Viscount
Lexical analysis is the first step that a compiler or interpreter will do, before parsing. Compilers (and interpreters) are very useful, and without them we would have to write machine code all day. I won't comment on whether a CS student should or shouldn't study compilers. I think they are interesting in their own right, and if you're a curious programmer you may be inclined to wonder how they work. There are a lot of topics in CS though, and understanding compilation may not be interesting to you. That's ok too. That said, compilers are certainly relevant to CS in general.Shults
Thank you very much Sir for sharing your thoughts. I'm interested in studying the compiler / compilation process because I dreamed of designing one someday. What i'm afraid of is that I may not be able understand it very well. As I said i'm still a beginner. I started studying Computer Science without any background about computer programming. I always wonder where to start.Viscount
There are a lot of very accessible resources for getting started with and learning about compilers (and interpreters). If you're interested, don't be afraid or intimidated -- you too can write a compiler! Feel free to get in touch over email (in my profile) if you want to discuss this further or need some help finding a good place to start.Shults
Thanks man! Your basic example helped me understand the process a bit better. All the best!Cysticercus
@michiakig When I try to run this code I get an OutOfMemoryError error. Wouldn't it potentially be better to avoid creating a new object for each token to save memory?Carbonate
@Carbonate This example was a demonstration of how to write a lexical analyzer without tools, and was never intended to be performant or memory efficient. I threw it together in probably half an hour while writing this answer. As I wrote 5 years ago, if you have an industrial lexing task to perform, consider industrial strength tools. But if you are studying the implementation of compilers, simple examples can be useful.Shults
Hi! I am looking to your implementation for inspiration on a project I am working on currently, where I take assembly code and interpret it in java. In the project in question, there are 5 tokens instead of the given 3, and I am looking for Strings corresponding to instructions like "LOAD" and "ADD". With this in mind, how would my implementation differ from the above if at all?Alleyn
@Alleyn depending on the details you might have to look ahead more than 1 char to differentiate the keywords if you want to have different token types for them. notice the switch statement in the post looks at 1 character, and has 3 cases. if your language permits an identifier (like a label) like "LOADX" you need to peek ahead up to the next whitespace to tell that this is not a "LOAD" keyword but an identifier. this is a whole subtopic of lexical analyser generators i think, but it's been a long time since i played around with compilers (other hobbies these days). good luckShults
I
4

ANTLR 4 will do exactly this with the Java.g4 reference grammar. You have two options depending on how closely you want the handling of Unicode escape sequences to follow the language specification.

Edit: The names of the tokens produced by this grammar differ slightly from your table.

  • Your Key_Word token is Identifier
  • Your Object_Accessor token is DOT
  • Your left_Parenthesis token is LPAREN
  • Your String_Literal token is StringLiteral
  • Your right_Parenthesis token is RPAREN
  • Your statement_separator token is SEMI
Inbreed answered 25/7, 2013 at 3:1 Comment(0)
U
3

Lexical analysis is a topic by itself that usually goes together with compiler design and analysis. You should read up about it before trying to code anything. My favourite book on this topic is the Dragon book which should give you a good introduction to compiler design and even provides pseudocodes for all compiler phases which you can easily translate to Java and move from there.

In short, the main idea is to parse the input and divide it into tokens which belong to certain classes (parentheses or keywords, for example, in your desired output) using a finite state machine. Process of state machine building is actually the only hard part of this analysis and the Dragon book will provide you with great insight into this thing.

Undershoot answered 25/7, 2013 at 2:55 Comment(1)
Thanks Sir! I really appreciate your suggestions. There really is a big need for me to study lexical analysis deeper for me to understand the concept better.Viscount
B
2

You can use libraries like Lex & Bison in C or Antlr in Java. Lexical analysis can be done through making automata. I'll give you small example:

Suppose you need to tokenize a string where keywords (language) are {'echo', '.', ' ', 'end'). By keywords I mean language consists of following keywords only. So if I input

echo .
end .

My lexer should output

echo ECHO
 SPACE
. DOT
end END
 SPACE
. DOT

Now to build automata for such a tokenizer, I can start by

  ->(SPACE) (Back)
 |   
(S)-------------E->C->H->O->(ECHO) (Back)
 |              |
 .->(DOT)(Back)  ->N->D ->(END) (Back to Start)

Above diagram is prolly very bad, but idea is that you have a start state represented by S now you consume E and go to some other state, now you expect N or C to come for END and ECHO respectively. You keep consuming characters and reach different states within this simple finite state machine. Ultimately, you reach certain Emit state, for example after consuming E, N, D you reach emit state for END which emits the token out and then you go back to start state. This cycle continues forever as far as you have characters stream coming to your tokenizer. On invalid character you can either thrown an error or ignore depending on the design.

Breaker answered 25/7, 2013 at 3:11 Comment(0)
H
0

CookCC ( https://github.com/coconut2015/cookcc ) generates a very fast, small, zero-dependency lexer for Java.

Hobbyhorse answered 10/8, 2018 at 15:30 Comment(0)
L
-2

Write a program to make a simple lexical analyzer that will build a symbol table from given stream of chars. You will need to read a file named “input.txt” to collect all chars. For simplicity, input file will be a C/Java/Python program without headers and methods(body of the main progrm). Then you will identify all the numerical values, identifiers, keywords, math operators, logical operators and others[distinct]. See the example for more details. You can assume that, there will be a space after each keyword.

#include<stdio.h>
    #include<stdlib.h>
    #include<string.h>

    int main(){
    /* By Ashik Rabbani
    Daffodil International University,CSE43 */
    keyword_check();
    identifier_check();
    math_operator_check();
    logical_operator_check();
    numerical_check();
    others_check();


        return 0;
    }


    void math_operator_check()
    {

        char ch, string_input[15], operators[] = "+-*/%";
        FILE *fp;
        char tr[20];
        int i,j=0;

        fp = fopen("input.txt","r");

        if(fp == NULL){
            printf("error while opening the file\n");
            exit(0);
        }
       printf("\nMath Operators : ");
        while((ch = fgetc(fp)) != EOF){
               for(i = 0; i < 6; ++i){
                   if(ch == operators[i])
                       printf("%c ", ch);

               }
               }
                printf("\n");



        fclose(fp);
    }


    void logical_operator_check()
    {

        char ch, string_input[15], operators[] = "&&||<>";
        FILE *fp;
        char tr[20];
        int i,j=0;

        fp = fopen("input.txt","r");

        if(fp == NULL){
            printf("error while opening the file\n");
            exit(0);
        }
       printf("\nLogical Operators : ");
        while((ch = fgetc(fp)) != EOF){
               for(i = 0; i < 6; ++i){
                   if(ch == operators[i])
                       printf("%c ", ch);

               }
               }
                printf("\n");



        fclose(fp);
    }

    void numerical_check()
    {

        char ch, string_input[15], operators[] ={'0','1','2','3','4','5','6','7','8','9'};
        FILE *fp;

        int i,j=0;

        fp = fopen("input.txt","r");

        if(fp == NULL){
            printf("error while opening the file\n");
            exit(0);
        }
       printf("\nNumerical Values : ");
        while((ch = fgetc(fp)) != EOF){
               for(i = 0; i < 6; ++i){
                   if(ch == operators[i])
                       printf("%c ", ch);

               }
               }
                printf("\n");



        fclose(fp);
    }

    void others_check()
    {
        char ch, string_input[15], symbols[] = "(){}[]";
        FILE *fp;
        char tr[20];
        int i,j=0;

        fp = fopen("input.txt","r");

        if(fp == NULL){
            printf("error while opening the file\n");
            exit(0);
        }
       printf("\nOthers : ");
        while((ch = fgetc(fp)) != EOF){
               for(i = 0; i < 6; ++i){
                   if(ch == symbols[i])
                       printf("%c ", ch);

               }
               }
               printf("\n");



        fclose(fp);
    }

    void identifier_check()
    {
        char ch, string_input[15];
        FILE *fp;
    char    operators[] ={'0','1','2','3','4','5','6','7','8','9'};
        int i,j=0;

        fp = fopen("input.txt","r");

        if(fp == NULL){
            printf("error while opening the file\n");
            exit(0);
        }

        printf("\nIdentifiers : ");
        while((ch = fgetc(fp)) != EOF){

               if(isalnum(ch)){
                   string_input[j++] = ch;
               }
               else if((ch == ' ' || ch == '\n') && (j != 0)){
                       string_input[j] = '\0';
                       j = 0;

                       if(isKeyword(string_input) == 1)
                       {

                       }

                       else
                           printf("%s ", string_input);
               }

               }

                printf("\n");


        fclose(fp);
    }

    int isKeyword(char string_input[]){
        char keywords[32][10] = {"auto","break","case","char","const","continue","default",
                                "do","double","else","enum","extern","float","for","goto",
                                "if","int","long","register","return","short","signed",
                                "sizeof","static","struct","switch","typedef","union",
                                "unsigned","void","volatile","while"};
        int i, flag = 0;

        for(i = 0; i < 32; ++i){
            if(strcmp(keywords[i], string_input) == 0){
                flag = 1;
                break;
            }
        }

        return flag;
    }

    void keyword_check()
    {

        char ch, string_input[15], operators[] = "+-*/%=";
        FILE *fp;
        char tr[20];
        int i,j=0;

        printf(" Token Identification using C \n By Ashik-E-Rabbani \n 161-15-7093\n\n");

        fp = fopen("input.txt","r");

        if(fp == NULL){
            printf("error while opening the file\n");
            exit(0);
        }

        printf("\nKeywords : ");
        while((ch = fgetc(fp)) != EOF){

               if(isalnum(ch)){
                   string_input[j++] = ch;
               }
               else if((ch == ' ' || ch == '\n') && (j != 0)){
                       string_input[j] = '\0';
                       j = 0;

                       if(isKeyword(string_input) == 1)
                           printf("%s ", string_input);

               }

               }

     printf("\n");


        fclose(fp);
    }
Leuco answered 10/12, 2018 at 12:48 Comment(2)
could you please add some explainations about what does your code?Horten
Write a program to make a simple lexical analyzer that will build a symbol table from given stream of chars. You will need to read a file named “input.txt” to collect all chars. For simplicity, input file will be a C/Java/Python program without headers and methods(body of the main progrm). Then you will identify all the numerical values, identifiers, keywords, math operators, logical operators and others[distinct]. See the example for more details. You can assume that, there will be a space after each keyword.Leuco

© 2022 - 2024 — McMap. All rights reserved.