Writing a transpiler to the point where the actual mapping takes place

Asked 5/10, 2012 at 20:10 Answered 5/11, 2019 at 12:59

compiler-construction lexical-analysis code-translation

I want to understand how a transpiler works. The best to do this is to write one ofcourse.

I've been looking into a few resources to understand how this works, theoretically. And i understand the following:

From what i understand i basically need to write two classes:

A Lexical Analyzer
A Parser

Lexical Analyzer

The Lexical Analyzer takes the source code of a file as input (the input stream). For example the following code:

if (someVar == 20) {
    MessageBox("Hello World!");
}

Then the Lexical Analyzer creates chunks of data out of this:

[if]
[ ]
[(]
[someVar]
[ ]
[==]
[ ]
[20]
[)]
[ ]
[{]
[\n]
[\t]
[MessageBox]
[(]
["]
[Hello World!]
["]
[)]
[;]
[\n]
[\t]
[}]

This will then be sent to the Parser class.

The Parser

The Parser class will then read all the chunks of tokens(?) and specify what each token(?) means. It will assign a certain type to it. So the result of the above string will be identified as something like:

[if]      // Keyword
[ ]       // Whitespace
[(]       // L_Parenthesis
[someVar] // Identifier
[ ]       // Whitespace
[==]      // Operator
[ ]       // Whitespace
[20]      // Number (or Integer)
[)]       // R_Parenthesis
[ ]       // Whitespace
[{]       // L_Bracket
[\n]      // Whitespace
[\t]      // Whitespace
[MessageBox] // Keyword
[(]       // L_Parenthesis
["]       // Not yet sure where this would go
[Hello World!] // Same..
["]       // Same...
[)]       // R_Parenthesis
[;]       // Semicolon
[\n]      // Whitespace
[\t]      // Whitespace
[}]       // R_Bracket

As you can see, i haven't fully sorted out what types exactly goes where. But this should be the basic idea.

Now the next thing i'd like to do, is convert that source code to another source code, thus transpiling it. But how does that work? I can't find any direct tutorials, explanations about that.

Suppose i have the following custom code:

def myVar = true;

public function myFunc ( def arg1 )
{
    if ( arg1 == true ) {
        MessageBox("My Message");
    }
}

Then the Lexical process will parse this code. Then how do i convert that to something like Javascript?

var myVar = true;

myFunc = function ( arg1 )
{
    if ( arg1 == true ) {
        alert("My Message");
    }
}

How does the mapping work, going from my custom cdoe, to a code like Javascript? Like, the function declaration. My Lexical parser has the following: public, function, myFunc. How can it know that it should map that to: myFunc = function ...?

Anyone any good and practical information on how this should be done in an transpiler? Or am i going the wrong to by writing a lexical analyzer for this job?

Edit

So obviously my idea how the lexer / parser works isn't exactly right. Any "pseudo" information on how this process works (with pseudo examples) is more than welcome.

Uam answered 5/10, 2012 at 20:10 Comment(6)

What you call a lexer does not strip enough superfluous information (e.g. the whitespace) , and classification of each token (which you describe as job of the parser) is part of the lexer's output. A pass which just breaks the input into pieces without classifying them is not useful. – Helbonna 5/10, 2012 at 20:14

@delnan The classification is done by the parser, which in return, returns it back to the lexical parser. In anyway, everything will be classified. And what do you mean by "not stripping enough superfluous information, like whitespaces"? This is just a basic example/idea. And besides that, I need to preserve the whitespaces when i need to do a look-behind or look-ahead. That's how the old Fortran Lexical Analyzer got into trouble, by ignorig white spaces... – Uam 5/10, 2012 at 20:20

The parser does higher level work (grouping tokens into expressions/statements/declarations etc., figuring out precedence, and so on). The tokens coming out of the lexer are already classified. I'd like to see a source for your definition, as I have more sources than I can recall for mine, and I have already written lexers and parsers. As for ignoring whitespace specifically: While some languages require information about whitespace, there are generally some bits of input which can and should be ignored completely. Also note that parsing is just the very first step towards generating code. – Helbonna 5/10, 2012 at 20:24

See my SO answer "how to translate between programming languages" https://mcmap.net/q/216872/-what-kinds-of-patterns-could-i-enforce-on-the-code-to-make-it-easier-to-translate-to-another-programming-language-closed I think you'll find that learning how to do this is better started by doing some serious reading on compiler technology, than by attempting to build with little or no background. – Probabilism 5/10, 2012 at 20:30

@delnan I don't have any source, still trying to think this entire process through. And i'm sure you know what you're talking about. I'm not doubting that in anyway. – Uam 5/10, 2012 at 20:33

@Uam The "Dragon Book" is an excellent classic source for how to write a compiler (or transpiler): amazon.ca/Compilers-Principles-Techniques-Tools-2nd/dp/… – Sedda 13/10, 2016 at 22:17

Id highly recommend taking a look at the Decaf compiler project.

Generally speaking, a transpiler or translator is just a compiler without optimization.

Here is a broad outline of the phases of a compiler (or transpiler in this case).

Lexical Analysis: This takes the input stream and converts it into tokens. Tokens are the smallest meaningful unit within the language. For example, keywords, braces, identifiers. Tokens can typically be described with regular languages (or equivalently regular expressions) and so typically a lexer uses regular expressions to scan the input stream to create a classify tokens.
Parsing: The parser, generally speaking, needs to convert the stream of tokens into Abstract Syntax Trees. This allows us to gather more meaning and structre from the input. For example "(id:x) (operator:=) (id:y)" could be the incoming token stream, then the parse would output a tree that has (operator:=) as the root, (id:x) as the left child, and (id:y) as the right child. The structure of the AST depends on the grammar of the language. Your parser enforces the languages grammar generally speaking.
Semantic Analysis: This is where you go through your ASTs and collect what other info you need from the input. A typical example would be type checking.
Optimization: After semantic analysis a compiler typically performs some level of optimization. In the case of a transpiler/translator optimization probably is not wanted.
Code Generation: This is the part where you would "map" one language to another. The generator would take each AST and generates the code that it dictates. In a normal compiler this generates assembly or machine code. In a translator/transpiler you would just need to write the generators for the specific language you wanted.

Of course this list is perfectly exhuastive but I hope it helps you understand your project better!

Circle answered 23/9, 2019 at 19:6 Comment(0)

You have already done some experiments. I advice you to Pick ANTLR. The grammars of both Languages you specified are present in ANTLR v4's Github Repo. The next Step is to map both Grammars . Variable to Variable, Function to function etc... means all Language Constructs: Keywords , symbols etc. Any extra features needs to be hard coded otherwise Transpilation, Compilation or interpretion will not work 100% .

Imprison answered 5/11, 2019 at 12:59 Comment(0)

Lexical Analyzer

The Parser

Edit

Recommended topics

Hot tags