Neural Networks For Generating New Programming Language Grammars

Asked 15/5, 2019 at 2:8 Answered 5/6, 2019 at 19:26

Solved machine-learning grammar code-translation transpiler language-recognition

I have recently had the need to create an ANTLR language grammar for the purpose of a transpiler (Converting one scripting language to another). It occurs to me that Google Translate does a pretty good job translating natural language. We have all manner of recurrent neural network models, LSTM, and GPT-2 is generating grammatically correct text.

Question: Is there a model sufficient to train on grammar/code example combinations for the purpose of then outputting a new grammar file given an arbitrary example source-code?

Smoke answered 15/5, 2019 at 2:8 Comment(5)

No, unfortunately programmers still have to think in order to write programs. Of course, if your job is writing programs, you might consider that to be A Good Thing. – Linden 15/5, 2019 at 4:0

I mean generating an ANTLR grammar file. Not writing new software. – Smoke 15/5, 2019 at 11:58

Two relevant links: arxiv.org/pdf/1703.05698.pdf | becominghuman.ai/… – Smoke 5/6, 2019 at 3:5

maybe this one? ml4code.github.io/publications/rabinovich2017abstract – Extortionate 10/6, 2019 at 15:20

Didn't Facebook have to shutdown their AI because of this exact problem? If it is automatically generated then it might not be understandable. – Antepast 11/6, 2019 at 21:9

I doubt any such model exists.

The main issue is that languages are generated from the grammars and it is next to impossible to convert back due to the infinite number of parser trees (combinations) available for various source-codes.

So in your case, say you train on python code (1000 sample codes), the resultant grammar for training will be the same. So, the model will always generate the same grammar irrespective of the example source code.

If you use training samples from a number of languages, the model still can't generate the grammar as it consists of an infinite number of possibilities.

Your example of Google translate works for real life translation as small errors are acceptable, but these models don't rely on generating the root grammar for each language. There are some tools that can translate programming languages example, but they don't generate the grammar, work based on the grammar.

Update

How to learn grammar from code.

After comparing to some NLP concepts, I have a list of issue that may arise and a way to counter them.

Dealing with variable names, coding structures and tokens.

For understanding the grammar, we'll have to breakdown the code to its bare minimum form. This means understanding what each and every term in the code means. Have a look at this example

The already simple expression is reduced to the parse tree. We can see that the tree breaks down the expression and tags each number as a factor. This is really important to get rid of the human element of the code (such as variable names etc.) and dive into the actual grammar. In NLP this concept is known as Part of Speech tagging. You'll have to develop your own method to do the tagging, by it's easy given that you know grammar for the language.

Understanding the relations

For this, you can tokenize the reduced code and train using a model based on the output you are looking for. In case you want to write code, make use of a n grams model using LSTM like this example. The model will learn the grammar, but extracting it is not a simple task. You'll have to run separate code to try and extract all the possible relations learned by the model.

Example

Code snippet

# Sample code
int a = 1 + 2;
cout<<a;

Tags

# Sample tags and tokens
 int      a          =         1       +       2       ;
[int] [variable] [operator] [factor] [expr] [factor] [end]

Leaving the operator, expr and keywords shouldn't matter if there is enough data present, but they will become a part of the grammar.

This is a sample to help understand my idea. You can improve on this by having a deeper look at the Theory of Computation and understanding the working of the automata and the different grammars.

Littlest answered 24/5, 2019 at 21:14 Comment(4)

Your point about the large search space is well taken but I have seen a neural network that can generate C code that passes the compiler checks. The code doesn't do anything specific (It's random statements) but doesn't the fact that it passes the compiler in the first place mean that a neural network could in fact at least generate useful ANTLR grammars (Or an arbitrary transpiler of its own)? – Smoke 5/6, 2019 at 2:51

The thing is, parser/grammars are sort of one-to-one mappings of a textual representation to a language construct. So doesn't that greatly narrow the learning task in a general sense? My intuition (I could be wrong) is that it's not some infinite explosion of possibility. – Smoke 5/6, 2019 at 2:54

Please see the other answer by @Jakub . I have increased the bounty. – Smoke 6/6, 2019 at 1:20

I had a deeper look the some NLP concepts and how they could be implemented to overcome the shortcomings which I had previously mentioned. – Littlest 8/6, 2019 at 19:30

What you're describing is 'just' learning structure of Context-Free Grammars.

I'm not sure if this approach will actually work for your case, but it's a long-standing problem in NLP: grammar induction for Context-Free Grammars. An example introduction how to tackle this problem using statistical learning methods can be found in Charniak's Statistical Language Learning.

Note that what I described is about CFGs in general, but you might want to check induction for LL grammars, because parser generators mostly use these types of grammars.

Quince answered 5/6, 2019 at 19:26 Comment(2)

Right. This is why I don't think it's as difficult a task as 'Learning to program' as it is 'Learning the grammar of programming languages' instead. Those are two very different tasks. This answer might be the best I get. We will see in the next few days. Thanks. – Smoke 6/6, 2019 at 1:8

Well, comparing this to learning to program is just silly, because programming actually needs semantics, and what you're asking is purely syntactical. What you're interested in should be actually easier than NLP problem because most programming languages have unambigous grammars, and the fact of being LL adds additional structure - but I'm not sure if anyone has indeed done anything that exploits these facts. – Quince 6/6, 2019 at 7:57

I know nothing about ANTLR, but there are pretty good examples of translating natural language e.g. into valid SQL requests: http://nlpprogress.com/english/semantic_parsing.html#sql-parsing.

Quintie answered 5/6, 2019 at 19:15 Comment(0)

Recommended topics

Hot tags