Implement Language Auto-Completion based on ANTLR4 Grammar

Asked 6/5, 2016 at 9:12 Answered 7/5, 2016 at 9:39

I am wondering if are there any examples (googling I haven't found any) of TAB auto-complete solutions for Command Line Interface (console), that use ANTLR4 grammars for predicting the next term (like in a REPL model).

I've written a PL/SQL grammar for an open source database, and now I would like to implement a command line interface to the database that provides the user the feature of completing the statements according to the grammar, or eventually discover the proper database object name to use (eg. a table name, a trigger name, the name of a column, etc.).

Thanks for pointing me to the right direction.

Convoy answered 6/5, 2016 at 9:12 Comment(1)

See: #19738939 – Pharyngeal 6/5, 2016 at 11:13

Actually it is possible! (Of course, based on the complexity of your grammar.) Problem with auto-completion and ANTLR is that you do not have complete expression and you want to parse it. If you would have complete expression, it wont be any big problem to know what kind of element is at what place and to know what can be used at such a place. But you do not have complete expression and you cannot parse the incomplete one. So what you need to do is to wrap the input into some wrapper/helper that will complete the expression to create a parse-able one. Notice that nothing that is added only to complete the expression is important to you - you will only ask for members up to last really written character.

So:

A) Create the wrapper that will change this (excel formula) '=If(' into '=If()'

B) Parse the wrapped input

C) Realize that you are in the IF function at the first parameter

D) Return all that can go into that place.

It actually works, I have completed intellisense editor for several simple languages. There is much more infrastructure than this, but the basic idea is as I wrote it. Only be careful, writing the wrapper is not easy if not impossible if the grammar is really complex. In that case look at Papa Carlo project. http://lakhin.com/projects/papa-carlo/

Nicolais answered 6/5, 2016 at 12:5 Comment(6)

The problem with this idea is it supposes you know how to extend the code fragment you have, so that you can give the extended fragment to the parser. But if you know how to complete the fragment, you already know what comes next. You might be able do this manually for a small grammar. I pity the guy that tries this with PL/SQL, whose grammar is enormous and complicated. – Facetious 6/5, 2016 at 14:51

@IraBaxter I agree with you on the principle: predictable analysis has more to do with AI rather than text analysis/parsing. Anyway, an ANTLR grammar is a finite state machine, so in theory it shouldn't be extremely complex to do. In fact, the alternatives at each terminal reduce to a forced state. Anyway, if I had the answer already, I wouldn't have asked :) – Convoy 6/5, 2016 at 15:24

1) Grammars are not finite statemachines, and are not representable as FSAs. 2) "Alternatives at each terminal.." eh? You'd be much better equipped to do this task if you understood how parsers and (more importantly) parser generators work. 1) and 2) make me think you do not. – Facetious 6/5, 2016 at 16:28

@IraBaxter It is not tottaly true. It is divide and conquer. It is much easier to complete the expression and then parse it to get a model than be scrutinizing the original input as is and wondering all the time if you are in comment or string, or if it is a string in the comment, etc... But I do agree that the approach is usable only for easier grammars. – Nicolais 9/5, 2016 at 7:3

You have to take any (assumed valid) fragment, guess what is needed to continue it before consulting the parser. This means for every partial phrase, your completer has to have a good idea of what comes next. To do that, it must have some detailed concept of the grammar and so grammar complexity has huge impact. OP wants to do PL/SQL, which has a huge grammar. Also, if you continue the middle of an expression (A*(B+C[ which means you have guess a sequence of tokens. Assuming you do that and hand it to the parser, it says "yes"; what tokens (plural) come next in the middle? – Facetious 9/5, 2016 at 7:32

What the autocompleter has to do is compute FOLLOWS(substring) because it wants to offer that set to the user. Your proposal is fundamentally producing a bad approximation by choosing a member of FOLLOWS by ad hoc methods to complete the string. a) that's just an approximation which gets worse ("ad hoc" always loses) as the grammar grows big b) it won't produce the whole set. I just don't see this as practical. – Facetious 9/5, 2016 at 7:40

As already mentioned auto completion is based on the follow set at a given position, simply because this is what we defined in the grammar to be valid language. But that's only a small part of the task. What you need is context (as Sam Harwell wrote: it's a semantic process, not a syntactic one). And this information is independent of the parser. And since a parser is made to parse valid input (and during auto completion you have most of the time invalid input), it's not the right tool for this task.

Knowing what token can follow at a given position is useful to control the entire process (e.g. you don't want to show suggestions if only a string can appear), but is most of the time not what you actually want to suggest (except for keywords). If an ID is possible at the current position, it doesn't tell you what ID is actually allowed (a variable name? a namespace? etc.). So what you need is essentially 3 things:

A symbol table that provides you with all possible names sorted by scope. Creating this depends heavily on the parsed language. But this is a task where a parser is very helpful. You may want to cache this info as it is time consuming to run this analysis step.
Determine in which scope you are when invoking auto completion. You could use a parser as well here (maybe in conjunction with step 1).
Determine what type of symbol(s) you want to show. Many people think this is where a parser can give you all necessary information (the follow set). But as mentioned above that's not true (keywords aside).

In my blog post Universal Code Completion using ANTLR3 I especially addressed the 3rd step. There I don't use a parser, but simulate one, only that I don't stop when a parser would, but when the caret position is reached (so it is essential that the input must be valid syntax up to that point). After reaching the caret the collection process starts, which not only collects terminal nodes (for keywords) but looks at the rule names to learn what needs to be collected too. Using specific rule names is my way there to put context into the grammar, so when the collection code finds a rule table_ref it knows that it doesn't need to go further down the rule chain (to the ultimate ID token), but instead can use this information to provide a list of tables as suggestion.

With ANTLR4 things might become even simpler. I haven't used it myself yet, but the parser interpreter could be a big help here, as it essentially doing what I do manually in my implementation (with the ANTLR3 backend).

Zindman answered 7/5, 2016 at 9:39 Comment(0)

This is probably pretty hard to do.

Fundamentally you want to use some parser to predict "what comes next" to display as auto-completion. This has to at least predict what the FIRST token is at the point where the user's input stops.

For ANTLR, I think this will be very difficult. The reason is that ANTLR generates essentially procedural, recursive descent parsers. So at runtime, when you need to figure out what FIRST tokens are, you have to inspect the procedural source code of the generated parser. That way lies madness.

This blog entry claims to achieve autocompletion by collecting error reports rather than inspecting the parser code. Its sort of an interesting idea, but I do not understand how his method really works, and I cannot see how it would offer all possible FIRST tokens; it might acquire some of them. This SO answer confirms my intuition.

Sam Harwell discusses how he has tackled this; he is one of the ANTLR4 implementers and if anybody can make this work, he can. It wouldn't surprise me if he reached inside ANTLR to extract the information he needs; as an ANTLR implementer he would certainly know where to tap in. You are not likely to be so well positioned. Even so, he doesn't really describe what he did in detail. Good luck replicating. You might ask him what he really did.

What you want is a parsing engine for which that FIRST token information is either directly available (the parser generator could produce it) or computable based on the parser state. This is actually possible to do with bottom up parsers such as LALR(k); you can build an algorithm that walks the state tables and computes this information. (We do this with our DMS Software Reengineering Toolkit for its GLR parser precisely to produce syntax error reports that say "missing token, could be any of these [set]")

Facetious answered 6/5, 2016 at 11:18 Comment(0)

Recommended topics

Hot tags