Asked 13/9, 2010 at 20:45 Answered 10/10, 2023 at 11:32

Solved algorithm language-agnostic artificial-intelligence interpreter pseudocode

Like lots of you guys on SO, I often write in several languages. And when it comes to planning stuff, (or even answering some SO questions), I actually think and write in some unspecified hybrid language. Although I used to be taught to do this using flow diagrams or UML-like diagrams, in retrospect, I find "my" pseudocode language has components of C, Python, Java, bash, Matlab, perl, Basic. I seem to unconsciously select the idiom best suited to expressing the concept/algorithm.

Common idioms might include Java-like braces for scope, pythonic list comprehensions or indentation, C++like inheritance, C#-style lambdas, matlab-like slices and matrix operations.

I noticed that it's actually quite easy for people to recognise exactly what I'm triying to do, and quite easy for people to intelligently translate into other languages. Of course, that step involves considering the corner cases, and the moments where each language behaves idiosyncratically.

But in reality, most of these languages share a subset of keywords and library functions which generally behave identically - maths functions, type names, while/for/if etc. Clearly I'd have to exclude many 'odd' languages like lisp, APL derivatives, but...

So my questions are,

Does code already exist that recognises the programming language of a text file? (Surely this must be a less complicated task than eclipse's syntax trees or than google translate's language guessing feature, right?) In fact, does the SO syntax highlighter do anything like this?
Is it theoretically possible to create a single interpreter or compiler that recognises what language idiom you're using at any moment and (maybe "intelligently") executes or translates to a runnable form. And flags the corner cases where my syntax is ambiguous with regards to behaviour. Immediate difficulties I see include: knowing when to switch between indentation-dependent and brace-dependent modes, recognising funny operators (like *pointer vs *kwargs) and knowing when to use list vs array-like representations.
Is there any language or interpreter in existence, that can manage this kind of flexible interpreting?
Have I missed an obvious obstacle to this being possible?

edit

Thanks all for your answers and ideas. I am planning to write a constraint-based heuristic translator that could, potentially, "solve" code for the intended meaning and translate into real python code. It will notice keywords from many common languages, and will use syntactic clues to disambiguate the human's intentions - like spacing, brackets, optional helper words like let or then, context of how variables are previously used etc, plus knowledge of common conventions (like capital names, i for iteration, and some simplistic limited understanding of naming of variables/methods e.g containing the word get, asynchronous, count, last, previous, my etc). In real pseudocode, variable naming is as informative as the operations themselves!

Using these clues it will create assumptions as to the implementation of each operation (like 0/1 based indexing, when should exceptions be caught or ignored, what variables ought to be const/global/local, where to start and end execution, and what bits should be in separate threads, notice when numerical units match / need converting). Each assumption will have a given certainty - and the program will list the assumptions on each statement, as it coaxes what you write into something executable!

For each assumption, you can 'clarify' your code if you don't like the initial interpretation. The libraries issue is very interesting. My translator, like some IDE's, will read all definitions available from all modules, use some statistics about which classes/methods are used most frequently and in what contexts, and just guess! (adding a note to the program to say why it guessed as such...) I guess it should attempt to execute everything, and warn you about what it doesn't like. It should allow anything, but let you know what the several alternative interpretations are, if you're being ambiguous.

It will certainly be some time before it can manage such unusual examples like @Albin Sunnanbo's ImportantCustomer example. But I'll let you know how I get on!

Katheryn answered 13/9, 2010 at 20:45 Comment(9)

Do you know why quite a few programming languages can be parsed with a LL(1) parser (i.e. only looks at the next token) while natural language parsing still doesn't really work? A programming language (even Perl) has fixed semantics associated with certain syntax. You're asking for a program that reads random glibberish and makes up the semantics the writer had in mind. You might as well ask for a strong AI. – Dynasty 13/9, 2010 at 20:52

In one sense, yes! But actually I was trying to simplify the problem of 'intelligence' into 1. recognising a line or phrase as a specific language, then 2. binding the variable names in the general stack frame. In another sense, I'm asking what is it that you and I do (algorithmically, in our head) more-or-less effortlessly, when reading pseudocode. – Katheryn 13/9, 2010 at 20:58

I always thought this was a pretty good pseudocode interpreter – Dreiser 13/9, 2010 at 21:8

I hate to be a naysayer, because we don't have enough crazy ideas in this field. But even if this could work, it would be unpleasant to use, for the same reason that, say, AppleScript is unpleasant to use. The ultimate hybrid language you implement would be highly non-compact: it would be difficult to predict what anything would do, and difficult to figure out how to specify a given behavior. Historically, compact, easily-modeled languages have won over complicated languages. – Painting 13/9, 2010 at 21:9

I guess the most practical step would be design your own language that takes all these features that you like and then build an interpreter for it. Instead of trying to recognize and apply a different parser for each program segment, design a language with a unified and consistent syntax that supports these features. – Cowbell 13/9, 2010 at 21:33

There are a number of similar topics already on SO (in fact I could have sword we did this already, but I can't find it). Related: What programming language best bridges the gap between pseudocode and code? [Pseudocode: a clear definition? ](stackoverflow.com/questions/2489745) – Gustav 14/9, 2010 at 2:35

Yukihiro Matsumoto had this exact same problem. He had things he liked in most languages (from BASIC to Pascal to lisp) but not all in a single language. His solution? Invent his own pseudocode syntax that combines all the idea he liked and then wrote an interpreter for it. The result: Ruby. It looks like you already have a syntax/semantics in mind just like Matz did when he started. And just like your crazy mixed up language feels natural to you, Ruby is a style of pseudocode-turned-real-code that feels natural to Matz. I'm not saying use Ruby. I'm saying write an interpreter. – Kylander 14/9, 2010 at 4:2

I have created an English-to-Python translator called EngScript, which is the closest thing I've found to a pseudocode converter. – Stench 19/2, 2014 at 20:37

@AndersonGreen Thanks! that looks great. I'll take a look. Maybe I can modify it to accept C, fortran, java syntax too? – Katheryn 3/3, 2014 at 18:9

To detect what programming language is used: Detecting programming language from a snippet
I think it should be possible. The approach in 1. could be leveraged to do this, I think. I would try to do it iteratively: detect the syntax used in the first line/clause of code, "compile" it to intermediate form based on that detection, along with any important syntax (e.g. begin/end wrappers). Then the next line/clause etc. Basically write a parser that attempts to recognize each "chunk". Ambiguity could be flagged by the same algorithm.
I doubt that this has been done ... seems like the cognitive load of learning to write e.g. python-compatible pseudocode would be much easier than trying to debug the cases where your interpreter fails.
a. I think the biggest problem is that most pseudocode is invalid in any language. For example, I might completely skip object initialization in a block of pseudocode because for a human reader it is almost always straightforward to infer. But for your case it might be completely invalid in the language syntax of choice, and it might be impossible to automatically determine e.g. the class of the object (it might not even exist). Etc.
b. I think the best you can hope for is an interpreter that "works" (subject to 4a) for your pseudocode only, no-one else's.

Note that I don't think that 4a,4b are necessarily obstacles to it being possible. I just think it won't be useful for any practical purpose.

Eri answered 13/9, 2010 at 21:25 Comment(1)

To parse pseudocode accurately, you'd need to generate a parser that can handle ambiguous grammars. It's fairly easy to do this using an Earley parser generator. – Stench 27/8, 2020 at 19:25

I think that is quite useless for everything but toy examples and strict mathematical algorithms. For everything else the language is not just the language. There are lots of standard libraries and whole environments around the languages. I think I write almost as many lines of library calls as I write "actual code".

In C# you have .NET Framework, in C++ you have STL, in Java you have some Java libraries, etc.

The difference between those libraries are too big to be just syntactic nuances.

<subjective>
There has been attempts at unifying language constructs of different languages to a "unified syntax". That was called 4GL language and never really took of.
</subjective>

As a side note I have seen a code example about a page long that was valid as c#, Java and Java script code. That can serve as an example of where it is impossible to determine the actual language used.

Edit:

Besides, the whole purpose of pseudocode is that it does not need to compile in any way. The reason you write pseudocode is to create a "sketch", however sloppy you like.

foreach c in ImportantCustomers{== OrderValue >=$1M}
    SendMailInviteToSpecialEvent(c)

Now tell me what language it is and write an interpreter for that.

Quill answered 13/9, 2010 at 21:5 Comment(1)

I think that the point is that the OP swims in such a language soup that even he doesn't know what language he's writing in, and he wants to feel free to write pseudocode and have the computer fix it (translate to a target language). So, if you look at the answer the OP himself posted, you will see that "Github Copilot and ChatGPT" have solved that problem for him. Your last code example doesn't look like any of these languages -- "I find "my" pseudocode language has components of C, Python, Java, bash, Matlab, perl, Basic." – Evvy 26/5, 2024 at 0:57

Recognizing what language a program is in is really not that big a deal. Recognizing the language of a snippet is more difficult, and recognizing snippets that aren't clearly delimited (what do you do if four lines are Python and the next one is C or Java?) is going to be really difficult.

Assuming you got the lines assigned to the right language, doing any sort of compilation would require specialized compilers for all languages that would cooperate. This is a tremendous job in itself.

Moreover, when you write pseudo-code you aren't worrying about the syntax. (If you are, you're doing it wrong.) You'll wind up with code that simply can't be compiled because it's incomplete or even contradictory.

And, assuming you overcame all these obstacles, how certain would you be that the pseudo-code was being interpreted the way you were thinking?

What you would have would be a new computer language, that you would have to write correct programs in. It would be a sprawling and ambiguous language, very difficult to work with properly. It would require great care in its use. It would be almost exactly what you don't want in pseudo-code. The value of pseudo-code is that you can quickly sketch out your algorithms, without worrying about the details. That would be completely lost.

If you want an easy-to-write language, learn one. Python is a good choice. Use pseudo-code for sketching out how processing is supposed to occur, not as a compilable language.

Amaras answered 13/9, 2010 at 21:45 Comment(0)

An interesting approach would be a "type-as-you-go" pseudocode interpreter. That is, you would set the language to be used up front, and then it would attempt to convert the pseudo code to real code, in real time, as you typed. An interactive facility could be used to clarify ambiguous stuff and allow corrections. Part of the mechanism could be a library of code which the converter tried to match. Over time, it could learn and adapt its translation based on the habits of a particular user.

People who program all the time will probably prefer to just use the language in most cases. However, I could see the above being a great boon to learners, "non-programmer programmers" such as scientists, and for use in brainstorming sessions with programmers of various languages and skill levels.

-Neil

Seward answered 14/9, 2010 at 17:8 Comment(0)

Programs interpreting human input need to be given the option of saying "I don't know." The language PL/I is a famous example of a system designed to find a reasonable interpretation of anything resembling a computer program that could cause havoc when it guessed wrong: see http://horningtales.blogspot.com/2006/10/my-first-pli-program.html

Note that in the later language C++, when it resolves possible ambiguities it limits the scope of the type coercions it tries, and that it will flag an error if there is not a unique best interpretation.

Untune answered 14/9, 2010 at 18:27 Comment(1)

My recollection of the PL/I compiler I used is that it would attempt to muddle through a compilation pass in the presence of errors, but any errors would cause it to stop after that. Given that feeding code into the compiler required having an operator physically load a deck of cards into the machine, it was desirable to get as many useful diagnostics as possible out of each submission, even if that meant the compiler would also output a lot of useful ones. Very different from early Borland compilers which would simply stop at the first error (but take almost no time getting there). – Seta 24/1, 2014 at 16:44

I have a feeling that the answer to 2. is NO. All I need to prove it false is a code snippet that can be interpreted in more than one way by a competent programmer.

Foiled answered 13/9, 2010 at 23:52 Comment(2)

Surely this would be possible to spot with the appropriate tools, and "Flag up as ambiguous" as I suggeted? Or not? Compilers for C etc have rules for ambiguity within the language; why not have such rules for multiple languages? If such checking rules were available, even our programming in "standard" languages might improve, as the compiler would pick up on language-dependent tricks. – Katheryn 14/9, 2010 at 0:36

Nonetheless, there are several programs that can automatically recognize programming languages. – Stench 16/4, 2016 at 8:1

Does code already exist that recognises the programming language of a text file?

Yes, the Unix file command.

(Surely this must be a less complicated task than eclipse's syntax trees or than google translate's language guessing feature, right?) In fact, does the SO syntax highlighter do anything like this?

As far as I can tell, SO has a one-size-fits-all syntax highlighter that tries to combine the keywords and comment syntax of every major language. Sometimes it gets it wrong:

def median(seq):
    """Returns the median of a list."""
    seq_sorted = sorted(seq)
    if len(seq) & 1:
        # For an odd-length list, return the middle item
        return seq_sorted[len(seq) // 2]
    else:
        # For an even-length list, return the mean of the 2 middle items
        return (seq_sorted[len(seq) // 2 - 1] + seq_sorted[len(seq) // 2]) / 2

Note that SO's highlighter assumes that // starts a C++-style comment, but in Python it's the integer division operator.

This is going to be a major problem if you try to combine multiple languages into one. What do you do if the same token has different meanings in different languages? Similar situations are:

Is ^ exponentiation like in BASIC, or bitwise XOR like in C?
Is || logical OR like in C, or string concatenation like in SQL?
What is 1 + "2"? Is the number converted to a string (giving "12"), or is the string converted to a number (giving 3)?

Is there any language or interpreter in existence, that can manage this kind of flexible interpreting?

On another forum, I heard a story of a compiler (IIRC, for FORTRAN) that would compile any program regardless of syntax errors. If you had the line

= Y + Z

The compiler would recognize that a variable was missing and automatically convert the statement to X = Y + Z, regardless of whether you had an X in your program or not.

This programmer had a convention of starting comment blocks with a line of hyphens, like this:

C ----------------------------------------

But one day, they forgot the leading C, and the compiler choked trying to add dozens of variables between what it thought was subtraction operators.

"Flexible parsing" is not always a good thing.

Nostomania answered 17/9, 2010 at 4:15 Comment(1)

Thanks for all these examples! Vv interesting and useful for what I'm going to do. Well- all these are examples of what I'd call "inflexible parsing"! So, the meaning of ^ is context dependent - guessed by how you use the variable elsewhere - e.g. with other logical ops/flags later in the file, or whether it is used as a drawing coordinate etc.. Basically, humans rarely have problems knowing what was meant in pseudocode. So the compiler will remind you where there are ambiguities, which assumption it has made and why, and whether you want to clarify it or leave it if the meaning is obvious. – Katheryn 17/9, 2010 at 16:30

To create a "pseudocode interpreter," it might be necessary to design a programming language that allows user-defined extensions to its syntax. There already are several programming languages with this feature, such as Coq, Seed7, Agda, and Lever. A particularly interesting example is the Inform programming language, since its syntax is essentially "structured English."

The Coq programming language allows "syntax extensions", so the language can be extended to parse new operators:

Notation "A /\ B" := (and A B).

Similarly, the Seed7 programming language can be extended to parse "pseudocode" using "structured syntax definitions." The while loop in Seed7 is defined in this way:

syntax expr: .while.().do.().end.while is -> 25;

Alternatively, it might be possible to "train" a statistical machine translation system to translate pseudocode into a real programming language, though this would require a large corpus of parallel texts.

Stench answered 3/4, 2019 at 16:14 Comment(0)

Finally - solved!

As of 2023, my question from 2013 has now been answered of course. Many of you answered with great ideas. I did try building the parser, using an "antibody-antigen" matching system, where syntax-matching antibodies "bind" probabilistically to regions of the code. However it rapidly got too complex, and had exponential solve times.

Github Copilot (arxiv paper) and ChatGPT have remarkably solved the problem, in a way I could not have imagined before. I tried Albin Sunnanbo's question in Copilot:

# pseudocode: 
# foreach c in ImportantCustomers{== OrderValue >=$1M}
#    SendMailInviteToSpecialEvent(c)
# python:

output

for c in ImportantCustomers:
    if c.OrderValue >= 1000000:
        SendMailInviteToSpecialEvent(c)

and even

# c# code:
var importantCustomers = from c in customers
                         where c.OrderValue >= 1000000
                         select c;
foreach (var c in importantCustomers)
{
    SendMailInviteToSpecialEvent(c);
}

So after 10 years I consider the problem solved!

Katheryn answered 10/10, 2023 at 11:32 Comment(1)

Thanks for following up! I was thinking that ChatGPT etc would work well in this capacity for you. Did you try Seed7? – Evvy 26/5, 2024 at 0:59

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

edit

Edit:

Finally - solved!

So after 10 years I consider the problem solved!

Recommended topics

Hot tags