Code for identifying programming language in a text file [closed]

Asked 30/8, 2010 at 12:18 Answered 30/8, 2010 at 13:39

Solved c++algorithm text-processing language-recognition

i'm supposed to write code which when given a text file (source code) as input will output which programming language is it. This is the most basic definition of the problem. More constraints follow:

I must write this in C++.
A wide variety of languages should be recognized - html, php, perl, ruby, C, C++, Java, C#...
Amount of false positives (wrong recognition) should be low - better to output "unknown" than a wrong result. (it will be in the list of probabilities for example as unknown: 100%, see below)
The output should be a list of probabilities for each language the code knows, so if it knows C, Java and Perl, the output should be for example: C: 70%, Java: 50%, Perl: 30% (note there is no need to have the probabilities sum up to 100%)
It should have a good ratio of accuracy/speed (speed is a bit more favored)

It would be very nice if the code could be written in a way that adding new languages for recognition will be fairly easy and involve just adding "settings/data" for that particular language. I can use anything available - a heuristic, a neural network, black magic. Anything. I'am even allowed to use existing solutions, but: the solution must be free, opensource and allow commercial usage. It must come in form of easily integrable source code or as a static library - no DLL. However i prefer writing my own code or just using fragments of another solution, i'm fed up with integrating code of others. Last note: maybe some of you will suggest FANN (fast artificial neural network library) - this is the only thing i cannot use, since this is the thing we use ALREADY and we want to replace that.

Now the question is: how would you handle such a task, what would you do? Any suggestions how to implement this or what to use?

EDIT: based on the comments and answers i must emphasize some things i forgot: speed is very crucial, since this will get thousands of files and is supposed to answer fast, so looking at a thousand files should produce answers for all of them in a few seconds at most (the size of files will be small of course, a few kB each one). So trying to compile each one is out of question. The thing is, that i really want probabilities for each language - so i rather want to know that the file is likely to be C or C++ but that the chance it is a bash script is very low. Due to code obfuscation, comments etc. i think that looking for a 100% accurate code is a bad idea and in fact is not the goal of this.

Colp answered 30/8, 2010 at 12:18 Comment(11)

A rather tongue in cheek idea - run it through a compiler for each language, and pick the one that doesn't error? ;). (yes, I know - probably slow, prone to be totally wrong if the code doesn't compile, or if the user is writing polygots... etc.) – Artificer 30/8, 2010 at 12:21

+1: Nice question. But I think the "probabilities" part doesn't make sense: the input either is legal in a particular language or it is not. I don't see what it would mean that it has a higher probability to belong to language A than to language B. – Vince 30/8, 2010 at 12:24

I would expect this to be a very easy problem, with the only obstacle being similarities in C/C++. This is a scenario which could be solved simply or vastly over-complicated. – Nicaea 30/8, 2010 at 12:26

@Job: If you do not use the official lexer for the language and version in question, there is a probability that you are wrong. It is a Good Thing[tm] to be aware of this. Depending on the application, you might want to report an identification even if the user has minor syntax errors. And finally in the case of polyglots, you will have a high probability/identification for several languages. – Cittern 30/8, 2010 at 12:28

what about polyglot programs? – Buskus 30/8, 2010 at 12:28

Thanks for all the answers folks, it amazes me how fast the community here can answer. I have updated the question with a few more details. – Colp 30/8, 2010 at 12:40

See here? #475533 – Petersburg 30/8, 2010 at 12:43

@Philip Potter: this is one of the reasons i want probabilities, not one definitive answer. – Colp 30/8, 2010 at 13:23

@Colp A polygot would still give confusing answers, (assuming a Naive Bayes classifier) btw - because 99% of the code could be a comment in one language, whereas 100% of the code could be legal in another language - so the chances are that the classifier would say "80-100% likely that it's language A, 1% that it's language B", despite the fact it's actually 100% for both. I don't think it'd be easy to teach a naive bayes compiler about comment lines... – Artificer 30/8, 2010 at 13:35

@Artificer For a polyglot program a naive bayes classifier will give something like 50-50 exactly because it doesn't understand comments. It will see just the keywords in both languages. – Bulahbulawayo 30/8, 2010 at 14:8

@djv No, it might not. A polygot for languages A and B doesn't have to have it's text split 50-50 between A and B. It might have a massive program written in A, and a tiny one written in B. In this case, many words will appear which are noted to be of language A, and far fewer which are noted to be of language B, and therefore a NB classifier will expect it to be language A... – Artificer 30/8, 2010 at 15:22

You have a problem of document classification. I suggest you read about naive bayes classifiers and support vector machines. In the articles there are links to libraries which implement these algorithms and many of them have C++ interfaces.

Bulahbulawayo answered 30/8, 2010 at 12:52 Comment(1)

Indeed, I'd say a Naive Bayes classifier with a quick check at the start for definite "must be" words (i.e. a Python or Ruby env line.) – Artificer 30/8, 2010 at 13:14

One simple solution I could think of is that you could just identify the keywords used in different languages. Each identified word would have score +1. Then calculate ratio = identified_words / total_words. The language that gets most score is the winner. Off course there are problems like usage of comments e.t.c. But I think that is a very simple solution that should work in most cases.

Countersignature answered 30/8, 2010 at 12:25 Comment(3)

Following on this idea, you could maybe try a naive bayesian classifier, like the early spam filters. That might as well give very good results? – Vamp 30/8, 2010 at 12:50

+1 for both the original and the comment--this is the best way I know of to get quick and reasonably accurate performance. – Homunculus 30/8, 2010 at 17:5

The only downside of this approach i can see is that some keywords are much more characteristic for certain languages than others. So it will be required to do some statistical analysis beforehand. But i might do just that, seems it might work very good. – Colp 30/8, 2010 at 17:41

If you know that the source files will conform to standards, file extensions are unique to just about every language. I assume that you've already considered this and ruled it out based on some other information.

If you can't use file extensions, the best way would be to find the things between languages that are most different and use those to determine filetype. For example, for loop statement syntax won't vary much between languages, but package include statements should. If you have a file including java.util.*, then you know it's a java file.

Fractional answered 30/8, 2010 at 12:21 Comment(4)

You can't calculate a probability for each language this way. – Stalder 30/8, 2010 at 12:24

Wondering how would a loop look like in brainf*** or Haskell. – Countersignature 30/8, 2010 at 12:27

@Lieven You can't calculate a probability based on only my suggestions, but you can certainly get a good start. If you rule out certain languages based on syntactical differences it makes it easier to determine probabilities. Plus, if you can definitely identify a language based on an include statement, then probabilities are unnecessary. – Fractional 30/8, 2010 at 12:30

+1 for the file extension query – Vivian 30/8, 2010 at 13:0

I'm sorry but if you have to parse thousands of files, then your best bet is to look at the file extension. Don't over engineer a simple problem, or put burdensome requirements on a simply task.

It sounds like you have thousands of files of source code and you have no idea what programming language they were written in. What kind of programming environment do you work in? (Ruling out the possibility of an artificial homework requirement) I mean one of the basics of software engineering that I can always rely on are that c++ code files have .cpp extension, that java code files have the .java extension, that c code files have the .c extension etc... Is your company playing fast and loose with these standards? If so I would be really worried.

Vivian answered 30/8, 2010 at 13:1 Comment(3)

Who says his "files" are even files with names? Perhaps he is trying to do correct syntax highlighting on code snippets on a forum like SO. Or maybe he is trying to figure out if some code files have the wrong extension! – Manhour 30/8, 2010 at 13:46

Sorry, but relying on file extensions is not possible here. Anyway thanks for your suggestion! – Colp 30/8, 2010 at 15:5

Than what type of problem are you trying to solve? – Vivian 30/8, 2010 at 20:44

As dmckee suggested, you might want to have a look at the Unix file program, whose source is available. The heuristics used by this utility might be a great source of inspiration. Since it is written in C, I guess that it qualifies for C++. :) You do not get confidence percentages directly, though; maybe are they used internally?

Varix answered 30/8, 2010 at 13:39 Comment(0)

Take a look at nedit. It has a syntax highlighting recognition system, under Syntax Highlighting->Recognition Patterns. You can browse sample recognition patterns here, or download the program and check out the standard ones.

Here's a description of the highlighting system.

Ownership answered 30/8, 2010 at 12:28 Comment(2)

The unix utility file has it's own set of heuristics, though they may be too simple. – Casta 30/8, 2010 at 12:39

@dmckee, true, yet for really short programs it will fail :/ – Ownership 30/8, 2010 at 12:58

Since the list of languages is known upfront you know the syntax/grammar for each of them. Hence you can, as an example, to write a function to extract reserved words from the provided source code.

Build a binary tree that will have all reserved words for all languages that you support. And then just walk that tree with the extracted reserved words from the previous step.

If in the end you only have 1 possibility left - this is your language. If you reach the end of the program too soon - then (from where you stopped) - you can analyse your position on a tree to work out which languages are still the possibitilies.

Petersburg answered 30/8, 2010 at 12:41 Comment(2)

+1: that would probably be quite accurate, but fast enough given thousands of files? – Lustrum 30/8, 2010 at 13:28

Thanks Axel. You can probably combine 2 stages: as you extract reserved words feed them into the tree straight away. In some cases you will detect language way before reaching the end of input source code. The slowest part is actually applying syntax rules of all languages (i.e. need to build a lot of L-strings etc - this is almost like writing parsers for all supported languages) – Petersburg 30/8, 2010 at 13:47

You can maybe try to think about languages differences and model these with a binary tree, like "is feature X found ? " if yes, proceed in one direction, if not, proceed in another direction.

By constructing this search tree efficiently you could end with a rather fast code.

Intermezzo answered 30/8, 2010 at 12:27 Comment(3)

Not every Perl program exhibits every Perl feature. There is no Perl feature X such that every valid Perl program exhibits X. – Buskus 30/8, 2010 at 12:31

Then you will find that by eliminating other possibilities – Intermezzo 30/8, 2010 at 12:32

then perl programs will appear multiple times in the binary tree. As a result the binary tree will not be of guaranteed size O(logN). – Buskus 30/8, 2010 at 13:29

This one is not fast and may not satisfy your requirements, but just an idea. It should be easy to implement and should give 100% result.

You could try to compile/execute the input text with different compilers/interpreters (opensource or free) and check for errors behind the scene.

Cumin answered 30/8, 2010 at 12:39 Comment(3)

What if it compiles as more than one language? eg. nyx.net/~gthompso/poly/micah.txt works as both C and perl – Hither 30/8, 2010 at 12:46

Then you say 50% probability C, 50% - Perl. This is OK according to the question author. – Petersburg 30/8, 2010 at 12:56

Exactly. If it compiles then who can prove the opposite? – Cumin 30/8, 2010 at 14:37

The Sequitur algorithm infers context-free grammars from sequences of terminal symbols. Perhaps you could use that to compare against a set of known production rules for each language.

Haggadah answered 30/8, 2010 at 12:47 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags