Detecting programming language from a snippet [closed]
Asked Answered
S

17

133

What would be the best way to detect what programming language is used in a snippet of code?

Superficial answered 23/1, 2009 at 23:16 Comment(7)
There are practically an infinite number of languages out there... do you want to detect ANY of them? Or are we just talking the popular ones?Galeiform
Just the popular ones (C/C++, C#, Java, Pascal, Python, VB.NET. PHP, JavaScript and maybe Haskell).Worldwide
Well Haskell can't be popular since I've never heard of it. ;-)Alkyne
You probably don't know much about programming languages if you haven't heard of Haskell.Butterscotch
There is this online service which does it: algorithmia.com/algorithms/PetiteProgrammer/…Low
Clearly the popularity of Haskell is growing ...Beautify
@BennyNeugebauer Can't get it to work, at least off the cuff. You did post that many years ago however so might be the reasonCaduceus
E
104

I think that the method used in spam filters would work very well. You split the snippet into words. Then you compare the occurences of these words with known snippets, and compute the probability that this snippet is written in language X for every language you're interested in.

http://en.wikipedia.org/wiki/Bayesian_spam_filtering

If you have the basic mechanism then it's very easy to add new languages: just train the detector with a few snippets in the new language (you could feed it an open source project). This way it learns that "System" is likely to appear in C# snippets and "puts" in Ruby snippets.

I've actually used this method to add language detection to code snippets for forum software. It worked 100% of the time, except in ambiguous cases:

print "Hello"

Let me find the code.

I couldn't find the code so I made a new one. It's a bit simplistic but it works for my tests. Currently if you feed it much more Python code than Ruby code it's likely to say that this code:

def foo
   puts "hi"
end

is Python code (although it really is Ruby). This is because Python has a def keyword too. So if it has seen 1000x def in Python and 100x def in Ruby then it may still say Python even though puts and end is Ruby-specific. You could fix this by keeping track of the words seen per language and dividing by that somewhere (or by feeding it equal amounts of code in each language).

class Classifier
  def initialize
    @data = {}
    @totals = Hash.new(1)
  end

  def words(code)
    code.split(/[^a-z]/).reject{|w| w.empty?}
  end

  def train(code,lang)
    @totals[lang] += 1
    @data[lang] ||= Hash.new(1)
    words(code).each {|w| @data[lang][w] += 1 }
  end

  def classify(code)
    ws = words(code)
    @data.keys.max_by do |lang|
      # We really want to multiply here but I use logs 
      # to avoid floating point underflow
      # (adding logs is equivalent to multiplication)
      Math.log(@totals[lang]) +
      ws.map{|w| Math.log(@data[lang][w])}.reduce(:+)
    end
  end
end

# Example usage

c = Classifier.new

# Train from files
c.train(open("code.rb").read, :ruby)
c.train(open("code.py").read, :python)
c.train(open("code.cs").read, :csharp)

# Test it on another file
c.classify(open("code2.py").read) # => :python (hopefully)
Expansible answered 23/1, 2009 at 23:21 Comment(10)
I also need to use it in forum software. Thanks for the tip about the Bayesian filtering.Worldwide
I did something like this in my NLP class, but we took it a step further. You don't like look at frequencies of a single word, but pairs and triples of words. For example, "public" might be a keyword in many languages, but "public static void" is more common to C#. If the triple can't be found, you fall back to 2, and then 1.Want
Might also want to think about where you're splitting the words. In PHP, variables start with $, so maybe you shouldn't be splitting on word bounds, because the $ should stick with the variable. Operators like => and := should be stuck together as a single token, but OTH you probably should split around {s because they always stand on their own.Want
Yep. A way to avoid splitting at all is to use ngrams: you take every n length substring. For example the 5-grams of "puts foo" are "puts " "uts f", "ts fo" and "s foo". This strategy may seem weird but it works better than you'd think, it's just not how a human would solve the problem. To decide which method works better you'll have to test both...Expansible
Some languages have very little syntax, though. I'm also speculating that common variable names would dominate over the language's keywords. Basically, if you have a piece of C code written by a Hungarian, with variable names and comments in Hungarian, in your training data, then any other source with Hungarian in it is likely to be determined to be "similar".Sector
@data[lang][w]/@totals[w] should be @totals[w]/@data[lang][w]Attenuation
Glaslos, I don't think so. Actually you can remove the /@totals[w] completely and the algorithm will still behave identically. That is because it is dividing by the same number for each different language, so it doesn't matter for the relative ordering. But @data[lang][w] should be used normally, not divided by. That is so because the more times we saw a word w in lang, the higher the probability that that word belongs to lang, not lower.Expansible
I just want to point out that this is a Machine Learning topic with much scientific literature on it, and that it is my understanding that Hidden Markov Models or Long Short Term Memory (LSTM) Artificial Neural Networks (ANN) is what would be used by data science professionals to do sequence classification of this sort.Batsheva
I sure hope not! For difficult classification tasks, yes. However since classifying programming snippets is such an easy problem that would be massive overkill. Naive Bayes works fine.Expansible
If anyone wants to start without training : github.com/anvaka/common-wordsCombustion
L
27

Language detection solved by others:

Ohloh's approach: https://github.com/blackducksw/ohcount/

Github's approach: https://github.com/github/linguist

Lenalenard answered 12/3, 2012 at 22:11 Comment(2)
I examined both of these solutions and neither will do exactly what was asked. They mainly look at the file extensions to determine the language, so they can't necessarily examine a snippet without a clue from the extension.Ullyot
Github's approach now includes a Bayesian classifier too. It primarily detects a language candidate based on file extension, but when a file extension matches multiple candidates (e.g. ".h" --> C,C++,ObjC), it will tokenize the input code sample and classify against a pre-trained set of data. The Github version can be forced to scan the code always without looking at the extension too.Constitutionally
B
11

Guesslang is a possible solution:

http://guesslang.readthedocs.io/en/latest/index.html

There's also SourceClassifier:

https://github.com/chrislo/sourceclassifier/tree/master

I became interested in this problem after finding some code in a blog article which I couldn't identify. Adding this answer since this question was the first search hit for "identify programming language".

Biographer answered 27/2, 2018 at 10:20 Comment(0)
G
6

An alternative is to use highlight.js, which performs syntax highlighting but uses the success-rate of the highlighting process to identify the language. In principle, any syntax highlighter codebase could be used in the same way, but the nice thing about highlight.js is that language detection is considered a feature and is used for testing purposes.

UPDATE: I tried this and it didn't work that well. Compressed JavaScript completely confused it, i.e. the tokenizer is whitespace sensitive. Generally, just counting highlight hits does not seem very reliable. A stronger parser, or perhaps unmatched section counts, might work better.

Greaser answered 12/6, 2012 at 9:42 Comment(2)
The language data included in highlight.js is limited to the values needed for highlighting, which turns out to be quite insufficient for language detection (especially for small amounts of code).Tab
I think it is fine, check with this fiddle jsfiddle.net/3tgjnz10Combustion
H
4

First, I would try to find the specific keyworks of a language e.g.

"package, class, implements "=> JAVA
"<?php " => PHP
"include main fopen strcmp stdout "=>C
"cout"=> C++
etc...
Haiti answered 23/1, 2009 at 23:21 Comment(1)
Problem is that those keywords can still appear in any language, either as variable names or in strings. That, and there's a lot of overlap in keywords used. You'd have to do more than just look a keywords.Want
S
4

It's very hard and sometimes impossible. Which language is this short snippet from?

int i = 5;
int k = 0;
for (int j = 100 ; j > i ; i++) {
    j = j + 1000 / i;
    k = k + i * j;
}

(Hint: It could be any one out of several.)

You can try to analyze various languages and try to decide using frequency analysis of keywords. If certain sets of keywords occur with certain frequencies in a text it's likely that the language is Java etc. But I don't think you will get anything that is completely fool proof, as you could name for example a variable in C the same name as a keyword in Java, and the frequency analysis will be fooled.

If you take it up a notch in complexity you could look for structures, if a certain keyword always comes after another one, that will get you more clues. But it will also be much harder to design and implement.

Spanish answered 23/1, 2009 at 23:27 Comment(2)
Well, if several languages are possible, the detector can just give all the possible candidates.Otic
Or, it can give the first one that matches. If the real-world use case is something like syntax highlighting, then it really wouldn't make a difference. Meaning that any of the matching languages would result in highlighting the code correctly.Bolster
P
2

It would depend on what type of snippet you have, but I would run it through a series of tokenizers and see which language's BNF it came up as valid against.

Puberty answered 23/1, 2009 at 23:20 Comment(1)
All languages can't even be described by a BNF. If you're allowed to redefine keywords and create macros it gets much harder. Alså as we're talking about a snippet you would have to do partial match against a BNF, which is harder and more error prone.Spanish
C
2

I needed this so i created my own. https://github.com/bertyhell/CodeClassifier

It's very easily extendable by adding a training file in the correct folder. Written in c#. But i imagine the code is easily converted to any other language.

Calmative answered 8/6, 2015 at 6:17 Comment(0)
V
2

Best solution I have come across is using the linguist gem in a Ruby on Rails app. It's kind of a specific way to do it, but it works. This was mentioned above by @nisc but I will tell you my exact steps for using it. (Some of the following command line commands are specific to ubuntu but should be easily translated to other OS's)

If you have any rails app that you don't mind temporarily messing with, create a new file in it to insert your code snippet in question. (If you don't have rails installed there's a good guide here although for ubuntu I recommend this. Then run rails new <name-your-app-dir> and cd into that directory. Everything you need to run a rails app is already there).

After you have a rails app to use this with, add gem 'github-linguist' to your Gemfile (literally just called Gemfile in your app directory, no ext).

Then install ruby-dev (sudo apt-get install ruby-dev)

Then install cmake (sudo apt-get install cmake)

Now you can run gem install github-linguist (if you get an error that says icu required, do sudo apt-get install libicu-dev and try again)

(You may need to do a sudo apt-get update or sudo apt-get install make or sudo apt-get install build-essential if the above did not work)

Now everything is set up. You can now use this any time you want to check code snippets. In a text editor, open the file you've made to insert your code snippet (let's just say it's app/test.tpl but if know the extension of your snippet, use that instead of .tpl. If you don't know the extension, don't use one). Now paste your code snippet in this file. Go to command line and run bundle install (must be in your application's directory). Then run linguist app/test.tpl (more generally linguist <path-to-code-snippet-file>). It will tell you the type, mime type, and language. For multiple files (or for general use with a ruby/rails app) you can run bundle exec linguist --breakdown in your application's directory.

It seems like a lot of extra work, especially if you don't already have rails, but you don't actually need to know ANYTHING about rails if you follow these steps and I just really haven't found a better way to detect the language of a file/code snippet.

Variegated answered 27/7, 2015 at 17:34 Comment(0)
U
2

This site seems to be pretty good at identifying languages, if you want a quick way to paste a snippet into a web form, rather than doing it programmatically: http://dpaste.com/

Unreligious answered 13/7, 2020 at 12:12 Comment(5)
it failed on all languages i've tried, even simple ones like the hello world code piece of code from wikipedia lol it got detected as java. a alert('hello world!'); got guesses as C++Windrow
alert('hello world!'); could be valid C++ if you had a function called alert()... It probably guesses from the syntax, not each language's standard library.Unreligious
it didn't consider even it was a simple string over a double-quote...Windrow
Oh, my mistake - I hadn't noticed they were single quotes, so that wouldn't be valid C++.Unreligious
Actually, if you compile with g++ -fpermissive, a "string" of more than one character surrounded by single quotes generates a warning, not an error - so that is valid C++, even though it's bad.Unreligious
U
1

Prettify is a Javascript package that does an okay job of detecting programming languages:

http://code.google.com/p/google-code-prettify/

It is mainly a syntax highlighter, but there is probably a way to extract the detection part for the purposes of detecting the language from a snippet.

Ullyot answered 5/4, 2012 at 15:15 Comment(2)
Upon further inspection it seems prettify doesn't actually detect the language, but it highlights according to the syntax of each element.Ullyot
Hawkee is correct. The feature page claims autodetection, the source code shows they use a "default-code" syntax when the syntax isn't given explicitly.Eurydice
M
0

Nice puzzle.

I think it is imposible to detect all languages. But you could trigger on key tokens. (certain reserved words and often used character combinations).

Ben there are a lot of languages with similar syntax. So it depends on the size of the snippet.

Mackinnon answered 23/1, 2009 at 23:22 Comment(0)
V
0

I wouldn't think there would be an easy way of accomplishing this. I would probably generate lists of symbols/common keywords unique to certain languages/classes of languages (e.g. curly brackets for C-style language, the Dim and Sub keywords for BASIC languages, the def keyword for Python, the let keyword for functional languages). You then might be able to use basic syntax features to narrow it down even further.

Vladimir answered 23/1, 2009 at 23:22 Comment(0)
W
0

I think the biggest distinction between languages is its structure. So my idea would be to look at certain common elements across all languages and see how they differ. For example, you could use regexes to pick out things such as:

  • function definitions
  • variable declarations
  • class declarations
  • comments
  • for loops
  • while loops
  • print statements

And maybe a few other things that most languages should have. Then use a point system. Award at most 1 point for each element if the regex is found. Obviously, some languages will use the exact same syntax (for loops are often written like for(int i=0; i<x; ++i) so multiple languages could each score a point for the same thing, but at least you're reducing the likelihood of it being an entirely different language). Some of them might scores 0s across the board (the snippet doesnt contain a function at all, for example) but thats perfectly fine.

Combine this with Jules' solution, and it should work pretty well. Maybe also look for frequencies of keywords for an extra point.

Want answered 28/11, 2010 at 0:37 Comment(0)
Z
0

Interesting. I have a similar task to recognize text in different formats. YAML, JSON, XML, or Java properties? Even with syntax errors, for example, I should tell apart JSON from XML with confidence.

I figure how we model the problem is critical. As Mark said, single-word tokenization is necessary but likely not enough. We will need bigrams, or even trigrams. But I think we can go further from there knowing that we are looking at programming languages. I notice that almost any programming language has two unique types of tokens -- symbols and keywords. Symbols are relatively easy (some symbols might be literals not part of the language) to recognize. Then bigrams or trigrams of symbols will pick up unique syntax structures around symbols. Keywords is another easy target if the training set is big and diverse enough. A useful feature could be bigrams around possible keywords. Another interesting type of token is whitespace. Actually if we tokenize in the usual way by white space, we will loose this information. I'd say, for analyzing programming languages, we keep the whitespace tokens as this may carry useful information about the syntax structure.

Finally if I choose a classifier like random forest, I will crawl github and gather all the public source code. Most of the source code file can be labeled by file suffix. For each file, I will randomly split it at empty lines into snippets of various sizes. I will then extract the features and train the classifier using the labeled snippets. After training is done, the classifier can be tested for precision and recall.

Zephan answered 29/10, 2014 at 7:55 Comment(0)
S
-1

I believe that there is no single solution that could possibly identify what language a snippet is in, just based upon that single snippet. Take the keyword print. It could appear in any number of languages, each of which are for different purposes, and have different syntax.

I do have some advice. I'm currently writing a small piece of code for my website that can be used to identify programming languages. Like most of the other posts, there could be a huge range of programming languages that you simply haven't heard, you can't account for them all.

What I have done is that each language can be identified by a selection of keywords. For example, Python could be identified in a number of ways. It's probably easier if you pick 'traits' that are also certainly unique to the language. For Python, I choose the trait of using colons to start a set of statements, which I believe is a fairly unique trait (correct me if I'm wrong).

If, in my example, you can't find a colon to start a statement set, then move onto another possible trait, let's say using the def keyword to define a function. Now this can causes some problems, because Ruby also uses the keyword def to define a function. The key to telling the two (Python and Ruby) apart is to use various levels of filtering to get the best match. Ruby use the keyword end to finish a function, whereas Python doesn't have anything to finish a function, just a de-indent but you don't want to go there. But again, end could also be Lua, yet another programming language to add to the mix.

You can see that programming languages simply overlay too much. One keyword that could be a keyword in one language could happen to be a keyword in another language. Using a combination of keywords that often go together, like Java's public static void main(String[] args) helps to eliminate those problems.

Like I've already said, your best chance is looking for relatively unique keywords or sets of keywords to separate one from the other. And, if you get it wrong, at least you had a go.

Speedboat answered 4/2, 2016 at 21:9 Comment(0)
T
-2

Set up the random scrambler like

matrix S = matrix(GF(2),k,[random()<0.5for _ in range(k^2)]); while (rank(S) < k) : S[floor(k*random()),floor(k*random())] +=1;
Tumbledown answered 16/2, 2016 at 4:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.