How does github figure out a project's language?
Asked Answered
L

5

93

I was recently working on a github project in both JavaScript and C++, and noticed that github tagged the project as C++. If you have to pick a single language, this is probably the correct designation since the C++ code is compiled as a JavaScript library, but this made me wonder... how does github figure out what language to tag each project?

Luxury answered 15/3, 2011 at 21:55 Comment(6)
You can consider yourself lucky. I'm writing a Ruby on Rails project but since I'm using Twitter Bootstrap, Github thinks my project is Javascript, instead of the intended RubySurgical
@davblayn I think that github.com/github/linguist/blob/master/lib/linguist/vendor.yml would solve your problem. Also using a CDN for bootstrap would work.Thay
This question appears to be off-topic because it is not about programming. See What topics can I ask about here in the Help Center. Perhaps Web Apps Stack Exchange would be a better place to ask.Dorladorlisa
Also see Misidentified Language tag on Github tracker for Linguist.Dorladorlisa
@Dorladorlisa - I disagree - the question is at this point essentially about an open source library identifying one programming language instead of another. How is that "not about programming"?Luxury
You can tell the stats engine lies about file types to fudge the result. See https://mcmap.net/q/160640/-how-can-i-change-the-language-of-a-repository-on-githubKerbstone
M
89

Update April 2013, by nuclearsandwich (GitHub support team or "supportocat"):

If your desired language is not receiving syntax highlighting you can contribute to the Linguist library to add it.


(Original answer, Oct. 2012)

This thread on GitHub support explains it:

It just sums up file sizes for each extension. Largest one "wins".

We'd like to avoid opening files up and parsing their content, as both would slow down the process... but that might be the only method of resolving conflicts like this one.

Since this is not 100% accurate, that had lead some to add:

I, too, would vote for a simple manual-override switch for the cases where the guess is wrong.


Note: as Mark Rushakoff mentions in his answer (upvoted), the guessing got better since then with the linguist project (open-sourced from June 2011).
You can see there are still issues though: GitHub Linguist Issues.
See here for more details:

Once the language has been detected, it is passed to Albino, a Pygments wrapper, which does the actual syntax highlighting.

And you can add linguist directives in a .gitattributes file.

Monagan answered 15/3, 2011 at 22:7 Comment(4)
Thanks for the info. I guess there're still no way to modify the language manually.Hypoploid
This is no longer the case! The answers below regarding linguist are closer to the mark. Check out My repository is marked as the wrong language and Why isn't my favorite language recognized on help.github.com . Disclaimer: I work on GitHub's support team.Jamshedpur
@Jamshedpur Excellent, I have updated the answer, completing your edit. Note: I will be at GitHub headquarters Friday, May 10th, meeting with John Greet and other supportocats :)Monagan
I just want to add that not marking repository or letting the user choose the main language would be way more convenient than automatically guessing, because my repository github.com/salda/file_scraper is mainly in C++ with a bit of C, but marked as 70% Objective-C.Holohedral
L
14

Currently, Github's linguist project is what is used to determine language statistics, as described in this Github blog post (which came out a few months after this question was originally asked).

Lederer answered 6/4, 2012 at 18:23 Comment(1)
Excellent, I didn't see it at the time of my answer. +1Monagan
S
7

First, know that you can override the language detected for files in your repository using Linguist overrides.

Now, in a nutshell,

  1. Each repository is tagged with the first language from language statistics.
  2. Language statistics count the total size of files for each detected programming or markup language. Vendored, documentation, and generated files are not counted.
  3. The language of each file is detected by the open source project Linguist.

How does Linguist detect languages?

Linguist relies on the following strategies, in order, and returns the language as soon as it found a perfect match (strategy with a single language returned).

  1. Look for Emacs and Vim modelines.
  2. Known filename. Some filenames are associated to specific languages (think Makefile).
  3. Look for a shebang. A file with a #!/bin/bash shebang will be classified as Shell.
  4. Known file extension. Languages have a set of extensions associated to them. There are, however, lots of conflicts with this strategy. The conflicting results (think C++, C and Objective-C for .h) are refined by the subsequent strategies.
  5. A set of heuristic rules. They usually rely on regular expressions over the content of files to try and identify the language (e.g., ^[^#]+:- for Prolog).
  6. A naive Bayesian classifier trained on sample files. Last strategy, lowest accuracy. The Bayesian classifier always takes a subset of languages as input; it is not meant to classify among all languages. The best match found by the classifier is returned.

What are unvendored and documentation files?

Linguist considers some files as vendored, meaning they are not included in language statistics. These include third-party libraries such as jQuery and are defined in the vendor.yml configuration file. You can also vendor or unvendor files in your repository using Linguist overrides.

Similarly, documentation files are defined in documentation.yml and can be changed using Linguist overrides.

How are generated files detected?

Linguist relies on simple rules to detect generated files, using both the paths and the content of files. Generated files are not counted in language statistics and are not displayed in diffs on github.com.

What about programming and markup languages?

In Linguist, each language is given a type. These types can be found in the main configuration file, languages.yml. Only the programming and markup languages are counted in statistics.

Satire answered 20/8, 2017 at 10:59 Comment(0)
D
2

After some tinkering with linguist I have noticed this.

For files with a Shebang, the Shebang is considered when determining the language but seems to be evenly weighted against other tokens. This seems to be a big error because the Shebang should definitively define the language of the file.

This can cause issues with highlighting.

Davey answered 21/12, 2012 at 2:45 Comment(1)
This answer has several broken links. This is also true of this answer as it appears on stack exchange: webapps.stackexchange.com/a/40110. A shame, as I'd like to look at those links!Menarche
T
-1

File extensions is the first thing that comes to my mind.

Twila answered 15/3, 2011 at 22:1 Comment(1)
Of course, but... my project contained both .js and .cc files, among other extensions.Luxury

© 2022 - 2024 — McMap. All rights reserved.