Counting lines of code
Asked Answered
H

4

7

I was doing some research on line counters for C++ projects and I'm very interested in algorithms they use. Does anyone know where can I look at some implementation of such algorithms?

Hide answered 4/7, 2012 at 15:14 Comment(12)
pardon my ignorance, but what is line counting?Constitutional
?? Title is Algorithms for line counting.Emmieemmit
What is your exact definition of a line?Chuppah
#BoBTFish yeah, and body of the post explains what I mean by it. Or you didn't bother to read the body and have read just the title?Hide
@Chuppah just get any C++ file and see how it looks like. It has lines with code and lines with comments. Line of code is where there is a code on the line.Hide
@smallB: ok, so go through each line of a file, determine if that condition matches, and if yes, increase a counter. isnt much of an algorithm I would guess...Chuppah
@Chuppah and I'm asking about algorithms to see how are they implemented. And your guess would be incorrect because I've tried few line counters for C++ and all of them count lines incorrectly. So it's not such a trivial as the naive user might think.Hide
@smallB: What does "incorrectly" mean here? what is the correct definition of what is a line and who defines that? It looks like the definition you just gave is incorrect too, so without a proper definition, no one can suggest an "algorithm". And given the proper definition, an "algorithm" is trivial, and as outlined. So the key point is: What is the exact definition of a line. If you can't give it, you can't write code that implements it.Chuppah
@Chuppah I could think of a couple of definitions that would require non-trivial algorithms:-). (But until smallB tells us what he's trying to do, who knows whether it is trivial or not.)Freehanded
@JamesKanze I'm trying to find some algorithms which would count lines of code in C++ file. Line of code is a line with code in it.Hide
@smallB: then again, I outlined the algorithm as above. Go through source, determine for each line if it contains code, and if yes, increase counter. You dont need more than this one algorithm for it. But you said this is wrong. so please enlighten us how an algorithm that counts lines that have code in it can be wrong when you want to count lines that have code in it.Chuppah
I think what the SO was trying to ask is to identify the number of lines (character stream delimited by a new line) which includes specifically C++ code. So, for example, if you give the algorithm a directory as an input, one that includes .java, .cpp, .c, .html, .js files etc. it would only count how many lines of C++ there is. That's what I searched for and ended up on this post anyway. @stelonix's answered my question (using cloc).Ecclesiology
F
26

There's cloc, which is a free open-source source lines of code counter. It has support for many languages, including C++. I personally use it to get the line count of my projects.

At its sourceforge page you can find the perl source code for download.

Furnishings answered 4/7, 2012 at 16:43 Comment(0)
F
4

Well, if by line counters, you mean programs which count lines, then the algorithm is pretty trivial: just count the number of '\n' in the code. If, on the other hand, you mean programs which count C++ statements, or produce other metrics... Although not 100% accurate, I've gotten pretty good results in the past just by counting '}' and ';' (ignoring those in comments and string and character literals, of course). Anything more accurate would probably require parsing the actual C++.

Freehanded answered 4/7, 2012 at 15:20 Comment(12)
I meant algorithms which would count physical lines of code in C++ file. Am affraid counting '}' and ';' is too primitive for anything but the simplest cases.Hide
@Hide It depends on what you're trying to measure. If it really is lines, just use wc -l. And the algorithm really is to just count the '\n'; it's even more primitive that counting '}' and ';' (which does give a good first order approximation of the number of statements in the program).Freehanded
#James Kanze counting lines of code has nothing to do with counting '\n'. What about comments? Comment is not a line yet it would get counted if just parsing for '\n', ';' or '}'. I am interested in a algorithms which are able to count lines of code in C++ files.Hide
@Hide Counting lines counts lines. It's not clear what you want. Strip comments and count non-empty lines? What is a line of code, given that C++ isn't line oriented?Freehanded
as I've explained to PlasmaHH line of code is line with code in it. How simpler can it get?Hide
@Hide So if ( a > b ) std::cout << a; can be one line or two, depending on how the programmer formats it? And what about #define, defined on several lines? The concept doesn't seem simple to me. (Or rather, I can't see where the simple interpretation could provide any useful information.)Freehanded
yes you're right, it depends on how programmer formats it. Don't worry about the information, it's for me to decide what's useful. I'm asking about algorithms which I can look at.Hide
@smallB: How can we provide you with an algorithm if you won't tell us what is important? How's this for an algorithm: start at top of file; while not eof: if line includes c++, add 1 to count; increment pointer; loopCarpal
@Carpal Lines of code are important. Line of code is a line with code in it.Hide
@Daniel, please don't try to invent algorithm for non trivial things in a sec or two. Your naive solution wouldn't work.Hide
line includes int a = 0; would you count it as a line of code (according to your naive algorithm)?Hide
@smallB: see, that is the problem. /You/ are the one who decides what a line of code is, since /you/ are the one asking for an algorithm. Above you complain that its simple ("line of code is line with code in it. How simpler can it get?"), yet you fail to define what a line of code in it is. for YOU. appearantly the difficulties come from the fact that you dont have a proper definition, which is why you have to ask if "int a = 0;" is code. But the point is that /you/ want an algorithm, are rejecting lots of existing tools, by saying it doesnt do what you want. But you never tell what you want.Chuppah
G
3

You don't need to actually parse the code to count line numbers, it's enough to tokenise it.

The algorithm could look like:

int lastLine = -1;
int lines = 0;
for each token {
    if (isCode(token) && lastLine != token.line) {
        ++lines; 
        lastLine = token.line;
    }
}

The only information you need to collect during tokenisation is:

  • what type of a token it is (an operator, an identifier, a comment...) You don't need to get very precise here actually, as you only need to distinguish "non-code tokens" (comments) and "code tokens" (anything else)
  • at which line in the file the token occures.

On how to tokenise, that's for you to figure out, but hand-writting a tokeniser for such a simple case shouldn't be hard. You could use flex but that's probably redundant.


EDIT

I've mentioned "tokenisation", let me describe it for you quickly:

Tokenisation is the first stage of compilation. The input of tokenisation is text (multi-line program), and the output is a sequence of "tokens", as in: symbols with some meaning. For instance, the following program:

#include "something.h"

/*
This is my program.
It is quite useless.
*/
int main() {
    return something(2+3); // this is equal to 5
}

could look like:

PreprocessorDirective("include")
StringLiteral("something.h")
PreprocessorDirectiveEnd
MultiLineComment(...)
Keyword(INT)
Identifier("main")
Symbol(LeftParen)
Symbol(RightParen)
Symbol(LeftBrace)
Keyword(RETURN)
Identifier("something")
Symbol(LeftParen)
NumericLiteral(2)
Operator(PLUS)
NumericLiteral(3)
Symbol(RightParen)
Symbol(Semicolon)
SingleLineComment(" this is equal to 5")
Symbol(RightBrace)

Et cetera.

Tokens, depending on their type, may have arbitrary meta-data attached to them (i.e. the symbol type, the operator type, the identifier text, or perhaps the number of the line where the token was found).

Such stream of tokens is then fed to the parser, which uses grammar production rules written in terms of these tokens, for instance, to build a syntax tree.

Doing a full parser that would give you a complete syntax tree of code is challenging, and especially challenging if it's C++ we're talking about. However, tokenising (or "lexing" or "lexical analysis") is easier, esp. when you're not concerned about much details, and you should be able to write a tokeniser yourself using a Finite state machine.

On how to actually use the output to count lines of code (i.e. lines in which at least "code" token, i.e. any token except comment, starts) - see the algorithm I've described earlier.

Gerardogeratology answered 4/7, 2012 at 16:51 Comment(9)
this wouldn't work for anything but the simples cases. What about multiline comment? What I need, is to see not some algorithm made up by someone in five minutes, which I can see that it doesn't work after looking at it for 2 sec. but a real world algorithm which actually is employed in real application.Hide
I'm afraid you haven't understood what I described. I'll try to elaborateGerardogeratology
I've expanded my answer, I hope you understand now; also please try to remain polite on SO and avoid disdaining answerers who dedicate their time to help youGerardogeratology
thanks for your answer, still I believe that your algorithm wouldn't count every line correctly. What if you have multiline comment spanning for few lines? Your algorithm doesn't takes this into account.Hide
And in which place was I impolite?Hide
After tokenisation any comment, be it one-line or multi-line, is represented as one token. See the exampleGerardogeratology
@Gerardogeratology The way I do it, comments are reduced to white space, and the tokenizer doesn't return it. I used to have a tokenizer for C++ floating around, written in lex; I'd pop a simple DFA on top of it, and use it to calculate code metrics. (Of course, using lex, it didn't handle non-ascii. But that wasn't an issue in the code I used it on.)Freehanded
@JamesKanze when I worked on Eclipse CDT, IIRC the lexer grabbed comments too; this was useful for keeping track of documentation for indexed symbols. There was even a time when the AST itself contained comment nodes (they gave it up though).Gerardogeratology
@Gerardogeratology It depends on what you're trying to do. The lexer for Doxygen or JavaDoc must grab comments. If you're trying to develop statistics on actual code, however, you probably don't want it to grab the comments.Freehanded
C
2

I think part of the reason people are having so much trouble understanding your problem is because "Count the lines of c++" is itself an algorithm. Perhaps what you're trying to ask is "How do I identify a line of c++ in a file?" That is an entirely different question which Kos seems to have done a pretty good job trying to explain.

Carpal answered 5/7, 2012 at 21:21 Comment(5)
His (Kos's) algorithm is incorrect. Wouldn't count correctly lines of code in more complicated scenarios.Hide
@smallB: His solution covers that just fine. All you have to do is once you see a code token in a line, stop reading that line.Carpal
but his algorithm doesn't do that does it? And there are also other scenarios which his algorithm is not cut (yes not cut) for. So when I say that his algorithm is incorrect this means that his algorithm in the form he presented it would NOT count correctly lines of code in every possible scenario.Hide
@Hide I am going to remove myself from this conversation.Carpal
@Hide you are incorrect in asserting that the answer Kos provided does not work. The algorithm as described finds unique lines with at least one non-comment token. Since multi-line comments collate to a single token, this covers all lines with actual code. Your example of where it should fail is incorrect, it would would count exactly one line. (Yeah I know I'm late, but this thread actually pointed to some useful tools and deserved some cleanups.)Bessbessarabia

© 2022 - 2024 — McMap. All rights reserved.