I was doing some research on line counters for C++ projects and I'm very interested in algorithms they use. Does anyone know where can I look at some implementation of such algorithms?
There's cloc, which is a free open-source source lines of code counter. It has support for many languages, including C++. I personally use it to get the line count of my projects.
At its sourceforge page you can find the perl source code for download.
Well, if by line counters, you mean programs which count lines, then the
algorithm is pretty trivial: just count the number of '\n'
in the
code. If, on the other hand, you mean programs which count C++
statements, or produce other metrics... Although not 100% accurate,
I've gotten pretty good results in the past just by counting '}' and
';' (ignoring those in comments and string and character literals, of
course). Anything more accurate would probably require parsing the
actual C++.
wc -l
. And the algorithm really is to just count the '\n'
; it's even more primitive that counting '}' and ';' (which does give a good first order approximation of the number of statements in the program). –
Freehanded if ( a > b ) std::cout << a;
can be one line or two, depending on how the programmer formats it? And what about #define
, defined on several lines? The concept doesn't seem simple to me. (Or rather, I can't see where the simple interpretation could provide any useful information.) –
Freehanded You don't need to actually parse the code to count line numbers, it's enough to tokenise it.
The algorithm could look like:
int lastLine = -1;
int lines = 0;
for each token {
if (isCode(token) && lastLine != token.line) {
++lines;
lastLine = token.line;
}
}
The only information you need to collect during tokenisation is:
- what type of a token it is (an operator, an identifier, a comment...) You don't need to get very precise here actually, as you only need to distinguish "non-code tokens" (comments) and "code tokens" (anything else)
- at which line in the file the token occures.
On how to tokenise, that's for you to figure out, but hand-writting a tokeniser for such a simple case shouldn't be hard. You could use flex
but that's probably redundant.
EDIT
I've mentioned "tokenisation", let me describe it for you quickly:
Tokenisation is the first stage of compilation. The input of tokenisation is text (multi-line program), and the output is a sequence of "tokens", as in: symbols with some meaning. For instance, the following program:
#include "something.h"
/*
This is my program.
It is quite useless.
*/
int main() {
return something(2+3); // this is equal to 5
}
could look like:
PreprocessorDirective("include")
StringLiteral("something.h")
PreprocessorDirectiveEnd
MultiLineComment(...)
Keyword(INT)
Identifier("main")
Symbol(LeftParen)
Symbol(RightParen)
Symbol(LeftBrace)
Keyword(RETURN)
Identifier("something")
Symbol(LeftParen)
NumericLiteral(2)
Operator(PLUS)
NumericLiteral(3)
Symbol(RightParen)
Symbol(Semicolon)
SingleLineComment(" this is equal to 5")
Symbol(RightBrace)
Et cetera.
Tokens, depending on their type, may have arbitrary meta-data attached to them (i.e. the symbol type, the operator type, the identifier text, or perhaps the number of the line where the token was found).
Such stream of tokens is then fed to the parser, which uses grammar production rules written in terms of these tokens, for instance, to build a syntax tree.
Doing a full parser that would give you a complete syntax tree of code is challenging, and especially challenging if it's C++ we're talking about. However, tokenising (or "lexing" or "lexical analysis") is easier, esp. when you're not concerned about much details, and you should be able to write a tokeniser yourself using a Finite state machine.
On how to actually use the output to count lines of code (i.e. lines in which at least "code" token, i.e. any token except comment, starts) - see the algorithm I've described earlier.
lex
; I'd pop a simple DFA on top of it, and use it to calculate code metrics. (Of course, using lex
, it didn't handle non-ascii. But that wasn't an issue in the code I used it on.) –
Freehanded I think part of the reason people are having so much trouble understanding your problem is because "Count the lines of c++" is itself an algorithm. Perhaps what you're trying to ask is "How do I identify a line of c++ in a file?" That is an entirely different question which Kos seems to have done a pretty good job trying to explain.
© 2022 - 2024 — McMap. All rights reserved.
Algorithms for line counting
. – Emmieemmit