How to count lines of code in Python excluding comments and docstrings? [closed]
Asked Answered
J

4

21

I want to count the lines of code in a multi-file Python project as accurately as possible, but without including comments, docstrings or blank lines in the total.

I first tried using cloc, which is available as a Debian package. But cloc treats most docstrings as code - even though they are comments. (Update: no longer - recent versions of cloc now treat Python docstrings as comments.)

I notice some comments below saying that docstrings should be included in the total because they might be used by the code to influence behaviour at runtime and hence count as part of the programs code/data/config. A prominent example of this is 'ply', which asks you to write functions with docstrings which, as I recall, contain grammar and regular expressions which are central to the program's operation. However, this seems to me to be very much a rare exception. Most of the time docstrings act just like comments. Specifically, I know for a fact that is true for all the code I want to measure. So I want to exclude them as such from my line counts.

Justinn answered 31/1, 2012 at 8:43 Comment(11)
I'd say counting comments is the right way, because in general the comments are just as valuable as the actual code linesBottle
@Bottle I must say I've had the opposite experience in 20 years of programming - comments are generally worthless because the compiler never checks them :-)Staging
Python docstrings are code - they become the __doc__ attribute of the function and can contain tests. Maybe you need to define what you mean by 'lines of code'Authoritarian
@AdrianCornish: LOC count is pretty worthless, too, so that works out just fine then.Mcminn
@AdrianCornish WTF are you talking about .. python compiler? and in your 20 years of programming you learned that "comments are generally worthless"?Funnel
Docstrings are code but not all multiline strings are docstrings, they may be used as a multiline comment without generating code.Sidereal
@Funnel I can't count the number of times I'v read completely outdated and misleading comments. I believe that was Adrian Cornish' point.Calton
At one place I worked (FWIW, the best software engineering team I've ever worked on) we called comments "lies". As in "This code has no tests, but he did write lots of lies about it."Surround
The correct way to answer this question is to examine the parsed Python bytecode, or maybe the AST. Any other approaches are fraught with peril and will fail to work properly in many different circumstances. I don't have a full working solution however - just that vague hunch.Surround
@Bottle Say you are counting lines to get a measure on feature creep. You want to keep the number of code lines limited, but it would be counter productive to punish comments.Heptarchy
@ThomasAhle that's a perfect example of you get what you measure - counting lines is quite stupid way to measure feature creep...Bottle
A
7

It is probably correct to include Python docstrings in a "lines of code" count. Normally a comment would be discarded by the compiler, but docstrings are parsed:

See PEP 257 - Docstring Conventions:

A docstring is a string literal that occurs as the first statement in a module, function, class, or method definition. Such a docstring becomes the __doc__ special attribute of that object.

...

String literals occurring elsewhere in Python code may also act as documentation. They are not recognized by the Python bytecode compiler and are not accessible as runtime object attributes..

In other words, docstrings are compiled and constitute, in a very real way, the code of the program. Additionally, they're commonly used by the doctest module for unit testing, as usage strings for command line utilities, and so on.

Authoritarian answered 31/1, 2012 at 9:8 Comment(8)
Disagree. While docstrings are compiled and can be used by the code, their use and semantic is as comments. They should be excluded from any meaningful line count.Surround
@JonathanHartley personally I think that "compiled and can be used by the code" is a good argument for it being counted.Authoritarian
Hey. I guess I feel the opposite because even though they can be used by the code, they almost never are. By which I mean, yes they are used by 'pydoc' et al, but I think the only program I've seen that stores data in docstrings and then examines that data is David Beazley's 'Ply'. So it's very rare. If you're comparing two modules to see which contains more code, and one has docstrings but the other does not, it seems most useful to me to exclude the docstrings and get the result 'they are about the same'.Surround
No part of my project relies on the docstrings in any way whatsoever. I just want to get the number of Python instructions in my program, without my huge docstrings included, so I can check my ratio of production code to test code. For my purposes it makes zero sense to include them in a "lines of code" count.Dipsomania
@leo-the-manic ok, well do whatever you need to do for "your purposes". Note that "number of Python instructions" is totally different to LOC. "number of instructions" might be a better metric, for some definitions of "better".Authoritarian
I'm not interested in counting "instructions," which I assume implies some kind of inspection of the bytecode. I want lines of non-blank, non-comment source code.Dipsomania
@leo-the-manic You're the one that said "I just want to get the number of Python instructions in my program" :). If you know what you want to count, count that. Not sure what the problem is.Authoritarian
My problem is the question asks how to get a line count without docstrings, and the first answer is, "Well you probably want the docstrings." No, I don't. On a side note, I don't see how docstrings could reasonably constitute 'the code of the program' when they are stripped away if you run Python with the -OO flag. :) A more direct answer to the question is located further down the page.Dipsomania
C
8

Tahar doesn't count the docstrings. Here's its count_loc function :

def count_loc(lines):
    nb_lines  = 0
    docstring = False
    for line in lines:
        line = line.strip()

        if line == "" \
           or line.startswith("#") \
           or docstring and not (line.startswith('"""') or line.startswith("'''"))\
           or (line.startswith("'''") and line.endswith("'''") and len(line) >3)  \
           or (line.startswith('"""') and line.endswith('"""') and len(line) >3) :
            continue

        # this is either a starting or ending docstring
        elif line.startswith('"""') or line.startswith("'''"):
            docstring = not docstring
            continue

        else:
            nb_lines += 1

    return nb_lines
Calton answered 5/1, 2013 at 10:17 Comment(9)
Thank you for the reasonable recommendation, and for not making preposterous and pontificating claims, like your fellow responders, about docstrings being code. Lines of code is a valid (and in fact the best: herraiz.org/blog/2010/11/22/making-software-is-out) measure of code complexity and when I need that complexity to reflect the raw source code (rather than my copious amount of math notes in docstrings), I need to omit docstrings!Oread
I beleive that the doc in docstrings is for documentationCalton
The above code would fail on docstrings which use single quotes, or on some regular strings which use triple quotes. The right way to solve this problem is to look at the AST.Surround
@JonathanHartley Can you provide an example where the code would possibly fail ?Calton
@ychaouche. Hey. Any docstring which doesn't use triple quotes will be counted as code line. Conversely, any regular code which uses triple-quotes will be counted as a docstring (see the example in wim's answer.)Surround
...and, on reflection, it fails if the docstring ends with a line containing text then the closing triple quoets. And in lines consisting of just four or five consecutive quote chars. This simply isn't a suitable way to try and detect docstrings.Surround
@JonathanHartley I tried the function on OP's code and correctly outputs 2, discarding the docstring.Calton
But you are correct it won't work if the docstring doesn't use triple (simple or double) quotes, it only knows about """ docstring """ and ''' docstring '''.Calton
@Calton The OP's code doesn't fall into any of the categories I enumerated.Surround
F
7

Comment lines can be lines of code in python. See doctest for example.

Moreover, you will have trouble to find a sensible/reliable way to consider a case like this as being a comment or code:

foo = ('spam', 
       '''eggs
          eggs
          eggs'''
       '''more spam''',
       'spam')

Just count the comment lines as well, I think most programmers will agree it is as good a measure for whatever you are actually trying to measure.

Funnel answered 31/1, 2012 at 9:8 Comment(1)
Disagree. While technically docstrings are compiled and accessible from code, the vast predominance of their usage and semantic is a comments. They should be excluded from line counts. The way to detect ambiguous looking cases like the one in this answer is to do the line count using the AST.Surround
A
7

It is probably correct to include Python docstrings in a "lines of code" count. Normally a comment would be discarded by the compiler, but docstrings are parsed:

See PEP 257 - Docstring Conventions:

A docstring is a string literal that occurs as the first statement in a module, function, class, or method definition. Such a docstring becomes the __doc__ special attribute of that object.

...

String literals occurring elsewhere in Python code may also act as documentation. They are not recognized by the Python bytecode compiler and are not accessible as runtime object attributes..

In other words, docstrings are compiled and constitute, in a very real way, the code of the program. Additionally, they're commonly used by the doctest module for unit testing, as usage strings for command line utilities, and so on.

Authoritarian answered 31/1, 2012 at 9:8 Comment(8)
Disagree. While docstrings are compiled and can be used by the code, their use and semantic is as comments. They should be excluded from any meaningful line count.Surround
@JonathanHartley personally I think that "compiled and can be used by the code" is a good argument for it being counted.Authoritarian
Hey. I guess I feel the opposite because even though they can be used by the code, they almost never are. By which I mean, yes they are used by 'pydoc' et al, but I think the only program I've seen that stores data in docstrings and then examines that data is David Beazley's 'Ply'. So it's very rare. If you're comparing two modules to see which contains more code, and one has docstrings but the other does not, it seems most useful to me to exclude the docstrings and get the result 'they are about the same'.Surround
No part of my project relies on the docstrings in any way whatsoever. I just want to get the number of Python instructions in my program, without my huge docstrings included, so I can check my ratio of production code to test code. For my purposes it makes zero sense to include them in a "lines of code" count.Dipsomania
@leo-the-manic ok, well do whatever you need to do for "your purposes". Note that "number of Python instructions" is totally different to LOC. "number of instructions" might be a better metric, for some definitions of "better".Authoritarian
I'm not interested in counting "instructions," which I assume implies some kind of inspection of the bytecode. I want lines of non-blank, non-comment source code.Dipsomania
@leo-the-manic You're the one that said "I just want to get the number of Python instructions in my program" :). If you know what you want to count, count that. Not sure what the problem is.Authoritarian
My problem is the question asks how to get a line count without docstrings, and the first answer is, "Well you probably want the docstrings." No, I don't. On a side note, I don't see how docstrings could reasonably constitute 'the code of the program' when they are stripped away if you run Python with the -OO flag. :) A more direct answer to the question is located further down the page.Dipsomania
S
3

Have you looked at http://www.ohloh.net/p/ohcount - always been pretty on the money for me - although I do not use python

Staging answered 31/1, 2012 at 8:45 Comment(1)
Thanks, but like cloc this tool also counts docstrings with triple apostrophes as code, so it's also not really Python-aware.Justinn

© 2022 - 2024 — McMap. All rights reserved.