How to detect code duplication during development? [closed]
Asked Answered
S

13

88

We have a fairly large code base, 400K LOC of C++, and code duplication is something of a problem. Are there any tools which can effectively detect duplicated blocks of code?

Ideally this would be something that developers could use during development rather than just run occasionally to see where the problems are. It would also be nice if we could integrate such a tool with CruiseControl to give a report after each check in.

I had a look at Duploc some time ago, it showed a nice graph but requires a smalltalk environment to use it, which makes running it automatically rather difficult.

Free tools would be nice, but if there are some good commercial tools I would also be interested.

Sublime answered 10/10, 2008 at 14:34 Comment(2)
Whenever somebody uses the paste button :-}Panhellenism
Related question - #2491384Bayne
C
40

Simian detects duplicate code in C++ projects.

Update: Also works with Java, C#, C, COBOL, Ruby, JSP, ASP, HTML, XML, Visual Basic, Groovy source code and even plain text files

Chanticleer answered 10/10, 2008 at 14:40 Comment(5)
would simian check seamlessly .mm files, AKA ObjectiveC++Wolsky
@Wolsky Simian can do plain text checks, so it can detect code duplication in any language.Hyla
Note that it's not free for commercial use.Acescent
It does not seem to be working recursively if I just run it with the default settings on a directory in Linux.Aristides
I followed the link but the installation instructions do not show how to download it.Denaturalize
M
21

I've used PMD's Copy-and-Paste-Detector and integrated it into CruiseControl by using the following wrapper script (be sure to have the pmd jar in the classpath).

Our check runs nightly. If you wish to limit output to list only files from the current change set you might need some custom programming (idea: check all and list only duplicates where one of the changed files is involved. You have to check all files because a change could use some code from a non-changed file). Should be doable by using XML output and parsing the result. Don't forget to post that script when it's done ;)

For starters the "Text" output should be ok, but you will want to display the results in a user-friendly way, for which i use a perl script to generate HTML files from the "xml" output of CPD. Those are accessible by posting them to the tomcat where cruise's reporting jsp resides. The developers can view them from there and see the results of their dirty hacking :)

It runs quite fast, less than 2 seconds on 150 KLoc code (empty lines and comments not counted in that number).

duplicatecheck.xml:

<project name="duplicatecheck" default="cpd">

<property name="files.dir" value="dir containing your sources"/>
<property name="output.dir" value="dir containing results for publishing"/>

<target name="cpd">
    <taskdef name="cpd" classname="net.sourceforge.pmd.cpd.CPDTask"/>
    <cpd minimumTokenCount="100" 
         language="cpp" 
         outputFile="${output.dir}/duplicates.txt"
         ignoreLiterals="false"
         ignoreIdentifiers="false"
         format="text">
        <fileset dir="${files.dir}/">
            <include name="**/*.h"/>
            <include name="**/*.cpp"/>
                <!-- exclude third-party stuff -->
            <exclude name="boost/"/>
            <exclude name="cppunit/"/>
        </fileset>
    </cpd>
</target>

Mira answered 24/11, 2008 at 17:12 Comment(3)
Be sure to use the latest version from the sourgeforge page! Their documentation page suggests a version from 2011, while there is active development. In my case version 5.5 works much better than the version 4.2 they link on their homepage.Solorio
There is no mention of C++ support in their documentation anymore either.Aristides
The latest link to PMD's duplicate code page is pmd.github.io/latest/pmd_userdocs_cpd.html#supported-languagesWebbing
M
7

duplo appears to be a C implementation of the algorithm used in Duploc. It is simple to compile and install, and while the options are limited it seems to more or less work out-of-the-box.

Millrun answered 17/12, 2008 at 4:54 Comment(1)
I just tried it on a rather legacy code file with lots of tons-of-almost-dupes, and compared to a recent code file with few-perfect-dupes. The legacy file was rated perfect. The new file was rated bad.Xeric
K
6

These Debian packages seem to do something along these lines:

P.S. There ought to be a debtags tag for all tools related for finding [near] duplication. (But what would it be called?)

Kokoruda answered 13/3, 2012 at 22:4 Comment(1)
Perfect answer. Package similarity-tester has comprehensive man page man 1 sim, works out of the box, and produces convincing results.Scholiast
U
5

Look at the PMD project.

I've never used it, but have always wanted to.

Unsnap answered 10/10, 2008 at 14:43 Comment(1)
I thought that PMD was just for Java, but I now see that CPD (which is part of PMD) can be used for C++ as well.Sublime
P
3

Well, you can run a clone detector on your source code base every night.

Many clone detectors work by comparing source lines, and can only find exact duplicate code.

CCFinder, above, works by comparing language tokens, so it isn't sensitive to white space changes. It can detect clones which are variants of the original code if there only single token changes (e.g, change a variable X to Y in the clone).

Ideally what you want is the above, but the ability to find clones where the variations are allowed to be relatively arbitrary, e.g., replace a variable by an expression, a statement by a block, etc.

Our CloneDR clone detector does this for Java, C#, C++, COBOL, VB.net, VB6, Fortran and a variety of other languages. It can be seen at: http://www.semdesigns.com/Products/Clone/index.html

As well as being able to handle multiple languages, CloneDR engine is capable of handling a variety of input encoding styles, including ASCII, ISO-8859-1, UTF8, UTF16, EBCDIC, a number of Microsoft encodings, and (Japanese) Shift-JIS.

The site has several clone detection run example reports, including one for C++.

EDIT Feb 2014: Now handles all of C++14.

Panhellenism answered 28/6, 2009 at 19:27 Comment(0)
C
2

CCFinderX is a free (for in-house use) cloned code detector that supports multiple programming languages (Java, C, C++, COBOL, VB, C#).

Cabala answered 11/10, 2008 at 4:55 Comment(1)
Thanks for this link. I will definitely look at it. What is even better is that there is a Japanese version (all other developers on the project apart from me are Japanese)Sublime
T
2

Same (http://sourceforge.net/projects/same/) is extremely plain, but it works on text lines instead of tokens, which is useful if you're using a language that isn't supported by one of the fancier clone finders.

Trimeter answered 25/8, 2009 at 16:10 Comment(0)
D
2

There is also Simian which supports Java, C#, C++, C, Objective-C, JavaScript...

It's supported by Hudson (like CPD).

Unless you're an open source project, you must pay for Simian.

Dappled answered 15/7, 2010 at 22:18 Comment(0)
D
2

ConQAT is a great tool which suports C++ code analysis. Can find duplicates ignoring whitespace. Has extreamly handy gui and console interfaces. Because of it's flexibility it is not an easy to to setup. I've found this blog post very useful for setting up c++ project.

Downpipe answered 3/8, 2013 at 14:48 Comment(0)
A
2

You can use our SourceMeter tool for detecting code duplication. It is a command line tool (very similar to compilers), so you can it easily integrate into continuous integration tools, like CruiseControl your mentioned, or Jenkins.

Amiraamis answered 31/7, 2015 at 16:12 Comment(1)
SourceMeter is free, but is quite unfriendly during setup, or probably it just doesn't support VS2019 with v142Mirage
S
1

Finding "identical" code snippets is relatively easy, there are existing tool that already do this (see other answers).

Sometimes it's a good thing, sometimes it's not; it can bog down development time if done at a too fine "level"; i.e. trying to refactor so much code, you loose your goal (and probably bust your milestones and schedules).

What is harder is to find multiple function/method that do the same thing but with different (but similar) inputs and/or algorithm without proper documentation.

If you have to two or different methods to do the same thing and the programmer try to fix one instance but forget (or does not know they exist) to fix the other ones, you will increase the risk to your software.

Supercargo answered 24/11, 2008 at 17:25 Comment(3)
... and as a practical matter, you aren't going to be able to detect that two pieces of code do the same thing if they are implemented differently. There's a Turing machine standing in your way.Panhellenism
"What is harder is to find multiple function/method that do the same thing but with different (but similar) inputs and/or algorithm without proper documentation." Right. And if they DO the same thing, they should be NAMED the same, since the name should describe why that code exists in the first place. So step one might be to make sure that all functions/methods are accurately named and documented. If the name truly describes what it does, similarities and identities will quickly become obvious.Pyrophosphate
Trouble is, even a "does the same thing" oracle (which I believe to be significantly more powerful than a halting oracle?) wouldn't help you figure out if two names express (or are intended to express) the same idea. (Besides which, there would tend to be a lot of false positives.)Kokoruda
C
-3

TeamCity has a powerful code duplication engine for .NET and java, that can effortlessly run as part of your build system.

Cottingham answered 17/11, 2008 at 16:20 Comment(1)
Neither .Net or Java is C++, so while this may be effortless to run, it is also fruitless.Tasteless

© 2022 - 2024 — McMap. All rights reserved.