Automated-refactoring tool to find similar duplicate source code for Java/Javascript? [closed]

Asked 25/11, 2016 at 6:49 Answered 28/11, 2016 at 3:14

I'm looking for a tool to find duplicate or similar code of Java/Javascript. I can't tell the exact definition of "similar", but I wish the tool is smart enough and give me advices to refactor the code, e.g.,

(1) class A and class B have imilar methods (e.g., there 5 methods have same method name, arguments and similar implementation appearing in both classes), then it should advise to move these similar methods into a base class.
(2) class A has similar code lines at different places multiple times, the tool should advise to move these similar code lines into a single method.

I tried PMD which can find duplicate code lines but it's not clever enough. It did not find out those similar source codes which is widely spreaded in one my projects.

Is there such tool?

Agro answered 25/11, 2016 at 6:49 Comment(2)

afaik PMD has options to define the degree of similiarity it accepts, Did you try these? – Kovrov 25/11, 2016 at 7:41

IDEA is the boss! – Majorette 18/4, 2017 at 20:17

Our CloneDR tool finds duplicated code by comparing abstract syntax trees from parsers. (It comes in language-specific versions for many languages, including Java and JavaScript).

This means it can find cloned code in spite of format changes and modifications of the body of the clone, both of which are often done while cloning. Found clones match language concepts such as expression, declaration, statements, functions, and even classes. Clones that are similar are reported along with the differences/variation points as proposed parameters.

It can find clone sets with multiple instances (we've some applications with hundreds of clones of a single bit of code), and it can find clones across many source files.

It produces HTML reports that are directly readable by people, and XML reports that can be processed by other downstream tools. (You can see some sample HTML reports via the link).

Similarity is hard to define, and in fact you can define it in many ways. CloneDR defines it as the ratio of identical elements (technically, AST nodes) across a clone set divided by the total number of elements across the clone set. This ratio is a value between 0 and 1. It is compared against a threshold; we've found that 95% is surprisingly robust as threshold in terms of the quality of reported clones.

It is useful to establish a minimum size for interesting clones. a*b is a clone of x*y (with 2 parameters) but isn't useful to report because it is too small. CloneDR also uses a size threshold which we call "line count", but in fact is the size of the clone in elements divided by the average number of elements per line across the entire code base. This produces clones which usually have more lines than the threshold, but it will find clones for enormous expressions that are within a line. We've found that 5-6 "lines" is also fairly robust in terms of reported clone quality.

This table shows how effective the AST matching approach of CloneDR is compared to many other clone detection tools (ranking it “very well”). The only one that comes close is CCDIML …. which is an academic re-implementation of the CloneDR approach. There are other approaches (namely PDG-based approaches) which can detect clones that are scattered about more effectively, but in practice, in my personal experience, people that clone code don’t usually cut the cloned part into a bunch of separate parts to scatter them about; they are just too lazy. YMMV.

[Table from: Roy, Cordy, Koschke: Comparison and Evaluation of Code Clone Detection Techniques and Tools: A Qualitative Approach , Science of Computer Programming, Volume 74 Issue 7, May, 2009. This paper sketches many different clone detection approaches and evaluates their effectiveness.]

[PMD isn't listed, but apparantly using Rabin-Karp string matching, "text based" according to the above table, rather than AST matching.]

Re OP's requirements:

CloneDR (and in fact no tool I know) will NOT find a set of similar methods across multiple methods, if those methods occur in different orders in different classes. In this case, CloneDR is more likely to report the individual methods as clones; the net result is the same. It will find such a set if the members occur sequentially in the same order in the different classes, as happens when one class body has been wholesale copied from another.

Similar code blocks across multiple methods is quite commonly detected. The generated report shows how the the similar code blocks are related, including an abstracted version of the code which is essentially the parameterized code block you need for a method body.

Weidner answered 25/11, 2016 at 9:53 Comment(4)

1. this 'CloneDR' tool is NOT open source, at best its freeware = downvote. 2. your link is broken, here is a new one – Listing 8/9, 2020 at 10:44

He didn't ask for an open source tool. Thanks for the downvote for no good reason. 1. He asked for tools that were better than what he could get. CloneDR is a very good tool for this kind of job, based on the technical paper analysis, whether you beleive that or not. – Weidner 8/9, 2020 at 19:17

2. Thanks for pointing out the broken link. It used to work; the internet changed some rules. Your link points to a tool that is what CloneDR is based on; I fixed the link to point (again) to the CloneDR tool itself. – Weidner 8/9, 2020 at 19:22

tried to use clone-doctor to analyze some ES6 code. problem: CDR only supports some ancient dialect of javascript called 'MicrosoftNetscape'. transpiling the ES6 code is not an option. could you at least open-source the parser interface? so we can add an ES6 parser, or build a bridge to the ANTLR parser framework – Listing 26/9, 2020 at 12:57

I have not used IntelliJ IDEA before, but I just found it supports Analyzing Duplicates with it's ulimate edition.

Agro answered 28/11, 2016 at 3:14 Comment(0)

You can definitely take a look at simian.
As far as I know, it is more or less only usable with a build server.
But you can also execute it via the command line or integrate it within a local build tool like ant.

It has multiple options to configure like how many lines are necessary to "identify" a duplicate and many more. The only thing that is not the best is IMHO the generated output (xml), but I think this is also configurable.

Hope this helps !

EDIT: PMD is indeed a very good tool, maybe try to use it together with simian. PMD has also a very good support for integrating it into IDE's or editors as plugins.

Dogfish answered 25/11, 2016 at 6:55 Comment(0)

Recommended topics

Hot tags