How to detect the amount of almost-repetition in a text file?

Asked 25/6, 2022 at 10:52 Answered 4/7, 2022 at 1:9

Solved python algorithm analytics similarity code-duplication

I am a programming teacher, and I would like to write a script that detects the amount of repetition in a C/C++/Python file. I guess I can treat any file as pure text.

The script's output would be the number of similar sequences that repeat. Eventually, I am only interested in a DRY's metric (how much the code satisfied the DRY principle).

Naively I tried to do a simple autocorrelation but it would be hard to find the proper threshold.

u = open("find.c").read()
v = [ord(x) for x in u]
y = np.correlate(v, v, mode="same")
y = y[: int(len(y) / 2)]

x = range(len(y))
z = np.polyval(np.polyfit(x, y, 3), x)

f = (y - z)[: -5]
plt.plot(f)
plt.show();

So I am looking at different strategies... I also tried to compare the similarities between each line, each group of 2 lines, each group of 3 lines ...

import difflib
import numpy as np

lines = open("b.txt").readlines()
lines = [line.strip() for line in lines]

n = 3
d = []
for i in range(len(lines)):
    a = lines[i:i+n]
    
    for j in range(len(lines)):
        b = lines[j:j+n]
        if i == j: continue # skip same line
        group_size = np.sum([len(x) for x in a])
        if group_size < 5: continue # skip short lines
        ratio = 0
        for u, v in zip(a, b):
            r = difflib.SequenceMatcher(None, u, v).ratio()
            ratio += r if r > 0.7 else 0        
        d.append(ratio)
        
dry = sum(d) / len(lines)

In the following, we can identify some repetition at a glance:

w = int(len(d) / 100)
e = np.convolve(d, np.ones(w), "valid") / w * 10
plt.plot(range(len(d)), d, range(len(e)), e)
plt.show()

Why not using:

d = np.exp(np.array(d))

Thus, difflib module looks promising, the SequenceMatcher does some magic (Levenshtein?), but I would need some magic constants as well (0.7)... However, this code is > O(n^2) and runs very slowly for long files.

What is funny is that the amount of repetition is quite easily identified with attentive eyes (sorry to this student for having taken his code as a good bad example):

I am sure there is a more clever solution out there.

Any hint?

Otterburn answered 25/6, 2022 at 10:52 Comment(9)

• there's a whole field of how to do things that human find easy but we can't figure out how can we get a computer to do it (machine learning) • nevertheless there are some existing "code plagiarism checker" programs, try starting your research there. – Fact 26/6, 2022 at 2:9

I would try to unpack what "quite easily identified with attentive eyes" means in more detail and try to code elements of that. But yeah, extra layers of abstraction of the code would be needed. I don't think simple string comparisons would be enough. – Extemporary 26/6, 2022 at 10:52

It would be great to have an AST, but all parser I’ve tried failed to build student code as it don’t always build – Otterburn 26/6, 2022 at 13:3

you really only want to analyze plagiarism in code that builds. If they didn't even get it to the building state, then why bother. Why would they even send code that doesn't build. – Toaster 26/6, 2022 at 16:16

I do not want to detect plagiarism – Otterburn 26/6, 2022 at 16:17

Stuff that's repetitive compresses easily, right? How about you use a compression function as a driver, where you see how well the thing compresses after an increment of text has been added. If you add a bunch of lines but it barely takes up more space in compressed form, maybe that's a DRY candidate? – Bucksaw 1/7, 2022 at 11:9

Regarding your difflib code, you can make it faster by caching the results of difflib.SequenceMatcher (only need to call it once for each pair of lines). You can also ignore pairs of lines if their lengths are considerably different. – Lorentz 1/7, 2022 at 13:1

@Bucksaw it looks indeed promising – Otterburn 1/7, 2022 at 13:46

To me the code duplication in this example is more a matter of style than a real problem. The duplicated code are different #if-variants, so this style makes it easy to look at one of them in isolation instead of trying to weave together the result of different #if-paths. What you clearly don't want is people trying to fool the duplication checker by making them spuriously different. – Black 4/7, 2022 at 7:56

I would build a system based on compressibility, because that is essentially what things being repeated means. Modern compression algorithms are already looking for how to reduce repetition, so let's piggy back on that work.

Things that are similar will compress well under any reasonable compression algorithm, eg LZ. Under the hood a compression algo is a text with references to itself, which you might be able to pull out.
Write a program that feeds lines [0:n] into the compression algorithm, compare it to the output length with [0:n+1].
When you see the incremental length of the compressed output increases by a lot less than the incremental input, you note down that you potentially have a DRY candidate at that location, plus if you can figure out the format, you can see what previous text it was deemed similar to.
If you can figure out the compression format, you don't need to rely on the "size doesn't grow as much" heuristic, you can just pull out the references directly.
If needed, you can find similar structures with different names by pre-processing the input, for instance by normalizing the names. However I foresee this getting a bit messy, so it's a v2 feature. Pre-processing can also be used to normalize the formatting.

Bucksaw answered 1/7, 2022 at 21:4 Comment(4)

Great way to easily extract a metric from a possibly very complex system. – Holding 3/7, 2022 at 21:21

Do you slide the window down, or do you always keep the start of the comparison window at line 0? Also I suspect that normalization may be critical since you'd want any changes in compressibility to come from the logic structure, not noise like variable and method names, etc. – Pettus 6/7, 2022 at 13:3

Also do modern compression algos work this way (I have no idea; it's an honest question)? Meaning if you add more and more lines and you get a big compression jump, you know that its from the most recent lines being highly repetitive and not from the compression algo figuring out how to better compress the whole file up to that point thanks to the newest lines? – Pettus 6/7, 2022 at 13:5

Good question, I suspect there's more than one way to do it – Bucksaw 6/7, 2022 at 15:55

Looks like you're choosing a long path. I wouldn't go there.

I would look into trying to minify the code before analyzing it. To completely remove any influence of variable names, extra spacing, formatting and even slight logic reshuffling.

Another approach would be comparing byte-code of the students. But it may be not a very good idea since the result will likely have to be additionally cleaned up.

Dis would be an interesting option.

I would, most likely, stop on comparing their AST. But ast is likely to give false positives for short functions. Cuz their structure may be too similar, so consider checking short functions with something else, something trivial.

On top of thaaaat, I would consider using Levenshtein distance or something similar to numerically calculate the differences between byte-codes/sources/ast/dis of the students. This would be what? Almost O(N^2)? Shouldn't matter.

Or, if needed, make it more complex and calculate the distance between each function of student A and each function of student B, highlighting cases when the distance is too short. It may be not needed though.

With simplification and normalization of the input, more algorithms should start returning good results. If a student is good enough to take someone's code and reshuffle not only the variables, but the logic and maybe even improve the algo, then this student understands the code well enough to defend it and use it with no help in future. I guess, that's the kind of help a teacher would want to be exchanged between students.

Toaster answered 26/6, 2022 at 4:27 Comment(0)

You can treat this as a variant of the longest common subsequence problem between the input and itself, where the trivial matching of each element with itself is disallowed. This retains the optimal substructure of the standard algorithm, since it can be phrased as a non-transitive “equality” and the algorithm never relies on transitivity.

As such, we can write this trivial implementation:

import operator

class Repeat:
  def __init__(self,l):
    self.l=list(l)
    self.memo={}

  def __call__(self,m,n):
    l=self.l
    memo=self.memo
    k=m,n
    ret=memo.get(k)
    if not ret:
      if not m or not n: ret=0,None
      elif m!=n and l[m-1]==l[n-1]:  # critical change here!
        z,tail=self(m-1,n-1)
        ret=z+1,((m-1,n-1),tail)
      else: ret=max(self(m-1,n),self(m,n-1),key=operator.itemgetter(0))
      memo[k]=ret
    return ret

  def go(self):
    n=len(self.l)
    v=self(n,n)[1]
    ret=[]
    while v:
      x,v=v
      ret.append(x)
    ret.reverse()
    return ret

def repeat(l): return Repeat(l).go()

You might want to canonicalize lines of code by removing whitespace (except perhaps between letters), removing comments, and/or replacing each unique identifier with a standardized label. You might also want to omit trivial lines like } in C/C++ to reduce noise. Finally, the symmetry should allow only cases with, say, m>=n to be treated.

Of course, there are also "real" answers and real research on this issue!

Magdalen answered 4/7, 2022 at 1:9 Comment(0)

Frame challenge: I’m not sure you should do this

It’d be a fun programming challenge for yourself, but if you intend to use it as a teaching tool—-I’m not sure I would. There’s not a good definition of “repeat” from the DRY principle that would be easy to test for fully in a computer program. The human definition, which I’d say is basically “failure to properly abstract your code at an appropriate level, manifested via some type of repetition of code, whether repeating exact blocks of whether repeating the same idea over and over again, or somewhere in between” isn’t something I think anyone will be able to get working well enough at this time to use as a tool that teaches good habits with respect to DRY without confusing the student or teaching bad habits too. For now I’d argue this is a job for humans because it’s easy for us and hard for computers, at least for now…

That said if you want to give it a try, first define for yourself requirements for what errors you want to catch, what they’ll look like, and what good code looks like, and then define acceptable false positive and false negative rates and test your code on a wide variety of representative inputs, validating your code against human judgement to see if it performs well enough for your intended use. But I’m guessing you’re really looking for more than simple repetition of tokens, and if you want to have a chance at succeeding I think you need to clearly define what you’re looking for and how you’ll measure success and then validate your code. A teaching tool can do great harm if it doesn’t actually teach the correct lesson. For example if your tool simply encourages students to obfuscate their code so it doesn’t get flagged as violating DRY, or if the tool doesn’t flag bad code so the student assumes it’s ok. Or if it flags code that is actually very well written.

More specifically, what types of repetition are ok and what aren’t? Is it good or bad to use “if” or “for” or other syntax repeatedly in code? Is it ok for variables and functions/methods to have names with common substrings (e.g. average_age, average_salary, etc.?). How many times is repetition ok before abstraction should happen, and when it does what kind of abstraction is needed and at what level (e.g. a simple method, or a functor, or a whole other class, or a whole other module?). Is more abstraction always better or is perfect sometimes the enemy of on time on budget? This is a really interesting problem, but it’s also a very hard problem, and honestly I think a research problem, which is the reason for my frame challenge.

Edit: Or if you definitely want to try this anyway, you can make it a teaching tool--not necessarily as you may have intended, but rather by showing your students your adherence to DRY in the code you write when creating your tool, and by introducing them to the nuances of DRY and the shortcomings of automated code quality assessment by being transparent with them about the limitations of your quality assessment tool. What I wouldn’t do is use it like some professors use plagiarism detection tools, as a digital oracle whose assessment of the quality of the students’ code is unquestioned. That approach is likely to cause more harm than good toward the students.

Pettus answered 3/7, 2022 at 21:12 Comment(0)

I suggest the following approach: let's say that repetitions should be at least 3 lines long. Then we hash every 3 lines. If hash repeats, then we write down the line number where it occurred. All is left is to join together adjacent duplicated line numbers to get longer sequences.

For example, if you have duplicate blocks on lines 100-103 and 200-203, you will get {HASH1:(100,200), HASH2:(101,201)} (lines 100-102 and 200-202 will produce the same value HASH1, and HASH2 covers lines 101-103 and 201-203). When you join the results, it will produce a sequence (100,101,200,201). Finding in that monotonic subsequences, you will get ((100,101), (200,201)).

As no loops used, the time complexity is linear (both hashing and dictionary insertion are O(n))

Algorithm:

read text line by line
transform it by
- removing blanks
- removing empty lines, saving mapping to original for the future
for each 3 transformed lines, join them and calculate hash on it
filter lines for which their hashes occur more then once (these are repetitions of at least 3 lines)
find longest sequences and present repetitive text

Code:

from itertools import groupby, cycle
import re

def sequences(l):
    x2 = cycle(l)
    next(x2)
    grps = groupby(l, key=lambda j: j + 1 == next(x2))
    yield from (tuple(v) + (next((next(grps)[1])),) for k,v in grps if k)

with open('program.cpp') as fp:
    text = fp.readlines()

# remove white spaces
processed, text_map = [], {}
proc_ix = 0
for ix, line in enumerate(text):
    line = re.sub(r"\s+", "", line, flags=re.UNICODE)
    if line:
        processed.append(line)
        text_map[proc_ix] = ix
        proc_ix += 1

# calc hashes
hashes, hpos = [], {}
for ix in range(len(processed)-2):
    h = hash(''.join(processed[ix:ix+3])) # join 3 lines 
    hashes.append(h)
    hpos.setdefault(h, []).append(ix)  # this list will reflect lines that are duplicated

# filter duplicated three liners 
seqs = []
for k, v in hpos.items():
    if len(v) > 1:
        seqs.extend(v) 
seqs = sorted(list(set(seqs)))

# find longer sequences
result = {}
for seq in sequences(seqs):
    result.setdefault(hashes[seq[0]], []).append((text_map[seq[0]], text_map[seq[-1]+3]))

print('Duplicates found:')
for v in result.values():
    print('-'*20)
    vbeg, vend = v[0]
    print(''.join(text[vbeg:vend]))
    print(f'Found {len(v)} duplicates, lines')
    for line_numbers in v:
        print(f'{1+line_numbers[0]} : {line_numbers[1]}')

Artie answered 2/7, 2022 at 21:46 Comment(2)

What about near but not exact repeats—changing repeating a block of logic with a different variable or different comparison value? Also what about repetitions of blocks of length greater or less than three lines? – Pettus 4/7, 2022 at 0:55

The issue of slight changes in code like variable name and.or constant change could be addressed at the preprocessing stage by applying syntax parser. The processed code does not have to comply to the language rules, so in case of C++ curly braces might be removed, variable names could be set to be the same, etc. Blocks of length greater than three lines are treated in by joining sequences of duplicates (see the example). Sequences shorter than 3 lines are not practical to report on, but theoretically one can hash every single line. The algorithm does not really changes. – Artie 4/7, 2022 at 4:55

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags