Difficulty understanding Paul Heckel's Diff Algorithm
Asked Answered
E

2

6

I have been looking at Paul Heckel's Diff Algorithm and I don't seem to understand it fully.

I copied steps 1-5 as shown in Python code but I can't get it to show the differences using the final step of the algorithm. I would be grateful if someone explained the final step along with Python code.

Also, I don't fully understand why you need a reference to the table lines in step 4 and 5, so an explanation of that would be amazing too!

Many thanks

Here's my current code:

def find_diff(current_file_as_list, different_file_as_list):

N = current_file_as_list
O = different_file_as_list

table = {}

OA = []
NA = []
for i in O:
    OA.append(i)
for i in N:
    NA.append(i)

# First pass
i = 0

for line in N:
    if not line in table:
        table[line] = {}
        table[line]["NC"] = 1
    else:
        if table[line]["NC"] == 1:
            table[line]["NC"] = 2
        else:
            table[line]["NC"] = "many"
    NA[i] = table[line]
    i += 1

# second pass
j = 0

for line in O:
    if not line in table:
        table[line] = {}
        table[line]["OC"] = 1
    else:
        if not "OC" in table[line]:
            table[line]["OC"] = 1
        elif table[line]["OC"] == 1:
            table[line]["OC"] = 2
        else:
            table[line]["OC"] = "many"
    table[line]["OLNO"] = j  # Gets overwritten with multiple occurrences.
    # Check to see if this is the intended implementation.
    # Maybe only relevant for "OC" == "NC" == 1
    OA[j] = table[line]
    j += 1

# third pass
i = 0

for i in range(0, len(NA)):
    # Check if they appear in both files
    if "OC" in NA[i] and "NC" in NA[i]:
        # Check if they appear exactly once
        if NA[i]["OC"] == NA[i]["NC"] == 1:
            olno = NA[i]["OLNO"]
            NA[i], OA[olno] = olno, i
    i += 1

# fourth pass
# ascending
for i in range(0, len(NA)):
    for j in range(0 , len(OA)):
        if NA[i] == OA[j] and i + 1 < len(NA) and j + 1 < len(OA) and NA[i + 1] == OA[j + 1]:
            OA[j + 1] = table[O[i + 1]]
            NA[i + 1] = table[N[j + 1]]

# fifth pass
# descending
for i in range(len(NA) - 1, 0, -1):
    for j in range(len(OA) - 1, 0, -1):
        if NA[i] == OA[j] and i - 1 > 0 and j - 1 > 0 and NA[i - 1] == OA[j - 1]:
            OA[j - 1] = table[O[i - 1]]
            NA[i - 1] = table[N[j - 1]]

# final step implementation should go here but I'm not sure how to approach it but this is my current attempt (which I am certain is wrong):
k = 0

array = []

for i in range(0, len(NA)):

    if isinstance(NA[i], int):
        array.append("= " + str(N[i]))
        k = NA[i] + 1
    elif isinstance(NA[i], dict):
        array.append("+ " + N[i])

    for j in range(k, len(OA)):
        k = j + 1
        print("j - " + str(j))
        if not isinstance(OA[j], int):
            array.append("- " + O[j])
        else:
            break

You can pass any two string's or list of string's as input to the function eg. find_diff("hello", "hell")

Ellis answered 13/3, 2017 at 0:47 Comment(0)
N
6

I'm not sure where you found this explanation and code, but it has several mistakes in it. One of the Wikipedia page for data comparison's references was a reference to Paul's paper, which proved most helpful in understanding the algorithm.

First of all, as far as I can see your implementation of the last step is correct (assuming the previous steps were done correctly).

Let's begin with a syntax/language issue: maybe I'm missing something, but I don't understand why you (and the code you linked to) increment the self-incrementing index i in the third pass.

Regarding the counters of the table's entries: in the linked code there is a commented question - why do we need the 2 value at all? The answer is - we don't! In the paper itself Heckel explicitly writes that the only values that the counters are ought to have are 0, 1 and many. You can see that we never use or query the counters for a 2 value. I am wild-guessing this mistake comes from implementing the algorithm in a language that is more flexible than the ones Heckel had in mind while writing the algorithm, as the query whether a counter exists for a specific table entry is synonymous with querying if the counter's value is 0.

Lastly and most importantly, the fourth and fifth passes in this implementation are wrong. Here I believe the phrasing of the passes in the paper can be confusing, and that whoever wrote the linked code got it wrong. Your second question reveals it already. The fourth pass is in ascending order over NA and for each position with its value pointing to a position in OA (meaning it is of type int in the discussed implementation) we check whether the next positions's values in both arrays point to the same table entry. If they do we replace those pointers with each other's position (overriding the pointers with ints. So your second question was on point - we don't use table entry pointers here at all). This way, we have our unique lines, discovered in the third pass, as anchors to find unalteted lines that come immediately after them and are part of their"block" but are not unique in the files. The same happens on the fifth pass, but backwards, so the identical lines preceding the unaltered unique lines would be categorised as unaltered as well.

Here's the fourth and fifth passes as I described them:

# fourth pass
# ascending
for i in range(0, len(NA) - 1):
    if isinstance(NA[i], int) and (NA[i] + 1) < len(OA) and NA[i + 1] == OA[NA[i] + 1]:
        NA[i + 1] = NA[i] + 1
        OA[NA[i] + 1] = i + 1

# fifth pass
# descending
for i in range(len(NA) - 1, 0, -1):
    if isinstance(NA[i], int) and (NA[i] - 1) >= 0 and NA[i - 1] == OA[NA[i] - 1]:
        NA[i - 1] = NA[i] - 1
        OA[NA[i] - 1] = i - 1
Nazi answered 20/3, 2017 at 16:31 Comment(4)
Thanks you. Your explanation is very informative. I look forward to seeing the updated code. With regards to the syntax/language issue in the third step 3, I was using a dictionary before similar to the link so it was used to keep a reference to the current element, however I've switched to a list later on during my implementation and the definition and incrementation of i is wrong as you've stated.Ellis
Added the code. Please let me know if you find anything unclear or confusing.Nazi
And if this answer is what you were looking for - please accept it as the correct answer (click the tick sign below the answer's score, on the left hand side).Nazi
Hi @Nazi it's exactly what I was trying to achieve. Thanks for your help! I've accepted this as the answerEllis
S
1

I've been looking for the same problem's solution. I ended up implementing Heckel's algorithm in Python from scratch. Here is implementation of first 5 steps of Heckel's algorithm and implementation of 6th step (extracting diff reperesentation).

You can also use mdiff package in Your program to detect text diff with block movement using Heckel's Algorithm:

from mdiff import HeckelSequenceMatcher

a = ['line1', 'line2', 'line3', 'line4', 'line5']
b = ['line1', 'line3', 'line2', 'line4', 'line6']
sm = HeckelSequenceMatcher(a, b)
opcodes = sm.get_opcodes()

for tag, i1, i2, j1, j2 in opcodes:
    print('{:7}   a[{}:{}] --> b[{}:{}] {!r:>8} --> {!r}'.format(tag, i1, i2, j1, j2, a[i1:i2], b[j1:j2]))
equal     a[0:1] --> b[0:1]  ['line1'] --> ['line1']
move      a[1:2] --> b[2:2]  ['line2'] --> []
equal     a[2:3] --> b[1:2]  ['line3'] --> ['line3']
moved     a[1:1] --> b[2:3]         [] --> ['line2']
equal     a[3:4] --> b[3:4]  ['line4'] --> ['line4']
replace   a[4:5] --> b[4:5]  ['line5'] --> ['line6']

Also there is simple GUI app in package that allows to visualise and test the algorithm.

Supervision answered 14/4, 2022 at 7:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.