Python Git diff parser
Asked Answered
G

2

14

I would like to parse git diff with Python code and I am interested to get following information from diff parser:

  1. Content of deleted/added lines and also line number.
  2. File name.
  3. Status of file whether it is deleted, renamed or added.

I am using unidiff 0.5.2 for this purpose and I wrote the following code:

from unidiff import PatchSet
import git
import os

commit_sha1 = 'b4defafcb26ab86843bbe3464a4cf54cdc978696'
repo_directory_address = '/my/git/repo'
repository = git.Repo(repo_directory_address)
commit = repository.commit(commit_sha1)
diff_index = commit.diff(commit_sha1+'~1', create_patch=True)
diff_text = reduce(lambda x, y: str(x)+os.linesep+str(y), diff_index).split(os.linesep)
patch = PatchSet(diff_text)
print patch[0].is_added_file

I am using GitPython to generate Git diff. I received following error for the above code:

current_file = PatchedFile(source_file, target_file,
UnboundLocalError: local variable 'source_file' referenced before assignment

I would appreciate if you could help me to fix this error.

Genoa answered 10/9, 2016 at 6:4 Comment(5)
There are a few things wrong with this question. First, you're asking for us to recommend a library or other off-site resource. Second, you haven't told us what your requirements are at all ... What does it mean to "parse [a] git diff string"? Much better would be to say "I have this git diff string and I'd like to get the following information out of it. Here's what I've tried and here is why it isn't working..."Greaser
Thank you for your comments. I will rewrite the questions.Genoa
You say that you can't use unidiff -- I'm not sure that I understand that assertion. According to the pypi page you linked, unidiff works with file-like objects. If you have a string, it's easy enough to construct a file-like object with it using StringIO or io in the standard library.Greaser
Unfortunately, there is no good documentation for unidiff. I have checked the source code. it actually accept any iteration (Please check init function of PatchSet). I converted my string to a List of lines but I cannot make it working. Would you please let me know how can I construct file-like object using python string?Genoa
@mgilson, I have tried StringIO, unfortunately, I have got the same error as above.Genoa
G
13

Update:
I found my old answer is not working anymore. Here is the new solution:
For this solution, you need git and unidiff packages.

import git
from unidiff import PatchSet

from cStringIO import StringIO

commit_sha1 = 'commit_sha'
repo_directory_address = "your/repo/address"

repository = git.Repo(repo_directory_address)
commit = repository.commit(commit_sha1)

uni_diff_text = repository.git.diff(commit_sha1+ '~1', commit_sha1,
                                    ignore_blank_lines=True, 
                                    ignore_space_at_eol=True)

patch_set = PatchSet(StringIO(uni_diff_text), encoding='utf-8')

change_list = []  # list of changes 
                  # [(file_name, [row_number_of_deleted_line],
                  # [row_number_of_added_lines]), ... ]

for patched_file in patch_set:
    file_path = patched_file.path  # file name
    print('file name :' + file_path)
    del_line_no = [line.target_line_no 
                   for hunk in patched_file for line in hunk 
                   if line.is_added and
                   line.value.strip() != '']  # the row number of deleted lines
    print('deleted lines : ' + str(del_line_no))
    ad_line_no = [line.source_line_no for hunk in patched_file 
                  for line in hunk if line.is_removed and
                  line.value.strip() != '']   # the row number of added liens
    print('added lines : ' + str(ad_line_no))
    change_list.append((file_path, del_line_no, ad_line_no))

Old Solution (This solution may not work anymore)

Finally, I found the solution. The output of gitpython is a little bit different from the standard git diff output. In the standard git diff source file start with --- but the output of gitpython start with ------ as you can see in the out put of running the following python code (this example is generated with elasticsearch repository):

import git

repo_directory_address = '/your/elasticsearch/repository/address'
revision = "ace83d9d2a97cfe8a8aa9bdd7b46ce71713fb494"
repository = git.Repo(repo_directory_address)
commit = repository.commit(rev=revision)
# Git ignore white space at the end of line, empty lines,
# renamed files and also copied files
diff_index = commit.diff(revision+'~1', create_patch=True, ignore_blank_lines=True, 
                         ignore_space_at_eol=True, diff_filter='cr')

print reduce(lambda x, y: str(x)+str(y), diff_index)

The partial out put would be as follow:

core/src/main/java/org/elasticsearch/action/index/IndexRequest.java
=======================================================
lhs: 100644 | f8b0ce6c13fd819a02b1df612adc929674749220
rhs: 100644 | b792241b56ce548e7dd12ac46068b0bcf4649195
------ a/core/src/main/java/org/elasticsearch/action/index/IndexRequest.java
+++ b/core/src/main/java/org/elasticsearch/action/index/IndexRequest.java
@@ -20,16 +20,18 @@
package org.elasticsearch.action.index;

 import org.elasticsearch.ElasticsearchGenerationException;
+import org.elasticsearch.Version;
 import org.elasticsearch.action.ActionRequestValidationException;
 import org.elasticsearch.action.DocumentRequest;
 import org.elasticsearch.action.RoutingMissingException;
 import org.elasticsearch.action.TimestampParsingException;
 import org.elasticsearch.action.support.replication.ReplicationRequest;
 import org.elasticsearch.client.Requests;
+import org.elasticsearch.cluster.metadata.IndexMetaData;
 import org.elasticsearch.cluster.metadata.MappingMetaData;
 import org.elasticsearch.cluster.metadata.MetaData;
 import org.elasticsearch.common.Nullable;
-import org.elasticsearch.common.UUIDs;
+import org.elasticsearch.common.Strings;
 import org.elasticsearch.common.bytes.BytesArray;
 import org.elasticsearch.common.bytes.BytesReference;

As you can see the line 4 of the source file start with ------. To fix the problem, you need to edit the regular expression in the source file of unidiff 0.5.2 which you find in /unidiff/constants.py from :

RE_SOURCE_FILENAME = re.compile(
                      r'^--- (?P<filename>[^\t\n]+)(?:\t(?P<timestamp>[^\n]+))?')

to:

RE_SOURCE_FILENAME = re.compile(
                   r'^------ (?P<filename>[^\t\n]+)(?:\t(?P<timestamp>[^\n]+))?')

PS: if the source file renamed, gitpython generates diff start with ---. But it will not throw an error because I filtered git diff of rename file (diff_filter='cr').

Genoa answered 7/10, 2016 at 4:29 Comment(4)
Maybe you can use diff_index[i].diff, which strips anything before @@.Surge
It's a really bad idea to modify the source code of libraries you're using. You won't be able to upgrade them to a newer version or you will always have to perform your modifications after the upgrade.Athalie
I have updated the solution, @tdihp, thank you for your comments.Genoa
@DawidFerenczyRogožan, My new solution does not change the source code anymore. Thank you for your comments.Genoa
N
0

Use diff_index[i].diff as tdihp recommended and also add source and target file lines to the diff, othewise unidiff will throw. Here is my working code sample:

diffs = []
diff_index = commit.diff(prev_commit, create_patch=True)
for diff in diff_index.iter_change_type('M'):
  if diff.a_path[-3:] == ".js":
    diffs.append(diff)

if diffs:
  for d in diffs:
    a_path = "--- " + d.a_rawpath.decode('utf-8')
    b_path = "+++ " + d.b_rawpath.decode('utf-8')

    # Get detailed info
    patch = PatchSet(a_path + os.linesep + b_path + os.linesep + d.diff.decode('utf-8'))

    for h in patch[0]:
      for l in h:
        print("  " + str(l.source_line_no) + " <-> " + str(l.target_line_no))
      print("")
Nephology answered 3/5, 2019 at 16:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.