git word diff regex strange behaviour
Asked Answered
G

1

6

I'm using Git to version prose and have been trying git diff --word-diff to see changes within lines. I want to use the results generated in a script.

But the default way that --word-diff identifies a word seems flawed. So I've been experimenting with --word-diff-regex= options.

Problem

Here are the two main flaws I'm trying to deal with:

  1. Added whitespace seems to be ignored. But whitespace can be quite important if trying to use the results programmatically.

    For example, take this header from a Markdown (.md) file:

    # Test file
    

    Now, let's add some text to the end of it:

    # Test file in Markdown
    

    If I run git diff --word-diff on this:

    # Test file {+in Markdown+}
    

    But the space before the word "in" has not been included as part of the diff.

  2. Empty lines are completely ignored.

    Here's a standard git diff for the content of a file where I've removed a line and also added a couple of new lines -- one empty, the other with the text "Here's a new line."

     This is a test file to see how word diff responds in certain situations.
    -
     I'll try removing lines and adding them to see what happens.
    
     Here's another line so we can see what happens with line removals and additions. I want to see how `git diff --word-diff` handles it all!
    +
    +Here's a new line.
    

    But here's git diff --word-diff for the same content:

    This is a test file to see how word diff responds in certain situations.
    
    I'll try removing lines and adding them to see what happens.
    
    Here's another line so we can see what happens with line removals and additions. I want to see how `git diff --word-diff` handles it all!
    
    {+Here's a new line.+}
    

    The removed and added empty lines are completely ignored.

Desired results

Putting the two examples above together. Here's what I'd like to see:

# Test file{+ in Markdown+}

This is a test file to see how word diff responds in certain situations.
{--}
I'll try removing lines and adding them to see what happens.

Here's another line so we can see what happens with line removals and additions. I want to see how `git diff --word-diff` handles it all!
{++}
{+Here's a new line.+}

Things I've tried:

  • git diff --word-diff-regex='.' seems too granular for when new words share characters with existing words
  • git diff --word-diff-regex='[^ ]+|[ ]' seems to solve the first problem but, to be honest, I'm not actually sure why.
  • git diff --word-diff-regex='[^ ]+|[ ]|^$' -- I was hoping the ^$ on the end would help capture empty lines -- but it doesn't and, worse, it then seems to ignore the change that follows.
  • git diff --word-diff-regex='[^ ]+|[ ]|.{0}' creates same problem as the one before.

I'd be grateful if anyone could shed any light on how to do this, or at least share some knowledge on what's going on under the hood with --word-diff-regex.

Guano answered 5/10, 2019 at 14:52 Comment(5)
Here's a stranger finding... which means bonus questions! If I try --word-diff-regex='\n', the last line of my example displays Here's a {+new lin+}e. Odd. Firstly, the regex flavour Git is using doesn't seem to recognise \n as a newline character (the same is true of an escaped version \\\n. So which flavour is Git using? Secondly: is the regex here really defining a word -- note that the diff bit picked out begins and ends with a literal n. So is it looking for boundaries instead and does that affect the way we should write regexes for --word-diff-regex=?Guano
It's been a long time since I glanced at the code, but if I remember right, Git first finds diffs the usual way, then takes the differenced-lines and applies the word regex to do sub-matches within those lines, then discards whitespace. I think the last accounts for the {+new lin+}e result: the regex matched the newline, which the word diff code discarded after the whitespace was removed, so the word diff code discarded the e in line.Idealistic
Thanks for the reply. What you say about Git finding diffs the usual way first (line by line) makes sense to me and is what I suspected was happening. That may explain why empty lines just don't get tracked. I'm still not sure I understand why the "e" in particular at the end of that example gets discarded. I suppose I'd expected to see just {+n+} show up twice. But because the characters inbetween those n's show up, I was thinking the regex must be something to do with defining word boundaries rather than defining what a word looks like. I still haven't fully understood this though.Guano
Git uses POSIX ERE, which doesn't consider \n to indicate a newline. If you want to match a newline, you have to include it literally.Schnapps
@Schnapps Thanks for this. I couldn't seem to find it anywhere in Git's documentation so that's good info to have. I'm still unclear as to why --word-diff-regex='^$' or --word-diff-regex='.{0}' would not capture empty lines, since both those patterns would be compliant with POSIX ERE. But I'm starting to assume there's something in the Git script itself which deliberately skips empty lines rather than attempting to match them to the regex pattern provided.Guano
B
2

The main thing that you're running into that's stopping you from having what you want, from https://git-scm.com/docs/diff-options, is:

A match that contains a newline is silently truncated(!) at the newline.

This is going to mean that word diffs are always going to ignore line diffs. I don't think you're going to achieve the results you want short of a custom diff generator.

Bobo answered 3/9, 2020 at 22:8 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.