Git merging within a line

Asked 7/4, 2011 at 21:2 Answered 15/10, 2011 at 22:19

Preamble

I'm using git as a version control system for a paper that my lab is writing, in LaTeX. There are several people collaborating.

I'm running into git being stubborn about how it merges. Let's say two people have made single-word changes to a line, and then attempt to merge them. Though git diff --word-diff seems capable of SHOWING the difference between the branches word-by-word, git merge seems unable to perform the merge word-by-word, and instead requires a manual merge.

With a LaTeX document this is particularly annoying, as the common habit when writing LaTeX is to write a full paragraph per line and just let your text editor handle word wrapping when displaying for you. We are working around for now by adding a newline for each sentence, so that git can at least merge changes on different sentences within a paragraph. But it will still get confused about multiple changes within a sentence, and this makes the text no longer wrap nicely of course.

The Question

Is there a way to git merge two files "word by word" rather than "line by line"?

Canaliculus answered 7/4, 2011 at 21:2 Comment(3)

Just as an aside, I personally think this is one case where a human should intervene in a merge. Two independent changes to different words of a sentence has the potential to completely change the meaning from what either editor intended. I would be too worried about missing something like that to leave the decision to a computer. Also, two different people frequently changing the exact same sentence at the same time brings up concerns about your process for dividing up work. If it's working out for you, more power to you. Just be careful and do some heavy proofreading at the end. – Burtonburty 8/4, 2011 at 4:23

If only we could set core.eol to any regular expression... – Nanoid 7/7, 2014 at 17:34

A general text merge algorithm question (not necessarily Git): #1204225 – Nanoid 7/7, 2014 at 20:3

Here's a solution in the same vein as sehe's, with a few changes which hopefully will address your comments:

This solution considers merging by sentence rather than by word, as you had been doing by hand, only now, the user will see a single line per paragraph, but git will see paragraphs broken into sentences. This seems to be more logical because adding/removing a sentence from a paragraph may be compatible with other changes in the paragraph, but it is probably more desirable to have a manual merge when the same sentence is edited by two commits. This also has the benefit of the "clean" snapshots to still be somewhat human readable (and latex compilable!).
The filters are one-line commands, which should make it easier to port this to collaborators.

As in saha's solution make a (or append to) .gittatributes.

    *.tex filter=sentencebreak

Now to implement the clean and smudge filters:

    git config filter.sentencebreak.clean "perl -pe \"s/[.]*?(\\?|\\!|\\.|'') /$&%NL%\\n/g unless m/%/||m/^[\\ *\\\\\\]/\""
    git config filter.sentencebreak.smudge "perl -pe \"s/%NL%\n//gm\""

I've created a test file with the following contents, notice the one-line paragraph.

    \chapter{Tumbling Tumbleweeds. Intro}
    A way out west there was a fella, fella I want to tell you about, fella by the name of Jeff Lebowski.  At least, that was the handle his lovin' parents gave him, but he never had much use for it himself. This Lebowski, he called himself the Dude. Now, Dude, that's a name no one would self-apply where I come from.  But then, there was a lot about the Dude that didn't make a whole lot of sense to me.  And a lot about where he lived, like- wise.  But then again, maybe that's why I found the place s'durned innarestin'.

    This line has two sentences. But it also ends with a comment. % here

After we commit it to the local repo, we can see the raw contents.

    $ git show HEAD:test.tex

    \chapter{Tumbling Tumbleweeds. Intro}
    A way out west there was a fella, fella I want to tell you about, fella by the name of Jeff Lebowski. %NL%
     At least, that was the handle his lovin' parents gave him, but he never had much use for it himself. %NL%
    This Lebowski, he called himself the Dude. %NL%
    Now, Dude, that's a name no one would self-apply where I come from. %NL%
     But then, there was a lot about the Dude that didn't make a whole lot of sense to me. %NL%
     And a lot about where he lived, like- wise. %NL%
     But then again, maybe that's why I found the place s'durned innarestin'.

    This line has two sentences. But it also ends with a comment. % here

So the rules of the clean filter are whenever it finds a string of text that ends with . or ? or ! or '' (that's the latex way to do double quotes) then a space, it will add %NL% and a newline character. But it ignores lines that start with \ (latex commands) or contain a comment anywhere (so that comments cannot become part of the main text).

The smudge filter removes %NL% and the newline.

Diffing and merging is done on the 'clean' files so changes to paragraphs are merged sentence by sentence. This is the desired behavior.

The nice thing is that the latex file should compile in either the clean or smudged state, so there is some hope for collaborators to not need to do anything. Finally, you could put the git config commands in a shell script that is part of the repo so a collaborator would just have to run it in the root of the repo to get configured.

    #!/bin/bash

    git config filter.sentencebreak.clean "perl -pe \"s/[.]*?(\\?|\\!|\\.|'') /$&%NL%\\n/g unless m/%/||m/^[\\ *\\\\\\]/\""
    git config filter.sentencebreak.smudge "perl -pe \"s/%NL%\n//gm\""

    fileArray=($(find . -iname "*.tex"))

    for (( i=0; i<${#fileArray[@]}; i++ ));
    do
        perl -pe "s/%NL%\n//gm" < ${fileArray[$i]} > temp
        mv temp ${fileArray[$i]}
    done

That last little bit is a hack because when this script is first run, the branch is already checked out (in the clean form) and it doesn't get smudged automatically.

You can add this script and the .gitattributes file to the repo, then new users just need to clone, then run the script in the root of the repo.

I think this script even runs on windows git if done in git bash.

Drawbacks:

This doesn't handle lines with comments smartly, it just ignores them.
%NL% is kind of ugly
The filters may screw up some equations (not sure about this).

Override answered 15/10, 2011 at 22:19 Comment(2)

Would be nice to exclude inline maths, i.e., anything between $...$ or $...$. Actually one could even consider putting them on a line of their own. Is that possible? (Unfortunately I don't speak regexp) – Holmberg 2/4, 2013 at 8:19

“Edits must be at least 6 characters” – okay, so the typo in the filename remains. This really is a very st^H^H … disadvantageous rule on stackoverflow. – Rheotropism 30/10, 2018 at 11:25

You could try this:

instead of swapping out a merge engine (hard) you can do some kind of 'normalization' (canonicalization, if you will). I don't speak LateX, but let me illustrate as follows:

Say you have input like test.raw

curve ball well received {misfit} whatever
proprietary format extinction {benefit}.

You want it to diff/merge word-by-word. Add the following .gitattributes file

*.raw     filter=wordbyword

Then

git config --global filter.wordbyword.clean /home/username/bin/wordbyword.clean
git config --global filter.wordbyword.smudge /home/username/bin/wordbyword.smudge

A minimalist implementation of the filters would be

/home/username/bin/wordbyword.clean

#!/usr/bin/perl
use strict;
use warnings;

while (<>)
{
    print "$_\n" foreach (m/(.*?\s+)/go);
    print '#@#DELIM#@#' . "\n";
}

/home/username/bin/wordbyword.smudge

#!/usr/bin/perl
use strict;
use warnings;

while (<>)
{
    chomp; '#@#DELIM#@#' eq $_ and print "\n" or print;
}

After committing the file, inspect the raw contents of the committed blob with `git show

HEAD:test.raw`:

curve 
ball 
well 
received 
{misfit} 
whatever

#@#DELIM#@#
proprietary 
format 
extinction 
{benefit}.

#@#DELIM#@#

After changing the contents of test.raw to

curve ball welled repreived {misfit} whatever
proprietary extinction format {benefit}.

The output of git diff --patch-with-stat will probably what you wanted:

 test.raw |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/test.raw b/test.raw
index b0b0b88..ed8c393 100644
--- a/test.raw
+++ b/test.raw
@@ -1,14 +1,14 @@
 curve 
 ball 
-well 
-received 
+welled 
+repreived 
 {misfit} 
 whatever

 #@#DELIM#@#
 proprietary 
-format 
 extinction 
+format 
 {benefit}.

 #@#DELIM#@#

You can see how this would work magically for merges resulting in word-by-word diffing and merging. Q.E.D.

(I hope you like my creative use of .gitattributes. If not, I enjoyed making this little exercise)

Sapphira answered 7/4, 2011 at 22:51 Comment(1)

I like the creative use of .gitattributes. Seems like this solution has some problematic side effects though... 1) The files will be stored in git's commit snapshots as one-word-per-line, correct? So all of the collaborators will need to have the script (and a working perl installation or whatever script engine we go with) otherwise they will be looking at basically gibberish in their working tree? 2) Commits diffs will also be messy to read, since they will also be one-word-per-line. – Canaliculus 8/4, 2011 at 17:50

I believe the git merge algorithm is quite simple (even though you can make it work harder with the "patience" merge strategy).
Its work item will remain the line.

But the general idea is to delegate any fine-grained detection§resolution mechanism to a third-party tool you can setup with git config mergetool.
If some words within a long line differs, that external tool (KDiff3, DiffMerge, ...) will be able to pick up that change and present it to you.

Wine answered 7/4, 2011 at 21:36 Comment(4)

If I am understanding you correctly, git mergetool (and you can chose a user friendly word-aware tool) can be used to solve the merge conflicts detected by git, but there is no way to have git silently merge separate changes within a line without considering those changes to be conflicts? – Canaliculus 7/4, 2011 at 21:45

@akeshet: yes, that is why git is referenced as a "stupid content manager": the kind of intelligence you ask for is for you to provide, either through a merge tool or through some kind of script, like sehe illustrates in his answer. – Wine 8/4, 2011 at 3:51

There are also merge drivers – Hellbox 22/7, 2017 at 17:18

@Hellbox Yes, I suppose at the time (6 years ago) I wasn't as aware of merge driver (filter/smudge/clean) as I am now. – Wine 23/7, 2017 at 4:18

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

/home/username/bin/wordbyword.clean

/home/username/bin/wordbyword.smudge

Recommended topics

Hot tags