Git: Confusion about merge algorithm, conflict format, and interplay with mergetools
Asked Answered
P

2

10

I don't know the details, but as far as I understand the process of merging and conflict resolution, it goes as follows (assume there is only one file in the repository, modified in two branches):

  1. The user issues a git merge command.
  2. Git applies some git-specific algorithm to automatically merge the two modified files. For this purpose it creates a BASE, LOCAL, OTHER and BACKUP version of the file.
  3. It then writes the merge result into the original tracked file (call it MERGED).
  4. Assume there are conflicts. Git uses some format to represent the conflict (<<<<<<<, |||||||, =======, >>>>>>> markers). It then sets its status to 'merging' or similar.
  5. If the user then issues git mergetool ... the configured external merge tool opens, with arguments pointing to the BASE, LOCAL, OTHER and of course MERGED.

There are a few points I'm confused about:

  • Will the tool always understand Git's conflict format? Is it standardized? What about the diff3 option? Is it also commonly understood by external tools?
  • Will the tool apply its own (and maybe different) merge algorithm and trash the output of Git entirely?
  • When Git needs to perform a recursive merge (because of several merge bases)—and the intermediate merge creates conflicts—will it treat inner conflict markers as plain text just as any other non-conflicting text? Or is the conflict format recursive itself?

I couldn't find any explanation that really tells the whole story.

Pinnatisect answered 27/5, 2017 at 3:24 Comment(0)
A
5

The full answer is complicated. Edward Thomson's covers much of it. Here is considerably more detail.

Let's start, though, with this: git mergetool runs—I should say, you run itafter all the rest of git merge is done. Your merge tools do not even enter the picture until git merge has completed (and failed due to conflicts). This changes a lot of the way you will think about these.

How (recursive and resolve) merge works

The user issues a git merge command.

So far so good.

Git applies some git-specific algorithm to automatically merge the two modified files.

Whoops, no, we've already derailed and the train may be heading off the cliff. :-)

The first step at this point is to choose a merge strategy. Let's pick the default (-s recursive) strategy. If we pick some other strategy, the next step may be different (it is entirely different for -s ours, and somewhat different for -s octopus, but none of those are interesting right now anyway).

The next step is to find all the merge bases. With any luck there is only one. We'll come back to the recursion issue later. There could be no merge base, though. Older versions of Git used an empty tree as a fake merge base. Newer ones—2.9 or later—demand that you add --allow-unrelated-histories here (and then proceed in the same way). With an empty tree, every file is added, in both non-base commits.

If there is one merge base, it might be the same as either branch tip. If so, there is no merge to perform. There are two sub-cases here as well, though. There may be nothing to merge, because the merge base is the other commit and the other commit is "behind" (is an ancestor of) the current commit. In this case, Git always does nothing. Or, the other commit may be ahead of (a descendant of) the current commit. In this case, Git normally does a fast-forward operation, unless you specify --no-ff. In both cases (fast-forward or --no-ff), no actual merging happens. Instead, the further-ahead commit gets extracted. It either becomes the current commit (fast-forward merge: whatever branch you are on, it now points to the further-ahead commit), or Git makes a new commit using that commit's tree, and the new commit become the current commit.

A real merge: merging one merge base with two commits

We are now at a phase where we have a single merge base commit B, and two commits L (local or left-side, --ours) and R (remote or right-side, --theirs). Now, the two normal (-s recursive and -s resolve) strategies do a pair of git diff --name-status operations with rename detection enabled, to see if there are files in the B-to-L change that change their names, and if there are files in the B-to-R change that change their names. This also finds out if there are newly added files in either L or R, and if files are deleted in either L or R. All of this information is combined to produce file identities, so that Git knows which sets of changes to combine. There may be conflicts here: a file whose path was PB in the base, but is now both PL and PR, has a rename/rename conflict, for instance.

Any conflicts at this point—I call them high level conflicts—lie outside the domain of file-level merging: they will make Git end this merge process with a conflict, regardless of whatever else occurs. In the meantime, though, we end up with "identified files", as I said above, without quite defining it. Loosely, what this means is that just because some path P got changed, doesn't mean it's a new file. If there was a file base in the base commit B, and it's now called renamed in L but still called base in R, Git will use the new name, but compare B:base with L:renamed and B:base with R:base when Git goes to combine changes at the file level.

In other words, the file identity we compute at this stage tells us (and Git) which files in B match which files in L and/or R. This identity is not necessarily by path name. It's just usually the case that all three paths match.

There are a few small tweaks you can insert during this first diff phase:

  • Renormalization (merge.renormalize): you can make Git apply text conversions from .gitattributes and/or core.eol settings. The .gitattributes settings include the ident filter and any smudge and clean filters (though only the smudge direction applies here).

    (I assumed Git did this early, since it may affect rename detection. I have not actually tested this, though, and I just looked through the Git source and it seems to not use this at this stage. So perhaps merge.renormalize does not apply here, even though a smudge filter could radically rewrite a file. Consider a filter-pair that encrypts and decrypts, for instance. This is probably a bug, albeit a small one. Fortunately EOL conversion has no effect at all on similarity index values.)

  • You can set the similarity index for when Git will consider files to be renamed, or disable rename detection entirely. This is the -X find-renames=n extended strategy option, previously called the rename threshold. It is the same as the git diff -M or --find-renames option.

  • Git currently has no way to set the "break" threshold a la git diff -B. This also affects file identity computation, but if you can't set it, it doesn't really matter. (You probably should be able to set it: another minor buglet.)

Merging individual files

Now that we have our files identified and have decided which ones match up with which other ones, we finally proceed to the file-merging level. Note that here, if you are using the built-in merge driver, the remaining settable diff options will start to matter.

Let me quote this bit again, as it's relevant:

Git applies some ... algorithm to automatically merge the two modified files. For this purpose it creates a BASE, LOCAL, OTHER and BACKUP version of the file.

There are three (not four) files involved at this point, but Git does not create any of them. They are the files from B, L, and R. These three files exist as blob objects in the repository. (If Git is renormalizing files, it does have to create the renormalized ones as blob objects at this point, but then they live in the repository, and Git just sort of pretends they were in the original commits.)

The next step is quite critical, and it is where the index comes into the picture. The hash IDs of those three blob objects are HB, HL, and HR. Git gets ready to place these three hashes into the index, in slots 1, 2, and 3 respectively, but now uses the rules described in the git read-tree documentation under the 3-Way Merge section:

  • If all three hashes are equal, the file is already merged and nothing happens: the hash goes into slot zero. Even if only the second and third hashes are equal, the file is still already merged: both L and R make the same change with respect to B. The new hash goes into slot zero and the file-merge is complete.
  • If HB = HL and HB ≠ HR, the right side (remote/other/--theirs) file should be the result. This hash goes into slot zero and the file-merge is complete.
  • If HB ≠ HL and HB = HR, the left side (local/--ours) file should be the result. This hash goes into slot zero and the file-merge is complete.
  • This leaves only the case where all three hashes differ. Now the files really do need to be merged. Git places all three hashes into the three index slots.

There are a few special cases that can apply at this point, all having to do with higher-level conflicts. It's possible that one or two index slots are left empty for some path names, because the index is carefully managed in a way that keeps it synchronized with the work-tree (so that it can play its role as a cache that speeds up Git a lot). But in principle, especially when we are concerned with merge drivers, we can think of this as just "all three slots"—they just may be three slots spread across several names, in the case of renamed files.

Invoking merge drivers (.gitattributes)

At this point, we have an actual file-level merge to perform. We have three input files. Their actual contents are stored in the repository, as blob objects. Their hash IDs are stored in the index, in slots 1 through 3 (usually of a single index entry, but in the case of renames, maybe using more than one index entry). We may now:

  • Use git's built in file merge (which is also available as an external command, git merge-file).

    The built in file merge works directly from the index (though if we want to run it via git merge-file we must extract the blobs into the file system). It extracts the files, does its thing to merge them, and optionally—depending on extended-strategy-options -X ours or -X theirs—writes conflict markers as well. It drops its final result into the work-tree, under whatever path name Git chose as the final path name, and is finished.

  • Use a merge driver (via .gitattributes). A merge driver is run with arguments. However, these arguments are constructed by having Git extract the three blob objects to three temporary files.

    The arguments are expanded from whatever we put in as %O, %A, %B, %L, and %P. These argument letters don't quite match what we have been using: %O is the name of the base file, %A is the name of the left-side / local / --ours version, %B is the name of the right-side / other / remote / --theirs version, %L is the conflict-marker-size setting (default 7), and %P is the path that Git wants to use to save the final result in the work-tree.

    Note that %O, %A, and %B are all the names of temporary files that Git created (to hold the blob contents). None of them match %P. Git expects the merge driver to leave the result of the merge in the path %A (which Git will then rename to %P on its own).

In all cases, the merged file goes into the work-tree, at this point. If the merge went well, the higher-numbered slots in the index are cleaned out: Git, in effect, runs git add on the work-tree file, writing the data into the repository as a blob object, and getting a hash ID that goes into slot zero. If the merge failed with conflicts, the higher-numbered slots remain in place; slot zero is left empty.

The end result of all this is that the work-tree holds the merged files, perhaps with conflict markers, and the index holds the result of the merge, perhaps with conflicts that should be resolved.

Using git mergetool

This works much the same way as a merge driver. Aside from running only after the merge has completed with its results in the index and work-tree, though, the main differences are:

  • git mergetool will make extra copies of files (the .orig files).
  • It knows exactly how to run each known tool, i.e., what arguments to pass to make that tool do something useful. There is no equivalent to a driver %O placeholder, for instance.
  • It can run commands on all the as-yet-unmerged files in some directory.

In fact, git mergetool is a big shell script: it uses git ls-files -u to find the unmerged index entries, and git checkout-index to extract each stage from the index. It even has special cases for the higher level conflicts such as add/add or rename/delete.

There's an additional driver shell-script fragment per known tool: look in

$ ls $(git --exec-path)/mergetools

to see all the individual tool drivers. These are passed a flag, $base_present, for handling add/add conflicts. (They are sourced, i.e., run with . "$MERGE_TOOLS_DIR/$tool", so that they can override shell functions defined in the script.)

For unknown tools, you use the shell's variable names $BASE, $LOCAL, and $REMOTE to know where the script has put the three files extracted from the index, and you write your result to $MERGED (which is in fact the work-tree name for the file). The script does this:

setup_user_tool () {
        merge_tool_cmd=$(get_merge_tool_cmd "$tool")
        test -n "$merge_tool_cmd" || return 1

        diff_cmd () {
                ( eval $merge_tool_cmd )
        }

        merge_cmd () {
                ( eval $merge_tool_cmd )
        }
}

i.e., evals your tool command in a sub-shell, so that you cannot override things the way the known tools can.

Recursive merge

When Git needs to perform a recursive merge ...

Most of this question is kind of moot at this point. A merge tool never sees this situation at all, because git mergetool is invoked after Git itself has finished the recursive merge and left the result in the index and work-tree. However, merge drivers do get a say here.

When the -s recursive merge strategy is merging merge-bases to make a new "virtual commit", it invokes another git merge—well, more precisely, just calls itself recursively—on the merge base commits (but see below). This inner git merge knows that it's being invoked recursively, so when it is about to apply a .gitattributes merge driver, it checks the recursive = setting there. This determines whether the merge driver is used again, or some other merge driver is used for the inner merge. For the built-in merge driver, Git turns off the extended strategy options, i.e., neither -X ours nor -X theirs is in effect.

When an inner merge completes, its result—all the files that would be left in the work-tree, were this not an inner, recursive merge—is actually saved as a real commit. This is true even if there were unresolved conflicts. These unresolved conflicts may even contain conflict markers. Nonetheless, this is the new "virtual merge base" commit, and it is a true commit; it just has no external name by which you can find its commit hash.

If there are three or more merge bases at this particular level, rather than just two merge bases, this new virtual merge base is now merged with the next remaining merge base, iteratively. Logically, Git could use a divide-and-conquer strategy here: if there were 32 merge bases initially, it could merge them two at a time to produce 16 commits, merge those two at a time to produce 8, and so on. Aside from doing ceil(log2(N)) merges instead of N-1 merges, though, it's not clear that this would buy much: it's already quite rare to have N > 1.

Anemophilous answered 27/5, 2017 at 16:29 Comment(7)
Of course, +1. On the index and its stage, you also wrote https://mcmap.net/q/22501/-how-to-stage-a-rename-without-subsequent-edits-in-gitHoneywell
@VonC: yes, but that other answer is about normal, non-merge-y index entries.Anemophilous
Needless to say, this is an excellent answer! Exactly the level of detail I was looking for. So thank you very much for the effort! I still have open questions: Is the format of the conflict markers somehow standardized? And: Do external merge tools make any use of the already produced (by Git) markers in the merged file? As I understand, they use $MERGED only as a write-target. And just to confirm: Inner merge conflict markers are therefore treated as "normal" file contents, right?Pinnatisect
Do external merge tools make any use of the already produced (by Git) markers in the merged file? I doubt it, although it's possible (since each tool has its own script, and can do whatever it wants). Is the format of the conflict markers somehow standardized? Git itself only writes one kind, but the length varies and it has both merge and diff3 conflict-style settings. Inner merge conflict markers are therefore treated as "normal" file contents, right? They become part of the new commit that is the next input, so, yes; but I doubt they play well with each other, so this [cont'd]Anemophilous
... so this seems like a candidate for future improvement, should conflicts occur often in virtual bases in the future (not that I see that as likely).Anemophilous
@torek, I see! And the return code of the external merge tool actually determines whether or not the file pointed to by $MERGED should be considered resolved, right? However, git could also distrust it and instead ask the user explicitly? (mergetool.<tool>.trustExitCode)Pinnatisect
Yes—and in the latter case (no trust-exit-code setting), the script checks to make sure that the file named $MERGED is newer than a backup copy the script makes.Anemophilous
H
4

Merge tools don't parse the file in the working directory with the conflict markers. They read the ancestor, ours and theirs files that git mergetool creates from the index and places on disk for them.

They will use their own logic to produce a merge result and will overwrite the file created by Git.

Hallie answered 27/5, 2017 at 4:39 Comment(2)
How can they read the index? Then they would need to understand Git internals or issue Git commands in the background. Do they even know Git? And why does Git then create all those file versions (like LOCAL) on the disk?Pinnatisect
No, they don't know anything about Git. The git mergetool command creates all those file versions for them.Hallie

© 2022 - 2024 — McMap. All rights reserved.