git combining two files into one with history preserved
Asked Answered
C

3

12

Imagine that you have two files in a git repository, say A.txt and B.txt.

Is it possible to concat the two files into a third one A+B.txt, removing the original A.txt and B.txt and committing it all, so the history is still preserved?

That is, if I asked git log --follow A+B.txt I would know that the content originated from the A.txt and B.txt files?

I've tried to separate the files into two different branches and then merging them into a new file (while removing the old ones), but to no avail.

Crossfade answered 6/10, 2017 at 17:59 Comment(3)
you can try renaming A.txt to A+B.txt, add in the changes from B.txt and delete B.txt, then commiting that.Ursulina
Why not put that in the commit message when you make A+B.txt that it is concat of A.txt and B.txt.Gorden
Reverse operation - Preserving Git history while splitting fileDiagonal
S
13

The long answer is 'yes'!

Full credit to Raymond Chen's article Combining two files into one while preserving line history:

Imagine you had two files: fruits & veggies

git blame for both fruits and veggies

The naïve way of combining the files would be to do it in a single commit, but you'll lose line history on one of the files (or both)

You could tweak the git blame algorithms with options like -M and -C to get it to try harder, but in practice, you don’t often have control over those options (eg. the git blame may be performed on a server)

The trick is to use a merge with two forked branches

  • In one branch, we rename veggies to produce.
  • In the other branch, we rename fruits to produce.
git checkout -b rename-veggies
git mv veggies produce
git commit -m "rename veggies to produce"
git checkout -
git mv fruits produce
git commit -m "rename fruits to produce"

Then merge the first into the second

git merge -m "combine fruits and veggies" rename-veggies

This will generate a merge conflict - that's okay - now take the changes from each branch's Produce file and combine into one - here's a simple concatenation (but resolve the merge conflict however you please):

cat "produce~HEAD" "produce~rename-veggies" >produce
git add produce
git merge --continue

The resulting produce file was created by a merge, so git knows to look in both parents of the merge to learn what happened.

git blame for produce

And that’s where it sees that each parent contributed half of the file, and it also sees that the files in each branch were themselves created via renames of other files, so it can chase the history back into both of the original files.

Each line should be correctly attributed to the person who introduced it in the original file, whether it’s fruits or veggies. People investigating the produce file get a more accurate history of who last touched each line of the file.

For best results, your rename commit should be a pure rename. Resist the temptation to edit the file’s contents at the same time you rename it. A pure rename ensure that git’s rename detection will find the match. If you edit the file in the same commit as the rename, then whether the rename is detected as such will depend on git’s “similar files” heuristic.

Checkout the full article for a full step by step breakdown and more explanations


Originally, I had thought this might be a use case for git merge-file doing something like this:

>produce echo #empty
git merge-file fruits produce veggies --union -p > produce
git rm fruits veggies
git add produce
git commit -m "combine fruits and veggies"

However, all this does is help simulate the merge diffing algorithm against two different files - the end output when committed is identical to if the file had been updated manually and the resulting changes manually committed

Sirius answered 4/11, 2019 at 22:34 Comment(4)
Combine both files in each branch so the merge has no conflicts. Otherwise rebasing revives the conflict every time.Carlettacarley
That's awesome! But how would you go about merging file A into file B, without renaming file B? Do I first have to rename the target file to something else in a separate commit? Or is there a nicer way to do it?Drillstock
@Hubro, the trick is you have to rename both A & B, that way they both get to be considered parents, and there's no single winner.Sirius
To see full line history, e. g. after splitting + rejoining a file, use git blame -C40. To put that in effect in TortoiseGitBlame Window, this can be set up with "Detect moved or copied lines" = "From modified files".Chavarria
T
5

The short answer is "no" (or perhaps even Mu). (But for a way to get useful synthesized line history for a combined file via git blame, see KyleMit's answer.)

History, in Git, is the set of commits. There is no such thing as "file history": you either have a commit, or you don't, and that commit has one or more parents, or it doesn't. This means that "file history" as a thing doesn't exist—and yet, git log --follow exists. This is self-contradictory: How can git log --follow produce a file history, if file history doesn't exist?

The answer is that git log --follow cheats. It doesn't really find file history. It looks through history and constructs a sub-history by changing the (single) name of the file it is looking for. It looks at each commit, one at a time, and runs a (sped-up, limited) git diff --find-renames of that commit against its parent.1 If the diff says that file X.txt in the parent was renamed to A.txt in the child, and you're running git log --follow A.txt, the code in git log now starts looking for X.txt.

Since there's no code to start looking for more than one file at a time, you can't get this particular cheat to accommodate your desired situation, which is to go from looking for one particular file to more-than-one file. (There are actually two problems here. One is that, due to the rather limited internal implementation,2 git log --follow can only look at one file at a time. The other is that rename detection does not include "combine detection": there is a form of "split detection", in which Git will do copy-finding, enabled with --find-copies and --find-copies-harder. The latter is very compute-intensive, and both are working in the wrong direction here, although it could be made to do the right thing simply by reversing the order of the diff.)


1As this implies, --follow doesn't look at merge diffs at all, at least by default. See also `git log --follow --graph` skips commits.

2aka "cheesy hack"

Thynne answered 6/10, 2017 at 18:23 Comment(2)
looks like one vehicle is to create identical file names and merge from one branch into another so git will make an attempt to look at the merged parents when assigning line by line attribution (see below)Sirius
@KyleMit: that is a nice hack. It doesn't make the one file have two file histories—there's still no real file history at all, and git log won't help you here. But it does make git blame on the merge commit, and commits that follow it, much more useful. The blame (or annotate) command is synthesizing line history from commit history, and this makes it do a much better job of it.Thynne
C
0

The article by Raymond Chen and cited by KyleMit is the best answer. Below is a solution that ultimately only keeps half of the line history, but I'm going to leave it up for reference/education.

Instead of merging the branches, just use cherry-pick to pull in the commit. This will still cause a conflict to be resolved, but the result will be a single commit without a merge commit, and a simpler history for future operations (at the cost of one file's line history).

git checkout -b temp
git mv A.txt AB.txt
git commit -am "moving B to AB"
git switch main
git mv B.txt AB.txt
git commit -am "moving A to AB"
git cherry-pick temp

resolve the conflict

git add AB.txt
git cherry-pick --continue

AB.txt will preserve the blame history

git blame AB.txt
Copula answered 7/8, 2022 at 21:58 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.