hg to git conversion and subrepo merge

Asked 10/5, 2016 at 17:43 Answered 15/8, 2016 at 19:19

git version-control mercurial mercurial-subrepos

Despite involving two subparts, I'm asking this as a combined question because the way it's broken down into parts isn't what's important. I'm open to different ways to achieve what I want as long as the end result retains all the meaningful history and ability to check out, study, and build/test historical versions. The goal is to retire hg and the subrepo model that's been used so far and move to a unified tree in git, but without sacrificing history.

What I'm starting with is a Mercurial repository that consists of some top-level code and a number of subrepositories where the bulk of interesting history lies. The subrepos have some branching/merges, but nothing too crazy. The final result I want to achieve is a single git repository, with no submodules, such that:

For each commit in the original top-level hg repo, there is a git commit that checks out exactly the same tree as you'd get checking out the corresponding hg commit with all its references subrepo commits.
These git commits corresponding to successive top-level hg commits are descendants of each other, with commits corresponding to all relevant subrepo commits in between.

The basic idea I have for how to achieve this is to iterate over all top-level hg commits, and for each top-level commit that changes .hgsubstate, also iterate over all paths from the old revision to the new revision for the submodule (possibly involving branching). At each step:

Check out the appropriate hg revisions for top-level and all subrepos.
Delete everything from the git index.
Stage everything checked out from hg to the git index.
Use git-write-tree and git-commit-tree to generate a commit with the desired parents, using authorship, date, and commit message from the corresponding hg commit.
Record the correspondence between the new git commit and hg commits for use in generating future commits' parents.

Should this work? Is there a better way to achieve what I want, perhaps doing the subrepo collapse with hg first? The biggest thing I'm not clear on is how to perform the desired iteration, so practical advice for how to achieve it would be great.

One additional constraint: the original repos involve content which can't be published (this an additional git-filter-branch step once the basic conversion is done) so solutions that involve uploading the repo for processing by a third party are not viable.

Attar answered 10/5, 2016 at 17:43 Comment(15)

It looks to me like git fast-import was made for jobs like this. – Geum 17/5, 2016 at 16:43

@jthill: Can you elaborate? I don't see anything about using it for merging subrepo history which is a highly nontrivial task even at a high level. – Attar 17/5, 2016 at 17:14

The thing to understand is, the commit-history dag is all there is. It isn't an abstraction, and there is no global state. You can use .hgsubs and .hgsubstate to find the subrepositories, and recursively import them into your main git repository, starting from e.g. hg manifest --debug output. Once you've got them all in one git repo, you can construct arbitrary additional histories any way you want. This is going to be very much faster than the read-tree/write-tree manipulations. The elaboration needed at that point is only exactly what do you want as your resulting history? – Geum 17/5, 2016 at 17:45

@jthill: I think you're missing what's hard about the problem. As far as I can tell, there is no trivial or even canonical way to merge multiple commit-history dags into one such that any reasonable properties are maintained. In a repo with subrepos, the full tree state is defined for every commit in the top-level repo by which subrepo revisions it references, but between these commits, if more than one subrepo has changed or nonlinear changes have been made in a subrepo, there are lots of degrees of freedom for how you represent that in the unified history... – Attar 17/5, 2016 at 17:58

...and choosing a way that has nice properties does not seem trivial. One "easy" choice is treating each subrepo's changes between top-level revisions as a one or more branches, branching from the full-tree revision at the parent full-tree revision, and merging them all with a big multi-parent merge commit to achieve the full-tree state of the next revision of the top-level input repo. But this seems to yield a lot of gratuitous branching/merge structure and doesn't represent concurrent, possibly related changes to multiple subrepos... – Attar 17/5, 2016 at 18:1

If commits in the subrepos are interrelated, where and how is that relation represented currently? The simplest way represents all the information in the histories you've mentioned so far, does it not? I don't see anything gratuitous about including detailed histories you've explicitly asked to have included. – Geum 17/5, 2016 at 19:13

@jthill: It's not represented except perhaps in commit timestamps. It likely doesn't matter in the vast majority of cases, but even if it doesn't, I think it would be much nicer to have commits to separate parts of the code in a linear, chronological history where reading the log and bisecting are easy, rather than in a gratuitously complex branch/merge structure. – Attar 17/5, 2016 at 19:42

If the subrepo histories between the mainline commits are linear, you can linearize the resulting history. If the subrepo histories aren't, you can't, not without additional information that apparently doesn't exist. – Geum 17/5, 2016 at 20:19

What happens if you designate the first-parent line from each submodule commit as "mainline" and carry all the subrepo mainline commits through to the master mainline history in time sequence? If first-parent submodule history gets you from the submodule commit in this master commit to the one in the previous one, that seems like a pretty safe bet. If first-parent ancestry doesn't get to the submodule commit in the previous master, leaving the incoming history as a merge parent seems at least reasonable. But all of this is going to depend on the actual histories you're talking about. – Geum 17/5, 2016 at 23:5

@R What about submodules? You can use subrepos as submodules, What about that? – Pris 19/5, 2016 at 17:18

Get all subrepo as sub modules. And assign the top level that's in hg. – Pris 19/5, 2016 at 17:20

@khrm: Getting rid of the subrepo/submodule structure, which was largely a mistake, is a big part of the goal. A single unified repo contains strictly more information about development history (in form of an ordering/structure between commits from different components) than a subrepo structure. This is of course why the conversion is hard - it's having to recreate some structure that's lost by having used subrepos. – Attar 19/5, 2016 at 17:52

@jthill: That sounds like an interesting/viable strategy. – Attar 19/5, 2016 at 17:54

How about using .hgsubstate, going to first commit. Then taking all the sub repos one by one till you reach next commit with .hgsubstate? Ofcourse, there won't be any order between subrepo. – Pris 19/5, 2016 at 18:8

@khrm: That sounds viable and reasonably easy to do. jthill's approach of interleaving them chronologically via first-parent relationships sounds more difficult but like it might be mildly better (or maybe worse, depending on the content of the commits). – Attar 19/5, 2016 at 20:0

What you have written might or might not solve the issue. But it isn't simple. Main issue is that you need commit in order so that your subrepos and main repo are consistent. I recreated this problem in a small scale and was able to have consistency between subrepos also).

My solution:

Using hg convert extension, I converted main repo to a repo without subrepos (and related information).

cd main
awk '{ print  $1}'  .hgsub | xargs -n 1 echo 'exclude'  > ../filemap
echo exclude .hgsub >> ../filemap
echo exclude .hgsubstate >> ../filemap
cd ..
hg convert --filemap filemap  main mainConv
cd mainConv
hg update

Convert subrepo by using rename in --filemap.

cd ..
echo rename . subRepo > subFileMap
hg convert --filemap main/subRepo subRepoConv
cd subRepoConv
hg update

Pull subrepos to converted main repo.

cd ../mainConv
hg pull -f ../subRepoConv

You will notice multiple heads in the repo while pulling (because subrepo have their own head). Merge them:
```
 hg heads
 hg merge <RevID from subrepo (not main repo)>
 hg ci -mMergeOfSubRepo
```

You have to repeat 3 & 4 for every subrepo.

But commits won't be sorted. So put them in order as done here https://stackoverflow.com/a/16012597:

 cd .. 
 hg clone -r 0 mainConv mainOrdered
 cd mainOrdered
 for REV in `hg log -R ../main -r 'sort(1:tip, date)' --template '{rev}\n'`
 do 
          hg pull ../main -r $REV
 done

Now convert this ordered mercurial repo to git using http://repo.or.cz/w/fast-export.git:

cd ..
git clone git://repo.or.cz/fast-export.git
git init mainGit
cd mainGit
../fast-export/hg-fast-export.sh -r ../mainOrdered
git checkout HEAD

Pris answered 13/5, 2016 at 21:57 Comment(3)

Any complications I should be aware of for putting them in order when the history is not entirely linear? I don't understand how that part is supposed to work. – Attar 14/5, 2016 at 1:58

Yes, I have assumed that dates are from the same time zone. . And are nearly synchronized for all committers. History is not required to be linear for sorting through dates. That part sort all revisions on the basis of date and then then pull all revisions, one by one to the new repo. It might increase the size as mentioned in the link I shared because of delta computation. (For me it reduced but I don't think that would be the case.) – Pris 14/5, 2016 at 2:58

This answer might be onto something, but I don't understand from the current contents whether/how it preserves history structure or meets the constraints I asked for in the question. My lack of familiarity with hg might be part of the cause, but in any case I don't feel right awarding the bounty at this point. (I think SO might assign it by default when I let it expire, though. If not, and this answer turns out to be what I end up using, I'll just run another bounty and award it.) – Attar 20/5, 2016 at 18:33

Yes. Your best bet is creating the commits manually with git commit-tree. There are many conversion tools, but they will never give you exactly what you want. On the other hand a hand-written script will give you all the flexibility that you need.

I've written many of these scripts, including git remote-hg itself.

Pinot answered 18/5, 2016 at 18:3 Comment(3)

I suspect this is the approach I'll end up having to take, but this answer doesn't really add anything new to make it worthy of the bounty. If it had more detail and advice on tools to use to make the scripting easy, I'd consider it, but it's too late now unless I open a new bounty later. – Attar 20/5, 2016 at 18:35

So? You asked if git write-tree and git commit-tree was the right approach, I'm telling you it is. Do you want me to tell you to use git remote-hg, or something that "adds" something new? It won't help you. I told you; you have to checkout each Mercurial commit, and create the commits by hands, it's simple, and it works. – Pinot 21/5, 2016 at 0:50

I didn't mean to offend or be hostile. Sorry if my comment came across that way. – Attar 21/5, 2016 at 4:39

Unrelated offtopic

I'm sure, you selected worst idea of migration (from Mercurial to Git), but it's your choice and your responsibility at last

Migration course

My knowledge of Git is rather weak, thus for Mercurial+subrepo -> monolithic Git I can see and describe only such way:

Mercurial+subrepo -> monolithic Mercurial -> monolithic Git repo

In order to merge subrepos history with wrapper-repo history you can (with correction from alexis's comment) use my idea from earlier question about Convert Extension
Monolithic Mercurial repo with additionally polished history (one root, no anonymous heads without at least linked bookmarks) can be easy pushed to empty Git-repo, using hg-git

Checker answered 11/5, 2016 at 5:20 Comment(0)

It seems what I was missing from my question and discussion of possible solutions was a proper understanding of the graph theory involved. Ideas like "iterate over all paths from the old revision to the new revision" were not really well-defined, or at least didn't reflect what I expected them to reflect. Coming at it from a more rigorous standpoint, I think I have an approach that works.

To begin with, the problem: Subrepo revisions only represent the state of their own subtrees at a given point in history. I want to map them to revisions that represent the state of the whole combined tree. Then the subrepo DAGs can be merged with the top-level DAG in a meaningful way.

For a given subrepo revision R, we can ask what top-level-repo (or parent-repo, if we had multiple levels of subrepos) revisions include R or any descendant of R. Assuming a single root, this set of revisions has a Lowest Common Ancestor (or maybe more than one), which seems like a good candidate. Indeed, if the top-level revision S we use with R is not a common ancestor of revisions which use R or its descendants (but the mapping is otherwise reasonable), then R will have a descendant R' whose associated top-level revision S' is not a descendant of S. In other words, the history derived from the subrepo will have confusing/nonsensical jumps between revisions of the top-level tree.

Now, if we want to choose a common ancestor, the lowest one makes sense from a standpoint of making these revisions something that can be checked-out, built, and tested, and from a standpoint of giving a reasonable idea what the state of the top-level repo (and other subrepos) was at the time the changes in the subrepo were made. The root of the whole top-level DAG would of course also work, but it would not give meaningful, usable revisions that could be checked out; choosing the root would be equivalent (from a usability standpoint) to a naive repo-merge that has one root per subrepo and just merges from the subrepo histories whenever the top-level repo updates the revisions it's using.

So, if we can use the LCA to assign a top-level revision T(R) to each subrepo revision R, how does that translate into

Whenever a subrepo revision R has T(R) distinct from T(P) for each parent P of R, it's effectively merging new changes from the top-level repo (and other subrepos) into the subrepo history. The conversion should represent this as two commits:

The actual subrepo commit R, using an old top-level revision. If R has a single parent P (not a merge commit), this will be T(P). If R had multiple parents, it's not clear whether there's a perfect choice of which one to use, but T(P) for any parent P should be reasonable.
A merge commit merging back the conversion C(T(R)) of the top-level-repo commit T(R) associated with R, where C(T(R)) itself just merged (1) above.

Aside from C(T(R)), which references (1) as a merge parent, all other references to R in the conversion should use (2). This includes the conversions of any descendants of T(R) in the top-level repo which use revision R of this subrepo, and the conversions of direct children of R itself.

I believe the above (albeit poorly worded) description specifies all that's needed for merging the top-level and subrepo DAGs. Each subrepo revision gets a full version of the tree, and ends up connected into a unified DAG for the converted repo via "merge commits" (when the subrepo merges a new associated top-level revision, and when the top-level merges subrepo revisions that have changed).

The final step of producing the git repo, then, is simply replaying the merged DAG, either in topologically sorted form or via a depth-first walk, such that each git commit-tree already has all the parent revisions it needs present.

Attar answered 15/8, 2016 at 19:19 Comment(0)

This is what I did to solve a similar problem:

Convert each mercurial repository with fast-export
Add the directories of the sub-repositories as remote in the parent repo
In the parent repo git checkout -b to give a name to each subrepo repository
git read-tree --prefix=pathsubrepo/ -u subrepobranch for each subrepo

This is more or less what I did in a bit more detail (adapted from bash history... but not actually run)

Step 1

cd ~
git clone git://repo.or.cz/fast-export.git
git init parent_repo
cd parent_repo
~/fast-export/hg-fast-export.sh -r /path/to/old/mercurial/parent
git checkout HEAD
cd ~
git init subrepo1
cd subrepo1
~/fast-export/hg-fast-export.sh -r /path/to/old/mercurial/parent/subrepo1
git checkout HEAD
cd ~
git init subrepo2
cd subrepo2
~/fast-export/hg-fast-export.sh -r /path/to/old/mercurial/parent/subrepo2
git checkout HEAD

Step 2

cd ~/parent_repo
git remote add sub1 $HOME/subrepo1/
git remote add sub2 $HOME/subrepo2/

Step 3

cd ~/parent_repo
git checkout -b sub1master sub1/master
git checkout -b sub2master sub2/master

Step 4

cd ~/parent_repo
git read-tree --prefix=subrepo1/ -u sub1master
git read-tree --prefix=subrepo1/ -u sub2master

Once done, you can git branch -D sub1master and git branch -D sub2master since you don't need them anymore.

Thoth answered 19/5, 2016 at 16:8 Comment(1)

I don't see how this works. As soon as you convert the repos to git without merging them first, you've lost all information about which subrepo revisions are associated with a top-level revision, since the .hgsubstate file references hg revisions, not git revisions. Unless of course you keep a mapping between the two -- but I don't see that anywhere in the procedure you described. – Attar 19/5, 2016 at 21:3

-1

Try Facebook's Hg<->Git converter: FbShipIt. Most of what you described should work well with this commit converter tool, which copies the commits between Mercurial and Git.

FbShipIt has a caveat: it doesn't understand merge commits, but it can be worked around via git rebase.

Candiscandle answered 20/5, 2016 at 17:20 Comment(1)

I don't see any indication that this answer is specific to the question. Nothing about merging subrepos (the hard part), and while the question indicates there are merge commits, the answer offers git rebase as a workaround for the tool not supporting merge commits, despite the fact that the source repo is hg, not git, and thus git tools would be useless to resolve issues incompatible with the conversion tool before doing the conversion. – Attar 20/5, 2016 at 17:58

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags