When to use git subtree?
Asked Answered
M

6

128

What problem does git subtree solve? When and why should I use that feature?

I've read that it is used for repository separation. But why would I not just create two independent repositories instead of sticking two unrelated ones into one?

This GitHub tutorial explains how to perform Git subtree merges.

I kind of know how to use it, but not when (use cases) and why, and how it relates to git submodule. I'd use submodules when I have a dependency on another project or library.

Micah answered 4/9, 2015 at 22:54 Comment(4)
"repository separation" != "unrelated repositories" think dependencies in your repo and you don't want to use submodules (for some reason, maybe you don't like that they're not transparent and that the paths in the commits in the submodule don't match your path in the main git repo).Joejoeann
@cyphar: Are you saying that both submodule and subtree are more or less achieving the same goal which is incorporating related projects and that the only difference is that submodule might be a bit less transparent and updating submodules is a two step operation and that the drawback of subtree is that commit messages will be all mixed up between the two projects?Micah
Well, it's not really a drawback in certain cases. For example, if you need to bisect a repository that has subtrees and a bug was introduced in a dependency, you'll find the exact commit in the subtree that introduced the bug. With submodules, you'll only find that the commit that rev'd the submodule causes the bug and you're sort of SOL if you want to quickly find which commit in a submodule causes a bug in your main project.Joejoeann
Here's an article that compares git subtree and git submodule with practical examples nering.dev/2016/git-submodules-vs-subtreesRevenant
L
95

You should be careful to note explicitly what you are talking about when you use the term 'subtree' in the context of git as there are actually two separate but related topics here:

git-subtree and git subtree merge strategy.

The TL;DR

Both subtree related concepts effectively allow you to manage multiple repositories in one. In contrast to git-submodule where only metadata is stored in the root repository, in the form of .gitmodules, and you must manage the external repositories separately.

More Details

git subtree merge strategy is basically the more manual method using the commands you referenced.

git-subtree is a wrapper shell script to facilitate a more natural syntax. This is actually still a part of contrib and not fully integrated into git with the usual man pages. The documentation is instead stored along side the script.

Here is the usage info:

NAME
----
git-subtree - Merge subtrees together and split repository into subtrees


SYNOPSIS
--------
[verse]
'git subtree' add   -P <prefix> <commit>
'git subtree' add   -P <prefix> <repository> <ref>
'git subtree' pull  -P <prefix> <repository> <ref>
'git subtree' push  -P <prefix> <repository> <ref>
'git subtree' merge -P <prefix> <commit>
'git subtree' split -P <prefix> [OPTIONS] [<commit>]

I have come across a pretty good number of resources on the subject of subtrees, as I was planning on writing a blog post of my own. I will update this post if I do, but for now here is some relevant information to the question at hand:

Much of what you are seeking can be found on this Atlassian blog by Nicola Paolucci the relevant section below:

Why use subtree instead of submodule?

There are several reasons why you might find subtree better to use:

  • Management of a simple workflow is easy.
  • Older version of git are supported (even before v1.5.2).
  • The sub-project’s code is available right after the clone of the super project is done.
  • subtree does not require users of your repository to learn anything new, they can ignore the fact that you are using subtree to manage dependencies.
  • subtree does not add new metadata files like submodules does (i.e. .gitmodule).
  • Contents of the module can be modified without having a separate repository copy of the dependency somewhere else.

In my opinion the drawbacks are acceptable:

  • You must learn about a new merge strategy (i.e. subtree).
  • Contributing code back upstream for the sub-projects is slightly more complicated.
  • The responsibility of not mixing super and sub-project code in commits lies with you.

I would agree with much of this as well. I would recommend checking out the article as it goes over some common usage.

You may have noticed that he has also written a follow up here where he mentions an important detail that is left off with this approach...

git-subtree currently fails to include the remote!

This short sightedness is probably due to the fact that people often add a remote manually when dealing with subtrees, but this isn't stored in git either. The author details a patch he has written to add this meta data to the commit that git-subtree already generates. Until this makes it into the official git mainline you could do something similar by modifying the commit message or storing it in another commit.

I also find this blog post very informative as well. The author adds a third subtree method he calls git-streeto the mix. The article is worth a read as he does a pretty good job of comparing the three approaches. He gives his personal opinion of what he does and doesn't like and explains why he created the third approach.

Extras

Closing Thoughts

This topic shows both the power of git and the segmentation that can occur when a feature just misses the mark.

I personally have built a distaste for git-submodule as I find it more confusing for contributors to understand. I also prefer to keep ALL of my dependencies managed within my projects to facilitate an easily reproducible environment without trying to manage multiple repositories. git-submodule, however, is much more well known currently so it is obviously good to be aware of it and depending on your audience that may sway your decision.

Leggy answered 7/11, 2015 at 4:16 Comment(1)
git-stree's author says in the same blog post that he's in favour of git-subrepo since 2016.Piranha
V
17

First of: I believe your question tends to get strongly opinionated answers and may be considered off-topic here. However I don't like that SO policy and would push the border of being on-topic a bit outward, so I like to answer instead and hope others do as well.

On the GitHub tutorial that you pointed to there's a link to How to use the subtree merge strategy which gives a viewpoint on advantages/disadvantages:

Comparing subtree merge with submodules

The benefit of using subtree merge is that it requires less administrative burden from the users of your repository. It works with older (before Git v1.5.2) clients and you have the code right after clone.

However if you use submodules then you can choose not to transfer the submodule objects. This may be a problem with the subtree merge.

Also, in case you make changes to the other project, it is easier to submit changes if you just use submodules.

Here's my viewpoint based on the above:

I often work with folks (=committers) who are no regular git users, some still (and will forever) struggle with version control. Educating them about how to use the submodule merge strategy is basically impossible. It involves the concepts of additional remotes, about merging, branches, and then mixing it all into one workflow. Pulling from upstream and pushing upstream is a two stage process. Since branches is difficult to understand for them, this is all hopeless.

With submodules it's still too complicated for them (sigh) but it is easier to understand: It's just a repo within a repo (they are familiar with hierarchy) and you can do your pushing and pulling as usual.

Providing simple wrapper scripts is easier imho for the submodule workflow.

For large super-repos with many sub-repos the point of choosing not to clone data of some sub-repos is an important advantage of the submodules. We can limit this based on work requirements and disk space usage.

Access control might be different. Haven't had this issue yet, but if different repos require different access controls, effectively banning some users from some sub-repos, I wonder if that's easier to accomplish with the submodule approach.

Personally I'm undecided what to use myself. So I share your confusion :o]

Vig answered 5/9, 2015 at 12:37 Comment(4)
This answer being the most strongly opinionated one I have seen, despite the contradiction, as it is the only answer, and self fulfilling prophecy. The exasperated sigh, the doomsayer attitude on others ability to learn, this is a very arrogant answer. Your opinion on policy probably belongs on Meta where it could be helpful. The answer itself, outside of the self-serving fluff, is pretty good though.Peace
@vgoff: Your critique is correct. Sorry for being seemingly arrogant - it's just >15 years of work experience with people who've been trained over that time by different people in different version control systems and still copy text files to multitudes of .backup.<timestamp>. I think I made it clear at the start it's going to be opinionated. Others hopefully are able to provide a more factual insight, and I am surprised no one has yet.Vig
I still don't get it. Are you saying that submodule is the deprecated old way of incorporating used libraries and subtree is the new shiny way?Micah
No. The docs at least don't mention that any of the two is deprecated. And to me the docs have the final say (except for bugs). It's just two different workflows to accomplish a similar thing. Both have advantages and disadvantages. To me the fact that no one of the git gurus has answered yet is a confirmation that to the expert the differences are negligible. Most probably use the subtree merge strategy because it's the one that was implemented earlier and people are familiar with read-tree (and branching/merging/remotes anyway). submodules was added onVig
W
9

A real use case that we have where git subtree was a salvation:

The main product of our company is high modular and developed in several projects in separate repositories. All modules have their separate roadmap. Whole product is composed with all modules of concrete versions.

In parallel the concrete version of whole product is customized for each of our clients - seperate branches for each module. Customization have to be made sometimes in several project at once (cross-module customization).

To have a separate product life cycle (maintenance, feature branches) for customized product we introduced git subtree. We have one git-subtree repository for all customized modules. Our customization are everyday 'git subtree push' back to all original repositories to customization branches.

Like this we avoid managing many repos and many braches. git-subtree increased our productivity several times!

UPDATE

More details about solution that was posted to comments:

We created a brand new repository. Then we added each project that had client branch to that new repo as subtree. We had a jenkins job that was pushing back master changes to original repositories to client branch regularly. We worked just with "client repo" using tipical git flow with feature and maintenance branches.

Our 'client' repo had also building scripts that we also adapted for this particular client.

However there is a pitfall of presented solution.

As we were going farther and farther from the main core development of product the possible upgrade for that particular client was more and more difficult. In our case it was ok as the state of project before subtree had been already far a way of main path, so the subtree introduce at least order and possibility to introduce default git flow.

Wench answered 18/11, 2015 at 23:33 Comment(6)
Marek, I'm faced with what sounds like the same situation and I'm relatively new to git and floundering in the possibilities. I would like to know more about your setup.Mcleroy
I created a brand new repository. Then I added each project that had client branch to that repo as subtree. We had a jenkins job that was pushing back changes to original repositories to client branch. On our client repo we were working normally on master with feature, maintenance branches.Wench
The pitfall was that we were going farther and farther from the main core development of product. So the possible upgrade for that particular client was more and more difficult. In our case it was ok as the state of project before subtree had been already far a way of main path, so the subtree introduce at least order and possibility to introduce default git flow.Wench
One more thing that our 'client' repo had also building scripts that we were also adapting for this particular client.Wench
Excellent example! I wish I could give you more upvotes.Defant
I'd like to recommend you incorporate your additional info from the comments into your answer; they definitely make this a better answer.Termitarium
U
9

Basically Git-subtree are the alternatives for the Git-submodule approach: There are many drawbacks or rather I would say, you need to be very careful while using git-submodules. e.g when you have "one" repo and inside "one" you have added another repo called "two" using submodules. Things you need to take care:

  • When you change something in "two", you need to commit and push inside "two", if you are at top-level directory (i.e in "one") your changes wont get highlighted.

  • When an unknown user tries to clone your "one" repo, after cloning "one" that user needs to update the submodules to get the "two" repo

These are some of the points and for better understanding I would recommend you to watch this video: https://www.youtube.com/watch?v=UQvXst5I41I

  • To overcome such problems subtree approach is invented. To get the basics about git-subtree, have a view on this: https://www.youtube.com/watch?v=t3Qhon7burE

  • I find subtree approach is more reliable and practical compare to submodules :) (I am very much beginner to say these things)

Cheers!

Unscrew answered 7/5, 2017 at 21:11 Comment(0)
F
4

To add to above answers, an additional drawback of using subtree is the repo size compared to submodules.

I don't have any real world metrics, but given that each time a push is made on a module, everywhere that module is used gets a copy of the same change on the parent module (when is subsequently updated on those repos).

So if a code base is heavily modularised, that will add up quite quickly.

However, given storage prices are always coming down, that may not be a significant factor.

Fogg answered 7/6, 2018 at 1:54 Comment(2)
Storage is not the problem. Entropy is the problem! For example, you have 1000 tools which are of 10KB to 100KB each sharing a common codebase of say 35 GB (because it contains vast number of modules from different sources). With submodules you transfer around 36 GB for all, but probably over 1 TB with git subtree! Also note that submodule has a clearly unfair advantage if it comes to git gc and ZFS dedup (object packs). Hence AFAICS smaller codebases (repo size wise not repo count wise) should go with submodules, bigger ones with monorepo. I did not find any use for subtree yet.Landaulet
@tino Git will dedup subtrees with common code just fine. I just ran some experiments to confirm. For the checked out code, you'd need to run something like ZFS. But submodule are not different.Guevara
W
1

I had to stop using submodules because one mistake and you screw up your repo for yourself, and others and it's hard to fix up ("gee, which submodule commit was the one that matches the main code and where did I forget to check that one in?"). I arrived at the conclusion that, if git would simply default to always checking out the entire codebase, including any submodules and also commtting and pushing the entire codebase, including submodules, unless specifically told to not do that, submodules would be very easy to use. Simply default to treating the entire set of nested repos as one big virtual repo (which is what it should be by default, why are you including a submodule if it's not essential to the project) unless you explicitly need to do something different which is likely the exception. Optimizing for size which is what git currently does is a bit dated in this world where a terabyte of disk space costs $10 and a lot of people have gigabit Internet.

Winnebago answered 10/11, 2023 at 17:35 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.