Multiple copies of same repository on a machine
Asked Answered
B

2

7

I have a local computer on which there are multiple copies of the same GIT repository, each copy for the different user. So it may look like this:

/home/userA/BigRepository
/home/userB/BigRepository
/home/userC/BigRepository
/home/userD/BigRepository
/home/userE/BigRepository

Let's say each repository uses ~2-3GB, 20 users will use 40-60GB of unnecessary redundant data. users may work on their private branches developing something, but the majority of data remains redundant. Thats why I'd like to optimize disk usage.

I wonder what would be the best way to approach it.


What I've checked already:

  • git clone --local - Each repository will share .git/objects with a bare repository, but this implies .bare repository mush be available locally (so it cant be GitHub, right?)
  • git clone --depth <n> - which will reduce the size of the repo, but also reduce local history to n objects.
  • git clone --shallow-since - as I understand it will work similarly to --depth option but will store commits since the specified time.
  • git clone --separate-dir - my own Idea to use the same place to store all .git directories. (so each of 20 repositories will link to the same place when making a clone. Dont know yet if it's possible, just sharing my ideas.

Will --depth imply that repositories will have at most n commits, or is it checked only when cloning, and then the repository can grow with time?

Bottomry answered 29/1, 2020 at 7:9 Comment(1)
I suppose it boils down to what you're using those repositories for, and how users are using the machine (and thus the repos). If it's for active development, keeping copies per-user is appropriate. If it's to compile and use the output, you're better off doing that once and installing the application to /usr/bin to make it available to all users.Indiscreet
S
5
  • git clone --local - Each repository will share .git/objects with a bare repository, but this implies .bare repository mush be available locally (so it cant be GitHub, right?)

Not really right, no. You can use this with any local clone, bare or not. But in general, in cases where this works at all, you don't need --local either: you can just clone from a local path name.

For instance, suppose userA, whose home directory is /home/userA, clones the GitHub repository, making a full and non-bare clone. Suppose further that userB can read from /home/userA. User B can therefore do:

git clone /home/userA/BigRepository

to create ./BigRepository. If he does this in his home directory, he winds up with /home/userB/BigRepository, which contains all the same commits as userA's clone.

Because Git will make hard links, if user A now removes his repository, he does not regain his space (so if disk quotas are in effect, user A does not get his quota back). User B still has links to the files owned by user A. Everything still works; it's just that whoever made that first clone has "paid for" the initial storage for the repository proper.

(User B "pays for" his own work-tree. He shares the .git/objects files, including pack files, that user A created. Those files are all read-only, at all times, whether or not user B is sharing user A's files, so the fact that user B can't write to those files is unimportant.)

The one drawback, which is quite small, to this process is that user B will probably want to change his origin URL to point to the GitHub repository rather than to user A's clone, and until he does so, he will not see the same set of remote-tracking names (origin/* names) that user A sees.

User C can repeat this process with either preceding repository.

  • git clone --depth <n> - which will reduce the size of the repo, but also reduce local history to n objects.

Mostly, yes. Technically wrong in terms of the number n though:

Will --depth imply that repositories will have at most n commits, or is it checked only when cloning, and then the repository can grow with time?

They not only grow over time, the number n does not mean what you're suggesting. It is the depth, not the number of commits. Depth in this case is a technical term referring to graph traversal.

Remember that Git uses the commit as its basic storage unit. (Commits can be broken down further, but for our purpose here they're the unit.) Each commit has a unique hash ID, and can be represented as a node or vertex in a graph. Each commit also stores the hash ID of its immediate predecessor commit(s): these form one-way edges or arcs linking the nodes, and hence form the rest of the graph.

We can draw bits of the graph like this:

... <-F <-G <-H

where each letter stands in for a commit hash ID. The stored hash IDs in each commit act as pointers to earlier commits. To find the end of this chain easily, we—or Git—establish a branch name, or some other form of name, that points to the last commit in the chain:

...--F--G--H   <-- master

(where we get lazy and draw the connecting arcs as lines, for the simple reason that no commit can ever be changed, so it doesn't really matter at this point which way the arrows go—though at other times, it's important to remember that they inherently point backwards, which forces Git to work backwards at all times).

Now, a graph with these kinds of backwards-pointing arrows can have forks and joins in it:

          o--o         o--o--H   <-- branch1
         /    \       /
...--o--o--o---o--o--o--o--K   <-- branch2
         \          /
          o--o--o--o

When we traverse this graph, we start at an end—in normal graphs we start at a start, but Git works backwards—like commit H, as pointed-to by name branch1. If we choose --depth 3, Git will pick up H and two earlier commits, and K and two earlier commits:

          o--o--H   <-- branch1
         /
<snip>--o--o--K   <-- branch2

Our --depth 3 got six commits, because going back 3 from each end got us these commits out of the full graph. If we go to --depth 4 we get:

               o--o--H   <-- branch1
              /
  <snip>--o--o--o--K   <-- branch2
         /
<snip>--o

Each of these "snip" points represents a shallow graft, where we know that there were more commits, but we've deliberately omitted those commits. The hash IDs of the omitted commits get written to .git/shallow and Git knows, when it visits a commit whose parents are listed in .git/shallow, not to try to find the parent commits.

The --depth argument chooses the snip points. This happens at the time of the git fetchgit clone is a fancy six-part wrapper that includes git fetch as the fifth step. The snip points remain there, where they are, unless and until you run a git fetch with a specific argument to deepen, or further-en-shallow, the repository. New commits get added in the usual way and make the graph deeper, including any git fetch operations that any of the users run.

  • git clone --shallow-since - as I understand it will work similarly to --depth option but will store commits since the specified time.

Yes: it's just a more useful, as well as less-confusing, way to set the "snip" points.

  • git clone --separate-dir

You mean --separate-git-dir. There is no real point to this: the directory you specify here gets created and filled by the clone operation. If combined with any of the earlier options, that would help reduce space needed, but otherwise it just separates the work-tree from the repository proper.

In a standard setup, the repository itself appears in the work-tree in a subdirectory named .git. With --separate-git-dir, .git still appears in the work-tree, but this time it is a file containing the path in which the repository is kept. Either way, each user pays the storage cost independently, unless using --local as implied by cloning some other user's repository.

It's important that each user have his own actual repository

If and when user A makes a new commit, his Git must write one or more new objects into his .git/objects. (Since a commit is always unique, the operation needs to at least write that object. It probably also needs to write some tree objects, and to get to this point, Git probably had to create some blob objects.)

Meanwhile, if and when user B makes a new commit, his Git must write one or more new objects into his .git/objects. If users A and B literally share the Git repository, then A and B must have write permission on the other users' files and directories. This mode can be made to work, but it has an additional drawback: each user must be very careful not to step on the other users by accident. While the bulk of a repository—including the proposed-to-be-shared .git/objects parts—consists of objects that are never changed once written, other parts, including the special file .git/HEAD and numerous other files such as branch head data and reflogs, must be private to each user, or else—and this alternative is generally unworkable—only one user can be doing any real work at any time.

In theory, git worktree add could be used here

However, it's not designed for this kind of use. You can experiment with it if you like: add a work-tree for each user, then give that user permission on all the files associated with that user (the extra files are in subdirectories within .git).

The thing that is designed for this is --reference

What is designed for dealing with this is the --reference option. Using --reference, you, as the administrator of the machine, would first make a full clone of the GitHub repository. You could make this --bare or not—it's not really important—but you might want to make it a --mirror clone so that it gets every ref and can be updated more easily. (I experimented with this a bit at a previous job, and there are some issues here that make updating it tricky, so this might not be as useful as you would think at first.)

Once this "reference clone" exists, each user can do:

git clone --reference <path> <github-url>

Their Git will contact the Git at GitHub and get from it the information they would need to make a full clone. But then, instead of actually making a full clone, they'd checks the reference clone to see if it already has the objects they want. Whenever and wherever the reference clone already has those objects, their Git will merely use those existing objects, in that existing reference clone.

What this means is that the git clone itself goes very fast, and uses almost no additional disk space. It may take a few minutes or even a few hours to make the original ~3GB reference clone, but when one of the users does this git clone --reference operation, it should finish in seconds. Moreover, it works "cleanly" in that if there are new objects they need from GitHub, they just get them from GitHub as usual. Because no commit—no Git object of any kind, really—can ever be changed, the reference clone merely serves to provide all the objects that you put in it initially. New objects gradually expand each user's repository.

(You can, in the future, update the reference clone. The individual users can then re-clone to reduce their disk usage. The tricky part here is that you must make sure that no object, and no pack file, disappear from the reference clone between the time you update it and the time they re-clone. You could instead just make a new reference clone, wait until all users have re-cloned the new reference clone, then delete the original reference, to avoid this trickiness.)

Scandura answered 29/1, 2020 at 8:4 Comment(0)
L
0

You can try to symlink the .git directory from one location to all other workspaces

git clone git@server:BigRepository /home/userA/BigRepository
mkdir /home/userB/BigRepository/
ln -s /home/userA/BigRepository/.git /home/userB/BigRepository/.git

However, everybody will be changing everybody else's branches, i.e. your master branch might unexpectedly move. Your workspace will not change, so your files behave as expected. But Git will suddenly report changes.

Lydia answered 29/1, 2020 at 7:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.