What is a good workflow for git-annex?

Asked 20/12, 2014 at 17:12 Answered 6/2, 2020 at 21:38

Our development team has been using git for version control and using git-annex to store large binary files (data-binaries, images, test-binaries etc). Although we have been able to set it up and use it, we have had our set of troubles.

A common action that we frequently perform that has given us trouble is:

Developer 1 adds some tests for a new feature and adds corresponding data for the tests using git-annex.

git add <test-file>
git annex add <data-file>
git annex copy <data-file> --to=<remote location(we use s3 if that is relevant)>
git commit -m 'Tests with data'
git push
git annex sync

The work is reviewed and merged (we use Github for hosting and follow a forking model where all work is done by a developer on their own fork and merged into the main repository through Pull requests)
Developer 2 fetches/merges with upstream and tries to run the tests on his machine.
```
git fetch upstream
git merge upstream/<branch>
git annex sync
git annex get
```

We often end up with the test data either not being tracked in git or unable to be downloaded from the remote location.

What is a good way to use git-annex in our workflow?

As an aside, what are other options that might make such a workflow better/easier to manage?

Microscope answered 20/12, 2014 at 17:12 Comment(2)

Maybe explain "We often end up with the test data either not being tracked in git or unable to be downloaded from the remote location." better. What causes the problem? People forgetting to use git-annex? S3 not available? Something else? – Lassitude 29/1, 2015 at 9:28

In your use case you probally often dont used git add to track the files and you also dont used git annex sync --content to sync the files it self you only synced meta – Pyrethrin 18/3, 2016 at 10:57

Ok Here we go:

Manual git annex v6 use:

Server1 & Server2:

mkdir testdata
cd testdata
git init
git annex init "LocationNameIdentifyer"
git annex upgrade
git remote add OtherServerLocationNameIdentifyer ssh://otherserver.com/thedir

when this setup is ready and there are no extra files in the directory you can now run

git annex sync --content

on both location if there are files in both location you need to do

git add --all

in both locations to track current files as so called unlocked files

after

git annex sync --content

on both locations runned lets say 3 times

all is merged and you can now cron git annex sync --content in both locations and both have same files in the worktree if you want to track new files you puted in a location you do git add not git annex add git annex add will add the files as so called locked files that makes a totall other workflow

Pyrethrin answered 18/3, 2016 at 10:55 Comment(0)

This will let you have a git repo "myrepo" with related S3 bucket that holds all of the big files you don't really want in your git repository.

Set up the repo:

# Clone your repo "myrepo"
git clone [email protected]:me/myrepo.git
cd myrepo

# Initialize it to work with git-annex.  
# This creates .git/annex directory in the repo, 
# and a `git-annex` metadata branch the tools use behind the scenes.
git annex init                  

# The first time you use the repo with git-annex someone must link it to S3.
# Be sure to have AWS_* env vars set.
# Select a name that is fitting to be a top-level bucket name.
# This creates the bucket s3://myrepo-annexfiles-SOME_UUID.
git annex initremote myrepo-annexfiles type=S3  

# Save the repo updates related to attaching your git annex remote.
# Warning: this does a commit and push to origin of this branch plus git-annex.
# It will ALSO grab other things so make sure you have committed
# or stashed those to keep them out of the commit.
git annex sync

Add some files to the annex:

# These examples are small for demo.
mkdir mybigfiles
cd mybigfiles
echo 123 > file1
echo 456 > file2

# This is the alternative to `git add`
# It replaces the files with symlinks into .git/annex/.../SOME_SHA256.
# It also does `git add` on the symlinks, but not the targets.
git annex add file*             

# Look at the symlinks with wonder.
ls -l mybigfiles/file*    

# This puts the content into S3 by SHA256 under the attached to your "special remote":
git annex move file* --to myrepo-annexfiles 

# Again, this will do a lot of committing and pushing so be prepared.
git annex sync

With git-annex the git repo will just have dead symlinks that contain a SHA256 value for the real file content, and the tooling will bring down the big files.

Later, when someone else clones the repo and wants the files:

git clone myrepo
cd myrepo

# Enable access to the S3 annex files.
# NOTE: This will put out a warning about ssh because the origin above is ssh.
# This is ONLY telling you that it can't push the big annex files there.
# In this example we are using git-annex specifically to ensure that.
# It is good that it has configured your origin to NOT participate here.
git annex enableremote myrepo-annexfiles

# Get all of the file content from S3:
git annex get mybigfiles/*

When done with the files, get your disk space back:

git annex drop mybigfiles/*

Check to see where everything really lives, and what is really downloaded where:

git annex whereis mybigfiles/file*

Note that git-annex is a super flexible tool. I found that distilling down a simpler recipe for the common case required a bit of study of the docs. Hope this helps others.

Unintentional answered 6/2, 2020 at 21:38 Comment(0)

Recommended topics

Hot tags