What does the git index contain EXACTLY?
Asked Answered
P

7

238

What does the Git index exactly contain, and what command can I use to view the content of the index?


Thanks for all your answers. I know that the index acts as a staging area, and what is committed is in the index rather than the working tree. I am just curious about what an index object consists of. I guess it might be a list of filename/directory names, SHA-1 pairs, a kind of virtual tree maybe?

Is there, in Git terminology, any plumbing command that I can use to list the contents of the index?

Perr answered 3/11, 2010 at 7:20 Comment(4)
Follow up: ralfebert.de/blog/tools/visual_git_tutorial_1Diplococcus
you should read and watch diagrams - very helpful: gitguys.com/topics/whats-the-deal-with-the-git-indexLegatee
@Legatee the domain has expired. Not very helpful anymore.Abscission
updated link: web.archive.org/web/20160822072849/http://www.gitguys.com/…Eichler
N
213

The Git book contains an article on what an index includes:

The index is a binary file (generally kept in .git/index) containing a sorted list of path names, each with permissions and the SHA1 of a blob object; git ls-files can show you the contents of the index:

$ git ls-files --stage
100644 63c918c667fa005ff12ad89437f2fdc80926e21c 0   .gitignore
100644 5529b198e8d14decbe4ad99db3f7fb632de0439d 0   .mailmap

The Racy git problem gives some more details on that structure:

The index is one of the most important data structures in git.
It represents a virtual working tree state by recording list of paths and their object names and serves as a staging area to write out the next tree object to be committed.
The state is "virtual" in the sense that it does not necessarily have to, and often does not, match the files in the working tree.


Nov. 2021: see also "Make your monorepo feel small with Git’s sparse index" from Derrick Stolee (Microsoft/GitHub)

https://static.mcmap.net/file/mcmap/ZG-AbGLDKwfnZ7-ocV9QWmYvXe/wp-content/uploads/2021/11/Fig-1-working-directory-index-commit-history.png

The Git index is a critical data structure in Git. It serves as the “staging area” between the files you have on your filesystem and your commit history.

  • When you run git add, the files from your working directory are hashed and stored as objects in the index, leading them to be “staged changes”.
  • When you run git commit, the staged changes as stored in the index are used to create that new commit.
  • When you run git checkout, Git takes the data from a commit and writes it to the working directory and the index.

In addition to storing your staged changes, the index also stores filesystem information about your working directory.
This helps Git report changed files more quickly.


To see more, cf. "git/git/blob/master/Documentation/gitformat-index.txt":

The Git index file has the following format

All binary numbers are in network byte order.
Version 2 is described here unless stated otherwise.

  • A 12-byte header consisting of:
  • 4-byte signature:
    The signature is { 'D', 'I', 'R', 'C' } (stands for "dircache")
  • 4-byte version number:
    The current supported versions are 2, 3 and 4.
  • 32-bit number of index entries.
  • A number of sorted index entries.
  • Extensions:
    Extensions are identified by signature.
    Optional extensions can be ignored if Git does not understand them.
    Git currently supports cached tree and resolve undo extensions.
  • 4-byte extension signature. If the first byte is 'A'..'Z' the extension is optional and can be ignored.
  • 32-bit size of the extension
  • Extension data
  • 160-bit SHA-1 over the content of the index file before this checksum.

mljrg comments:

If the index is the place where the next commit is prepared, why doesn't "git ls-files -s" return nothing after commit?

Because the index represents what is being tracked, and right after a commit, what is being tracked is identical to the last commit (git diff --cached returns nothing).

So git ls-files -s lists all files tracked (object name, mode bits and stage number in the output).

That list (of element tracked) is initialized with the content of a commit.
When you switch branch, the index content is reset to the commit referenced by the branch you just switched to.


Git 2.20 (Q4 2018) adds an Index Entry Offset Table (IEOT):

See commit 77ff112, commit 3255089, commit abb4bb8, commit c780b9c, commit 3b1d9e0, commit 371ed0d (10 Oct 2018) by Ben Peart (benpeart).
See commit 252d079 (26 Sep 2018) by Nguyễn Thái Ngọc Duy (pclouds).
(Merged by Junio C Hamano -- gitster -- in commit e27bfaa, 19 Oct 2018)

ieot: add Index Entry Offset Table (IEOT) extension

This patch enables addressing the CPU cost of loading the index by adding additional data to the index that will allow us to efficiently multi- thread the loading and conversion of cache entries.

It accomplishes this by adding an (optional) index extension that is a table of offsets to blocks of cache entries in the index file.

To make this work for V4 indexes, when writing the cache entries, it periodically"resets" the prefix-compression by encoding the current entry as if the path name for the previous entry is completely different and saves the offset of that entry in the IEOT.
Basically, with V4 indexes, it generates offsets into blocks of prefix-compressed entries.

With the new index.threads config setting, the index loading is now faster.


As a result (of using IEOT), commit 7bd9631 clean-up the read-cache.c load_cache_entries_threaded() function for Git 2.23 (Q3 2019).

See commit 8373037, commit d713e88, commit d92349d, commit 113c29a, commit c95fc72, commit 7a2a721, commit c016579, commit be27fb7, commit 13a1781, commit 7bd9631, commit 3c1dce8, commit cf7a901, commit d64db5b, commit 76a7bc0 (09 May 2019) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit c0e78f7, 13 Jun 2019)

read-cache: drop unused parameter from threaded load

The load_cache_entries_threaded() function takes a src_offset parameter that it doesn't use. This has been there since its inception in 77ff112 (read-cache: load cache entries on worker threads, 2018-10-10, Git v2.20.0-rc0).

Digging on the mailing list, that parameter was part of an earlier iteration of the series, but became unnecessary when the code switched to using the IEOT extension.


With Git 2.29 (Q4 2020), the format description adjusts to the recent SHA-256 work.

See commit 8afa50a, commit 0756e61, commit 123712b, commit 5b6422a (15 Aug 2020) by Martin Ågren (none).
(Merged by Junio C Hamano -- gitster -- in commit 74a395c, 19 Aug 2020)

index-format.txt: document SHA-256 index format

Signed-off-by: Martin Ågren

Document that in SHA-1 repositories, we use SHA-1 and in SHA-256 repositories, we use SHA-256, then replace all other uses of "SHA-1" with something more neutral.
Avoid referring to "160-bit" hash values.

technical/index-format now includes in its man page:

All binary numbers are in network byte order.
In a repository using the traditional SHA-1, checksums and object IDs (object names) mentioned below are all computed using SHA-1.
Similarly, in SHA-256 repositories, these values are computed using SHA-256.

Version 2 is described here unless stated otherwise.


Commit 4950aca from commit cf4a3bd, Git 2.44, Q1 2024, details the block management in a Git index.

Nitid answered 3/11, 2010 at 12:23 Comment(12)
About the importance if the index in the Git model, see stackoverflow.com/questions/1450348/…Nitid
The first link above points to a version of git-scm which does not have an article on the index. I think the intent was to point here: schacon.github.io/gitbook/7_the_git_index.htmlBosley
@Nitid If the index is the place where the next commit is prepared, why doesn't "git ls-files -s" return nothing after commit? There must be something more about the index than you have put in your answer.Laplace
@Laplace not sure I follow you: after a commit, the stage (where the commit was being prepared) would be empty, since the commit has been done, wouldn't it be?Nitid
@Nitid well I think it would, but I did a "git commit -a" and then I did a "git ls-files -s" and got the list of all the files at the HEAD of my current branch instead of an empty index ... (that's why I think the index does much more than just supporting the preparation of the next commit).Laplace
@Laplace you are correct, the output isn't empty. After a commit, its content is identical to what has just been committed, though. I have edited my answer.Nitid
@Nitid I would bet more than "the index represents what is being tracked" and say that the index knows the files that exist at the head of the current branch, so if you switch branches you should see a different index if there are different files at the head of both branches (I've not tested this because I just begining Git learning and don't know how to create branches yet in Git)Laplace
@Laplace yes, it does change: the content of the index is reset to the content of the commit it is based on (like the commit referenced by the branch you just switched on). Then, any change added to the index helps git to detect differences between that commit and what you modified. That, in turn, help prepare the next commit.Nitid
Could you explain what tree ls-tree is working on? See also https://mcmap.net/q/20054/-what-does-the-git-index-contain-exactlyBuchanan
Isn't it misleading that git ls-files --stage lists all files, regardless of whether or not they are staged? The name of the option "--stage" would suggest that only files with staged changes would be listed, and yet, here we are...Aconcagua
@Telescope True, but the term "stage" here does not refer to files that are merely staged for commit. Instead, it is showing information about the staging area, where all the information about what is going to be committed is stored.Nitid
@Nitid Thanks for the response. What happened was I was confused by the relationship of the terms staging area (the next snapshot to be committed), staged file (a file which has changes recorded in the staging area), and staged change (a change which has been accounted for in the staging area). In particular, the terminology "staged file" implied to me that the staging area did not contain every file; that there were "unstaged files" not contained by the staging area. Now I know that the staging area represents the whole working tree, and thus contains all tracked files.Aconcagua
C
70

Bit by bit analysis

I've decided to do a little testing to better understand the format and research some of the fields in more detail.

Results bellow are the same for Git versions 1.8.5.2 and 2.3.

I have marked points which I'm not sure / haven't found with TODO: please feel free to complement those points.

As others mentioned, the index is stored under .git/index, not as a standard tree object, and its format is binary and documented at: https://github.com/git/git/blob/master/Documentation/technical/index-format.txt

The major structs that define the index are at cache.h, because the index is a cache for creating commits.

Setup

When we start a test repository with:

git init
echo a > b
git add b
tree --charset=ascii

The .git directory looks like:

.git/objects/
|-- 78
|   `-- 981922613b2afb6025042ff6bd878ac1994e85
|-- info
`-- pack

And if we get the content of the only object:

git cat-file -p 78981922613b2afb6025042ff6bd878ac1994e85

We get a. This indicates that:

  • the index points to the file contents, since git add b created a blob object
  • it stores the metadata in the index file, not in a tree object, since there was only a single object: the blob (on regular Git objects, blob metadata is stored on the tree)

hd analysis

Now let's look at the index itself:

hd .git/index

Gives:

00000000  44 49 52 43 00 00 00 02  00 00 00 01 54 09 76 e6  |DIRC.... ....T.v.|
00000010  1d 81 6f c6 54 09 76 e6  1d 81 6f c6 00 00 08 05  |..o.T.v. ..o.....|
00000020  00 e4 2e 76 00 00 81 a4  00 00 03 e8 00 00 03 e8  |...v.... ........|
00000030  00 00 00 02 78 98 19 22  61 3b 2a fb 60 25 04 2f  |....x.." a;*.`%./|
00000040  f6 bd 87 8a c1 99 4e 85  00 01 62 00 ee 33 c0 3a  |......N. ..b..3.:|
00000050  be 41 4b 1f d7 1d 33 a9  da d4 93 9a 09 ab 49 94  |.AK...3. ......I.|
00000060

Next we will conclude:

  | 0           | 4            | 8           | C              |
  |-------------|--------------|-------------|----------------|
0 | DIRC        | Version      | File count  | ctime       ...| 0
  | ...         | mtime                      | device         |
2 | inode       | mode         | UID         | GID            | 2
  | File size   | Entry SHA-1                              ...|
4 | ...                        | Flags       | Index SHA-1 ...| 4
  | ...                                                       |

First comes the header, defined at: struct cache_header:

  • 44 49 52 43: DIRC. TODO: why is this necessary?

  • 00 00 00 02: format version: 2. The index format has evolved with time. Currently there exists version up to 4. The format of the index should not be an issue when collaborating between different computers on GitHub because bare repositories don't store the index: it is generated at clone time.

  • 00 00 00 01: count of files on the index: just one, b.

Next starts a list of index entries, defined by struct cache_entry Here we have just one. It contains:

  • a bunch of file metadata: 8 byte ctime, 8 byte mtime, then 4 byte: device, inode, mode, UID and GID.

    Note how:

    • ctime and mtime are the same (54 09 76 e6 1d 81 6f c6) as expected since we haven't modified the file

      The first bytes are seconds since EPOCH in hex:

      date --date="@$(printf "%x" "540976e6")"
      

      Gives:

      Fri Sep  5 10:40:06 CEST 2014
      

      Which is when I made this example.

      The second 4 bytes are nanoseconds.

    • UID and GID are 00 00 03 e8, 1000 in hex: a common value for single user setups.

    All of this metadata, most of which is not present in tree objects, allows Git to check if a file has changed quickly without comparing the entire contents.

  • at the beginning of line 30: 00 00 00 02: file size: 2 bytes (a and \n from echo)

  • 78 98 19 22 ... c1 99 4e 85: 20 byte SHA-1 over the previous content of the entry. Note that according to my experiments with the assume valid flag, the flags that follow it are not considered in this SHA-1.

  • 2 byte flags: 00 01

    • 1 bit: assume valid flag. My investigations indicate that this poorly named flag is where git update-index --assume-unchanged stores its state: https://mcmap.net/q/20312/-where-does-quot-git-update-index-assume-unchanged-file-quot-actually-save-this-information-to

    • 1 bit extended flag. Determines if the extended flags are present or not. Must be 0 on version 2 which does not have extended flags.

    • 2 bit stage flag used during merge. Stages are documented in man git-merge:

      • 0: regular file, not in a merge conflict
      • 1: base
      • 2: ours
      • 3: theirs

      During a merge conflict, all stages from 1-3 are stored in the index to allow operations like git checkout --ours.

      If you git add, then a stage 0 is added to the index for the path, and Git will know that the conflict has been marked as solved. TODO: check this.

    • 12 bit length of the path that will follow: 0 01: 1 byte only since the path was b

  • 2 byte extended flags. Only meaningful if the "extended flag" was set on the basic flags. TODO.

  • 62 (ASCII b): variable length path. Length determined in the previous flags, here just 1 byte, b.

Then comes a 00: 1-8 bytes of zero padding so that the path will be null-terminated and the index will end in a multiple of 8 bytes. This only happens before index version 4.

No extensions were used. Git knows this because there would not be enough space left in the file for the checksum.

Finally there is a 20 byte checksum ee 33 c0 3a .. 09 ab 49 94 over the content of the index.

Carlita answered 12/9, 2014 at 10:42 Comment(10)
Very Interesting. +1. That illustrates my own answer nicely. I wonder if those results would change with the latest Git 2.1+.Nitid
Is there a reason you reverse-engineered the file like this? You could also look at Git's source to know what was actually in there and what the purpose is, right?Parasang
@NielsBom yes, that would work also. When interpreting programs, I prefer to take two approaches: first empirical to see what outputs it generates, and only then read the source. Otherwise one might get caught up into source code edge cases which don't even appear on simple outputs. Of course, I did look at the source structs to help guide me, and every TODO can be solved my reading how those structs are manipulated, which is the hard part.Carlita
@CiroSantilli六四事件法轮功纳米比亚威视 : If I modify the index in an hex editor and update it’s 20 byte checksum, is there a command to update the sha1 which is stored in other objects ? (git complains sha1 signature of index is corrupt). Also does the index data is stored in a completely different way when sended over push requests.Reactive
@Reactive 1) Command to update sha1: I think not, sounds too internalish, I'd just manipulate it with head, recalculate the sha with sha256sum and concatenate. I'm curious: why do you want to do that? 2) The index is never sent, only required objects. I don't know the Git protocol, but I think it sends packfiles, which is an encoding for diffs. Related: stackoverflow.com/questions/9478023/…Carlita
@CiroSantilli六四事件法轮功纳米比亚威视 : Security purposes. Just looking for the well know kind of raster image files attacks applied to git database/objects. (of course I know most implementation took recently care of that perspective, but probably not all)  So I’m especially searching for binary data structures that tell the length of an array. (concerning text buffers it seems null termination is the norm for telling the number of rows)Reactive
Here is a date printing command that works on Mac OS Darwin date -r $(printf "%d" "0x540976e6")Uxmal
Regarding git add, per your TODO: you are correct. If you have high-stage index entries (a conflict) at a given path, when you git add that path, all high-stage index entries will be removed and the working directory copy will be added at stage 0. (Resolving the conflict).Octachord
Could you explain what tree ls-tree is working on? See also https://mcmap.net/q/20054/-what-does-the-git-index-contain-exactlyBuchanan
@Buchanan I'm not sure if it uses index or not (I'd just test with a git add). But for sure it looks at committed files going through commit object -> tree objects, see also: stackoverflow.com/questions/22968856/…Carlita
D
12

The Git index is a staging area between your working directory and your repository. You can use the index to build up a set of changes that you want to commit together. When you create a commit, what is committed is what is currently in this index, not what is in your working directory.

To see what is inside the index, issue the command:

git status

When you run git status, you can see which files are staged (currently in your index), which are modified but not yet staged, and which are completely untracked.

You can read this. A Google search throws up many links, which should be fairly self sufficient.

Diplococcus answered 3/11, 2010 at 7:25 Comment(5)
git status does not list all files from index. It only list those files which differ between index and working directory. To see all files in index, you need to use git ls-files.Algerian
@AkashAgrawal, git status does in fact list index files, irrespective of whether they differ between index and workdir.Gauldin
yes, it list SOME of the index files, but it doesn't show you everything that is inside the index, which is what his statement in his answer says. That's like saying there are 2 green balls and 3 red balls inside a box. To see whats inside the box, pull out the 2 green balls. What Akash said is most accurate, to see all the files in the index, use git ls-files.Serb
Indeed. git status lists files that are in the index, yes, but does not list all files in the index. Explaining how git status actually works would be a beneficial answer to some question, though probably not this one.Octachord
git status shows the working tree status (difference between working tree and index). It doesn't actually show the index. git-scm.com/docs/git-statusOulu
B
4

Git index is a binary file (generally kept in .git/index) containing a sorted list of path names, each with permissions and the SHA1 of a blob object;

git ls-files can show you the contents of the index. Please note that words index, stage, and cache are the same thing in Git: they are used interchangeably.

enter image description here

Git index, or Git cache, has 3 important properties:

  1. The index contains all the information necessary to generate a single (uniquely determined) tree object.
  2. The index enables fast comparisons between the tree object it defines and the working tree.
  3. It can efficiently represent information about merge conflicts between different tree objects, allowing each pathname to be associated with sufficient information about the trees involved that you can create a three-way merge between them.

Source:

  1. https://mincong.io/2018/04/28/git-index/
  2. https://medium.com/hackernoon/understanding-git-index-4821a0765cf
Brail answered 12/7, 2020 at 14:3 Comment(1)
Could you explain what tree ls-tree is working on? See also https://mcmap.net/q/20054/-what-does-the-git-index-contain-exactlyBuchanan
S
2

In response to @ciro-santilli-%e9%83%9d%e6%b5%b7%e4%b8%9c%e5%86%a0%e7%8a%b6%e7%97%85%e5%85%ad%e5%9b%9b%e4%ba%8b%e4%bb%b6%e6%b3%95%e8%bd%ae%e5%8a%9f detailed in-depth look at the index, am sharing output for one of the TODO.

"If you git add, then a stage 0 is added to the index for the path, and Git will know that the conflict has been marked as solved. TODO: check this."

And, more specifically, the different merge stages.

  • 0: regular file, not in a merge conflict
  • 1: base
  • 2: ours
  • 3: theirs

Details on the numerical representation of the various stages, in this case a file with conflict.

$ git ls-files -s
100644 f72d68f0d10f6efdb8adc8553a1df9c0444a0bec 0       vars/buildComponent.groovy

$ git stash list
stash@{0}: WIP on master: c40172e turn off notifications, temporarily

$ git stash apply
Auto-merging vars/commonUtils.groovy
Auto-merging vars/buildComponent.groovy
CONFLICT (content): Merge conflict in vars/buildComponent.groovy

$ git ls-files -s
100644 bc48727339d36f5d54e14081f8357a0168f4c665 1       vars/buildComponent.groovy
100644 f72d68f0d10f6efdb8adc8553a1df9c0444a0bec 2       vars/buildComponent.groovy
100644 24dd5be1783633bbb049b35fc01e8e88facb20e2 3       vars/buildComponent.groovy
Scroggins answered 10/10, 2020 at 3:43 Comment(0)
C
1

Here is what you exactly needed, use this command.

$ binwalk index

DECIMAL       HEXADECIMAL     DESCRIPTION
--------------------------------------------------------------------------------
1717          0x6B5           Unix path: /company/user/user/delete.php
1813          0x715           Unix path: /company/user/user/get.php
1909          0x775           Unix path: /company/user/user/post.php
2005          0x7D5           Unix path: /company/user/user/put.php
3373          0xD2D           Unix path: /urban-airship/channel/channel/post.php
3789          0xECD           Unix path: /urban-airship/named-user/named-user/post.php
3901          0xF3D           Unix path: /user/categories/categories/delete.php
4005          0xFA5           Unix path: /user/categories/categories/get.php
4109          0x100D          Unix path: /user/categories/categories/put.php
4309          0x10D5          Unix path: /user/favorites/favorites/delete.php
Cropper answered 24/1, 2019 at 16:47 Comment(0)
B
0

Just wanted to put git ls-tree in the ring.

The index is one of the most important data structures in git.
It represents a virtual working tree state by recording list of paths and their object names and serves as a staging area to write out the next tree object to be committed.
The state is "virtual" in the sense that it does not necessarily have to, and often does not, match the files in the working tree.

Would it be true to say git ls-tree tells me exactly what working files/objects should be present if I checked out a special commit? What kind of tree do we speak of in the context of ls-tree?

Examples

git ls-tree -r -l HEAD
git ls-tree -r -l commit-hash

BTW: ls-tree works also for repositories cloned without checkout (-n) where ls-files returns nothing.

https://mcmap.net/q/20313/-what-does-quot-git-ls-files-quot-do-exactly-and-how-do-we-remove-a-file-from-it

https://mcmap.net/q/20313/-what-does-quot-git-ls-files-quot-do-exactly-and-how-do-we-remove-a-file-from-it

Buchanan answered 17/5, 2021 at 17:41 Comment(1)
Is this a question?Leanora

© 2022 - 2024 — McMap. All rights reserved.