What does "git ls-files" do exactly and how do we remove a file from it?
Asked Answered
F

5

23

Does it show files from the local repository, the staging repository, the remote repository or from somewhere else?

I'm constantly seeing a file that is present in git ls-files. That file was deleted from the remote repository. After which I tried doing a git pull. However, that file still shows up in this command list. It should not be present here because it's not present in the remote repository either.

Fluctuant answered 21/5, 2019 at 9:30 Comment(4)
What is your git status showing?Acidulous
@NghiaBui It shows that the file X was deleted in red color (which implies that it's being tracked). X is the file that should not appear inside "git ls-files". This file doesn't exist in the remote repository. I tried doing a git pull + git fetch + git reset --hard origin/branch_name . None of these resolved the issue.Fluctuant
git help ls-files says > git-ls-files - Show information about files in the index and the working treeTeamster
@Teamster Yes. I've read that and it's still not clear whether they're talking about the local repository, the staging repository, or the list of local files. It's also unclear how to remove a file that is showing up in this list but isn't present either locally or in the remote repository.Fluctuant
M
36

Summary

You need to wrap your head around the idea that Git stores at least three, and sometimes up to five active copies of each file: one in the current commit, one (or two or three!) in the index, and one—the only one you can see and work with—in your work-tree. The git ls-files command looks at these copies, then tells you something about some of them, depending on the flags you supply to git ls-files.

Without this idea of three-to-five copies of each file, lots of things in Git will never make any sense. (Well, some things are still tricky even with it, but that's another problem entirely. 😀)

Long

I think there are two issues here. One requires some terminology and then the other should fall into place:

Does [git ls-files] show files from the local repository,

Sort of, but:

the staging repository,

Git does not have a staging repository. Each repository has something that is called, in different Git documentation, either the index or the staging area. (There's an obsoleted third name, cache, that also appears in the Git glossary.)

the remote repository

Definitely not: there need not be any remote repositories—i.e., other Gits with their own repositories—at all, and if there are, only git fetch and git push have your Git call up their Git and exchange data with them. (Well, git ls-remote does the first little bit of git fetch, and git pull runs git fetch, so these two also exchange data with a remote. But git ls-files doesn't.)

or from somewhere else?

Yes, sort of. That gets us back to the first part. So let's take these three bits of terminology as defined in the Git glossary. Italic (including bold italic) text in below is directly from the linked documentation:

  • repository

    A collection of refs together with an object database containing all objects which are reachable from the refs, possibly accompanied by meta data from one or more porcelains. A repository can share an object database with other repositories via alternates mechanism. (all links theirs)

    This of course is full of yet more terminology. To attempt to de-mystify it a bit, what they're saying here is that the repository proper doesn't include the index and work-tree: it's mostly made up of the commits (and their contents). Of course, that requires that we define "index" and "work-tree", so let's move on to:

  • index

    A collection of files with stat information, whose contents are stored as objects. The index is a stored version of your working tree. Truth be told, it can also contain a second, and even a third version of a working tree, which are used when merging.

  • working tree (I usually call this work-tree):

    The tree of actual checked out files. The working tree normally contains the contents of the HEAD commit’s tree, plus any local changes that you have made but not yet committed.

Commits are frozen forever

When you run git commit, Git makes a snapshot of all of your files—well, all of your tracked files, anyway—and stores that, plus some metadata like your name and email address, in a commit. This commit is mostly permanent—you can get rid of commits, usually with a fair bit of difficulty, but just think of them as permanent for convenience—and is totally, completely, 100% read-only. It's read-only like this on purpose, because that allows other commits to share identical copies of files, so that if you commit the same file once, ten times, or even a million times, there's really only one copy of that file in the repository. It's only when you change the file to a new version that Git has to commit a new, separate copy.

The commits are numbered, but not by a nice easy sequential numbering system. That is, we might draw them as a series of simple numbered or lettered things:

... <-C4 <-C5 <-C6 ...

where each later commit points back to its immediate predecessor. But their actual names are big ugly hash IDs. Each one is guaranteed to be unique, which is why they have to be so big and ugly and random-looking. Each hash ID is actually a cryptographic checksum, calculated over the commit's contents, so that every Git everywhere in the universe will agree that that commit, and only that commit, gets that checksum. That's the other reason you—and even Git—can't change it: if you take a commit out of the repository database, tinker with it, and change even one single bit and then put it back into the database, what you get is a new commit with a new and different hash ID.

So commits are totally frozen, forever. The files inside them are frozen forever as well, and compressed, and in a special Git-only format. I like to call these files "freeze-dried". What this means is that, hey, they're great for archiving, but they are utterly useless for getting any new work done ... and that means that Git must provide some way of taking these freeze-dried files and rehydrating them into a useful form.

The work-tree provides the useful-form copies

Things don't really get much simpler than this: the work-tree has the useful-form, rehydrated copies of your files. Because they're just ordinary everyday files on your computer, you can see them, use them, change them around however you like, and otherwise work with them. They're technically not in the repository at all—they are more just right next to it. In a typical setup, the repository itself is in the .git directory/folder of the top level of your work-tree.

Obviously, if there's a commit you've extracted to make the work-tree, there must now be two copies of each file: the freeze-dried committed one, plus the regular working one. Git could stop here. Mercurial does stop here: if you use Mercurial instead of Git, you don't need to concern yourself with a third copy, because there is no third copy. But Git goes on to store yet more copies of the files.

The index / staging-area sits between the commit and the work-tree

What Git does here is to interpose a third copy of each file, between the freeze-dried committed copy and the work-tree copy. This third copy is in the committed-file format—i.e., pre-dehydrated–but by not being in a commit, it's not actually totally frozen: it can be replaced at any time. That's what git add does: git add takes the ordinary copy of the file from the work-tree, compresses it down into the freeze-dried format, and replaces the copy that's in the index. Or, if the file wasn't in the index at all, it puts a copy into the index.

This is why you have to git add files all the time. In Mercurial, you only hg add a file once. After that, you just run hg commit, and Mercurial looks at all the files it knows about, and freezes them into a new commit. This can take a long time, in a big repository. Git, by contrast, already has all the files it's supposed to know about, and already dehydrated, in the index, so git commit can just package up those dehydrated files into a new frozen commit. The cost of this speed is git add, but if you get into playing clever tricks with the index copies—e.g., using git add -p—you get more benefits than just the speedup.

As the Git glossary mentioned in its description of the index, the index takes on an expanded role during a conflicted merge. When you do a merge operation—whether that's from git merge, or from git revert or git cherry-pick or any other Git command that uses the merge engine—and it doesn't go smoothly, Git winds up putting all three inputs for each file into the index, so that instead of just one copy of file.ext, you get three. But as long as you're not in the middle of a merge, there's only one copy in the index.

Usually the index copy matches the HEAD frozen copy, or matches the work-tree copy, or both. For instance, after a fresh git checkout, all three copies match. Then you modify file.ext in the work-tree: now the commit and the index match, but they're not the same as the work-tree copy. Then you git add file.ext, and now the index and work-tree match, but they're different from the frozen copy. Then you git commit to make a new commit, which becomes the current commit, and all three copies match again.

Note that you can modify the work-tree copy:

vim file.ext

then copy the updated one into the index:

git add file.ext

then edit it again:

vim file.ext

and that way, you can make all three copies different. If you do that, git status will say that you have changes staged for commit, because the index copy is different from the current-commit copy, and say that you have changes not staged for commit, because the work-tree copy is different from the index copy.

The work-tree can contain files that aren't in the index at all

The index is initially just a copy of the current commit. Git then also copies those files to the work-tree, so that you can use them. But you can create files in the work-tree and not run git add on them. Those files aren't in the index now, and if you run git commit, they won't be in the new commit either, because Git builds the new commit from the index.

You can also remove files from the index, without removing them from the work-tree:

git rm --cached file.ext

removes the index copy. It can't touch the current commit frozen copy, of course, but if you now make a new commit, the new commit won't have file.ext in it at all. (The previous commit still does, of course.)

Any file that is in your work-tree right now, and is not in your index right now, is an untracked file. Its untracked-ness comes from the fact that it's not in your index. Put that file into your index and it's tracked, no matter how you got it into your index. Remove it from your index and it's untracked, no matter how you got it out of your index. So that's the last role of the index: to determine which files are tracked, and will therefore be in the next commit.

Now we can see clearly what git ls-files does

What git ls-files does is to read everything: the commit, the index, and the work-tree. Depending on what arguments you give to git ls-files, it then prints the names of some or all files that are in the index and/or in the work-tree:

git ls-files --stage

lists the files that are in the index / staging-area, along with their staging slot numbers. (It says nothing about the copies in the HEAD commit and the work-tree.) Or:

git ls-files --others

lists the (names of the) files that are in the work-tree, but not in the index. (It says nothing about the copies in the HEAD commit.) Or:

git ls-files --modified

lists the (names of the) files that are in the index and are different from their copies in the HEAD commit (or aren't in the HEAD commit at all). With no options:

git ls-files

lists the (names of the) files that are in the index, with no regard for what files are in the HEAD commit or the work-tree.

Mulish answered 21/5, 2019 at 16:43 Comment(9)
I am not sure that the index holds a copy of a file, instead the index holds the name (40 chars long sha1) which can be found in the .git/objects folder. 100644 802992c4220de19a90767f3000a79a31b98d0df7 0 README.md This line above is exracted from the index and this is not a copy of the file README.md. This is just the name of the file in the work-tree and the hash value is a key which git uses to find the blob in the objects folder.Retrorse
@IvanRuski: yes, the index holds a hash name and a reference to the contents. But the file in your local file system is likely a name and a reference to the contents. Do you then say "my directory doesn't have any files in it, it only has file names"? :-) That would be technically correct—but it doesn't get any work done. It's useful to know at times, but mostly, we just say that our directories have the files in them.Mulish
Thank you for this excellent answer. I've known about each of these topics individually, but your post really connects them together!Paddlefish
So in a fresh cloned git repository,what would be the command 'git ls-files ...' to show exactly the same files than 'find . -type f' from the repository root (excluding entries from .git directory) ? What git terms would we speak about here? Indexed files also present in the working tree?Towne
@grenix: git ls-files would show the files in Git's index, if any. If you ran git clone -n (no checkout), the index would be empty so this would show nothing. Otherwise they would be the files that Git shoved into its index during the checkout, which would be the same set of files that appear in your working tree, yes. Note that you can, after the checkout, remove some or all of those working tree files without affecting the index copies. Git will gripe a bit but you can still make new commits containing all the files!Mulish
You can also manipulate Git's index without doing anything with your working tree. For instance, git read-tree, git reset, and git restore can all write to an index without modifying a working tree. (Ever since Git 2.5, Git has officially supported multiple index-and-working-tree pairs, and even before that, you could use GIT_INDEX_FILE and --git-worktree to fake it: there was a contrib script that was like a poor-man's git worktree.)Mulish
Ok I realize now my question should have been in git terms: Can I have a preview of what files/objects would be created on 'git checkout' in a repository created with 'git clone -n' or 'git checkout .' in a repository where everything except the .git folder was deleted? BTW: Surprising for me git can also create links and empty directories (eg. github.com/nvie/gitflow.git)Towne
@grenix: yes, though remember to check for error cases (it's possible for HEAD to be a symbolic reference to a nonexistent branch name). Read the commit hash ID, via HEAD, and then examine the corresponding tree objects. From a shell, you can use git ls-tree -r HEAD to do this. Git can create symlinks but normally won't create an empty directory; the empty-directory case is just for gitlinks (submodule parts).Mulish
Thanks for your explanations and the hint to ls-tree. I posted my experiences with ls-tree as an answer (https://mcmap.net/q/20313/-what-does-quot-git-ls-files-quot-do-exactly-and-how-do-we-remove-a-file-from-it)Towne
A
2

The git ls-files works correctly in your case. As your git status shows that the X file is deleted from the working dir, this means the file still exists in the index. That's why git ls-files shows X because the command shows content of the index.

Now, you have to remove that file from the index, just run:

git rm --cached <pathToXFile>
Acidulous answered 21/5, 2019 at 11:14 Comment(1)
I had deleted the file locally and tried to push the changes on git. git add (deleted file). git commit -m (some message about deleting this). git push (which failed due to a server issue). Now to backport that I had to use "git reset HEAD^ --soft (Save your changes, back to last commit) (https://mcmap.net/q/20481/-git-your-branch-is-ahead-of-39-origin-master-39-by-1-commit). Adding this comment here so that it helps out anyone else who might be stuck at this point.Fluctuant
T
2

Just wanted to share:

Refering to the accepted answer https://mcmap.net/q/20313/-what-does-quot-git-ls-files-quot-do-exactly-and-how-do-we-remove-a-file-from-it and dicussion with https://stackoverflow.com/users/1256452/torek:

If the question was, how do I find out what files/objects should be there if I checked out a special commit, another answer might be something like:

git ls-tree -r -l HEAD

Torek also mentioned "(it's possible for HEAD to be a symbolic reference to a nonexistent branch name)" but I dont undestand that for now.

so more general:

git ls-tree -r -l commit-hash

This also works in repositories cloned with switch -n (no checkout)

Just wondering where the magic of the output is documented

extract from a repo cloned with: git clone -n https://github.com/nvie/gitflow.git

100755 blob fd16d5168d671b8f9a8a8a6a140d3f7b5dacdccd    git-flow
100644 blob 55198ad82cbfe7249951aa75f1373a476997d33a    git-flow-feature
100644 blob ba485f6fe4b7d9c35bc01d2a6bd4ae201bccc9bd    git-flow-hotfix
100644 blob 5b4e7e807423279d5983c28b16307e40dfdb51d7    git-flow-init
100644 blob cb95bd486deb7089939362705d78b2197893f578    git-flow-release
100644 blob cdbfc717c0f1eb9e653a4d10d7c4df261ed40eab    git-flow-support
100644 blob 8c314996c0ac31f1396c48af5c6511124002dab7    git-flow-version
100644 blob 33274053347f4eec2f27dd8bceca967b89ae02d5    gitflow-common
120000 blob 7b736c183c7f6400b20ea613183d74a55ead78b5    gitflow-shFlags
160000 commit 2fb06af13de884e9680f14a00c82e52a67c867f1  shFlags

My interpretation:

The hashes seem to be "blob checksums" (no commit hashes). The same checksum can appear more than once if more than one file was in a commit. The last three nibbles of e.g. 100644 look like linux file access properties in octal numbering scheme (rw-r--r--). The first three nibbles are not 100 if the object is not a regular file. In real life gitflow-shFlags is a symlink and shflags a submodule directory.

EDIT: Just stumbled over https://github.com/git/git/blob/master/Documentation/technical/index-format.txt (GOOGLE: git --index-info, STACKOVERFLOW: What does the git index contain EXACTLY?)

32-bit mode, split into (high to low bits)

  4-bit object type
  valid values in binary are 1000 (regular file), 1010 (symbolic link)
  and 1110 (gitlink)

  3-bit unused

  9-bit unix permission. Only 0755 and 0644 are valid for regular files.
  Symbolic links and gitlinks have value 0 in this field.

So if you interpret the nibbles as octal values

100644: 1'000' 000'110'100'100 --> object type is regular file

120000: 1'010' 000'000'000'000 --> object type is symbolic link

160000: 1'110' 000'000'000'000 --> object type is gitlink

OMG: Why is it so hard extracting such information from the git man pages directly?

Next questions: What is 'gitlink'? Is it only associated with git submodules?

Towne answered 17/5, 2021 at 9:21 Comment(1)
The modes are a puzzle unless or until you notice that they're derived from Linux/Unix "inode" modes. The gitlink mode is special and is indeed for submodules.Mulish
C
1

With Git 2.35 (Q1 2022), "git ls-files" learns the "--sparse" option to help debugging.

It is used with sparse index, after a git sparse checkout command.

See commit 408c51f, commit c2a2940, commit 3a9a6ac, commit 7808709, commit 5a4e054 (22 Dec 2021) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit 3c0e417, 10 Jan 2022)

ls-files: add --sparse option

Signed-off-by: Derrick Stolee

Existing callers to 'git ls-files(man) ' are expecting file names, not directories. It is best to expand a sparse index to show all of the contained files in this case.

However, expert users may want to inspect the contents of the index itself including which directories are sparse.
Add a --sparse option to allow users to request this information.

During testing, I noticed that options such as --modified did not affect the output when the files in question were outside the sparse-checkout definition.

git ls-files now includes in its man page:

--sparse

If the index is sparse, show the sparse directories without expanding to the contained files.
Sparse directories will be shown with a trailing slash, such as "x/" for a sparse directory "x".

Choochoo answered 14/1, 2022 at 11:26 Comment(0)
T
0

I'm constantly seeing a file that is present in "git ls-files". That file was deleted from the remote repository. After which I tried doing a git pull.

You added that file to your index and haven't committed or removed it, so Git carries it for you until you decide what to do with it.

If you don't want it in your index, remove it. The usual is git rm --cached or if you also want it gone from your work tree just git rm.

Often enough while you're working you'll find some stupid little bug that needs fixing but isn't really part of your current task. Git makes handling things like this very easy: check out a bugfix branch off your maintenance base, commit just that fix, go back to what you were doing and merge that fix.

If at all possible (and it's often so trivial Git just does it, silently) Git does this without in the least disturbing whatever other changes you had in flight.

You'll find other cases where Git's way of handling in-flight work avoids useless churn, the important thing is, this is how Git handles in-flight work: it stays in the index until you decide what to do with it. So long as you don't tell Git to put something else there, Git carries what you added silently.

Teamster answered 21/5, 2019 at 17:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.