How does a software like Voidtools's Everything indexes more than 100k files in less than a second?

Asked 16/5, 2021 at 7:37 Answered 16/5, 2021 at 8:35

algorithm indexing operating-system ntfs

There is a software called "Everything" it indexes all the files in your machine, and find anything very fast; once the files are indexed.

I would expect the index phase to take few minutes but no. it takes few seconds to index a full computer. with multiple TB. How is it possible? A simple loop over over the files would take much more.

What am I missing?

Crashing answered 16/5, 2021 at 7:37 Comment(1)

Apart from the answers, try looking into how the OS itself actually keeps a track of files and processes – Sesame 16/5, 2021 at 8:50

Enumerating files one-by-one through the official API would takes ages, indeed. But Everything reads the Master File Table (and later updates look at the USN Change Journal), according to the author himself, thereby bypassing the slow file enumeration API.

a full computer. with multiple TB

The total size of the files is not relevant, because Everything does not index file contents. MFT entries are 1KB each, so for 100K files you can expect to read on the order of 0.1GB to build an index from scratch (actually more because of non-file entries, but similar order of magnitude, of course less when updating an existing index). That's not really a lot of data after all, it should be possible to read it in under a second.

Then processing 100K entries to build an index may seem like a task that could be slow, but for sense of scale you can compare to the (tens of) billions of instructions that a contemporary computer can execute per second. "4GHz" does not exactly mean "4 billion instructions per second", but it's even better, even an old CPU like the original Pentium could execute several instructions per cycle. Just based on that scale alone, it's not unthinkable to build an index of 100K entries in a few seconds. Minutes seems excessive: that would correspond to millions of instructions per item, that's bad even for an O(n log n) algorithm (the base 2 log of 100K is about 17), surely we can do better than that.

Muskeg answered 16/5, 2021 at 8:35 Comment(2)

"takes ages" is a relative term. On my SSD (212 GB used for 400,000 files), a full scan via Windows API takes 12 seconds. – Alverta 16/5, 2021 at 8:50

@AxelKemper that's interesting, it has been a lot slower than that (several minutes) when I last experimented with it, I may have to retry – Muskeg 16/5, 2021 at 8:55

-3

threading/multiprocessing can drastically improve speeds. They are probably taking advantage of multiple cores. You said a simple loop over the files so i am assuming you don't know of threading/multiprocessing.

Ingenious answered 16/5, 2021 at 8:0 Comment(4)

According to your logic, using 8 "cores" should only reduce the complexity of a simple loop by a factor of 8, which can not reduce multiple minutes to a second – Sesame 16/5, 2021 at 8:52

what logic, I didnt say anything about it reducing time in parallel to the number of cores... – Ingenious 17/5, 2021 at 8:41

even then your statement seems to make even less sense because with something like Multiprocessing in python, you can spawn more processes than cores and it can still be faster even though it is not recommended. That means it wouldnt matter if you had 8 cores, you can run more than 8 processes... – Ingenious 17/5, 2021 at 8:48

Try reading up on theoretical vs actual speedup. Threading/multiprocessing has limitations – Sesame 17/5, 2021 at 9:0

Recommended topics

Hot tags