What makes Everything's file search and index so efficient?

Asked 7/12, 2017 at 2:21 Answered 16/6, 2022 at 14:31

performance search data-structures file-search

Everything is a file searching program. As its author hasn't released the source code, I am wondering how it works.

How could it index files so efficiently?
What data structures does it use for file searching?
How can its file searching be so fast?

To quote its FAQ,

"Everything" only indexes file and folder names and generally takes a few seconds to build its database. A fresh install of Windows 10 (about 120,000 files) will take about 1 second to index. 1,000,000 files will take about 1 minute.

If it takes only one second to index the whole Windows 10, and takes only 1 minute to index one million files, does this mean it can index 120,000 files per second?

To make the search fast, there must be a special data structure. Searching by file name doesn't only search from the start, but also from the middle in most cases. This makes it some widely used indexing structures such as Trie and Red–black tree ineffective.

The FAQ clarifies further.

Does "Everything" hog my system resources?

No, "Everything" uses very little system resources. A fresh install of Windows 10 (about 120,000 files) will use about 14 MB of ram and less than 9 MB of disk space. 1,000,000 files will use about 75 MB of ram and 45 MB of disk space.

Caulicle answered 7/12, 2017 at 2:21 Comment(6)

Supposedly it doesn't index file contents. This makes the mystery a lot less interesting. – Breadbasket 7/12, 2017 at 2:37

Note that 1,000,000 files / 1 minute is only about 20k files per second; presumably a fresh Windows install allows some files to be skipped or handled faster. – Breadbasket 7/12, 2017 at 2:38

A basic Regex search over a text file can run at many GB/s under optimal conditions; handling a query over a few tens of MB of memory isn't that impressive even with a linear scan. – Breadbasket 7/12, 2017 at 2:46

Using just a simple text file? Um... It cannot be treated as not a good idea. – Caulicle 7/12, 2017 at 2:50

Why not? If it works, it works. – Breadbasket 7/12, 2017 at 2:51

I'm voting to close this question as off-topic because it's asking us how a specific closed-source application works. I would suggest rephrasing it to not ask what this program specifically does, but rather how to approach the problem in general (but you're asking multiple questions here, which is discouraged, and one of them appears to be purely a disk I/O question and has nothing to do with data structures and, for the other, you may be looking for a "suffix tree"). – Shackelford 8/12, 2017 at 8:19

Short Answer: MFT (Master File Table)

Getting the data

Many search engines used to recursively walk through the entire disk structure so that it finds all the files. Therefore it used to take longer time to complete the indexing process (even when contents are not indexed). If contents were also indexed, it would take a lot longer.

From my analysis of Everything, it does not recurse at all. If we observe the speed, in about 5 seconds it indexed an entire 1tb drive (SSD). Even if it had to recurse it would take longer - since there are thousands of small files - each with its own file size, date etc - all spread across.

Instead, Everything does not even touch the actual data, it reads and parses the 'Index' of the hard drive. For NTFS, MFT store all the file names, its physical location (like concept of iNodes in Linux). So, in one small contiguous area (a file), all the data inside MFT is present. So, the search indexer does not have waste time finding where the info about next file is, it does not have to seek. Since MFT by design is contiguous (rare exception if there are many more files and MFT for some reason is filled up or corrupt, it will link to a new one which will cause a seek time - but that edge case is very rare).

MFT is not plain text, it needs to be parsed. Folks at Everything have designed a superfast parser and decoder for NFT and hence all is well.

FSCTL_GET_NTFS_VOLUME_DATA (declared in winioctl.h) will get you the cluster locations for mft. Alternatively, you could use NTFSInfo (Microsoft SysInternals - NTFSInfo v1.2).

MFT zone clusters : 90400352 - 90451584

Storing and retrieving

The .db file from my index at C:\Users\XXX\AppData\Local\Everything I assume this is a regular nosql file-based database. Since it uses a DB and not a flat file, that contributes to the speed. And also, at start of program, it loads this db file into RAM, so all the queries do not look up the DB on disk, instead on RAM. All this combined makes it slick.

Batholith answered 16/6, 2022 at 14:31 Comment(5)

It probably would help to lead off with the fact that MFT stands for "Master File Table". And then since this is a programming-focused site, name some of the APIs used for MFT access. As it stands, this would be a great answer on SuperUser, but here it misses the point. – Lucillalucille 16/6, 2022 at 14:45

Very helpful that you pointed this out. Will edit my answer to add the same. – Batholith 16/6, 2022 at 14:47

One can also take advantage of the MFT without writing a custom parser, by using the FSCTL_ENUM_USN_DATA ioctl. This sacrifices some performance in exchange for portability to future revisions of the filesystem that may change the on-disk format. – Lucillalucille 20/6, 2022 at 16:20

I believe the Everything.db file is an SQLite database. When I attempt to open it I'm prompted for a password. – Heelandtoe 15/7, 2022 at 13:52

I agree and felt the same. From the data, I am quite sure it is SQLite/Derivative format, but I could not confirm since I could not open with regular viewer. Am assuming, this db might be optimized by the Dev. Thanks for sharing! – Batholith 15/7, 2022 at 18:32

How could it index files so efficiently?

First, it indexes only file/directory names, not contents.

I don't know if it's efficient enough for your needs, but the ordinary way is with FindFirstFile function. Write a simple C program to list folders/files recursively, and see if it's fast enough for you. The second step through optimization would be running threads in parallel, but maybe disk access would be the bottleneck, if so multiple threads would add little benefit.

If this is not enough, finally you could try to dig into the even lower Native API functions; I have no experience with these, so I can't give you further advice. You'd be pretty close to the metal, and maybe the Linux NTFS project has some concepts you need to learn.

What data structures does it use for file searching?

How can its file searching be so fast?

Well, you know there are many different data structures designed for fast searching... probably the author ran a lot of benchmarks.

Purism answered 7/12, 2017 at 2:54 Comment(0)

Recommended topics

Hot tags