Why does adding an index to a database field speed up searching over that field?

Asked 27/9, 2012 at 1:19 Answered 27/9, 2012 at 5:6

I'm new to databases and have been reading that adding an index to a field you need to search over can dramatically speed up search times. I understand this reality, but am curious as to how it actually works. I've searched a bit on the subject, but haven't found any good, concise, and not over technical answer to how it works.

I've read the analogy of it being like an index at the back of a book, but in the case of a data field of unique elements (such as e-mail addresses in a user database), using the back of the book analogy would provide the same linear look up time as a non-indexed seach.

What is going on here to speed up search times so much? I've read a little bit about searching using B+-Trees, but the descriptions were a bit too indepth. What I'm looking for is a high level overview of what is going on, something to help my conceptual understanding of it, not technical details.

Isleana answered 27/9, 2012 at 1:19 Comment(0)

Expanding on the search algorithm efficiencies, a key area in database performance is how fast the data can be accessed. In general, reading data from a disk is a lot lot slower than reading data from memory.

To illustrate a point, lets assume everything is stored on disk. If you need to search through every row of data in a table looking for certain values in a field, you still need to read the entire row of data from the disk to see if it matches - this is commonly referred to as a 'table scan'.

If your table is 100MB, that's 100MB you need to read from disk.

If you now index the column you want to search on, in simplistic terms the index will store each unique value of the data and a reference to the exact location of the corresponding full row of data. This index may now only be 10MB compared to 100MB for the entire table.

Reading 10MB of data from the disk (and maybe a bit extra to read the full row data for each match) is roughly 10 times faster than reading the 100MB.

Different databases will store indexes or data in memory in different ways to make these things much faster. However, if your data set is large and doesn't fit in memory then the disk speed can have a huge impact and indexing can show huge gains. In memory there can still be large performance gains (amongst other efficiencies).

In general, that's why you may not notice any tangible difference with indexing a small dataset which easily fits in memory.

The underlying details will vary between systems and actually will be a lot more complicated, but I've always found the disk reads vs. memory reads an easily understandable way of explaining this.

Pompadour answered 27/9, 2012 at 5:6 Comment(0)

Okay, after a bit of research and discussion, here is what I have learned:

Conceptually an index is a sorted copy of the data field it is indexing, where each index value points to it's original(unsorted) row. Because the database knows how values are sorted, it can apply more sophisticated search algorithms than just looking for the value from start to finish. The binary search algorithm is a simple example of a searching algorithm for sorted lists and reduces the maximum search time from O(n) to O(log n).

As a side note: A decent sorting algorithm generally will take O(n log n) to complete, which means (as we've all probably heard before) you should only put indexes on fields you will search often, as it's a bit more expensive to add the index (which includes a sort) than it is to do a full search a few times. For example, in a large database of >1,000,000 entries it's on the range of 20x more expensive to sort than to search once.

Edit: See @Jarod Elliott's answer for a more in-depth look at search efficiencies, specifically in regard to read from disk operations.

Isleana answered 27/9, 2012 at 4:48 Comment(0)

To continue your back-of-the-book analogy, if the pages were in order by that element it would be the same look-up time as a non-indexed search, yes.

However, what if your book were a list of book reviews ordered by author, but you only knew the ISBN. The ISBN is unique, yes, but you'd still have to scan each review to find the one you are looking for.

Now, add an index at the back of the book, sorted by ISBN. Boom, fast search time. This is analogous to the database index, going from the index key (ISBN) to the actual data row (in this case a page number of your book).

Exchequer answered 27/9, 2012 at 1:29 Comment(7)

This still doesn't provide a sufficient answer. In a table things are stored as fields (columns), so we can think of a data field as a chapter in a book. So if we go the email chapter of the book, it's still just as fast to look there for an e-mail as it is in the index of the book. We don't scan the whole table for a item we want to find... just the relevant field. – Isleana 27/9, 2012 at 2:16

So you are suggesting to store ALL the data again for each row in each chapter? Such that you have a "last name" chapter, sorted by last name, listing first name, last name, DOB, birthplace, username, email, and a 1000-word biography. Then you have a "username" chapter, sorted by username, again listing first name, last name, DOB, birthplace, username, email, and a 1000-word biography. Then you have a "email" chapter, sorted by email, listing first name, last name, DOB, birthplace, username, email, and a 1000-word biography. This seems like a highly inefficient use of space... – Exchequer 27/9, 2012 at 2:23

Ok, think of it this way. We have a book that consists only of unique e-mail addresses (no repeats). That is it, no other content. In this book, if we had an index, it would be an exact copy of the book's contents, only sorted somehow (though depends on whoever makes the index). So, this case, searching for an e-mail address in the book or the index is equivalent. This is why I say the book index analogy fails. There obviously is more to it than that, as an indexed database search will find an e-mail much quicker than a full-scan. – Isleana 27/9, 2012 at 2:32

That's because it has no way of knowing the emails are in order until you put the index on it. Without the index, it would have to check every row from start to finish. With the index, it can just find it. Consider the same book analogy, but, whether they are or are not, you didn't know the emails were listed in order. How would you find the one you were looking for? Naturally, you would have to start at the beginning and scan every line on every page until you found it, right? – Exchequer 27/9, 2012 at 2:36

I never said the e-mails are in any particular order. Also if you have an index using the back of the book analogy, you still have to scan every item in the index until you find the value you want to find. It's still a full-table scan... no gains. I understand the book analogy quite well, and it simply doesn't work here. – Isleana 27/9, 2012 at 2:46

That's not true and I hope you don't scan every item in the index when you use one in real life. You start somewhere in the middle and compare where your value should be vs where you opened to. If it's earlier, you flip somewhere earlier and repeat the process. Google "binary search" if you want help understanding this, but try looking up "xylophone" in a paper dictionary and see what you do to find it. – Exchequer 27/9, 2012 at 2:58

let us continue this discussion in chat – Isleana 27/9, 2012 at 2:59

Recommended topics

Hot tags