How does database indexing work? [closed]
Asked Answered
N

7

2883

Given that indexing is so important as your data set increases in size, can someone explain how indexing works at a database-agnostic level?

For information on queries to index a field, check out How do I index a database column.

Napoleonnapoleonic answered 4/8, 2008 at 10:7 Comment(0)
N
4141

Why is it needed?

When data is stored on disk-based storage devices, it is stored as blocks of data. These blocks are accessed in their entirety, making them the atomic disk access operation. Disk blocks are structured in much the same way as linked lists; both contain a section for data, a pointer to the location of the next node (or block), and both need not be stored contiguously.

Due to the fact that a number of records can only be sorted on one field, we can state that searching on a field that isn’t sorted requires a Linear Search which requires (N+1)/2 block accesses (on average), where N is the number of blocks that the table spans. If that field is a non-key field (i.e. doesn’t contain unique entries) then the entire tablespace must be searched at N block accesses.

Whereas with a sorted field, a Binary Search may be used, which has log2 N block accesses. Also since the data is sorted given a non-key field, the rest of the table doesn’t need to be searched for duplicate values, once a higher value is found. Thus the performance increase is substantial.

What is indexing?

Indexing is a way of sorting a number of records on multiple fields. Creating an index on a field in a table creates another data structure which holds the field value, and a pointer to the record it relates to. This index structure is then sorted, allowing Binary Searches to be performed on it.

The downside to indexing is that these indices require additional space on the disk since the indices are stored together in a table using the MyISAM engine, this file can quickly reach the size limits of the underlying file system if many fields within the same table are indexed.

How does it work?

Firstly, let’s outline a sample database table schema;

Field name       Data type      Size on disk
id (Primary key) Unsigned INT   4 bytes
firstName        Char(50)       50 bytes
lastName         Char(50)       50 bytes
emailAddress     Char(100)      100 bytes

Note: char was used in place of varchar to allow for an accurate size on disk value. This sample database contains five million rows and is unindexed. The performance of several queries will now be analyzed. These are a query using the id (a sorted key field) and one using the firstName (a non-key unsorted field).

Example 1 - sorted vs unsorted fields

Given our sample database of r = 5,000,000 records of a fixed size giving a record length of R = 204 bytes and they are stored in a table using the MyISAM engine which is using the default block size B = 1,024 bytes. The blocking factor of the table would be bfr = (B/R) = 1024/204 = 5 records per disk block. The total number of blocks required to hold the table is N = (r/bfr) = 5000000/5 = 1,000,000 blocks.

A linear search on the id field would require an average of N/2 = 500,000 block accesses to find a value, given that the id field is a key field. But since the id field is also sorted, a binary search can be conducted requiring an average of log2 1000000 = 19.93 = 20 block accesses. Instantly we can see this is a drastic improvement.

Now the firstName field is neither sorted nor a key field, so a binary search is impossible, nor are the values unique, and thus the table will require searching to the end for an exact N = 1,000,000 block accesses. It is this situation that indexing aims to correct.

Given that an index record contains only the indexed field and a pointer to the original record, it stands to reason that it will be smaller than the multi-field record that it points to. So the index itself requires fewer disk blocks than the original table, which therefore requires fewer block accesses to iterate through. The schema for an index on the firstName field is outlined below;

Field name       Data type      Size on disk
firstName        Char(50)       50 bytes
(record pointer) Special        4 bytes

Note: Pointers in MySQL are 2, 3, 4 or 5 bytes in length depending on the size of the table.

Example 2 - indexing

Given our sample database of r = 5,000,000 records with an index record length of R = 54 bytes and using the default block size B = 1,024 bytes. The blocking factor of the index would be bfr = (B/R) = 1024/54 = 18 records per disk block. The total number of blocks required to hold the index is N = (r/bfr) = 5000000/18 = 277,778 blocks.

Now a search using the firstName field can utilize the index to increase performance. This allows for a binary search of the index with an average of log2 277778 = 18.08 = 19 block accesses. To find the address of the actual record, which requires a further block access to read, bringing the total to 19 + 1 = 20 block accesses, a far cry from the 1,000,000 block accesses required to find a firstName match in the non-indexed table.

When should it be used?

Given that creating an index requires additional disk space (277,778 blocks extra from the above example, a ~28% increase), and that too many indices can cause issues arising from the file systems size limits, careful thought must be used to select the correct fields to index.

Since indices are only used to speed up the searching for a matching field within the records, it stands to reason that indexing fields used only for output would be simply a waste of disk space and processing time when doing an insert or delete operation, and thus should be avoided. Also given the nature of a binary search, the cardinality or uniqueness of the data is important. Indexing on a field with a cardinality of 2 would split the data in half, whereas a cardinality of 1,000 would return approximately 1,000 records. With such a low cardinality the effectiveness is reduced to a linear sort, and the query optimizer will avoid using the index if the cardinality is less than 30% of the record number, effectively making the index a waste of space.

Napoleonnapoleonic answered 4/8, 2008 at 10:41 Comment(35)
binary search can be done when the data is unique, am i right? although you mentioned that minimum cardinality is important, the algorithm wouldn't be a simple binary search, how would this approximation (~log2 n) affect the process time?Scrutiny
This is also a great read: kylebanker.com/blog/2010/09/21/the-joy-of-mongodb-indexesChurchyard
@XenphYan - So an index is just a way to sort data in a column and keep that sort order handy for quickly accessing the column elements ? If we update a non-indexed column, then performance should not be affected, right ? Related question - #16125190Lindemann
"it stands to reason that indexing fields used only for output would be simply a waste" This is not entirely true if you consider covering indexes. A covering index will respond to queries by pulling values only from the index without needing to look up matching records.Cahra
@AbhishekShivkumar:Great question!I think the index table will have as many rows as there are in the data table. And as this field will have only 2 values(boolean with true/false) & say you want a record with value true,then you can only halve the result set in first pass, in second pass all your records have value true so there is no basis to differentiate,now you have to search the data table in linear fashion-hence he said cardinality should be considered while deciding the indexed column. In this case,it's worthless to index on such a column. Hope I'm correct :)Cording
shouldn't the number of block accesses in the average case be (N+1)/2. If we sum the number of block accesses for all possible cases, and divide it by the number of cases, then we have N*(N+1)/(2*n) which comes out to be (N+1)/2.Caledonian
Higher cardinality fields can make use of bitmap indexes. It is designed specifically for fields which hold lots of duplicate values, e.g., a gender field.Versus
Finally clarified. I have one more question. If a table is small and will stay small (example a country list) it looks like not indexing at all is both a win in time and disk space. On huge tables indexing CHAR fields should be avoided if possible, especially if searches on CHAR fields are rather rare. That's probably why big forums only allow one keyword search every minute.Malodorous
I think there are a few typos in this answer, for example, in the sentence: "a far cry from the 277,778 block accesses required by the non-indexed table." doesn't the author mean 1,000,000 block accesses? 277,778 is the number of blocks required by the index itself. There seems to be a couple of other inaccuracies too :(Ingulf
The answer doesn't seem to explain that the index itself is sorted? I.e. they just jump to this explanation: "Now a search using the firstName field can utilise the index to increase performance. This allows for a binary search of the index with an average of log2 277778 = 18.08 = 19 block accesses" - the index MUST be sorted in order for binary search to be possible, but I can't see that explained anywhere.Ingulf
It would also be great if there was an explanation of how the indexing works on multiple fields, the answer only explains indexing of a single field, the firstNameIngulf
@Ingulf He explained it in the "What is indexing section" - "Indexing is a way of sorting a number of records on multiple fields. Creating an index on a field in a table creates another data structure which holds the field value, and pointer to the record it relates to. This index structure is then sorted, allowing Binary Searches to be performed on it."Leekgreen
In the second example, why do we need to store the firstName? Isn't the pointer enough? We'll just have to simply change the comparison function to (*pointer).firstName.Immoral
Can you tell us the references from which one can learn these ?Brant
fyi, binary search is not used on b-tree lookups. The performance would be terrible when the index does not fit in memory. The reason for 'b-tree indexes' is to be fast even when not all the index fits in available memory. This is about 'fan out' of keys to where to find data. Binary tree has a 'fan out' of 2. 'b-trees' can have fan-outs of 'rather more' than that 10's or greater - it depends on key size and 'page size'.Treasurehouse
Can you explain why "searching on a field that isn’t sorted requires a Linear Search which requires N/2 block accesses (on average)"? Why is it N/2 and not N?Ringtailed
@HeyJude I think he means Average.Bushido
@Ingulf (and future readers) Re: comment about 1,000,00 vs 277,778 - I just reversed an edit making this change because I thought the reference was correct. On re-reading both the answer and your comment more carefully though, you are absolutely correct, it's talking about the original unsorted field and not the index size. I changed the number back (oops) to 1 million and re-worded things slightly (e.g replaced 'table' with 'index') to clarify the context of the 277,778Harbor
Am I missing something or When should it be used? point doesn't answer the question. It rather describes "When should't it be used?". So when it should be used?Invertase
Just to be clean when we say Creating an index on a field in a table creates another data structure which holds the field value, and pointer to the record it relates to. this is for non clustered index. For clustered index which can be only one for a table this is not true since the original data itself is stored in sorted order using it..Thorny
How can we run binary serch on sorted data if they are not stored in contiguous blocks ?Sidon
I tried indexing on a non-key field, things are going worse for select query. How do I understand this?Squiffy
I lost you after first paragraph Due to the fact that a number of records can only be sorted on one field, can you explain this or correct the sentence.Manipular
index can be hash index, not necessary has to be sorted. What you mentioned is just a kind of index.Bea
Thank you for the brilliant, thorough explanation. If I could posit an explanation of my own (probably inaccurate and wrong): indexing on firstName is like adding a list of sorted ids to firstName that correlate to the entries of that field, which enables Binary Search?Maneuver
Can someone explain why searching on a field that isn’t sorted but has unique entries requires N/2 block accesses on average and searching on a field that is a non-key field (doesn't contain unique entries) requires N block accesses? Wouldn't they both require N since it's a linear search?Compete
@mouscous Suppose you have a list of unique numbers in random order. If you want to know if a value is in there you look at the 1st one, then the 2nd one, 3rd etc... until you find a match. Once found there is no need to continue looking, the value is unique and only present once. Sometimes you'll have a match early on, sometimes it takes a while to find it but on average you'll traverse half the list. If the values are not unique you may find it in the first spot, but it could be there's another occurrence further on! To find them all you need to go through the entire list every time. Thus N.Selfloading
@rouscous if you look up on a list with unique values for all records with a certain value, then you start from the beginning and search each record one by one. When you find the record with the value you want, then... you stop. Why do you stop? Because you know the value is unique and therefore no more records with that value exist. Since the values are no sorted there is equal propability to find that record in the first, second, etc till the last. This means you may stop searching after the 1st read. Sometimes you may stop after the 100th read etc. On average you will stop around n/2 reads.Darelldarelle
@rouscous But when the values are not unique you start searching from the 1st record. You go to the 2nd. Then, in the i-th record you find the value you are looking for. You are ready to stop searching but then you think... hmm, those values are not unique.... Maybe there are other records with that value... I better keep searching.... So effectively, you end up searching all of the records each time! So on average (actually it is the same every time) you search n records. So with unique value you stop searching when you find the one. With non unique, you don't stop until you search them all.Darelldarelle
@AnmolSinghJaggi Accessing data on the file system is not the same as accessing it in memory. (*pointer).firstName implies that you will read one block (to read the pointer) and again another block (to read the firstName). Moreover it beats the very purpose of the index. An index is a sorted catalog (simplistic view but needed to understand it).Darelldarelle
@AnmolSinghJaggi For instance, imagine this structure. A person has a name (the search field), an address (the pointer that points where you can find that person) and an age. You want to know the age of mr John Smith. The proper way to do that is to take the address catalog (a huge book that maps names to addresses) which is the INDEX in our case. You look for mr smith in the catalog. Because it is sorted, you don't need to look the entire of it. You can find it with 3-5 page turns. Then you know the address. You go there, and extract your information.Darelldarelle
Hmm. I'm confused. This answer talks about "Given that an index record contains only the indexed field and a pointer to the original record," but what if the index is a column that does not have unique values? (e.g. out of 1 million records, 37000 have the value USA and they're scattered throughout the 1 million records rather than distributed evenly)Thenceforward
This answer tackles disk size issue, but skirts over the (most often) bigger issue: Writes and deletes take longer when there are more indexes. That's the real reason to not just index every field.Dandiprat
Will not using multi-level index shorten the access time? Since the blocking factor is 18, the extra memory required in the previous level would be ( 277,778/18). This would give a GP, whose sum will be less than 277,778 * 19/18. So, by spending 5% extra space, we would be able to reduce the number of accesses from 20 to log(18) 277,778 = 5, A 4- fold improvement in access time.Ac
When indexing happens on the database?Brachio
L
633

Classic example "Index in Books"

Consider a "Book" of 1000 pages, divided by 10 Chapters, each section with 100 pages.

Simple, huh?

Now, imagine you want to find a particular Chapter that contains a word "Alchemist". Without an index page, you have no other option than scanning through the entire book/Chapters. i.e: 1000 pages.

This analogy is known as "Full Table Scan" in database world.

enter image description here

But with an index page, you know where to go! And more, to lookup any particular Chapter that matters, you just need to look over the index page, again and again, every time. After finding the matching index you can efficiently jump to that chapter by skipping the rest.

But then, in addition to actual 1000 pages, you will need another ~10 pages to show the indices, so totally 1010 pages.

Thus, the index is a separate section that stores values of indexed column + pointer to the indexed row in a sorted order for efficient look-ups.

Things are simple in schools, isn't it? :P

Landeros answered 23/4, 2017 at 14:43 Comment(6)
really nice analogy! funny i didn't make the connection between a book index and a db indexVigor
This makes me think Library or Grocery Store Could you image not having an index at a grocery store? Where's The Beef?!? Oh its next to the Restrooms, a mop, and makeupCounteraccusation
"But with an index page at the beginning, you are there." What does "you are there" mean?Maneuver
Indices usually go at the back of books, while a table of contents goes in the front. But, that makes the analogy even better, since column order shouldn't matter.Zingaro
I still do not exactly understand, so if there are n unique words how would index help me? it creates pointer for each word? If so it takes a lot of time to find that pointer maybe even same time then just scroll everything and find it in a default wayProdigious
@Prodigious If there are n unique words in a field A, the index would still help in that they get to store pointers to each of the unique words in field A but they (indexes) are sorted and stored in a field B. Naturally, I believe a sorted field will be faster to traverse in search of a record (word) than an unsorted one. I hope that helps.Insectile
T
351

An index is just a data structure that makes the searching faster for a specific column in a database. This structure is usually a b-tree or a hash table but it can be any other logic structure.

Thiazole answered 20/2, 2014 at 14:40 Comment(3)
+1 times a million for this answer, as I found this listing while trying to find a simple explanation what indexing essentially is.Angeliaangelic
Let's note that "just a data structure" doesn't mean "additional to the data". Some times it is (e.g. "non-clustered index"), some times it determines the layout of the data (e.g. "clustered index").Sexpartite
This is the best answer, an Index is basically like a Hashmap in which a get has O(1) complexity, whereas searching in a List is O(N)Razorback
P
265

The first time I read this it was very helpful to me. Thank you.

Since then I gained some insight about the downside of creating indexes: if you write into a table (UPDATE or INSERT) with one index, you have actually two writing operations in the file system. One for the table data and another one for the index data (and the resorting of it (and - if clustered - the resorting of the table data)). If table and index are located on the same hard disk this costs more time. Thus a table without an index (a heap) , would allow for quicker write operations. (if you had two indexes you would end up with three write operations, and so on)

However, defining two different locations on two different hard disks for index data and table data can decrease/eliminate the problem of increased cost of time. This requires definition of additional file groups with according files on the desired hard disks and definition of table/index location as desired.

Another problem with indexes is their fragmentation over time as data is inserted. REORGANIZE helps, you must write routines to have it done.

In certain scenarios a heap is more helpful than a table with indexes,

e.g:- If you have lots of rivalling writes but only one nightly read outside business hours for reporting.

Also, a differentiation between clustered and non-clustered indexes is rather important.

Helped me:- What do Clustered and Non clustered index actually mean?

Peon answered 30/4, 2013 at 14:31 Comment(4)
I think, these indexing issues can be resolved by maintaining two different databases, just as Master and Slave. Where Master can be used to insert or update records. Without indexing. And slave can be used to read with proper indexing right???Refulgent
no, wrong, sorry. not just the content of the tables must be updated, but also the index structure and content (b-tree, nodes). your concept of master and slave makes no sense here. what can be feasable though is replicating or mirroring to a second database on which analytics take place to take that workload away from the first database. that second database would hold copies of data and indexes on that data.Peon
Ya...! Try to read my comment and understand it properly. I also said the same, I referred to master and slave (whatever) as "eplicating or mirroring to a second database on which analytics take place to take that workload away from the first database. that second database would hold copies of data and indexes on that data"Refulgent
the second database - to which mirroring or replicating is done, the slave - would experience all the data manipulation as the first one does. with each dml-operation the indexes on that second database would experience "these indexing issues". i don't see the gain in that, where ever the indexes are needed and built for quick analysis they need to be kept up to date.Peon
T
200

Now, let’s say that we want to run a query to find all the details of any employees who are named ‘Abc’?

SELECT * FROM Employee 
WHERE Employee_Name = 'Abc'

What would happen without an index?

Database software would literally have to look at every single row in the Employee table to see if the Employee_Name for that row is ‘Abc’. And, because we want every row with the name ‘Abc’ inside it, we can not just stop looking once we find just one row with the name ‘Abc’, because there could be other rows with the name Abc. So, every row up until the last row must be searched – which means thousands of rows in this scenario will have to be examined by the database to find the rows with the name ‘Abc’. This is what is called a full table scan

How a database index can help performance

The whole point of having an index is to speed up search queries by essentially cutting down the number of records/rows in a table that need to be examined. An index is a data structure (most commonly a B- tree) that stores the values for a specific column in a table.

How does B-trees index work?

The reason B- trees are the most popular data structure for indexes is due to the fact that they are time efficient – because look-ups, deletions, and insertions can all be done in logarithmic time. And, another major reason B- trees are more commonly used is because the data that is stored inside the B- tree can be sorted. The RDBMS typically determines which data structure is actually used for an index. But, in some scenarios with certain RDBMS’s, you can actually specify which data structure you want your database to use when you create the index itself.

How does a hash table index work?

The reason hash indexes are used is because hash tables are extremely efficient when it comes to just looking up values. So, queries that compare for equality to a string can retrieve values very fast if they use a hash index.

For instance, the query we discussed earlier could benefit from a hash index created on the Employee_Name column. The way a hash index would work is that the column value will be the key into the hash table and the actual value mapped to that key would just be a pointer to the row data in the table. Since a hash table is basically an associative array, a typical entry would look something like “Abc => 0x28939″, where 0x28939 is a reference to the table row where Abc is stored in memory. Looking up a value like “Abc” in a hash table index and getting back a reference to the row in memory is obviously a lot faster than scanning the table to find all the rows with a value of “Abc” in the Employee_Name column.

The disadvantages of a hash index

Hash tables are not sorted data structures, and there are many types of queries which hash indexes can not even help with. For instance, suppose you want to find out all of the employees who are less than 40 years old. How could you do that with a hash table index? Well, it’s not possible because a hash table is only good for looking up key value pairs – which means queries that check for equality

What exactly is inside a database index? So, now you know that a database index is created on a column in a table, and that the index stores the values in that specific column. But, it is important to understand that a database index does not store the values in the other columns of the same table. For example, if we create an index on the Employee_Name column, this means that the Employee_Age and Employee_Address column values are not also stored in the index. If we did just store all the other columns in the index, then it would be just like creating another copy of the entire table – which would take up way too much space and would be very inefficient.

How does a database know when to use an index? When a query like “SELECT * FROM Employee WHERE Employee_Name = ‘Abc’ ” is run, the database will check to see if there is an index on the column(s) being queried. Assuming the Employee_Name column does have an index created on it, the database will have to decide whether it actually makes sense to use the index to find the values being searched – because there are some scenarios where it is actually less efficient to use the database index, and more efficient just to scan the entire table.

What is the cost of having a database index?

It takes up space – and the larger your table, the larger your index. Another performance hit with indexes is the fact that whenever you add, delete, or update rows in the corresponding table, the same operations will have to be done to your index. Remember that an index needs to contain the same up to the minute data as whatever is in the table column(s) that the index covers.

As a general rule, an index should only be created on a table if the data in the indexed column will be queried frequently.

See also

  1. What columns generally make good indexes?
  2. How do database indexes work
Tereus answered 13/8, 2016 at 18:36 Comment(9)
"a database index does not store the values in the other columns " -- not true.Eudemon
@mustaccio: Index stores reference of row with the indexed columns only (as far I know). I might be wrong. Do you have any reference which says index stores other columns values?Tereus
@To Downvoters : Can you just explain what's wrong so that I can improve?Tereus
Check for example SQL Server clustering indexes or DB2's CREATE INDEX ... INCLUDE clause. You have too many generalizations in your answer, in my view.Eudemon
@mustaccio: So by default create index does not include the other columns and why it should. If we did just store all the other columns in the index, then it would be just like creating another copy of the entire table, which would take up way too much space and would be very inefficient.. This is more generalized version of indexes. CREATE INDEX ... INCLUDE is the newer version by considering other columns. Post I have explained is considering more generalized version. How indexes work would be one book if we consider all the databases? Isn't it? Do you think answer deserves downvote?Tereus
This is basically extracted from programmerinterview.com/index.php/database-sql/what-is-an-index Perhaps you could add thatThiazole
@SomnathMuluk there's a concept called "covering index". Sometimes we would want to add more fields to our index, if we know that many queries will use SELECT with those fields only, thus allowing us to avoid reading the actual row. Naturally, this have a storage-cost (larger index size).Sanchez
copy paste. Very unethical way to get upVotes.Blissful
"Looking up a value like “Abc” in a hash table index and getting back a reference to the row in memory" -- you mention "to the row", but that only works if the value is unique; how do indexes work when there are multiple values?Thenceforward
S
133

Simple Description!

The index is nothing but a data structure that stores the values for a specific column in a table. An index is created on a column of a table.

Example: We have a database table called User with three columns – Name, Age and Address. Assume that the User table has thousands of rows.

Now, let’s say that we want to run a query to find all the details of any users who are named 'John'. If we run the following query:

SELECT * FROM User 
WHERE Name = 'John'

The database software would literally have to look at every single row in the User table to see if the Name for that row is ‘John’. This will take a long time.

This is where index helps us: index is used to speed up search queries by essentially cutting down the number of records/rows in a table that needs to be examined.

How to create an index:

CREATE INDEX name_index
ON User (Name)

An index consists of column values(Eg: John) from one table, and those values are stored in a data structure.

So now the database will use the index to find employees named John because the index will presumably be sorted alphabetically by the Users name. And, because it is sorted, it means searching for a name is a lot faster because all names starting with a “J” will be right next to each other in the index!

Scholasticism answered 2/8, 2016 at 1:30 Comment(4)
An index doesn't imply sorting order on the columnWallop
Thanks. This helped my understanding. So basically an index is a replica of the column data that has been sorted. Normally the column data is just in the order the data was inserted.Pessa
does this mean internally, a separate table is maintained for each name, eg Name=John has it's own tableVirgilvirgilia
"The index is nothing but a data structure that stores the values for a specific column in a table" -- why do you say that? I don't think the value is enough; instead it would have to store a reference to a row/record in the table. If I have a table with 10 columns and one of them is COUNTRY_CODE, the index can't just store the values of COUNTRY_CODE, it would have to store a reference to the table rows. Otherwise if you do a SELECT of another column but join/select on COUNTRY_CODE you would not be able to make use of the COUNTRY_CODE values alone.Thenceforward
I
42

Just think of Database Index as Index of a book.

If you have a book about dogs and you want to find an information about let's say, German Shepherds, you could of course flip through all the pages of the book and find what you are looking for - but this of course is time consuming and not very fast.

Another option is that, you could just go to the Index section of the book and then find what you are looking for by using the Name of the entity you are looking ( in this instance, German Shepherds) and also looking at the page number to quickly find what you are looking for.

In Database, the page number is referred to as a pointer which directs the database to the address on the disk where entity is located. Using the same German Shepherd analogy, we could have something like this (“German Shepherd”, 0x77129) where 0x77129 is the address on the disk where the row data for German Shepherd is stored.

In short, an index is a data structure that stores the values for a specific column in a table so as to speed up query search.

Immortality answered 21/12, 2016 at 17:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.