What's the difference between an inverted index and a plain old index?
Asked Answered
P

9

124

In software engineering we create indexes all the time (e.g., in databases) but I also hear a lot of people talk about inverted indices. Is there something fundamentally different between the two? They sound like the same thing.

Principle answered 11/10, 2011 at 14:30 Comment(3)
en.wikipedia.org/wiki/Inverted_indexPiet
To clarify, you're asking: what's different about a normal index (en.wikipedia.org/wiki/Index_%28database%29) that breaks down a table based on data that already exist in that table? Is that correct?Galimatias
@Principle What everyone failed to mention (though normalocity partially describes it by examples and lovesh is pretty much on the button) is that inverted indexes "invert" the basic data to be more efficient (e.g. swap keys/data to search from different perspective or ordering alphabetically/numerically to allow fast search algorithms), whereas a standard index stores data as it finds it. The "backward/forward" references and literal meaning of the word "invert" do not apply here, instead it refers to inversion of data to produce an efficient format specific to the task at hand.Abamp
O
257

One common use is "...to allow fast full-text searching."

The two types denote directionality. One takes you forward through the index, and the other takes you backward (the inverse) through the index. That's it. There's no mystery to uncover here. Otherwise the two types are identical, it's just a question of what information you have, and as a result what information you're trying to find.

To address your inquiry, I don't think there's actually a way to know why the use is what it is today. The only reason it's important to define which is forward and which one is inverted is so that we can all have a conversation about them, and everyone knows which direction we're talking about. Think about the terms "left" and "right": they are relative. Which is which doesn't matter, except that everyone needs to agree which one is "left" and which one is "right" in order for the words to have meaning. If, as a culture, we decided to flip left and right, then you'd have the same issue figuring out what a "right turn" vs a "left turn" is since the agreed upon meaning had changed. However, the naming is arbitrary, so which one is which (in and of itself) doesn't matter - what matters is that we all agree on the meaning.

In your comment where you ask, "please don't just define the terms", you're missing the point, and I think you're just getting hung up on the wording when there is absolutely no difference between them.


For the benefit of future readers, I will now provide several "forward" and "inverted" index examples:

Example 1: Web search

If you're thinking that the inverse of an index is something like the inverse of a function in mathematics, where the inverse is a special thing that has a different form, then you're mistaken: that's not the case here.

In a search engine you have a list of documents (pages on web sites), where you enter some keywords and get results back.

A forward index (or just index) is the list of documents, and which words appear in them. In the web search example, Google crawls the web, building the list of documents, figuring out which words appear in each page.

The inverted index is the list of words, and the documents in which they appear. In the web search example, you provide the list of words (your search query), and Google produces the documents (search result links).

They are both indexes - it's just a question of which direction you're going. Forward is from documents->to->words, inverted is from words->to->documents.

Example 2: DNS

Another example is a DNS lookup (which takes a host name, and returns an IP address) and a reverse lookup (which takes an IP address, and gives you the host name).

Example 3: A book

The index in the back of a book is actually an inverted index, as defined by the examples above - a list of words, and where to find them in the book. In a book, the table of contents is like a forward index: it's a list of documents (chapters) which the book contains, except instead of listing the words in those sections, the table of contents just gives a name/general description of what's contained in those documents (chapters).

Example 4: Your cell phone

The forward index in your cell phone is your list of contacts, and which phone numbers (cell, home, work) are associated with those contacts. The inverted index is what allows you to manually enter a phone number, and when you hit "dial" you see the person's name, rather than the number, because your phone has taken the phone number and found you the contact associated with it.

Outmoded answered 2/12, 2011 at 18:32 Comment(15)
thank you for your time. but your answer is still uninformative. As I mentioned in my bounty request, I DO understand what the terms involved mean and why they arise. My question was: "why did the people who named inverted indexes call them inverted when we have a long standing tradition which calls them just plain indexes? For example, indexes at the end of books, as you point out, are actually inverted. Going by historical perspective, the indexes at the end of books came before web indexes. Then why invert tradition?". My guess is that it was just one of those things that just happened...Miliary
Yup, and I read that part of your post. My examples are more for the benefit of future readers of your question.Outmoded
As far as trying to answer the "why" question, I addressed that in my second and third paragraphs. Simply put, I don't think it's possible to know why without conducting a historical examination of the use of the terms, and how that use changed. I've also (just now) made a few adjustments to my original answer that reflect these comments.Outmoded
Pointing out the traditional use is valid only if you are arguing which one people "should" use now, but it doesn't help you know why one was chosen over the other, or why it was flipped on its head at some later time. The naming is arbitrary, and has no meaning in and of itself, and therefore, there is no answer to "why". The only purpose of the naming is to facilitate conversation about the two concepts, such that when a person says "inverted" or "forward" we all have the same concept in mind.Outmoded
I won't be awarding you the bounty since you didn't really answer the why part, but you should be getting half of the bounty anyway (since yours is the highest voted answer). And I think you deserve it too, since going by all the other answers on this page you seem to be the only person who is atleast getting the intent of the question :) And again, thank you for your time.Miliary
"I don't think it's possible to know why without conducting a historical examination of the use of the terms" -- I'd have hoped someone would conduct such a historical examination and give an answer. :-) Because this being opposite to the common-language meaning of "index" is surprising. (One possible answer is that when the phrase "inverted index" was first thought of, the phrase "index" was already being for some "index" inverted wrt "inverted index", i.e., inverted wrt the real-life meaning of "index". In that case, it would be useful to know why the forward "index" got the strange name.)Mb
@Outmoded just wondering on why should the forward indexing be used. I am particularly talking about the web search example here. So if google, as part of the forward indexing does the list of documents <-> words in them , and ultimately uses the list of words <-> list of documents in their search, why do the list of documents <-> words in them ? In other words, my question is: One cannot ask google what words are there in a particular page(document) or is mainly going to ask where do the keywords he/she is looking for occurs in the pages. Then why do forward indexing?Fill
How is the TOC of a book a forward index and the Index at the back a reverse index? In both cases we are searching for a word (or words), and we get the page number as the result, right? So they can't be different. Either both are forward, or both inverted. My thought is they are both inverted indices.Hysterogenic
So in context of Relational database there's no inverted index? or those indexes are actually 'inverted index' . Problems with "agreeable" terms in literature is ignorance/mistake/deliberation by few pioneers or corps who start different agreement and part of community follow that nomenclature. Everyone gets confused after sometime. I'm sure there are many terms in software that were originally meant to be lets say A but different community deliberately or mistakenly takes it as A' or B, syntactically off course. It still confuse the hell out of new learner.Arvad
@jefflunt, in Google's white-paper: the anatomy of large scale hypertextual web search engine, it mentioned about forward index/inverted index, in a context of implementing search engine indexing, do we remove entries from forward index file after we have taken its information to forge inverted index?Gallager
@Roylee - I've not read that white paper. I think what you're asking is, "Do you update the inverted index when updating the forward index?" If that's your question, then the answer is yes.Outmoded
@Outmoded Forgive me for my lack of clarity. Its actually "Do you remove forward index, when the inverted index has updated?"Gallager
@Roylee - either way, I think the answer is the same. The directionality isn't important, and both sides of the index should match one another. I'm sorry - I'm not sure I'm actually answering your question.Outmoded
@Arvad I agree - it seems to me that by the definition of 'inverted index', standard indexes on RDBMS columns are inverted indexes, because you use the column values to give you a list of rows in which they appear. I think the confusion arises because the index that the 'inverted index' is the inversion of is not what one usually thinks of as 'index' but rather a straightforward list of documents/pages and the words they contain.Downtoearth
People tend to get caught up on "forward" verses "inverted" when thinking about inverted indices rather than realizing that the difference between regular indices and inverted indices does not involve a mathematical inversion of the query process, but instead a different indexing algorithm. Take a look at @Bery answer as it actually does spell out the algorithm known as "inverted index" rather than just trying to explain what the inverse of a forward index would look like.Desjardins
P
39

They called it inverted just because there is already a forward index. Take the example of search engine, it composed by two parts: the first part is "web crawler and parser" which build a index from document to word, the second part is search database which build a index from word to document. Because of the first index exist, we naturally call the second index as inverted index.

If you name the TOC (Table of Content) of a book as index, then you should call the index at the end of book as "inverted index". Or, in other side, you can call the TOC as inverted index.

Philine answered 5/12, 2011 at 20:5 Comment(3)
This should be the accepted answer as it answers the question why we call an index "inverted" even if it is just what everybody thinks of a "normal index". A SQL b-tree index stores for each word a pointer to all rows ("documents") containing it. There we call it "index". But in search engines we suddenly call this exact same procedure "inverted index". Not because it's fundamentally different, but because we first created a "forward index" (split text) and then "inverse" it. So, all in all, the name "inverse" comes from the process of creating it, not from the final structure of the index.Signally
@Philine thanks for the insights. Quick question: Is it practical to remove entries from forward index file after inverted index is built from it?Gallager
I agree with @FooBar. This answer should be chosen as the right answer. It answered why we invent a new term inverted index even though all normal index in our life are already used as inverted.Bolero
B
8

typically when speaking about index, you mean some added calculations or stored results of procedures which have been done in order to speed up application (e.g. MySQL or other RDBMS Consult MySQL the docs). Indexing can also be related to caching etc.

Inverted index creates file with structure that is primarily intender for (fulltext) searching.

Inverted index consists of two main files:

  • Vocabulary
  • Occurences

In vocabulary are common words extracted from text (of course after filtering blacklist words like pronouns). The occurences file holds the connection between words and documents (word1 appears in doc1 and doc2, not in doc3). It is represented in a form of a matrix.

Indexing process - inverted index

In the above image is shown the process of creating the two files mentioned.

If you are further interester in this problematic I can recommend you a great book written by Ricardo Yated - Modern Information Retrieval (See it on Amazon) - about page 200 I think.

Hope it helps :-)

Boyt answered 5/12, 2011 at 16:47 Comment(2)
This is a very good answer as it explains what an inverted index really is. It gets past the idea of forward indexing and inverse indexing which is different from the algorithm that is used for a search capability that is enabled by creating and inverted index.Desjardins
I like this answer best of all the ones here. I'll also add that you can think of an inverted index as a de-normalized index. A single index can reference multiple fields or entities, letting you search many things in a single lookup (very useful for search engines).Herzog
S
8

normalocity has already wonderfully differentiated between a forward and an inverted index but for the question of why one is called a forward index and the other an inverted index, maybe this is why they are called that way---

Taking example of search engine crawling and indexing (or building index for a book), a forward index can be built simultaneously while you are crawling the web pages(or reading the book) or going forward. So if you have 10 webpages to crawl(or 10 chapters in a book) you can crawl the first webpage(read the first chapter) and then make a list of words which appear in the webpage(words which appear in the chapter) and continue this process for other webpages(other chapters) so by the time you have crawled all the 10 webpages(read all 10 chapters) your forward index is complete with each webpage(chapter) pointing to a list of words it contains.

But to make an inverted index you have to crawl all the 10 webpages(read the 10 chapters) and and then take each word from each documents list and figure out which documents contain that word. So this is like going backward once you have crawled the webpages(read chapters of the book). So its called an inverted index.

This is just my speculation.

Southwesterly answered 3/5, 2012 at 11:41 Comment(0)
F
8

The term "Inverted Word Index" refers to the change in relationship of a single-document containing many-words, to each unique word containing (or identifying) a list of many-documents. This is effectively taking a One-to-Many Relationship (Docs to Words) and Inverting (or reversing) it such that a new "Inverted" One-to-Many Relationship now exists, which is each-unique-word relating to Many-Documents (i.e., all that contain that word). It's origin really is that simple, and the term "inverted index" was used to describe manual indexes of the same type long before computers and electronic high-speed indexing even existed (yes, admittedly, I'm an old, geezer programmer, almost old enough to have considered Grace Hopper a "sweet young lady" age appropriate for courting back when COBOL was a shiny new language). Please don't discard us geezers just yet, as we may occasionally provide a useful, and possibly even valuable, historical tid-bit or two - when our personal RAM is still working, that is. [grin]

Fiester answered 28/4, 2018 at 8:43 Comment(0)
L
5

There are many types of index. For example, B-tree, R-tree, hash... For different purposes, we must choose correct index.

Inverted index is a special one. Inverted index usually used in full text search engine. Use inverted index we can find out a word's locate in a document(or documents set) as fast as possible. Think about the limit of memory and cpu, other index can't finish this job.

You can read lucene document for more details. It's a open source search engine. http://lucene.apache.org/java/docs/index.html

Luby answered 2/12, 2011 at 19:7 Comment(0)
L
2

in inverted indexes, we have the following form:

word1-> list of docs it occurs in (sorted order)

word2-> list of docs it occurs in (sorted order)

It is very useful for search engine query processing as it allows us to find docs that word occurs in .

You can use supervised machine learing to build this inverted index.

Lamkin answered 11/10, 2011 at 14:33 Comment(4)
That sounds like an index to me, what's inverted about it?Principle
@Principle An inverted index is the inversion of a forward index. a forward index stores a list of words for each doc. Eg Doc->w1,w2Lamkin
I still don't find any difference between Forward and Inverted index (in terms of how it works, leave the naming bit). Both to me, looks like an index that maps a field to a bunch of document ids. This is how I understood how the oracle btree (otherwise referred to forward index) organises the data. I don't see any difference to the inverted index's principles. Mapping a Doc -> w1, w2, w3 looks like an inefficient proposition to me in terms of search. Wonder why is this in the first place? That leaves me back to square one. :-).Lorettalorette
@Lamkin Quick question: Is it practical to remove entries from forward index file after inverted index is built from it?Gallager
O
0

One more difference:

Handling updates with the inverted index are expensive in comparison with forward index.

Forward index handles updates easily by reflecting the changes only in the corresponding document index, whereas in the inverted index, the same change has to reflect in multiple positions across the inverted index.

Och answered 28/8, 2017 at 7:5 Comment(0)
N
0

The way to interpret is based on "what points to what".

Example, An entity has many attributes.We first need to have the entity at hand to be able to find what attributes it has.

A practical example is that in a search engine , in its ingress/ web crawler phase, it first crawls and has access to a webpage. Webpage here is the entity. The index that it would create to map a webpage to the different words it has is a forward index. The mapping would be , document -> words

To facilitate the lookup from an attribute to the entity , we need a inverted index. In my example it would be the mapping of words -> webpage documents.

Nutter answered 25/2, 2024 at 0:40 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.