Forward Index vs Inverted index Why?

Asked 1/8, 2015 at 11:18 Answered 13/5, 2022 at 14:20

solr elasticsearch lucene inverted-index forward-indexing

I was reading about inverted index (used by the text search engines like Solr, Elastic Search etc) and as I understand (if we take "Person" as an example):

The attribute to Person relationship is inverted:

John -> PersonId(1), PersonId(2), PersonId(3)
London -> PersonId(1), PersonId(2), PersonId(5)

I can now search the person records for 'John who lives in London'

Doesn't this solve all the problems? Why do we have the forward (or regular database index) at all? Or in other words, in what cases the regular indexing is useful? Please explain. Thanks.

Chemisette answered 1/8, 2015 at 11:18 Comment(1)

See also #7728186 – Arsis 21/6, 2022 at 18:59

The point that you're missing is that there is no real technical distinction between a forward index and an inverted index. "Forward" and "inverted" in this case are just descriptive terms to distinguish between:

A list of words contained in a document.
A list of documents containing a word.

The concept of an inverted index only makes sense if the concept of a regular (forward) index already exists. In the context of a search engine, a forward index would be the term vector; a list of terms contained within a particular document. The inverted index would be a list of documents containing a given term.

When you understand that the terms "forward" and "inverted" are really just relative terms used to describe the nature of the index you're talking about - and that really an index is just an index - your question doesn't really make sense any more.

Fricke answered 1/8, 2015 at 12:34 Comment(9)

Thanks for that. I understand that it's a way to distinguish from what's already there. But I still don't find any difference between Forward and Inverted index (in terms of how it works). Both to me, looks like an index that maps a field to a bunch of document ids. This is how I understood how the oracle btree (otherwise referred to forward index) organises the data. I don't see any difference to the inverted index's principles. That leaves me back to square one. :-) – Chemisette 6/4, 2016 at 4:20

That's my point - there is no functional difference. An inverted index is just an index... but backwards. A forward index would store { Document1: ["Hello", "this", "is", "a", "document"] }, an inverted index would store (for example) { "Hello": [Document1], "this": [Document1, Document40] } ... one lets you look up a document and find the contents, the other lets you look up a word and get a list of documents. – Fricke 6/4, 2016 at 12:54

Mapping a Doc -> w1, w2, w3 looks like an inefficient proposition to me in terms of search. Wonder why this existed in the first place? What are they actually used for? – Chemisette 7/4, 2016 at 5:15

I think you might be a bit too focused on the terminology - the "inverted index" is really just a name intended to describe how it works (much like 'reverse proxy' doesn't necessarily imply a forward proxy in the same context). That said, the most obvious candidate for a forward index in this case is to map document IDs to their content - so you can retrieve documents by ID! Remember that the difference is conceptual (retrieving the content for the document, rather than retrieving the documents for a piece of content) not technical. Ultimately an index is an index. – Fricke 7/4, 2016 at 7:2

quick question: do we remove entries in the forward index after we have consumed and turned it into inverted index? – Spermic 22/4, 2016 at 2:52

@Roylee there's nothing to imply that an inverted index is generated by consuming a forward index. Again, the term doesn't literally mean that you have inverted an index, it is just a descriptive term for the "direction" of the key/value pairs. You're not going to be able to rationalise this as a well defined technical term, because it isn't one. – Fricke 22/4, 2016 at 7:14

It's also worth noting that the term is usually applied specifically in the context of full text search where all of the content of the document is in some way broken up and treated as a set of keys in the index. – Fricke 22/4, 2016 at 7:19

@AntP thanks for pointing out, yes you are right, I asked the question in context of full text search. In fact, the question came up to me after I've read (Google) white-paper: The Anatomy of a Large-Scale Hypertextual Web Search Engine, based on the paper, the inverted-index file was derived from forward-index file. Hence I was wondering if its rational to keep forward-index file after consumed? – Spermic 22/4, 2016 at 8:49

This answer requires clarification. In the terminology of inverted indexes, the (forward) index is no usual index, as one would expect, e.g. a BTree index. It is rather the table data itself. So an inverted index is really a regular index, as we are used to it. Still, it has technical differences compared to BTree or other index types. So in my opinion, the statement "there is no real technical distinction between a forward index and an inverted index" is wrong. Look here https://mcmap.net/q/748041/-b-tree-index-vs-inverted-index – Feminine 19/3, 2023 at 18:34

Here's an explanation of inverted index, from Elasticsearch:

Elasticsearch uses a structure called an inverted index, which is designed to allow very fast full-text searches. An inverted index consists of a list of all the unique words that appear in any document, and for each word, a list of the documents in which it appears. https://www.elastic.co/guide/en/elasticsearch/guide/current/inverted-index.html

Inverted indexing is for fast full text search. Regular indexing is less efficient, because the engine looks through all entries for a term, but very fast with indexing!

You can say this:

Forward index: fast indexing, less efficient query's
Inverted index: fast query, slower indexing

But, it's always context related. If you compare it with MySQL: myisam has fast read, innodb has fast insert/update and slower read.

Simonnesimonpure answered 1/8, 2015 at 12:27 Comment(0)

In forward index, the input is a document and the output is words contained in the document.

{
  doc1: [word1, word2, word3],
  doc2: [word4, word5]
}

In the reverse/inverted index, the input is a word, and the output is all the documents in which the words are contained.

{
  word1: [doc1, doc10, doc3],
  word2: [doc5, doc3]
}

Search engines make use of reverse/inverted index to get us documents from keywords.

Vestryman answered 13/5, 2022 at 14:20 Comment(0)

Recommended topics

Hot tags