It seems your question more about index merging than about indexing itself.
Indexing process is quite simple if you ignore low-level details. Lucene form what is called "inverted index" from documents. So if document with text "To be or not to be" and id=1 comes in, inverted index would look like:
[to] → 1
[be] → 1
[or] → 1
[not] → 1
This is basically it – the index from the word to the list of documents containing given word. Each line of this index (word) is called posting list. This index is persisted on long-term storage then.
In reality of course things are more complicated:
- Lucene may skip some words based on the particular Analyzer given;
- words can be preprocessed using stemming algorithm to reduce flexia of the language;
- posting list can contains not only identifiers of the documents, but also offset of the given word inside document (potentially several instances) and some other additional information.
There are many more complications which are not so important for basic understanding.
It's important to understand though, that Lucene index is append only. In some point in time application decide to commit (publish) all the changes in the index. Lucene finish all service operations with index and close it, so it's available for searching. After commit index basically immutable. This index (or index part) is called segment. When Lucene execute search for a query it search in all available segments.
So the question arise – how can we change already indexed document?
New documents or new versions of already indexed documents are indexed in new segments and old versions invalidated in previous segments using so called kill list. Kill list is the only part of committed index which can change. As you might guess, index efficiency drops with time, because old indexes might contain mostly removed documents.
This is where merging comes in. Merging – is the process of combining several indexes to make more efficient index overall. What is basically happens during merge is live documents copied to the new segment and old segments removed entirely.
Using this simple process Lucene is able to maintain index in good shape in terms of a search performance.
Hope it'll helps.