What are segments in Lucene?
Asked Answered
L

3

49

What are segments in Lucene?

What are the benefits of segments?

Lobachevsky answered 24/4, 2010 at 6:11 Comment(0)
R
71

The Lucene index is split into smaller chunks called segments. Each segment is its own index. Lucene searches all of them in sequence.

A new segment is created when a new writer is opened and when a writer commits or is closed.

The advantages of using this system are that you never have to modify the files of a segment once it is created. When you are adding new documents in your index, they are added to the next segment. Previous segments are never modified.

Deleting a document is done by simply indicating in a file which document of a segment is deleted, but physically, the document always stays in the segment. Documents in Lucene aren't really updated. What happens is that the previous version of the document is marked as deleted in its original segment and the new version of the document is added to the current segment. This minimizes the chances of corrupting an index by constantly having to modify its content when there are changes. It also allows for easy backup and synchronization of the index across different machines.

However, at some point, Lucene may decide to merge some segments. This operation can also be triggered with an optimize.

Retrad answered 24/4, 2010 at 16:3 Comment(1)
which means there is term dictionary in every single segment file? If two segments both have word "search, me" , there would be two dictionaries of them? it would makes each segments become so big for storing duplicated terms ?Torticollis
S
27

A segment is very simply a section of the index. The idea is that you can add documents to the index that's currently being served by creating a new segment with only new documents in it. This way, you don't have to go to the expensive trouble of rebuilding your entire index frequently in order to add new documents to the index.

Staten answered 24/4, 2010 at 6:54 Comment(0)
S
8

The segment benefits have been answered already by others. I will include an ascii diagram of a Lucene Index.

Lucene Segment

A Lucene segment is part of an Index. Each segment is composed of several index files. If you look inside any of these files, you will see that it holds 1 or more Lucene documents.

+- Index 5 ------------------------------------------+
|                                                    |
|  +- Segment _0 ---------------------------------+  |
|  |                                              |  |
|  |  +- file 1 -------------------------------+  |  |
|  |  |                                        |  |  |
|  |  | +- L.Doc1-+  +- L.Doc2-+  +- L.Doc3-+  |  |  |
|  |  | |         |  |         |  |         |  |  |  |
|  |  | | field 1 |  | field 1 |  | field 1 |  |  |  |
|  |  | | field 2 |  | field 2 |  | field 2 |  |  |  |
|  |  | | field 3 |  | field 3 |  | field 3 |  |  |  |
|  |  | |         |  |         |  |         |  |  |  |
|  |  | +---------+  +---------+  +---------+  |  |  |
|  |  |                                        |  |  |
|  |  +----------------------------------------+  |  |
|  |                                              |  |
|  |                                              |  |
|  |  +- file 2 -------------------------------+  |  |
|  |  |                                        |  |  |
|  |  | +- L.Doc4-+  +- L.Doc5-+  +- L.Doc6-+  |  |  |
|  |  | |         |  |         |  |         |  |  |  |
|  |  | | field 1 |  | field 1 |  | field 1 |  |  |  |
|  |  | | field 2 |  | field 2 |  | field 2 |  |  |  |
|  |  | | field 3 |  | field 3 |  | field 3 |  |  |  |
|  |  | |         |  |         |  |         |  |  |  |
|  |  | +---------+  +---------+  +---------+  |  |  |
|  |  |                                        |  |  |
|  |  +----------------------------------------+  |  |
|  |                                              |  |
|  +----------------------------------------------+  |
|                                                    |
|  +- Segment _1 (optional) ----------------------+  |
|  |                                              |  |
|  +----------------------------------------------+  |
+----------------------------------------------------+

Reference

Lucene in Action Second Edition - July 2010 - Manning Publication

Seve answered 14/1, 2017 at 11:51 Comment(4)
The segment does not hold the documents themselves, it's only a part of the inverted index which contains a reference to the document itself (such as id).Drab
Hello BornToCode ... By trying to answer both part of the questions I used the word document which covered two meanings - a sequence of fields AND a source document. I have just left the ascii diagram which uses the only the one meaning of "document" a Lucene document which is a sequence of fields. I hope that is clearer?Seve
It's better.. but if I wouldn't read your comment I would be still confused. I think the main point to show is that inside the segments there are the terms that the inverted index is comprised from and a reference id to documents containing those terms (if I understand this correctly). Also where did you take the notion of "Lucene document" from?Drab
re: Where did you take the notion of "Lucene document" from? - I took it from the Apache's Lucene documentation under the definitions section... it says: "An index contains a sequence of documents. A document is a sequence of fields." lucene.apache.org/core/2_9_4/fileformats.html#SegmentsSeve

© 2022 - 2024 — McMap. All rights reserved.