Elasticsearch index sharding explanation
Asked Answered
O

1

12

I am trying to figure out the concept of elastic search index and quite don't understand it. I want to make a couple of points in advance. I understand how inverse document index works (mapping terms to document ids), I also understand how the document ranking works based on TF-IDF. What I don't understand is the data structure of the actual index. When referring to the elastic search documentation it describes the index as a "table with mappings to the documents". So, here comes sharding!! When you look at typical picture of the elastic search index, it is represented like this: Elastic search index What the picture doesn't show is how the actual partitioning happens and how this [table -> document] link is split across multiple shards. For instance, does each shard split the table vertically? Meaning the inverted index table only contains terms that are present on the shard. For instance, lets assume we have 3 shards, meaning the first one will contain document1, the second shard only contains document 2 and the 3rd shard is document 3. Now, would the first shard index only contain terms that are present in document1? In this case [Blue, bright, butterfly, breeze, hangs]. If so, what if someone searches for [forget], how does elastic search "knows" not to search in shard 1, or it searches all shards every time? When you look at the cluster image: enter image description here

It is not clear what exactly is in shard1, shard2 and shard3. We go from Term -> DocumentId -> Document to a "rectangular" shard, but what does the shard contain exactly?

I would appreciate if someone can explain it from the picture above.

Outnumber answered 29/10, 2017 at 17:46 Comment(0)
U
11

Theory

Elastichsarch built on top of Lucene. Every shard is simply a Lucene index. Lucene index, if simplified, is the inverted index. Every Elasticsearch index is a bunch of shards or Lucene indices. When you query for a document, Elasticsearch will subquery all shards, merge results and return it to you. When you index document to Elasticsearch, the Elasticsearch will calculate in which shard document should be written using the formula

shard = hash(routing) % number_of_primary_shards

By default as a routing, the Elasticsearch uses the document id. If you specify routing param, it will be used instead of id. You can use routing param both in search queries and in requests for indexing, deleting or updating a document. By default as a hash function used MurmurHash3

Elasticsearch index

Example

Create index with 3 shards

$ curl -XPUT localhost:9200/so -d '
{ 
    "settings" : { 
        "index" : { 
            "number_of_shards" : 3, 
            "number_of_replicas" : 0 
        } 
    } 
}'

Index document

$ curl -XPUT localhost:9200/so/question/1 -d '
{ 
    "number" : 47011047, 
    "title" : "need elasticsearch index sharding explanation" 
}'

Query without routing

$ curl "localhost:9200/so/question/_search?&pretty"

Response

Look at _shards.total - this is a number of shards that were queried. Also note that we found the document

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "so",
        "_type" : "question",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "number" : 47011047,
          "title" : "need elasticsearch index sharding explanation"
        }
      }
    ]
  }
}

Query with correct routing

$ curl "localhost:9200/so/question/_search?explain=true&routing=1&pretty"

Response

_shards.total now 1, because we specified routing and elasticsearch know which shard to ask for document. With param explain=true I ask elasticsearch to give me additional information about query. Pay attention to hits._shard - it's setted to [so][2]. It means that our document stored in second shard of so index.

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [
      {
        "_shard" : "[so][2]",
        "_node" : "2skA6yiPSVOInMX0ZsD91Q",
        "_index" : "so",
        "_type" : "question",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "number" : 47011047,
          "title" : "need elasticsearch index sharding explanation"
        },
        ...
}

Query with uncorrect routing

$ curl "localhost:9200/so/question/_search?explain=true&routing=2&pretty"

Response

_shards.total again 1. But Elasticsearch return nothing to our query because we specify wrong routing and Elasticsearch query the shard, where there is no document.

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

Additional information

Unreflecting answered 30/10, 2017 at 8:47 Comment(3)
Thank you Nikita for the detailed answer, but I don't think it actually addresses the question. I understand how Lucene index works, how inverted index works and how document is routed to a shard based on a document id. What i cannot grasp is what being stored as part of the inverted index table / terms (look at the image). Please refer to the original question: Would the first shard index only contain terms that are present in document1? In this case [Blue, bright, butterfly, breeze, hangs].Outnumber
@AlexPryiomka In short - the first shard index will only contains terms which were obtained after an analysis of the documents that were routing to him. Elasticsearch delegates the entire work to Lucene. If document1 was routed to shard 1, than shard 1 will contain terms of document1. If document1 was not routed to shard1, than shard1 will not contains terms of document1. Every shard is a separate Lucene index. Elasticsearch serves only as aggregator. I updated the examples - try to execute them and a little more carefully read the answerUnreflecting
Inverted index is not part of Elasticsearch, it's part of LuceneUnreflecting

© 2022 - 2024 — McMap. All rights reserved.