Schemaless Support for Elastic Search Queries
Asked Answered
B

2

8

Our REST API allows users to add custom schemaless JSON to some of our REST resources, and we need it to be searchable in Elasticsearch. This custom data and its structure can be completely different across resources of the same type.

Consider this example document:

{
  "givenName": "Joe",
  "username": "joe",
  "email": "[email protected]",
  "customData": {
    "favoriteColor": "red",
    "someObject": {
      "someKey": "someValue"
    }
  } 
}

All fields except customData adhere to a schema. customData is always a JSON Object, but all the fields and values within that Object can vary dramatically from resource to resource. There is no guarantee that any given field name or value (or even value type) within customData is the same across any two resources as users can edit these fields however they wish.

What is the best way to support search for this?

We thought a solution would be to just not create any mapping for customData when the index is created, but then it becomes unqueryable (which is contrary to what the ES docs say). This would be the ideal solution if queries on non-mapped properties worked, and there were no performance problems with this approach. However, after running multiple tests for that matter we haven’t been able to get that to work.

Is this something that needs any special configuration? Or are the docs incorrect? Some clarification as to why it is not working would be greatly appreciated.

Since this is not currently working for us, we’ve thought of a couple alternative solutions:

  1. Reindexing: this would be costly as we would need to reindex every index that contains that document and do so every time a user updates a property with a different value type. Really bad for performance, so this is likely not a real option.

  2. Use multi-match query: we would do this by appending a random string to the customData field name every time there is a change in the customData object. For example, this is what the document being indexed would look like:

    {
      "givenName": "Joe",
      "username": "joe",
      "email": "[email protected]",
      "customData_03ae8b95-2496-4c8d-9330-6d2058b1bbb9": {
        "favoriteColor": "red",
        "someObject": {
          "someKey": "someValue"
        }
      }
    }
    

    This means ES would create a new mapping for each ‘random’ field, and we would use phrase multi-match query using a "starts with" wild card for the field names when performing the queries. For example:

    curl -XPOST 'eshost:9200/test/_search?pretty' -d '
    {
      "query": {
        "multi_match": {
          "query" : "red",
          "type" :  "phrase",
          "fields" : ["customData_*.favoriteColor"]
        }
      }
    }'
    

    This could be a viable solution, but we are concerned that having too many mappings like this could affect performance. Are there any performance repercussions for having too many mappings on an index? Maybe periodic reindexing could alleviate having too many mappings?

    This also just feels like a hack and something that should be handled by ES natively. Am I missing something?

Any suggestions about any of this would be much appreciated.

Thanks!

Bechtold answered 1/7, 2015 at 21:26 Comment(0)
M
3

You're correct that Elasticsearch is not truly schemaless. If no mapping is specified, Elasticsearch infers field type primitives based upon the first value it sees for that field. Therefore your non-deterministic customData object can get you in trouble if you first see "favoriteColor": 10 followed by "favoriteColor": "red".

For your requirements, you should take a look at SIREn Solutions Elasticsearch plugin which provides a schemaless solution coupled with an advanced query language (using Twig) and a custom Lucene index format to speed up indexing and search operations for non-deterministic data.

Moorish answered 12/8, 2015 at 17:14 Comment(1)
Thanks for the comment Peter - we'll try it out and award the answer if it works as expected.Eon
P
0

Fields with same mapping will be stored as same lucene field in the lucene index (Elasticsearch shard). Different lucene field will have separate inverted index (term dict and index entry) and separate doc values. Lucene is highly optimized to store documents of same field in a compressed way. Using a mapping with different field for different document prevent lucene from doing its optimization.

You should use Elasticsearch Nested Document to search efficiently. The underlying technology is Lucene BlockJoin, which indexes parent/child documents as a document block.

Pentecostal answered 21/7, 2015 at 7:56 Comment(1)
are you saying that Nested Document can handle the described non-deterministic nature of the customData object?Eon

© 2022 - 2024 — McMap. All rights reserved.