Indexing/Searching "complex" JSON in elasticsearch
Asked Answered
L

3

9

I have some JSON that looks like the following: Let's call that field metadata

{ 
  "somekey1": "val1",
  "someotherkey2": "val2",
  "more_data": { 
    "contains_more": [
      { 
        "foo": "val5",
        "bar": "val6"
      },
      { 
        "foo": "val66",
        "baz": "val44"
      },
    ],
    "even_more": {
      "foz" : 1234,
    }
  }
}

This is just a simple example. The real one can grow even more complex. Keys can come up multiple times. Values as well and can be int or str.

Now the first problem is that I'm not quite sure how I have to correctly index this in elasticsearch so I can find something with specific requests.

I am using Django/Haystack where the index looks like this:

class FooIndex(indexes.SearchIndex, indexes.Indexable):
    text = indexes.CharField(document=True, use_template=True)
    metadata = indexes.CharField(model_attr='get_metadata')
    # and some more specific fields

And the template:

{
    "foo": {{ object.foo }},
    "metadata": {{ object.metadata}},
    # and some more
}

The metadata will then be filled with the sample above and the result will look like this:

  {
    "foo": "someValue",
    "metadata": { 
      "somekey1": "val1",
      "someotherkey2": "val2",
      "more_data": { 
        "contains_more": [
          { 
            "foo": "val5",
            "bar": "val6"
          },
          { 
            "foo": "val66",
            "baz": "val44"
          },
        ],
        "even_more": {
          "foz" : 1234,
        }
      }
    },
  }

Which will go into the 'text' column in elasticsearch.

So the goal is now to be able to search for things like:

  • foo: val5
  • foz: 12*
  • bar: val*
  • somekey1: val1
  • and so on

The second problem: When I search e.g. for foo: val5 it matches all objects that just have the key "foo" and all objects that have the val5 somewhere else in it's structure.

This is how I search in Django:

self.searchqueryset.auto_query(self.cleaned_data['q'])

Sometimes the results are "okayish" sometime it's just completely useless.

I could need a pointer in the right direction and get to know the mistakes I made here. Thank you!

Edit: I added my final solution as an answer below!

Larainelarboard answered 20/5, 2015 at 16:48 Comment(3)
Preface: I'm no django user, just ES. My guess: the content field is populated with all the data, making it impossible to make field-specific matches. If you want to that, you need to express that in your filter/queries (but my guess is: not using auto_query).Bolen
has your metadata field always the same structure ?Sharynshashlik
@juliendangers Sometimes it has more fields or contains multiple elements in the array And sometimes there is no array and it can be quite flat. The keys however are known before and there can be e.g. up to 30+ different onesLarainelarboard
L
0

It took a while to figure out the right solution that works for me

It was a mix of both the provided answers by @juliendangers and @Val and some more customizing.

  1. I replaced Haystack with the more specific django-simple-elasticsearch
  2. Added custom get_type_mapping method to the model

    @classmethod
    def get_type_mapping(cls):
      return {
        "properties": {
          "somekey": {
            "type": "<specific_type>",
            "format": "<specific_format>",
          },
          "more_data": {
            "type": "nested",
            "include_in_parent": True,
            "properties": {
              "even_more": {
                "type": "nested",
                "include_in_parent": True,
              }
              /* and so on for each level you care about */
           }
         }
      }
    
  3. Added custom get_document method to the model

    @classmethod
    def get_document(cls, obj):
      return {
        'somekey': obj.somekey,
        'more_data': obj.more_data,
        /* and so on */
      }
    
  4. Add custom Searchform

    class Searchform(ElasticsearchForm):
      q = forms.Charfield(required=False)
    
      def get_index(self):
        return 'your_index'
    
      def get_type(self):
        return 'your_model'
    
      def prepare_query(self):
        if not self.cleaned_data['q']:
          q = "*"
        else:
          q = str(self.cleaned_data['q'])
    
        return {
          "query": {
            "query_string": {
              "query": q
            }
          }
        }
    
      def search(self):
        esp = ElasticsearchProcessor(self.es)
        esp.add_search(self.prepare_query, page=1, page_size=25, index=self.get_index(), doc_type=self.get_type())
        responses = esp.search()
        return responses[0]
    

So this is what worked for me and covers my usecases. Maybe it can be of some help for someone.

Larainelarboard answered 7/7, 2015 at 9:32 Comment(0)
G
3

The one thing that is certain is that you first need to craft a custom mapping based on your specific data and according to your query needs, my advice is that contains_more should be of nested type so that you can issue more precise queries on your fields.

I don't know the exact names of your fields, but based on what you showed, one possible mapping could be something like this.

{
  "your_type_name": {
    "properties": {
      "foo": {
        "type": "string"
      },
      "metadata": {
        "type": "object",
        "properties": {
          "some_key": {
            "type": "string"
          },
          "someotherkey2": {
            "type": "string"
          },
          "more_data": {
            "type": "object",
            "properties": {
              "contains_more": {
                "type": "nested",
                "properties": {
                  "foo": {
                    "type": "string"
                  },
                  "bar": {
                    "type": "string"
                  },
                  "baz": {
                    "type": "string"
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

Then, as already mentioned by mark in his comment, auto_query won't cut it, mainly because of the multiple nesting levels. As far as I know, Django/Haystack doesn't support nested queries out of the box, but you can extend Haystack to support it. Here is a blog post that explains how to tackle this: http://www.stamkracht.com/extending-haystacks-elasticsearch-backend. Not sure if this helps, but you should give it a try and let us know if you need more help.

Gyronny answered 23/5, 2015 at 12:1 Comment(3)
Does this mean I have to define to mapping for all possible 'keys' as well as their structure? As I wrote in another comment there could be 30+ different ones.Larainelarboard
Well, the more you instruct your mapping, the more precise and powerful your queries can be. 30 fields is not a killer, I'd say. I have documents with hundreds of fields and they are all mapped properly and accurately for what I need them to do. Best is to give it a try and see how it goes for you in your particular case.Gyronny
Thank you. I'll try it and report back!Larainelarboard
S
3

Indexing :

First of all you should use dynamic templates, if you want to define specific mapping relatively to key name, or if your documents do not have the same structure.

But 30 key isn't that high, and you should prefer defining your own mapping than letting Elasticsearch guessing it for you (in case incorrect data have been added first, mapping would be defined according to these data)

Searching:

You can't search for

foz: val5

since "foz" key doesn't exist.

But key "metadata.more_data.even_more.foz" does => all your keys are flatten from the root of your document

this way you'll have to search for

foo: val5
metadata.more_data.even_more.foz: 12*
metadata.more_data.contains_more.bar: val*
metadata.somekey1: val1

Using query_string for example

"query_string": {
    "default_field": "metadata.more_data.even_more.foz",
    "query": "12*"
}

Or if you want to search in multiple fields

"query_string": {
    "fields" : ["metadata.more_data.contains_more.bar", "metadata.somekey1"],
    "query": "val*"
}
Sharynshashlik answered 26/5, 2015 at 11:28 Comment(2)
So arrays will be flattened as well? (e.g. not having to use metadata.more_data.contains_more.0.key)Larainelarboard
yes, Elasticsearch will detect array, and "contains_more.foo" and "contains_more.bar" will become multi-value fieldsSharynshashlik
L
0

It took a while to figure out the right solution that works for me

It was a mix of both the provided answers by @juliendangers and @Val and some more customizing.

  1. I replaced Haystack with the more specific django-simple-elasticsearch
  2. Added custom get_type_mapping method to the model

    @classmethod
    def get_type_mapping(cls):
      return {
        "properties": {
          "somekey": {
            "type": "<specific_type>",
            "format": "<specific_format>",
          },
          "more_data": {
            "type": "nested",
            "include_in_parent": True,
            "properties": {
              "even_more": {
                "type": "nested",
                "include_in_parent": True,
              }
              /* and so on for each level you care about */
           }
         }
      }
    
  3. Added custom get_document method to the model

    @classmethod
    def get_document(cls, obj):
      return {
        'somekey': obj.somekey,
        'more_data': obj.more_data,
        /* and so on */
      }
    
  4. Add custom Searchform

    class Searchform(ElasticsearchForm):
      q = forms.Charfield(required=False)
    
      def get_index(self):
        return 'your_index'
    
      def get_type(self):
        return 'your_model'
    
      def prepare_query(self):
        if not self.cleaned_data['q']:
          q = "*"
        else:
          q = str(self.cleaned_data['q'])
    
        return {
          "query": {
            "query_string": {
              "query": q
            }
          }
        }
    
      def search(self):
        esp = ElasticsearchProcessor(self.es)
        esp.add_search(self.prepare_query, page=1, page_size=25, index=self.get_index(), doc_type=self.get_type())
        responses = esp.search()
        return responses[0]
    

So this is what worked for me and covers my usecases. Maybe it can be of some help for someone.

Larainelarboard answered 7/7, 2015 at 9:32 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.