Best way to index arbitrary attribute value pairs on elastic search
Asked Answered
M

3

9

I am trying to index documents on elastic search, which have attribute value pairs. Example documents:

{
    id: 1,
    name: "metamorphosis",
    author: "franz kafka"
}

{
    id: 2,
    name: "techcorp laptop model x",
    type: "computer",
    memorygb: 4
}

{
    id: 3,
    name: "ss2014 formal shoe x",
    color: "black",
    size: 42,
    price: 124.99
}

Then, I need queries like:

1. "author" EQUALS "franz kafka"
2. "type" EQUALS "computer" AND "memorygb" GREATER THAN 4
3. "color" EQUALS "black" OR ("size" EQUALS 42 AND price LESS THAN 200.00)

What is the best way to store these documents for efficiently querying them? Should I store them exactly as shown in the examples? Or should I store them like:

{
    fields: [
        { "type": "computer" },
        { "memorygb": 4 }
    ]
}

or like:

{
    fields: [
        { "key": "type", "value": "computer" },
        { "key": "memorygb", "value": 4 }
    ]
}

And how should I map my indices for being able to perform both my equality and range queries?

Mammy answered 18/2, 2015 at 12:42 Comment(0)
G
11

If someone is still looking for an answer, I wrote a post about how to index arbitrary data into Elasticsearch and then to search by specific fields and values. All this, without blowing up your index mapping.

The post: http://smnh.me/indexing-and-searching-arbitrary-json-data-using-elasticsearch/

In short, you will need to create special index described in the post. Then you will need to flatten your data using the flattenData function https://gist.github.com/smnh/30f96028511e1440b7b02ea559858af4. Then, the flattened data can be safely indexed into Elasticsearch index.

For example:

flattenData({
    id: 1,
    name: "metamorphosis",
    author: "franz kafka"
});

Will produce:

[
    {
        "key": "id",
        "type": "long",
        "key_type": "id.long",
        "value_long": 1
    },
    {
        "key": "name",
        "type": "string",
        "key_type": "name.string",
        "value_string": "metamorphosis"
    },
    {
        "key": "author",
        "type": "string",
        "key_type": "author.string",
        "value_string": "franz kafka"
    }
]

And

flattenData({
    id: 2,
    name: "techcorp laptop model x",
    type: "computer",
    memorygb: 4
});

Will produce:

[
    {
        "key": "id",
        "type": "long",
        "key_type": "id.long",
        "value_long": 2
    },
    {
        "key": "name",
        "type": "string",
        "key_type": "name.string",
        "value_string": "techcorp laptop model x"
    },
    {
        "key": "type",
        "type": "string",
        "key_type": "type.string",
        "value_string": "computer"
    },
    {
        "key": "memorygb",
        "type": "long",
        "key_type": "memorygb.long",
        "value_long": 4
    }
]

Then you can use build Elasticsearch queries to query your data. Every query should specify both the key and type of value. If you are unsure of what keys or types the index has, you can run an aggregation to find out, this is also discussed in the post.

For example, to find a document where author == "franz kafka" you need to execute the following query:

{
    "query": {
        "nested": {
            "path": "flatData",
            "query": {
                "bool": {
                    "must": [
                        {"term": {"flatData.key": "author"}},
                        {"match": {"flatData.value_string": "franz kafka"}}
                    ]
                }
            }
        }
    }
}

To find documents where type == "computer" and memorygb > 4 you need to execute the following query:

{
    "query": {
        "bool": {
            "must": [
                {
                    "nested": {
                        "path": "flatData",
                        "query": {
                            "bool": {
                                "must": [
                                    {"term": {"flatData.key": "type"}},
                                    {"match": {"flatData.value_string": "computer"}}
                                ]
                            }
                        }
                    }
                },
                {
                    "nested": {
                        "path": "flatData",
                        "query": {
                            "bool": {
                                "must": [
                                    {"term": {"flatData.key": "memorygb"}},
                                    {"range": {"flatData.value_long": {"gt": 4}}}
                                ]
                            }
                        }
                    }
                }
            ]
        }
    }
}

Here, because we want same document match both conditions, we are using outer bool query with a must clause wrapping two nested queries.

Gusta answered 20/10, 2017 at 20:59 Comment(4)
this is very useful if the number of fields/attributes is very high, say thousands. But if the fields are arbitrary but still limited to say 100 or even 200, then I think DynamicMapping of ElasticSearch would be good enough.Lewie
Using nested field type will add a performance penalty.Pommard
@Pommard indeed. Every software problem can be solved in multiple ways, with different trade offs. In this case, I found that when I need to index hundreds, or even thousands, of different documents, the costs of the performance penalty for querying nested fields is lower than having gigabytes of RAM memory to hold all the index data in memory to access it quickly, rather letting the machine to “page” chunk of indexes from disk.Gusta
Of course if the memory costs is not an issue and you have a budget to afford let’s say EC2 machines with 32 GB or even 64 GB of memory, then indexing documents as-is might work for you. But even then, if your data is highly disjoint, the 64GB not be enough to hold all the indexes in memory and ElasticSearch will need to read indexes from disk, in which case your search query might take longer than a nested query.Gusta
P
1

Elastic Search is a schema-less data store which allows dynamic indexing of new attributes and there is no performance impact in having optional fields. You first mapping is absolutely fine and you can have boolean queries around your dynamic attributes. There is no inherent performance benefit by making them nested fields, they will anyways be flattened on indexing like fields.type , fields.memorygb etc.

On the contrary your last mapping where you try to store as key-value pairs, will have a performance impact, since you will have to query on 2 different indexed fields i.e where key='memorygb' and value =4

Have a look at the documentation about dynamic mapping:

One of the most important features of Elasticsearch is its ability to be schema-less. There is no performance overhead if an object is dynamic, the ability to turn it off is provided as a safety mechanism so "malformed" objects won’t, by mistake, index data that we do not wish to be indexed.

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-object-type.html

Parlin answered 18/2, 2015 at 13:47 Comment(1)
The first mapping is fine up until a certain point, if the number of arbitrary keys keeps increasing (an example is using UUIDs as the field name) Elasticsearch can encounter a mapping explosion, which is why starting from ES ~5, the default limit for mappings is 1000 (set in index.mapping.total_fields).Thallic
A
0

you need filtered query look from here :

you have to use together range query with match query

Antofagasta answered 18/2, 2015 at 12:48 Comment(2)
That is kind of obvious but I am looking for not how to query my documents, but how to structure them for high performance queries.Mammy
look here https://mcmap.net/q/1315204/-elasticsearch-improve-query-performance and I suggest you use Hadoop. My friend has solved the problem of slowness with hadoop and recommended tips like as loggly.com/blog/…Antofagasta

© 2022 - 2024 — McMap. All rights reserved.