aggregate a field in elasticsearch-dsl using python
Asked Answered
H

2

19

Can someone tell me how to write Python statements that will aggregate (sum and count) stuff about my documents?


SCRIPT

from datetime import datetime
from elasticsearch_dsl import DocType, String, Date, Integer
from elasticsearch_dsl.connections import connections

from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q

# Define a default Elasticsearch client
client = connections.create_connection(hosts=['http://blahblahblah:9200'])

s = Search(using=client, index="attendance")
s = s.execute()

for tag in s.aggregations.per_tag.buckets:
    print (tag.key)

OUTPUT

File "/Library/Python/2.7/site-packages/elasticsearch_dsl/utils.py", line 106, in __getattr__
'%r object has no attribute %r' % (self.__class__.__name__, attr_name))
AttributeError: 'Response' object has no attribute 'aggregations'

What is causing this? Is the "aggregations" keyword wrong? Is there some other package I need to import? If a document in the "attendance" index has a field called emailAddress, how would I count which documents have a value for that field?

Herbivore answered 31/3, 2015 at 23:19 Comment(2)
May I ask you, whether you managed to answer your own questions or not? I'm facing exactly the same problems now - I do not know how to make count aggregation in elasticsearch-dslBlitzkrieg
Yes. I have hurdled a few obstacles since then. With help from coders of the dsl, I'm using what I consider a work-around to do this in Python. Unfortunately, I haven't had the time to use the purely DSL-way of doing this, but instead have been utilizing to_dict. I'll try to paste a good example.Herbivore
H
34

First of all. I notice now that what I wrote here, actually has no aggregations defined. The documentation on how to use this is not very readable for me. Using what I wrote above, I'll expand. I'm changing the index name to make for a nicer example.

from datetime import datetime
from elasticsearch_dsl import DocType, String, Date, Integer
from elasticsearch_dsl.connections import connections

from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q

# Define a default Elasticsearch client
client = connections.create_connection(hosts=['http://blahblahblah:9200'])

s = Search(using=client, index="airbnb", doc_type="sleep_overs")
s = s.execute()

# invalid! You haven't defined an aggregation.
#for tag in s.aggregations.per_tag.buckets:
#    print (tag.key)

# Lets make an aggregation
# 'by_house' is a name you choose, 'terms' is a keyword for the type of aggregator
# 'field' is also a keyword, and 'house_number' is a field in our ES index
s.aggs.bucket('by_house', 'terms', field='house_number', size=0)

Above we're creating 1 bucket per house number. Therefore, the name of the bucket will be the house number. ElasticSearch (ES) will always give a document count of documents fitting into that bucket. Size=0 means to give use all results, since ES has a default setting to return 10 results only (or whatever your dev set it up to do).

# This runs the query.
s = s.execute()

# let's see what's in our results

print s.aggregations.by_house.doc_count
print s.hits.total
print s.aggregations.by_house.buckets

for item in s.aggregations.by_house.buckets:
    print item.doc_count

My mistake before was thinking an Elastic Search query had aggregations by default. You sort of define them yourself, then execute them. Then your response can be split b the aggregators you mentioned.

The CURL for the above should look like:
NOTE: I use SENSE an ElasticSearch plugin/extension/add-on for Google Chrome. In SENSE you can use // to comment things out.

POST /airbnb/sleep_overs/_search
{
// the size 0 here actually means to not return any hits, just the aggregation part of the result
    "size": 0,
    "aggs": {
        "by_house": {
            "terms": {
// the size 0 here means to return all results, not just the the default 10 results
                "field": "house_number",
                "size": 0
            }
        }
    }
}

Work-around. Someone on the GIT of DSL told me to forget translating, and just use this method. It's simpler, and you can just write the tough stuff in CURL. That's why I call it a work-around.

# Define a default Elasticsearch client
client = connections.create_connection(hosts=['http://blahblahblah:9200'])
s = Search(using=client, index="airbnb", doc_type="sleep_overs")

# how simple we just past CURL code here
body = {
    "size": 0,
    "aggs": {
        "by_house": {
            "terms": {
                "field": "house_number",
                "size": 0
            }
        }
    }
}

s = Search.from_dict(body)
s = s.index("airbnb")
s = s.doc_type("sleepovers")
body = s.to_dict()

t = s.execute()

for item in t.aggregations.by_house.buckets:
# item.key will the house number
    print item.key, item.doc_count

Hope this helps. I now design everything in CURL, then use Python statement to peel away at the results to get what I want. This helps for aggregations with multiple levels (sub-aggregations).

Herbivore answered 25/6, 2015 at 0:45 Comment(4)
Thank you, VISQL! I find your answer very informative and useful!Blitzkrieg
Thanks for this explanation. It really helped me! A couple of things might make the response easier to learn from. For one, after edits, there are unused imports that could be deleted. Second, in the final snippet, there's no need to reinstantiate Search. Replace "s = Search.from_dict(body)" with "s = s.from_dict(body)" and you won't need the next two lines of code resetting index and doc_type. Also, in the current example, "body = s.to_dict()" is not used for anything. Perhaps add a comment saying what you would do with the 'body' variable?Rodomontade
how can you add different aggregators? should you use metric or pipeline in case that you want something like aggs: [{}, {}, ...]Laniferous
has size changed its functionality? using it in version 7 returns me an error: [terms] failed to parse field [size]. how can i reproduce the same behaviour? thanksTortuga
O
2

I do not have the rep to comment yet but wanted to make a small fix on Matthew's comment on VISQL's answer regarding from_dict. If you want to maintain the search properties, use update_from_dict rather the from_dict.

According to the Docs , from_dict creates a new search object but update_from_dict will modify in place, which is what you want if Search already has properties such as index, using, etc

So you would want to declare the query body before the search and then create the search like this:

query_body = {
    "size": 0,
    "aggs": {
        "by_house": {
            "terms": {
                "field": "house_number",
                "size": 0
            }
        }
    }
}

s = Search(using=client, index="airbnb", doc_type="sleep_overs").update_from_dict(query_body)
Orvas answered 18/3, 2019 at 22:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.