Elasticsearch decay score based on occurrence
Asked Answered
M

3

6

I'm trying to find a way to prevent multiple posts from appearing in search results that are from the same author. So far I've tried random scoring, which allows me to maintain pagination. However, I can still have up to 4 of the same authors in a given page of 10 results.

Is there any way to score a document based on how many times a certain field occurs in the result set? As far as I'm aware you cannot persist a variable or object in a scoring script.

I've looked into several methods of accomplishing this, but many of them have quite a few cons. Such as removing the duplicates, and calling again to retrieve a new set of results which have the current authors excluded. However this can also return multiple of the same authors. So I'm left to query one by one to replace duplicate authors in a result set, and this breaks deep pagination because eventually the other result set which is used to replace duplicates runs out of pages before the standard search. I've also tried aggregation which is not page-able.

Is there any functionality to spread out or subtract the score of a document based on how many times a document of the same author(or field) occurs?

Marva answered 8/12, 2014 at 23:53 Comment(0)
M
0

You cannot diversify elasticsearch sorting. You can only random_seed score the documents and hope for the best. You can use something like a top hits aggregator to aggregate buckets per author, but you cannot paginate a group of buckets. Therefore breaking pagination.

See here for more information

Marva answered 17/12, 2014 at 20:19 Comment(0)
C
0

Any reason you can't use grouping? Just group by user and define the order for the group.

Calcariferous answered 15/12, 2014 at 9:58 Comment(1)
If you're referring to buckets, you cannot paginate buckets. Think about this. I can create a bucket for each author, then I can grab one hit per author. Lets say there are 90 authors(and this value changes), that query will give me 90 results each time, in 90 different buckets. Buckets themselves cannot be paginated. So I will always get however many authors worth of posts each page. Each bucket itself is page-able, however a group of buckets is not. So I can set from and size for a bucket, but I cannot do that on a set of buckets.Marva
M
0

You cannot diversify elasticsearch sorting. You can only random_seed score the documents and hope for the best. You can use something like a top hits aggregator to aggregate buckets per author, but you cannot paginate a group of buckets. Therefore breaking pagination.

See here for more information

Marva answered 17/12, 2014 at 20:19 Comment(0)
F
-1

EDIT: before you downvote this answer just because it is Lucene related and not a real answer to the question: 1. ElasticSearch is Lucene-based 2. What the OP wants to do is really hard to do and I was just trying to help...

You could try to play around with decay from here:

http://www.elasticsearch.org/guide/en/elasticsearch/reference/0.90/query-dsl-function-score-query.html

However this doesn't allow for back-referencing on the previous hits of the current query (as a technique would need to match your use case)

I have encountered a similar problem to yours in a webapp that we used Lucene/Hibernate-Search for and I didn't really get a satisfying result and it still bothers me.

I think it is best to try to get a good user experience by trying to implement the ordering in another way.

Foxhound answered 13/12, 2014 at 23:18 Comment(3)
however, I would be really glad if you found out a way and posted it here :)Foxhound
and btw. in my webapp I ended up presorting stuff in my Java code and then setting the sortorder to the query by hand. Since you are using ElasticSearch and not Lucene/HSearch, this won't work unfortunately.Foxhound
this was my question back in the day: #21528991Foxhound

© 2022 - 2024 — McMap. All rights reserved.