Document Similarity in ElasticSearch - McMap

About

Document Similarity in ElasticSearch

Asked 24/4, 2014 at 10:56 Answered 14/9, 2016 at 3:47

search solr lucene elasticsearch mlt

L

1

17

I want to calculate similarity between two documents indexed in elasticsearch. I know it can be done in lucene using term vectors. What is the direct way to do it?

I found that there is a similarity module doing exactly this: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-similarity.html

How do I integrate this in my system? I am using pyelasticsearch for calling elasticsearch commands, but I am open to use the REST api for similarity if needed.

Laggard answered 24/4, 2014 at 10:56 Comment(3)

Javanna has a great post about the difference between the MLT query and MLT API. This should help clarify the differences and give you more information on how it can work. https://mcmap.net/q/525621/-elasticsearch-quot-more-like-this-quot-api-vs-more_like_this-query – Afoul 24/4, 2014 at 14:51

I hope my answer helped, give me any questions you have. – Afoul 24/4, 2014 at 21:5

@Michaelatqbox.io the answer did not solve the issue I am facing. Both MLT query and MLT api, help you search for "close" documents. I want to measure closeness between two documents. One should see that the first problem is more difficult, but I do not have a way to solve the second problem. Looking forward to your reply. The problem is also written here: grokbase.com/t/gg/elasticsearch/131b9aa8xg/… – Laggard 25/4, 2014 at 11:58

U

8

I think the Elasticsearch documentation can easily be mis-interpreted.

Here "similarity" is not a comparison of documents or fields but rather a mechanism for scoring matching documents based on matching terms from the query.

The documentation states:

A similarity (scoring / ranking model) defines how matching documents are scored.

The similarity algorithms that Elasticsearch supports are probabilistic models based on term distribution in the corpus (index).

In regards to term vectors, this also can be mis-interpreted.

Here "term vectors" refer to statistics for the terms of a document that can easily be queried. It seems that any similarity measurements across term vectors would then have to be done in your application post-query. The documentation on term vectors state:

Returns information and statistics on terms in the fields of a particular document.

If you need a performant (fast) similarity metric over a very large corpus you might consider a low-rank embedding of your documents stored in an index for doing approximate nearest neighbor searches. After your KNN lookup, which greatly reduces the candidate set, you can do more costly metric calculations for ranking.

Here is an excellent resource for evaluation of approximate KNN solutions: https://github.com/erikbern/ann-benchmarks

Uphill answered 14/9, 2016 at 3:47 Comment(0)

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.