Dense vector array and cosine similarity
Asked Answered
R

3

6

I would like to store an array of dense_vector in my document but this does not work as it does for other data types eg.

PUT my_index
{
  "mappings": {
    "properties": {
      "my_vectors": {
        "type": "dense_vector",
        "dims": 3  
      },
      "my_text" : {
        "type" : "keyword"
      }
    }
  }
}

PUT my_index/_doc/1
{
  "my_text" : "text1",
  "my_vector" : [[0.5, 10, 6], [-0.5, 10, 10]]
}

returns:

'1 document(s) failed to index.',
    {'_index': 'my_index', '_type': '_doc', '_id': 'some_id', 'status': 400, 'error': 
      {'type': 'mapper_parsing_exception', 'reason': 'failed to parse', 'caused_by': 
        {'type': 'parsing_exception', 
         'reason': 'Failed to parse object: expecting token of type [VALUE_NUMBER] but found [START_ARRAY]'
        }
      }
    }

How do I achieve this? Different documents will have a variable number of vectors but never more than a handful.

Also, I would then like to query it by performing a cosineSimilarity for each value in that array. The code below is how I normally do it when I have only one vector in the doc.

"script_score": {
    "query": {
        "match_all": {}
    },
    "script": {
        "source": "(1.0+cosineSimilarity(params.query_vector, doc['my_vectors']))",
        "params": {"query_vector": query_vector}
    }
}

Ideally, I would like the closest similarity or an average.

Radome answered 22/4, 2020 at 22:43 Comment(0)
C
13

The dense_vector datatype expects one array of numeric values per document like so:

PUT my_index/_doc/1
{
  "my_text" : "text1",
  "my_vector" : [0.5, 10, 6]
}

To store any number of vectors, you could make the my_vector field a "nested" type which would contain an array of objects, and each object contains a vector:

PUT my_index
{
  "mappings": {
    "properties": {
      "my_vectors": {
        "type": "nested",
        "properties": {
          "vector": {
            "type": "dense_vector",
            "dims": 3  
          }
        }
      },
      "my_text" : {
        "type" : "keyword"
      }
    }
  }
}

PUT my_index/_doc/1
{
  "my_text" : "text1",
  "my_vector" : [
    {"vector": [0.5, 10, 6]}, 
    {"vector": [-0.5, 10, 10]}
  ]
}

EDIT

Then, to query the documents, you can use the following (as of ES v7.6.1)

{
  "query": {
    "nested": {
      "path": "my_vectors",
      "score_mode": "max", 
      "query": {
        "function_score": {
          "script_score": {
            "script": {
              "source": "(1.0+cosineSimilarity(params.query_vector, 'my_vectors.vector'))",
              "params": {"query_vector": query_vector}
            }
          }
        }
      }
    }
  }
}

Few things to note:

  • The query needs to be wrapped in a nested declaration (due to using nested objects to store the vectors)
  • Because nested objects are separate Lucene documents, the nested objects are scored individually and by default, the parent document is assigned the average score of matching nested documents. You can specify the nested property score_mode to change the scoring behavior. For your case, "max" will score based on largest cosine similarity score which describes documents that are most similar.
  • If you're interested in seeing the scores of each nested vector, you can use the nested property inner_hits.
  • If anyone is curious why +1.0 is added to the cosine similarity score, it's because Cos. Sim. computes values [-1,1], but ElasticSearch cannot have negative scores. Therefore, scores are transformed to [0,2].
Consequently answered 13/6, 2020 at 0:2 Comment(2)
I tried to use a similar structure of nested mapping that is provided in this answer but I got the error "BadRequestError(400, 'illegal_argument_exception', "[dense_vector] fields cannot be indexed if they're within [nested] mappings")". It seems as the following link says Dense vector fields cannot be indexed if they are within nested mappings. Am I missing something? elastic.co/guide/en/elasticsearch/reference/current/…Perquisite
Is it possible to combine a nested field and non nested fields in the script source ? I have the same field as in your example and also a field dense_vector named "title" and i'd like to have something similar to this : (1.0+cosineSimilarity(params.query_vector, 'my_vectors.vector') + cosineSimilarity(params.query_vector, 'title')) but it says that the title field is empty (i guess its trying to search in the nested object)Fillander
A
0

The dense_vector datatype is meant to

stores dense vectors of float values (from documentation) .... A dense_vector field is a single-valued field.

In your example, you want to index multiple vectors in the same property. But as said in the documentation your field must be single-valued. If you have multiple vectors for your document they need to be dispatched in different properties.

No workaround :(

So you need to dispatch vectors in different fields then use a loop in your script and keep the most suited value.

Alterable answered 23/4, 2020 at 12:34 Comment(1)
Thank you for your reply... I'd love to see an example of what you mean.Radome
F
0

I got to this post by attempting to have a set of vectors in my document.

When I do this:

"mappings": {
    "properties": {
        "vectors": {
            "type": "nested",
            "properties": {
                "vector": {
                    "type": "dense_vector",
                    "dims": 768,
                    "index": "true",
                    "similarity": "cosine"
                }
            }   
        },
        "my_text" : {
            "type" : "keyword"
        }
    }
}

I get:

BadRequestError: BadRequestError(400, 'illegal_argument_exception', "[dense_vector] fields cannot be indexed if they're within [nested] mappings")

If I remove the index: true and "similarity": "cosine" then the problem goes away (but I won't be able to use knn which is my main goal).

Hopefully this helps someone.

Frisk answered 17/3, 2023 at 19:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.