How ChromaDB querying system works?

Asked 23/7, 2023 at 18:22 Answered 25/9, 2023 at 12:40

I am currently learning ChromaDB vector DB.

I can't understand how the querying process works.

When I try to query using text, it's returning all documents.

collection.add(
    documents=["This is a document about cat", "This is a document about car"],
    metadatas=[{"category": "animal"}, {"category": "vehicle"}],
    ids=["id1", "id2"]
)

results = collection.query(
    query_texts=["vehicle"],
    n_results=2
)

results

The output is:

{'ids': [['id2', 'id1']],
 'distances': [[0.8069301247596741, 1.648103952407837]],
 'metadatas': [[{'category': 'vehicle'}, {'category': 'animal'}]],
 'embeddings': None,
 'documents': [['This is a document about car',
   'This is a document about cat']]}

Even I entered a word the not present anywhere, it's still returning all docs.

Why does this happen?

Sculley answered 23/7, 2023 at 18:22 Comment(5)

Did you mean to use where={'category': 'vehicle'}? A simple query like what you did is always going to return the whole collection, and the 'distances' tells you how close the document was to your query text. query_texts doesn't look at the metadata. – Outmarch 23/7, 2023 at 18:25

@TimRoberts Okay. And How to "Text similarity search" inside the document? Like "What are the documents about car?" should return "This is a document about car" only – Sculley 23/7, 2023 at 18:38

No, it returns ALL the documents, but it tells you how likely it is that each document is about a car. Actually, it only returns the top n_results results. – Outmarch 23/7, 2023 at 18:43

@TimRoberts So lower the distance, higher the match. Right? – Sculley 23/7, 2023 at 19:0

I have no idea. The documentation doesn't say, and the authors apparently felt their code needed no comments. – Outmarch 23/7, 2023 at 22:40

So, ChromaDB performs a cosine similarity search on the embeddings stored as vectors. So it not just takes in the word "vehicle" as a whole but also considers the way each letter is arranged with the text in the documents you pass in. You can read more about how cosine similarity search works here - https://www.geeksforgeeks.org/cosine-similarity/#

As for the embeddings, they are generated using all-MiniLM-L6-v2. You can read more about it in their document - https://docs.trychroma.com/embeddings

Heteropolar answered 3/8, 2023 at 18:19 Comment(0)

When given a query, chromadb can retrieve the most similar vectors based on a similarity metrics, such as cosine similarity or Euclidean distance. it will return top n_results document for each query. if you want to search for specific string or filter based on some metadata field you can use

Source : https://docs.trychroma.com/usage-guide

collection.query(
    query_embeddings=[[11.1, 12.1, 13.1],[1.1, 2.3, 3.2], ...],
    n_results=10,
    where={"metadata_field": "is_equal_to_this"},
    where_document={"$contains":"search_string"}
)

Dialogize answered 25/9, 2023 at 12:40 Comment(0)

I don't believe the OP is asking about the algorithm, but rather the rationale for the values returned.

Since the only input you provided for your query was text, it returned the number that you told it to, in the order in which the entries matched.

It is not suggesting that or how much the second result matched, just that it was second and, in this case, last. Had you included a "where" clause indicating that the only records you wanted were those for which category == vehicle (⬅️ Not valid ChromaDB syntax, BTW), then it would have honored that predicate.

To guide you on learning more about Vector databases, I'll plant the seed that, when querying, it's not looking for documents that "match or don't match," it's rating how semantically similar the input is to each document stored in the DB collection. With this in mind, you can understand that for some documents, the likelihood might be 0, or close to it. But even 0 is somewhere in the ordered list of probabilities, right?

Althorn answered 23/9, 2023 at 21:2 Comment(0)

Recommended topics

Hot tags