ChromaDb add single document, only if it doesn't exist
Asked Answered
J

3

14

I'm working with langchain and ChromaDb using python.

Now, I know how to use document loaders. For instance, the below loads a bunch of documents into ChromaDb:

from langchain.embeddings.openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

from langchain.vectorstores import Chroma
db = Chroma.from_documents(docs, embeddings, persist_directory='db')
db.persist()

But what if I wanted to add a single document at a time? More specifically, I want to check if a document exists before I add it. This is so I don't keep adding duplicates.

If a document does not exist, only then do I want to get embeddings and add it.

How do I do this using langchain? I think I mostly understand langchain but have no idea how to do seemingly basic tasks like this.

Justajustemilieu answered 16/5, 2023 at 17:15 Comment(0)
W
8

Filter based solely on the Document's Content

Here is an alternative filtering mechanism that uses a nice list comprehension trick that exploits the truthy evaluation associated with the or operator in Python:

# Create a list of unique ids for each document based on the content
ids = [str(uuid.uuid5(uuid.NAMESPACE_DNS, doc.page_content)) for doc in docs]
unique_ids = list(set(ids))

# Ensure that only docs that correspond to unique ids are kept and that only one of the duplicate ids is kept
seen_ids = set()
unique_docs = [doc for doc, id in zip(docs, ids) if id not in seen_ids and (seen_ids.add(id) or True)]

# Add the unique documents to your database
db = Chroma.from_documents(unique_docs, embeddings, ids=unique_ids, persist_directory='db')

In the first line, a unique UUID is generated for each document by using the uuid.uuid5() function, which creates a UUID using the SHA-1 hash of a namespace identifier and a name string (in this case, the content of the document).

The if condition in the list comprehension checks whether the ID of the current document exists in the seen_ids set:

  • If it doesn't exist, this implies the document is unique. It gets added to seen_ids using seen_ids.add(id), and the document gets included in unique_docs.
  • If it does exist, the document is a duplicate and gets ignored.

The or True at the end is necessary to always return a truthy value to the if condition, because seen_ids.add(id) returns None (which is falsy) even when an element is successfully added.

This approach is more practical than generating IDs using URLs or other document metadata, as it directly prevents the addition of duplicate documents based on content rather than relying on metadata or manual checks.

Waxy answered 19/6, 2023 at 11:29 Comment(4)
Can this UUID5 be used as the id for the doc in the chromaDB? If so, I should be able to query the DB for all the IDs when I am making future additions to the DB, then remove docs I have loaded from my source dir that have IDs already present in the DB. Would something like this work?Turpin
Tested this and it works perfectly. Kudos @Justin DehortyTurpin
Even though it's a fancy trick, I would argue something like unique_docs = [] for doc, id in zip(docs, ids): if id not in seen_ids: unique_docs.append(doc) seen_ids.add(id) is way more readableWeathercock
I have difficulty to understand the logic of the assignment of unique_docs.Urethritis
P
11

I think there are better ways to do that but here's what I found after reading the library:

If you see the Chroma.from_documents() method, it takes the ids param.

def from_documents(
        cls: Type[Chroma],
        documents: List[Document],
        embedding: Optional[Embeddings] = None,
        ids: Optional[List[str]] = None, # <--------------- here
        collection_name: str = _LANGCHAIN_DEFAULT_COLLECTION_NAME,
        persist_directory: Optional[str] = None,
        client_settings: Optional[chromadb.config.Settings] = None,
        client: Optional[chromadb.Client] = None,
        **kwargs: Any,
    ) -> Chroma:

Using this param you can set your predefined id for your documents. If you don't pass any ids, it will create some random ids. See the ref below from the langchain library:

# TODO: Handle the case where the user doesn't provide ids on the Collection
if ids is None:
    ids = [str(uuid.uuid1()) for _ in texts]

So, the workaround here is you have to set some unique ids/keys for your individual documents while storing them. In my case, I used a unique URL for each document, convert it to hash, and passed them on id param. After that when you store documents again, check the store for each document if they exist in the DB and remove them from the docs (ref from your sample code), and finally call the Chroma.from_documents() with duplicate documents removed from the list. See the below sample with ref to your sample code.

# step 1: generate some unique ids for your docs
# step 2: check your Chroma DB and remove duplicates
# step 3: store the docs without duplicates

# assuming your docs ids are in the ids list and your docs are in the docs list

db = Chroma.from_documents(docs, embeddings, ids=ids, persist_directory='db')
Pronator answered 17/5, 2023 at 8:1 Comment(6)
Thanks so much! This is why I love StackOverflow. Super useful, and super well-explained!Justajustemilieu
Oh, man. Stuck again (didn't take long!). How do you get single documents from Chromadb using the id?Justajustemilieu
Ah, got it: db._collection.get('id')Justajustemilieu
cheers to that!Pronator
is the ids list and docs list paired? like ids[0] is for docs[0]? and you store the ids to check every time if there's new file or not?Stogner
Yes, they are synced. Langchain iterate through documents and embed them from identical indices. @RChangPronator
W
8

Filter based solely on the Document's Content

Here is an alternative filtering mechanism that uses a nice list comprehension trick that exploits the truthy evaluation associated with the or operator in Python:

# Create a list of unique ids for each document based on the content
ids = [str(uuid.uuid5(uuid.NAMESPACE_DNS, doc.page_content)) for doc in docs]
unique_ids = list(set(ids))

# Ensure that only docs that correspond to unique ids are kept and that only one of the duplicate ids is kept
seen_ids = set()
unique_docs = [doc for doc, id in zip(docs, ids) if id not in seen_ids and (seen_ids.add(id) or True)]

# Add the unique documents to your database
db = Chroma.from_documents(unique_docs, embeddings, ids=unique_ids, persist_directory='db')

In the first line, a unique UUID is generated for each document by using the uuid.uuid5() function, which creates a UUID using the SHA-1 hash of a namespace identifier and a name string (in this case, the content of the document).

The if condition in the list comprehension checks whether the ID of the current document exists in the seen_ids set:

  • If it doesn't exist, this implies the document is unique. It gets added to seen_ids using seen_ids.add(id), and the document gets included in unique_docs.
  • If it does exist, the document is a duplicate and gets ignored.

The or True at the end is necessary to always return a truthy value to the if condition, because seen_ids.add(id) returns None (which is falsy) even when an element is successfully added.

This approach is more practical than generating IDs using URLs or other document metadata, as it directly prevents the addition of duplicate documents based on content rather than relying on metadata or manual checks.

Waxy answered 19/6, 2023 at 11:29 Comment(4)
Can this UUID5 be used as the id for the doc in the chromaDB? If so, I should be able to query the DB for all the IDs when I am making future additions to the DB, then remove docs I have loaded from my source dir that have IDs already present in the DB. Would something like this work?Turpin
Tested this and it works perfectly. Kudos @Justin DehortyTurpin
Even though it's a fancy trick, I would argue something like unique_docs = [] for doc, id in zip(docs, ids): if id not in seen_ids: unique_docs.append(doc) seen_ids.add(id) is way more readableWeathercock
I have difficulty to understand the logic of the assignment of unique_docs.Urethritis
B
3

Assuming that you load the document via LangChain like this:

loader = TextLoader("hello_directory/world.txt")

Chroma DB will contain a metadata in each document that looks like this:

metadata={'source': 'hello_directory/world.txt'}

So prior to indexing a new [text] file, query first if Chroma already has matches of that file path

results = chroma_collection.get(
  where={"source": "hello_directory/world.txt"},
  include=["metadatas"],
)

Then you can choose to proceed with indexing that document

if len(results["ids"]) > 0:
    print("Document already exists. Skipping...")
else:
  print("Loading document...")
  # Index logic here
Bordie answered 10/10, 2023 at 22:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.