I think there are better ways to do that but here's what I found after reading the library:
If you see the Chroma.from_documents()
method, it takes the ids
param.
def from_documents(
cls: Type[Chroma],
documents: List[Document],
embedding: Optional[Embeddings] = None,
ids: Optional[List[str]] = None, # <--------------- here
collection_name: str = _LANGCHAIN_DEFAULT_COLLECTION_NAME,
persist_directory: Optional[str] = None,
client_settings: Optional[chromadb.config.Settings] = None,
client: Optional[chromadb.Client] = None,
**kwargs: Any,
) -> Chroma:
Using this param you can set your predefined id
for your documents. If you don't pass any ids, it will create some random ids. See the ref below from the langchain library:
# TODO: Handle the case where the user doesn't provide ids on the Collection
if ids is None:
ids = [str(uuid.uuid1()) for _ in texts]
So, the workaround here is you have to set some unique ids/keys for your individual documents while storing them. In my case, I used a unique URL for each document, convert it to hash, and passed them on id param. After that when you store documents again, check the store for each document if they exist in the DB and remove them from the docs
(ref from your sample code), and finally call the Chroma.from_documents()
with duplicate documents removed from the list. See the below sample with ref to your sample code.
# step 1: generate some unique ids for your docs
# step 2: check your Chroma DB and remove duplicates
# step 3: store the docs without duplicates
# assuming your docs ids are in the ids list and your docs are in the docs list
db = Chroma.from_documents(docs, embeddings, ids=ids, persist_directory='db')