Langchain / ChromaDB: Why does VectorStore return so many duplicates?

Asked 27/11, 2023 at 8:1 Answered 21/4, 2024 at 4:11

Solved python openai-api langchain py-langchain chromadb

import os
from langchain.llms import OpenAI
import bs4
import langchain
from langchain import hub
from langchain.document_loaders import UnstructuredFileLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma

os.environ["OPENAI_API_KEY"] = "KEY"

loader = UnstructuredFileLoader(
    'path_to_file'
)
docs = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)
vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})

retrieved_docs = retriever.get_relevant_documents(
    "What is X?"
)

This returns:

[Document(page_content="...", metadata={'source': 'path_to_text', 'start_index': 16932}),
 Document(page_content="...", metadata={'source': 'path_to_text', 'start_index': 16932}),
 Document(page_content="...", metadata={'source': 'path_to_text', 'start_index': 16932}),
 Document(page_content="...", metadata={'source': 'path_to_text', 'start_index': 16932}),
 Document(page_content="...", metadata={'source': 'path_to_text', 'start_index': 16932}),
 Document(page_content="...", metadata={'source': 'path_to_text', 'start_index': 16932})]

Which is all seemingly the same document.

When I first ran this code in Google Colab/Jupyter Notebook, it returned different documents...as I ran it more, it started returning the same documents. Makes me feel like this is a database issue, where the same entry is being inserted into the database with each run.

How do I return 6 different unique documents?

Poleaxe answered 27/11, 2023 at 8:1 Comment(0)

the issue is here:

Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())

everytime you execute the file, you are inserting the same documents into the database.

you could comment out that part of code if you are inserting from same file. or you could detect the similar vectors using EmbeddingsRedundantFilter

Filter that drops redundant documents by comparing their embeddings.

Maltreat answered 29/11, 2023 at 3:25 Comment(3)

Is there a way to reset Chroma / drop the database? – Poleaxe 29/11, 2023 at 5:43

vectorstore has delete_collection – Maltreat 29/11, 2023 at 5:47

@Poleaxe if you persist to disk you can just delete the folder containing the database. – Leucocratic 27/1, 2024 at 2:4

I wrote this simple function to find the unique values of the embedded docs in a chroma db vector store, it iterates through all the source files that are duplicated and outputs the unique values:

## get list of all file URLs in vector db

def get_unique_files():
    
    db = vectordb
    print("\nEmbedding keys:", db.get().keys())
    print("\nNumber of embedded docs:", len(db.get()["ids"]))
    
    # Print the list of source files
    # for x in range(len(db.get()["ids"])):
    #     # print(db.get()["metadatas"][x])
    #     doc = db.get()["metadatas"][x]
    #     source = doc["source"]
    #     print(source)
    
    # db.get()
    
    file_list = []
    
    for x in range(len(db.get()["ids"])):
        doc = db.get()["metadatas"][x]
        source = doc["source"]
        # print(source)
        file_list.append(source)
        
    ### Set only stores a value once even if it is inserted more than once.
    list_set = set(file_list)
    unique_list = (list(list_set))

    print("\nList of unique files in db:\n")
    for unique_file in unique_list:
        print(unique_file)

issue the function with:

get_unique_files()

This will output only the individual files that were used for the embedding content:

Embedding keys: dict_keys(['ids', 'embeddings', 'metadatas', 'documents', 'uris', 'data'])

Number of embedded docs: 140

List of unique files in db:

pdf-files/leadership-team.pdf
pdf-files/report-summary.pdf
csv-files/small-csv.csv
ppt-content/presentation.pptx
csv-files/dataset-04-17-2024.csv

Floaty answered 21/4, 2024 at 4:11 Comment(0)

Recommended topics

Hot tags