LangChain Chroma - load data from Vector Database
Asked Answered
T

7

21

I have written LangChain code using Chroma DB to vector store the data from a website url. It currently works to get the data from the URL, store it into the project folder and then use that data to respond to a user prompt. I figured out how to make that data persist/be stored after the run, but I can't figure out how to then load that data for future prompts. The goal is a user input is received, and the program using OpenAI LLM will generate a response based on the existing database files, as opposed to the program needing to create/write those database files on each run. How can this be done?

What should I do?

I tried this as this would likely be the ideal solution:

vectordb = Chroma(persist_directory=persist_directory, embedding_function=embeddings)
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", vectorstore=vectordb)

But the from_chain_type() function doesn't take a vectorstore db as an input, so therefore this doesn't work.

Twostep answered 12/5, 2023 at 0:15 Comment(0)
L
14

You need to define the retriever and pass that to the chain. That will use your previously persisted DB to be used in queries.

vectordb = Chroma(persist_directory=persist_directory, embedding_function=embeddings)

retriever = vectordb.as_retriever()

qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)

Lewd answered 19/5, 2023 at 19:52 Comment(5)
How do I see what the context looks like that is being passed to the QA model/ is generated from the vector db given a certain prompt.Twostep
retriever.get_relevant_documents("my query") should return 4 (default) documents that match your query from the storageOstracod
Is it possible to do this with Javascript version of Langchain?Lourielouse
@LucaFoppiano - why 4? If i want to get a summary of the entire corpus, it should read all documents.Bridgettebridgewater
@Nguaial 4 is the default number of relevant document returned. You can change the value by using retriever = db.as_retriever(search_kwargs={"k": 10}) for exampleOstracod
N
7

All the answers I have seen are missing one crucial step to call persist the DB. As a complete solution, you need to perform following steps.

To create db first time and persist it using the below lines.

vectordb = Chroma.from_documents(data, embedding=embeddings, persist_directory = persist_directory)
vectordb.persist()

The db can then be loaded using the below line.

vectordb = Chroma(persist_directory=persist_directory, embedding_function=embeddings)
Nyx answered 14/9, 2023 at 16:26 Comment(2)
I think even this has become outdated right nowKieffer
@Kieffer indeed - how so though?Abdel
P
3

I have tried to use the Chroma vector store loader as well, but my code won't load the DB from the disk. Here is what I did:

from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import PyPDFDirectoryLoader
import os
import json

def load_api_key(secrets_file="secrets.json"):
    with open(secrets_file) as f:
        secrets = json.load(f)
    return secrets["OPENAI_API_KEY"]

# Setup
api_key = load_api_key()
os.environ["OPENAI_API_KEY"] = api_key

# load the document and split it into chunks
loader = PyPDFDirectoryLoader("LINK TO FOLDER WITH PDF")
documents = loader.load()

# split it into chunks
text_splitter = CharacterTextSplitter(chunk_size=1500, chunk_overlap=200)
docs = text_splitter.split_documents(documents)

# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# load docs into Chroma DB
db = Chroma.from_documents(docs, embedding_function)

# query the DB
query = "MY QUERY"
docs = db.similarity_search(query)

# print results
print(docs[0].page_content)

# save to disk
db2 = Chroma.from_documents(docs, embedding_function, persist_directory="./chroma_db")

So far no problems! Then when I load the DB with this code:

# load from disk
db3 = Chroma(persist_directory="./chroma_db", embedding_function=embedding_function)
db3.get() 
docs = db3.similarity_search(query)
print(docs[0].page_content)

The db3.get() already shows that there is no data in db3. It returns:

{'ids': [], 'embeddings': None, 'documents': [], 'metadatas': []}

Any ideas why this could by?

Percolation answered 3/8, 2023 at 14:16 Comment(5)
When I use FAISS instead of Chroma as a vector store it works. Simply replace the respective codes with db = FAISS.from_documents(docs, embedding_function), db2 = db.save_local("faiss_index") and db3 = FAISS.load_local("faiss_index", embedding_function).Percolation
If you have a new question, please ask it by clicking the Ask Question button. Include a link to this question if it helps provide context. - From ReviewBollworm
get same issue.Matthews
I didn't find the right solution but get one workaound for me: to export db2 = Chroma.from_documents(docs, embedding_function, persist_directory="./chroma_db") in the place you need it. (db2.similarity_search(query) )Matthews
You're not calling db.persist(). I'm pretty sure you also need to specify a collection name both when create the initial object, and when you load it from disk.Mareld
M
0

just find the following works:

def fetch_embeddings(collection_name):
    collection = chromadb_client.get_collection(
        name=collection_name, embedding_function=langchain_embedding_function
    )
    embeddings = collection.get(include=["embeddings"])

    print(collection.get(include=["embeddings", "documents", "metadatas"]))

    return embeddings

reference: https://docs.trychroma.com/usage-guide

Matthews answered 17/8, 2023 at 19:33 Comment(0)
T
0

Chroma provides get_collection at

https://docs.trychroma.com/reference/Client#get_collection

Here's an example of my code to query an existing vectorStore >

def get(embedding_function):
    db = Chroma(persist_directory="./chroma_db", embedding_function=embedding_function)
    print(db.get().keys())
    print(len(db.get()["ids"]))

The code output with 7580 chunks, as example >

Using embedded DuckDB with persistence: data will be stored in: ./chroma_db
dict_keys(['ids', 'embeddings', 'documents', 'metadatas'])
7580
Truc answered 20/8, 2023 at 9:42 Comment(0)
A
0

RetrievalQA itself a chain. this is how we import:

from langchain.chains import RetrievalQA

every chain has two important components: PromptTemplate and llm. RetrievalQA needs to get documents and stuff these documents into its own PromptTemplate. That is what this argument for:

chain_type="stuff",

RetrievalQA has another keyword argument retriever. this is a communication between RetrievalQA chain and different vector stores. RetrievalQA retrieves documents from vector stores through retriever. Vector stores do the similarity search and return the documents to the RetrievalQA. you created

vectordb = Chroma(persist_directory=persist_directory, embedding_function=embeddings)

now have RetrievalQA communicates with this vector store through retriever

qa = RetrievalQA.from_chain_type(llm=llm, 
                                 chain_type="stuff",
                                 # this will make similarity search in vectordb
                                 retriever=vectordb.as_retriever())
Anterior answered 24/10, 2023 at 0:41 Comment(0)
L
0
def load_api_key(secrets_file="secrets.json"):
    with open(secrets_file) as f:
        secrets = json.load(f)
    return secrets["OPENAI_API_KEY"]

Instead of doing this you can create a .env (secret file) and place your openaikey. Like this:

OPENAI_API_KEY = "<your_key>"

Then load it in your main file and in your main function like this:

from dotenv import load_dotenv

USAGE:

load_dotenv()
Lichter answered 17/11, 2023 at 6:59 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.