Get all documents from ChromaDb using Python and langchain
Asked Answered
D

4

15

I'm using langchain to process a whole bunch of documents which are in an Mongo database.

I can load all documents fine into the chromadb vector storage using langchain. Nothing fancy being done here. This is my code:


from langchain.embeddings.openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

from langchain.vectorstores import Chroma
db = Chroma.from_documents(docs, embeddings, persist_directory='db')
db.persist()

Now, after storing the data, I want to get a list of all the documents and embeddings WITH id's.

This is so I can store them back into MongoDb.

I also want to put them through Bertopic to get the topic categories.

Question 1 is: how do I get all documents I've just stored in the Chroma database? I want the documents, and all the metadata.

Many thanks for your help!

Dentate answered 5/5, 2023 at 17:9 Comment(0)
D
19

Looking at the source code (https://github.com/hwchase17/langchain/blob/master/langchain/vectorstores/chroma.py)

You can just call below

db.get()

and you will get a json output with the id's, embeddings and docs data.

Discontinue answered 6/5, 2023 at 1:34 Comment(4)
Nice. However, when I run that, I get: AttributeError: 'Chroma' object has no attribute 'get'Dentate
Ok, so probably you just need to update langchain, as get was introduced just 4 days ago. Just run 'pip install --upgrade langchain' and try again.Discontinue
How do you do this in another file?Baker
You can create ChromaDB client separately and perform any operations on collections. Please send correct persists directorySalisbury
S
9

Once the DB is created, you can create a client separately using the DB persist directory as below

import chromadb
client = chromadb.Client(Settings(is_persistent=True,
                                    persist_directory= <PERSIST_DIR_NAME>,
                                ))
coll = client.get_collection("<name of the collection>")
coll.get() # Gets all the data

You get a JSON with all embedded info, Metadata, Source and Documents as well.

Salisbury answered 15/9, 2023 at 16:24 Comment(0)
X
1

This worked for me, I just needed to get a list of the file names from the source key in the chroma db. I didn't want all the other metadata, just the source files.

## get list of all file URLs in vector db

vectordb = Chroma.from_documents(texts, embeddings, persist_directory="db2")

db = vectordb
print(db.get().keys())
print(len(db.get()["ids"]))

# Print the list of source files
for x in range(len(db.get()["ids"])):
    # print(db.get()["metadatas"][x])
    doc = db.get()["metadatas"][x]
    source = doc["source"]
    print(source)

outputs:

dict_keys(['ids', 'embeddings', 'metadatas', 'documents', 'uris', 'data'])
140

csv-files/small-csv.csv
csv-files/dataset-04-17-2024.csv
ppt-content/presentation.pptx
csv-files/events-small-csv.csv
csv-files/small-csv.csv

To list all docs and content in the embeddings,



db.get()

Xebec answered 21/4 at 3:43 Comment(0)
R
0

Try this. I usually use this with chromadb library.

chroma_client=chromadb.Client()

# Create the open-source embedding function
embedding_function1 = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")

client = chromadb.PersistentClient(path="./chroma_db1")
col = client.get_or_create_collection(name="test1", embedding_function=embedding_function1)

all_data = col.get(
    include=[ "documents","metadatas"],
    limit=5
)

all_data
Rosemaryrosemond answered 22/7 at 2:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.