Building a langchain Q&A bot and serving up with a python dash app.
Error:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 4.00 GiB total capacity; 3.44 GiB already allocated; 0 bytes free; 3.44 GiB reserved in total by PyTorch)
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Runs fine on CPU; attempting to get CUDA to work for scalability.
What I tried:
- Setting
PYTORCH_CUDA_ALLOC_CONF
to 512mb. - Introducing
batch_size=1;
. - Switching between 'stuff' and 'map_reduce' for chain_type.
None of the above solved the issue.
vector_db = Chroma(
persist_directory = "",
embedding_function = HuggingFaceInstructEmbeddings(
model_name = "hkunlp/instructor-xl",
model_kwargs = {
"device": "cuda"
}))
llm = AzureOpenAI("",batch_size=1)
qa_chain = RetrievalQA.from_chain_type(
llm = llm, chain_type = "map_reduce",
retriever = vector_db.as_retriever(
search_kwargs = {
'k': 1
}), return_source_documents = True)