CUDA memory issue when running langchain Q&A bot with python dash app: how to fix 'torch.cuda.OutOfMemoryError'?

Building a langchain Q&A bot and serving up with a python dash app.

Error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 4.00 GiB total capacity; 3.44 GiB already allocated; 0 bytes free; 3.44 GiB reserved in total by PyTorch)

If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Runs fine on CPU; attempting to get CUDA to work for scalability.

What I tried:

Setting PYTORCH_CUDA_ALLOC_CONF to 512mb.
Introducing batch_size=1;.
Switching between 'stuff' and 'map_reduce' for chain_type.

None of the above solved the issue.

vector_db = Chroma(
    persist_directory = "",
    embedding_function = HuggingFaceInstructEmbeddings(
        model_name = "hkunlp/instructor-xl",
        model_kwargs = {
            "device": "cuda"
        }))

llm = AzureOpenAI("",batch_size=1)

qa_chain = RetrievalQA.from_chain_type(
    llm = llm, chain_type = "map_reduce",
    retriever = vector_db.as_retriever(
        search_kwargs = {
            'k': 1
        }), return_source_documents = True)

Recommended topics

Hot tags