Using Vicuna + langchain + llama_index for creating a self hosted LLM model

E

2

6

I want to create a self hosted LLM model that will be able to have a context of my own custom data (Slack conversations for that matter).

I've heard Vicuna is a great alternative to ChatGPT and so I made the below code:

from llama_index import SimpleDirectoryReader, LangchainEmbedding, GPTListIndex, \
    GPTSimpleVectorIndex, PromptHelper, LLMPredictor, Document, ServiceContext
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
import torch
from langchain.llms.base import LLM
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

!export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
    
class CustomLLM(LLM):
    model_name = "eachadea/vicuna-13b-1.1"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)

    pipeline = pipeline("text2text-generation", model=model, tokenizer=tokenizer, device=0,
                        model_kwargs={"torch_dtype":torch.bfloat16})

    def _call(self, prompt, stop=None):
        return self.pipeline(prompt, max_length=9999)[0]["generated_text"]
 
    def _identifying_params(self):
        return {"name_of_model": self.model_name}

    def _llm_type(self):
        return "custom"


llm_predictor = LLMPredictor(llm=CustomLLM())

But sadly I'm hitting the below error:

OutOfMemoryError: CUDA out of memory. Tried to allocate 270.00 MiB (GPU 0; 22.03 GiB total capacity; 21.65 GiB 
already allocated; 94.88 MiB free; 21.65 GiB reserved in total by PyTorch) If reserved memory is >> allocated 
memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and 
PYTORCH_CUDA_ALLOC_CONF

Here's the output of !nvidia-smi (before running anything):

Thu Apr 20 18:04:00 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                     Off| 00000000:00:1E.0 Off |                    0 |
|  0%   23C    P0               52W / 300W|      0MiB / 23028MiB |     18%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Any idea how to modify my code to make it work?

Errick answered 20/4, 2023 at 18:14 Comment(1)

model_name = "google/tapas-small-finetuned-wikisql-supervised" I am getting following error for above code, any idea ValueError: Unrecognized configuration class <class 'transformers.models.tapas.configuration_tapas.TapasConfig'> for this kind of AutoModel: AutoModelForCausalLM. Model type should be one of BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, CodeGenConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, ElectraConfig, ErnieConfig, FalconConfig etc.. – Catwalk 23/7, 2023 at 6:21

B

2

length is too long, 9999 will consume huge amount of GPU RAM, especially using 13b model. try 7b model. And try using something like peft/bitsandbytes to reduce GPU RAM usage. set load_in_8bit=True is a good start.

By answered 21/4, 2023 at 14:17 Comment(2)

Howdy, is there any where that explains how to set all the parameters passed to the llm’s based on the hardware being used? – Dietrich 8/5, 2023 at 2:45

ValueError: The following model_kwargs are not used by the model: ['load_in_8bit'] (note: typos in the generate arguments will also show up in this list) – Stewardson 15/10, 2023 at 17:59

M

0

As explained in this topicsimilar issue my problem is the usage of VRAM is doubled. And i found the solution is: put the creation of the model and the tokenizer before the "class". I think it could be possible to solve the problem either if put the creation of the model in an init of the class.

Maggee answered 5/6, 2023 at 16:5 Comment(1)

While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - From Review – Dupleix 10/6, 2023 at 13:19

Recommended topics

Hot tags