Use LlamaIndex to load custom LLM model
Asked Answered
B

1

2

I am testing LlamaIndex using the Vicuna-7b or 13b models. I have encountered an issue where the model's memory usage appears to be normal when loaded into CPU memory. However, when I place it on the GPU, the VRAM usage seems to double. This prevents me from using the 13b model. However, when using FastChat's CLI, the 13b model can be used, and both VRAM and memory usage are around 25GB.

# define prompt helper
# set maximum input size
max_input_size = 2048
# set number of output tokens
num_output = 256 
# set maximum chunk overlap
max_chunk_overlap = 20
prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap)

def model_size(model: torch.nn.Module):
    return sum(p.numel() for p in model.parameters()) 

def model_memory_size(model: torch.nn.Module, dtype: torch.dtype=torch.float16):
    # Get the number of elements for each parameter
    num_elements = sum(p.numel() for p in model.parameters())
    # Get the number of bytes for the dtype
    dtype_size = torch.tensor([], dtype=dtype).element_size()
    return num_elements * dtype_size / (1024 ** 2)  # return in MB

class CustomLLM(LLM):
    model_name = "vicuna-7b"
    model_path = "../../../SharedData/vicuna-7b/"
    kwargs = {"torch_dtype": torch.float16}
    tokenizer_vicuna = AutoTokenizer.from_pretrained(model_path, use_fast=False)
    model_vicuna = AutoModelForCausalLM.from_pretrained(
            model_path, low_cpu_mem_usage=True, **kwargs 
    )
    # device = "cuda" 
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') 
    print(device)
    print(f"Model size: {model_size(model_vicuna)/1e6} million parameters")
    dtype_current = next(model_vicuna.parameters()).dtype
    print(f"Model memory size: {model_memory_size(model_vicuna,dtype_current)} MB")
    print("Press any key to continue...")
    input()
    model_vicuna.to(device)
    
    @torch.inference_mode()
    def generate_response(self, prompt: str, max_new_tokens=num_output, temperature=0.7, top_k=0, top_p=1.0):
        encoded_prompt = self.tokenizer_vicuna.encode(prompt, return_tensors='pt').to(self.device)
        max_length = len(encoded_prompt[0]) + max_new_tokens
        with torch.no_grad():
            output = self.model_vicuna.generate(encoded_prompt, 
                                                max_length=max_length,
                                                temperature=temperature, 
                                                top_k=top_k, 
                                                top_p=top_p, 
                                                do_sample=True)
        response = self.tokenizer_vicuna.decode(output[0], skip_special_tokens=True)
        return response

    def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
        prompt_length = len(prompt)
        response = self.generate_response(prompt)
        # only return newly generated tokens
        return response[prompt_length:]

    @property
    def _identifying_params(self) -> Mapping[str, Any]:
        return {"name_of_model": self.model_name}

    @property
    def _llm_type(self) -> str:
        return "custom"


Here is the output:

cuda
Model size: 6738.415616 million parameters
Model memory size: 12852.5078125 MB

Here is the result of nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    Off  | 00000000:17:00.0 Off |                  Off |
| 30%   39C    P2    69W / 300W |  26747MiB / 48682MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2205      G   /usr/libexec/Xorg                   9MiB |
|    0   N/A  N/A      2527      G   /usr/bin/gnome-shell                5MiB |
|    0   N/A  N/A   2270925      C   python                          26728MiB |
+-----------------------------------------------------------------------------+

26747MiB in GPU memory, and approx 12852MB before in CPU memory. And then, if i use 13b model, that will cause OUT of memory of cuda of cause.

Do you have some suggestion where i can continue to debug? Thanks in advance !

I have tried to confirm the model dtype

Billyebilobate answered 30/5, 2023 at 11:42 Comment(1)
I found the solution. This might be due to my lack of understanding of how Python works. If I put the creation of the model and the tokenizer before the "class", there is ok, no problem of double VRAM. I think it could be possible to solve the problem also if I put the creation of the model in an init of class.Billyebilobate
K
2

What I would recommend is:

  1. Enable 8-bit compression because it can reduce the memory usage around half with no sensible model quality effects. Use --load-8bit
  2. In addition to the above point, you can add a --cpu-offloading that offload weights that don't fit on your GPU onto the CPU memory.
Kellam answered 30/5, 2023 at 12:6 Comment(1)
Thank you for the suggestion. I think if I use 8-bit compression or cpu-offloading that will help, but that will not fix the problem of VRAM is double as VAM used for loading a model.Billyebilobate

© 2022 - 2024 — McMap. All rights reserved.