KEMBAR78
Implement multimodal models (LLaVA) by monatis · Pull Request #3436 · ggml-org/llama.cpp · GitHub
Skip to content

Conversation

@monatis
Copy link
Collaborator

@monatis monatis commented Oct 2, 2023

closes #3332

This is still WIP and highly experimental.

The work started in lmm.cpp,
but it turned out to be also ok to implement it in this repo, which I believe will be much simpler.

The plan is make a surgery on LLaVA models and export:

  1. a regular llama.gguf file,
  2. a custom CLIP model with multimodal projector on top of it.
  • GGUF support for CLIP and LLaVA model surgery is already done.
  • E2E inference of LLaVA V1.5.
  • Use the GGML allocator API and cleanup the code.
  • Better CLI args handling in llava executable.
  • Upload pre-converted models and write a readme.

usage:

  • Build with cmake.
  • From this link download `mmproj-model-f16.gguf and one of ggml-model-[f16|q5_k|q4_k].gguf.
  • Run:
./bin/llava -m ggml-model-q5_k.gguf --mmproj mmproj-model-f16.gguf --image path/to/an/image.jpg

This will output the detailed description of the image.

Note: You can override the default textual prompt "Describe the image in detail." by adding -p "custom promp comes here". Run ./bin/llava for other options.

Note: A lower temperature value like 0.1 is recommended. Add --temp 0.1 to your command to do so.

@staviq
Copy link
Contributor

staviq commented Oct 2, 2023

Sometime ago I was playing with the idea of allowing images to be uploaded via server web UI, I had a working poc, but dropped the idea since nobody was working on multimodal functionality back then

Would it be helpful for testing if I make a pr with this change ?

The idea was to import images client side, in the browser, draw them on hidden canvas and export as ppm, this would allow such image to be processed server side without relying on any external libraries/dependencies

I could add image upload to the server UI and a simple image wrapper class/functions on the cpp side.

Let me know if you are interested.

@monatis
Copy link
Collaborator Author

monatis commented Oct 2, 2023

Thanks @staviq! We can work with images thanks to a single-header C library included in this branch (stb-image.h), but integration with the UI would be great after this PR gets mature. It seems to be requiring some refactoring to the inference code of CLIP, copied from another repo of mine, due to different versions of GGML used. Currently I'm trying to debug and fix it --once done, I can move faster and we can colaborate for integration with the UI.

@staviq
Copy link
Contributor

staviq commented Oct 2, 2023

Thanks @staviq! We can work with images thanks to a single-header C library included in this branch (stb-image.h), but integration with the UI would be great after this PR gets mature. It seems to be requiring some refactoring to the inference code of CLIP, copied from another repo of mine, due to different versions of GGML used. Currently I'm trying to debug and fix it --once done, I can move faster and we can colaborate for integration with the UI.

I completely missed stb is licensed under MIT, that's cool. No format shenanigans necessary then.

Ok, take your time then, I'll wait until you feel comfortable for UI integration.

@ggerganov ggerganov added the model Model specific label Oct 3, 2023
@monatis
Copy link
Collaborator Author

monatis commented Oct 7, 2023

Sorry for the delay here. There was an issue with evaluating embedding input that I needed to debug, and it was too painful to do so with my physical machine slow at generation. Obtained a faster VM in the cloud and hope to move faster this weekend.

@monatis
Copy link
Collaborator Author

monatis commented Oct 7, 2023

This is now working with recently published LLaVA V1.5. The CLIP part consumes a huge amount of memory --I'll optimize it with ggml_allocr and cleanup the implementation tomorrow.

@monatis
Copy link
Collaborator Author

monatis commented Oct 8, 2023

@josephilome this shouldn't that hard --I can implement it once the current implementation is optimized.

@monatis
Copy link
Collaborator Author

monatis commented Oct 9, 2023

There are still some tasks to do but I think this is ready for testing / feedback / reviews.

A pre-converted model can be found here.

You need to download one of the ggml-model[f16|q5_k|q4_k].gguf models and the mmproj-model-f16.gguf (the image encoder). These two-file format is faster to move right now, but we can think of a single file format in the future. Also see the readme.

I'll add more documentation, do code cleanup and address reviews this afternoon. Any feedback is welcome.

@monatis monatis requested a review from ggerganov October 9, 2023 06:55
@monatis monatis marked this pull request as ready for review October 9, 2023 06:55
@ggerganov
Copy link
Member

ggerganov commented Oct 9, 2023

@monatis Awesome stuff!

I haven't had a detailed look or ran tests yet, but looking at the progress, it's quite amazing to have something that can understand images. Looking forward to giving this a try!

Just curious, how much of the total compute is done by CLIP? I.e. is it a bottleneck?

@ggerganov ggerganov added the high priority Very important issue label Oct 9, 2023
@ExtReMLapin
Copy link
Contributor

Any plan to update the GGUF for LLaVA 1.6 ?

@Green-Sky
Copy link
Collaborator

Green-Sky commented Jan 31, 2024

oh they released them https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2

a few days ago i only saw the 1.6 preview in their hf space, but no mention of it anywhere else on the internet :)

edit: blog post https://llava-vl.github.io/blog/2024-01-30-llava-1-6/

@ExtReMLapin
Copy link
Contributor

ExtReMLapin commented Feb 1, 2024

Even if you convert the safetensor file into torch .bin file you will get this error when trying to convert to GGUF


  File "/opt/LLaVA/llama.cpp/convert.py", line 1474, in <module>
    main()
  File "/opt/LLaVA/llama.cpp/convert.py", line 1460, in main
    model   = convert_model_names(model, params)
  File "/opt/LLaVA/llama.cpp/convert.py", line 1198, in convert_model_names
    raise Exception(f"Unexpected tensor name: {name}")
Exception: Unexpected tensor name: model.image_newline

@gamester2665
Copy link

gamester2665 commented Feb 1, 2024

yup.. can confirm following #2948 doesn't yield valid llava-v1.6-mistral-7b-GGUF... any suggestions?


$ python llama.cpp/convert.py llava-hf \
>   --outfile llava-v1.6-mistral-7b-GGUF.gguf \
>   --outtype f32
Loading model file llava-hf\model-00001-of-00004.safetensors
Loading model file llava-hf\model-00001-of-00004.safetensors
Loading model file llava-hf\model-00002-of-00004.safetensors
Loading model file llava-hf\model-00003-of-00004.safetensors
Loading model file llava-hf\model-00004-of-00004.safetensors
params = Params(n_vocab=32000, n_embd=4096, n_layer=32, n_ctx=32768, n_ff=14336, n_head=32, n_head_kv=8, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=1000000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=<GGMLFileType.AllF32: 0>, path_model=WindowsPath('llava-hf'))Found vocab files: {'tokenizer.model': WindowsPath('llava-hf/tokenizer.model'), 'vocab.json': None, 'tokenizer.json': WindowsPath('llava-hf/tokenizer.json')}
Loading vocab file 'llava-hf\tokenizer.model', type 'spm'
Vocab info: <SentencePieceVocab with 32000 base tokens and 0 added tokens>
Special vocab info: <SpecialVocab with 0 merges, special tokens {'bos': 1, 'eos': 2, 'unk': 0, 'pad': 0}, add special tokens {'bos': True, 'eos': False}>
Permuting layer 0
Permuting layer 1
Permuting layer 2
Permuting layer 3
Permuting layer 4
Permuting layer 5
Permuting layer 6
Permuting layer 7
Permuting layer 8
Permuting layer 9
Permuting layer 10
Permuting layer 11
Permuting layer 12
Permuting layer 13
Permuting layer 14
Permuting layer 15
Permuting layer 16
Permuting layer 17
Permuting layer 18
Permuting layer 19
Permuting layer 20
Permuting layer 21
Permuting layer 22
Permuting layer 23
Permuting layer 24
Permuting layer 25
Permuting layer 26
Permuting layer 27
Permuting layer 28
Permuting layer 29
Permuting layer 30
Permuting layer 31
model.embed_tokens.weight                        -> token_embd.weight                        | BF16   | [32000, 4096]
Traceback (most recent call last):
  File "F:\SANDBOX\convert_llava\llama.cpp\convert.py", line 1474, in <module>
    main()
  File "F:\SANDBOX\convert_llava\llama.cpp\convert.py", line 1460, in main
    model   = convert_model_names(model, params)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\SANDBOX\convert_llava\llama.cpp\convert.py", line 1198, in convert_model_names
    raise Exception(f"Unexpected tensor name: {name}")
Exception: Unexpected tensor name: model.image_newline
(llama-new) 

@ExtReMLapin
Copy link
Contributor

And that's the first one that fails (pretty much the first or second layer lmao)

@chigkim
Copy link

chigkim commented Feb 1, 2024

Looping in @haotian-liu and @cmp-nct in case they could help with Llava V1.6.

@cjpais
Copy link
Contributor

cjpais commented Feb 1, 2024

I've got a hacked up script that works for 1.6, will share shortly on a fork

raw script (breaks llava 1.5 support): llava1.6-surgery-hack.py

  • loads safetensors
  • removes "model.image_newline" for convert.py, I don't know the impact of this
  • splits mm_projector into new file
  • saves updates safetensors which have been modified

note: the location of the mmproj is different between 34b and 7b, probably best to do a search for all of the mmproj tensors, split them all out, save them, and resave each checkpoint without them

@cmp-nct
Copy link
Contributor

cmp-nct commented Feb 1, 2024

I'm also half way but occupied with real world stuff.
The main task of 1.6 is to implement the new 'unpad' mechanism

I've created a pull draft to use as a base for 1.6 #5267
It uses a clean surgery script which should work with all variants of llava, it also supports searching for stuff (though it currently does not search for the projector, only for the ViT)
The projector gguf file is also prepared for the new features (spatial_unpad), the new tensor is moved in there

Right now I am struggling with the new ViT
size mismatch for vision_model.encoder.layers.1.mlp.fc1.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([13824]).
That's ffn_down and ffn_up

When not using the correct ViT I could already test llava-1.6 and despite not including the proper image manipulation and resolution it is anyway very good already.

@cjpais
Copy link
Contributor

cjpais commented Feb 2, 2024

not sure if okay to share here...
for those who are looking here are initial gguf quants for llava 1.6

please note they are very early, built from the hacked surgery script. improvements coming in #5267 from @cmp-nct, will try to contribute where I can but I am nothing close to an expert

7b mistral
34b

@gamester2665
Copy link

awesome! thanks @cjpais .. throwing into LMStudio for testing now

@BBC-Esq
Copy link

BBC-Esq commented Feb 2, 2024

Did it work in LM Studio?

@gamester2665
Copy link

@BBC-Esq Yes! cjpais/llava-1.6-mistral-7b-gguf/llava-v1.6-mistral-7b.Q5_K_M.gguf working successfully in LMStudio.

@BBC-Esq
Copy link

BBC-Esq commented Feb 2, 2024

You guys move fast. I'm considering moving my stuff from ctranslate2 to llama.cpp, any good issues/discussions to see if you move that fast with whisper.cpp?

@ExtReMLapin
Copy link
Contributor

  • removes "model.image_newline" for convert.py, I don't know the impact of this

bruh moment

@aymenabid-lab
Copy link

I'm use the llava

how to modify bach size to avoid this error

  • from python within terminal:
    python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path /home/dl_g15/llava-v1.5-13b
    =>
    torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 22.00 MiB. GPU 0 has a total capacty of 7.75 GiB of which 8.06 MiB is free. Including non-PyTorch memory, this process has 7.73 GiB memory in use. Of the allocated memory 7.60 GiB is allocated by PyTorch, and 7.84 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    from anaconda:
    model_path = "/home/dl_g15/llava-v1.5-13b"

tokenizer, model, image_processor, context_len = load_pretrained_model(
model_path=model_path,
model_base=None,
model_name=get_model_name_from_path(model_path)
)
=>
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

@cebtenzzre
Copy link
Collaborator

I'm use the llava

You're almost certainly looking for https://github.com/haotian-liu/LLaVA. This is the llama.cpp repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

high priority Very important issue llava LLaVa and multimodal model Model specific need feedback Testing and feedback with results are needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

llama : add multimodal support (LLaVA)