KEMBAR78
[model] Support InternVL2.5-3 Series by Kuangdd01 · Pull Request #7258 · hiyouga/LLaMA-Factory · GitHub
Skip to content

Conversation

Kuangdd01
Copy link
Collaborator

@Kuangdd01 Kuangdd01 commented Mar 11, 2025

What does this PR do?

Reopened PR #7077

may Fix #6322 #6432 #6236 #3802

  • But this PR is built upon the PR in Transformers. yonigozlan:add-intern-vl
  • The integration of this model into Transformers is still incomplete and may undergo further modifications.
  • The current code has been validated on 1B small size model and demo data.

Before submitting

former version ## some demo experiment on `InternVL2_5-1B-MPO-hf` 1. video lora sft ``` yaml ### model model_name_or_path: kingsley01/InternVL3-1B-hf trust_remote_code: true

method

stage: sft
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all

dataset

dataset: mllm_video_demo

dataset: mllm_demo # text: identity,alpaca_en_demo # video: mllm_video_demo

template: intern_vl
cutoff_len: 2048
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 4

output

output_dir: saves/internvl-1b/lora/sft-test-demo-video
logging_steps: 1
save_steps: 500
plot_loss: true
overwrite_output_dir: true

train

per_device_train_batch_size: 2
gradient_accumulation_steps: 1
learning_rate: 1.0e-4
num_train_epochs: 30.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

eval

val_size: 0.1

per_device_eval_batch_size: 1

eval_strategy: steps

eval_steps: 500


![image](https://github.com/user-attachments/assets/33661f4d-0fa4-4afe-a82c-0238a545244f)

2. mix data full sft

``` yaml
### model
model_name_or_path: kingsley01/InternVL3-1B-hf
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: full


### dataset
dataset: mllm_demo, dentity, alpaca_en_demo  # video: mllm_video_demo
template: intern_vl
cutoff_len: 2048
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 4

### output
output_dir: saves/internvl-1b/full/sft-test-demo
logging_steps: 1
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 2
gradient_accumulation_steps: 1
learning_rate: 1.0e-4
num_train_epochs: 30.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

### eval
# val_size: 0.1
# per_device_eval_batch_size: 1
# eval_strategy: steps
# eval_steps: 500

image

Align with the latest Transformers Now! 😄

NOW we support InternVL2.5-InternVL3 series post-training!

Important

We should use the latest Huggingface code instead of it release and OpenGVLab/InternVL3-xB-hf checkpoint.
For the processor issues, please check your transformers versions and model.processor.config.

For now, please install a specific version of the latest transformers.

pip install git+https://github.com/huggingface/transformers.git@main

pip install git+https://github.com/Kuangdd01/transformers.git@hf-internvl

We support direct use of several small-sized checkpoints: [InternVL2.5-1/2/4/8B, InternVL3-1/2/8B]. Download the InternVL models from Huggingface or Modelscope.

### model
model_name_or_path: OpenGVLab/InternVL3-8B-hf # careful with -hf suffix
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all

### dataset
dataset: mllm_video_demo
template: intern_vl
cutoff_len: 2048
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 4

### output
output_dir: saves/internvl3-8b-s/lora/sft-mixup-video
logging_steps: 1
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 4
learning_rate: 1.0e-4
num_train_epochs: 5.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
flash_attn: auto
video_maxlen: 4

@hiyouga hiyouga added the pending This problem is yet to be addressed label Mar 12, 2025
@husk-huz
Copy link

husk-huz commented Mar 27, 2025

I know, we need to use yonigozlan/InternVL2_5-1B-MPO-hf instead of the original OpenGVLab/InternVL2_5-1B-MPO!
————————————edit
Good job! But I followed this and found something wrong with the tokenizer.
image
I use yaml as this:
image
Transformers is already changed to huggingface/transformers#35968 (comment) no error was found when importing.
————————————edit
image
processor loaded but failed here
image
What should I do to load correctly?

@Kuangdd01
Copy link
Collaborator Author

I know, we need to use yonigozlan/InternVL2_5-1B-MPO-hf instead of the original OpenGVLab/InternVL2_5-1B-MPO! ————————————edit Good job! But I followed this and found something wrong with the tokenizer. image I use yaml as this: image Transformers is already changed to huggingface/transformers#35968 (comment) no error was found when importing. ————————————edit image processor loaded but failed here image What should I do to load correctly?

Can you use this model card for testing one more time? I can't guarantee that the latest version is available. Feel free to report bugs, this PR is not complete for now. :[

@liuchengyuan123
Copy link

How to convert a model InternVL2_5-1B-MPO checkpoint to hf if I have a costomized-pertrained internvl model? Please help!

cvt_internvl_weights_to_hf.py seems to require a intern_vl_hf_implem/tokenizer_internvl_llama_fast.

@Kuangdd01
Copy link
Collaborator Author

How to convert a model InternVL2_5-1B-MPO checkpoint to hf if I have a costomized-pertrained internvl model? Please help!

cvt_internvl_weights_to_hf.py seems to require a intern_vl_hf_implem/tokenizer_internvl_llama_fast.

In my case, just replace these codes with

tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-chat-7b", return_token_type_ids=False, trust_remote_code=True)

It will replace the tokenizer with internLM2's.

@liuchengyuan123
Copy link

How to convert a model InternVL2_5-1B-MPO checkpoint to hf if I have a costomized-pertrained internvl model? Please help!
cvt_internvl_weights_to_hf.py seems to require a intern_vl_hf_implem/tokenizer_internvl_llama_fast.

In my case, just replace these codes with

tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-chat-7b", return_token_type_ids=False, trust_remote_code=True)

It will replace the tokenizer with internLM2's.

Thanks a lot for your timely response!!

When I use model = InternVLForConditionalGeneration.from_pretrained("InternVL2_5-8B-MPO-hf"), error raises:

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 4/4 [00:23<00:00,  5.83s/it]
Some weights of the model checkpoint at InternVL2_5-8B-MPO-hf were not used when initializing InternVLForConditionalGeneration: ['vision_tower.encoder.layer.0.attention.attention.key.bias', 'vision_tower.encoder.layer.0.attention.attention.key.weight', 'vision_tower.encoder.layer.0.attention.attention.query.bias', 'vision_tower.encoder.layer.0.attention.attention.query.weight', 'vision_tower.encoder.layer.0.attention.attention.value.bias', 'vision_tower.encoder.layer.0.attention.attention.value.weight', 'vision_tower.encoder.layer.0.attention.output.dense.bias', 'vision_tower.encoder.layer.0.attention.output.dense.weight', 'vision_tower.encoder.layer.0.intermediate.dense.bias', 'vision_tower.encoder.layer.0.intermediate.dense.weight', 'vision_tower.encoder.layer.0.output.dense.bias', 'vision_tower.encoder.layer.0.output.dense.weight', 'vision_tower.encoder.layer.1.attention.attention.key.bias', 'vision_tower.encoder.layer.1.attention.attention.key.weight', 'vision_tower.encoder.layer.1.attention.attention.query.bias', 'vision_tower.encoder.layer.1.attention.attention.query.weight', 'vision_tower.encoder.layer.1.attention.attention.value.bias', 'vision_tower.encoder.layer.1.attention.attention.value.weight', 'vision_tower.encoder.layer.1.attention.output.dense.bias', 'vision_tower.encoder.layer.1.attention.output.dense.weight', 'vision_tower.encoder.layer.1.intermediate.dense.bias', 'vision_tower.encoder.layer.1.intermediate.dense.weight', 'vision_tower.encoder.layer.1.output.dense.bias', 'vision_tower.encoder.layer.1.output.dense.weight', 'vision_tower.encoder.layer.10.attention.attention.key.bias', 'vision_tower.encoder.layer.10.attention.attention.key.weight', 'vision_tower.encoder.layer.10.attention.attention.query.bias', 'vision_tower.encoder.layer.10.attention.attention.query.weight', 'vision_tower.encoder.layer.10.attention.attention.value.bias', 'vision_tower.encoder.layer.10.attention.attention.value.weight', 'vision_tower.encoder.layer.10.attention.output.dense.bias', 'vision_tower.encoder.layer.10.attention.output.dense.weight', 'vision_tower.encoder.layer.10.intermediate.dense.bias', 'vision_tower.encoder.layer.10.intermediate.dense.weight', 'vision_tower.encoder.layer.10.output.dense.bias', 'vision_tower.encoder.layer.10.output.dense.weight', 'vision_tower.encoder.layer.11.attention.attention.key.bias', 'vision_tower.encoder.layer.11.attention.attention.key.weight', 'vision_tower.encoder.layer.11.attention.attention.query.bias', 'vision_tower.encoder.layer.11.attention.attention.query.weight', 'vision_tower.encoder.layer.11.attention.attention.value.bias', 'vision_tower.encoder.layer.11.attention.attention.value.weight', 'vision_tower.encoder.layer.11.attention.output.dense.bias', 'vision_tower.encoder.layer.11.attention.output.dense.weight', 'vision_tower.encoder.layer.11.intermediate.dense.bias', 'vision_tower.encoder.layer.11.intermediate.dense.weight', 'vision_tower.encoder.layer.11.output.dense.bias', 'vision_tower.encoder.layer.11.output.dense.weight', 'vision_tower.encoder.layer.12.attention.attention.key.bias', 'vision_tower.encoder.layer.12.attention.attention.key.weight', 'vision_tower.encoder.layer.12.attention.attention.query.bias', 'vision_tower.encoder.layer.12.attention.attention.query.weight', 'vision_tower.encoder.layer.12.attention.attention.value.bias', 'vision_tower.encoder.layer.12.attention.attention.value.weight', 'vision_tower.encoder.layer.12.attention.output.dense.bias', 'vision_tower.encoder.layer.12.attention.output.dense.weight', 'vision_tower.encoder.layer.12.intermediate.dense.bias', 'vision_tower.encoder.layer.12.intermediate.dense.weight', 'vision_tower.encoder.layer.12.output.dense.bias', 'vision_tower.encoder.layer.12.output.dense.weight', 'vision_tower.encoder.layer.13.attention.attention.key.bias', 'vision_tower.encoder.layer.13.attention.attention.key.weight', 'vision_tower.encoder.layer.13.attention.attention.query.bias', 'vision_tower.encoder.layer.13.attention.attention.query.weight', 'vision_tower.encoder.layer.13.attention.attention.value.bias', 'vision_tower.encoder.layer.13.attention.attention.value.weight', 'vision_tower.encoder.layer.13.attention.output.dense.bias', 'vision_tower.encoder.layer.13.attention.output.dense.weight', 'vision_tower.encoder.layer.13.intermediate.dense.bias', 'vision_tower.encoder.layer.13.intermediate.dense.weight', 'vision_tower.encoder.layer.13.output.dense.bias', 'vision_tower.encoder.layer.13.output.dense.weight', 'vision_tower.encoder.layer.14.attention.attention.key.bias', 'vision_tower.encoder.layer.14.attention.attention.key.weight', 'vision_tower.encoder.layer.14.attention.attention.query.bias', 'vision_tower.encoder.layer.14.attention.attention.query.weight', 'vision_tower.encoder.layer.14.attention.attention.value.bias', 'vision_tower.encoder.layer.14.attention.attention.value.weight', 'vision_tower.encoder.layer.14.attention.output.dense.bias', 'vision_tower.encoder.layer.14.attention.output.dense.weight', 'vision_tower.encoder.layer.14.intermediate.dense.bias', 'vision_tower.encoder.layer.14.intermediate.dense.weight', 'vision_tower.encoder.layer.14.output.dense.bias', 'vision_tower.encoder.layer.14.output.dense.weight', 'vision_tower.encoder.layer.15.attention.attention.key.bias', 'vision_tower.encoder.layer.15.attention.attention.key.weight', 'vision_tower.encoder.layer.15.attention.attention.query.bias', 'vision_tower.encoder.layer.15.attention.attention.query.weight', 'vision_tower.encoder.layer.15.attention.attention.value.bias', 'vision_tower.encoder.layer.15.attention.attention.value.weight', 'vision_tower.encoder.layer.15.attention.output.dense.bias', 'vision_tower.encoder.layer.15.attention.output.dense.weight', 'vision_tower.encoder.layer.15.intermediate.dense.bias', 'vision_tower.encoder.layer.15.intermediate.dense.weight', 'vision_tower.encoder.layer.15.output.dense.bias', 'vision_tower.encoder.layer.15.output.dense.weight', 'vision_tower.encoder.layer.16.attention.attention.key.bias', 'vision_tower.encoder.layer.16.attention.attention.key.weight', 'vision_tower.encoder.layer.16.attention.attention.query.bias', 'vision_tower.encoder.layer.16.attention.attention.query.weight', 'vision_tower.encoder.layer.16.attention.attention.value.bias', 'vision_tower.encoder.layer.16.attention.attention.value.weight', 'vision_tower.encoder.layer.16.attention.output.dense.bias', 'vision_tower.encoder.layer.16.attention.output.dense.weight', 'vision_tower.encoder.layer.16.intermediate.dense.bias', 'vision_tower.encoder.layer.16.intermediate.dense.weight', 'vision_tower.encoder.layer.16.output.dense.bias', 'vision_tower.encoder.layer.16.output.dense.weight', 'vision_tower.encoder.layer.17.attention.attention.key.bias', 'vision_tower.encoder.layer.17.attention.attention.key.weight', 'vision_tower.encoder.layer.17.attention.attention.query.bias', 'vision_tower.encoder.layer.17.attention.attention.query.weight', 'vision_tower.encoder.layer.17.attention.attention.value.bias', 'vision_tower.encoder.layer.17.attention.attention.value.weight', 'vision_tower.encoder.layer.17.attention.output.dense.bias', 'vision_tower.encoder.layer.17.attention.output.dense.weight', 'vision_tower.encoder.layer.17.intermediate.dense.bias', 'vision_tower.encoder.layer.17.intermediate.dense.weight', 'vision_tower.encoder.layer.17.output.dense.bias', 'vision_tower.encoder.layer.17.output.dense.weight', 'vision_tower.encoder.layer.18.attention.attention.key.bias', 'vision_tower.encoder.layer.18.attention.attention.key.weight', 'vision_tower.encoder.layer.18.attention.attention.query.bias', 'vision_tower.encoder.layer.18.attention.attention.query.weight', 'vision_tower.encoder.layer.18.attention.attention.value.bias', 'vision_tower.encoder.layer.18.attention.attention.value.weight', 'vision_tower.encoder.layer.18.attention.output.dense.bias', 'vision_tower.encoder.layer.18.attention.output.dense.weight', 'vision_tower.encoder.layer.18.intermediate.dense.bias', 'vision_tower.encoder.layer.18.intermediate.dense.weight', 'vision_tower.encoder.layer.18.output.dense.bias', 'vision_tower.encoder.layer.18.output.dense.weight', 'vision_tower.encoder.layer.19.attention.attention.key.bias', 'vision_tower.encoder.layer.19.attention.attention.key.weight', 'vision_tower.encoder.layer.19.attention.attention.query.bias', 'vision_tower.encoder.layer.19.attention.attention.query.weight', 'vision_tower.encoder.layer.19.attention.attention.value.bias', 'vision_tower.encoder.layer.19.attention.attention.value.weight', 'vision_tower.encoder.layer.19.attention.output.dense.bias', 'vision_tower.encoder.layer.19.attention.output.dense.weight', 'vision_tower.encoder.layer.19.intermediate.dense.bias', 'vision_tower.encoder.layer.19.intermediate.dense.weight', 'vision_tower.encoder.layer.19.output.dense.bias', 'vision_tower.encoder.layer.19.output.dense.weight', 'vision_tower.encoder.layer.2.attention.attention.key.bias', 'vision_tower.encoder.layer.2.attention.attention.key.weight', 'vision_tower.encoder.layer.2.attention.attention.query.bias', 'vision_tower.encoder.layer.2.attention.attention.query.weight', 'vision_tower.encoder.layer.2.attention.attention.value.bias', 'vision_tower.encoder.layer.2.attention.attention.value.weight', 'vision_tower.encoder.layer.2.attention.output.dense.bias', 'vision_tower.encoder.layer.2.attention.output.dense.weight', 'vision_tower.encoder.layer.2.intermediate.dense.bias', 'vision_tower.encoder.layer.2.intermediate.dense.weight', 'vision_tower.encoder.layer.2.output.dense.bias', 'vision_tower.encoder.layer.2.output.dense.weight', 'vision_tower.encoder.layer.20.attention.attention.key.bias', 'vision_tower.encoder.layer.20.attention.attention.key.weight', 'vision_tower.encoder.layer.20.attention.attention.query.bias', 'vision_tower.encoder.layer.20.attention.attention.query.weight', 'vision_tower.encoder.layer.20.attention.attention.value.bias', 'vision_tower.encoder.layer.20.attention.attention.value.weight', 'vision_tower.encoder.layer.20.attention.output.dense.bias', 'vision_tower.encoder.layer.20.attention.output.dense.weight', 'vision_tower.encoder.layer.20.intermediate.dense.bias', 'vision_tower.encoder.layer.20.intermediate.dense.weight', 'vision_tower.encoder.layer.20.output.dense.bias', 'vision_tower.encoder.layer.20.output.dense.weight', 'vision_tower.encoder.layer.21.attention.attention.key.bias', 'vision_tower.encoder.layer.21.attention.attention.key.weight', 'vision_tower.encoder.layer.21.attention.attention.query.bias', 'vision_tower.encoder.layer.21.attention.attention.query.weight', 'vision_tower.encoder.layer.21.attention.attention.value.bias', 'vision_tower.encoder.layer.21.attention.attention.value.weight', 'vision_tower.encoder.layer.21.attention.output.dense.bias', 'vision_tower.encoder.layer.21.attention.output.dense.weight', 'vision_tower.encoder.layer.21.intermediate.dense.bias', 'vision_tower.encoder.layer.21.intermediate.dense.weight', 'vision_tower.encoder.layer.21.output.dense.bias', 'vision_tower.encoder.layer.21.output.dense.weight', 'vision_tower.encoder.layer.22.attention.attention.key.bias', 'vision_tower.encoder.layer.22.attention.attention.key.weight', 'vision_tower.encoder.layer.22.attention.attention.query.bias', 'vision_tower.encoder.layer.22.attention.attention.query.weight', 'vision_tower.encoder.layer.22.attention.attention.value.bias', 'vision_tower.encoder.layer.22.attention.attention.value.weight', 'vision_tower.encoder.layer.22.attention.output.dense.bias', 'vision_tower.encoder.layer.22.attention.output.dense.weight', 'vision_tower.encoder.layer.22.intermediate.dense.bias', 'vision_tower.encoder.layer.22.intermediate.dense.weight', 'vision_tower.encoder.layer.22.output.dense.bias', 'vision_tower.encoder.layer.22.output.dense.weight', 'vision_tower.encoder.layer.23.attention.attention.key.bias', 'vision_tower.encoder.layer.23.attention.attention.key.weight', 'vision_tower.encoder.layer.23.attention.attention.query.bias', 'vision_tower.encoder.layer.23.attention.attention.query.weight', 'vision_tower.encoder.layer.23.attention.attention.value.bias', 'vision_tower.encoder.layer.23.attention.attention.value.weight', 'vision_tower.encoder.layer.23.attention.output.dense.bias', 'vision_tower.encoder.layer.23.attention.output.dense.weight', 'vision_tower.encoder.layer.23.intermediate.dense.bias', 'vision_tower.encoder.layer.23.intermediate.dense.weight', 'vision_tower.encoder.layer.23.output.dense.bias', 'vision_tower.encoder.layer.23.output.dense.weight', 'vision_tower.encoder.layer.3.attention.attention.key.bias', 'vision_tower.encoder.layer.3.attention.attention.key.weight', 'vision_tower.encoder.layer.3.attention.attention.query.bias', 'vision_tower.encoder.layer.3.attention.attention.query.weight', 'vision_tower.encoder.layer.3.attention.attention.value.bias', 'vision_tower.encoder.layer.3.attention.attention.value.weight', 'vision_tower.encoder.layer.3.attention.output.dense.bias', 'vision_tower.encoder.layer.3.attention.output.dense.weight', 'vision_tower.encoder.layer.3.intermediate.dense.bias', 'vision_tower.encoder.layer.3.intermediate.dense.weight', 'vision_tower.encoder.layer.3.output.dense.bias', 'vision_tower.encoder.layer.3.output.dense.weight', 'vision_tower.encoder.layer.4.attention.attention.key.bias', 'vision_tower.encoder.layer.4.attention.attention.key.weight', 'vision_tower.encoder.layer.4.attention.attention.query.bias', 'vision_tower.encoder.layer.4.attention.attention.query.weight', 'vision_tower.encoder.layer.4.attention.attention.value.bias', 'vision_tower.encoder.layer.4.attention.attention.value.weight', 'vision_tower.encoder.layer.4.attention.output.dense.bias', 'vision_tower.encoder.layer.4.attention.output.dense.weight', 'vision_tower.encoder.layer.4.intermediate.dense.bias', 'vision_tower.encoder.layer.4.intermediate.dense.weight', 'vision_tower.encoder.layer.4.output.dense.bias', 'vision_tower.encoder.layer.4.output.dense.weight', 'vision_tower.encoder.layer.5.attention.attention.key.bias', 'vision_tower.encoder.layer.5.attention.attention.key.weight', 'vision_tower.encoder.layer.5.attention.attention.query.bias', 'vision_tower.encoder.layer.5.attention.attention.query.weight', 'vision_tower.encoder.layer.5.attention.attention.value.bias', 'vision_tower.encoder.layer.5.attention.attention.value.weight', 'vision_tower.encoder.layer.5.attention.output.dense.bias', 'vision_tower.encoder.layer.5.attention.output.dense.weight', 'vision_tower.encoder.layer.5.intermediate.dense.bias', 'vision_tower.encoder.layer.5.intermediate.dense.weight', 'vision_tower.encoder.layer.5.output.dense.bias', 'vision_tower.encoder.layer.5.output.dense.weight', 'vision_tower.encoder.layer.6.attention.attention.key.bias', 'vision_tower.encoder.layer.6.attention.attention.key.weight', 'vision_tower.encoder.layer.6.attention.attention.query.bias', 'vision_tower.encoder.layer.6.attention.attention.query.weight', 'vision_tower.encoder.layer.6.attention.attention.value.bias', 'vision_tower.encoder.layer.6.attention.attention.value.weight', 'vision_tower.encoder.layer.6.attention.output.dense.bias', 'vision_tower.encoder.layer.6.attention.output.dense.weight', 'vision_tower.encoder.layer.6.intermediate.dense.bias', 'vision_tower.encoder.layer.6.intermediate.dense.weight', 'vision_tower.encoder.layer.6.output.dense.bias', 'vision_tower.encoder.layer.6.output.dense.weight', 'vision_tower.encoder.layer.7.attention.attention.key.bias', 'vision_tower.encoder.layer.7.attention.attention.key.weight', 'vision_tower.encoder.layer.7.attention.attention.query.bias', 'vision_tower.encoder.layer.7.attention.attention.query.weight', 'vision_tower.encoder.layer.7.attention.attention.value.bias', 'vision_tower.encoder.layer.7.attention.attention.value.weight', 'vision_tower.encoder.layer.7.attention.output.dense.bias', 'vision_tower.encoder.layer.7.attention.output.dense.weight', 'vision_tower.encoder.layer.7.intermediate.dense.bias', 'vision_tower.encoder.layer.7.intermediate.dense.weight', 'vision_tower.encoder.layer.7.output.dense.bias', 'vision_tower.encoder.layer.7.output.dense.weight', 'vision_tower.encoder.layer.8.attention.attention.key.bias', 'vision_tower.encoder.layer.8.attention.attention.key.weight', 'vision_tower.encoder.layer.8.attention.attention.query.bias', 'vision_tower.encoder.layer.8.attention.attention.query.weight', 'vision_tower.encoder.layer.8.attention.attention.value.bias', 'vision_tower.encoder.layer.8.attention.attention.value.weight', 'vision_tower.encoder.layer.8.attention.output.dense.bias', 'vision_tower.encoder.layer.8.attention.output.dense.weight', 'vision_tower.encoder.layer.8.intermediate.dense.bias', 'vision_tower.encoder.layer.8.intermediate.dense.weight', 'vision_tower.encoder.layer.8.output.dense.bias', 'vision_tower.encoder.layer.8.output.dense.weight', 'vision_tower.encoder.layer.9.attention.attention.key.bias', 'vision_tower.encoder.layer.9.attention.attention.key.weight', 'vision_tower.encoder.layer.9.attention.attention.query.bias', 'vision_tower.encoder.layer.9.attention.attention.query.weight', 'vision_tower.encoder.layer.9.attention.attention.value.bias', 'vision_tower.encoder.layer.9.attention.attention.value.weight', 'vision_tower.encoder.layer.9.attention.output.dense.bias', 'vision_tower.encoder.layer.9.attention.output.dense.weight', 'vision_tower.encoder.layer.9.intermediate.dense.bias', 'vision_tower.encoder.layer.9.intermediate.dense.weight', 'vision_tower.encoder.layer.9.output.dense.bias', 'vision_tower.encoder.layer.9.output.dense.weight']
- This IS expected if you are initializing InternVLForConditionalGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing InternVLForConditionalGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of InternVLForConditionalGeneration were not initialized from the model checkpoint at InternVL2_5-8B-MPO-hf and are newly initialized: ['vision_tower.encoder.layer.0.attention.key.bias', 'vision_tower.encoder.layer.0.attention.key.weight', 'vision_tower.encoder.layer.0.attention.output.bias', 'vision_tower.encoder.layer.0.attention.output.weight', 'vision_tower.encoder.layer.0.attention.query.bias', 'vision_tower.encoder.layer.0.attention.query.weight', 'vision_tower.encoder.layer.0.attention.value.bias', 'vision_tower.encoder.layer.0.attention.value.weight', 'vision_tower.encoder.layer.0.mlp.down_proj.bias', 'vision_tower.encoder.layer.0.mlp.down_proj.weight', 'vision_tower.encoder.layer.0.mlp.up_proj.bias', 'vision_tower.encoder.layer.0.mlp.up_proj.weight', 'vision_tower.encoder.layer.1.attention.key.bias', 'vision_tower.encoder.layer.1.attention.key.weight', 'vision_tower.encoder.layer.1.attention.output.bias', 'vision_tower.encoder.layer.1.attention.output.weight', 'vision_tower.encoder.layer.1.attention.query.bias', 'vision_tower.encoder.layer.1.attention.query.weight', 'vision_tower.encoder.layer.1.attention.value.bias', 'vision_tower.encoder.layer.1.attention.value.weight', 'vision_tower.encoder.layer.1.mlp.down_proj.bias', 'vision_tower.encoder.layer.1.mlp.down_proj.weight', 'vision_tower.encoder.layer.1.mlp.up_proj.bias', 'vision_tower.encoder.layer.1.mlp.up_proj.weight', 'vision_tower.encoder.layer.10.attention.key.bias', 'vision_tower.encoder.layer.10.attention.key.weight', 'vision_tower.encoder.layer.10.attention.output.bias', 'vision_tower.encoder.layer.10.attention.output.weight', 'vision_tower.encoder.layer.10.attention.query.bias', 'vision_tower.encoder.layer.10.attention.query.weight', 'vision_tower.encoder.layer.10.attention.value.bias', 'vision_tower.encoder.layer.10.attention.value.weight', 'vision_tower.encoder.layer.10.mlp.down_proj.bias', 'vision_tower.encoder.layer.10.mlp.down_proj.weight', 'vision_tower.encoder.layer.10.mlp.up_proj.bias', 'vision_tower.encoder.layer.10.mlp.up_proj.weight', 'vision_tower.encoder.layer.11.attention.key.bias', 'vision_tower.encoder.layer.11.attention.key.weight', 'vision_tower.encoder.layer.11.attention.output.bias', 'vision_tower.encoder.layer.11.attention.output.weight', 'vision_tower.encoder.layer.11.attention.query.bias', 'vision_tower.encoder.layer.11.attention.query.weight', 'vision_tower.encoder.layer.11.attention.value.bias', 'vision_tower.encoder.layer.11.attention.value.weight', 'vision_tower.encoder.layer.11.mlp.down_proj.bias', 'vision_tower.encoder.layer.11.mlp.down_proj.weight', 'vision_tower.encoder.layer.11.mlp.up_proj.bias', 'vision_tower.encoder.layer.11.mlp.up_proj.weight', 'vision_tower.encoder.layer.12.attention.key.bias', 'vision_tower.encoder.layer.12.attention.key.weight', 'vision_tower.encoder.layer.12.attention.output.bias', 'vision_tower.encoder.layer.12.attention.output.weight', 'vision_tower.encoder.layer.12.attention.query.bias', 'vision_tower.encoder.layer.12.attention.query.weight', 'vision_tower.encoder.layer.12.attention.value.bias', 'vision_tower.encoder.layer.12.attention.value.weight', 'vision_tower.encoder.layer.12.mlp.down_proj.bias', 'vision_tower.encoder.layer.12.mlp.down_proj.weight', 'vision_tower.encoder.layer.12.mlp.up_proj.bias', 'vision_tower.encoder.layer.12.mlp.up_proj.weight', 'vision_tower.encoder.layer.13.attention.key.bias', 'vision_tower.encoder.layer.13.attention.key.weight', 'vision_tower.encoder.layer.13.attention.output.bias', 'vision_tower.encoder.layer.13.attention.output.weight', 'vision_tower.encoder.layer.13.attention.query.bias', 'vision_tower.encoder.layer.13.attention.query.weight', 'vision_tower.encoder.layer.13.attention.value.bias', 'vision_tower.encoder.layer.13.attention.value.weight', 'vision_tower.encoder.layer.13.mlp.down_proj.bias', 'vision_tower.encoder.layer.13.mlp.down_proj.weight', 'vision_tower.encoder.layer.13.mlp.up_proj.bias', 'vision_tower.encoder.layer.13.mlp.up_proj.weight', 'vision_tower.encoder.layer.14.attention.key.bias', 'vision_tower.encoder.layer.14.attention.key.weight', 'vision_tower.encoder.layer.14.attention.output.bias', 'vision_tower.encoder.layer.14.attention.output.weight', 'vision_tower.encoder.layer.14.attention.query.bias', 'vision_tower.encoder.layer.14.attention.query.weight', 'vision_tower.encoder.layer.14.attention.value.bias', 'vision_tower.encoder.layer.14.attention.value.weight', 'vision_tower.encoder.layer.14.mlp.down_proj.bias', 'vision_tower.encoder.layer.14.mlp.down_proj.weight', 'vision_tower.encoder.layer.14.mlp.up_proj.bias', 'vision_tower.encoder.layer.14.mlp.up_proj.weight', 'vision_tower.encoder.layer.15.attention.key.bias', 'vision_tower.encoder.layer.15.attention.key.weight', 'vision_tower.encoder.layer.15.attention.output.bias', 'vision_tower.encoder.layer.15.attention.output.weight', 'vision_tower.encoder.layer.15.attention.query.bias', 'vision_tower.encoder.layer.15.attention.query.weight', 'vision_tower.encoder.layer.15.attention.value.bias', 'vision_tower.encoder.layer.15.attention.value.weight', 'vision_tower.encoder.layer.15.mlp.down_proj.bias', 'vision_tower.encoder.layer.15.mlp.down_proj.weight', 'vision_tower.encoder.layer.15.mlp.up_proj.bias', 'vision_tower.encoder.layer.15.mlp.up_proj.weight', 'vision_tower.encoder.layer.16.attention.key.bias', 'vision_tower.encoder.layer.16.attention.key.weight', 'vision_tower.encoder.layer.16.attention.output.bias', 'vision_tower.encoder.layer.16.attention.output.weight', 'vision_tower.encoder.layer.16.attention.query.bias', 'vision_tower.encoder.layer.16.attention.query.weight', 'vision_tower.encoder.layer.16.attention.value.bias', 'vision_tower.encoder.layer.16.attention.value.weight', 'vision_tower.encoder.layer.16.mlp.down_proj.bias', 'vision_tower.encoder.layer.16.mlp.down_proj.weight', 'vision_tower.encoder.layer.16.mlp.up_proj.bias', 'vision_tower.encoder.layer.16.mlp.up_proj.weight', 'vision_tower.encoder.layer.17.attention.key.bias', 'vision_tower.encoder.layer.17.attention.key.weight', 'vision_tower.encoder.layer.17.attention.output.bias', 'vision_tower.encoder.layer.17.attention.output.weight', 'vision_tower.encoder.layer.17.attention.query.bias', 'vision_tower.encoder.layer.17.attention.query.weight', 'vision_tower.encoder.layer.17.attention.value.bias', 'vision_tower.encoder.layer.17.attention.value.weight', 'vision_tower.encoder.layer.17.mlp.down_proj.bias', 'vision_tower.encoder.layer.17.mlp.down_proj.weight', 'vision_tower.encoder.layer.17.mlp.up_proj.bias', 'vision_tower.encoder.layer.17.mlp.up_proj.weight', 'vision_tower.encoder.layer.18.attention.key.bias', 'vision_tower.encoder.layer.18.attention.key.weight', 'vision_tower.encoder.layer.18.attention.output.bias', 'vision_tower.encoder.layer.18.attention.output.weight', 'vision_tower.encoder.layer.18.attention.query.bias', 'vision_tower.encoder.layer.18.attention.query.weight', 'vision_tower.encoder.layer.18.attention.value.bias', 'vision_tower.encoder.layer.18.attention.value.weight', 'vision_tower.encoder.layer.18.mlp.down_proj.bias', 'vision_tower.encoder.layer.18.mlp.down_proj.weight', 'vision_tower.encoder.layer.18.mlp.up_proj.bias', 'vision_tower.encoder.layer.18.mlp.up_proj.weight', 'vision_tower.encoder.layer.19.attention.key.bias', 'vision_tower.encoder.layer.19.attention.key.weight', 'vision_tower.encoder.layer.19.attention.output.bias', 'vision_tower.encoder.layer.19.attention.output.weight', 'vision_tower.encoder.layer.19.attention.query.bias', 'vision_tower.encoder.layer.19.attention.query.weight', 'vision_tower.encoder.layer.19.attention.value.bias', 'vision_tower.encoder.layer.19.attention.value.weight', 'vision_tower.encoder.layer.19.mlp.down_proj.bias', 'vision_tower.encoder.layer.19.mlp.down_proj.weight', 'vision_tower.encoder.layer.19.mlp.up_proj.bias', 'vision_tower.encoder.layer.19.mlp.up_proj.weight', 'vision_tower.encoder.layer.2.attention.key.bias', 'vision_tower.encoder.layer.2.attention.key.weight', 'vision_tower.encoder.layer.2.attention.output.bias', 'vision_tower.encoder.layer.2.attention.output.weight', 'vision_tower.encoder.layer.2.attention.query.bias', 'vision_tower.encoder.layer.2.attention.query.weight', 'vision_tower.encoder.layer.2.attention.value.bias', 'vision_tower.encoder.layer.2.attention.value.weight', 'vision_tower.encoder.layer.2.mlp.down_proj.bias', 'vision_tower.encoder.layer.2.mlp.down_proj.weight', 'vision_tower.encoder.layer.2.mlp.up_proj.bias', 'vision_tower.encoder.layer.2.mlp.up_proj.weight', 'vision_tower.encoder.layer.20.attention.key.bias', 'vision_tower.encoder.layer.20.attention.key.weight', 'vision_tower.encoder.layer.20.attention.output.bias', 'vision_tower.encoder.layer.20.attention.output.weight', 'vision_tower.encoder.layer.20.attention.query.bias', 'vision_tower.encoder.layer.20.attention.query.weight', 'vision_tower.encoder.layer.20.attention.value.bias', 'vision_tower.encoder.layer.20.attention.value.weight', 'vision_tower.encoder.layer.20.mlp.down_proj.bias', 'vision_tower.encoder.layer.20.mlp.down_proj.weight', 'vision_tower.encoder.layer.20.mlp.up_proj.bias', 'vision_tower.encoder.layer.20.mlp.up_proj.weight', 'vision_tower.encoder.layer.21.attention.key.bias', 'vision_tower.encoder.layer.21.attention.key.weight', 'vision_tower.encoder.layer.21.attention.output.bias', 'vision_tower.encoder.layer.21.attention.output.weight', 'vision_tower.encoder.layer.21.attention.query.bias', 'vision_tower.encoder.layer.21.attention.query.weight', 'vision_tower.encoder.layer.21.attention.value.bias', 'vision_tower.encoder.layer.21.attention.value.weight', 'vision_tower.encoder.layer.21.mlp.down_proj.bias', 'vision_tower.encoder.layer.21.mlp.down_proj.weight', 'vision_tower.encoder.layer.21.mlp.up_proj.bias', 'vision_tower.encoder.layer.21.mlp.up_proj.weight', 'vision_tower.encoder.layer.22.attention.key.bias', 'vision_tower.encoder.layer.22.attention.key.weight', 'vision_tower.encoder.layer.22.attention.output.bias', 'vision_tower.encoder.layer.22.attention.output.weight', 'vision_tower.encoder.layer.22.attention.query.bias', 'vision_tower.encoder.layer.22.attention.query.weight', 'vision_tower.encoder.layer.22.attention.value.bias', 'vision_tower.encoder.layer.22.attention.value.weight', 'vision_tower.encoder.layer.22.mlp.down_proj.bias', 'vision_tower.encoder.layer.22.mlp.down_proj.weight', 'vision_tower.encoder.layer.22.mlp.up_proj.bias', 'vision_tower.encoder.layer.22.mlp.up_proj.weight', 'vision_tower.encoder.layer.23.attention.key.bias', 'vision_tower.encoder.layer.23.attention.key.weight', 'vision_tower.encoder.layer.23.attention.output.bias', 'vision_tower.encoder.layer.23.attention.output.weight', 'vision_tower.encoder.layer.23.attention.query.bias', 'vision_tower.encoder.layer.23.attention.query.weight', 'vision_tower.encoder.layer.23.attention.value.bias', 'vision_tower.encoder.layer.23.attention.value.weight', 'vision_tower.encoder.layer.23.mlp.down_proj.bias', 'vision_tower.encoder.layer.23.mlp.down_proj.weight', 'vision_tower.encoder.layer.23.mlp.up_proj.bias', 'vision_tower.encoder.layer.23.mlp.up_proj.weight', 'vision_tower.encoder.layer.3.attention.key.bias', 'vision_tower.encoder.layer.3.attention.key.weight', 'vision_tower.encoder.layer.3.attention.output.bias', 'vision_tower.encoder.layer.3.attention.output.weight', 'vision_tower.encoder.layer.3.attention.query.bias', 'vision_tower.encoder.layer.3.attention.query.weight', 'vision_tower.encoder.layer.3.attention.value.bias', 'vision_tower.encoder.layer.3.attention.value.weight', 'vision_tower.encoder.layer.3.mlp.down_proj.bias', 'vision_tower.encoder.layer.3.mlp.down_proj.weight', 'vision_tower.encoder.layer.3.mlp.up_proj.bias', 'vision_tower.encoder.layer.3.mlp.up_proj.weight', 'vision_tower.encoder.layer.4.attention.key.bias', 'vision_tower.encoder.layer.4.attention.key.weight', 'vision_tower.encoder.layer.4.attention.output.bias', 'vision_tower.encoder.layer.4.attention.output.weight', 'vision_tower.encoder.layer.4.attention.query.bias', 'vision_tower.encoder.layer.4.attention.query.weight', 'vision_tower.encoder.layer.4.attention.value.bias', 'vision_tower.encoder.layer.4.attention.value.weight', 'vision_tower.encoder.layer.4.mlp.down_proj.bias', 'vision_tower.encoder.layer.4.mlp.down_proj.weight', 'vision_tower.encoder.layer.4.mlp.up_proj.bias', 'vision_tower.encoder.layer.4.mlp.up_proj.weight', 'vision_tower.encoder.layer.5.attention.key.bias', 'vision_tower.encoder.layer.5.attention.key.weight', 'vision_tower.encoder.layer.5.attention.output.bias', 'vision_tower.encoder.layer.5.attention.output.weight', 'vision_tower.encoder.layer.5.attention.query.bias', 'vision_tower.encoder.layer.5.attention.query.weight', 'vision_tower.encoder.layer.5.attention.value.bias', 'vision_tower.encoder.layer.5.attention.value.weight', 'vision_tower.encoder.layer.5.mlp.down_proj.bias', 'vision_tower.encoder.layer.5.mlp.down_proj.weight', 'vision_tower.encoder.layer.5.mlp.up_proj.bias', 'vision_tower.encoder.layer.5.mlp.up_proj.weight', 'vision_tower.encoder.layer.6.attention.key.bias', 'vision_tower.encoder.layer.6.attention.key.weight', 'vision_tower.encoder.layer.6.attention.output.bias', 'vision_tower.encoder.layer.6.attention.output.weight', 'vision_tower.encoder.layer.6.attention.query.bias', 'vision_tower.encoder.layer.6.attention.query.weight', 'vision_tower.encoder.layer.6.attention.value.bias', 'vision_tower.encoder.layer.6.attention.value.weight', 'vision_tower.encoder.layer.6.mlp.down_proj.bias', 'vision_tower.encoder.layer.6.mlp.down_proj.weight', 'vision_tower.encoder.layer.6.mlp.up_proj.bias', 'vision_tower.encoder.layer.6.mlp.up_proj.weight', 'vision_tower.encoder.layer.7.attention.key.bias', 'vision_tower.encoder.layer.7.attention.key.weight', 'vision_tower.encoder.layer.7.attention.output.bias', 'vision_tower.encoder.layer.7.attention.output.weight', 'vision_tower.encoder.layer.7.attention.query.bias', 'vision_tower.encoder.layer.7.attention.query.weight', 'vision_tower.encoder.layer.7.attention.value.bias', 'vision_tower.encoder.layer.7.attention.value.weight', 'vision_tower.encoder.layer.7.mlp.down_proj.bias', 'vision_tower.encoder.layer.7.mlp.down_proj.weight', 'vision_tower.encoder.layer.7.mlp.up_proj.bias', 'vision_tower.encoder.layer.7.mlp.up_proj.weight', 'vision_tower.encoder.layer.8.attention.key.bias', 'vision_tower.encoder.layer.8.attention.key.weight', 'vision_tower.encoder.layer.8.attention.output.bias', 'vision_tower.encoder.layer.8.attention.output.weight', 'vision_tower.encoder.layer.8.attention.query.bias', 'vision_tower.encoder.layer.8.attention.query.weight', 'vision_tower.encoder.layer.8.attention.value.bias', 'vision_tower.encoder.layer.8.attention.value.weight', 'vision_tower.encoder.layer.8.mlp.down_proj.bias', 'vision_tower.encoder.layer.8.mlp.down_proj.weight', 'vision_tower.encoder.layer.8.mlp.up_proj.bias', 'vision_tower.encoder.layer.8.mlp.up_proj.weight', 'vision_tower.encoder.layer.9.attention.key.bias', 'vision_tower.encoder.layer.9.attention.key.weight', 'vision_tower.encoder.layer.9.attention.output.bias', 'vision_tower.encoder.layer.9.attention.output.weight', 'vision_tower.encoder.layer.9.attention.query.bias', 'vision_tower.encoder.layer.9.attention.query.weight', 'vision_tower.encoder.layer.9.attention.value.bias', 'vision_tower.encoder.layer.9.attention.value.weight', 'vision_tower.encoder.layer.9.mlp.down_proj.bias', 'vision_tower.encoder.layer.9.mlp.down_proj.weight', 'vision_tower.encoder.layer.9.mlp.up_proj.bias', 'vision_tower.encoder.layer.9.mlp.up_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

It seems that the vision parts are all not initialized?

@Kuangdd01
Copy link
Collaborator Author

Kuangdd01 commented Apr 16, 2025

How to convert a model InternVL2_5-1B-MPO checkpoint to hf if I have a costomized-pertrained internvl model? Please help!
cvt_internvl_weights_to_hf.py seems to require a intern_vl_hf_implem/tokenizer_internvl_llama_fast.

In my case, just replace these codes with

tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-chat-7b", return_token_type_ids=False, trust_remote_code=True)

It will replace the tokenizer with internLM2's.

Thanks a lot for your timely response!!

When I use model = InternVLForConditionalGeneration.from_pretrained("InternVL2_5-8B-MPO-hf"), error raises:

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 4/4 [00:23<00:00,  5.83s/it]
Some weights of the model checkpoint at InternVL2_5-8B-MPO-hf were not used when initializing InternVLForConditionalGeneration: ['vision_tower.encoder.layer.0.attention.attention.key.bias', 'vision_tower.encoder.layer.0.attention.attention.key.weight', 'vision_tower.encoder.layer.0.attention.attention.query.bias', 'vision_tower.encoder.layer.0.attention.attention.query.weight', 'vision_tower.encoder.layer.0.attention.attention.value.bias', 'vision_tower.encoder.layer.0.attention.attention.value.weight', 'vision_tower.encoder.layer.0.attention.output.dense.bias', 'vision_tower.encoder.layer.0.attention.output.dense.weight', 'vision_tower.encoder.layer.0.intermediate.dense.bias', 'vision_tower.encoder.layer.0.intermediate.dense.weight', 'vision_tower.encoder.layer.0.output.dense.bias', 'vision_tower.encoder.layer.0.output.dense.weight', 'vision_tower.encoder.layer.1.attention.attention.key.bias', 'vision_tower.encoder.layer.1.attention.attention.key.weight', 'vision_tower.encoder.layer.1.attention.attention.query.bias', 'vision_tower.encoder.layer.1.attention.attention.query.weight', 'vision_tower.encoder.layer.1.attention.attention.value.bias', 'vision_tower.encoder.layer.1.attention.attention.value.weight', 'vision_tower.encoder.layer.1.attention.output.dense.bias', 'vision_tower.encoder.layer.1.attention.output.dense.weight', 'vision_tower.encoder.layer.1.intermediate.dense.bias', 'vision_tower.encoder.layer.1.intermediate.dense.weight', 'vision_tower.encoder.layer.1.output.dense.bias', 'vision_tower.encoder.layer.1.output.dense.weight', 'vision_tower.encoder.layer.10.attention.attention.key.bias', 'vision_tower.encoder.layer.10.attention.attention.key.weight', 'vision_tower.encoder.layer.10.attention.attention.query.bias', 'vision_tower.encoder.layer.10.attention.attention.query.weight', 'vision_tower.encoder.layer.10.attention.attention.value.bias', 'vision_tower.encoder.layer.10.attention.attention.value.weight', 'vision_tower.encoder.layer.10.attention.output.dense.bias', 'vision_tower.encoder.layer.10.attention.output.dense.weight', 'vision_tower.encoder.layer.10.intermediate.dense.bias', 'vision_tower.encoder.layer.10.intermediate.dense.weight', 'vision_tower.encoder.layer.10.output.dense.bias', 'vision_tower.encoder.layer.10.output.dense.weight', 'vision_tower.encoder.layer.11.attention.attention.key.bias', 'vision_tower.encoder.layer.11.attention.attention.key.weight', 'vision_tower.encoder.layer.11.attention.attention.query.bias', 'vision_tower.encoder.layer.11.attention.attention.query.weight', 'vision_tower.encoder.layer.11.attention.attention.value.bias', 'vision_tower.encoder.layer.11.attention.attention.value.weight', 'vision_tower.encoder.layer.11.attention.output.dense.bias', 'vision_tower.encoder.layer.11.attention.output.dense.weight', 'vision_tower.encoder.layer.11.intermediate.dense.bias', 'vision_tower.encoder.layer.11.intermediate.dense.weight', 'vision_tower.encoder.layer.11.output.dense.bias', 'vision_tower.encoder.layer.11.output.dense.weight', 'vision_tower.encoder.layer.12.attention.attention.key.bias', 'vision_tower.encoder.layer.12.attention.attention.key.weight', 'vision_tower.encoder.layer.12.attention.attention.query.bias', 'vision_tower.encoder.layer.12.attention.attention.query.weight', 'vision_tower.encoder.layer.12.attention.attention.value.bias', 'vision_tower.encoder.layer.12.attention.attention.value.weight', 'vision_tower.encoder.layer.12.attention.output.dense.bias', 'vision_tower.encoder.layer.12.attention.output.dense.weight', 'vision_tower.encoder.layer.12.intermediate.dense.bias', 'vision_tower.encoder.layer.12.intermediate.dense.weight', 'vision_tower.encoder.layer.12.output.dense.bias', 'vision_tower.encoder.layer.12.output.dense.weight', 'vision_tower.encoder.layer.13.attention.attention.key.bias', 'vision_tower.encoder.layer.13.attention.attention.key.weight', 'vision_tower.encoder.layer.13.attention.attention.query.bias', 'vision_tower.encoder.layer.13.attention.attention.query.weight', 'vision_tower.encoder.layer.13.attention.attention.value.bias', 'vision_tower.encoder.layer.13.attention.attention.value.weight', 'vision_tower.encoder.layer.13.attention.output.dense.bias', 'vision_tower.encoder.layer.13.attention.output.dense.weight', 'vision_tower.encoder.layer.13.intermediate.dense.bias', 'vision_tower.encoder.layer.13.intermediate.dense.weight', 'vision_tower.encoder.layer.13.output.dense.bias', 'vision_tower.encoder.layer.13.output.dense.weight', 'vision_tower.encoder.layer.14.attention.attention.key.bias', 'vision_tower.encoder.layer.14.attention.attention.key.weight', 'vision_tower.encoder.layer.14.attention.attention.query.bias', 'vision_tower.encoder.layer.14.attention.attention.query.weight', 'vision_tower.encoder.layer.14.attention.attention.value.bias', 'vision_tower.encoder.layer.14.attention.attention.value.weight', 'vision_tower.encoder.layer.14.attention.output.dense.bias', 'vision_tower.encoder.layer.14.attention.output.dense.weight', 'vision_tower.encoder.layer.14.intermediate.dense.bias', 'vision_tower.encoder.layer.14.intermediate.dense.weight', 'vision_tower.encoder.layer.14.output.dense.bias', 'vision_tower.encoder.layer.14.output.dense.weight', 'vision_tower.encoder.layer.15.attention.attention.key.bias', 'vision_tower.encoder.layer.15.attention.attention.key.weight', 'vision_tower.encoder.layer.15.attention.attention.query.bias', 'vision_tower.encoder.layer.15.attention.attention.query.weight', 'vision_tower.encoder.layer.15.attention.attention.value.bias', 'vision_tower.encoder.layer.15.attention.attention.value.weight', 'vision_tower.encoder.layer.15.attention.output.dense.bias', 'vision_tower.encoder.layer.15.attention.output.dense.weight', 'vision_tower.encoder.layer.15.intermediate.dense.bias', 'vision_tower.encoder.layer.15.intermediate.dense.weight', 'vision_tower.encoder.layer.15.output.dense.bias', 'vision_tower.encoder.layer.15.output.dense.weight', 'vision_tower.encoder.layer.16.attention.attention.key.bias', 'vision_tower.encoder.layer.16.attention.attention.key.weight', 'vision_tower.encoder.layer.16.attention.attention.query.bias', 'vision_tower.encoder.layer.16.attention.attention.query.weight', 'vision_tower.encoder.layer.16.attention.attention.value.bias', 'vision_tower.encoder.layer.16.attention.attention.value.weight', 'vision_tower.encoder.layer.16.attention.output.dense.bias', 'vision_tower.encoder.layer.16.attention.output.dense.weight', 'vision_tower.encoder.layer.16.intermediate.dense.bias', 'vision_tower.encoder.layer.16.intermediate.dense.weight', 'vision_tower.encoder.layer.16.output.dense.bias', 'vision_tower.encoder.layer.16.output.dense.weight', 'vision_tower.encoder.layer.17.attention.attention.key.bias', 'vision_tower.encoder.layer.17.attention.attention.key.weight', 'vision_tower.encoder.layer.17.attention.attention.query.bias', 'vision_tower.encoder.layer.17.attention.attention.query.weight', 'vision_tower.encoder.layer.17.attention.attention.value.bias', 'vision_tower.encoder.layer.17.attention.attention.value.weight', 'vision_tower.encoder.layer.17.attention.output.dense.bias', 'vision_tower.encoder.layer.17.attention.output.dense.weight', 'vision_tower.encoder.layer.17.intermediate.dense.bias', 'vision_tower.encoder.layer.17.intermediate.dense.weight', 'vision_tower.encoder.layer.17.output.dense.bias', 'vision_tower.encoder.layer.17.output.dense.weight', 'vision_tower.encoder.layer.18.attention.attention.key.bias', 'vision_tower.encoder.layer.18.attention.attention.key.weight', 'vision_tower.encoder.layer.18.attention.attention.query.bias', 'vision_tower.encoder.layer.18.attention.attention.query.weight', 'vision_tower.encoder.layer.18.attention.attention.value.bias', 'vision_tower.encoder.layer.18.attention.attention.value.weight', 'vision_tower.encoder.layer.18.attention.output.dense.bias', 'vision_tower.encoder.layer.18.attention.output.dense.weight', 'vision_tower.encoder.layer.18.intermediate.dense.bias', 'vision_tower.encoder.layer.18.intermediate.dense.weight', 'vision_tower.encoder.layer.18.output.dense.bias', 'vision_tower.encoder.layer.18.output.dense.weight', 'vision_tower.encoder.layer.19.attention.attention.key.bias', 'vision_tower.encoder.layer.19.attention.attention.key.weight', 'vision_tower.encoder.layer.19.attention.attention.query.bias', 'vision_tower.encoder.layer.19.attention.attention.query.weight', 'vision_tower.encoder.layer.19.attention.attention.value.bias', 'vision_tower.encoder.layer.19.attention.attention.value.weight', 'vision_tower.encoder.layer.19.attention.output.dense.bias', 'vision_tower.encoder.layer.19.attention.output.dense.weight', 'vision_tower.encoder.layer.19.intermediate.dense.bias', 'vision_tower.encoder.layer.19.intermediate.dense.weight', 'vision_tower.encoder.layer.19.output.dense.bias', 'vision_tower.encoder.layer.19.output.dense.weight', 'vision_tower.encoder.layer.2.attention.attention.key.bias', 'vision_tower.encoder.layer.2.attention.attention.key.weight', 'vision_tower.encoder.layer.2.attention.attention.query.bias', 'vision_tower.encoder.layer.2.attention.attention.query.weight', 'vision_tower.encoder.layer.2.attention.attention.value.bias', 'vision_tower.encoder.layer.2.attention.attention.value.weight', 'vision_tower.encoder.layer.2.attention.output.dense.bias', 'vision_tower.encoder.layer.2.attention.output.dense.weight', 'vision_tower.encoder.layer.2.intermediate.dense.bias', 'vision_tower.encoder.layer.2.intermediate.dense.weight', 'vision_tower.encoder.layer.2.output.dense.bias', 'vision_tower.encoder.layer.2.output.dense.weight', 'vision_tower.encoder.layer.20.attention.attention.key.bias', 'vision_tower.encoder.layer.20.attention.attention.key.weight', 'vision_tower.encoder.layer.20.attention.attention.query.bias', 'vision_tower.encoder.layer.20.attention.attention.query.weight', 'vision_tower.encoder.layer.20.attention.attention.value.bias', 'vision_tower.encoder.layer.20.attention.attention.value.weight', 'vision_tower.encoder.layer.20.attention.output.dense.bias', 'vision_tower.encoder.layer.20.attention.output.dense.weight', 'vision_tower.encoder.layer.20.intermediate.dense.bias', 'vision_tower.encoder.layer.20.intermediate.dense.weight', 'vision_tower.encoder.layer.20.output.dense.bias', 'vision_tower.encoder.layer.20.output.dense.weight', 'vision_tower.encoder.layer.21.attention.attention.key.bias', 'vision_tower.encoder.layer.21.attention.attention.key.weight', 'vision_tower.encoder.layer.21.attention.attention.query.bias', 'vision_tower.encoder.layer.21.attention.attention.query.weight', 'vision_tower.encoder.layer.21.attention.attention.value.bias', 'vision_tower.encoder.layer.21.attention.attention.value.weight', 'vision_tower.encoder.layer.21.attention.output.dense.bias', 'vision_tower.encoder.layer.21.attention.output.dense.weight', 'vision_tower.encoder.layer.21.intermediate.dense.bias', 'vision_tower.encoder.layer.21.intermediate.dense.weight', 'vision_tower.encoder.layer.21.output.dense.bias', 'vision_tower.encoder.layer.21.output.dense.weight', 'vision_tower.encoder.layer.22.attention.attention.key.bias', 'vision_tower.encoder.layer.22.attention.attention.key.weight', 'vision_tower.encoder.layer.22.attention.attention.query.bias', 'vision_tower.encoder.layer.22.attention.attention.query.weight', 'vision_tower.encoder.layer.22.attention.attention.value.bias', 'vision_tower.encoder.layer.22.attention.attention.value.weight', 'vision_tower.encoder.layer.22.attention.output.dense.bias', 'vision_tower.encoder.layer.22.attention.output.dense.weight', 'vision_tower.encoder.layer.22.intermediate.dense.bias', 'vision_tower.encoder.layer.22.intermediate.dense.weight', 'vision_tower.encoder.layer.22.output.dense.bias', 'vision_tower.encoder.layer.22.output.dense.weight', 'vision_tower.encoder.layer.23.attention.attention.key.bias', 'vision_tower.encoder.layer.23.attention.attention.key.weight', 'vision_tower.encoder.layer.23.attention.attention.query.bias', 'vision_tower.encoder.layer.23.attention.attention.query.weight', 'vision_tower.encoder.layer.23.attention.attention.value.bias', 'vision_tower.encoder.layer.23.attention.attention.value.weight', 'vision_tower.encoder.layer.23.attention.output.dense.bias', 'vision_tower.encoder.layer.23.attention.output.dense.weight', 'vision_tower.encoder.layer.23.intermediate.dense.bias', 'vision_tower.encoder.layer.23.intermediate.dense.weight', 'vision_tower.encoder.layer.23.output.dense.bias', 'vision_tower.encoder.layer.23.output.dense.weight', 'vision_tower.encoder.layer.3.attention.attention.key.bias', 'vision_tower.encoder.layer.3.attention.attention.key.weight', 'vision_tower.encoder.layer.3.attention.attention.query.bias', 'vision_tower.encoder.layer.3.attention.attention.query.weight', 'vision_tower.encoder.layer.3.attention.attention.value.bias', 'vision_tower.encoder.layer.3.attention.attention.value.weight', 'vision_tower.encoder.layer.3.attention.output.dense.bias', 'vision_tower.encoder.layer.3.attention.output.dense.weight', 'vision_tower.encoder.layer.3.intermediate.dense.bias', 'vision_tower.encoder.layer.3.intermediate.dense.weight', 'vision_tower.encoder.layer.3.output.dense.bias', 'vision_tower.encoder.layer.3.output.dense.weight', 'vision_tower.encoder.layer.4.attention.attention.key.bias', 'vision_tower.encoder.layer.4.attention.attention.key.weight', 'vision_tower.encoder.layer.4.attention.attention.query.bias', 'vision_tower.encoder.layer.4.attention.attention.query.weight', 'vision_tower.encoder.layer.4.attention.attention.value.bias', 'vision_tower.encoder.layer.4.attention.attention.value.weight', 'vision_tower.encoder.layer.4.attention.output.dense.bias', 'vision_tower.encoder.layer.4.attention.output.dense.weight', 'vision_tower.encoder.layer.4.intermediate.dense.bias', 'vision_tower.encoder.layer.4.intermediate.dense.weight', 'vision_tower.encoder.layer.4.output.dense.bias', 'vision_tower.encoder.layer.4.output.dense.weight', 'vision_tower.encoder.layer.5.attention.attention.key.bias', 'vision_tower.encoder.layer.5.attention.attention.key.weight', 'vision_tower.encoder.layer.5.attention.attention.query.bias', 'vision_tower.encoder.layer.5.attention.attention.query.weight', 'vision_tower.encoder.layer.5.attention.attention.value.bias', 'vision_tower.encoder.layer.5.attention.attention.value.weight', 'vision_tower.encoder.layer.5.attention.output.dense.bias', 'vision_tower.encoder.layer.5.attention.output.dense.weight', 'vision_tower.encoder.layer.5.intermediate.dense.bias', 'vision_tower.encoder.layer.5.intermediate.dense.weight', 'vision_tower.encoder.layer.5.output.dense.bias', 'vision_tower.encoder.layer.5.output.dense.weight', 'vision_tower.encoder.layer.6.attention.attention.key.bias', 'vision_tower.encoder.layer.6.attention.attention.key.weight', 'vision_tower.encoder.layer.6.attention.attention.query.bias', 'vision_tower.encoder.layer.6.attention.attention.query.weight', 'vision_tower.encoder.layer.6.attention.attention.value.bias', 'vision_tower.encoder.layer.6.attention.attention.value.weight', 'vision_tower.encoder.layer.6.attention.output.dense.bias', 'vision_tower.encoder.layer.6.attention.output.dense.weight', 'vision_tower.encoder.layer.6.intermediate.dense.bias', 'vision_tower.encoder.layer.6.intermediate.dense.weight', 'vision_tower.encoder.layer.6.output.dense.bias', 'vision_tower.encoder.layer.6.output.dense.weight', 'vision_tower.encoder.layer.7.attention.attention.key.bias', 'vision_tower.encoder.layer.7.attention.attention.key.weight', 'vision_tower.encoder.layer.7.attention.attention.query.bias', 'vision_tower.encoder.layer.7.attention.attention.query.weight', 'vision_tower.encoder.layer.7.attention.attention.value.bias', 'vision_tower.encoder.layer.7.attention.attention.value.weight', 'vision_tower.encoder.layer.7.attention.output.dense.bias', 'vision_tower.encoder.layer.7.attention.output.dense.weight', 'vision_tower.encoder.layer.7.intermediate.dense.bias', 'vision_tower.encoder.layer.7.intermediate.dense.weight', 'vision_tower.encoder.layer.7.output.dense.bias', 'vision_tower.encoder.layer.7.output.dense.weight', 'vision_tower.encoder.layer.8.attention.attention.key.bias', 'vision_tower.encoder.layer.8.attention.attention.key.weight', 'vision_tower.encoder.layer.8.attention.attention.query.bias', 'vision_tower.encoder.layer.8.attention.attention.query.weight', 'vision_tower.encoder.layer.8.attention.attention.value.bias', 'vision_tower.encoder.layer.8.attention.attention.value.weight', 'vision_tower.encoder.layer.8.attention.output.dense.bias', 'vision_tower.encoder.layer.8.attention.output.dense.weight', 'vision_tower.encoder.layer.8.intermediate.dense.bias', 'vision_tower.encoder.layer.8.intermediate.dense.weight', 'vision_tower.encoder.layer.8.output.dense.bias', 'vision_tower.encoder.layer.8.output.dense.weight', 'vision_tower.encoder.layer.9.attention.attention.key.bias', 'vision_tower.encoder.layer.9.attention.attention.key.weight', 'vision_tower.encoder.layer.9.attention.attention.query.bias', 'vision_tower.encoder.layer.9.attention.attention.query.weight', 'vision_tower.encoder.layer.9.attention.attention.value.bias', 'vision_tower.encoder.layer.9.attention.attention.value.weight', 'vision_tower.encoder.layer.9.attention.output.dense.bias', 'vision_tower.encoder.layer.9.attention.output.dense.weight', 'vision_tower.encoder.layer.9.intermediate.dense.bias', 'vision_tower.encoder.layer.9.intermediate.dense.weight', 'vision_tower.encoder.layer.9.output.dense.bias', 'vision_tower.encoder.layer.9.output.dense.weight']
- This IS expected if you are initializing InternVLForConditionalGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing InternVLForConditionalGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of InternVLForConditionalGeneration were not initialized from the model checkpoint at InternVL2_5-8B-MPO-hf and are newly initialized: ['vision_tower.encoder.layer.0.attention.key.bias', 'vision_tower.encoder.layer.0.attention.key.weight', 'vision_tower.encoder.layer.0.attention.output.bias', 'vision_tower.encoder.layer.0.attention.output.weight', 'vision_tower.encoder.layer.0.attention.query.bias', 'vision_tower.encoder.layer.0.attention.query.weight', 'vision_tower.encoder.layer.0.attention.value.bias', 'vision_tower.encoder.layer.0.attention.value.weight', 'vision_tower.encoder.layer.0.mlp.down_proj.bias', 'vision_tower.encoder.layer.0.mlp.down_proj.weight', 'vision_tower.encoder.layer.0.mlp.up_proj.bias', 'vision_tower.encoder.layer.0.mlp.up_proj.weight', 'vision_tower.encoder.layer.1.attention.key.bias', 'vision_tower.encoder.layer.1.attention.key.weight', 'vision_tower.encoder.layer.1.attention.output.bias', 'vision_tower.encoder.layer.1.attention.output.weight', 'vision_tower.encoder.layer.1.attention.query.bias', 'vision_tower.encoder.layer.1.attention.query.weight', 'vision_tower.encoder.layer.1.attention.value.bias', 'vision_tower.encoder.layer.1.attention.value.weight', 'vision_tower.encoder.layer.1.mlp.down_proj.bias', 'vision_tower.encoder.layer.1.mlp.down_proj.weight', 'vision_tower.encoder.layer.1.mlp.up_proj.bias', 'vision_tower.encoder.layer.1.mlp.up_proj.weight', 'vision_tower.encoder.layer.10.attention.key.bias', 'vision_tower.encoder.layer.10.attention.key.weight', 'vision_tower.encoder.layer.10.attention.output.bias', 'vision_tower.encoder.layer.10.attention.output.weight', 'vision_tower.encoder.layer.10.attention.query.bias', 'vision_tower.encoder.layer.10.attention.query.weight', 'vision_tower.encoder.layer.10.attention.value.bias', 'vision_tower.encoder.layer.10.attention.value.weight', 'vision_tower.encoder.layer.10.mlp.down_proj.bias', 'vision_tower.encoder.layer.10.mlp.down_proj.weight', 'vision_tower.encoder.layer.10.mlp.up_proj.bias', 'vision_tower.encoder.layer.10.mlp.up_proj.weight', 'vision_tower.encoder.layer.11.attention.key.bias', 'vision_tower.encoder.layer.11.attention.key.weight', 'vision_tower.encoder.layer.11.attention.output.bias', 'vision_tower.encoder.layer.11.attention.output.weight', 'vision_tower.encoder.layer.11.attention.query.bias', 'vision_tower.encoder.layer.11.attention.query.weight', 'vision_tower.encoder.layer.11.attention.value.bias', 'vision_tower.encoder.layer.11.attention.value.weight', 'vision_tower.encoder.layer.11.mlp.down_proj.bias', 'vision_tower.encoder.layer.11.mlp.down_proj.weight', 'vision_tower.encoder.layer.11.mlp.up_proj.bias', 'vision_tower.encoder.layer.11.mlp.up_proj.weight', 'vision_tower.encoder.layer.12.attention.key.bias', 'vision_tower.encoder.layer.12.attention.key.weight', 'vision_tower.encoder.layer.12.attention.output.bias', 'vision_tower.encoder.layer.12.attention.output.weight', 'vision_tower.encoder.layer.12.attention.query.bias', 'vision_tower.encoder.layer.12.attention.query.weight', 'vision_tower.encoder.layer.12.attention.value.bias', 'vision_tower.encoder.layer.12.attention.value.weight', 'vision_tower.encoder.layer.12.mlp.down_proj.bias', 'vision_tower.encoder.layer.12.mlp.down_proj.weight', 'vision_tower.encoder.layer.12.mlp.up_proj.bias', 'vision_tower.encoder.layer.12.mlp.up_proj.weight', 'vision_tower.encoder.layer.13.attention.key.bias', 'vision_tower.encoder.layer.13.attention.key.weight', 'vision_tower.encoder.layer.13.attention.output.bias', 'vision_tower.encoder.layer.13.attention.output.weight', 'vision_tower.encoder.layer.13.attention.query.bias', 'vision_tower.encoder.layer.13.attention.query.weight', 'vision_tower.encoder.layer.13.attention.value.bias', 'vision_tower.encoder.layer.13.attention.value.weight', 'vision_tower.encoder.layer.13.mlp.down_proj.bias', 'vision_tower.encoder.layer.13.mlp.down_proj.weight', 'vision_tower.encoder.layer.13.mlp.up_proj.bias', 'vision_tower.encoder.layer.13.mlp.up_proj.weight', 'vision_tower.encoder.layer.14.attention.key.bias', 'vision_tower.encoder.layer.14.attention.key.weight', 'vision_tower.encoder.layer.14.attention.output.bias', 'vision_tower.encoder.layer.14.attention.output.weight', 'vision_tower.encoder.layer.14.attention.query.bias', 'vision_tower.encoder.layer.14.attention.query.weight', 'vision_tower.encoder.layer.14.attention.value.bias', 'vision_tower.encoder.layer.14.attention.value.weight', 'vision_tower.encoder.layer.14.mlp.down_proj.bias', 'vision_tower.encoder.layer.14.mlp.down_proj.weight', 'vision_tower.encoder.layer.14.mlp.up_proj.bias', 'vision_tower.encoder.layer.14.mlp.up_proj.weight', 'vision_tower.encoder.layer.15.attention.key.bias', 'vision_tower.encoder.layer.15.attention.key.weight', 'vision_tower.encoder.layer.15.attention.output.bias', 'vision_tower.encoder.layer.15.attention.output.weight', 'vision_tower.encoder.layer.15.attention.query.bias', 'vision_tower.encoder.layer.15.attention.query.weight', 'vision_tower.encoder.layer.15.attention.value.bias', 'vision_tower.encoder.layer.15.attention.value.weight', 'vision_tower.encoder.layer.15.mlp.down_proj.bias', 'vision_tower.encoder.layer.15.mlp.down_proj.weight', 'vision_tower.encoder.layer.15.mlp.up_proj.bias', 'vision_tower.encoder.layer.15.mlp.up_proj.weight', 'vision_tower.encoder.layer.16.attention.key.bias', 'vision_tower.encoder.layer.16.attention.key.weight', 'vision_tower.encoder.layer.16.attention.output.bias', 'vision_tower.encoder.layer.16.attention.output.weight', 'vision_tower.encoder.layer.16.attention.query.bias', 'vision_tower.encoder.layer.16.attention.query.weight', 'vision_tower.encoder.layer.16.attention.value.bias', 'vision_tower.encoder.layer.16.attention.value.weight', 'vision_tower.encoder.layer.16.mlp.down_proj.bias', 'vision_tower.encoder.layer.16.mlp.down_proj.weight', 'vision_tower.encoder.layer.16.mlp.up_proj.bias', 'vision_tower.encoder.layer.16.mlp.up_proj.weight', 'vision_tower.encoder.layer.17.attention.key.bias', 'vision_tower.encoder.layer.17.attention.key.weight', 'vision_tower.encoder.layer.17.attention.output.bias', 'vision_tower.encoder.layer.17.attention.output.weight', 'vision_tower.encoder.layer.17.attention.query.bias', 'vision_tower.encoder.layer.17.attention.query.weight', 'vision_tower.encoder.layer.17.attention.value.bias', 'vision_tower.encoder.layer.17.attention.value.weight', 'vision_tower.encoder.layer.17.mlp.down_proj.bias', 'vision_tower.encoder.layer.17.mlp.down_proj.weight', 'vision_tower.encoder.layer.17.mlp.up_proj.bias', 'vision_tower.encoder.layer.17.mlp.up_proj.weight', 'vision_tower.encoder.layer.18.attention.key.bias', 'vision_tower.encoder.layer.18.attention.key.weight', 'vision_tower.encoder.layer.18.attention.output.bias', 'vision_tower.encoder.layer.18.attention.output.weight', 'vision_tower.encoder.layer.18.attention.query.bias', 'vision_tower.encoder.layer.18.attention.query.weight', 'vision_tower.encoder.layer.18.attention.value.bias', 'vision_tower.encoder.layer.18.attention.value.weight', 'vision_tower.encoder.layer.18.mlp.down_proj.bias', 'vision_tower.encoder.layer.18.mlp.down_proj.weight', 'vision_tower.encoder.layer.18.mlp.up_proj.bias', 'vision_tower.encoder.layer.18.mlp.up_proj.weight', 'vision_tower.encoder.layer.19.attention.key.bias', 'vision_tower.encoder.layer.19.attention.key.weight', 'vision_tower.encoder.layer.19.attention.output.bias', 'vision_tower.encoder.layer.19.attention.output.weight', 'vision_tower.encoder.layer.19.attention.query.bias', 'vision_tower.encoder.layer.19.attention.query.weight', 'vision_tower.encoder.layer.19.attention.value.bias', 'vision_tower.encoder.layer.19.attention.value.weight', 'vision_tower.encoder.layer.19.mlp.down_proj.bias', 'vision_tower.encoder.layer.19.mlp.down_proj.weight', 'vision_tower.encoder.layer.19.mlp.up_proj.bias', 'vision_tower.encoder.layer.19.mlp.up_proj.weight', 'vision_tower.encoder.layer.2.attention.key.bias', 'vision_tower.encoder.layer.2.attention.key.weight', 'vision_tower.encoder.layer.2.attention.output.bias', 'vision_tower.encoder.layer.2.attention.output.weight', 'vision_tower.encoder.layer.2.attention.query.bias', 'vision_tower.encoder.layer.2.attention.query.weight', 'vision_tower.encoder.layer.2.attention.value.bias', 'vision_tower.encoder.layer.2.attention.value.weight', 'vision_tower.encoder.layer.2.mlp.down_proj.bias', 'vision_tower.encoder.layer.2.mlp.down_proj.weight', 'vision_tower.encoder.layer.2.mlp.up_proj.bias', 'vision_tower.encoder.layer.2.mlp.up_proj.weight', 'vision_tower.encoder.layer.20.attention.key.bias', 'vision_tower.encoder.layer.20.attention.key.weight', 'vision_tower.encoder.layer.20.attention.output.bias', 'vision_tower.encoder.layer.20.attention.output.weight', 'vision_tower.encoder.layer.20.attention.query.bias', 'vision_tower.encoder.layer.20.attention.query.weight', 'vision_tower.encoder.layer.20.attention.value.bias', 'vision_tower.encoder.layer.20.attention.value.weight', 'vision_tower.encoder.layer.20.mlp.down_proj.bias', 'vision_tower.encoder.layer.20.mlp.down_proj.weight', 'vision_tower.encoder.layer.20.mlp.up_proj.bias', 'vision_tower.encoder.layer.20.mlp.up_proj.weight', 'vision_tower.encoder.layer.21.attention.key.bias', 'vision_tower.encoder.layer.21.attention.key.weight', 'vision_tower.encoder.layer.21.attention.output.bias', 'vision_tower.encoder.layer.21.attention.output.weight', 'vision_tower.encoder.layer.21.attention.query.bias', 'vision_tower.encoder.layer.21.attention.query.weight', 'vision_tower.encoder.layer.21.attention.value.bias', 'vision_tower.encoder.layer.21.attention.value.weight', 'vision_tower.encoder.layer.21.mlp.down_proj.bias', 'vision_tower.encoder.layer.21.mlp.down_proj.weight', 'vision_tower.encoder.layer.21.mlp.up_proj.bias', 'vision_tower.encoder.layer.21.mlp.up_proj.weight', 'vision_tower.encoder.layer.22.attention.key.bias', 'vision_tower.encoder.layer.22.attention.key.weight', 'vision_tower.encoder.layer.22.attention.output.bias', 'vision_tower.encoder.layer.22.attention.output.weight', 'vision_tower.encoder.layer.22.attention.query.bias', 'vision_tower.encoder.layer.22.attention.query.weight', 'vision_tower.encoder.layer.22.attention.value.bias', 'vision_tower.encoder.layer.22.attention.value.weight', 'vision_tower.encoder.layer.22.mlp.down_proj.bias', 'vision_tower.encoder.layer.22.mlp.down_proj.weight', 'vision_tower.encoder.layer.22.mlp.up_proj.bias', 'vision_tower.encoder.layer.22.mlp.up_proj.weight', 'vision_tower.encoder.layer.23.attention.key.bias', 'vision_tower.encoder.layer.23.attention.key.weight', 'vision_tower.encoder.layer.23.attention.output.bias', 'vision_tower.encoder.layer.23.attention.output.weight', 'vision_tower.encoder.layer.23.attention.query.bias', 'vision_tower.encoder.layer.23.attention.query.weight', 'vision_tower.encoder.layer.23.attention.value.bias', 'vision_tower.encoder.layer.23.attention.value.weight', 'vision_tower.encoder.layer.23.mlp.down_proj.bias', 'vision_tower.encoder.layer.23.mlp.down_proj.weight', 'vision_tower.encoder.layer.23.mlp.up_proj.bias', 'vision_tower.encoder.layer.23.mlp.up_proj.weight', 'vision_tower.encoder.layer.3.attention.key.bias', 'vision_tower.encoder.layer.3.attention.key.weight', 'vision_tower.encoder.layer.3.attention.output.bias', 'vision_tower.encoder.layer.3.attention.output.weight', 'vision_tower.encoder.layer.3.attention.query.bias', 'vision_tower.encoder.layer.3.attention.query.weight', 'vision_tower.encoder.layer.3.attention.value.bias', 'vision_tower.encoder.layer.3.attention.value.weight', 'vision_tower.encoder.layer.3.mlp.down_proj.bias', 'vision_tower.encoder.layer.3.mlp.down_proj.weight', 'vision_tower.encoder.layer.3.mlp.up_proj.bias', 'vision_tower.encoder.layer.3.mlp.up_proj.weight', 'vision_tower.encoder.layer.4.attention.key.bias', 'vision_tower.encoder.layer.4.attention.key.weight', 'vision_tower.encoder.layer.4.attention.output.bias', 'vision_tower.encoder.layer.4.attention.output.weight', 'vision_tower.encoder.layer.4.attention.query.bias', 'vision_tower.encoder.layer.4.attention.query.weight', 'vision_tower.encoder.layer.4.attention.value.bias', 'vision_tower.encoder.layer.4.attention.value.weight', 'vision_tower.encoder.layer.4.mlp.down_proj.bias', 'vision_tower.encoder.layer.4.mlp.down_proj.weight', 'vision_tower.encoder.layer.4.mlp.up_proj.bias', 'vision_tower.encoder.layer.4.mlp.up_proj.weight', 'vision_tower.encoder.layer.5.attention.key.bias', 'vision_tower.encoder.layer.5.attention.key.weight', 'vision_tower.encoder.layer.5.attention.output.bias', 'vision_tower.encoder.layer.5.attention.output.weight', 'vision_tower.encoder.layer.5.attention.query.bias', 'vision_tower.encoder.layer.5.attention.query.weight', 'vision_tower.encoder.layer.5.attention.value.bias', 'vision_tower.encoder.layer.5.attention.value.weight', 'vision_tower.encoder.layer.5.mlp.down_proj.bias', 'vision_tower.encoder.layer.5.mlp.down_proj.weight', 'vision_tower.encoder.layer.5.mlp.up_proj.bias', 'vision_tower.encoder.layer.5.mlp.up_proj.weight', 'vision_tower.encoder.layer.6.attention.key.bias', 'vision_tower.encoder.layer.6.attention.key.weight', 'vision_tower.encoder.layer.6.attention.output.bias', 'vision_tower.encoder.layer.6.attention.output.weight', 'vision_tower.encoder.layer.6.attention.query.bias', 'vision_tower.encoder.layer.6.attention.query.weight', 'vision_tower.encoder.layer.6.attention.value.bias', 'vision_tower.encoder.layer.6.attention.value.weight', 'vision_tower.encoder.layer.6.mlp.down_proj.bias', 'vision_tower.encoder.layer.6.mlp.down_proj.weight', 'vision_tower.encoder.layer.6.mlp.up_proj.bias', 'vision_tower.encoder.layer.6.mlp.up_proj.weight', 'vision_tower.encoder.layer.7.attention.key.bias', 'vision_tower.encoder.layer.7.attention.key.weight', 'vision_tower.encoder.layer.7.attention.output.bias', 'vision_tower.encoder.layer.7.attention.output.weight', 'vision_tower.encoder.layer.7.attention.query.bias', 'vision_tower.encoder.layer.7.attention.query.weight', 'vision_tower.encoder.layer.7.attention.value.bias', 'vision_tower.encoder.layer.7.attention.value.weight', 'vision_tower.encoder.layer.7.mlp.down_proj.bias', 'vision_tower.encoder.layer.7.mlp.down_proj.weight', 'vision_tower.encoder.layer.7.mlp.up_proj.bias', 'vision_tower.encoder.layer.7.mlp.up_proj.weight', 'vision_tower.encoder.layer.8.attention.key.bias', 'vision_tower.encoder.layer.8.attention.key.weight', 'vision_tower.encoder.layer.8.attention.output.bias', 'vision_tower.encoder.layer.8.attention.output.weight', 'vision_tower.encoder.layer.8.attention.query.bias', 'vision_tower.encoder.layer.8.attention.query.weight', 'vision_tower.encoder.layer.8.attention.value.bias', 'vision_tower.encoder.layer.8.attention.value.weight', 'vision_tower.encoder.layer.8.mlp.down_proj.bias', 'vision_tower.encoder.layer.8.mlp.down_proj.weight', 'vision_tower.encoder.layer.8.mlp.up_proj.bias', 'vision_tower.encoder.layer.8.mlp.up_proj.weight', 'vision_tower.encoder.layer.9.attention.key.bias', 'vision_tower.encoder.layer.9.attention.key.weight', 'vision_tower.encoder.layer.9.attention.output.bias', 'vision_tower.encoder.layer.9.attention.output.weight', 'vision_tower.encoder.layer.9.attention.query.bias', 'vision_tower.encoder.layer.9.attention.query.weight', 'vision_tower.encoder.layer.9.attention.value.bias', 'vision_tower.encoder.layer.9.attention.value.weight', 'vision_tower.encoder.layer.9.mlp.down_proj.bias', 'vision_tower.encoder.layer.9.mlp.down_proj.weight', 'vision_tower.encoder.layer.9.mlp.up_proj.bias', 'vision_tower.encoder.layer.9.mlp.up_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

It seems that the vision parts are all not initialized?

Could you catch the path in your environment?

def write_tokenizer(save_dir: str, push_to_hub: bool = False, path: str = None, hub_dir: str = None):
    if LM_TYPE_CORRESPONDENCE[path] == "qwen2":
        tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", return_token_type_ids=False)
        tokenizer.model_max_length = CONTEXT_LENGTH

For InternVL2_5, the code shouldn't run there.

@liuchengyuan123
Copy link

How to convert a model InternVL2_5-1B-MPO checkpoint to hf if I have a costomized-pertrained internvl model? Please help!
cvt_internvl_weights_to_hf.py seems to require a intern_vl_hf_implem/tokenizer_internvl_llama_fast.

In my case, just replace these codes with

tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-chat-7b", return_token_type_ids=False, trust_remote_code=True)

It will replace the tokenizer with internLM2's.

Thanks a lot for your timely response!!
When I use model = InternVLForConditionalGeneration.from_pretrained("InternVL2_5-8B-MPO-hf"), error raises:

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 4/4 [00:23<00:00,  5.83s/it]
Some weights of the model checkpoint at InternVL2_5-8B-MPO-hf were not used when initializing InternVLForConditionalGeneration: ['vision_tower.encoder.layer.0.attention.attention.key.bias', 'vision_tower.encoder.layer.0.attention.attention.key.weight', 'vision_tower.encoder.layer.0.attention.attention.query.bias', 'vision_tower.encoder.layer.0.attention.attention.query.weight', 'vision_tower.encoder.layer.0.attention.attention.value.bias', 'vision_tower.encoder.layer.0.attention.attention.value.weight', 'vision_tower.encoder.layer.0.attention.output.dense.bias', 'vision_tower.encoder.layer.0.attention.output.dense.weight', 'vision_tower.encoder.layer.0.intermediate.dense.bias', 'vision_tower.encoder.layer.0.intermediate.dense.weight', 'vision_tower.encoder.layer.0.output.dense.bias', 'vision_tower.encoder.layer.0.output.dense.weight', 'vision_tower.encoder.layer.1.attention.attention.key.bias', 'vision_tower.encoder.layer.1.attention.attention.key.weight', 'vision_tower.encoder.layer.1.attention.attention.query.bias', 'vision_tower.encoder.layer.1.attention.attention.query.weight', 'vision_tower.encoder.layer.1.attention.attention.value.bias', 'vision_tower.encoder.layer.1.attention.attention.value.weight', 'vision_tower.encoder.layer.1.attention.output.dense.bias', 'vision_tower.encoder.layer.1.attention.output.dense.weight', 'vision_tower.encoder.layer.1.intermediate.dense.bias', 'vision_tower.encoder.layer.1.intermediate.dense.weight', 'vision_tower.encoder.layer.1.output.dense.bias', 'vision_tower.encoder.layer.1.output.dense.weight', 'vision_tower.encoder.layer.10.attention.attention.key.bias', 'vision_tower.encoder.layer.10.attention.attention.key.weight', 'vision_tower.encoder.layer.10.attention.attention.query.bias', 'vision_tower.encoder.layer.10.attention.attention.query.weight', 'vision_tower.encoder.layer.10.attention.attention.value.bias', 'vision_tower.encoder.layer.10.attention.attention.value.weight', 'vision_tower.encoder.layer.10.attention.output.dense.bias', 'vision_tower.encoder.layer.10.attention.output.dense.weight', 'vision_tower.encoder.layer.10.intermediate.dense.bias', 'vision_tower.encoder.layer.10.intermediate.dense.weight', 'vision_tower.encoder.layer.10.output.dense.bias', 'vision_tower.encoder.layer.10.output.dense.weight', 'vision_tower.encoder.layer.11.attention.attention.key.bias', 'vision_tower.encoder.layer.11.attention.attention.key.weight', 'vision_tower.encoder.layer.11.attention.attention.query.bias', 'vision_tower.encoder.layer.11.attention.attention.query.weight', 'vision_tower.encoder.layer.11.attention.attention.value.bias', 'vision_tower.encoder.layer.11.attention.attention.value.weight', 'vision_tower.encoder.layer.11.attention.output.dense.bias', 'vision_tower.encoder.layer.11.attention.output.dense.weight', 'vision_tower.encoder.layer.11.intermediate.dense.bias', 'vision_tower.encoder.layer.11.intermediate.dense.weight', 'vision_tower.encoder.layer.11.output.dense.bias', 'vision_tower.encoder.layer.11.output.dense.weight', 'vision_tower.encoder.layer.12.attention.attention.key.bias', 'vision_tower.encoder.layer.12.attention.attention.key.weight', 'vision_tower.encoder.layer.12.attention.attention.query.bias', 'vision_tower.encoder.layer.12.attention.attention.query.weight', 'vision_tower.encoder.layer.12.attention.attention.value.bias', 'vision_tower.encoder.layer.12.attention.attention.value.weight', 'vision_tower.encoder.layer.12.attention.output.dense.bias', 'vision_tower.encoder.layer.12.attention.output.dense.weight', 'vision_tower.encoder.layer.12.intermediate.dense.bias', 'vision_tower.encoder.layer.12.intermediate.dense.weight', 'vision_tower.encoder.layer.12.output.dense.bias', 'vision_tower.encoder.layer.12.output.dense.weight', 'vision_tower.encoder.layer.13.attention.attention.key.bias', 'vision_tower.encoder.layer.13.attention.attention.key.weight', 'vision_tower.encoder.layer.13.attention.attention.query.bias', 'vision_tower.encoder.layer.13.attention.attention.query.weight', 'vision_tower.encoder.layer.13.attention.attention.value.bias', 'vision_tower.encoder.layer.13.attention.attention.value.weight', 'vision_tower.encoder.layer.13.attention.output.dense.bias', 'vision_tower.encoder.layer.13.attention.output.dense.weight', 'vision_tower.encoder.layer.13.intermediate.dense.bias', 'vision_tower.encoder.layer.13.intermediate.dense.weight', 'vision_tower.encoder.layer.13.output.dense.bias', 'vision_tower.encoder.layer.13.output.dense.weight', 'vision_tower.encoder.layer.14.attention.attention.key.bias', 'vision_tower.encoder.layer.14.attention.attention.key.weight', 'vision_tower.encoder.layer.14.attention.attention.query.bias', 'vision_tower.encoder.layer.14.attention.attention.query.weight', 'vision_tower.encoder.layer.14.attention.attention.value.bias', 'vision_tower.encoder.layer.14.attention.attention.value.weight', 'vision_tower.encoder.layer.14.attention.output.dense.bias', 'vision_tower.encoder.layer.14.attention.output.dense.weight', 'vision_tower.encoder.layer.14.intermediate.dense.bias', 'vision_tower.encoder.layer.14.intermediate.dense.weight', 'vision_tower.encoder.layer.14.output.dense.bias', 'vision_tower.encoder.layer.14.output.dense.weight', 'vision_tower.encoder.layer.15.attention.attention.key.bias', 'vision_tower.encoder.layer.15.attention.attention.key.weight', 'vision_tower.encoder.layer.15.attention.attention.query.bias', 'vision_tower.encoder.layer.15.attention.attention.query.weight', 'vision_tower.encoder.layer.15.attention.attention.value.bias', 'vision_tower.encoder.layer.15.attention.attention.value.weight', 'vision_tower.encoder.layer.15.attention.output.dense.bias', 'vision_tower.encoder.layer.15.attention.output.dense.weight', 'vision_tower.encoder.layer.15.intermediate.dense.bias', 'vision_tower.encoder.layer.15.intermediate.dense.weight', 'vision_tower.encoder.layer.15.output.dense.bias', 'vision_tower.encoder.layer.15.output.dense.weight', 'vision_tower.encoder.layer.16.attention.attention.key.bias', 'vision_tower.encoder.layer.16.attention.attention.key.weight', 'vision_tower.encoder.layer.16.attention.attention.query.bias', 'vision_tower.encoder.layer.16.attention.attention.query.weight', 'vision_tower.encoder.layer.16.attention.attention.value.bias', 'vision_tower.encoder.layer.16.attention.attention.value.weight', 'vision_tower.encoder.layer.16.attention.output.dense.bias', 'vision_tower.encoder.layer.16.attention.output.dense.weight', 'vision_tower.encoder.layer.16.intermediate.dense.bias', 'vision_tower.encoder.layer.16.intermediate.dense.weight', 'vision_tower.encoder.layer.16.output.dense.bias', 'vision_tower.encoder.layer.16.output.dense.weight', 'vision_tower.encoder.layer.17.attention.attention.key.bias', 'vision_tower.encoder.layer.17.attention.attention.key.weight', 'vision_tower.encoder.layer.17.attention.attention.query.bias', 'vision_tower.encoder.layer.17.attention.attention.query.weight', 'vision_tower.encoder.layer.17.attention.attention.value.bias', 'vision_tower.encoder.layer.17.attention.attention.value.weight', 'vision_tower.encoder.layer.17.attention.output.dense.bias', 'vision_tower.encoder.layer.17.attention.output.dense.weight', 'vision_tower.encoder.layer.17.intermediate.dense.bias', 'vision_tower.encoder.layer.17.intermediate.dense.weight', 'vision_tower.encoder.layer.17.output.dense.bias', 'vision_tower.encoder.layer.17.output.dense.weight', 'vision_tower.encoder.layer.18.attention.attention.key.bias', 'vision_tower.encoder.layer.18.attention.attention.key.weight', 'vision_tower.encoder.layer.18.attention.attention.query.bias', 'vision_tower.encoder.layer.18.attention.attention.query.weight', 'vision_tower.encoder.layer.18.attention.attention.value.bias', 'vision_tower.encoder.layer.18.attention.attention.value.weight', 'vision_tower.encoder.layer.18.attention.output.dense.bias', 'vision_tower.encoder.layer.18.attention.output.dense.weight', 'vision_tower.encoder.layer.18.intermediate.dense.bias', 'vision_tower.encoder.layer.18.intermediate.dense.weight', 'vision_tower.encoder.layer.18.output.dense.bias', 'vision_tower.encoder.layer.18.output.dense.weight', 'vision_tower.encoder.layer.19.attention.attention.key.bias', 'vision_tower.encoder.layer.19.attention.attention.key.weight', 'vision_tower.encoder.layer.19.attention.attention.query.bias', 'vision_tower.encoder.layer.19.attention.attention.query.weight', 'vision_tower.encoder.layer.19.attention.attention.value.bias', 'vision_tower.encoder.layer.19.attention.attention.value.weight', 'vision_tower.encoder.layer.19.attention.output.dense.bias', 'vision_tower.encoder.layer.19.attention.output.dense.weight', 'vision_tower.encoder.layer.19.intermediate.dense.bias', 'vision_tower.encoder.layer.19.intermediate.dense.weight', 'vision_tower.encoder.layer.19.output.dense.bias', 'vision_tower.encoder.layer.19.output.dense.weight', 'vision_tower.encoder.layer.2.attention.attention.key.bias', 'vision_tower.encoder.layer.2.attention.attention.key.weight', 'vision_tower.encoder.layer.2.attention.attention.query.bias', 'vision_tower.encoder.layer.2.attention.attention.query.weight', 'vision_tower.encoder.layer.2.attention.attention.value.bias', 'vision_tower.encoder.layer.2.attention.attention.value.weight', 'vision_tower.encoder.layer.2.attention.output.dense.bias', 'vision_tower.encoder.layer.2.attention.output.dense.weight', 'vision_tower.encoder.layer.2.intermediate.dense.bias', 'vision_tower.encoder.layer.2.intermediate.dense.weight', 'vision_tower.encoder.layer.2.output.dense.bias', 'vision_tower.encoder.layer.2.output.dense.weight', 'vision_tower.encoder.layer.20.attention.attention.key.bias', 'vision_tower.encoder.layer.20.attention.attention.key.weight', 'vision_tower.encoder.layer.20.attention.attention.query.bias', 'vision_tower.encoder.layer.20.attention.attention.query.weight', 'vision_tower.encoder.layer.20.attention.attention.value.bias', 'vision_tower.encoder.layer.20.attention.attention.value.weight', 'vision_tower.encoder.layer.20.attention.output.dense.bias', 'vision_tower.encoder.layer.20.attention.output.dense.weight', 'vision_tower.encoder.layer.20.intermediate.dense.bias', 'vision_tower.encoder.layer.20.intermediate.dense.weight', 'vision_tower.encoder.layer.20.output.dense.bias', 'vision_tower.encoder.layer.20.output.dense.weight', 'vision_tower.encoder.layer.21.attention.attention.key.bias', 'vision_tower.encoder.layer.21.attention.attention.key.weight', 'vision_tower.encoder.layer.21.attention.attention.query.bias', 'vision_tower.encoder.layer.21.attention.attention.query.weight', 'vision_tower.encoder.layer.21.attention.attention.value.bias', 'vision_tower.encoder.layer.21.attention.attention.value.weight', 'vision_tower.encoder.layer.21.attention.output.dense.bias', 'vision_tower.encoder.layer.21.attention.output.dense.weight', 'vision_tower.encoder.layer.21.intermediate.dense.bias', 'vision_tower.encoder.layer.21.intermediate.dense.weight', 'vision_tower.encoder.layer.21.output.dense.bias', 'vision_tower.encoder.layer.21.output.dense.weight', 'vision_tower.encoder.layer.22.attention.attention.key.bias', 'vision_tower.encoder.layer.22.attention.attention.key.weight', 'vision_tower.encoder.layer.22.attention.attention.query.bias', 'vision_tower.encoder.layer.22.attention.attention.query.weight', 'vision_tower.encoder.layer.22.attention.attention.value.bias', 'vision_tower.encoder.layer.22.attention.attention.value.weight', 'vision_tower.encoder.layer.22.attention.output.dense.bias', 'vision_tower.encoder.layer.22.attention.output.dense.weight', 'vision_tower.encoder.layer.22.intermediate.dense.bias', 'vision_tower.encoder.layer.22.intermediate.dense.weight', 'vision_tower.encoder.layer.22.output.dense.bias', 'vision_tower.encoder.layer.22.output.dense.weight', 'vision_tower.encoder.layer.23.attention.attention.key.bias', 'vision_tower.encoder.layer.23.attention.attention.key.weight', 'vision_tower.encoder.layer.23.attention.attention.query.bias', 'vision_tower.encoder.layer.23.attention.attention.query.weight', 'vision_tower.encoder.layer.23.attention.attention.value.bias', 'vision_tower.encoder.layer.23.attention.attention.value.weight', 'vision_tower.encoder.layer.23.attention.output.dense.bias', 'vision_tower.encoder.layer.23.attention.output.dense.weight', 'vision_tower.encoder.layer.23.intermediate.dense.bias', 'vision_tower.encoder.layer.23.intermediate.dense.weight', 'vision_tower.encoder.layer.23.output.dense.bias', 'vision_tower.encoder.layer.23.output.dense.weight', 'vision_tower.encoder.layer.3.attention.attention.key.bias', 'vision_tower.encoder.layer.3.attention.attention.key.weight', 'vision_tower.encoder.layer.3.attention.attention.query.bias', 'vision_tower.encoder.layer.3.attention.attention.query.weight', 'vision_tower.encoder.layer.3.attention.attention.value.bias', 'vision_tower.encoder.layer.3.attention.attention.value.weight', 'vision_tower.encoder.layer.3.attention.output.dense.bias', 'vision_tower.encoder.layer.3.attention.output.dense.weight', 'vision_tower.encoder.layer.3.intermediate.dense.bias', 'vision_tower.encoder.layer.3.intermediate.dense.weight', 'vision_tower.encoder.layer.3.output.dense.bias', 'vision_tower.encoder.layer.3.output.dense.weight', 'vision_tower.encoder.layer.4.attention.attention.key.bias', 'vision_tower.encoder.layer.4.attention.attention.key.weight', 'vision_tower.encoder.layer.4.attention.attention.query.bias', 'vision_tower.encoder.layer.4.attention.attention.query.weight', 'vision_tower.encoder.layer.4.attention.attention.value.bias', 'vision_tower.encoder.layer.4.attention.attention.value.weight', 'vision_tower.encoder.layer.4.attention.output.dense.bias', 'vision_tower.encoder.layer.4.attention.output.dense.weight', 'vision_tower.encoder.layer.4.intermediate.dense.bias', 'vision_tower.encoder.layer.4.intermediate.dense.weight', 'vision_tower.encoder.layer.4.output.dense.bias', 'vision_tower.encoder.layer.4.output.dense.weight', 'vision_tower.encoder.layer.5.attention.attention.key.bias', 'vision_tower.encoder.layer.5.attention.attention.key.weight', 'vision_tower.encoder.layer.5.attention.attention.query.bias', 'vision_tower.encoder.layer.5.attention.attention.query.weight', 'vision_tower.encoder.layer.5.attention.attention.value.bias', 'vision_tower.encoder.layer.5.attention.attention.value.weight', 'vision_tower.encoder.layer.5.attention.output.dense.bias', 'vision_tower.encoder.layer.5.attention.output.dense.weight', 'vision_tower.encoder.layer.5.intermediate.dense.bias', 'vision_tower.encoder.layer.5.intermediate.dense.weight', 'vision_tower.encoder.layer.5.output.dense.bias', 'vision_tower.encoder.layer.5.output.dense.weight', 'vision_tower.encoder.layer.6.attention.attention.key.bias', 'vision_tower.encoder.layer.6.attention.attention.key.weight', 'vision_tower.encoder.layer.6.attention.attention.query.bias', 'vision_tower.encoder.layer.6.attention.attention.query.weight', 'vision_tower.encoder.layer.6.attention.attention.value.bias', 'vision_tower.encoder.layer.6.attention.attention.value.weight', 'vision_tower.encoder.layer.6.attention.output.dense.bias', 'vision_tower.encoder.layer.6.attention.output.dense.weight', 'vision_tower.encoder.layer.6.intermediate.dense.bias', 'vision_tower.encoder.layer.6.intermediate.dense.weight', 'vision_tower.encoder.layer.6.output.dense.bias', 'vision_tower.encoder.layer.6.output.dense.weight', 'vision_tower.encoder.layer.7.attention.attention.key.bias', 'vision_tower.encoder.layer.7.attention.attention.key.weight', 'vision_tower.encoder.layer.7.attention.attention.query.bias', 'vision_tower.encoder.layer.7.attention.attention.query.weight', 'vision_tower.encoder.layer.7.attention.attention.value.bias', 'vision_tower.encoder.layer.7.attention.attention.value.weight', 'vision_tower.encoder.layer.7.attention.output.dense.bias', 'vision_tower.encoder.layer.7.attention.output.dense.weight', 'vision_tower.encoder.layer.7.intermediate.dense.bias', 'vision_tower.encoder.layer.7.intermediate.dense.weight', 'vision_tower.encoder.layer.7.output.dense.bias', 'vision_tower.encoder.layer.7.output.dense.weight', 'vision_tower.encoder.layer.8.attention.attention.key.bias', 'vision_tower.encoder.layer.8.attention.attention.key.weight', 'vision_tower.encoder.layer.8.attention.attention.query.bias', 'vision_tower.encoder.layer.8.attention.attention.query.weight', 'vision_tower.encoder.layer.8.attention.attention.value.bias', 'vision_tower.encoder.layer.8.attention.attention.value.weight', 'vision_tower.encoder.layer.8.attention.output.dense.bias', 'vision_tower.encoder.layer.8.attention.output.dense.weight', 'vision_tower.encoder.layer.8.intermediate.dense.bias', 'vision_tower.encoder.layer.8.intermediate.dense.weight', 'vision_tower.encoder.layer.8.output.dense.bias', 'vision_tower.encoder.layer.8.output.dense.weight', 'vision_tower.encoder.layer.9.attention.attention.key.bias', 'vision_tower.encoder.layer.9.attention.attention.key.weight', 'vision_tower.encoder.layer.9.attention.attention.query.bias', 'vision_tower.encoder.layer.9.attention.attention.query.weight', 'vision_tower.encoder.layer.9.attention.attention.value.bias', 'vision_tower.encoder.layer.9.attention.attention.value.weight', 'vision_tower.encoder.layer.9.attention.output.dense.bias', 'vision_tower.encoder.layer.9.attention.output.dense.weight', 'vision_tower.encoder.layer.9.intermediate.dense.bias', 'vision_tower.encoder.layer.9.intermediate.dense.weight', 'vision_tower.encoder.layer.9.output.dense.bias', 'vision_tower.encoder.layer.9.output.dense.weight']
- This IS expected if you are initializing InternVLForConditionalGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing InternVLForConditionalGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of InternVLForConditionalGeneration were not initialized from the model checkpoint at InternVL2_5-8B-MPO-hf and are newly initialized: ['vision_tower.encoder.layer.0.attention.key.bias', 'vision_tower.encoder.layer.0.attention.key.weight', 'vision_tower.encoder.layer.0.attention.output.bias', 'vision_tower.encoder.layer.0.attention.output.weight', 'vision_tower.encoder.layer.0.attention.query.bias', 'vision_tower.encoder.layer.0.attention.query.weight', 'vision_tower.encoder.layer.0.attention.value.bias', 'vision_tower.encoder.layer.0.attention.value.weight', 'vision_tower.encoder.layer.0.mlp.down_proj.bias', 'vision_tower.encoder.layer.0.mlp.down_proj.weight', 'vision_tower.encoder.layer.0.mlp.up_proj.bias', 'vision_tower.encoder.layer.0.mlp.up_proj.weight', 'vision_tower.encoder.layer.1.attention.key.bias', 'vision_tower.encoder.layer.1.attention.key.weight', 'vision_tower.encoder.layer.1.attention.output.bias', 'vision_tower.encoder.layer.1.attention.output.weight', 'vision_tower.encoder.layer.1.attention.query.bias', 'vision_tower.encoder.layer.1.attention.query.weight', 'vision_tower.encoder.layer.1.attention.value.bias', 'vision_tower.encoder.layer.1.attention.value.weight', 'vision_tower.encoder.layer.1.mlp.down_proj.bias', 'vision_tower.encoder.layer.1.mlp.down_proj.weight', 'vision_tower.encoder.layer.1.mlp.up_proj.bias', 'vision_tower.encoder.layer.1.mlp.up_proj.weight', 'vision_tower.encoder.layer.10.attention.key.bias', 'vision_tower.encoder.layer.10.attention.key.weight', 'vision_tower.encoder.layer.10.attention.output.bias', 'vision_tower.encoder.layer.10.attention.output.weight', 'vision_tower.encoder.layer.10.attention.query.bias', 'vision_tower.encoder.layer.10.attention.query.weight', 'vision_tower.encoder.layer.10.attention.value.bias', 'vision_tower.encoder.layer.10.attention.value.weight', 'vision_tower.encoder.layer.10.mlp.down_proj.bias', 'vision_tower.encoder.layer.10.mlp.down_proj.weight', 'vision_tower.encoder.layer.10.mlp.up_proj.bias', 'vision_tower.encoder.layer.10.mlp.up_proj.weight', 'vision_tower.encoder.layer.11.attention.key.bias', 'vision_tower.encoder.layer.11.attention.key.weight', 'vision_tower.encoder.layer.11.attention.output.bias', 'vision_tower.encoder.layer.11.attention.output.weight', 'vision_tower.encoder.layer.11.attention.query.bias', 'vision_tower.encoder.layer.11.attention.query.weight', 'vision_tower.encoder.layer.11.attention.value.bias', 'vision_tower.encoder.layer.11.attention.value.weight', 'vision_tower.encoder.layer.11.mlp.down_proj.bias', 'vision_tower.encoder.layer.11.mlp.down_proj.weight', 'vision_tower.encoder.layer.11.mlp.up_proj.bias', 'vision_tower.encoder.layer.11.mlp.up_proj.weight', 'vision_tower.encoder.layer.12.attention.key.bias', 'vision_tower.encoder.layer.12.attention.key.weight', 'vision_tower.encoder.layer.12.attention.output.bias', 'vision_tower.encoder.layer.12.attention.output.weight', 'vision_tower.encoder.layer.12.attention.query.bias', 'vision_tower.encoder.layer.12.attention.query.weight', 'vision_tower.encoder.layer.12.attention.value.bias', 'vision_tower.encoder.layer.12.attention.value.weight', 'vision_tower.encoder.layer.12.mlp.down_proj.bias', 'vision_tower.encoder.layer.12.mlp.down_proj.weight', 'vision_tower.encoder.layer.12.mlp.up_proj.bias', 'vision_tower.encoder.layer.12.mlp.up_proj.weight', 'vision_tower.encoder.layer.13.attention.key.bias', 'vision_tower.encoder.layer.13.attention.key.weight', 'vision_tower.encoder.layer.13.attention.output.bias', 'vision_tower.encoder.layer.13.attention.output.weight', 'vision_tower.encoder.layer.13.attention.query.bias', 'vision_tower.encoder.layer.13.attention.query.weight', 'vision_tower.encoder.layer.13.attention.value.bias', 'vision_tower.encoder.layer.13.attention.value.weight', 'vision_tower.encoder.layer.13.mlp.down_proj.bias', 'vision_tower.encoder.layer.13.mlp.down_proj.weight', 'vision_tower.encoder.layer.13.mlp.up_proj.bias', 'vision_tower.encoder.layer.13.mlp.up_proj.weight', 'vision_tower.encoder.layer.14.attention.key.bias', 'vision_tower.encoder.layer.14.attention.key.weight', 'vision_tower.encoder.layer.14.attention.output.bias', 'vision_tower.encoder.layer.14.attention.output.weight', 'vision_tower.encoder.layer.14.attention.query.bias', 'vision_tower.encoder.layer.14.attention.query.weight', 'vision_tower.encoder.layer.14.attention.value.bias', 'vision_tower.encoder.layer.14.attention.value.weight', 'vision_tower.encoder.layer.14.mlp.down_proj.bias', 'vision_tower.encoder.layer.14.mlp.down_proj.weight', 'vision_tower.encoder.layer.14.mlp.up_proj.bias', 'vision_tower.encoder.layer.14.mlp.up_proj.weight', 'vision_tower.encoder.layer.15.attention.key.bias', 'vision_tower.encoder.layer.15.attention.key.weight', 'vision_tower.encoder.layer.15.attention.output.bias', 'vision_tower.encoder.layer.15.attention.output.weight', 'vision_tower.encoder.layer.15.attention.query.bias', 'vision_tower.encoder.layer.15.attention.query.weight', 'vision_tower.encoder.layer.15.attention.value.bias', 'vision_tower.encoder.layer.15.attention.value.weight', 'vision_tower.encoder.layer.15.mlp.down_proj.bias', 'vision_tower.encoder.layer.15.mlp.down_proj.weight', 'vision_tower.encoder.layer.15.mlp.up_proj.bias', 'vision_tower.encoder.layer.15.mlp.up_proj.weight', 'vision_tower.encoder.layer.16.attention.key.bias', 'vision_tower.encoder.layer.16.attention.key.weight', 'vision_tower.encoder.layer.16.attention.output.bias', 'vision_tower.encoder.layer.16.attention.output.weight', 'vision_tower.encoder.layer.16.attention.query.bias', 'vision_tower.encoder.layer.16.attention.query.weight', 'vision_tower.encoder.layer.16.attention.value.bias', 'vision_tower.encoder.layer.16.attention.value.weight', 'vision_tower.encoder.layer.16.mlp.down_proj.bias', 'vision_tower.encoder.layer.16.mlp.down_proj.weight', 'vision_tower.encoder.layer.16.mlp.up_proj.bias', 'vision_tower.encoder.layer.16.mlp.up_proj.weight', 'vision_tower.encoder.layer.17.attention.key.bias', 'vision_tower.encoder.layer.17.attention.key.weight', 'vision_tower.encoder.layer.17.attention.output.bias', 'vision_tower.encoder.layer.17.attention.output.weight', 'vision_tower.encoder.layer.17.attention.query.bias', 'vision_tower.encoder.layer.17.attention.query.weight', 'vision_tower.encoder.layer.17.attention.value.bias', 'vision_tower.encoder.layer.17.attention.value.weight', 'vision_tower.encoder.layer.17.mlp.down_proj.bias', 'vision_tower.encoder.layer.17.mlp.down_proj.weight', 'vision_tower.encoder.layer.17.mlp.up_proj.bias', 'vision_tower.encoder.layer.17.mlp.up_proj.weight', 'vision_tower.encoder.layer.18.attention.key.bias', 'vision_tower.encoder.layer.18.attention.key.weight', 'vision_tower.encoder.layer.18.attention.output.bias', 'vision_tower.encoder.layer.18.attention.output.weight', 'vision_tower.encoder.layer.18.attention.query.bias', 'vision_tower.encoder.layer.18.attention.query.weight', 'vision_tower.encoder.layer.18.attention.value.bias', 'vision_tower.encoder.layer.18.attention.value.weight', 'vision_tower.encoder.layer.18.mlp.down_proj.bias', 'vision_tower.encoder.layer.18.mlp.down_proj.weight', 'vision_tower.encoder.layer.18.mlp.up_proj.bias', 'vision_tower.encoder.layer.18.mlp.up_proj.weight', 'vision_tower.encoder.layer.19.attention.key.bias', 'vision_tower.encoder.layer.19.attention.key.weight', 'vision_tower.encoder.layer.19.attention.output.bias', 'vision_tower.encoder.layer.19.attention.output.weight', 'vision_tower.encoder.layer.19.attention.query.bias', 'vision_tower.encoder.layer.19.attention.query.weight', 'vision_tower.encoder.layer.19.attention.value.bias', 'vision_tower.encoder.layer.19.attention.value.weight', 'vision_tower.encoder.layer.19.mlp.down_proj.bias', 'vision_tower.encoder.layer.19.mlp.down_proj.weight', 'vision_tower.encoder.layer.19.mlp.up_proj.bias', 'vision_tower.encoder.layer.19.mlp.up_proj.weight', 'vision_tower.encoder.layer.2.attention.key.bias', 'vision_tower.encoder.layer.2.attention.key.weight', 'vision_tower.encoder.layer.2.attention.output.bias', 'vision_tower.encoder.layer.2.attention.output.weight', 'vision_tower.encoder.layer.2.attention.query.bias', 'vision_tower.encoder.layer.2.attention.query.weight', 'vision_tower.encoder.layer.2.attention.value.bias', 'vision_tower.encoder.layer.2.attention.value.weight', 'vision_tower.encoder.layer.2.mlp.down_proj.bias', 'vision_tower.encoder.layer.2.mlp.down_proj.weight', 'vision_tower.encoder.layer.2.mlp.up_proj.bias', 'vision_tower.encoder.layer.2.mlp.up_proj.weight', 'vision_tower.encoder.layer.20.attention.key.bias', 'vision_tower.encoder.layer.20.attention.key.weight', 'vision_tower.encoder.layer.20.attention.output.bias', 'vision_tower.encoder.layer.20.attention.output.weight', 'vision_tower.encoder.layer.20.attention.query.bias', 'vision_tower.encoder.layer.20.attention.query.weight', 'vision_tower.encoder.layer.20.attention.value.bias', 'vision_tower.encoder.layer.20.attention.value.weight', 'vision_tower.encoder.layer.20.mlp.down_proj.bias', 'vision_tower.encoder.layer.20.mlp.down_proj.weight', 'vision_tower.encoder.layer.20.mlp.up_proj.bias', 'vision_tower.encoder.layer.20.mlp.up_proj.weight', 'vision_tower.encoder.layer.21.attention.key.bias', 'vision_tower.encoder.layer.21.attention.key.weight', 'vision_tower.encoder.layer.21.attention.output.bias', 'vision_tower.encoder.layer.21.attention.output.weight', 'vision_tower.encoder.layer.21.attention.query.bias', 'vision_tower.encoder.layer.21.attention.query.weight', 'vision_tower.encoder.layer.21.attention.value.bias', 'vision_tower.encoder.layer.21.attention.value.weight', 'vision_tower.encoder.layer.21.mlp.down_proj.bias', 'vision_tower.encoder.layer.21.mlp.down_proj.weight', 'vision_tower.encoder.layer.21.mlp.up_proj.bias', 'vision_tower.encoder.layer.21.mlp.up_proj.weight', 'vision_tower.encoder.layer.22.attention.key.bias', 'vision_tower.encoder.layer.22.attention.key.weight', 'vision_tower.encoder.layer.22.attention.output.bias', 'vision_tower.encoder.layer.22.attention.output.weight', 'vision_tower.encoder.layer.22.attention.query.bias', 'vision_tower.encoder.layer.22.attention.query.weight', 'vision_tower.encoder.layer.22.attention.value.bias', 'vision_tower.encoder.layer.22.attention.value.weight', 'vision_tower.encoder.layer.22.mlp.down_proj.bias', 'vision_tower.encoder.layer.22.mlp.down_proj.weight', 'vision_tower.encoder.layer.22.mlp.up_proj.bias', 'vision_tower.encoder.layer.22.mlp.up_proj.weight', 'vision_tower.encoder.layer.23.attention.key.bias', 'vision_tower.encoder.layer.23.attention.key.weight', 'vision_tower.encoder.layer.23.attention.output.bias', 'vision_tower.encoder.layer.23.attention.output.weight', 'vision_tower.encoder.layer.23.attention.query.bias', 'vision_tower.encoder.layer.23.attention.query.weight', 'vision_tower.encoder.layer.23.attention.value.bias', 'vision_tower.encoder.layer.23.attention.value.weight', 'vision_tower.encoder.layer.23.mlp.down_proj.bias', 'vision_tower.encoder.layer.23.mlp.down_proj.weight', 'vision_tower.encoder.layer.23.mlp.up_proj.bias', 'vision_tower.encoder.layer.23.mlp.up_proj.weight', 'vision_tower.encoder.layer.3.attention.key.bias', 'vision_tower.encoder.layer.3.attention.key.weight', 'vision_tower.encoder.layer.3.attention.output.bias', 'vision_tower.encoder.layer.3.attention.output.weight', 'vision_tower.encoder.layer.3.attention.query.bias', 'vision_tower.encoder.layer.3.attention.query.weight', 'vision_tower.encoder.layer.3.attention.value.bias', 'vision_tower.encoder.layer.3.attention.value.weight', 'vision_tower.encoder.layer.3.mlp.down_proj.bias', 'vision_tower.encoder.layer.3.mlp.down_proj.weight', 'vision_tower.encoder.layer.3.mlp.up_proj.bias', 'vision_tower.encoder.layer.3.mlp.up_proj.weight', 'vision_tower.encoder.layer.4.attention.key.bias', 'vision_tower.encoder.layer.4.attention.key.weight', 'vision_tower.encoder.layer.4.attention.output.bias', 'vision_tower.encoder.layer.4.attention.output.weight', 'vision_tower.encoder.layer.4.attention.query.bias', 'vision_tower.encoder.layer.4.attention.query.weight', 'vision_tower.encoder.layer.4.attention.value.bias', 'vision_tower.encoder.layer.4.attention.value.weight', 'vision_tower.encoder.layer.4.mlp.down_proj.bias', 'vision_tower.encoder.layer.4.mlp.down_proj.weight', 'vision_tower.encoder.layer.4.mlp.up_proj.bias', 'vision_tower.encoder.layer.4.mlp.up_proj.weight', 'vision_tower.encoder.layer.5.attention.key.bias', 'vision_tower.encoder.layer.5.attention.key.weight', 'vision_tower.encoder.layer.5.attention.output.bias', 'vision_tower.encoder.layer.5.attention.output.weight', 'vision_tower.encoder.layer.5.attention.query.bias', 'vision_tower.encoder.layer.5.attention.query.weight', 'vision_tower.encoder.layer.5.attention.value.bias', 'vision_tower.encoder.layer.5.attention.value.weight', 'vision_tower.encoder.layer.5.mlp.down_proj.bias', 'vision_tower.encoder.layer.5.mlp.down_proj.weight', 'vision_tower.encoder.layer.5.mlp.up_proj.bias', 'vision_tower.encoder.layer.5.mlp.up_proj.weight', 'vision_tower.encoder.layer.6.attention.key.bias', 'vision_tower.encoder.layer.6.attention.key.weight', 'vision_tower.encoder.layer.6.attention.output.bias', 'vision_tower.encoder.layer.6.attention.output.weight', 'vision_tower.encoder.layer.6.attention.query.bias', 'vision_tower.encoder.layer.6.attention.query.weight', 'vision_tower.encoder.layer.6.attention.value.bias', 'vision_tower.encoder.layer.6.attention.value.weight', 'vision_tower.encoder.layer.6.mlp.down_proj.bias', 'vision_tower.encoder.layer.6.mlp.down_proj.weight', 'vision_tower.encoder.layer.6.mlp.up_proj.bias', 'vision_tower.encoder.layer.6.mlp.up_proj.weight', 'vision_tower.encoder.layer.7.attention.key.bias', 'vision_tower.encoder.layer.7.attention.key.weight', 'vision_tower.encoder.layer.7.attention.output.bias', 'vision_tower.encoder.layer.7.attention.output.weight', 'vision_tower.encoder.layer.7.attention.query.bias', 'vision_tower.encoder.layer.7.attention.query.weight', 'vision_tower.encoder.layer.7.attention.value.bias', 'vision_tower.encoder.layer.7.attention.value.weight', 'vision_tower.encoder.layer.7.mlp.down_proj.bias', 'vision_tower.encoder.layer.7.mlp.down_proj.weight', 'vision_tower.encoder.layer.7.mlp.up_proj.bias', 'vision_tower.encoder.layer.7.mlp.up_proj.weight', 'vision_tower.encoder.layer.8.attention.key.bias', 'vision_tower.encoder.layer.8.attention.key.weight', 'vision_tower.encoder.layer.8.attention.output.bias', 'vision_tower.encoder.layer.8.attention.output.weight', 'vision_tower.encoder.layer.8.attention.query.bias', 'vision_tower.encoder.layer.8.attention.query.weight', 'vision_tower.encoder.layer.8.attention.value.bias', 'vision_tower.encoder.layer.8.attention.value.weight', 'vision_tower.encoder.layer.8.mlp.down_proj.bias', 'vision_tower.encoder.layer.8.mlp.down_proj.weight', 'vision_tower.encoder.layer.8.mlp.up_proj.bias', 'vision_tower.encoder.layer.8.mlp.up_proj.weight', 'vision_tower.encoder.layer.9.attention.key.bias', 'vision_tower.encoder.layer.9.attention.key.weight', 'vision_tower.encoder.layer.9.attention.output.bias', 'vision_tower.encoder.layer.9.attention.output.weight', 'vision_tower.encoder.layer.9.attention.query.bias', 'vision_tower.encoder.layer.9.attention.query.weight', 'vision_tower.encoder.layer.9.attention.value.bias', 'vision_tower.encoder.layer.9.attention.value.weight', 'vision_tower.encoder.layer.9.mlp.down_proj.bias', 'vision_tower.encoder.layer.9.mlp.down_proj.weight', 'vision_tower.encoder.layer.9.mlp.up_proj.bias', 'vision_tower.encoder.layer.9.mlp.up_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

It seems that the vision parts are all not initialized?

Could you catch the path in your environment?

def write_tokenizer(save_dir: str, push_to_hub: bool = False, path: str = None, hub_dir: str = None):
    if LM_TYPE_CORRESPONDENCE[path] == "qwen2":
        tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", return_token_type_ids=False)
        tokenizer.model_max_length = CONTEXT_LENGTH

For InternVL2_5, the code shouldn't run there.

Actually, I am using a customized-pretrained OpenGVLab/InternVL2_5-8B model. It should be identical with OpenGVLab/InternVL2_5-8B-MPO except for the parameters in my opinion.

According to your project:

LM_TYPE_CORRESPONDENCE = {
    "OpenGVLab/InternVL2_5-1B-MPO": "qwen2",
    "OpenGVLab/InternVL2_5-2B-MPO": "llama",
    "OpenGVLab/InternVL2_5-4B-MPO": "qwen2",
    "OpenGVLab/InternVL2_5-8B-MPO": "llama",
    "OpenGVLab/InternVL2_5-26B-MPO": "llama",
    "OpenGVLab/InternVL2_5-38B-MPO": "qwen2",
    "OpenGVLab/InternVL2_5-78B-MPO": "qwen2",
}

I suppose I should use the path pointing at "llama"?

@Kuangdd01
Copy link
Collaborator Author

Yes! Wait for a moment. I am going to reproduce it.

@liuchengyuan123
Copy link

How to convert a model InternVL2_5-1B-MPO checkpoint to hf if I have a costomized-pertrained internvl model? Please help!
cvt_internvl_weights_to_hf.py seems to require a intern_vl_hf_implem/tokenizer_internvl_llama_fast.

In my case, just replace these codes with

tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-chat-7b", return_token_type_ids=False, trust_remote_code=True)

It will replace the tokenizer with internLM2's.

Thanks a lot for your timely response!!

When I use model = InternVLForConditionalGeneration.from_pretrained("InternVL2_5-8B-MPO-hf"), error raises:

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 4/4 [00:23<00:00,  5.83s/it]
Some weights of the model checkpoint at InternVL2_5-8B-MPO-hf were not used when initializing InternVLForConditionalGeneration: ['vision_tower.encoder.layer.0.attention.attention.key.bias', 'vision_tower.encoder.layer.0.attention.attention.key.weight', 'vision_tower.encoder.layer.0.attention.attention.query.bias', 'vision_tower.encoder.layer.0.attention.attention.query.weight', 'vision_tower.encoder.layer.0.attention.attention.value.bias', 'vision_tower.encoder.layer.0.attention.attention.value.weight', 'vision_tower.encoder.layer.0.attention.output.dense.bias', 'vision_tower.encoder.layer.0.attention.output.dense.weight', 'vision_tower.encoder.layer.0.intermediate.dense.bias', 'vision_tower.encoder.layer.0.intermediate.dense.weight', 'vision_tower.encoder.layer.0.output.dense.bias', 'vision_tower.encoder.layer.0.output.dense.weight', 'vision_tower.encoder.layer.1.attention.attention.key.bias', 'vision_tower.encoder.layer.1.attention.attention.key.weight', 'vision_tower.encoder.layer.1.attention.attention.query.bias', 'vision_tower.encoder.layer.1.attention.attention.query.weight', 'vision_tower.encoder.layer.1.attention.attention.value.bias', 'vision_tower.encoder.layer.1.attention.attention.value.weight', 'vision_tower.encoder.layer.1.attention.output.dense.bias', 'vision_tower.encoder.layer.1.attention.output.dense.weight', 'vision_tower.encoder.layer.1.intermediate.dense.bias', 'vision_tower.encoder.layer.1.intermediate.dense.weight', 'vision_tower.encoder.layer.1.output.dense.bias', 'vision_tower.encoder.layer.1.output.dense.weight', 'vision_tower.encoder.layer.10.attention.attention.key.bias', 'vision_tower.encoder.layer.10.attention.attention.key.weight', 'vision_tower.encoder.layer.10.attention.attention.query.bias', 'vision_tower.encoder.layer.10.attention.attention.query.weight', 'vision_tower.encoder.layer.10.attention.attention.value.bias', 'vision_tower.encoder.layer.10.attention.attention.value.weight', 'vision_tower.encoder.layer.10.attention.output.dense.bias', 'vision_tower.encoder.layer.10.attention.output.dense.weight', 'vision_tower.encoder.layer.10.intermediate.dense.bias', 'vision_tower.encoder.layer.10.intermediate.dense.weight', 'vision_tower.encoder.layer.10.output.dense.bias', 'vision_tower.encoder.layer.10.output.dense.weight', 'vision_tower.encoder.layer.11.attention.attention.key.bias', 'vision_tower.encoder.layer.11.attention.attention.key.weight', 'vision_tower.encoder.layer.11.attention.attention.query.bias', 'vision_tower.encoder.layer.11.attention.attention.query.weight', 'vision_tower.encoder.layer.11.attention.attention.value.bias', 'vision_tower.encoder.layer.11.attention.attention.value.weight', 'vision_tower.encoder.layer.11.attention.output.dense.bias', 'vision_tower.encoder.layer.11.attention.output.dense.weight', 'vision_tower.encoder.layer.11.intermediate.dense.bias', 'vision_tower.encoder.layer.11.intermediate.dense.weight', 'vision_tower.encoder.layer.11.output.dense.bias', 'vision_tower.encoder.layer.11.output.dense.weight', 'vision_tower.encoder.layer.12.attention.attention.key.bias', 'vision_tower.encoder.layer.12.attention.attention.key.weight', 'vision_tower.encoder.layer.12.attention.attention.query.bias', 'vision_tower.encoder.layer.12.attention.attention.query.weight', 'vision_tower.encoder.layer.12.attention.attention.value.bias', 'vision_tower.encoder.layer.12.attention.attention.value.weight', 'vision_tower.encoder.layer.12.attention.output.dense.bias', 'vision_tower.encoder.layer.12.attention.output.dense.weight', 'vision_tower.encoder.layer.12.intermediate.dense.bias', 'vision_tower.encoder.layer.12.intermediate.dense.weight', 'vision_tower.encoder.layer.12.output.dense.bias', 'vision_tower.encoder.layer.12.output.dense.weight', 'vision_tower.encoder.layer.13.attention.attention.key.bias', 'vision_tower.encoder.layer.13.attention.attention.key.weight', 'vision_tower.encoder.layer.13.attention.attention.query.bias', 'vision_tower.encoder.layer.13.attention.attention.query.weight', 'vision_tower.encoder.layer.13.attention.attention.value.bias', 'vision_tower.encoder.layer.13.attention.attention.value.weight', 'vision_tower.encoder.layer.13.attention.output.dense.bias', 'vision_tower.encoder.layer.13.attention.output.dense.weight', 'vision_tower.encoder.layer.13.intermediate.dense.bias', 'vision_tower.encoder.layer.13.intermediate.dense.weight', 'vision_tower.encoder.layer.13.output.dense.bias', 'vision_tower.encoder.layer.13.output.dense.weight', 'vision_tower.encoder.layer.14.attention.attention.key.bias', 'vision_tower.encoder.layer.14.attention.attention.key.weight', 'vision_tower.encoder.layer.14.attention.attention.query.bias', 'vision_tower.encoder.layer.14.attention.attention.query.weight', 'vision_tower.encoder.layer.14.attention.attention.value.bias', 'vision_tower.encoder.layer.14.attention.attention.value.weight', 'vision_tower.encoder.layer.14.attention.output.dense.bias', 'vision_tower.encoder.layer.14.attention.output.dense.weight', 'vision_tower.encoder.layer.14.intermediate.dense.bias', 'vision_tower.encoder.layer.14.intermediate.dense.weight', 'vision_tower.encoder.layer.14.output.dense.bias', 'vision_tower.encoder.layer.14.output.dense.weight', 'vision_tower.encoder.layer.15.attention.attention.key.bias', 'vision_tower.encoder.layer.15.attention.attention.key.weight', 'vision_tower.encoder.layer.15.attention.attention.query.bias', 'vision_tower.encoder.layer.15.attention.attention.query.weight', 'vision_tower.encoder.layer.15.attention.attention.value.bias', 'vision_tower.encoder.layer.15.attention.attention.value.weight', 'vision_tower.encoder.layer.15.attention.output.dense.bias', 'vision_tower.encoder.layer.15.attention.output.dense.weight', 'vision_tower.encoder.layer.15.intermediate.dense.bias', 'vision_tower.encoder.layer.15.intermediate.dense.weight', 'vision_tower.encoder.layer.15.output.dense.bias', 'vision_tower.encoder.layer.15.output.dense.weight', 'vision_tower.encoder.layer.16.attention.attention.key.bias', 'vision_tower.encoder.layer.16.attention.attention.key.weight', 'vision_tower.encoder.layer.16.attention.attention.query.bias', 'vision_tower.encoder.layer.16.attention.attention.query.weight', 'vision_tower.encoder.layer.16.attention.attention.value.bias', 'vision_tower.encoder.layer.16.attention.attention.value.weight', 'vision_tower.encoder.layer.16.attention.output.dense.bias', 'vision_tower.encoder.layer.16.attention.output.dense.weight', 'vision_tower.encoder.layer.16.intermediate.dense.bias', 'vision_tower.encoder.layer.16.intermediate.dense.weight', 'vision_tower.encoder.layer.16.output.dense.bias', 'vision_tower.encoder.layer.16.output.dense.weight', 'vision_tower.encoder.layer.17.attention.attention.key.bias', 'vision_tower.encoder.layer.17.attention.attention.key.weight', 'vision_tower.encoder.layer.17.attention.attention.query.bias', 'vision_tower.encoder.layer.17.attention.attention.query.weight', 'vision_tower.encoder.layer.17.attention.attention.value.bias', 'vision_tower.encoder.layer.17.attention.attention.value.weight', 'vision_tower.encoder.layer.17.attention.output.dense.bias', 'vision_tower.encoder.layer.17.attention.output.dense.weight', 'vision_tower.encoder.layer.17.intermediate.dense.bias', 'vision_tower.encoder.layer.17.intermediate.dense.weight', 'vision_tower.encoder.layer.17.output.dense.bias', 'vision_tower.encoder.layer.17.output.dense.weight', 'vision_tower.encoder.layer.18.attention.attention.key.bias', 'vision_tower.encoder.layer.18.attention.attention.key.weight', 'vision_tower.encoder.layer.18.attention.attention.query.bias', 'vision_tower.encoder.layer.18.attention.attention.query.weight', 'vision_tower.encoder.layer.18.attention.attention.value.bias', 'vision_tower.encoder.layer.18.attention.attention.value.weight', 'vision_tower.encoder.layer.18.attention.output.dense.bias', 'vision_tower.encoder.layer.18.attention.output.dense.weight', 'vision_tower.encoder.layer.18.intermediate.dense.bias', 'vision_tower.encoder.layer.18.intermediate.dense.weight', 'vision_tower.encoder.layer.18.output.dense.bias', 'vision_tower.encoder.layer.18.output.dense.weight', 'vision_tower.encoder.layer.19.attention.attention.key.bias', 'vision_tower.encoder.layer.19.attention.attention.key.weight', 'vision_tower.encoder.layer.19.attention.attention.query.bias', 'vision_tower.encoder.layer.19.attention.attention.query.weight', 'vision_tower.encoder.layer.19.attention.attention.value.bias', 'vision_tower.encoder.layer.19.attention.attention.value.weight', 'vision_tower.encoder.layer.19.attention.output.dense.bias', 'vision_tower.encoder.layer.19.attention.output.dense.weight', 'vision_tower.encoder.layer.19.intermediate.dense.bias', 'vision_tower.encoder.layer.19.intermediate.dense.weight', 'vision_tower.encoder.layer.19.output.dense.bias', 'vision_tower.encoder.layer.19.output.dense.weight', 'vision_tower.encoder.layer.2.attention.attention.key.bias', 'vision_tower.encoder.layer.2.attention.attention.key.weight', 'vision_tower.encoder.layer.2.attention.attention.query.bias', 'vision_tower.encoder.layer.2.attention.attention.query.weight', 'vision_tower.encoder.layer.2.attention.attention.value.bias', 'vision_tower.encoder.layer.2.attention.attention.value.weight', 'vision_tower.encoder.layer.2.attention.output.dense.bias', 'vision_tower.encoder.layer.2.attention.output.dense.weight', 'vision_tower.encoder.layer.2.intermediate.dense.bias', 'vision_tower.encoder.layer.2.intermediate.dense.weight', 'vision_tower.encoder.layer.2.output.dense.bias', 'vision_tower.encoder.layer.2.output.dense.weight', 'vision_tower.encoder.layer.20.attention.attention.key.bias', 'vision_tower.encoder.layer.20.attention.attention.key.weight', 'vision_tower.encoder.layer.20.attention.attention.query.bias', 'vision_tower.encoder.layer.20.attention.attention.query.weight', 'vision_tower.encoder.layer.20.attention.attention.value.bias', 'vision_tower.encoder.layer.20.attention.attention.value.weight', 'vision_tower.encoder.layer.20.attention.output.dense.bias', 'vision_tower.encoder.layer.20.attention.output.dense.weight', 'vision_tower.encoder.layer.20.intermediate.dense.bias', 'vision_tower.encoder.layer.20.intermediate.dense.weight', 'vision_tower.encoder.layer.20.output.dense.bias', 'vision_tower.encoder.layer.20.output.dense.weight', 'vision_tower.encoder.layer.21.attention.attention.key.bias', 'vision_tower.encoder.layer.21.attention.attention.key.weight', 'vision_tower.encoder.layer.21.attention.attention.query.bias', 'vision_tower.encoder.layer.21.attention.attention.query.weight', 'vision_tower.encoder.layer.21.attention.attention.value.bias', 'vision_tower.encoder.layer.21.attention.attention.value.weight', 'vision_tower.encoder.layer.21.attention.output.dense.bias', 'vision_tower.encoder.layer.21.attention.output.dense.weight', 'vision_tower.encoder.layer.21.intermediate.dense.bias', 'vision_tower.encoder.layer.21.intermediate.dense.weight', 'vision_tower.encoder.layer.21.output.dense.bias', 'vision_tower.encoder.layer.21.output.dense.weight', 'vision_tower.encoder.layer.22.attention.attention.key.bias', 'vision_tower.encoder.layer.22.attention.attention.key.weight', 'vision_tower.encoder.layer.22.attention.attention.query.bias', 'vision_tower.encoder.layer.22.attention.attention.query.weight', 'vision_tower.encoder.layer.22.attention.attention.value.bias', 'vision_tower.encoder.layer.22.attention.attention.value.weight', 'vision_tower.encoder.layer.22.attention.output.dense.bias', 'vision_tower.encoder.layer.22.attention.output.dense.weight', 'vision_tower.encoder.layer.22.intermediate.dense.bias', 'vision_tower.encoder.layer.22.intermediate.dense.weight', 'vision_tower.encoder.layer.22.output.dense.bias', 'vision_tower.encoder.layer.22.output.dense.weight', 'vision_tower.encoder.layer.23.attention.attention.key.bias', 'vision_tower.encoder.layer.23.attention.attention.key.weight', 'vision_tower.encoder.layer.23.attention.attention.query.bias', 'vision_tower.encoder.layer.23.attention.attention.query.weight', 'vision_tower.encoder.layer.23.attention.attention.value.bias', 'vision_tower.encoder.layer.23.attention.attention.value.weight', 'vision_tower.encoder.layer.23.attention.output.dense.bias', 'vision_tower.encoder.layer.23.attention.output.dense.weight', 'vision_tower.encoder.layer.23.intermediate.dense.bias', 'vision_tower.encoder.layer.23.intermediate.dense.weight', 'vision_tower.encoder.layer.23.output.dense.bias', 'vision_tower.encoder.layer.23.output.dense.weight', 'vision_tower.encoder.layer.3.attention.attention.key.bias', 'vision_tower.encoder.layer.3.attention.attention.key.weight', 'vision_tower.encoder.layer.3.attention.attention.query.bias', 'vision_tower.encoder.layer.3.attention.attention.query.weight', 'vision_tower.encoder.layer.3.attention.attention.value.bias', 'vision_tower.encoder.layer.3.attention.attention.value.weight', 'vision_tower.encoder.layer.3.attention.output.dense.bias', 'vision_tower.encoder.layer.3.attention.output.dense.weight', 'vision_tower.encoder.layer.3.intermediate.dense.bias', 'vision_tower.encoder.layer.3.intermediate.dense.weight', 'vision_tower.encoder.layer.3.output.dense.bias', 'vision_tower.encoder.layer.3.output.dense.weight', 'vision_tower.encoder.layer.4.attention.attention.key.bias', 'vision_tower.encoder.layer.4.attention.attention.key.weight', 'vision_tower.encoder.layer.4.attention.attention.query.bias', 'vision_tower.encoder.layer.4.attention.attention.query.weight', 'vision_tower.encoder.layer.4.attention.attention.value.bias', 'vision_tower.encoder.layer.4.attention.attention.value.weight', 'vision_tower.encoder.layer.4.attention.output.dense.bias', 'vision_tower.encoder.layer.4.attention.output.dense.weight', 'vision_tower.encoder.layer.4.intermediate.dense.bias', 'vision_tower.encoder.layer.4.intermediate.dense.weight', 'vision_tower.encoder.layer.4.output.dense.bias', 'vision_tower.encoder.layer.4.output.dense.weight', 'vision_tower.encoder.layer.5.attention.attention.key.bias', 'vision_tower.encoder.layer.5.attention.attention.key.weight', 'vision_tower.encoder.layer.5.attention.attention.query.bias', 'vision_tower.encoder.layer.5.attention.attention.query.weight', 'vision_tower.encoder.layer.5.attention.attention.value.bias', 'vision_tower.encoder.layer.5.attention.attention.value.weight', 'vision_tower.encoder.layer.5.attention.output.dense.bias', 'vision_tower.encoder.layer.5.attention.output.dense.weight', 'vision_tower.encoder.layer.5.intermediate.dense.bias', 'vision_tower.encoder.layer.5.intermediate.dense.weight', 'vision_tower.encoder.layer.5.output.dense.bias', 'vision_tower.encoder.layer.5.output.dense.weight', 'vision_tower.encoder.layer.6.attention.attention.key.bias', 'vision_tower.encoder.layer.6.attention.attention.key.weight', 'vision_tower.encoder.layer.6.attention.attention.query.bias', 'vision_tower.encoder.layer.6.attention.attention.query.weight', 'vision_tower.encoder.layer.6.attention.attention.value.bias', 'vision_tower.encoder.layer.6.attention.attention.value.weight', 'vision_tower.encoder.layer.6.attention.output.dense.bias', 'vision_tower.encoder.layer.6.attention.output.dense.weight', 'vision_tower.encoder.layer.6.intermediate.dense.bias', 'vision_tower.encoder.layer.6.intermediate.dense.weight', 'vision_tower.encoder.layer.6.output.dense.bias', 'vision_tower.encoder.layer.6.output.dense.weight', 'vision_tower.encoder.layer.7.attention.attention.key.bias', 'vision_tower.encoder.layer.7.attention.attention.key.weight', 'vision_tower.encoder.layer.7.attention.attention.query.bias', 'vision_tower.encoder.layer.7.attention.attention.query.weight', 'vision_tower.encoder.layer.7.attention.attention.value.bias', 'vision_tower.encoder.layer.7.attention.attention.value.weight', 'vision_tower.encoder.layer.7.attention.output.dense.bias', 'vision_tower.encoder.layer.7.attention.output.dense.weight', 'vision_tower.encoder.layer.7.intermediate.dense.bias', 'vision_tower.encoder.layer.7.intermediate.dense.weight', 'vision_tower.encoder.layer.7.output.dense.bias', 'vision_tower.encoder.layer.7.output.dense.weight', 'vision_tower.encoder.layer.8.attention.attention.key.bias', 'vision_tower.encoder.layer.8.attention.attention.key.weight', 'vision_tower.encoder.layer.8.attention.attention.query.bias', 'vision_tower.encoder.layer.8.attention.attention.query.weight', 'vision_tower.encoder.layer.8.attention.attention.value.bias', 'vision_tower.encoder.layer.8.attention.attention.value.weight', 'vision_tower.encoder.layer.8.attention.output.dense.bias', 'vision_tower.encoder.layer.8.attention.output.dense.weight', 'vision_tower.encoder.layer.8.intermediate.dense.bias', 'vision_tower.encoder.layer.8.intermediate.dense.weight', 'vision_tower.encoder.layer.8.output.dense.bias', 'vision_tower.encoder.layer.8.output.dense.weight', 'vision_tower.encoder.layer.9.attention.attention.key.bias', 'vision_tower.encoder.layer.9.attention.attention.key.weight', 'vision_tower.encoder.layer.9.attention.attention.query.bias', 'vision_tower.encoder.layer.9.attention.attention.query.weight', 'vision_tower.encoder.layer.9.attention.attention.value.bias', 'vision_tower.encoder.layer.9.attention.attention.value.weight', 'vision_tower.encoder.layer.9.attention.output.dense.bias', 'vision_tower.encoder.layer.9.attention.output.dense.weight', 'vision_tower.encoder.layer.9.intermediate.dense.bias', 'vision_tower.encoder.layer.9.intermediate.dense.weight', 'vision_tower.encoder.layer.9.output.dense.bias', 'vision_tower.encoder.layer.9.output.dense.weight']
- This IS expected if you are initializing InternVLForConditionalGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing InternVLForConditionalGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of InternVLForConditionalGeneration were not initialized from the model checkpoint at InternVL2_5-8B-MPO-hf and are newly initialized: ['vision_tower.encoder.layer.0.attention.key.bias', 'vision_tower.encoder.layer.0.attention.key.weight', 'vision_tower.encoder.layer.0.attention.output.bias', 'vision_tower.encoder.layer.0.attention.output.weight', 'vision_tower.encoder.layer.0.attention.query.bias', 'vision_tower.encoder.layer.0.attention.query.weight', 'vision_tower.encoder.layer.0.attention.value.bias', 'vision_tower.encoder.layer.0.attention.value.weight', 'vision_tower.encoder.layer.0.mlp.down_proj.bias', 'vision_tower.encoder.layer.0.mlp.down_proj.weight', 'vision_tower.encoder.layer.0.mlp.up_proj.bias', 'vision_tower.encoder.layer.0.mlp.up_proj.weight', 'vision_tower.encoder.layer.1.attention.key.bias', 'vision_tower.encoder.layer.1.attention.key.weight', 'vision_tower.encoder.layer.1.attention.output.bias', 'vision_tower.encoder.layer.1.attention.output.weight', 'vision_tower.encoder.layer.1.attention.query.bias', 'vision_tower.encoder.layer.1.attention.query.weight', 'vision_tower.encoder.layer.1.attention.value.bias', 'vision_tower.encoder.layer.1.attention.value.weight', 'vision_tower.encoder.layer.1.mlp.down_proj.bias', 'vision_tower.encoder.layer.1.mlp.down_proj.weight', 'vision_tower.encoder.layer.1.mlp.up_proj.bias', 'vision_tower.encoder.layer.1.mlp.up_proj.weight', 'vision_tower.encoder.layer.10.attention.key.bias', 'vision_tower.encoder.layer.10.attention.key.weight', 'vision_tower.encoder.layer.10.attention.output.bias', 'vision_tower.encoder.layer.10.attention.output.weight', 'vision_tower.encoder.layer.10.attention.query.bias', 'vision_tower.encoder.layer.10.attention.query.weight', 'vision_tower.encoder.layer.10.attention.value.bias', 'vision_tower.encoder.layer.10.attention.value.weight', 'vision_tower.encoder.layer.10.mlp.down_proj.bias', 'vision_tower.encoder.layer.10.mlp.down_proj.weight', 'vision_tower.encoder.layer.10.mlp.up_proj.bias', 'vision_tower.encoder.layer.10.mlp.up_proj.weight', 'vision_tower.encoder.layer.11.attention.key.bias', 'vision_tower.encoder.layer.11.attention.key.weight', 'vision_tower.encoder.layer.11.attention.output.bias', 'vision_tower.encoder.layer.11.attention.output.weight', 'vision_tower.encoder.layer.11.attention.query.bias', 'vision_tower.encoder.layer.11.attention.query.weight', 'vision_tower.encoder.layer.11.attention.value.bias', 'vision_tower.encoder.layer.11.attention.value.weight', 'vision_tower.encoder.layer.11.mlp.down_proj.bias', 'vision_tower.encoder.layer.11.mlp.down_proj.weight', 'vision_tower.encoder.layer.11.mlp.up_proj.bias', 'vision_tower.encoder.layer.11.mlp.up_proj.weight', 'vision_tower.encoder.layer.12.attention.key.bias', 'vision_tower.encoder.layer.12.attention.key.weight', 'vision_tower.encoder.layer.12.attention.output.bias', 'vision_tower.encoder.layer.12.attention.output.weight', 'vision_tower.encoder.layer.12.attention.query.bias', 'vision_tower.encoder.layer.12.attention.query.weight', 'vision_tower.encoder.layer.12.attention.value.bias', 'vision_tower.encoder.layer.12.attention.value.weight', 'vision_tower.encoder.layer.12.mlp.down_proj.bias', 'vision_tower.encoder.layer.12.mlp.down_proj.weight', 'vision_tower.encoder.layer.12.mlp.up_proj.bias', 'vision_tower.encoder.layer.12.mlp.up_proj.weight', 'vision_tower.encoder.layer.13.attention.key.bias', 'vision_tower.encoder.layer.13.attention.key.weight', 'vision_tower.encoder.layer.13.attention.output.bias', 'vision_tower.encoder.layer.13.attention.output.weight', 'vision_tower.encoder.layer.13.attention.query.bias', 'vision_tower.encoder.layer.13.attention.query.weight', 'vision_tower.encoder.layer.13.attention.value.bias', 'vision_tower.encoder.layer.13.attention.value.weight', 'vision_tower.encoder.layer.13.mlp.down_proj.bias', 'vision_tower.encoder.layer.13.mlp.down_proj.weight', 'vision_tower.encoder.layer.13.mlp.up_proj.bias', 'vision_tower.encoder.layer.13.mlp.up_proj.weight', 'vision_tower.encoder.layer.14.attention.key.bias', 'vision_tower.encoder.layer.14.attention.key.weight', 'vision_tower.encoder.layer.14.attention.output.bias', 'vision_tower.encoder.layer.14.attention.output.weight', 'vision_tower.encoder.layer.14.attention.query.bias', 'vision_tower.encoder.layer.14.attention.query.weight', 'vision_tower.encoder.layer.14.attention.value.bias', 'vision_tower.encoder.layer.14.attention.value.weight', 'vision_tower.encoder.layer.14.mlp.down_proj.bias', 'vision_tower.encoder.layer.14.mlp.down_proj.weight', 'vision_tower.encoder.layer.14.mlp.up_proj.bias', 'vision_tower.encoder.layer.14.mlp.up_proj.weight', 'vision_tower.encoder.layer.15.attention.key.bias', 'vision_tower.encoder.layer.15.attention.key.weight', 'vision_tower.encoder.layer.15.attention.output.bias', 'vision_tower.encoder.layer.15.attention.output.weight', 'vision_tower.encoder.layer.15.attention.query.bias', 'vision_tower.encoder.layer.15.attention.query.weight', 'vision_tower.encoder.layer.15.attention.value.bias', 'vision_tower.encoder.layer.15.attention.value.weight', 'vision_tower.encoder.layer.15.mlp.down_proj.bias', 'vision_tower.encoder.layer.15.mlp.down_proj.weight', 'vision_tower.encoder.layer.15.mlp.up_proj.bias', 'vision_tower.encoder.layer.15.mlp.up_proj.weight', 'vision_tower.encoder.layer.16.attention.key.bias', 'vision_tower.encoder.layer.16.attention.key.weight', 'vision_tower.encoder.layer.16.attention.output.bias', 'vision_tower.encoder.layer.16.attention.output.weight', 'vision_tower.encoder.layer.16.attention.query.bias', 'vision_tower.encoder.layer.16.attention.query.weight', 'vision_tower.encoder.layer.16.attention.value.bias', 'vision_tower.encoder.layer.16.attention.value.weight', 'vision_tower.encoder.layer.16.mlp.down_proj.bias', 'vision_tower.encoder.layer.16.mlp.down_proj.weight', 'vision_tower.encoder.layer.16.mlp.up_proj.bias', 'vision_tower.encoder.layer.16.mlp.up_proj.weight', 'vision_tower.encoder.layer.17.attention.key.bias', 'vision_tower.encoder.layer.17.attention.key.weight', 'vision_tower.encoder.layer.17.attention.output.bias', 'vision_tower.encoder.layer.17.attention.output.weight', 'vision_tower.encoder.layer.17.attention.query.bias', 'vision_tower.encoder.layer.17.attention.query.weight', 'vision_tower.encoder.layer.17.attention.value.bias', 'vision_tower.encoder.layer.17.attention.value.weight', 'vision_tower.encoder.layer.17.mlp.down_proj.bias', 'vision_tower.encoder.layer.17.mlp.down_proj.weight', 'vision_tower.encoder.layer.17.mlp.up_proj.bias', 'vision_tower.encoder.layer.17.mlp.up_proj.weight', 'vision_tower.encoder.layer.18.attention.key.bias', 'vision_tower.encoder.layer.18.attention.key.weight', 'vision_tower.encoder.layer.18.attention.output.bias', 'vision_tower.encoder.layer.18.attention.output.weight', 'vision_tower.encoder.layer.18.attention.query.bias', 'vision_tower.encoder.layer.18.attention.query.weight', 'vision_tower.encoder.layer.18.attention.value.bias', 'vision_tower.encoder.layer.18.attention.value.weight', 'vision_tower.encoder.layer.18.mlp.down_proj.bias', 'vision_tower.encoder.layer.18.mlp.down_proj.weight', 'vision_tower.encoder.layer.18.mlp.up_proj.bias', 'vision_tower.encoder.layer.18.mlp.up_proj.weight', 'vision_tower.encoder.layer.19.attention.key.bias', 'vision_tower.encoder.layer.19.attention.key.weight', 'vision_tower.encoder.layer.19.attention.output.bias', 'vision_tower.encoder.layer.19.attention.output.weight', 'vision_tower.encoder.layer.19.attention.query.bias', 'vision_tower.encoder.layer.19.attention.query.weight', 'vision_tower.encoder.layer.19.attention.value.bias', 'vision_tower.encoder.layer.19.attention.value.weight', 'vision_tower.encoder.layer.19.mlp.down_proj.bias', 'vision_tower.encoder.layer.19.mlp.down_proj.weight', 'vision_tower.encoder.layer.19.mlp.up_proj.bias', 'vision_tower.encoder.layer.19.mlp.up_proj.weight', 'vision_tower.encoder.layer.2.attention.key.bias', 'vision_tower.encoder.layer.2.attention.key.weight', 'vision_tower.encoder.layer.2.attention.output.bias', 'vision_tower.encoder.layer.2.attention.output.weight', 'vision_tower.encoder.layer.2.attention.query.bias', 'vision_tower.encoder.layer.2.attention.query.weight', 'vision_tower.encoder.layer.2.attention.value.bias', 'vision_tower.encoder.layer.2.attention.value.weight', 'vision_tower.encoder.layer.2.mlp.down_proj.bias', 'vision_tower.encoder.layer.2.mlp.down_proj.weight', 'vision_tower.encoder.layer.2.mlp.up_proj.bias', 'vision_tower.encoder.layer.2.mlp.up_proj.weight', 'vision_tower.encoder.layer.20.attention.key.bias', 'vision_tower.encoder.layer.20.attention.key.weight', 'vision_tower.encoder.layer.20.attention.output.bias', 'vision_tower.encoder.layer.20.attention.output.weight', 'vision_tower.encoder.layer.20.attention.query.bias', 'vision_tower.encoder.layer.20.attention.query.weight', 'vision_tower.encoder.layer.20.attention.value.bias', 'vision_tower.encoder.layer.20.attention.value.weight', 'vision_tower.encoder.layer.20.mlp.down_proj.bias', 'vision_tower.encoder.layer.20.mlp.down_proj.weight', 'vision_tower.encoder.layer.20.mlp.up_proj.bias', 'vision_tower.encoder.layer.20.mlp.up_proj.weight', 'vision_tower.encoder.layer.21.attention.key.bias', 'vision_tower.encoder.layer.21.attention.key.weight', 'vision_tower.encoder.layer.21.attention.output.bias', 'vision_tower.encoder.layer.21.attention.output.weight', 'vision_tower.encoder.layer.21.attention.query.bias', 'vision_tower.encoder.layer.21.attention.query.weight', 'vision_tower.encoder.layer.21.attention.value.bias', 'vision_tower.encoder.layer.21.attention.value.weight', 'vision_tower.encoder.layer.21.mlp.down_proj.bias', 'vision_tower.encoder.layer.21.mlp.down_proj.weight', 'vision_tower.encoder.layer.21.mlp.up_proj.bias', 'vision_tower.encoder.layer.21.mlp.up_proj.weight', 'vision_tower.encoder.layer.22.attention.key.bias', 'vision_tower.encoder.layer.22.attention.key.weight', 'vision_tower.encoder.layer.22.attention.output.bias', 'vision_tower.encoder.layer.22.attention.output.weight', 'vision_tower.encoder.layer.22.attention.query.bias', 'vision_tower.encoder.layer.22.attention.query.weight', 'vision_tower.encoder.layer.22.attention.value.bias', 'vision_tower.encoder.layer.22.attention.value.weight', 'vision_tower.encoder.layer.22.mlp.down_proj.bias', 'vision_tower.encoder.layer.22.mlp.down_proj.weight', 'vision_tower.encoder.layer.22.mlp.up_proj.bias', 'vision_tower.encoder.layer.22.mlp.up_proj.weight', 'vision_tower.encoder.layer.23.attention.key.bias', 'vision_tower.encoder.layer.23.attention.key.weight', 'vision_tower.encoder.layer.23.attention.output.bias', 'vision_tower.encoder.layer.23.attention.output.weight', 'vision_tower.encoder.layer.23.attention.query.bias', 'vision_tower.encoder.layer.23.attention.query.weight', 'vision_tower.encoder.layer.23.attention.value.bias', 'vision_tower.encoder.layer.23.attention.value.weight', 'vision_tower.encoder.layer.23.mlp.down_proj.bias', 'vision_tower.encoder.layer.23.mlp.down_proj.weight', 'vision_tower.encoder.layer.23.mlp.up_proj.bias', 'vision_tower.encoder.layer.23.mlp.up_proj.weight', 'vision_tower.encoder.layer.3.attention.key.bias', 'vision_tower.encoder.layer.3.attention.key.weight', 'vision_tower.encoder.layer.3.attention.output.bias', 'vision_tower.encoder.layer.3.attention.output.weight', 'vision_tower.encoder.layer.3.attention.query.bias', 'vision_tower.encoder.layer.3.attention.query.weight', 'vision_tower.encoder.layer.3.attention.value.bias', 'vision_tower.encoder.layer.3.attention.value.weight', 'vision_tower.encoder.layer.3.mlp.down_proj.bias', 'vision_tower.encoder.layer.3.mlp.down_proj.weight', 'vision_tower.encoder.layer.3.mlp.up_proj.bias', 'vision_tower.encoder.layer.3.mlp.up_proj.weight', 'vision_tower.encoder.layer.4.attention.key.bias', 'vision_tower.encoder.layer.4.attention.key.weight', 'vision_tower.encoder.layer.4.attention.output.bias', 'vision_tower.encoder.layer.4.attention.output.weight', 'vision_tower.encoder.layer.4.attention.query.bias', 'vision_tower.encoder.layer.4.attention.query.weight', 'vision_tower.encoder.layer.4.attention.value.bias', 'vision_tower.encoder.layer.4.attention.value.weight', 'vision_tower.encoder.layer.4.mlp.down_proj.bias', 'vision_tower.encoder.layer.4.mlp.down_proj.weight', 'vision_tower.encoder.layer.4.mlp.up_proj.bias', 'vision_tower.encoder.layer.4.mlp.up_proj.weight', 'vision_tower.encoder.layer.5.attention.key.bias', 'vision_tower.encoder.layer.5.attention.key.weight', 'vision_tower.encoder.layer.5.attention.output.bias', 'vision_tower.encoder.layer.5.attention.output.weight', 'vision_tower.encoder.layer.5.attention.query.bias', 'vision_tower.encoder.layer.5.attention.query.weight', 'vision_tower.encoder.layer.5.attention.value.bias', 'vision_tower.encoder.layer.5.attention.value.weight', 'vision_tower.encoder.layer.5.mlp.down_proj.bias', 'vision_tower.encoder.layer.5.mlp.down_proj.weight', 'vision_tower.encoder.layer.5.mlp.up_proj.bias', 'vision_tower.encoder.layer.5.mlp.up_proj.weight', 'vision_tower.encoder.layer.6.attention.key.bias', 'vision_tower.encoder.layer.6.attention.key.weight', 'vision_tower.encoder.layer.6.attention.output.bias', 'vision_tower.encoder.layer.6.attention.output.weight', 'vision_tower.encoder.layer.6.attention.query.bias', 'vision_tower.encoder.layer.6.attention.query.weight', 'vision_tower.encoder.layer.6.attention.value.bias', 'vision_tower.encoder.layer.6.attention.value.weight', 'vision_tower.encoder.layer.6.mlp.down_proj.bias', 'vision_tower.encoder.layer.6.mlp.down_proj.weight', 'vision_tower.encoder.layer.6.mlp.up_proj.bias', 'vision_tower.encoder.layer.6.mlp.up_proj.weight', 'vision_tower.encoder.layer.7.attention.key.bias', 'vision_tower.encoder.layer.7.attention.key.weight', 'vision_tower.encoder.layer.7.attention.output.bias', 'vision_tower.encoder.layer.7.attention.output.weight', 'vision_tower.encoder.layer.7.attention.query.bias', 'vision_tower.encoder.layer.7.attention.query.weight', 'vision_tower.encoder.layer.7.attention.value.bias', 'vision_tower.encoder.layer.7.attention.value.weight', 'vision_tower.encoder.layer.7.mlp.down_proj.bias', 'vision_tower.encoder.layer.7.mlp.down_proj.weight', 'vision_tower.encoder.layer.7.mlp.up_proj.bias', 'vision_tower.encoder.layer.7.mlp.up_proj.weight', 'vision_tower.encoder.layer.8.attention.key.bias', 'vision_tower.encoder.layer.8.attention.key.weight', 'vision_tower.encoder.layer.8.attention.output.bias', 'vision_tower.encoder.layer.8.attention.output.weight', 'vision_tower.encoder.layer.8.attention.query.bias', 'vision_tower.encoder.layer.8.attention.query.weight', 'vision_tower.encoder.layer.8.attention.value.bias', 'vision_tower.encoder.layer.8.attention.value.weight', 'vision_tower.encoder.layer.8.mlp.down_proj.bias', 'vision_tower.encoder.layer.8.mlp.down_proj.weight', 'vision_tower.encoder.layer.8.mlp.up_proj.bias', 'vision_tower.encoder.layer.8.mlp.up_proj.weight', 'vision_tower.encoder.layer.9.attention.key.bias', 'vision_tower.encoder.layer.9.attention.key.weight', 'vision_tower.encoder.layer.9.attention.output.bias', 'vision_tower.encoder.layer.9.attention.output.weight', 'vision_tower.encoder.layer.9.attention.query.bias', 'vision_tower.encoder.layer.9.attention.query.weight', 'vision_tower.encoder.layer.9.attention.value.bias', 'vision_tower.encoder.layer.9.attention.value.weight', 'vision_tower.encoder.layer.9.mlp.down_proj.bias', 'vision_tower.encoder.layer.9.mlp.down_proj.weight', 'vision_tower.encoder.layer.9.mlp.up_proj.bias', 'vision_tower.encoder.layer.9.mlp.up_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

It seems that the vision parts are all not initialized?

Additionally, this error occurs when I directly attempt to load your published model, without runnning converting weight file. I don't know what could cause the mismatch between parameter names.

I used to the newest transformers version by pip install the following: https://github.com/yonigozlan/transformers/tree/add-intern-vl

@Kuangdd01
Copy link
Collaborator Author

How to convert a model InternVL2_5-1B-MPO checkpoint to hf if I have a costomized-pertrained internvl model? Please help!
cvt_internvl_weights_to_hf.py seems to require a intern_vl_hf_implem/tokenizer_internvl_llama_fast.

In my case, just replace these codes with

tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-chat-7b", return_token_type_ids=False, trust_remote_code=True)

It will replace the tokenizer with internLM2's.

Thanks a lot for your timely response!!
When I use model = InternVLForConditionalGeneration.from_pretrained("InternVL2_5-8B-MPO-hf"), error raises:

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 4/4 [00:23<00:00,  5.83s/it]
Some weights of the model checkpoint at InternVL2_5-8B-MPO-hf were not used when initializing InternVLForConditionalGeneration: ['vision_tower.encoder.layer.0.attention.attention.key.bias', 'vision_tower.encoder.layer.0.attention.attention.key.weight', 'vision_tower.encoder.layer.0.attention.attention.query.bias', 'vision_tower.encoder.layer.0.attention.attention.query.weight', 'vision_tower.encoder.layer.0.attention.attention.value.bias', 'vision_tower.encoder.layer.0.attention.attention.value.weight', 'vision_tower.encoder.layer.0.attention.output.dense.bias', 'vision_tower.encoder.layer.0.attention.output.dense.weight', 'vision_tower.encoder.layer.0.intermediate.dense.bias', 'vision_tower.encoder.layer.0.intermediate.dense.weight', 'vision_tower.encoder.layer.0.output.dense.bias', 'vision_tower.encoder.layer.0.output.dense.weight', 'vision_tower.encoder.layer.1.attention.attention.key.bias', 'vision_tower.encoder.layer.1.attention.attention.key.weight', 'vision_tower.encoder.layer.1.attention.attention.query.bias', 'vision_tower.encoder.layer.1.attention.attention.query.weight', 'vision_tower.encoder.layer.1.attention.attention.value.bias', 'vision_tower.encoder.layer.1.attention.attention.value.weight', 'vision_tower.encoder.layer.1.attention.output.dense.bias', 'vision_tower.encoder.layer.1.attention.output.dense.weight', 'vision_tower.encoder.layer.1.intermediate.dense.bias', 'vision_tower.encoder.layer.1.intermediate.dense.weight', 'vision_tower.encoder.layer.1.output.dense.bias', 'vision_tower.encoder.layer.1.output.dense.weight', 'vision_tower.encoder.layer.10.attention.attention.key.bias', 'vision_tower.encoder.layer.10.attention.attention.key.weight', 'vision_tower.encoder.layer.10.attention.attention.query.bias', 'vision_tower.encoder.layer.10.attention.attention.query.weight', 'vision_tower.encoder.layer.10.attention.attention.value.bias', 'vision_tower.encoder.layer.10.attention.attention.value.weight', 'vision_tower.encoder.layer.10.attention.output.dense.bias', 'vision_tower.encoder.layer.10.attention.output.dense.weight', 'vision_tower.encoder.layer.10.intermediate.dense.bias', 'vision_tower.encoder.layer.10.intermediate.dense.weight', 'vision_tower.encoder.layer.10.output.dense.bias', 'vision_tower.encoder.layer.10.output.dense.weight', 'vision_tower.encoder.layer.11.attention.attention.key.bias', 'vision_tower.encoder.layer.11.attention.attention.key.weight', 'vision_tower.encoder.layer.11.attention.attention.query.bias', 'vision_tower.encoder.layer.11.attention.attention.query.weight', 'vision_tower.encoder.layer.11.attention.attention.value.bias', 'vision_tower.encoder.layer.11.attention.attention.value.weight', 'vision_tower.encoder.layer.11.attention.output.dense.bias', 'vision_tower.encoder.layer.11.attention.output.dense.weight', 'vision_tower.encoder.layer.11.intermediate.dense.bias', 'vision_tower.encoder.layer.11.intermediate.dense.weight', 'vision_tower.encoder.layer.11.output.dense.bias', 'vision_tower.encoder.layer.11.output.dense.weight', 'vision_tower.encoder.layer.12.attention.attention.key.bias', 'vision_tower.encoder.layer.12.attention.attention.key.weight', 'vision_tower.encoder.layer.12.attention.attention.query.bias', 'vision_tower.encoder.layer.12.attention.attention.query.weight', 'vision_tower.encoder.layer.12.attention.attention.value.bias', 'vision_tower.encoder.layer.12.attention.attention.value.weight', 'vision_tower.encoder.layer.12.attention.output.dense.bias', 'vision_tower.encoder.layer.12.attention.output.dense.weight', 'vision_tower.encoder.layer.12.intermediate.dense.bias', 'vision_tower.encoder.layer.12.intermediate.dense.weight', 'vision_tower.encoder.layer.12.output.dense.bias', 'vision_tower.encoder.layer.12.output.dense.weight', 'vision_tower.encoder.layer.13.attention.attention.key.bias', 'vision_tower.encoder.layer.13.attention.attention.key.weight', 'vision_tower.encoder.layer.13.attention.attention.query.bias', 'vision_tower.encoder.layer.13.attention.attention.query.weight', 'vision_tower.encoder.layer.13.attention.attention.value.bias', 'vision_tower.encoder.layer.13.attention.attention.value.weight', 'vision_tower.encoder.layer.13.attention.output.dense.bias', 'vision_tower.encoder.layer.13.attention.output.dense.weight', 'vision_tower.encoder.layer.13.intermediate.dense.bias', 'vision_tower.encoder.layer.13.intermediate.dense.weight', 'vision_tower.encoder.layer.13.output.dense.bias', 'vision_tower.encoder.layer.13.output.dense.weight', 'vision_tower.encoder.layer.14.attention.attention.key.bias', 'vision_tower.encoder.layer.14.attention.attention.key.weight', 'vision_tower.encoder.layer.14.attention.attention.query.bias', 'vision_tower.encoder.layer.14.attention.attention.query.weight', 'vision_tower.encoder.layer.14.attention.attention.value.bias', 'vision_tower.encoder.layer.14.attention.attention.value.weight', 'vision_tower.encoder.layer.14.attention.output.dense.bias', 'vision_tower.encoder.layer.14.attention.output.dense.weight', 'vision_tower.encoder.layer.14.intermediate.dense.bias', 'vision_tower.encoder.layer.14.intermediate.dense.weight', 'vision_tower.encoder.layer.14.output.dense.bias', 'vision_tower.encoder.layer.14.output.dense.weight', 'vision_tower.encoder.layer.15.attention.attention.key.bias', 'vision_tower.encoder.layer.15.attention.attention.key.weight', 'vision_tower.encoder.layer.15.attention.attention.query.bias', 'vision_tower.encoder.layer.15.attention.attention.query.weight', 'vision_tower.encoder.layer.15.attention.attention.value.bias', 'vision_tower.encoder.layer.15.attention.attention.value.weight', 'vision_tower.encoder.layer.15.attention.output.dense.bias', 'vision_tower.encoder.layer.15.attention.output.dense.weight', 'vision_tower.encoder.layer.15.intermediate.dense.bias', 'vision_tower.encoder.layer.15.intermediate.dense.weight', 'vision_tower.encoder.layer.15.output.dense.bias', 'vision_tower.encoder.layer.15.output.dense.weight', 'vision_tower.encoder.layer.16.attention.attention.key.bias', 'vision_tower.encoder.layer.16.attention.attention.key.weight', 'vision_tower.encoder.layer.16.attention.attention.query.bias', 'vision_tower.encoder.layer.16.attention.attention.query.weight', 'vision_tower.encoder.layer.16.attention.attention.value.bias', 'vision_tower.encoder.layer.16.attention.attention.value.weight', 'vision_tower.encoder.layer.16.attention.output.dense.bias', 'vision_tower.encoder.layer.16.attention.output.dense.weight', 'vision_tower.encoder.layer.16.intermediate.dense.bias', 'vision_tower.encoder.layer.16.intermediate.dense.weight', 'vision_tower.encoder.layer.16.output.dense.bias', 'vision_tower.encoder.layer.16.output.dense.weight', 'vision_tower.encoder.layer.17.attention.attention.key.bias', 'vision_tower.encoder.layer.17.attention.attention.key.weight', 'vision_tower.encoder.layer.17.attention.attention.query.bias', 'vision_tower.encoder.layer.17.attention.attention.query.weight', 'vision_tower.encoder.layer.17.attention.attention.value.bias', 'vision_tower.encoder.layer.17.attention.attention.value.weight', 'vision_tower.encoder.layer.17.attention.output.dense.bias', 'vision_tower.encoder.layer.17.attention.output.dense.weight', 'vision_tower.encoder.layer.17.intermediate.dense.bias', 'vision_tower.encoder.layer.17.intermediate.dense.weight', 'vision_tower.encoder.layer.17.output.dense.bias', 'vision_tower.encoder.layer.17.output.dense.weight', 'vision_tower.encoder.layer.18.attention.attention.key.bias', 'vision_tower.encoder.layer.18.attention.attention.key.weight', 'vision_tower.encoder.layer.18.attention.attention.query.bias', 'vision_tower.encoder.layer.18.attention.attention.query.weight', 'vision_tower.encoder.layer.18.attention.attention.value.bias', 'vision_tower.encoder.layer.18.attention.attention.value.weight', 'vision_tower.encoder.layer.18.attention.output.dense.bias', 'vision_tower.encoder.layer.18.attention.output.dense.weight', 'vision_tower.encoder.layer.18.intermediate.dense.bias', 'vision_tower.encoder.layer.18.intermediate.dense.weight', 'vision_tower.encoder.layer.18.output.dense.bias', 'vision_tower.encoder.layer.18.output.dense.weight', 'vision_tower.encoder.layer.19.attention.attention.key.bias', 'vision_tower.encoder.layer.19.attention.attention.key.weight', 'vision_tower.encoder.layer.19.attention.attention.query.bias', 'vision_tower.encoder.layer.19.attention.attention.query.weight', 'vision_tower.encoder.layer.19.attention.attention.value.bias', 'vision_tower.encoder.layer.19.attention.attention.value.weight', 'vision_tower.encoder.layer.19.attention.output.dense.bias', 'vision_tower.encoder.layer.19.attention.output.dense.weight', 'vision_tower.encoder.layer.19.intermediate.dense.bias', 'vision_tower.encoder.layer.19.intermediate.dense.weight', 'vision_tower.encoder.layer.19.output.dense.bias', 'vision_tower.encoder.layer.19.output.dense.weight', 'vision_tower.encoder.layer.2.attention.attention.key.bias', 'vision_tower.encoder.layer.2.attention.attention.key.weight', 'vision_tower.encoder.layer.2.attention.attention.query.bias', 'vision_tower.encoder.layer.2.attention.attention.query.weight', 'vision_tower.encoder.layer.2.attention.attention.value.bias', 'vision_tower.encoder.layer.2.attention.attention.value.weight', 'vision_tower.encoder.layer.2.attention.output.dense.bias', 'vision_tower.encoder.layer.2.attention.output.dense.weight', 'vision_tower.encoder.layer.2.intermediate.dense.bias', 'vision_tower.encoder.layer.2.intermediate.dense.weight', 'vision_tower.encoder.layer.2.output.dense.bias', 'vision_tower.encoder.layer.2.output.dense.weight', 'vision_tower.encoder.layer.20.attention.attention.key.bias', 'vision_tower.encoder.layer.20.attention.attention.key.weight', 'vision_tower.encoder.layer.20.attention.attention.query.bias', 'vision_tower.encoder.layer.20.attention.attention.query.weight', 'vision_tower.encoder.layer.20.attention.attention.value.bias', 'vision_tower.encoder.layer.20.attention.attention.value.weight', 'vision_tower.encoder.layer.20.attention.output.dense.bias', 'vision_tower.encoder.layer.20.attention.output.dense.weight', 'vision_tower.encoder.layer.20.intermediate.dense.bias', 'vision_tower.encoder.layer.20.intermediate.dense.weight', 'vision_tower.encoder.layer.20.output.dense.bias', 'vision_tower.encoder.layer.20.output.dense.weight', 'vision_tower.encoder.layer.21.attention.attention.key.bias', 'vision_tower.encoder.layer.21.attention.attention.key.weight', 'vision_tower.encoder.layer.21.attention.attention.query.bias', 'vision_tower.encoder.layer.21.attention.attention.query.weight', 'vision_tower.encoder.layer.21.attention.attention.value.bias', 'vision_tower.encoder.layer.21.attention.attention.value.weight', 'vision_tower.encoder.layer.21.attention.output.dense.bias', 'vision_tower.encoder.layer.21.attention.output.dense.weight', 'vision_tower.encoder.layer.21.intermediate.dense.bias', 'vision_tower.encoder.layer.21.intermediate.dense.weight', 'vision_tower.encoder.layer.21.output.dense.bias', 'vision_tower.encoder.layer.21.output.dense.weight', 'vision_tower.encoder.layer.22.attention.attention.key.bias', 'vision_tower.encoder.layer.22.attention.attention.key.weight', 'vision_tower.encoder.layer.22.attention.attention.query.bias', 'vision_tower.encoder.layer.22.attention.attention.query.weight', 'vision_tower.encoder.layer.22.attention.attention.value.bias', 'vision_tower.encoder.layer.22.attention.attention.value.weight', 'vision_tower.encoder.layer.22.attention.output.dense.bias', 'vision_tower.encoder.layer.22.attention.output.dense.weight', 'vision_tower.encoder.layer.22.intermediate.dense.bias', 'vision_tower.encoder.layer.22.intermediate.dense.weight', 'vision_tower.encoder.layer.22.output.dense.bias', 'vision_tower.encoder.layer.22.output.dense.weight', 'vision_tower.encoder.layer.23.attention.attention.key.bias', 'vision_tower.encoder.layer.23.attention.attention.key.weight', 'vision_tower.encoder.layer.23.attention.attention.query.bias', 'vision_tower.encoder.layer.23.attention.attention.query.weight', 'vision_tower.encoder.layer.23.attention.attention.value.bias', 'vision_tower.encoder.layer.23.attention.attention.value.weight', 'vision_tower.encoder.layer.23.attention.output.dense.bias', 'vision_tower.encoder.layer.23.attention.output.dense.weight', 'vision_tower.encoder.layer.23.intermediate.dense.bias', 'vision_tower.encoder.layer.23.intermediate.dense.weight', 'vision_tower.encoder.layer.23.output.dense.bias', 'vision_tower.encoder.layer.23.output.dense.weight', 'vision_tower.encoder.layer.3.attention.attention.key.bias', 'vision_tower.encoder.layer.3.attention.attention.key.weight', 'vision_tower.encoder.layer.3.attention.attention.query.bias', 'vision_tower.encoder.layer.3.attention.attention.query.weight', 'vision_tower.encoder.layer.3.attention.attention.value.bias', 'vision_tower.encoder.layer.3.attention.attention.value.weight', 'vision_tower.encoder.layer.3.attention.output.dense.bias', 'vision_tower.encoder.layer.3.attention.output.dense.weight', 'vision_tower.encoder.layer.3.intermediate.dense.bias', 'vision_tower.encoder.layer.3.intermediate.dense.weight', 'vision_tower.encoder.layer.3.output.dense.bias', 'vision_tower.encoder.layer.3.output.dense.weight', 'vision_tower.encoder.layer.4.attention.attention.key.bias', 'vision_tower.encoder.layer.4.attention.attention.key.weight', 'vision_tower.encoder.layer.4.attention.attention.query.bias', 'vision_tower.encoder.layer.4.attention.attention.query.weight', 'vision_tower.encoder.layer.4.attention.attention.value.bias', 'vision_tower.encoder.layer.4.attention.attention.value.weight', 'vision_tower.encoder.layer.4.attention.output.dense.bias', 'vision_tower.encoder.layer.4.attention.output.dense.weight', 'vision_tower.encoder.layer.4.intermediate.dense.bias', 'vision_tower.encoder.layer.4.intermediate.dense.weight', 'vision_tower.encoder.layer.4.output.dense.bias', 'vision_tower.encoder.layer.4.output.dense.weight', 'vision_tower.encoder.layer.5.attention.attention.key.bias', 'vision_tower.encoder.layer.5.attention.attention.key.weight', 'vision_tower.encoder.layer.5.attention.attention.query.bias', 'vision_tower.encoder.layer.5.attention.attention.query.weight', 'vision_tower.encoder.layer.5.attention.attention.value.bias', 'vision_tower.encoder.layer.5.attention.attention.value.weight', 'vision_tower.encoder.layer.5.attention.output.dense.bias', 'vision_tower.encoder.layer.5.attention.output.dense.weight', 'vision_tower.encoder.layer.5.intermediate.dense.bias', 'vision_tower.encoder.layer.5.intermediate.dense.weight', 'vision_tower.encoder.layer.5.output.dense.bias', 'vision_tower.encoder.layer.5.output.dense.weight', 'vision_tower.encoder.layer.6.attention.attention.key.bias', 'vision_tower.encoder.layer.6.attention.attention.key.weight', 'vision_tower.encoder.layer.6.attention.attention.query.bias', 'vision_tower.encoder.layer.6.attention.attention.query.weight', 'vision_tower.encoder.layer.6.attention.attention.value.bias', 'vision_tower.encoder.layer.6.attention.attention.value.weight', 'vision_tower.encoder.layer.6.attention.output.dense.bias', 'vision_tower.encoder.layer.6.attention.output.dense.weight', 'vision_tower.encoder.layer.6.intermediate.dense.bias', 'vision_tower.encoder.layer.6.intermediate.dense.weight', 'vision_tower.encoder.layer.6.output.dense.bias', 'vision_tower.encoder.layer.6.output.dense.weight', 'vision_tower.encoder.layer.7.attention.attention.key.bias', 'vision_tower.encoder.layer.7.attention.attention.key.weight', 'vision_tower.encoder.layer.7.attention.attention.query.bias', 'vision_tower.encoder.layer.7.attention.attention.query.weight', 'vision_tower.encoder.layer.7.attention.attention.value.bias', 'vision_tower.encoder.layer.7.attention.attention.value.weight', 'vision_tower.encoder.layer.7.attention.output.dense.bias', 'vision_tower.encoder.layer.7.attention.output.dense.weight', 'vision_tower.encoder.layer.7.intermediate.dense.bias', 'vision_tower.encoder.layer.7.intermediate.dense.weight', 'vision_tower.encoder.layer.7.output.dense.bias', 'vision_tower.encoder.layer.7.output.dense.weight', 'vision_tower.encoder.layer.8.attention.attention.key.bias', 'vision_tower.encoder.layer.8.attention.attention.key.weight', 'vision_tower.encoder.layer.8.attention.attention.query.bias', 'vision_tower.encoder.layer.8.attention.attention.query.weight', 'vision_tower.encoder.layer.8.attention.attention.value.bias', 'vision_tower.encoder.layer.8.attention.attention.value.weight', 'vision_tower.encoder.layer.8.attention.output.dense.bias', 'vision_tower.encoder.layer.8.attention.output.dense.weight', 'vision_tower.encoder.layer.8.intermediate.dense.bias', 'vision_tower.encoder.layer.8.intermediate.dense.weight', 'vision_tower.encoder.layer.8.output.dense.bias', 'vision_tower.encoder.layer.8.output.dense.weight', 'vision_tower.encoder.layer.9.attention.attention.key.bias', 'vision_tower.encoder.layer.9.attention.attention.key.weight', 'vision_tower.encoder.layer.9.attention.attention.query.bias', 'vision_tower.encoder.layer.9.attention.attention.query.weight', 'vision_tower.encoder.layer.9.attention.attention.value.bias', 'vision_tower.encoder.layer.9.attention.attention.value.weight', 'vision_tower.encoder.layer.9.attention.output.dense.bias', 'vision_tower.encoder.layer.9.attention.output.dense.weight', 'vision_tower.encoder.layer.9.intermediate.dense.bias', 'vision_tower.encoder.layer.9.intermediate.dense.weight', 'vision_tower.encoder.layer.9.output.dense.bias', 'vision_tower.encoder.layer.9.output.dense.weight']
- This IS expected if you are initializing InternVLForConditionalGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing InternVLForConditionalGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of InternVLForConditionalGeneration were not initialized from the model checkpoint at InternVL2_5-8B-MPO-hf and are newly initialized: ['vision_tower.encoder.layer.0.attention.key.bias', 'vision_tower.encoder.layer.0.attention.key.weight', 'vision_tower.encoder.layer.0.attention.output.bias', 'vision_tower.encoder.layer.0.attention.output.weight', 'vision_tower.encoder.layer.0.attention.query.bias', 'vision_tower.encoder.layer.0.attention.query.weight', 'vision_tower.encoder.layer.0.attention.value.bias', 'vision_tower.encoder.layer.0.attention.value.weight', 'vision_tower.encoder.layer.0.mlp.down_proj.bias', 'vision_tower.encoder.layer.0.mlp.down_proj.weight', 'vision_tower.encoder.layer.0.mlp.up_proj.bias', 'vision_tower.encoder.layer.0.mlp.up_proj.weight', 'vision_tower.encoder.layer.1.attention.key.bias', 'vision_tower.encoder.layer.1.attention.key.weight', 'vision_tower.encoder.layer.1.attention.output.bias', 'vision_tower.encoder.layer.1.attention.output.weight', 'vision_tower.encoder.layer.1.attention.query.bias', 'vision_tower.encoder.layer.1.attention.query.weight', 'vision_tower.encoder.layer.1.attention.value.bias', 'vision_tower.encoder.layer.1.attention.value.weight', 'vision_tower.encoder.layer.1.mlp.down_proj.bias', 'vision_tower.encoder.layer.1.mlp.down_proj.weight', 'vision_tower.encoder.layer.1.mlp.up_proj.bias', 'vision_tower.encoder.layer.1.mlp.up_proj.weight', 'vision_tower.encoder.layer.10.attention.key.bias', 'vision_tower.encoder.layer.10.attention.key.weight', 'vision_tower.encoder.layer.10.attention.output.bias', 'vision_tower.encoder.layer.10.attention.output.weight', 'vision_tower.encoder.layer.10.attention.query.bias', 'vision_tower.encoder.layer.10.attention.query.weight', 'vision_tower.encoder.layer.10.attention.value.bias', 'vision_tower.encoder.layer.10.attention.value.weight', 'vision_tower.encoder.layer.10.mlp.down_proj.bias', 'vision_tower.encoder.layer.10.mlp.down_proj.weight', 'vision_tower.encoder.layer.10.mlp.up_proj.bias', 'vision_tower.encoder.layer.10.mlp.up_proj.weight', 'vision_tower.encoder.layer.11.attention.key.bias', 'vision_tower.encoder.layer.11.attention.key.weight', 'vision_tower.encoder.layer.11.attention.output.bias', 'vision_tower.encoder.layer.11.attention.output.weight', 'vision_tower.encoder.layer.11.attention.query.bias', 'vision_tower.encoder.layer.11.attention.query.weight', 'vision_tower.encoder.layer.11.attention.value.bias', 'vision_tower.encoder.layer.11.attention.value.weight', 'vision_tower.encoder.layer.11.mlp.down_proj.bias', 'vision_tower.encoder.layer.11.mlp.down_proj.weight', 'vision_tower.encoder.layer.11.mlp.up_proj.bias', 'vision_tower.encoder.layer.11.mlp.up_proj.weight', 'vision_tower.encoder.layer.12.attention.key.bias', 'vision_tower.encoder.layer.12.attention.key.weight', 'vision_tower.encoder.layer.12.attention.output.bias', 'vision_tower.encoder.layer.12.attention.output.weight', 'vision_tower.encoder.layer.12.attention.query.bias', 'vision_tower.encoder.layer.12.attention.query.weight', 'vision_tower.encoder.layer.12.attention.value.bias', 'vision_tower.encoder.layer.12.attention.value.weight', 'vision_tower.encoder.layer.12.mlp.down_proj.bias', 'vision_tower.encoder.layer.12.mlp.down_proj.weight', 'vision_tower.encoder.layer.12.mlp.up_proj.bias', 'vision_tower.encoder.layer.12.mlp.up_proj.weight', 'vision_tower.encoder.layer.13.attention.key.bias', 'vision_tower.encoder.layer.13.attention.key.weight', 'vision_tower.encoder.layer.13.attention.output.bias', 'vision_tower.encoder.layer.13.attention.output.weight', 'vision_tower.encoder.layer.13.attention.query.bias', 'vision_tower.encoder.layer.13.attention.query.weight', 'vision_tower.encoder.layer.13.attention.value.bias', 'vision_tower.encoder.layer.13.attention.value.weight', 'vision_tower.encoder.layer.13.mlp.down_proj.bias', 'vision_tower.encoder.layer.13.mlp.down_proj.weight', 'vision_tower.encoder.layer.13.mlp.up_proj.bias', 'vision_tower.encoder.layer.13.mlp.up_proj.weight', 'vision_tower.encoder.layer.14.attention.key.bias', 'vision_tower.encoder.layer.14.attention.key.weight', 'vision_tower.encoder.layer.14.attention.output.bias', 'vision_tower.encoder.layer.14.attention.output.weight', 'vision_tower.encoder.layer.14.attention.query.bias', 'vision_tower.encoder.layer.14.attention.query.weight', 'vision_tower.encoder.layer.14.attention.value.bias', 'vision_tower.encoder.layer.14.attention.value.weight', 'vision_tower.encoder.layer.14.mlp.down_proj.bias', 'vision_tower.encoder.layer.14.mlp.down_proj.weight', 'vision_tower.encoder.layer.14.mlp.up_proj.bias', 'vision_tower.encoder.layer.14.mlp.up_proj.weight', 'vision_tower.encoder.layer.15.attention.key.bias', 'vision_tower.encoder.layer.15.attention.key.weight', 'vision_tower.encoder.layer.15.attention.output.bias', 'vision_tower.encoder.layer.15.attention.output.weight', 'vision_tower.encoder.layer.15.attention.query.bias', 'vision_tower.encoder.layer.15.attention.query.weight', 'vision_tower.encoder.layer.15.attention.value.bias', 'vision_tower.encoder.layer.15.attention.value.weight', 'vision_tower.encoder.layer.15.mlp.down_proj.bias', 'vision_tower.encoder.layer.15.mlp.down_proj.weight', 'vision_tower.encoder.layer.15.mlp.up_proj.bias', 'vision_tower.encoder.layer.15.mlp.up_proj.weight', 'vision_tower.encoder.layer.16.attention.key.bias', 'vision_tower.encoder.layer.16.attention.key.weight', 'vision_tower.encoder.layer.16.attention.output.bias', 'vision_tower.encoder.layer.16.attention.output.weight', 'vision_tower.encoder.layer.16.attention.query.bias', 'vision_tower.encoder.layer.16.attention.query.weight', 'vision_tower.encoder.layer.16.attention.value.bias', 'vision_tower.encoder.layer.16.attention.value.weight', 'vision_tower.encoder.layer.16.mlp.down_proj.bias', 'vision_tower.encoder.layer.16.mlp.down_proj.weight', 'vision_tower.encoder.layer.16.mlp.up_proj.bias', 'vision_tower.encoder.layer.16.mlp.up_proj.weight', 'vision_tower.encoder.layer.17.attention.key.bias', 'vision_tower.encoder.layer.17.attention.key.weight', 'vision_tower.encoder.layer.17.attention.output.bias', 'vision_tower.encoder.layer.17.attention.output.weight', 'vision_tower.encoder.layer.17.attention.query.bias', 'vision_tower.encoder.layer.17.attention.query.weight', 'vision_tower.encoder.layer.17.attention.value.bias', 'vision_tower.encoder.layer.17.attention.value.weight', 'vision_tower.encoder.layer.17.mlp.down_proj.bias', 'vision_tower.encoder.layer.17.mlp.down_proj.weight', 'vision_tower.encoder.layer.17.mlp.up_proj.bias', 'vision_tower.encoder.layer.17.mlp.up_proj.weight', 'vision_tower.encoder.layer.18.attention.key.bias', 'vision_tower.encoder.layer.18.attention.key.weight', 'vision_tower.encoder.layer.18.attention.output.bias', 'vision_tower.encoder.layer.18.attention.output.weight', 'vision_tower.encoder.layer.18.attention.query.bias', 'vision_tower.encoder.layer.18.attention.query.weight', 'vision_tower.encoder.layer.18.attention.value.bias', 'vision_tower.encoder.layer.18.attention.value.weight', 'vision_tower.encoder.layer.18.mlp.down_proj.bias', 'vision_tower.encoder.layer.18.mlp.down_proj.weight', 'vision_tower.encoder.layer.18.mlp.up_proj.bias', 'vision_tower.encoder.layer.18.mlp.up_proj.weight', 'vision_tower.encoder.layer.19.attention.key.bias', 'vision_tower.encoder.layer.19.attention.key.weight', 'vision_tower.encoder.layer.19.attention.output.bias', 'vision_tower.encoder.layer.19.attention.output.weight', 'vision_tower.encoder.layer.19.attention.query.bias', 'vision_tower.encoder.layer.19.attention.query.weight', 'vision_tower.encoder.layer.19.attention.value.bias', 'vision_tower.encoder.layer.19.attention.value.weight', 'vision_tower.encoder.layer.19.mlp.down_proj.bias', 'vision_tower.encoder.layer.19.mlp.down_proj.weight', 'vision_tower.encoder.layer.19.mlp.up_proj.bias', 'vision_tower.encoder.layer.19.mlp.up_proj.weight', 'vision_tower.encoder.layer.2.attention.key.bias', 'vision_tower.encoder.layer.2.attention.key.weight', 'vision_tower.encoder.layer.2.attention.output.bias', 'vision_tower.encoder.layer.2.attention.output.weight', 'vision_tower.encoder.layer.2.attention.query.bias', 'vision_tower.encoder.layer.2.attention.query.weight', 'vision_tower.encoder.layer.2.attention.value.bias', 'vision_tower.encoder.layer.2.attention.value.weight', 'vision_tower.encoder.layer.2.mlp.down_proj.bias', 'vision_tower.encoder.layer.2.mlp.down_proj.weight', 'vision_tower.encoder.layer.2.mlp.up_proj.bias', 'vision_tower.encoder.layer.2.mlp.up_proj.weight', 'vision_tower.encoder.layer.20.attention.key.bias', 'vision_tower.encoder.layer.20.attention.key.weight', 'vision_tower.encoder.layer.20.attention.output.bias', 'vision_tower.encoder.layer.20.attention.output.weight', 'vision_tower.encoder.layer.20.attention.query.bias', 'vision_tower.encoder.layer.20.attention.query.weight', 'vision_tower.encoder.layer.20.attention.value.bias', 'vision_tower.encoder.layer.20.attention.value.weight', 'vision_tower.encoder.layer.20.mlp.down_proj.bias', 'vision_tower.encoder.layer.20.mlp.down_proj.weight', 'vision_tower.encoder.layer.20.mlp.up_proj.bias', 'vision_tower.encoder.layer.20.mlp.up_proj.weight', 'vision_tower.encoder.layer.21.attention.key.bias', 'vision_tower.encoder.layer.21.attention.key.weight', 'vision_tower.encoder.layer.21.attention.output.bias', 'vision_tower.encoder.layer.21.attention.output.weight', 'vision_tower.encoder.layer.21.attention.query.bias', 'vision_tower.encoder.layer.21.attention.query.weight', 'vision_tower.encoder.layer.21.attention.value.bias', 'vision_tower.encoder.layer.21.attention.value.weight', 'vision_tower.encoder.layer.21.mlp.down_proj.bias', 'vision_tower.encoder.layer.21.mlp.down_proj.weight', 'vision_tower.encoder.layer.21.mlp.up_proj.bias', 'vision_tower.encoder.layer.21.mlp.up_proj.weight', 'vision_tower.encoder.layer.22.attention.key.bias', 'vision_tower.encoder.layer.22.attention.key.weight', 'vision_tower.encoder.layer.22.attention.output.bias', 'vision_tower.encoder.layer.22.attention.output.weight', 'vision_tower.encoder.layer.22.attention.query.bias', 'vision_tower.encoder.layer.22.attention.query.weight', 'vision_tower.encoder.layer.22.attention.value.bias', 'vision_tower.encoder.layer.22.attention.value.weight', 'vision_tower.encoder.layer.22.mlp.down_proj.bias', 'vision_tower.encoder.layer.22.mlp.down_proj.weight', 'vision_tower.encoder.layer.22.mlp.up_proj.bias', 'vision_tower.encoder.layer.22.mlp.up_proj.weight', 'vision_tower.encoder.layer.23.attention.key.bias', 'vision_tower.encoder.layer.23.attention.key.weight', 'vision_tower.encoder.layer.23.attention.output.bias', 'vision_tower.encoder.layer.23.attention.output.weight', 'vision_tower.encoder.layer.23.attention.query.bias', 'vision_tower.encoder.layer.23.attention.query.weight', 'vision_tower.encoder.layer.23.attention.value.bias', 'vision_tower.encoder.layer.23.attention.value.weight', 'vision_tower.encoder.layer.23.mlp.down_proj.bias', 'vision_tower.encoder.layer.23.mlp.down_proj.weight', 'vision_tower.encoder.layer.23.mlp.up_proj.bias', 'vision_tower.encoder.layer.23.mlp.up_proj.weight', 'vision_tower.encoder.layer.3.attention.key.bias', 'vision_tower.encoder.layer.3.attention.key.weight', 'vision_tower.encoder.layer.3.attention.output.bias', 'vision_tower.encoder.layer.3.attention.output.weight', 'vision_tower.encoder.layer.3.attention.query.bias', 'vision_tower.encoder.layer.3.attention.query.weight', 'vision_tower.encoder.layer.3.attention.value.bias', 'vision_tower.encoder.layer.3.attention.value.weight', 'vision_tower.encoder.layer.3.mlp.down_proj.bias', 'vision_tower.encoder.layer.3.mlp.down_proj.weight', 'vision_tower.encoder.layer.3.mlp.up_proj.bias', 'vision_tower.encoder.layer.3.mlp.up_proj.weight', 'vision_tower.encoder.layer.4.attention.key.bias', 'vision_tower.encoder.layer.4.attention.key.weight', 'vision_tower.encoder.layer.4.attention.output.bias', 'vision_tower.encoder.layer.4.attention.output.weight', 'vision_tower.encoder.layer.4.attention.query.bias', 'vision_tower.encoder.layer.4.attention.query.weight', 'vision_tower.encoder.layer.4.attention.value.bias', 'vision_tower.encoder.layer.4.attention.value.weight', 'vision_tower.encoder.layer.4.mlp.down_proj.bias', 'vision_tower.encoder.layer.4.mlp.down_proj.weight', 'vision_tower.encoder.layer.4.mlp.up_proj.bias', 'vision_tower.encoder.layer.4.mlp.up_proj.weight', 'vision_tower.encoder.layer.5.attention.key.bias', 'vision_tower.encoder.layer.5.attention.key.weight', 'vision_tower.encoder.layer.5.attention.output.bias', 'vision_tower.encoder.layer.5.attention.output.weight', 'vision_tower.encoder.layer.5.attention.query.bias', 'vision_tower.encoder.layer.5.attention.query.weight', 'vision_tower.encoder.layer.5.attention.value.bias', 'vision_tower.encoder.layer.5.attention.value.weight', 'vision_tower.encoder.layer.5.mlp.down_proj.bias', 'vision_tower.encoder.layer.5.mlp.down_proj.weight', 'vision_tower.encoder.layer.5.mlp.up_proj.bias', 'vision_tower.encoder.layer.5.mlp.up_proj.weight', 'vision_tower.encoder.layer.6.attention.key.bias', 'vision_tower.encoder.layer.6.attention.key.weight', 'vision_tower.encoder.layer.6.attention.output.bias', 'vision_tower.encoder.layer.6.attention.output.weight', 'vision_tower.encoder.layer.6.attention.query.bias', 'vision_tower.encoder.layer.6.attention.query.weight', 'vision_tower.encoder.layer.6.attention.value.bias', 'vision_tower.encoder.layer.6.attention.value.weight', 'vision_tower.encoder.layer.6.mlp.down_proj.bias', 'vision_tower.encoder.layer.6.mlp.down_proj.weight', 'vision_tower.encoder.layer.6.mlp.up_proj.bias', 'vision_tower.encoder.layer.6.mlp.up_proj.weight', 'vision_tower.encoder.layer.7.attention.key.bias', 'vision_tower.encoder.layer.7.attention.key.weight', 'vision_tower.encoder.layer.7.attention.output.bias', 'vision_tower.encoder.layer.7.attention.output.weight', 'vision_tower.encoder.layer.7.attention.query.bias', 'vision_tower.encoder.layer.7.attention.query.weight', 'vision_tower.encoder.layer.7.attention.value.bias', 'vision_tower.encoder.layer.7.attention.value.weight', 'vision_tower.encoder.layer.7.mlp.down_proj.bias', 'vision_tower.encoder.layer.7.mlp.down_proj.weight', 'vision_tower.encoder.layer.7.mlp.up_proj.bias', 'vision_tower.encoder.layer.7.mlp.up_proj.weight', 'vision_tower.encoder.layer.8.attention.key.bias', 'vision_tower.encoder.layer.8.attention.key.weight', 'vision_tower.encoder.layer.8.attention.output.bias', 'vision_tower.encoder.layer.8.attention.output.weight', 'vision_tower.encoder.layer.8.attention.query.bias', 'vision_tower.encoder.layer.8.attention.query.weight', 'vision_tower.encoder.layer.8.attention.value.bias', 'vision_tower.encoder.layer.8.attention.value.weight', 'vision_tower.encoder.layer.8.mlp.down_proj.bias', 'vision_tower.encoder.layer.8.mlp.down_proj.weight', 'vision_tower.encoder.layer.8.mlp.up_proj.bias', 'vision_tower.encoder.layer.8.mlp.up_proj.weight', 'vision_tower.encoder.layer.9.attention.key.bias', 'vision_tower.encoder.layer.9.attention.key.weight', 'vision_tower.encoder.layer.9.attention.output.bias', 'vision_tower.encoder.layer.9.attention.output.weight', 'vision_tower.encoder.layer.9.attention.query.bias', 'vision_tower.encoder.layer.9.attention.query.weight', 'vision_tower.encoder.layer.9.attention.value.bias', 'vision_tower.encoder.layer.9.attention.value.weight', 'vision_tower.encoder.layer.9.mlp.down_proj.bias', 'vision_tower.encoder.layer.9.mlp.down_proj.weight', 'vision_tower.encoder.layer.9.mlp.up_proj.bias', 'vision_tower.encoder.layer.9.mlp.up_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

It seems that the vision parts are all not initialized?

Could you catch the path in your environment?

def write_tokenizer(save_dir: str, push_to_hub: bool = False, path: str = None, hub_dir: str = None):
    if LM_TYPE_CORRESPONDENCE[path] == "qwen2":
        tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", return_token_type_ids=False)
        tokenizer.model_max_length = CONTEXT_LENGTH

For InternVL2_5, the code shouldn't run there.

Actually, I am using a customized-pretrained OpenGVLab/InternVL2_5-8B model. It should be identical with OpenGVLab/InternVL2_5-8B-MPO except for the parameters in my opinion.

According to your project:

LM_TYPE_CORRESPONDENCE = {
    "OpenGVLab/InternVL2_5-1B-MPO": "qwen2",
    "OpenGVLab/InternVL2_5-2B-MPO": "llama",
    "OpenGVLab/InternVL2_5-4B-MPO": "qwen2",
    "OpenGVLab/InternVL2_5-8B-MPO": "llama",
    "OpenGVLab/InternVL2_5-26B-MPO": "llama",
    "OpenGVLab/InternVL2_5-38B-MPO": "qwen2",
    "OpenGVLab/InternVL2_5-78B-MPO": "qwen2",
}

I suppose I should use the path pointing at "llama"?

A quick check. It seems that your transformers version is not the latest. Try to

How to convert a model InternVL2_5-1B-MPO checkpoint to hf if I have a costomized-pertrained internvl model? Please help!
cvt_internvl_weights_to_hf.py seems to require a intern_vl_hf_implem/tokenizer_internvl_llama_fast.

In my case, just replace these codes with

tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-chat-7b", return_token_type_ids=False, trust_remote_code=True)

It will replace the tokenizer with internLM2's.

Thanks a lot for your timely response!!

When I use model = InternVLForConditionalGeneration.from_pretrained("InternVL2_5-8B-MPO-hf"), error raises:

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 4/4 [00:23<00:00,  5.83s/it]
Some weights of the model checkpoint at InternVL2_5-8B-MPO-hf were not used when initializing InternVLForConditionalGeneration: ['vision_tower.encoder.layer.0.attention.attention.key.bias', 'vision_tower.encoder.layer.0.attention.attention.key.weight', 'vision_tower.encoder.layer.0.attention.attention.query.bias', 'vision_tower.encoder.layer.0.attention.attention.query.weight', 'vision_tower.encoder.layer.0.attention.attention.value.bias', 'vision_tower.encoder.layer.0.attention.attention.value.weight', 'vision_tower.encoder.layer.0.attention.output.dense.bias', 'vision_tower.encoder.layer.0.attention.output.dense.weight', 'vision_tower.encoder.layer.0.intermediate.dense.bias', 'vision_tower.encoder.layer.0.intermediate.dense.weight', 'vision_tower.encoder.layer.0.output.dense.bias', 'vision_tower.encoder.layer.0.output.dense.weight', 'vision_tower.encoder.layer.1.attention.attention.key.bias', 'vision_tower.encoder.layer.1.attention.attention.key.weight', 'vision_tower.encoder.layer.1.attention.attention.query.bias', 'vision_tower.encoder.layer.1.attention.attention.query.weight', 'vision_tower.encoder.layer.1.attention.attention.value.bias', 'vision_tower.encoder.layer.1.attention.attention.value.weight', 'vision_tower.encoder.layer.1.attention.output.dense.bias', 'vision_tower.encoder.layer.1.attention.output.dense.weight', 'vision_tower.encoder.layer.1.intermediate.dense.bias', 'vision_tower.encoder.layer.1.intermediate.dense.weight', 'vision_tower.encoder.layer.1.output.dense.bias', 'vision_tower.encoder.layer.1.output.dense.weight', 'vision_tower.encoder.layer.10.attention.attention.key.bias', 'vision_tower.encoder.layer.10.attention.attention.key.weight', 'vision_tower.encoder.layer.10.attention.attention.query.bias', 'vision_tower.encoder.layer.10.attention.attention.query.weight', 'vision_tower.encoder.layer.10.attention.attention.value.bias', 'vision_tower.encoder.layer.10.attention.attention.value.weight', 'vision_tower.encoder.layer.10.attention.output.dense.bias', 'vision_tower.encoder.layer.10.attention.output.dense.weight', 'vision_tower.encoder.layer.10.intermediate.dense.bias', 'vision_tower.encoder.layer.10.intermediate.dense.weight', 'vision_tower.encoder.layer.10.output.dense.bias', 'vision_tower.encoder.layer.10.output.dense.weight', 'vision_tower.encoder.layer.11.attention.attention.key.bias', 'vision_tower.encoder.layer.11.attention.attention.key.weight', 'vision_tower.encoder.layer.11.attention.attention.query.bias', 'vision_tower.encoder.layer.11.attention.attention.query.weight', 'vision_tower.encoder.layer.11.attention.attention.value.bias', 'vision_tower.encoder.layer.11.attention.attention.value.weight', 'vision_tower.encoder.layer.11.attention.output.dense.bias', 'vision_tower.encoder.layer.11.attention.output.dense.weight', 'vision_tower.encoder.layer.11.intermediate.dense.bias', 'vision_tower.encoder.layer.11.intermediate.dense.weight', 'vision_tower.encoder.layer.11.output.dense.bias', 'vision_tower.encoder.layer.11.output.dense.weight', 'vision_tower.encoder.layer.12.attention.attention.key.bias', 'vision_tower.encoder.layer.12.attention.attention.key.weight', 'vision_tower.encoder.layer.12.attention.attention.query.bias', 'vision_tower.encoder.layer.12.attention.attention.query.weight', 'vision_tower.encoder.layer.12.attention.attention.value.bias', 'vision_tower.encoder.layer.12.attention.attention.value.weight', 'vision_tower.encoder.layer.12.attention.output.dense.bias', 'vision_tower.encoder.layer.12.attention.output.dense.weight', 'vision_tower.encoder.layer.12.intermediate.dense.bias', 'vision_tower.encoder.layer.12.intermediate.dense.weight', 'vision_tower.encoder.layer.12.output.dense.bias', 'vision_tower.encoder.layer.12.output.dense.weight', 'vision_tower.encoder.layer.13.attention.attention.key.bias', 'vision_tower.encoder.layer.13.attention.attention.key.weight', 'vision_tower.encoder.layer.13.attention.attention.query.bias', 'vision_tower.encoder.layer.13.attention.attention.query.weight', 'vision_tower.encoder.layer.13.attention.attention.value.bias', 'vision_tower.encoder.layer.13.attention.attention.value.weight', 'vision_tower.encoder.layer.13.attention.output.dense.bias', 'vision_tower.encoder.layer.13.attention.output.dense.weight', 'vision_tower.encoder.layer.13.intermediate.dense.bias', 'vision_tower.encoder.layer.13.intermediate.dense.weight', 'vision_tower.encoder.layer.13.output.dense.bias', 'vision_tower.encoder.layer.13.output.dense.weight', 'vision_tower.encoder.layer.14.attention.attention.key.bias', 'vision_tower.encoder.layer.14.attention.attention.key.weight', 'vision_tower.encoder.layer.14.attention.attention.query.bias', 'vision_tower.encoder.layer.14.attention.attention.query.weight', 'vision_tower.encoder.layer.14.attention.attention.value.bias', 'vision_tower.encoder.layer.14.attention.attention.value.weight', 'vision_tower.encoder.layer.14.attention.output.dense.bias', 'vision_tower.encoder.layer.14.attention.output.dense.weight', 'vision_tower.encoder.layer.14.intermediate.dense.bias', 'vision_tower.encoder.layer.14.intermediate.dense.weight', 'vision_tower.encoder.layer.14.output.dense.bias', 'vision_tower.encoder.layer.14.output.dense.weight', 'vision_tower.encoder.layer.15.attention.attention.key.bias', 'vision_tower.encoder.layer.15.attention.attention.key.weight', 'vision_tower.encoder.layer.15.attention.attention.query.bias', 'vision_tower.encoder.layer.15.attention.attention.query.weight', 'vision_tower.encoder.layer.15.attention.attention.value.bias', 'vision_tower.encoder.layer.15.attention.attention.value.weight', 'vision_tower.encoder.layer.15.attention.output.dense.bias', 'vision_tower.encoder.layer.15.attention.output.dense.weight', 'vision_tower.encoder.layer.15.intermediate.dense.bias', 'vision_tower.encoder.layer.15.intermediate.dense.weight', 'vision_tower.encoder.layer.15.output.dense.bias', 'vision_tower.encoder.layer.15.output.dense.weight', 'vision_tower.encoder.layer.16.attention.attention.key.bias', 'vision_tower.encoder.layer.16.attention.attention.key.weight', 'vision_tower.encoder.layer.16.attention.attention.query.bias', 'vision_tower.encoder.layer.16.attention.attention.query.weight', 'vision_tower.encoder.layer.16.attention.attention.value.bias', 'vision_tower.encoder.layer.16.attention.attention.value.weight', 'vision_tower.encoder.layer.16.attention.output.dense.bias', 'vision_tower.encoder.layer.16.attention.output.dense.weight', 'vision_tower.encoder.layer.16.intermediate.dense.bias', 'vision_tower.encoder.layer.16.intermediate.dense.weight', 'vision_tower.encoder.layer.16.output.dense.bias', 'vision_tower.encoder.layer.16.output.dense.weight', 'vision_tower.encoder.layer.17.attention.attention.key.bias', 'vision_tower.encoder.layer.17.attention.attention.key.weight', 'vision_tower.encoder.layer.17.attention.attention.query.bias', 'vision_tower.encoder.layer.17.attention.attention.query.weight', 'vision_tower.encoder.layer.17.attention.attention.value.bias', 'vision_tower.encoder.layer.17.attention.attention.value.weight', 'vision_tower.encoder.layer.17.attention.output.dense.bias', 'vision_tower.encoder.layer.17.attention.output.dense.weight', 'vision_tower.encoder.layer.17.intermediate.dense.bias', 'vision_tower.encoder.layer.17.intermediate.dense.weight', 'vision_tower.encoder.layer.17.output.dense.bias', 'vision_tower.encoder.layer.17.output.dense.weight', 'vision_tower.encoder.layer.18.attention.attention.key.bias', 'vision_tower.encoder.layer.18.attention.attention.key.weight', 'vision_tower.encoder.layer.18.attention.attention.query.bias', 'vision_tower.encoder.layer.18.attention.attention.query.weight', 'vision_tower.encoder.layer.18.attention.attention.value.bias', 'vision_tower.encoder.layer.18.attention.attention.value.weight', 'vision_tower.encoder.layer.18.attention.output.dense.bias', 'vision_tower.encoder.layer.18.attention.output.dense.weight', 'vision_tower.encoder.layer.18.intermediate.dense.bias', 'vision_tower.encoder.layer.18.intermediate.dense.weight', 'vision_tower.encoder.layer.18.output.dense.bias', 'vision_tower.encoder.layer.18.output.dense.weight', 'vision_tower.encoder.layer.19.attention.attention.key.bias', 'vision_tower.encoder.layer.19.attention.attention.key.weight', 'vision_tower.encoder.layer.19.attention.attention.query.bias', 'vision_tower.encoder.layer.19.attention.attention.query.weight', 'vision_tower.encoder.layer.19.attention.attention.value.bias', 'vision_tower.encoder.layer.19.attention.attention.value.weight', 'vision_tower.encoder.layer.19.attention.output.dense.bias', 'vision_tower.encoder.layer.19.attention.output.dense.weight', 'vision_tower.encoder.layer.19.intermediate.dense.bias', 'vision_tower.encoder.layer.19.intermediate.dense.weight', 'vision_tower.encoder.layer.19.output.dense.bias', 'vision_tower.encoder.layer.19.output.dense.weight', 'vision_tower.encoder.layer.2.attention.attention.key.bias', 'vision_tower.encoder.layer.2.attention.attention.key.weight', 'vision_tower.encoder.layer.2.attention.attention.query.bias', 'vision_tower.encoder.layer.2.attention.attention.query.weight', 'vision_tower.encoder.layer.2.attention.attention.value.bias', 'vision_tower.encoder.layer.2.attention.attention.value.weight', 'vision_tower.encoder.layer.2.attention.output.dense.bias', 'vision_tower.encoder.layer.2.attention.output.dense.weight', 'vision_tower.encoder.layer.2.intermediate.dense.bias', 'vision_tower.encoder.layer.2.intermediate.dense.weight', 'vision_tower.encoder.layer.2.output.dense.bias', 'vision_tower.encoder.layer.2.output.dense.weight', 'vision_tower.encoder.layer.20.attention.attention.key.bias', 'vision_tower.encoder.layer.20.attention.attention.key.weight', 'vision_tower.encoder.layer.20.attention.attention.query.bias', 'vision_tower.encoder.layer.20.attention.attention.query.weight', 'vision_tower.encoder.layer.20.attention.attention.value.bias', 'vision_tower.encoder.layer.20.attention.attention.value.weight', 'vision_tower.encoder.layer.20.attention.output.dense.bias', 'vision_tower.encoder.layer.20.attention.output.dense.weight', 'vision_tower.encoder.layer.20.intermediate.dense.bias', 'vision_tower.encoder.layer.20.intermediate.dense.weight', 'vision_tower.encoder.layer.20.output.dense.bias', 'vision_tower.encoder.layer.20.output.dense.weight', 'vision_tower.encoder.layer.21.attention.attention.key.bias', 'vision_tower.encoder.layer.21.attention.attention.key.weight', 'vision_tower.encoder.layer.21.attention.attention.query.bias', 'vision_tower.encoder.layer.21.attention.attention.query.weight', 'vision_tower.encoder.layer.21.attention.attention.value.bias', 'vision_tower.encoder.layer.21.attention.attention.value.weight', 'vision_tower.encoder.layer.21.attention.output.dense.bias', 'vision_tower.encoder.layer.21.attention.output.dense.weight', 'vision_tower.encoder.layer.21.intermediate.dense.bias', 'vision_tower.encoder.layer.21.intermediate.dense.weight', 'vision_tower.encoder.layer.21.output.dense.bias', 'vision_tower.encoder.layer.21.output.dense.weight', 'vision_tower.encoder.layer.22.attention.attention.key.bias', 'vision_tower.encoder.layer.22.attention.attention.key.weight', 'vision_tower.encoder.layer.22.attention.attention.query.bias', 'vision_tower.encoder.layer.22.attention.attention.query.weight', 'vision_tower.encoder.layer.22.attention.attention.value.bias', 'vision_tower.encoder.layer.22.attention.attention.value.weight', 'vision_tower.encoder.layer.22.attention.output.dense.bias', 'vision_tower.encoder.layer.22.attention.output.dense.weight', 'vision_tower.encoder.layer.22.intermediate.dense.bias', 'vision_tower.encoder.layer.22.intermediate.dense.weight', 'vision_tower.encoder.layer.22.output.dense.bias', 'vision_tower.encoder.layer.22.output.dense.weight', 'vision_tower.encoder.layer.23.attention.attention.key.bias', 'vision_tower.encoder.layer.23.attention.attention.key.weight', 'vision_tower.encoder.layer.23.attention.attention.query.bias', 'vision_tower.encoder.layer.23.attention.attention.query.weight', 'vision_tower.encoder.layer.23.attention.attention.value.bias', 'vision_tower.encoder.layer.23.attention.attention.value.weight', 'vision_tower.encoder.layer.23.attention.output.dense.bias', 'vision_tower.encoder.layer.23.attention.output.dense.weight', 'vision_tower.encoder.layer.23.intermediate.dense.bias', 'vision_tower.encoder.layer.23.intermediate.dense.weight', 'vision_tower.encoder.layer.23.output.dense.bias', 'vision_tower.encoder.layer.23.output.dense.weight', 'vision_tower.encoder.layer.3.attention.attention.key.bias', 'vision_tower.encoder.layer.3.attention.attention.key.weight', 'vision_tower.encoder.layer.3.attention.attention.query.bias', 'vision_tower.encoder.layer.3.attention.attention.query.weight', 'vision_tower.encoder.layer.3.attention.attention.value.bias', 'vision_tower.encoder.layer.3.attention.attention.value.weight', 'vision_tower.encoder.layer.3.attention.output.dense.bias', 'vision_tower.encoder.layer.3.attention.output.dense.weight', 'vision_tower.encoder.layer.3.intermediate.dense.bias', 'vision_tower.encoder.layer.3.intermediate.dense.weight', 'vision_tower.encoder.layer.3.output.dense.bias', 'vision_tower.encoder.layer.3.output.dense.weight', 'vision_tower.encoder.layer.4.attention.attention.key.bias', 'vision_tower.encoder.layer.4.attention.attention.key.weight', 'vision_tower.encoder.layer.4.attention.attention.query.bias', 'vision_tower.encoder.layer.4.attention.attention.query.weight', 'vision_tower.encoder.layer.4.attention.attention.value.bias', 'vision_tower.encoder.layer.4.attention.attention.value.weight', 'vision_tower.encoder.layer.4.attention.output.dense.bias', 'vision_tower.encoder.layer.4.attention.output.dense.weight', 'vision_tower.encoder.layer.4.intermediate.dense.bias', 'vision_tower.encoder.layer.4.intermediate.dense.weight', 'vision_tower.encoder.layer.4.output.dense.bias', 'vision_tower.encoder.layer.4.output.dense.weight', 'vision_tower.encoder.layer.5.attention.attention.key.bias', 'vision_tower.encoder.layer.5.attention.attention.key.weight', 'vision_tower.encoder.layer.5.attention.attention.query.bias', 'vision_tower.encoder.layer.5.attention.attention.query.weight', 'vision_tower.encoder.layer.5.attention.attention.value.bias', 'vision_tower.encoder.layer.5.attention.attention.value.weight', 'vision_tower.encoder.layer.5.attention.output.dense.bias', 'vision_tower.encoder.layer.5.attention.output.dense.weight', 'vision_tower.encoder.layer.5.intermediate.dense.bias', 'vision_tower.encoder.layer.5.intermediate.dense.weight', 'vision_tower.encoder.layer.5.output.dense.bias', 'vision_tower.encoder.layer.5.output.dense.weight', 'vision_tower.encoder.layer.6.attention.attention.key.bias', 'vision_tower.encoder.layer.6.attention.attention.key.weight', 'vision_tower.encoder.layer.6.attention.attention.query.bias', 'vision_tower.encoder.layer.6.attention.attention.query.weight', 'vision_tower.encoder.layer.6.attention.attention.value.bias', 'vision_tower.encoder.layer.6.attention.attention.value.weight', 'vision_tower.encoder.layer.6.attention.output.dense.bias', 'vision_tower.encoder.layer.6.attention.output.dense.weight', 'vision_tower.encoder.layer.6.intermediate.dense.bias', 'vision_tower.encoder.layer.6.intermediate.dense.weight', 'vision_tower.encoder.layer.6.output.dense.bias', 'vision_tower.encoder.layer.6.output.dense.weight', 'vision_tower.encoder.layer.7.attention.attention.key.bias', 'vision_tower.encoder.layer.7.attention.attention.key.weight', 'vision_tower.encoder.layer.7.attention.attention.query.bias', 'vision_tower.encoder.layer.7.attention.attention.query.weight', 'vision_tower.encoder.layer.7.attention.attention.value.bias', 'vision_tower.encoder.layer.7.attention.attention.value.weight', 'vision_tower.encoder.layer.7.attention.output.dense.bias', 'vision_tower.encoder.layer.7.attention.output.dense.weight', 'vision_tower.encoder.layer.7.intermediate.dense.bias', 'vision_tower.encoder.layer.7.intermediate.dense.weight', 'vision_tower.encoder.layer.7.output.dense.bias', 'vision_tower.encoder.layer.7.output.dense.weight', 'vision_tower.encoder.layer.8.attention.attention.key.bias', 'vision_tower.encoder.layer.8.attention.attention.key.weight', 'vision_tower.encoder.layer.8.attention.attention.query.bias', 'vision_tower.encoder.layer.8.attention.attention.query.weight', 'vision_tower.encoder.layer.8.attention.attention.value.bias', 'vision_tower.encoder.layer.8.attention.attention.value.weight', 'vision_tower.encoder.layer.8.attention.output.dense.bias', 'vision_tower.encoder.layer.8.attention.output.dense.weight', 'vision_tower.encoder.layer.8.intermediate.dense.bias', 'vision_tower.encoder.layer.8.intermediate.dense.weight', 'vision_tower.encoder.layer.8.output.dense.bias', 'vision_tower.encoder.layer.8.output.dense.weight', 'vision_tower.encoder.layer.9.attention.attention.key.bias', 'vision_tower.encoder.layer.9.attention.attention.key.weight', 'vision_tower.encoder.layer.9.attention.attention.query.bias', 'vision_tower.encoder.layer.9.attention.attention.query.weight', 'vision_tower.encoder.layer.9.attention.attention.value.bias', 'vision_tower.encoder.layer.9.attention.attention.value.weight', 'vision_tower.encoder.layer.9.attention.output.dense.bias', 'vision_tower.encoder.layer.9.attention.output.dense.weight', 'vision_tower.encoder.layer.9.intermediate.dense.bias', 'vision_tower.encoder.layer.9.intermediate.dense.weight', 'vision_tower.encoder.layer.9.output.dense.bias', 'vision_tower.encoder.layer.9.output.dense.weight']
- This IS expected if you are initializing InternVLForConditionalGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing InternVLForConditionalGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of InternVLForConditionalGeneration were not initialized from the model checkpoint at InternVL2_5-8B-MPO-hf and are newly initialized: ['vision_tower.encoder.layer.0.attention.key.bias', 'vision_tower.encoder.layer.0.attention.key.weight', 'vision_tower.encoder.layer.0.attention.output.bias', 'vision_tower.encoder.layer.0.attention.output.weight', 'vision_tower.encoder.layer.0.attention.query.bias', 'vision_tower.encoder.layer.0.attention.query.weight', 'vision_tower.encoder.layer.0.attention.value.bias', 'vision_tower.encoder.layer.0.attention.value.weight', 'vision_tower.encoder.layer.0.mlp.down_proj.bias', 'vision_tower.encoder.layer.0.mlp.down_proj.weight', 'vision_tower.encoder.layer.0.mlp.up_proj.bias', 'vision_tower.encoder.layer.0.mlp.up_proj.weight', 'vision_tower.encoder.layer.1.attention.key.bias', 'vision_tower.encoder.layer.1.attention.key.weight', 'vision_tower.encoder.layer.1.attention.output.bias', 'vision_tower.encoder.layer.1.attention.output.weight', 'vision_tower.encoder.layer.1.attention.query.bias', 'vision_tower.encoder.layer.1.attention.query.weight', 'vision_tower.encoder.layer.1.attention.value.bias', 'vision_tower.encoder.layer.1.attention.value.weight', 'vision_tower.encoder.layer.1.mlp.down_proj.bias', 'vision_tower.encoder.layer.1.mlp.down_proj.weight', 'vision_tower.encoder.layer.1.mlp.up_proj.bias', 'vision_tower.encoder.layer.1.mlp.up_proj.weight', 'vision_tower.encoder.layer.10.attention.key.bias', 'vision_tower.encoder.layer.10.attention.key.weight', 'vision_tower.encoder.layer.10.attention.output.bias', 'vision_tower.encoder.layer.10.attention.output.weight', 'vision_tower.encoder.layer.10.attention.query.bias', 'vision_tower.encoder.layer.10.attention.query.weight', 'vision_tower.encoder.layer.10.attention.value.bias', 'vision_tower.encoder.layer.10.attention.value.weight', 'vision_tower.encoder.layer.10.mlp.down_proj.bias', 'vision_tower.encoder.layer.10.mlp.down_proj.weight', 'vision_tower.encoder.layer.10.mlp.up_proj.bias', 'vision_tower.encoder.layer.10.mlp.up_proj.weight', 'vision_tower.encoder.layer.11.attention.key.bias', 'vision_tower.encoder.layer.11.attention.key.weight', 'vision_tower.encoder.layer.11.attention.output.bias', 'vision_tower.encoder.layer.11.attention.output.weight', 'vision_tower.encoder.layer.11.attention.query.bias', 'vision_tower.encoder.layer.11.attention.query.weight', 'vision_tower.encoder.layer.11.attention.value.bias', 'vision_tower.encoder.layer.11.attention.value.weight', 'vision_tower.encoder.layer.11.mlp.down_proj.bias', 'vision_tower.encoder.layer.11.mlp.down_proj.weight', 'vision_tower.encoder.layer.11.mlp.up_proj.bias', 'vision_tower.encoder.layer.11.mlp.up_proj.weight', 'vision_tower.encoder.layer.12.attention.key.bias', 'vision_tower.encoder.layer.12.attention.key.weight', 'vision_tower.encoder.layer.12.attention.output.bias', 'vision_tower.encoder.layer.12.attention.output.weight', 'vision_tower.encoder.layer.12.attention.query.bias', 'vision_tower.encoder.layer.12.attention.query.weight', 'vision_tower.encoder.layer.12.attention.value.bias', 'vision_tower.encoder.layer.12.attention.value.weight', 'vision_tower.encoder.layer.12.mlp.down_proj.bias', 'vision_tower.encoder.layer.12.mlp.down_proj.weight', 'vision_tower.encoder.layer.12.mlp.up_proj.bias', 'vision_tower.encoder.layer.12.mlp.up_proj.weight', 'vision_tower.encoder.layer.13.attention.key.bias', 'vision_tower.encoder.layer.13.attention.key.weight', 'vision_tower.encoder.layer.13.attention.output.bias', 'vision_tower.encoder.layer.13.attention.output.weight', 'vision_tower.encoder.layer.13.attention.query.bias', 'vision_tower.encoder.layer.13.attention.query.weight', 'vision_tower.encoder.layer.13.attention.value.bias', 'vision_tower.encoder.layer.13.attention.value.weight', 'vision_tower.encoder.layer.13.mlp.down_proj.bias', 'vision_tower.encoder.layer.13.mlp.down_proj.weight', 'vision_tower.encoder.layer.13.mlp.up_proj.bias', 'vision_tower.encoder.layer.13.mlp.up_proj.weight', 'vision_tower.encoder.layer.14.attention.key.bias', 'vision_tower.encoder.layer.14.attention.key.weight', 'vision_tower.encoder.layer.14.attention.output.bias', 'vision_tower.encoder.layer.14.attention.output.weight', 'vision_tower.encoder.layer.14.attention.query.bias', 'vision_tower.encoder.layer.14.attention.query.weight', 'vision_tower.encoder.layer.14.attention.value.bias', 'vision_tower.encoder.layer.14.attention.value.weight', 'vision_tower.encoder.layer.14.mlp.down_proj.bias', 'vision_tower.encoder.layer.14.mlp.down_proj.weight', 'vision_tower.encoder.layer.14.mlp.up_proj.bias', 'vision_tower.encoder.layer.14.mlp.up_proj.weight', 'vision_tower.encoder.layer.15.attention.key.bias', 'vision_tower.encoder.layer.15.attention.key.weight', 'vision_tower.encoder.layer.15.attention.output.bias', 'vision_tower.encoder.layer.15.attention.output.weight', 'vision_tower.encoder.layer.15.attention.query.bias', 'vision_tower.encoder.layer.15.attention.query.weight', 'vision_tower.encoder.layer.15.attention.value.bias', 'vision_tower.encoder.layer.15.attention.value.weight', 'vision_tower.encoder.layer.15.mlp.down_proj.bias', 'vision_tower.encoder.layer.15.mlp.down_proj.weight', 'vision_tower.encoder.layer.15.mlp.up_proj.bias', 'vision_tower.encoder.layer.15.mlp.up_proj.weight', 'vision_tower.encoder.layer.16.attention.key.bias', 'vision_tower.encoder.layer.16.attention.key.weight', 'vision_tower.encoder.layer.16.attention.output.bias', 'vision_tower.encoder.layer.16.attention.output.weight', 'vision_tower.encoder.layer.16.attention.query.bias', 'vision_tower.encoder.layer.16.attention.query.weight', 'vision_tower.encoder.layer.16.attention.value.bias', 'vision_tower.encoder.layer.16.attention.value.weight', 'vision_tower.encoder.layer.16.mlp.down_proj.bias', 'vision_tower.encoder.layer.16.mlp.down_proj.weight', 'vision_tower.encoder.layer.16.mlp.up_proj.bias', 'vision_tower.encoder.layer.16.mlp.up_proj.weight', 'vision_tower.encoder.layer.17.attention.key.bias', 'vision_tower.encoder.layer.17.attention.key.weight', 'vision_tower.encoder.layer.17.attention.output.bias', 'vision_tower.encoder.layer.17.attention.output.weight', 'vision_tower.encoder.layer.17.attention.query.bias', 'vision_tower.encoder.layer.17.attention.query.weight', 'vision_tower.encoder.layer.17.attention.value.bias', 'vision_tower.encoder.layer.17.attention.value.weight', 'vision_tower.encoder.layer.17.mlp.down_proj.bias', 'vision_tower.encoder.layer.17.mlp.down_proj.weight', 'vision_tower.encoder.layer.17.mlp.up_proj.bias', 'vision_tower.encoder.layer.17.mlp.up_proj.weight', 'vision_tower.encoder.layer.18.attention.key.bias', 'vision_tower.encoder.layer.18.attention.key.weight', 'vision_tower.encoder.layer.18.attention.output.bias', 'vision_tower.encoder.layer.18.attention.output.weight', 'vision_tower.encoder.layer.18.attention.query.bias', 'vision_tower.encoder.layer.18.attention.query.weight', 'vision_tower.encoder.layer.18.attention.value.bias', 'vision_tower.encoder.layer.18.attention.value.weight', 'vision_tower.encoder.layer.18.mlp.down_proj.bias', 'vision_tower.encoder.layer.18.mlp.down_proj.weight', 'vision_tower.encoder.layer.18.mlp.up_proj.bias', 'vision_tower.encoder.layer.18.mlp.up_proj.weight', 'vision_tower.encoder.layer.19.attention.key.bias', 'vision_tower.encoder.layer.19.attention.key.weight', 'vision_tower.encoder.layer.19.attention.output.bias', 'vision_tower.encoder.layer.19.attention.output.weight', 'vision_tower.encoder.layer.19.attention.query.bias', 'vision_tower.encoder.layer.19.attention.query.weight', 'vision_tower.encoder.layer.19.attention.value.bias', 'vision_tower.encoder.layer.19.attention.value.weight', 'vision_tower.encoder.layer.19.mlp.down_proj.bias', 'vision_tower.encoder.layer.19.mlp.down_proj.weight', 'vision_tower.encoder.layer.19.mlp.up_proj.bias', 'vision_tower.encoder.layer.19.mlp.up_proj.weight', 'vision_tower.encoder.layer.2.attention.key.bias', 'vision_tower.encoder.layer.2.attention.key.weight', 'vision_tower.encoder.layer.2.attention.output.bias', 'vision_tower.encoder.layer.2.attention.output.weight', 'vision_tower.encoder.layer.2.attention.query.bias', 'vision_tower.encoder.layer.2.attention.query.weight', 'vision_tower.encoder.layer.2.attention.value.bias', 'vision_tower.encoder.layer.2.attention.value.weight', 'vision_tower.encoder.layer.2.mlp.down_proj.bias', 'vision_tower.encoder.layer.2.mlp.down_proj.weight', 'vision_tower.encoder.layer.2.mlp.up_proj.bias', 'vision_tower.encoder.layer.2.mlp.up_proj.weight', 'vision_tower.encoder.layer.20.attention.key.bias', 'vision_tower.encoder.layer.20.attention.key.weight', 'vision_tower.encoder.layer.20.attention.output.bias', 'vision_tower.encoder.layer.20.attention.output.weight', 'vision_tower.encoder.layer.20.attention.query.bias', 'vision_tower.encoder.layer.20.attention.query.weight', 'vision_tower.encoder.layer.20.attention.value.bias', 'vision_tower.encoder.layer.20.attention.value.weight', 'vision_tower.encoder.layer.20.mlp.down_proj.bias', 'vision_tower.encoder.layer.20.mlp.down_proj.weight', 'vision_tower.encoder.layer.20.mlp.up_proj.bias', 'vision_tower.encoder.layer.20.mlp.up_proj.weight', 'vision_tower.encoder.layer.21.attention.key.bias', 'vision_tower.encoder.layer.21.attention.key.weight', 'vision_tower.encoder.layer.21.attention.output.bias', 'vision_tower.encoder.layer.21.attention.output.weight', 'vision_tower.encoder.layer.21.attention.query.bias', 'vision_tower.encoder.layer.21.attention.query.weight', 'vision_tower.encoder.layer.21.attention.value.bias', 'vision_tower.encoder.layer.21.attention.value.weight', 'vision_tower.encoder.layer.21.mlp.down_proj.bias', 'vision_tower.encoder.layer.21.mlp.down_proj.weight', 'vision_tower.encoder.layer.21.mlp.up_proj.bias', 'vision_tower.encoder.layer.21.mlp.up_proj.weight', 'vision_tower.encoder.layer.22.attention.key.bias', 'vision_tower.encoder.layer.22.attention.key.weight', 'vision_tower.encoder.layer.22.attention.output.bias', 'vision_tower.encoder.layer.22.attention.output.weight', 'vision_tower.encoder.layer.22.attention.query.bias', 'vision_tower.encoder.layer.22.attention.query.weight', 'vision_tower.encoder.layer.22.attention.value.bias', 'vision_tower.encoder.layer.22.attention.value.weight', 'vision_tower.encoder.layer.22.mlp.down_proj.bias', 'vision_tower.encoder.layer.22.mlp.down_proj.weight', 'vision_tower.encoder.layer.22.mlp.up_proj.bias', 'vision_tower.encoder.layer.22.mlp.up_proj.weight', 'vision_tower.encoder.layer.23.attention.key.bias', 'vision_tower.encoder.layer.23.attention.key.weight', 'vision_tower.encoder.layer.23.attention.output.bias', 'vision_tower.encoder.layer.23.attention.output.weight', 'vision_tower.encoder.layer.23.attention.query.bias', 'vision_tower.encoder.layer.23.attention.query.weight', 'vision_tower.encoder.layer.23.attention.value.bias', 'vision_tower.encoder.layer.23.attention.value.weight', 'vision_tower.encoder.layer.23.mlp.down_proj.bias', 'vision_tower.encoder.layer.23.mlp.down_proj.weight', 'vision_tower.encoder.layer.23.mlp.up_proj.bias', 'vision_tower.encoder.layer.23.mlp.up_proj.weight', 'vision_tower.encoder.layer.3.attention.key.bias', 'vision_tower.encoder.layer.3.attention.key.weight', 'vision_tower.encoder.layer.3.attention.output.bias', 'vision_tower.encoder.layer.3.attention.output.weight', 'vision_tower.encoder.layer.3.attention.query.bias', 'vision_tower.encoder.layer.3.attention.query.weight', 'vision_tower.encoder.layer.3.attention.value.bias', 'vision_tower.encoder.layer.3.attention.value.weight', 'vision_tower.encoder.layer.3.mlp.down_proj.bias', 'vision_tower.encoder.layer.3.mlp.down_proj.weight', 'vision_tower.encoder.layer.3.mlp.up_proj.bias', 'vision_tower.encoder.layer.3.mlp.up_proj.weight', 'vision_tower.encoder.layer.4.attention.key.bias', 'vision_tower.encoder.layer.4.attention.key.weight', 'vision_tower.encoder.layer.4.attention.output.bias', 'vision_tower.encoder.layer.4.attention.output.weight', 'vision_tower.encoder.layer.4.attention.query.bias', 'vision_tower.encoder.layer.4.attention.query.weight', 'vision_tower.encoder.layer.4.attention.value.bias', 'vision_tower.encoder.layer.4.attention.value.weight', 'vision_tower.encoder.layer.4.mlp.down_proj.bias', 'vision_tower.encoder.layer.4.mlp.down_proj.weight', 'vision_tower.encoder.layer.4.mlp.up_proj.bias', 'vision_tower.encoder.layer.4.mlp.up_proj.weight', 'vision_tower.encoder.layer.5.attention.key.bias', 'vision_tower.encoder.layer.5.attention.key.weight', 'vision_tower.encoder.layer.5.attention.output.bias', 'vision_tower.encoder.layer.5.attention.output.weight', 'vision_tower.encoder.layer.5.attention.query.bias', 'vision_tower.encoder.layer.5.attention.query.weight', 'vision_tower.encoder.layer.5.attention.value.bias', 'vision_tower.encoder.layer.5.attention.value.weight', 'vision_tower.encoder.layer.5.mlp.down_proj.bias', 'vision_tower.encoder.layer.5.mlp.down_proj.weight', 'vision_tower.encoder.layer.5.mlp.up_proj.bias', 'vision_tower.encoder.layer.5.mlp.up_proj.weight', 'vision_tower.encoder.layer.6.attention.key.bias', 'vision_tower.encoder.layer.6.attention.key.weight', 'vision_tower.encoder.layer.6.attention.output.bias', 'vision_tower.encoder.layer.6.attention.output.weight', 'vision_tower.encoder.layer.6.attention.query.bias', 'vision_tower.encoder.layer.6.attention.query.weight', 'vision_tower.encoder.layer.6.attention.value.bias', 'vision_tower.encoder.layer.6.attention.value.weight', 'vision_tower.encoder.layer.6.mlp.down_proj.bias', 'vision_tower.encoder.layer.6.mlp.down_proj.weight', 'vision_tower.encoder.layer.6.mlp.up_proj.bias', 'vision_tower.encoder.layer.6.mlp.up_proj.weight', 'vision_tower.encoder.layer.7.attention.key.bias', 'vision_tower.encoder.layer.7.attention.key.weight', 'vision_tower.encoder.layer.7.attention.output.bias', 'vision_tower.encoder.layer.7.attention.output.weight', 'vision_tower.encoder.layer.7.attention.query.bias', 'vision_tower.encoder.layer.7.attention.query.weight', 'vision_tower.encoder.layer.7.attention.value.bias', 'vision_tower.encoder.layer.7.attention.value.weight', 'vision_tower.encoder.layer.7.mlp.down_proj.bias', 'vision_tower.encoder.layer.7.mlp.down_proj.weight', 'vision_tower.encoder.layer.7.mlp.up_proj.bias', 'vision_tower.encoder.layer.7.mlp.up_proj.weight', 'vision_tower.encoder.layer.8.attention.key.bias', 'vision_tower.encoder.layer.8.attention.key.weight', 'vision_tower.encoder.layer.8.attention.output.bias', 'vision_tower.encoder.layer.8.attention.output.weight', 'vision_tower.encoder.layer.8.attention.query.bias', 'vision_tower.encoder.layer.8.attention.query.weight', 'vision_tower.encoder.layer.8.attention.value.bias', 'vision_tower.encoder.layer.8.attention.value.weight', 'vision_tower.encoder.layer.8.mlp.down_proj.bias', 'vision_tower.encoder.layer.8.mlp.down_proj.weight', 'vision_tower.encoder.layer.8.mlp.up_proj.bias', 'vision_tower.encoder.layer.8.mlp.up_proj.weight', 'vision_tower.encoder.layer.9.attention.key.bias', 'vision_tower.encoder.layer.9.attention.key.weight', 'vision_tower.encoder.layer.9.attention.output.bias', 'vision_tower.encoder.layer.9.attention.output.weight', 'vision_tower.encoder.layer.9.attention.query.bias', 'vision_tower.encoder.layer.9.attention.query.weight', 'vision_tower.encoder.layer.9.attention.value.bias', 'vision_tower.encoder.layer.9.attention.value.weight', 'vision_tower.encoder.layer.9.mlp.down_proj.bias', 'vision_tower.encoder.layer.9.mlp.down_proj.weight', 'vision_tower.encoder.layer.9.mlp.up_proj.bias', 'vision_tower.encoder.layer.9.mlp.up_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

It seems that the vision parts are all not initialized?

To fix: Update the transformers to align with the commit.

# test with converting to InternVL3-8B-hf
from transformers import InternVLForConditionalGeneration
model = InternVLForConditionalGeneration.from_pretrained("./InternVL3-8B-hf",  device_map="auto")

@Kuangdd01 Kuangdd01 changed the title [model][WIP] Support InternVL2_5 Series [model]Support InternVL2_5-3 Series Apr 16, 2025
@Kuangdd01 Kuangdd01 requested a review from hiyouga April 16, 2025 11:39
@murray-z
Copy link

murray-z commented Apr 22, 2025

Hey guys, now we should update the transformers to the latest version and use the official OpenGVLab model cards with -hf suffix. For those sizes that are not converted now, please refer to this file /transformers/src/transformers/models/internvl/convert_internvl_weights_to_hf.py to convert original checkpoints to the HF version. Thanks to @yonigozlan!

the latest version of transformers refer to 4.51.3 or git+https://github.com/Kuangdd01/transformers.git@hf-internvl? I tried 4.51.3, but still didn't work. @haonan3 's solution is work for me.

:) We should use the latest code of transformers instead of the latest release.

pip install git+https://github.com/huggingface/transformers.git@main

use pip install git+https://github.com/huggingface/transformers.git@main, latest llamafactory, OpenGVLab/InternVL3-1B-hf,
File "/llama-factory/lib/python3.10/site-packages/llamafactory/data/mm_plugin.py", line 557, in _get_mm_inputs
mm_inputs["pixel_values"] = torch.stack(pixel_values_list)
TypeError: expected Tensor as element 0 in argument 0, but got numpy.ndarray

@hiyouga
Copy link
Owner

hiyouga commented Apr 22, 2025

@Kuangdd01 we should add return_tensors="pt" to image processor's process func

@Kuangdd01
Copy link
Collaborator Author

@Kuangdd01 we should add return_tensors="pt" to image processor's process func

yes, my bad. Wait for a second

@Kuangdd01 Kuangdd01 mentioned this pull request Apr 22, 2025
2 tasks
@zhangshuyue-neu
Copy link

An error is reported after the model is fine-tuned。【Unrecognized configuration class <class 'transformers.models.internvl.configuration_internvl.InternVLConfig'> for this kind of AutoModel: AutoModel.】
my transformer version is [transformers 4.52.0.dev0]
my model is from [kingsley01/InternVL3-1B-hf]

@Kuangdd01
Copy link
Collaborator Author

Kuangdd01 commented Apr 23, 2025

An error is reported after the model is fine-tuned。【Unrecognized configuration class <class 'transformers.models.internvl.configuration_internvl.InternVLConfig'> for this kind of AutoModel: AutoModel.】 my transformer version is [transformers 4.52.0.dev0] my model is from [kingsley01/InternVL3-1B-hf]

There was a mismatch between transformers and model ckpt.
kingsley01/ckpt -> git pip install git+https://github.com/Kuangdd01/transformers.git@hf-internvl
OpenGVLab/ckpt -> git pip install git+https://github.com/huggingface/transformers.git@main
Now, we recommend the second one because in some bigger ckpts, q/k_norm is needed.
huggingface/transformers#37620

@ChetTaylor-hub
Copy link

kingsley01/InternVL2_5-1B-MPO-hf

Great! Thank you for your prompt reply, and I’m glad to hear that you’ll be updating the docs with OpenGVLab’s official model cards.

In the meantime, I’ve been able to run the code with the following configuration—hope this helps:

Model config:

model_name_or_path: kingsley01/InternVL2_5-1B-MPO-hf
image_max_pixels: 262144
video_max_pixels: 16384
trust_remote_code: true

Transformers version: I am using:

pip install git+https://github.com/Kuangdd01/transformers.git@hf-internvl

I’m not entirely sure whether I should switch back to the main branch with: pip install -U transformers

Codebase change: In src/llamafactory/model/loader.py, I commented out:

if processor is not None and "Processor" not in processor.__class__.__name__:

我照着这个步骤做了,但是报错
ValueError: Image features and image tokens do not match: tokens: 0, features 256

@Kuangdd01
Copy link
Collaborator Author

kingsley01/InternVL2_5-1B-MPO-hf

Great! Thank you for your prompt reply, and I’m glad to hear that you’ll be updating the docs with OpenGVLab’s official model cards.
In the meantime, I’ve been able to run the code with the following configuration—hope this helps:
Model config:

model_name_or_path: kingsley01/InternVL2_5-1B-MPO-hf
image_max_pixels: 262144
video_max_pixels: 16384
trust_remote_code: true

Transformers version: I am using:

pip install git+https://github.com/Kuangdd01/transformers.git@hf-internvl

I’m not entirely sure whether I should switch back to the main branch with: pip install -U transformers
Codebase change: In src/llamafactory/model/loader.py, I commented out:

if processor is not None and "Processor" not in processor.__class__.__name__:

我照着这个步骤做了,但是报错 ValueError: Image features and image tokens do not match: tokens: 0, features 256

参考版本和模型对应关系 #7258 (comment) 更新llamafactory代码重新试一次

Salmon-f42 pushed a commit to IshiKura-a/LLaMA-Factory that referenced this pull request Apr 29, 2025
[assets] update wechat (hiyouga#7288)

[dataset] fix ultrachat_200k dataset (hiyouga#7259)

The `HuggingFaceH4/ultrachat_200k` dataset doesn't contain the default "train" split. The correct split is "train_sft".

[data] gemma3 plugin pan and scan (hiyouga#7294)

* gemma3 pan and scan

* add test case

* fix test

[inference] support sglang backend (hiyouga#7278)

* Mimic SGLang offline Engine

* Add more tests and args

* Pass all current tests

* Clean Code

* fix sample_params

* clean code

* Fix Stream Chat

* change sglang from engine mode to server mode

* fix

* Fix Review Issues

* Use SGLang Built-In Utilities

* Fix test SGLang

* Some Doc Issue

* fix sglang engine

* add readme

---------

Co-authored-by: Jin Pan <jpan236@wisc.edu>
Co-authored-by: hiyouga <hiyouga@buaa.edu.cn>

[model] support hunyuan 7b (hiyouga#7317)

* [Model]supported tencent-hunyuan model

* [Model]supported tencent-hunyuan model(fix)

* [Model]supported tencent-hunyuan model(fix)

[assets] update videos (hiyouga#7340)

* Update README.md

* Update README_zh.md

[data] fix template (hiyouga#7349)

[misc] set dev version (hiyouga#7351)

[assets] update wechat (hiyouga#7361)

[version] fix minicpmo (hiyouga#7378)

[3rdparty] fix redundant process group destroy for ray (hiyouga#7395)

* fix redundant process group destroy for ray

* Update tuner.py

---------

Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn>

[misc] fix sglang deps (hiyouga#7432)

* feat: Add transformer version requirement for sglang

* feat: add srt to sglang which is required for running sglang

Other options are srt_hip, srt_xpu, srt_npu, srt_hpu, srt_cpu, for different computation architectures.

[deps] upgrade vllm to 0.8 (hiyouga#7436)

[deps] upgrade transformers to 4.50.0 (hiyouga#7437)

* upgrade transformers

* fix hf cache

* fix dpo trainer

[scripts] support compute score on vllm's predictions (hiyouga#7419)

* enable manual bleu&rouge eval by adding `scripts/eval_bleu_rouge.py`

* added libraries check

* update: 使用datasets库的多进程加速处理

* update:
- 使用 fire.Fire
- 修改代码格式

* Update eval_bleu_rouge.py: correctly uses fire

Deleted the code of using sys.argv

* Update eval_bleu_rouge.py

---------

Co-authored-by: SnowFox4004 <manba@out>
Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn>

[misc] fix license (hiyouga#7440)

[misc] fix ci (hiyouga#7441)

* fix ci

* improve ci

[docker] upgrade to torch 2.6 (hiyouga#7442)

[trainer] fix vlm loss for transformers 4.49 (hiyouga#7448)

[assets] fix gemma3 readme (hiyouga#7449)

[assets] update wechat (hiyouga#7455)

[misc] enable liger kernel for gemma3 (hiyouga#7462)

[misc] enable liger kernel for gemma3 text and paligemma (hiyouga#7466)

* add gemma3 text

* add paligemma (1,2 and 2 mix)

[misc] update liger-kernel's monkey patch (hiyouga#7453)

* Update liger_kernel.py

* Update setup.py

[model] fix lora on quant models (hiyouga#7456)

Co-authored-by: root <root@ai>

[model] add qwen2vl 32b & upgrade peft (hiyouga#7469)

* add qwen2vl 32b

* fix ci

* upgrade peft to 0.15

* fix ci

* fix ci

[trainer] fix wsd scheduler (hiyouga#7304)

* [trainer] Warmup_stable_decay supports setting the number of stable and decay steps according to the warmup_ratio ratio

* Update trainer_utils.py

---------

Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn>

[3rdparty] support swanlab lark notification (hiyouga#7481)

[data] fix pixtral plugin (hiyouga#7505)

* preserve `image_sizes`

* add comments

[assets] update wechat (hiyouga#7523)

[deps] pin pydantic to 2.10.6 (hiyouga#7546)

[model] add Qwen2.5-Omni model (hiyouga#7537)

* preserve image_sizes

* preserve image_sizes

* init plugin

* support audio-text2text lora

* nit

* support image/video-text2text, audio-text2text

* remove args

* remove lines

* add docs && nit

* remove some comments

* fix && add merge part script

* add license

[data] fix qwen2.5 omni collator (hiyouga#7553)

[trainer] new kto mismatch pair creation strategy (hiyouga#7509)

[data] shard the dataset to allow multiprocessing when streaming is enabled (hiyouga#7530)

* Shard the dataset when streaming to allow multiprocessing

* Allow user to not set dataset_shards to ensure backward compatibility

[webui] fix launch with proxy (hiyouga#7332)

[data] specify position_ids in PackedSupervisedDatasetProcessor for neat_packing (hiyouga#7318)

* use position_ids for neat_packing with fa2

* revert fa2 changes

[model] fix use_cache patching for gemma3 multimodal (hiyouga#7500)

[model] fix kv cache (hiyouga#7564)

[infer] vllm video/audio inference (hiyouga#7566)

[trainer] fix batch processing in PPO trainer (hiyouga#7576)

[data] fix qwen2.5 omni plugin (hiyouga#7573)

* align key with qwen2vl

* nit && change scripts

[data] fix qwen2.5 omni plugin (hiyouga#7578)

* specific entry

* Update mm_plugin.py

* fix fps cal

---------

Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn>

[assets] update wechat (hiyouga#7594)

[model] add llama4 (hiyouga#7611)

[assets] update readme (hiyouga#7612)

[misc] fix packing and eval plot (hiyouga#7623)

[sglang] support transformers 4.51.0 (hiyouga#7639)

[trainer] fix key error (hiyouga#7635)

[data] Fix bugs of `use_audio_in_video` in Qwen2.5 Omni (hiyouga#7638)

* cache _mm_inputs

* nit

* support for use_audio_in_video

* remove cache

* fix data

* Update mllm_video_audio_demo.json

[assets] update readme (hiyouga#7644)

[assets] update readme (hiyouga#7654)

[data] add coig-p dataset (hiyouga#7657)

[misc] fix cuda warn on intel GPU (hiyouga#7655)

[bugfix] enable_gemma_liger_kernel (hiyouga#7660)

- The `enable_liger_kernel` function for the Gemma model series was not executed due to the existing `if` statement in the code.
- Changed the line to an `elif` statement so that the `apply_liger_kernel` function is executed properly.

resolved: hiyouga#7628

[ray] allow for specifying ray.init kwargs (i.e. runtime_env) (hiyouga#7647)

* ray init kwargs

* Update trainer_utils.py

* fix ray args

---------

Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn>

[data] support for specifying a dataset in cloud storage (hiyouga#7567)

* add support for loading datasets from s3/gcs

* add comments to readme

* run linter and address comments

* add option to pass in kwargs to ray init (i.e. runtime env)

* address comment

* revert mixed up changes

[assets] update wechat (hiyouga#7674)

[deps] fix uv conflicts (hiyouga#7686)

* fix hiyouga#7678

* Update setup.py

* Update tests.yml

* Update publish.yml

* Update Makefile

[model] add GLM-4-0414 (hiyouga#7695)

* Update README_zh.md

* update

[deps] upgrade transformers (hiyouga#7704)

[misc] upgrade cli (hiyouga#7714)

[misc] fix env vars (hiyouga#7715)

[model] Support Kimi_VL thinking/instruct (hiyouga#7719)

* add kimi_vl

* patch config

* check version

* Update mm_plugin.py

* Update mm_plugin.py

---------

Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn>

[assets] update model readme (hiyouga#7724)

[docker] patch docker-rocm (hiyouga#7725)

* Update Dockerfile

* Fix typo

* Fix syntax for /bin/sh conditional

* Add build args to docker-compose

* Change shell to /bin/bash

This is required for "==" syntax in conditional string comparison

[deps] upgrade vllm (hiyouga#7728)

[api] fix chat messages (hiyouga#7732)

[assets] wechat (hiyouga#7740)

[infer] support vllm-ascend (hiyouga#7739)

[misc] improve entrypoint (hiyouga#7345)

* 纯粹优化下入口代码,因为看到if else太多了

* Update cli.py

---------

Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn>

[model] support intern-VL 2.5-3 series (hiyouga#7258)

* add internvl and rebase

* fix for internvl2&3

* remove lines

* fix video_inputs & lint

* nit

* add constants

* remove lines

* fix

* fix error

* pass ci

* pass ci

* skip internvl & nit

[infer] set env for vllm ascend (hiyouga#7745)

[breaking] bump transformers to 4.45.0 & improve ci (hiyouga#7746)

* update ci

* fix

* fix

* fix

* fix

* fix

[trainer] fix pt loss (hiyouga#7748)

* fix pt loss

* robust

* fix

* test

[assets] update wechat (hiyouga#7792)

[misc] fix bug in constant (hiyouga#7765)

Co-authored-by: Sachin Beldona <sbeldona@cs.cmu.edu>

[model] fix gemma3 export (hiyouga#7786)

Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn>

[misc] fix new tokens adding (hiyouga#7253)

Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn>

[data] Fix wrong position ids with packed attention masks (hiyouga#7754)

Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn>

[parser] support omegaconf (hiyouga#7793)

[trainer] Add Muon Optimizer (hiyouga#7749)

Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn>

[example] add bash usage (hiyouga#7794)

[data] improve mmplugin (hiyouga#7795)

[trainer] support early stop (hiyouga#7797)

[misc] update internvl constants (hiyouga#7801)

[model] add arch check for InternVL (hiyouga#7803)

[assets] update model readme (hiyouga#7804)

[data] fix internvl plugin (hiyouga#7817)

[model] fix moe zero3 (hiyouga#7826)

Merge commit from fork

[model] fix vit gradient checkpointing (hiyouga#7830)

[assets] update wechat (hiyouga#7840)

[ray] add storage filesystem to ray config (hiyouga#7854)

fix attn patch for kimivl (hiyouga#7867)

[data] fix minicpmo vllm infer (hiyouga#7870)

[trainer] make projector trainable in freeze training (hiyouga#7872)

Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn>

[data] fix qwen2 omni plugin (hiyouga#7875)

[model] fix dsv3 leaf node (hiyouga#7879)

[data] fix qwen2.5 omni template (hiyouga#7883)

[model] add qwen3 (hiyouga#7885)

support lora sft dsv3

update code

update eval yaml

rebase sync w/ major branch

update baseline
yoonseok312 pushed a commit to pensieve-ai/LLaMA-Factory-vlm that referenced this pull request Apr 29, 2025
* add internvl and rebase

* fix for internvl2&3

* remove lines

* fix video_inputs & lint

* nit

* add constants

* remove lines

* fix

* fix error

* pass ci

* pass ci

* skip internvl & nit
@Elenore1997
Copy link

@

What does this PR do?

Reopened PR #7077

may Fix #6322 #6432 #6236 #3802

  • But this PR is built upon the PR in Transformers. yonigozlan:add-intern-vl
  • The integration of this model into Transformers is still incomplete and may undergo further modifications.
  • The current code has been validated on 1B small size model and demo data.

Before submitting

former version

some demo experiment on InternVL2_5-1B-MPO-hf 1. video lora sft ``` yaml ### model model_name_or_path: kingsley01/InternVL3-1B-hf trust_remote_code: true

method

stage: sft do_train: true finetuning_type: lora lora_rank: 8 lora_target: all

dataset

dataset: mllm_video_demo

dataset: mllm_demo # text: identity,alpaca_en_demo # video: mllm_video_demo

template: intern_vl cutoff_len: 2048 max_samples: 1000 overwrite_cache: true preprocessing_num_workers: 4

output

output_dir: saves/internvl-1b/lora/sft-test-demo-video logging_steps: 1 save_steps: 500 plot_loss: true overwrite_output_dir: true

train

per_device_train_batch_size: 2 gradient_accumulation_steps: 1 learning_rate: 1.0e-4 num_train_epochs: 30.0 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true ddp_timeout: 180000000

eval

val_size: 0.1

per_device_eval_batch_size: 1

eval_strategy: steps

eval_steps: 500


![image](https://github.com/user-attachments/assets/33661f4d-0fa4-4afe-a82c-0238a545244f)

2. mix data full sft

``` yaml
### model
model_name_or_path: kingsley01/InternVL3-1B-hf
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: full


### dataset
dataset: mllm_demo, dentity, alpaca_en_demo  # video: mllm_video_demo
template: intern_vl
cutoff_len: 2048
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 4

### output
output_dir: saves/internvl-1b/full/sft-test-demo
logging_steps: 1
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 2
gradient_accumulation_steps: 1
learning_rate: 1.0e-4
num_train_epochs: 30.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

### eval
# val_size: 0.1
# per_device_eval_batch_size: 1
# eval_strategy: steps
# eval_steps: 500

image

Align with the latest Transformers Now! 😄

NOW we support InternVL2.5-InternVL3 series post-training!

Important

We should use the latest Huggingface code instead of it release and OpenGVLab/InternVL3-xB-hf checkpoint. For the processor issues, please check your transformers versions and model.processor.config.

For now, please install a specific version of the latest transformers.

pip install git+https://github.com/huggingface/transformers.git@main

pip install git+https://github.com/Kuangdd01/transformers.git@hf-internvl

We support direct use of several small-sized checkpoints: [InternVL2.5-1/2/4/8B, InternVL3-1/2/8B]. Download the InternVL models from Huggingface or Modelscope.

### model
model_name_or_path: OpenGVLab/InternVL3-8B-hf # careful with -hf suffix
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all

### dataset
dataset: mllm_video_demo
template: intern_vl
cutoff_len: 2048
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 4

### output
output_dir: saves/internvl3-8b-s/lora/sft-mixup-video
logging_steps: 1
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 4
learning_rate: 1.0e-4
num_train_epochs: 5.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
flash_attn: auto
video_maxlen: 4

hi, thanks for your contribution! may i ask if you have an example script of fine-tuning on text-image-to-text dataset?

@hiyouga
Copy link
Owner

hiyouga commented Apr 30, 2025

@Elenore1997 use dataset: mllm_demo

@Elenore1997
Copy link

@Elenore1997 use dataset: mllm_demo

and the remaining hyper-parameters is the same as mentioned above?

@Kuangdd01
Copy link
Collaborator Author

@Elenore1997 use dataset: mllm_demo

and the remaining hyper-parameters is the same as mentioned above?

depend on your fine-tuning method and your dataset size

@Elenore1997
Copy link

@Elenore1997 use dataset: mllm_demo

and the remaining hyper-parameters is the same as mentioned above?

depend on your fine-tuning method and your dataset size

我拉了最新的训练代码,模型仓库用的是8B-hf,transformers:pip install git+https://github.com/huggingface/transformers.git@main
微调的时候出现了如下的错误:
[rank0]: Traceback (most recent call last): [rank0]: File "/home/project/LLaMA-Factory/src/train.py", line 30, in <module> [rank0]: main() [rank0]: File "/home/project/LLaMA-Factory/src/train.py", line 19, in main [rank0]: run_exp() [rank0]: File "/home/project/LLaMA-Factory/src/llamafactory/train/tuner.py", line 110, in run_exp [rank0]: _training_function(config={"args": args, "callbacks": callbacks}) [rank0]: File "/home/project/LLaMA-Factory/src/llamafactory/train/tuner.py", line 72, in _training_function [rank0]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) [rank0]: File "/home/project/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 105, in run_sft [rank0]: train_result = trainer.train() [rank0]: ^^^^^^^^^^^^^^^ [rank0]: File "/home/kas/.conda/envs/internvl3/lib/python3.12/site-packages/transformers/trainer.py", line 2239, in train [rank0]: return inner_training_loop( [rank0]: ^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/kas/.conda/envs/internvl3/lib/python3.12/site-packages/transformers/trainer.py", line 2554, in _inner_training_loop [rank0]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/kas/.conda/envs/internvl3/lib/python3.12/site-packages/transformers/trainer.py", line 3746, in training_step [rank0]: loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/project/LLaMA-Factory/src/llamafactory/train/sft/trainer.py", line 103, in compute_loss [rank0]: return super().compute_loss(model, inputs, *args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/kas/.conda/envs/internvl3/lib/python3.12/site-packages/transformers/trainer.py", line 3811, in compute_loss [rank0]: outputs = model(**inputs) [rank0]: ^^^^^^^^^^^^^^^ [rank0]: File "/home/kas/.conda/envs/internvl3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/kas/.conda/envs/internvl3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/kas/.conda/envs/internvl3/lib/python3.12/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn [rank0]: ret_val = func(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/kas/.conda/envs/internvl3/lib/python3.12/site-packages/deepspeed/runtime/engine.py", line 1914, in forward [rank0]: loss = self.module(*inputs, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/kas/.conda/envs/internvl3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/kas/.conda/envs/internvl3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/kas/.conda/envs/internvl3/lib/python3.12/site-packages/transformers/models/internvl/modeling_internvl.py", line 955, in forward [rank0]: image_features = self.get_image_features( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/kas/.conda/envs/internvl3/lib/python3.12/site-packages/transformers/models/internvl/modeling_internvl.py", line 831, in get_image_features [rank0]: vision_features = self.vision_tower(pixel_values=pixel_values).last_hidden_state [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/kas/.conda/envs/internvl3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/kas/.conda/envs/internvl3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/kas/.conda/envs/internvl3/lib/python3.12/site-packages/transformers/utils/generic.py", line 969, in wrapper [rank0]: output = func(self, *args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/kas/.conda/envs/internvl3/lib/python3.12/site-packages/transformers/models/internvl/modeling_internvl.py", line 576, in forward [rank0]: embedding_output, _ = self.embeddings(pixel_values, bool_masked_pos=bool_masked_pos) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/kas/.conda/envs/internvl3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/kas/.conda/envs/internvl3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/kas/.conda/envs/internvl3/lib/python3.12/site-packages/transformers/models/internvl/modeling_internvl.py", line 368, in forward [rank0]: embeddings, (patch_height, patch_width) = self.patch_embeddings(pixel_values) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/kas/.conda/envs/internvl3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/kas/.conda/envs/internvl3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/kas/.conda/envs/internvl3/lib/python3.12/site-packages/transformers/models/internvl/modeling_internvl.py", line 285, in forward [rank0]: embeddings = self.projection(pixel_values) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/kas/.conda/envs/internvl3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/kas/.conda/envs/internvl3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/kas/.conda/envs/internvl3/lib/python3.12/site-packages/torch/nn/modules/conv.py", line 554, in forward [rank0]: return self._conv_forward(input, self.weight, self.bias) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/kas/.conda/envs/internvl3/lib/python3.12/site-packages/torch/nn/modules/conv.py", line 549, in _conv_forward [rank0]: return F.conv2d( [rank0]: ^^^^^^^^^ [rank0]: RuntimeError: Input type (float) and bias type (c10::BFloat16) should be the same
我的训练脚本如下:
image
请问这是什么原因呢,这个脚本训练其他多模态模型的时候没有出现过类型问题。这个问题和flash attention有无关系?当前我未安装flash attention。

@Kuangdd01
Copy link
Collaborator Author

Kuangdd01 commented May 2, 2025

无关,字面意思,你pixel values是float32,vit是bf16的,类型不匹配
要侵入式的改一下代码加一个dtype_check,源代码这里最好加一段下面的类型检查 你可以先改了
https://github.com/huggingface/transformers/blob/2932f318a20d9e54cc7aea052e040164d85de7d6/src/transformers/models/internvl/modeling_internvl.py#L278-L289
=>

batch_size, num_channels, height, width = pixel_values.shape
if num_channels != self.num_channels:
    raise ValueError(
        "Make sure that the channel dimension of the pixel values match with the one set in the configuration."
    )
target_dtype = self.projection.weight.dtype
if pixel_values.dtype != target_dtype:
    pixel_values = pixel_values.to(dtype=target_dtype)
embeddings = self.projection(pixel_values)

@Elenore1997
Copy link

无关,字面意思,你pixel values是float32,vit是bf16的,类型不匹配 要侵入式的改一下代码加一个dtype_check,源代码这里最好加一段下面的类型检查 你可以先改了 https://github.com/huggingface/transformers/blob/2932f318a20d9e54cc7aea052e040164d85de7d6/src/transformers/models/internvl/modeling_internvl.py#L278-L289 =>

batch_size, num_channels, height, width = pixel_values.shape
if num_channels != self.num_channels:
    raise ValueError(
        "Make sure that the channel dimension of the pixel values match with the one set in the configuration."
    )
target_dtype = self.projection.weight.dtype
if pixel_values.dtype != target_dtype:
    pixel_values = pixel_values.to(dtype=target_dtype)
embeddings = self.projection(pixel_values)

谢谢大佬的及时回复,目前已能够成功训练~

@FloSophorae
Copy link

An error is reported after the model is fine-tuned。【Unrecognized configuration class <class 'transformers.models.internvl.configuration_internvl.InternVLConfig'> for this kind of AutoModel: AutoModel.】 my transformer version is [transformers 4.52.0.dev0] my model is from [kingsley01/InternVL3-1B-hf]

There was a mismatch between transformers and model ckpt. kingsley01/ckpt -> git pip install git+https://github.com/Kuangdd01/transformers.git@hf-internvl OpenGVLab/ckpt -> git pip install git+https://github.com/huggingface/transformers.git@main Now, we recommend the second one because in some bigger ckpts, q/k_norm is needed. huggingface/transformers#37620

你好我想问一下,我使用的是InternVL3-8B进行LoRA SFT,使用的模型是OpenGVLab/InternVL3-8B用transformers/src/transformers/models/internvl /convert_internvl_weights_to_hf.py 转化成hf格式,训练的时候安装的transformers是 pip install git+https://github.com/Kuangdd01/transformers.git@hf-internvl , 训练都成功了,但是我发现我用导出的模型跑vllm的时候,出现
raise ValueError("`limit_mm_per_prompt` is only supported for " ValueError: `limit_mm_per_prompt` is only supported for multimodal models. 这样的问题,好像是架构和版本导出的问题?请问能解答一下吗?
我使用的vllm的指令是:

CUDA_VISIBLE_DEVICES=0,1 vllm serve $MODEL_PATH \
    --tensor-parallel-size 2 \
    --port $MODEL_PROT \
    --host 0.0.0.0 \
    --dtype float16 \
    --max-model-len 65536 \
    --limit-mm-per-prompt image=30,video=0\
    --enable-prefix-caching \
    --gpu-memory-utilization 0.6 \
    --block-size 16 > "$VLLM_LOG" 

当MODEL_PATH设置为我OpenGVLab/InternVL3-8B转化成hf格式的模型和LoRA SFT训练完导出的模型都会出现这个问题, 请问能帮忙解答一下吗? 非常感谢 @Kuangdd01

@Kuangdd01
Copy link
Collaborator Author

Kuangdd01 commented May 7, 2025

An error is reported after the model is fine-tuned。【Unrecognized configuration class <class 'transformers.models.internvl.configuration_internvl.InternVLConfig'> for this kind of AutoModel: AutoModel.】 my transformer version is [transformers 4.52.0.dev0] my model is from [kingsley01/InternVL3-1B-hf]

There was a mismatch between transformers and model ckpt. kingsley01/ckpt -> git pip install git+https://github.com/Kuangdd01/transformers.git@hf-internvl OpenGVLab/ckpt -> git pip install git+https://github.com/huggingface/transformers.git@main Now, we recommend the second one because in some bigger ckpts, q/k_norm is needed. huggingface/transformers#37620

你好我想问一下,我使用的是InternVL3-8B进行LoRA SFT,使用的模型是OpenGVLab/InternVL3-8B用transformers/src/transformers/models/internvl /convert_internvl_weights_to_hf.py 转化成hf格式,训练的时候安装的transformers是 pip install git+https://github.com/Kuangdd01/transformers.git@hf-internvl , 训练都成功了,但是我发现我用导出的模型跑vllm的时候,出现 raise ValueError("`limit_mm_per_prompt` is only supported for " ValueError: `limit_mm_per_prompt` is only supported for multimodal models. 这样的问题,好像是架构和版本导出的问题?请问能解答一下吗? 我使用的vllm的指令是:

CUDA_VISIBLE_DEVICES=0,1 vllm serve $MODEL_PATH \
    --tensor-parallel-size 2 \
    --port $MODEL_PROT \
    --host 0.0.0.0 \
    --dtype float16 \
    --max-model-len 65536 \
    --limit-mm-per-prompt image=30,video=0\
    --enable-prefix-caching \
    --gpu-memory-utilization 0.6 \
    --block-size 16 > "$VLLM_LOG" 

当MODEL_PATH设置为我OpenGVLab/InternVL3-8B转化成hf格式的模型和LoRA SFT训练完导出的模型都会出现这个问题, 请问能帮忙解答一下吗? 非常感谢 @Kuangdd01

目前vllm支持的internvl模型格式是internvlChat而不是internvl-hf, 如果想用vllm启动需要把-hf再转换回原来的模型命名格式,并且用原来的配置文件,可以先用hf engine当推理引擎,速度上会慢很多

@FloSophorae
Copy link

FloSophorae commented May 7, 2025

An error is reported after the model is fine-tuned。【Unrecognized configuration class <class 'transformers.models.internvl.configuration_internvl.InternVLConfig'> for this kind of AutoModel: AutoModel.】 my transformer version is [transformers 4.52.0.dev0] my model is from [kingsley01/InternVL3-1B-hf]

There was a mismatch between transformers and model ckpt. kingsley01/ckpt -> git pip install git+https://github.com/Kuangdd01/transformers.git@hf-internvl OpenGVLab/ckpt -> git pip install git+https://github.com/huggingface/transformers.git@main Now, we recommend the second one because in some bigger ckpts, q/k_norm is needed. huggingface/transformers#37620

你好我想问一下,我使用的是InternVL3-8B进行LoRA SFT,使用的模型是OpenGVLab/InternVL3-8B用transformers/src/transformers/models/internvl /convert_internvl_weights_to_hf.py 转化成hf格式,训练的时候安装的transformers是 pip install git+https://github.com/Kuangdd01/transformers.git@hf-internvl , 训练都成功了,但是我发现我用导出的模型跑vllm的时候,出现 raise ValueError("`limit_mm_per_prompt` is only supported for " ValueError: `limit_mm_per_prompt` is only supported for multimodal models. 这样的问题,好像是架构和版本导出的问题?请问能解答一下吗? 我使用的vllm的指令是:

CUDA_VISIBLE_DEVICES=0,1 vllm serve $MODEL_PATH \
    --tensor-parallel-size 2 \
    --port $MODEL_PROT \
    --host 0.0.0.0 \
    --dtype float16 \
    --max-model-len 65536 \
    --limit-mm-per-prompt image=30,video=0\
    --enable-prefix-caching \
    --gpu-memory-utilization 0.6 \
    --block-size 16 > "$VLLM_LOG" 

当MODEL_PATH设置为我OpenGVLab/InternVL3-8B转化成hf格式的模型和LoRA SFT训练完导出的模型都会出现这个问题, 请问能帮忙解答一下吗? 非常感谢 @Kuangdd01

目前vllm支持的internvl模型格式是internvlChat而不是internvl-hf, 如果想用vllm启动需要把-hf再转换回原来的模型命名格式,并且用原来的配置文件,可以先用hf engine当推理引擎,速度上会慢很多

你好,请问一下这个怎么转化回去吗?我在transformers上也没找到相应的文件,能请教一下这个怎么操作吗?我自己尝试过直接修改model_type但是并不能成功,非常感谢 @Kuangdd01

@Kuangdd01
Copy link
Collaborator Author

Kuangdd01 commented May 7, 2025

An error is reported after the model is fine-tuned。【Unrecognized configuration class <class 'transformers.models.internvl.configuration_internvl.InternVLConfig'> for this kind of AutoModel: AutoModel.】 my transformer version is [transformers 4.52.0.dev0] my model is from [kingsley01/InternVL3-1B-hf]

There was a mismatch between transformers and model ckpt. kingsley01/ckpt -> git pip install git+https://github.com/Kuangdd01/transformers.git@hf-internvl OpenGVLab/ckpt -> git pip install git+https://github.com/huggingface/transformers.git@main Now, we recommend the second one because in some bigger ckpts, q/k_norm is needed. huggingface/transformers#37620

你好我想问一下,我使用的是InternVL3-8B进行LoRA SFT,使用的模型是OpenGVLab/InternVL3-8B用transformers/src/transformers/models/internvl /convert_internvl_weights_to_hf.py 转化成hf格式,训练的时候安装的transformers是 pip install git+https://github.com/Kuangdd01/transformers.git@hf-internvl , 训练都成功了,但是我发现我用导出的模型跑vllm的时候,出现 raise ValueError("`limit_mm_per_prompt` is only supported for " ValueError: `limit_mm_per_prompt` is only supported for multimodal models. 这样的问题,好像是架构和版本导出的问题?请问能解答一下吗? 我使用的vllm的指令是:

CUDA_VISIBLE_DEVICES=0,1 vllm serve $MODEL_PATH \
    --tensor-parallel-size 2 \
    --port $MODEL_PROT \
    --host 0.0.0.0 \
    --dtype float16 \
    --max-model-len 65536 \
    --limit-mm-per-prompt image=30,video=0\
    --enable-prefix-caching \
    --gpu-memory-utilization 0.6 \
    --block-size 16 > "$VLLM_LOG" 

当MODEL_PATH设置为我OpenGVLab/InternVL3-8B转化成hf格式的模型和LoRA SFT训练完导出的模型都会出现这个问题, 请问能帮忙解答一下吗? 非常感谢 @Kuangdd01

目前vllm支持的internvl模型格式是internvlChat而不是internvl-hf, 如果想用vllm启动需要把-hf再转换回原来的模型命名格式,并且用原来的配置文件,可以先用hf engine当推理引擎,速度上会慢很多

你好,请问一下这个怎么转化回去吗?我在transformers上也没找到相应的文件,能请教一下这个怎么操作吗?我自己尝试过直接修改model_type但是并不能成功,非常感谢 @Kuangdd01

我还没尝试过这么做,但是你可以根据transformers/src/transformers/models/internvl /convert_internvl_weights_to_hf.py来做这个逆向过程,如果你用的internvl的llm backbone是qwen的话只需要更改visual部分的命名即可, 对当前hf版本的name_params做反向命名。
https://github.com/huggingface/transformers/blob/c8607a17cbbfc20b6bae046e3e2f72e1749cf0fc/src/transformers/models/internvl/convert_internvl_weights_to_hf.py#L56-L70
以及MLP部分的命名
https://github.com/huggingface/transformers/blob/c8607a17cbbfc20b6bae046e3e2f72e1749cf0fc/src/transformers/models/internvl/convert_internvl_weights_to_hf.py#L84-L89

stephen-nju pushed a commit to stephen-nju/Llmtrain that referenced this pull request May 10, 2025
* add internvl and rebase

* fix for internvl2&3

* remove lines

* fix video_inputs & lint

* nit

* add constants

* remove lines

* fix

* fix error

* pass ci

* pass ci

* skip internvl & nit
liu-qingyuan pushed a commit to liu-qingyuan/LLaMA-Factory-Megafake that referenced this pull request Jun 6, 2025
* add internvl and rebase

* fix for internvl2&3

* remove lines

* fix video_inputs & lint

* nit

* add constants

* remove lines

* fix

* fix error

* pass ci

* pass ci

* skip internvl & nit
@liuchaos03
Copy link

pip install git+https://github.com/huggingface/transformers.git@main 现在对应的版本是 transformers-4.55.0.dev0
但是目前最新的llamafactory 0.9.4.dev0 requires transformers!=4.52.0,<=4.52.4,>=4.49.0; sys_platform != "darwin", but you have transformers 4.55.0.dev0 which is incompatible.

@Kuangdd01
Copy link
Collaborator Author

but you have transformers 4.55.0.dev0 which is incompatible.

Run with the following env var DISABLE_VERSION_CHECK=1.

@FantasyJXF
Copy link

对应的版本是 transformers-4.55.0.dev0
但是目前最新的llamafactory 0.9.4.dev0 requires transformers!=4.52.0,<=4.52.4,>=4.49.0; sys_platform != "darwin", but you have transformers 4.55.0.dev0 which is incompatible.

在训练internVL3.5系列时我也遇到了RuntimeError: Input type (float) and bias type (c10::BFloat16) should be the same的问题,但是不同之处是,仅在开启了eval时模型会训练会报错,如果只训练,则不会报错,请问这是原因呢?

ps: llamafactory==0.9.3、transformers==4.52.3,训练internVL3系列时相同的配置,即使开eval也不报错

### train
per_device_train_batch_size: 32
gradient_accumulation_steps: 1
learning_rate: 1.0e-4
num_train_epochs: 5.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
flash_attn: fa2
enable_liger_kernel: true

## eval  --- internVL3.5注释以下部分,就可以正常训练
val_size: 0.05
per_device_eval_batch_size: 32
eval_strategy: steps
eval_steps: 500
metric_for_best_model: eval_loss  # choices: [eval_accuracy, eval_loss, rouge-l等]
greater_is_better: false

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

solved This problem has been already solved

Projects

None yet

Development

Successfully merging this pull request may close these issues.

InternVL2.5-8B finetuning