-
Notifications
You must be signed in to change notification settings - Fork 7.3k
[model] Support InternVL2.5-3 Series #7258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4c01ba8 to
20eeb05
Compare
|
I know, we need to use yonigozlan/InternVL2_5-1B-MPO-hf instead of the original OpenGVLab/InternVL2_5-1B-MPO! |
Can you use this model card for testing one more time? I can't guarantee that the latest version is available. Feel free to report bugs, this PR is not complete for now. :[ |
|
How to convert a model InternVL2_5-1B-MPO checkpoint to hf if I have a costomized-pertrained internvl model? Please help!
|
In my case, just replace these codes with tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-chat-7b", return_token_type_ids=False, trust_remote_code=True)It will replace the tokenizer with internLM2's. |
Thanks a lot for your timely response!! When I use Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 4/4 [00:23<00:00, 5.83s/it]
Some weights of the model checkpoint at InternVL2_5-8B-MPO-hf were not used when initializing InternVLForConditionalGeneration: ['vision_tower.encoder.layer.0.attention.attention.key.bias', 'vision_tower.encoder.layer.0.attention.attention.key.weight', 'vision_tower.encoder.layer.0.attention.attention.query.bias', 'vision_tower.encoder.layer.0.attention.attention.query.weight', 'vision_tower.encoder.layer.0.attention.attention.value.bias', 'vision_tower.encoder.layer.0.attention.attention.value.weight', 'vision_tower.encoder.layer.0.attention.output.dense.bias', 'vision_tower.encoder.layer.0.attention.output.dense.weight', 'vision_tower.encoder.layer.0.intermediate.dense.bias', 'vision_tower.encoder.layer.0.intermediate.dense.weight', 'vision_tower.encoder.layer.0.output.dense.bias', 'vision_tower.encoder.layer.0.output.dense.weight', 'vision_tower.encoder.layer.1.attention.attention.key.bias', 'vision_tower.encoder.layer.1.attention.attention.key.weight', 'vision_tower.encoder.layer.1.attention.attention.query.bias', 'vision_tower.encoder.layer.1.attention.attention.query.weight', 'vision_tower.encoder.layer.1.attention.attention.value.bias', 'vision_tower.encoder.layer.1.attention.attention.value.weight', 'vision_tower.encoder.layer.1.attention.output.dense.bias', 'vision_tower.encoder.layer.1.attention.output.dense.weight', 'vision_tower.encoder.layer.1.intermediate.dense.bias', 'vision_tower.encoder.layer.1.intermediate.dense.weight', 'vision_tower.encoder.layer.1.output.dense.bias', 'vision_tower.encoder.layer.1.output.dense.weight', 'vision_tower.encoder.layer.10.attention.attention.key.bias', 'vision_tower.encoder.layer.10.attention.attention.key.weight', 'vision_tower.encoder.layer.10.attention.attention.query.bias', 'vision_tower.encoder.layer.10.attention.attention.query.weight', 'vision_tower.encoder.layer.10.attention.attention.value.bias', 'vision_tower.encoder.layer.10.attention.attention.value.weight', 'vision_tower.encoder.layer.10.attention.output.dense.bias', 'vision_tower.encoder.layer.10.attention.output.dense.weight', 'vision_tower.encoder.layer.10.intermediate.dense.bias', 'vision_tower.encoder.layer.10.intermediate.dense.weight', 'vision_tower.encoder.layer.10.output.dense.bias', 'vision_tower.encoder.layer.10.output.dense.weight', 'vision_tower.encoder.layer.11.attention.attention.key.bias', 'vision_tower.encoder.layer.11.attention.attention.key.weight', 'vision_tower.encoder.layer.11.attention.attention.query.bias', 'vision_tower.encoder.layer.11.attention.attention.query.weight', 'vision_tower.encoder.layer.11.attention.attention.value.bias', 'vision_tower.encoder.layer.11.attention.attention.value.weight', 'vision_tower.encoder.layer.11.attention.output.dense.bias', 'vision_tower.encoder.layer.11.attention.output.dense.weight', 'vision_tower.encoder.layer.11.intermediate.dense.bias', 'vision_tower.encoder.layer.11.intermediate.dense.weight', 'vision_tower.encoder.layer.11.output.dense.bias', 'vision_tower.encoder.layer.11.output.dense.weight', 'vision_tower.encoder.layer.12.attention.attention.key.bias', 'vision_tower.encoder.layer.12.attention.attention.key.weight', 'vision_tower.encoder.layer.12.attention.attention.query.bias', 'vision_tower.encoder.layer.12.attention.attention.query.weight', 'vision_tower.encoder.layer.12.attention.attention.value.bias', 'vision_tower.encoder.layer.12.attention.attention.value.weight', 'vision_tower.encoder.layer.12.attention.output.dense.bias', 'vision_tower.encoder.layer.12.attention.output.dense.weight', 'vision_tower.encoder.layer.12.intermediate.dense.bias', 'vision_tower.encoder.layer.12.intermediate.dense.weight', 'vision_tower.encoder.layer.12.output.dense.bias', 'vision_tower.encoder.layer.12.output.dense.weight', 'vision_tower.encoder.layer.13.attention.attention.key.bias', 'vision_tower.encoder.layer.13.attention.attention.key.weight', 'vision_tower.encoder.layer.13.attention.attention.query.bias', 'vision_tower.encoder.layer.13.attention.attention.query.weight', 'vision_tower.encoder.layer.13.attention.attention.value.bias', 'vision_tower.encoder.layer.13.attention.attention.value.weight', 'vision_tower.encoder.layer.13.attention.output.dense.bias', 'vision_tower.encoder.layer.13.attention.output.dense.weight', 'vision_tower.encoder.layer.13.intermediate.dense.bias', 'vision_tower.encoder.layer.13.intermediate.dense.weight', 'vision_tower.encoder.layer.13.output.dense.bias', 'vision_tower.encoder.layer.13.output.dense.weight', 'vision_tower.encoder.layer.14.attention.attention.key.bias', 'vision_tower.encoder.layer.14.attention.attention.key.weight', 'vision_tower.encoder.layer.14.attention.attention.query.bias', 'vision_tower.encoder.layer.14.attention.attention.query.weight', 'vision_tower.encoder.layer.14.attention.attention.value.bias', 'vision_tower.encoder.layer.14.attention.attention.value.weight', 'vision_tower.encoder.layer.14.attention.output.dense.bias', 'vision_tower.encoder.layer.14.attention.output.dense.weight', 'vision_tower.encoder.layer.14.intermediate.dense.bias', 'vision_tower.encoder.layer.14.intermediate.dense.weight', 'vision_tower.encoder.layer.14.output.dense.bias', 'vision_tower.encoder.layer.14.output.dense.weight', 'vision_tower.encoder.layer.15.attention.attention.key.bias', 'vision_tower.encoder.layer.15.attention.attention.key.weight', 'vision_tower.encoder.layer.15.attention.attention.query.bias', 'vision_tower.encoder.layer.15.attention.attention.query.weight', 'vision_tower.encoder.layer.15.attention.attention.value.bias', 'vision_tower.encoder.layer.15.attention.attention.value.weight', 'vision_tower.encoder.layer.15.attention.output.dense.bias', 'vision_tower.encoder.layer.15.attention.output.dense.weight', 'vision_tower.encoder.layer.15.intermediate.dense.bias', 'vision_tower.encoder.layer.15.intermediate.dense.weight', 'vision_tower.encoder.layer.15.output.dense.bias', 'vision_tower.encoder.layer.15.output.dense.weight', 'vision_tower.encoder.layer.16.attention.attention.key.bias', 'vision_tower.encoder.layer.16.attention.attention.key.weight', 'vision_tower.encoder.layer.16.attention.attention.query.bias', 'vision_tower.encoder.layer.16.attention.attention.query.weight', 'vision_tower.encoder.layer.16.attention.attention.value.bias', 'vision_tower.encoder.layer.16.attention.attention.value.weight', 'vision_tower.encoder.layer.16.attention.output.dense.bias', 'vision_tower.encoder.layer.16.attention.output.dense.weight', 'vision_tower.encoder.layer.16.intermediate.dense.bias', 'vision_tower.encoder.layer.16.intermediate.dense.weight', 'vision_tower.encoder.layer.16.output.dense.bias', 'vision_tower.encoder.layer.16.output.dense.weight', 'vision_tower.encoder.layer.17.attention.attention.key.bias', 'vision_tower.encoder.layer.17.attention.attention.key.weight', 'vision_tower.encoder.layer.17.attention.attention.query.bias', 'vision_tower.encoder.layer.17.attention.attention.query.weight', 'vision_tower.encoder.layer.17.attention.attention.value.bias', 'vision_tower.encoder.layer.17.attention.attention.value.weight', 'vision_tower.encoder.layer.17.attention.output.dense.bias', 'vision_tower.encoder.layer.17.attention.output.dense.weight', 'vision_tower.encoder.layer.17.intermediate.dense.bias', 'vision_tower.encoder.layer.17.intermediate.dense.weight', 'vision_tower.encoder.layer.17.output.dense.bias', 'vision_tower.encoder.layer.17.output.dense.weight', 'vision_tower.encoder.layer.18.attention.attention.key.bias', 'vision_tower.encoder.layer.18.attention.attention.key.weight', 'vision_tower.encoder.layer.18.attention.attention.query.bias', 'vision_tower.encoder.layer.18.attention.attention.query.weight', 'vision_tower.encoder.layer.18.attention.attention.value.bias', 'vision_tower.encoder.layer.18.attention.attention.value.weight', 'vision_tower.encoder.layer.18.attention.output.dense.bias', 'vision_tower.encoder.layer.18.attention.output.dense.weight', 'vision_tower.encoder.layer.18.intermediate.dense.bias', 'vision_tower.encoder.layer.18.intermediate.dense.weight', 'vision_tower.encoder.layer.18.output.dense.bias', 'vision_tower.encoder.layer.18.output.dense.weight', 'vision_tower.encoder.layer.19.attention.attention.key.bias', 'vision_tower.encoder.layer.19.attention.attention.key.weight', 'vision_tower.encoder.layer.19.attention.attention.query.bias', 'vision_tower.encoder.layer.19.attention.attention.query.weight', 'vision_tower.encoder.layer.19.attention.attention.value.bias', 'vision_tower.encoder.layer.19.attention.attention.value.weight', 'vision_tower.encoder.layer.19.attention.output.dense.bias', 'vision_tower.encoder.layer.19.attention.output.dense.weight', 'vision_tower.encoder.layer.19.intermediate.dense.bias', 'vision_tower.encoder.layer.19.intermediate.dense.weight', 'vision_tower.encoder.layer.19.output.dense.bias', 'vision_tower.encoder.layer.19.output.dense.weight', 'vision_tower.encoder.layer.2.attention.attention.key.bias', 'vision_tower.encoder.layer.2.attention.attention.key.weight', 'vision_tower.encoder.layer.2.attention.attention.query.bias', 'vision_tower.encoder.layer.2.attention.attention.query.weight', 'vision_tower.encoder.layer.2.attention.attention.value.bias', 'vision_tower.encoder.layer.2.attention.attention.value.weight', 'vision_tower.encoder.layer.2.attention.output.dense.bias', 'vision_tower.encoder.layer.2.attention.output.dense.weight', 'vision_tower.encoder.layer.2.intermediate.dense.bias', 'vision_tower.encoder.layer.2.intermediate.dense.weight', 'vision_tower.encoder.layer.2.output.dense.bias', 'vision_tower.encoder.layer.2.output.dense.weight', 'vision_tower.encoder.layer.20.attention.attention.key.bias', 'vision_tower.encoder.layer.20.attention.attention.key.weight', 'vision_tower.encoder.layer.20.attention.attention.query.bias', 'vision_tower.encoder.layer.20.attention.attention.query.weight', 'vision_tower.encoder.layer.20.attention.attention.value.bias', 'vision_tower.encoder.layer.20.attention.attention.value.weight', 'vision_tower.encoder.layer.20.attention.output.dense.bias', 'vision_tower.encoder.layer.20.attention.output.dense.weight', 'vision_tower.encoder.layer.20.intermediate.dense.bias', 'vision_tower.encoder.layer.20.intermediate.dense.weight', 'vision_tower.encoder.layer.20.output.dense.bias', 'vision_tower.encoder.layer.20.output.dense.weight', 'vision_tower.encoder.layer.21.attention.attention.key.bias', 'vision_tower.encoder.layer.21.attention.attention.key.weight', 'vision_tower.encoder.layer.21.attention.attention.query.bias', 'vision_tower.encoder.layer.21.attention.attention.query.weight', 'vision_tower.encoder.layer.21.attention.attention.value.bias', 'vision_tower.encoder.layer.21.attention.attention.value.weight', 'vision_tower.encoder.layer.21.attention.output.dense.bias', 'vision_tower.encoder.layer.21.attention.output.dense.weight', 'vision_tower.encoder.layer.21.intermediate.dense.bias', 'vision_tower.encoder.layer.21.intermediate.dense.weight', 'vision_tower.encoder.layer.21.output.dense.bias', 'vision_tower.encoder.layer.21.output.dense.weight', 'vision_tower.encoder.layer.22.attention.attention.key.bias', 'vision_tower.encoder.layer.22.attention.attention.key.weight', 'vision_tower.encoder.layer.22.attention.attention.query.bias', 'vision_tower.encoder.layer.22.attention.attention.query.weight', 'vision_tower.encoder.layer.22.attention.attention.value.bias', 'vision_tower.encoder.layer.22.attention.attention.value.weight', 'vision_tower.encoder.layer.22.attention.output.dense.bias', 'vision_tower.encoder.layer.22.attention.output.dense.weight', 'vision_tower.encoder.layer.22.intermediate.dense.bias', 'vision_tower.encoder.layer.22.intermediate.dense.weight', 'vision_tower.encoder.layer.22.output.dense.bias', 'vision_tower.encoder.layer.22.output.dense.weight', 'vision_tower.encoder.layer.23.attention.attention.key.bias', 'vision_tower.encoder.layer.23.attention.attention.key.weight', 'vision_tower.encoder.layer.23.attention.attention.query.bias', 'vision_tower.encoder.layer.23.attention.attention.query.weight', 'vision_tower.encoder.layer.23.attention.attention.value.bias', 'vision_tower.encoder.layer.23.attention.attention.value.weight', 'vision_tower.encoder.layer.23.attention.output.dense.bias', 'vision_tower.encoder.layer.23.attention.output.dense.weight', 'vision_tower.encoder.layer.23.intermediate.dense.bias', 'vision_tower.encoder.layer.23.intermediate.dense.weight', 'vision_tower.encoder.layer.23.output.dense.bias', 'vision_tower.encoder.layer.23.output.dense.weight', 'vision_tower.encoder.layer.3.attention.attention.key.bias', 'vision_tower.encoder.layer.3.attention.attention.key.weight', 'vision_tower.encoder.layer.3.attention.attention.query.bias', 'vision_tower.encoder.layer.3.attention.attention.query.weight', 'vision_tower.encoder.layer.3.attention.attention.value.bias', 'vision_tower.encoder.layer.3.attention.attention.value.weight', 'vision_tower.encoder.layer.3.attention.output.dense.bias', 'vision_tower.encoder.layer.3.attention.output.dense.weight', 'vision_tower.encoder.layer.3.intermediate.dense.bias', 'vision_tower.encoder.layer.3.intermediate.dense.weight', 'vision_tower.encoder.layer.3.output.dense.bias', 'vision_tower.encoder.layer.3.output.dense.weight', 'vision_tower.encoder.layer.4.attention.attention.key.bias', 'vision_tower.encoder.layer.4.attention.attention.key.weight', 'vision_tower.encoder.layer.4.attention.attention.query.bias', 'vision_tower.encoder.layer.4.attention.attention.query.weight', 'vision_tower.encoder.layer.4.attention.attention.value.bias', 'vision_tower.encoder.layer.4.attention.attention.value.weight', 'vision_tower.encoder.layer.4.attention.output.dense.bias', 'vision_tower.encoder.layer.4.attention.output.dense.weight', 'vision_tower.encoder.layer.4.intermediate.dense.bias', 'vision_tower.encoder.layer.4.intermediate.dense.weight', 'vision_tower.encoder.layer.4.output.dense.bias', 'vision_tower.encoder.layer.4.output.dense.weight', 'vision_tower.encoder.layer.5.attention.attention.key.bias', 'vision_tower.encoder.layer.5.attention.attention.key.weight', 'vision_tower.encoder.layer.5.attention.attention.query.bias', 'vision_tower.encoder.layer.5.attention.attention.query.weight', 'vision_tower.encoder.layer.5.attention.attention.value.bias', 'vision_tower.encoder.layer.5.attention.attention.value.weight', 'vision_tower.encoder.layer.5.attention.output.dense.bias', 'vision_tower.encoder.layer.5.attention.output.dense.weight', 'vision_tower.encoder.layer.5.intermediate.dense.bias', 'vision_tower.encoder.layer.5.intermediate.dense.weight', 'vision_tower.encoder.layer.5.output.dense.bias', 'vision_tower.encoder.layer.5.output.dense.weight', 'vision_tower.encoder.layer.6.attention.attention.key.bias', 'vision_tower.encoder.layer.6.attention.attention.key.weight', 'vision_tower.encoder.layer.6.attention.attention.query.bias', 'vision_tower.encoder.layer.6.attention.attention.query.weight', 'vision_tower.encoder.layer.6.attention.attention.value.bias', 'vision_tower.encoder.layer.6.attention.attention.value.weight', 'vision_tower.encoder.layer.6.attention.output.dense.bias', 'vision_tower.encoder.layer.6.attention.output.dense.weight', 'vision_tower.encoder.layer.6.intermediate.dense.bias', 'vision_tower.encoder.layer.6.intermediate.dense.weight', 'vision_tower.encoder.layer.6.output.dense.bias', 'vision_tower.encoder.layer.6.output.dense.weight', 'vision_tower.encoder.layer.7.attention.attention.key.bias', 'vision_tower.encoder.layer.7.attention.attention.key.weight', 'vision_tower.encoder.layer.7.attention.attention.query.bias', 'vision_tower.encoder.layer.7.attention.attention.query.weight', 'vision_tower.encoder.layer.7.attention.attention.value.bias', 'vision_tower.encoder.layer.7.attention.attention.value.weight', 'vision_tower.encoder.layer.7.attention.output.dense.bias', 'vision_tower.encoder.layer.7.attention.output.dense.weight', 'vision_tower.encoder.layer.7.intermediate.dense.bias', 'vision_tower.encoder.layer.7.intermediate.dense.weight', 'vision_tower.encoder.layer.7.output.dense.bias', 'vision_tower.encoder.layer.7.output.dense.weight', 'vision_tower.encoder.layer.8.attention.attention.key.bias', 'vision_tower.encoder.layer.8.attention.attention.key.weight', 'vision_tower.encoder.layer.8.attention.attention.query.bias', 'vision_tower.encoder.layer.8.attention.attention.query.weight', 'vision_tower.encoder.layer.8.attention.attention.value.bias', 'vision_tower.encoder.layer.8.attention.attention.value.weight', 'vision_tower.encoder.layer.8.attention.output.dense.bias', 'vision_tower.encoder.layer.8.attention.output.dense.weight', 'vision_tower.encoder.layer.8.intermediate.dense.bias', 'vision_tower.encoder.layer.8.intermediate.dense.weight', 'vision_tower.encoder.layer.8.output.dense.bias', 'vision_tower.encoder.layer.8.output.dense.weight', 'vision_tower.encoder.layer.9.attention.attention.key.bias', 'vision_tower.encoder.layer.9.attention.attention.key.weight', 'vision_tower.encoder.layer.9.attention.attention.query.bias', 'vision_tower.encoder.layer.9.attention.attention.query.weight', 'vision_tower.encoder.layer.9.attention.attention.value.bias', 'vision_tower.encoder.layer.9.attention.attention.value.weight', 'vision_tower.encoder.layer.9.attention.output.dense.bias', 'vision_tower.encoder.layer.9.attention.output.dense.weight', 'vision_tower.encoder.layer.9.intermediate.dense.bias', 'vision_tower.encoder.layer.9.intermediate.dense.weight', 'vision_tower.encoder.layer.9.output.dense.bias', 'vision_tower.encoder.layer.9.output.dense.weight']
- This IS expected if you are initializing InternVLForConditionalGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing InternVLForConditionalGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of InternVLForConditionalGeneration were not initialized from the model checkpoint at InternVL2_5-8B-MPO-hf and are newly initialized: ['vision_tower.encoder.layer.0.attention.key.bias', 'vision_tower.encoder.layer.0.attention.key.weight', 'vision_tower.encoder.layer.0.attention.output.bias', 'vision_tower.encoder.layer.0.attention.output.weight', 'vision_tower.encoder.layer.0.attention.query.bias', 'vision_tower.encoder.layer.0.attention.query.weight', 'vision_tower.encoder.layer.0.attention.value.bias', 'vision_tower.encoder.layer.0.attention.value.weight', 'vision_tower.encoder.layer.0.mlp.down_proj.bias', 'vision_tower.encoder.layer.0.mlp.down_proj.weight', 'vision_tower.encoder.layer.0.mlp.up_proj.bias', 'vision_tower.encoder.layer.0.mlp.up_proj.weight', 'vision_tower.encoder.layer.1.attention.key.bias', 'vision_tower.encoder.layer.1.attention.key.weight', 'vision_tower.encoder.layer.1.attention.output.bias', 'vision_tower.encoder.layer.1.attention.output.weight', 'vision_tower.encoder.layer.1.attention.query.bias', 'vision_tower.encoder.layer.1.attention.query.weight', 'vision_tower.encoder.layer.1.attention.value.bias', 'vision_tower.encoder.layer.1.attention.value.weight', 'vision_tower.encoder.layer.1.mlp.down_proj.bias', 'vision_tower.encoder.layer.1.mlp.down_proj.weight', 'vision_tower.encoder.layer.1.mlp.up_proj.bias', 'vision_tower.encoder.layer.1.mlp.up_proj.weight', 'vision_tower.encoder.layer.10.attention.key.bias', 'vision_tower.encoder.layer.10.attention.key.weight', 'vision_tower.encoder.layer.10.attention.output.bias', 'vision_tower.encoder.layer.10.attention.output.weight', 'vision_tower.encoder.layer.10.attention.query.bias', 'vision_tower.encoder.layer.10.attention.query.weight', 'vision_tower.encoder.layer.10.attention.value.bias', 'vision_tower.encoder.layer.10.attention.value.weight', 'vision_tower.encoder.layer.10.mlp.down_proj.bias', 'vision_tower.encoder.layer.10.mlp.down_proj.weight', 'vision_tower.encoder.layer.10.mlp.up_proj.bias', 'vision_tower.encoder.layer.10.mlp.up_proj.weight', 'vision_tower.encoder.layer.11.attention.key.bias', 'vision_tower.encoder.layer.11.attention.key.weight', 'vision_tower.encoder.layer.11.attention.output.bias', 'vision_tower.encoder.layer.11.attention.output.weight', 'vision_tower.encoder.layer.11.attention.query.bias', 'vision_tower.encoder.layer.11.attention.query.weight', 'vision_tower.encoder.layer.11.attention.value.bias', 'vision_tower.encoder.layer.11.attention.value.weight', 'vision_tower.encoder.layer.11.mlp.down_proj.bias', 'vision_tower.encoder.layer.11.mlp.down_proj.weight', 'vision_tower.encoder.layer.11.mlp.up_proj.bias', 'vision_tower.encoder.layer.11.mlp.up_proj.weight', 'vision_tower.encoder.layer.12.attention.key.bias', 'vision_tower.encoder.layer.12.attention.key.weight', 'vision_tower.encoder.layer.12.attention.output.bias', 'vision_tower.encoder.layer.12.attention.output.weight', 'vision_tower.encoder.layer.12.attention.query.bias', 'vision_tower.encoder.layer.12.attention.query.weight', 'vision_tower.encoder.layer.12.attention.value.bias', 'vision_tower.encoder.layer.12.attention.value.weight', 'vision_tower.encoder.layer.12.mlp.down_proj.bias', 'vision_tower.encoder.layer.12.mlp.down_proj.weight', 'vision_tower.encoder.layer.12.mlp.up_proj.bias', 'vision_tower.encoder.layer.12.mlp.up_proj.weight', 'vision_tower.encoder.layer.13.attention.key.bias', 'vision_tower.encoder.layer.13.attention.key.weight', 'vision_tower.encoder.layer.13.attention.output.bias', 'vision_tower.encoder.layer.13.attention.output.weight', 'vision_tower.encoder.layer.13.attention.query.bias', 'vision_tower.encoder.layer.13.attention.query.weight', 'vision_tower.encoder.layer.13.attention.value.bias', 'vision_tower.encoder.layer.13.attention.value.weight', 'vision_tower.encoder.layer.13.mlp.down_proj.bias', 'vision_tower.encoder.layer.13.mlp.down_proj.weight', 'vision_tower.encoder.layer.13.mlp.up_proj.bias', 'vision_tower.encoder.layer.13.mlp.up_proj.weight', 'vision_tower.encoder.layer.14.attention.key.bias', 'vision_tower.encoder.layer.14.attention.key.weight', 'vision_tower.encoder.layer.14.attention.output.bias', 'vision_tower.encoder.layer.14.attention.output.weight', 'vision_tower.encoder.layer.14.attention.query.bias', 'vision_tower.encoder.layer.14.attention.query.weight', 'vision_tower.encoder.layer.14.attention.value.bias', 'vision_tower.encoder.layer.14.attention.value.weight', 'vision_tower.encoder.layer.14.mlp.down_proj.bias', 'vision_tower.encoder.layer.14.mlp.down_proj.weight', 'vision_tower.encoder.layer.14.mlp.up_proj.bias', 'vision_tower.encoder.layer.14.mlp.up_proj.weight', 'vision_tower.encoder.layer.15.attention.key.bias', 'vision_tower.encoder.layer.15.attention.key.weight', 'vision_tower.encoder.layer.15.attention.output.bias', 'vision_tower.encoder.layer.15.attention.output.weight', 'vision_tower.encoder.layer.15.attention.query.bias', 'vision_tower.encoder.layer.15.attention.query.weight', 'vision_tower.encoder.layer.15.attention.value.bias', 'vision_tower.encoder.layer.15.attention.value.weight', 'vision_tower.encoder.layer.15.mlp.down_proj.bias', 'vision_tower.encoder.layer.15.mlp.down_proj.weight', 'vision_tower.encoder.layer.15.mlp.up_proj.bias', 'vision_tower.encoder.layer.15.mlp.up_proj.weight', 'vision_tower.encoder.layer.16.attention.key.bias', 'vision_tower.encoder.layer.16.attention.key.weight', 'vision_tower.encoder.layer.16.attention.output.bias', 'vision_tower.encoder.layer.16.attention.output.weight', 'vision_tower.encoder.layer.16.attention.query.bias', 'vision_tower.encoder.layer.16.attention.query.weight', 'vision_tower.encoder.layer.16.attention.value.bias', 'vision_tower.encoder.layer.16.attention.value.weight', 'vision_tower.encoder.layer.16.mlp.down_proj.bias', 'vision_tower.encoder.layer.16.mlp.down_proj.weight', 'vision_tower.encoder.layer.16.mlp.up_proj.bias', 'vision_tower.encoder.layer.16.mlp.up_proj.weight', 'vision_tower.encoder.layer.17.attention.key.bias', 'vision_tower.encoder.layer.17.attention.key.weight', 'vision_tower.encoder.layer.17.attention.output.bias', 'vision_tower.encoder.layer.17.attention.output.weight', 'vision_tower.encoder.layer.17.attention.query.bias', 'vision_tower.encoder.layer.17.attention.query.weight', 'vision_tower.encoder.layer.17.attention.value.bias', 'vision_tower.encoder.layer.17.attention.value.weight', 'vision_tower.encoder.layer.17.mlp.down_proj.bias', 'vision_tower.encoder.layer.17.mlp.down_proj.weight', 'vision_tower.encoder.layer.17.mlp.up_proj.bias', 'vision_tower.encoder.layer.17.mlp.up_proj.weight', 'vision_tower.encoder.layer.18.attention.key.bias', 'vision_tower.encoder.layer.18.attention.key.weight', 'vision_tower.encoder.layer.18.attention.output.bias', 'vision_tower.encoder.layer.18.attention.output.weight', 'vision_tower.encoder.layer.18.attention.query.bias', 'vision_tower.encoder.layer.18.attention.query.weight', 'vision_tower.encoder.layer.18.attention.value.bias', 'vision_tower.encoder.layer.18.attention.value.weight', 'vision_tower.encoder.layer.18.mlp.down_proj.bias', 'vision_tower.encoder.layer.18.mlp.down_proj.weight', 'vision_tower.encoder.layer.18.mlp.up_proj.bias', 'vision_tower.encoder.layer.18.mlp.up_proj.weight', 'vision_tower.encoder.layer.19.attention.key.bias', 'vision_tower.encoder.layer.19.attention.key.weight', 'vision_tower.encoder.layer.19.attention.output.bias', 'vision_tower.encoder.layer.19.attention.output.weight', 'vision_tower.encoder.layer.19.attention.query.bias', 'vision_tower.encoder.layer.19.attention.query.weight', 'vision_tower.encoder.layer.19.attention.value.bias', 'vision_tower.encoder.layer.19.attention.value.weight', 'vision_tower.encoder.layer.19.mlp.down_proj.bias', 'vision_tower.encoder.layer.19.mlp.down_proj.weight', 'vision_tower.encoder.layer.19.mlp.up_proj.bias', 'vision_tower.encoder.layer.19.mlp.up_proj.weight', 'vision_tower.encoder.layer.2.attention.key.bias', 'vision_tower.encoder.layer.2.attention.key.weight', 'vision_tower.encoder.layer.2.attention.output.bias', 'vision_tower.encoder.layer.2.attention.output.weight', 'vision_tower.encoder.layer.2.attention.query.bias', 'vision_tower.encoder.layer.2.attention.query.weight', 'vision_tower.encoder.layer.2.attention.value.bias', 'vision_tower.encoder.layer.2.attention.value.weight', 'vision_tower.encoder.layer.2.mlp.down_proj.bias', 'vision_tower.encoder.layer.2.mlp.down_proj.weight', 'vision_tower.encoder.layer.2.mlp.up_proj.bias', 'vision_tower.encoder.layer.2.mlp.up_proj.weight', 'vision_tower.encoder.layer.20.attention.key.bias', 'vision_tower.encoder.layer.20.attention.key.weight', 'vision_tower.encoder.layer.20.attention.output.bias', 'vision_tower.encoder.layer.20.attention.output.weight', 'vision_tower.encoder.layer.20.attention.query.bias', 'vision_tower.encoder.layer.20.attention.query.weight', 'vision_tower.encoder.layer.20.attention.value.bias', 'vision_tower.encoder.layer.20.attention.value.weight', 'vision_tower.encoder.layer.20.mlp.down_proj.bias', 'vision_tower.encoder.layer.20.mlp.down_proj.weight', 'vision_tower.encoder.layer.20.mlp.up_proj.bias', 'vision_tower.encoder.layer.20.mlp.up_proj.weight', 'vision_tower.encoder.layer.21.attention.key.bias', 'vision_tower.encoder.layer.21.attention.key.weight', 'vision_tower.encoder.layer.21.attention.output.bias', 'vision_tower.encoder.layer.21.attention.output.weight', 'vision_tower.encoder.layer.21.attention.query.bias', 'vision_tower.encoder.layer.21.attention.query.weight', 'vision_tower.encoder.layer.21.attention.value.bias', 'vision_tower.encoder.layer.21.attention.value.weight', 'vision_tower.encoder.layer.21.mlp.down_proj.bias', 'vision_tower.encoder.layer.21.mlp.down_proj.weight', 'vision_tower.encoder.layer.21.mlp.up_proj.bias', 'vision_tower.encoder.layer.21.mlp.up_proj.weight', 'vision_tower.encoder.layer.22.attention.key.bias', 'vision_tower.encoder.layer.22.attention.key.weight', 'vision_tower.encoder.layer.22.attention.output.bias', 'vision_tower.encoder.layer.22.attention.output.weight', 'vision_tower.encoder.layer.22.attention.query.bias', 'vision_tower.encoder.layer.22.attention.query.weight', 'vision_tower.encoder.layer.22.attention.value.bias', 'vision_tower.encoder.layer.22.attention.value.weight', 'vision_tower.encoder.layer.22.mlp.down_proj.bias', 'vision_tower.encoder.layer.22.mlp.down_proj.weight', 'vision_tower.encoder.layer.22.mlp.up_proj.bias', 'vision_tower.encoder.layer.22.mlp.up_proj.weight', 'vision_tower.encoder.layer.23.attention.key.bias', 'vision_tower.encoder.layer.23.attention.key.weight', 'vision_tower.encoder.layer.23.attention.output.bias', 'vision_tower.encoder.layer.23.attention.output.weight', 'vision_tower.encoder.layer.23.attention.query.bias', 'vision_tower.encoder.layer.23.attention.query.weight', 'vision_tower.encoder.layer.23.attention.value.bias', 'vision_tower.encoder.layer.23.attention.value.weight', 'vision_tower.encoder.layer.23.mlp.down_proj.bias', 'vision_tower.encoder.layer.23.mlp.down_proj.weight', 'vision_tower.encoder.layer.23.mlp.up_proj.bias', 'vision_tower.encoder.layer.23.mlp.up_proj.weight', 'vision_tower.encoder.layer.3.attention.key.bias', 'vision_tower.encoder.layer.3.attention.key.weight', 'vision_tower.encoder.layer.3.attention.output.bias', 'vision_tower.encoder.layer.3.attention.output.weight', 'vision_tower.encoder.layer.3.attention.query.bias', 'vision_tower.encoder.layer.3.attention.query.weight', 'vision_tower.encoder.layer.3.attention.value.bias', 'vision_tower.encoder.layer.3.attention.value.weight', 'vision_tower.encoder.layer.3.mlp.down_proj.bias', 'vision_tower.encoder.layer.3.mlp.down_proj.weight', 'vision_tower.encoder.layer.3.mlp.up_proj.bias', 'vision_tower.encoder.layer.3.mlp.up_proj.weight', 'vision_tower.encoder.layer.4.attention.key.bias', 'vision_tower.encoder.layer.4.attention.key.weight', 'vision_tower.encoder.layer.4.attention.output.bias', 'vision_tower.encoder.layer.4.attention.output.weight', 'vision_tower.encoder.layer.4.attention.query.bias', 'vision_tower.encoder.layer.4.attention.query.weight', 'vision_tower.encoder.layer.4.attention.value.bias', 'vision_tower.encoder.layer.4.attention.value.weight', 'vision_tower.encoder.layer.4.mlp.down_proj.bias', 'vision_tower.encoder.layer.4.mlp.down_proj.weight', 'vision_tower.encoder.layer.4.mlp.up_proj.bias', 'vision_tower.encoder.layer.4.mlp.up_proj.weight', 'vision_tower.encoder.layer.5.attention.key.bias', 'vision_tower.encoder.layer.5.attention.key.weight', 'vision_tower.encoder.layer.5.attention.output.bias', 'vision_tower.encoder.layer.5.attention.output.weight', 'vision_tower.encoder.layer.5.attention.query.bias', 'vision_tower.encoder.layer.5.attention.query.weight', 'vision_tower.encoder.layer.5.attention.value.bias', 'vision_tower.encoder.layer.5.attention.value.weight', 'vision_tower.encoder.layer.5.mlp.down_proj.bias', 'vision_tower.encoder.layer.5.mlp.down_proj.weight', 'vision_tower.encoder.layer.5.mlp.up_proj.bias', 'vision_tower.encoder.layer.5.mlp.up_proj.weight', 'vision_tower.encoder.layer.6.attention.key.bias', 'vision_tower.encoder.layer.6.attention.key.weight', 'vision_tower.encoder.layer.6.attention.output.bias', 'vision_tower.encoder.layer.6.attention.output.weight', 'vision_tower.encoder.layer.6.attention.query.bias', 'vision_tower.encoder.layer.6.attention.query.weight', 'vision_tower.encoder.layer.6.attention.value.bias', 'vision_tower.encoder.layer.6.attention.value.weight', 'vision_tower.encoder.layer.6.mlp.down_proj.bias', 'vision_tower.encoder.layer.6.mlp.down_proj.weight', 'vision_tower.encoder.layer.6.mlp.up_proj.bias', 'vision_tower.encoder.layer.6.mlp.up_proj.weight', 'vision_tower.encoder.layer.7.attention.key.bias', 'vision_tower.encoder.layer.7.attention.key.weight', 'vision_tower.encoder.layer.7.attention.output.bias', 'vision_tower.encoder.layer.7.attention.output.weight', 'vision_tower.encoder.layer.7.attention.query.bias', 'vision_tower.encoder.layer.7.attention.query.weight', 'vision_tower.encoder.layer.7.attention.value.bias', 'vision_tower.encoder.layer.7.attention.value.weight', 'vision_tower.encoder.layer.7.mlp.down_proj.bias', 'vision_tower.encoder.layer.7.mlp.down_proj.weight', 'vision_tower.encoder.layer.7.mlp.up_proj.bias', 'vision_tower.encoder.layer.7.mlp.up_proj.weight', 'vision_tower.encoder.layer.8.attention.key.bias', 'vision_tower.encoder.layer.8.attention.key.weight', 'vision_tower.encoder.layer.8.attention.output.bias', 'vision_tower.encoder.layer.8.attention.output.weight', 'vision_tower.encoder.layer.8.attention.query.bias', 'vision_tower.encoder.layer.8.attention.query.weight', 'vision_tower.encoder.layer.8.attention.value.bias', 'vision_tower.encoder.layer.8.attention.value.weight', 'vision_tower.encoder.layer.8.mlp.down_proj.bias', 'vision_tower.encoder.layer.8.mlp.down_proj.weight', 'vision_tower.encoder.layer.8.mlp.up_proj.bias', 'vision_tower.encoder.layer.8.mlp.up_proj.weight', 'vision_tower.encoder.layer.9.attention.key.bias', 'vision_tower.encoder.layer.9.attention.key.weight', 'vision_tower.encoder.layer.9.attention.output.bias', 'vision_tower.encoder.layer.9.attention.output.weight', 'vision_tower.encoder.layer.9.attention.query.bias', 'vision_tower.encoder.layer.9.attention.query.weight', 'vision_tower.encoder.layer.9.attention.value.bias', 'vision_tower.encoder.layer.9.attention.value.weight', 'vision_tower.encoder.layer.9.mlp.down_proj.bias', 'vision_tower.encoder.layer.9.mlp.down_proj.weight', 'vision_tower.encoder.layer.9.mlp.up_proj.bias', 'vision_tower.encoder.layer.9.mlp.up_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.It seems that the vision parts are all not initialized? |
Could you catch the def write_tokenizer(save_dir: str, push_to_hub: bool = False, path: str = None, hub_dir: str = None):
if LM_TYPE_CORRESPONDENCE[path] == "qwen2":
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", return_token_type_ids=False)
tokenizer.model_max_length = CONTEXT_LENGTHFor |
Actually, I am using a customized-pretrained According to your project: LM_TYPE_CORRESPONDENCE = {
"OpenGVLab/InternVL2_5-1B-MPO": "qwen2",
"OpenGVLab/InternVL2_5-2B-MPO": "llama",
"OpenGVLab/InternVL2_5-4B-MPO": "qwen2",
"OpenGVLab/InternVL2_5-8B-MPO": "llama",
"OpenGVLab/InternVL2_5-26B-MPO": "llama",
"OpenGVLab/InternVL2_5-38B-MPO": "qwen2",
"OpenGVLab/InternVL2_5-78B-MPO": "qwen2",
}I suppose I should use the path pointing at "llama"? |
|
Yes! Wait for a moment. I am going to reproduce it. |
Additionally, this error occurs when I directly attempt to load your published model, without runnning converting weight file. I don't know what could cause the mismatch between parameter names. I used to the newest transformers version by pip install the following: |
A quick check. It seems that your
To fix: Update the transformers to align with the commit. # test with converting to InternVL3-8B-hf
from transformers import InternVLForConditionalGeneration
model = InternVLForConditionalGeneration.from_pretrained("./InternVL3-8B-hf", device_map="auto") |
use pip install git+https://github.com/huggingface/transformers.git@main, latest llamafactory, OpenGVLab/InternVL3-1B-hf, |
|
@Kuangdd01 we should add |
yes, my bad. Wait for a second |
|
An error is reported after the model is fine-tuned。【Unrecognized configuration class <class 'transformers.models.internvl.configuration_internvl.InternVLConfig'> for this kind of AutoModel: AutoModel.】 |
There was a mismatch between transformers and model ckpt. |
我照着这个步骤做了,但是报错 |
参考版本和模型对应关系 #7258 (comment) 更新llamafactory代码重新试一次 |
[assets] update wechat (hiyouga#7288) [dataset] fix ultrachat_200k dataset (hiyouga#7259) The `HuggingFaceH4/ultrachat_200k` dataset doesn't contain the default "train" split. The correct split is "train_sft". [data] gemma3 plugin pan and scan (hiyouga#7294) * gemma3 pan and scan * add test case * fix test [inference] support sglang backend (hiyouga#7278) * Mimic SGLang offline Engine * Add more tests and args * Pass all current tests * Clean Code * fix sample_params * clean code * Fix Stream Chat * change sglang from engine mode to server mode * fix * Fix Review Issues * Use SGLang Built-In Utilities * Fix test SGLang * Some Doc Issue * fix sglang engine * add readme --------- Co-authored-by: Jin Pan <jpan236@wisc.edu> Co-authored-by: hiyouga <hiyouga@buaa.edu.cn> [model] support hunyuan 7b (hiyouga#7317) * [Model]supported tencent-hunyuan model * [Model]supported tencent-hunyuan model(fix) * [Model]supported tencent-hunyuan model(fix) [assets] update videos (hiyouga#7340) * Update README.md * Update README_zh.md [data] fix template (hiyouga#7349) [misc] set dev version (hiyouga#7351) [assets] update wechat (hiyouga#7361) [version] fix minicpmo (hiyouga#7378) [3rdparty] fix redundant process group destroy for ray (hiyouga#7395) * fix redundant process group destroy for ray * Update tuner.py --------- Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn> [misc] fix sglang deps (hiyouga#7432) * feat: Add transformer version requirement for sglang * feat: add srt to sglang which is required for running sglang Other options are srt_hip, srt_xpu, srt_npu, srt_hpu, srt_cpu, for different computation architectures. [deps] upgrade vllm to 0.8 (hiyouga#7436) [deps] upgrade transformers to 4.50.0 (hiyouga#7437) * upgrade transformers * fix hf cache * fix dpo trainer [scripts] support compute score on vllm's predictions (hiyouga#7419) * enable manual bleu&rouge eval by adding `scripts/eval_bleu_rouge.py` * added libraries check * update: 使用datasets库的多进程加速处理 * update: - 使用 fire.Fire - 修改代码格式 * Update eval_bleu_rouge.py: correctly uses fire Deleted the code of using sys.argv * Update eval_bleu_rouge.py --------- Co-authored-by: SnowFox4004 <manba@out> Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn> [misc] fix license (hiyouga#7440) [misc] fix ci (hiyouga#7441) * fix ci * improve ci [docker] upgrade to torch 2.6 (hiyouga#7442) [trainer] fix vlm loss for transformers 4.49 (hiyouga#7448) [assets] fix gemma3 readme (hiyouga#7449) [assets] update wechat (hiyouga#7455) [misc] enable liger kernel for gemma3 (hiyouga#7462) [misc] enable liger kernel for gemma3 text and paligemma (hiyouga#7466) * add gemma3 text * add paligemma (1,2 and 2 mix) [misc] update liger-kernel's monkey patch (hiyouga#7453) * Update liger_kernel.py * Update setup.py [model] fix lora on quant models (hiyouga#7456) Co-authored-by: root <root@ai> [model] add qwen2vl 32b & upgrade peft (hiyouga#7469) * add qwen2vl 32b * fix ci * upgrade peft to 0.15 * fix ci * fix ci [trainer] fix wsd scheduler (hiyouga#7304) * [trainer] Warmup_stable_decay supports setting the number of stable and decay steps according to the warmup_ratio ratio * Update trainer_utils.py --------- Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn> [3rdparty] support swanlab lark notification (hiyouga#7481) [data] fix pixtral plugin (hiyouga#7505) * preserve `image_sizes` * add comments [assets] update wechat (hiyouga#7523) [deps] pin pydantic to 2.10.6 (hiyouga#7546) [model] add Qwen2.5-Omni model (hiyouga#7537) * preserve image_sizes * preserve image_sizes * init plugin * support audio-text2text lora * nit * support image/video-text2text, audio-text2text * remove args * remove lines * add docs && nit * remove some comments * fix && add merge part script * add license [data] fix qwen2.5 omni collator (hiyouga#7553) [trainer] new kto mismatch pair creation strategy (hiyouga#7509) [data] shard the dataset to allow multiprocessing when streaming is enabled (hiyouga#7530) * Shard the dataset when streaming to allow multiprocessing * Allow user to not set dataset_shards to ensure backward compatibility [webui] fix launch with proxy (hiyouga#7332) [data] specify position_ids in PackedSupervisedDatasetProcessor for neat_packing (hiyouga#7318) * use position_ids for neat_packing with fa2 * revert fa2 changes [model] fix use_cache patching for gemma3 multimodal (hiyouga#7500) [model] fix kv cache (hiyouga#7564) [infer] vllm video/audio inference (hiyouga#7566) [trainer] fix batch processing in PPO trainer (hiyouga#7576) [data] fix qwen2.5 omni plugin (hiyouga#7573) * align key with qwen2vl * nit && change scripts [data] fix qwen2.5 omni plugin (hiyouga#7578) * specific entry * Update mm_plugin.py * fix fps cal --------- Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn> [assets] update wechat (hiyouga#7594) [model] add llama4 (hiyouga#7611) [assets] update readme (hiyouga#7612) [misc] fix packing and eval plot (hiyouga#7623) [sglang] support transformers 4.51.0 (hiyouga#7639) [trainer] fix key error (hiyouga#7635) [data] Fix bugs of `use_audio_in_video` in Qwen2.5 Omni (hiyouga#7638) * cache _mm_inputs * nit * support for use_audio_in_video * remove cache * fix data * Update mllm_video_audio_demo.json [assets] update readme (hiyouga#7644) [assets] update readme (hiyouga#7654) [data] add coig-p dataset (hiyouga#7657) [misc] fix cuda warn on intel GPU (hiyouga#7655) [bugfix] enable_gemma_liger_kernel (hiyouga#7660) - The `enable_liger_kernel` function for the Gemma model series was not executed due to the existing `if` statement in the code. - Changed the line to an `elif` statement so that the `apply_liger_kernel` function is executed properly. resolved: hiyouga#7628 [ray] allow for specifying ray.init kwargs (i.e. runtime_env) (hiyouga#7647) * ray init kwargs * Update trainer_utils.py * fix ray args --------- Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn> [data] support for specifying a dataset in cloud storage (hiyouga#7567) * add support for loading datasets from s3/gcs * add comments to readme * run linter and address comments * add option to pass in kwargs to ray init (i.e. runtime env) * address comment * revert mixed up changes [assets] update wechat (hiyouga#7674) [deps] fix uv conflicts (hiyouga#7686) * fix hiyouga#7678 * Update setup.py * Update tests.yml * Update publish.yml * Update Makefile [model] add GLM-4-0414 (hiyouga#7695) * Update README_zh.md * update [deps] upgrade transformers (hiyouga#7704) [misc] upgrade cli (hiyouga#7714) [misc] fix env vars (hiyouga#7715) [model] Support Kimi_VL thinking/instruct (hiyouga#7719) * add kimi_vl * patch config * check version * Update mm_plugin.py * Update mm_plugin.py --------- Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn> [assets] update model readme (hiyouga#7724) [docker] patch docker-rocm (hiyouga#7725) * Update Dockerfile * Fix typo * Fix syntax for /bin/sh conditional * Add build args to docker-compose * Change shell to /bin/bash This is required for "==" syntax in conditional string comparison [deps] upgrade vllm (hiyouga#7728) [api] fix chat messages (hiyouga#7732) [assets] wechat (hiyouga#7740) [infer] support vllm-ascend (hiyouga#7739) [misc] improve entrypoint (hiyouga#7345) * 纯粹优化下入口代码,因为看到if else太多了 * Update cli.py --------- Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn> [model] support intern-VL 2.5-3 series (hiyouga#7258) * add internvl and rebase * fix for internvl2&3 * remove lines * fix video_inputs & lint * nit * add constants * remove lines * fix * fix error * pass ci * pass ci * skip internvl & nit [infer] set env for vllm ascend (hiyouga#7745) [breaking] bump transformers to 4.45.0 & improve ci (hiyouga#7746) * update ci * fix * fix * fix * fix * fix [trainer] fix pt loss (hiyouga#7748) * fix pt loss * robust * fix * test [assets] update wechat (hiyouga#7792) [misc] fix bug in constant (hiyouga#7765) Co-authored-by: Sachin Beldona <sbeldona@cs.cmu.edu> [model] fix gemma3 export (hiyouga#7786) Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn> [misc] fix new tokens adding (hiyouga#7253) Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn> [data] Fix wrong position ids with packed attention masks (hiyouga#7754) Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn> [parser] support omegaconf (hiyouga#7793) [trainer] Add Muon Optimizer (hiyouga#7749) Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn> [example] add bash usage (hiyouga#7794) [data] improve mmplugin (hiyouga#7795) [trainer] support early stop (hiyouga#7797) [misc] update internvl constants (hiyouga#7801) [model] add arch check for InternVL (hiyouga#7803) [assets] update model readme (hiyouga#7804) [data] fix internvl plugin (hiyouga#7817) [model] fix moe zero3 (hiyouga#7826) Merge commit from fork [model] fix vit gradient checkpointing (hiyouga#7830) [assets] update wechat (hiyouga#7840) [ray] add storage filesystem to ray config (hiyouga#7854) fix attn patch for kimivl (hiyouga#7867) [data] fix minicpmo vllm infer (hiyouga#7870) [trainer] make projector trainable in freeze training (hiyouga#7872) Co-authored-by: hoshi-hiyouga <hiyouga@buaa.edu.cn> [data] fix qwen2 omni plugin (hiyouga#7875) [model] fix dsv3 leaf node (hiyouga#7879) [data] fix qwen2.5 omni template (hiyouga#7883) [model] add qwen3 (hiyouga#7885) support lora sft dsv3 update code update eval yaml rebase sync w/ major branch update baseline
* add internvl and rebase * fix for internvl2&3 * remove lines * fix video_inputs & lint * nit * add constants * remove lines * fix * fix error * pass ci * pass ci * skip internvl & nit
|
@
hi, thanks for your contribution! may i ask if you have an example script of fine-tuning on text-image-to-text dataset? |
|
@Elenore1997 use |
and the remaining hyper-parameters is the same as mentioned above? |
depend on your fine-tuning method and your dataset size |
我拉了最新的训练代码,模型仓库用的是8B-hf,transformers:pip install git+https://github.com/huggingface/transformers.git@main |
|
无关,字面意思,你pixel values是float32,vit是bf16的,类型不匹配 batch_size, num_channels, height, width = pixel_values.shape
if num_channels != self.num_channels:
raise ValueError(
"Make sure that the channel dimension of the pixel values match with the one set in the configuration."
)
target_dtype = self.projection.weight.dtype
if pixel_values.dtype != target_dtype:
pixel_values = pixel_values.to(dtype=target_dtype)
embeddings = self.projection(pixel_values) |
谢谢大佬的及时回复,目前已能够成功训练~ |
你好我想问一下,我使用的是InternVL3-8B进行LoRA SFT,使用的模型是OpenGVLab/InternVL3-8B用 当MODEL_PATH设置为我OpenGVLab/InternVL3-8B转化成hf格式的模型和LoRA SFT训练完导出的模型都会出现这个问题, 请问能帮忙解答一下吗? 非常感谢 @Kuangdd01 |
目前vllm支持的internvl模型格式是internvlChat而不是internvl-hf, 如果想用vllm启动需要把-hf再转换回原来的模型命名格式,并且用原来的配置文件,可以先用hf engine当推理引擎,速度上会慢很多 |
你好,请问一下这个怎么转化回去吗?我在transformers上也没找到相应的文件,能请教一下这个怎么操作吗?我自己尝试过直接修改model_type但是并不能成功,非常感谢 @Kuangdd01 |
我还没尝试过这么做,但是你可以根据transformers/src/transformers/models/internvl /convert_internvl_weights_to_hf.py来做这个逆向过程,如果你用的internvl的llm backbone是qwen的话只需要更改visual部分的命名即可, 对当前hf版本的name_params做反向命名。 |
* add internvl and rebase * fix for internvl2&3 * remove lines * fix video_inputs & lint * nit * add constants * remove lines * fix * fix error * pass ci * pass ci * skip internvl & nit
* add internvl and rebase * fix for internvl2&3 * remove lines * fix video_inputs & lint * nit * add constants * remove lines * fix * fix error * pass ci * pass ci * skip internvl & nit
|
pip install git+https://github.com/huggingface/transformers.git@main 现在对应的版本是 transformers-4.55.0.dev0 |
Run with the following env var |
在训练internVL3.5系列时我也遇到了
|










What does this PR do?
Reopened PR #7077
may Fix #6322 #6432 #6236 #3802
Before submitting
former version
## some demo experiment on `InternVL2_5-1B-MPO-hf` 1. video lora sft ``` yaml ### model model_name_or_path: kingsley01/InternVL3-1B-hf trust_remote_code: truemethod
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all
dataset
dataset: mllm_video_demo
dataset: mllm_demo # text: identity,alpaca_en_demo # video: mllm_video_demo
template: intern_vl
cutoff_len: 2048
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 4
output
output_dir: saves/internvl-1b/lora/sft-test-demo-video
logging_steps: 1
save_steps: 500
plot_loss: true
overwrite_output_dir: true
train
per_device_train_batch_size: 2
gradient_accumulation_steps: 1
learning_rate: 1.0e-4
num_train_epochs: 30.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500
Align with the latest Transformers Now! 😄
NOW we support InternVL2.5-InternVL3 series post-training!
Important
We should use the latest Huggingface code instead of it release and OpenGVLab/InternVL3-xB-hf checkpoint.
For the processor issues, please check your transformers versions and model.processor.config.
For now, please install
a specific versionof the latesttransformers.pip install git+https://github.com/Kuangdd01/transformers.git@hf-internvlWe support direct use of several small-sized checkpoints:[InternVL2.5-1/2/4/8B, InternVL3-1/2/8B]. Download the InternVL models from Huggingface or Modelscope.