-
Notifications
You must be signed in to change notification settings - Fork 30.9k
Fix non FA2 tests after FA2 installed in CI docker image #40430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| model = Glm4vForConditionalGeneration.from_pretrained( | ||
| "THUDM/GLM-4.1V-9B-Thinking", dtype=torch.float16, device_map="auto" | ||
| ) | ||
| questions = ["Describe this video."] * 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use batch 2, otherwise OOM. It's fine to simply to test batch 1 here, the goal is to check if video works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, we have the same for images at
transformers/tests/models/glm4v/test_modeling_glm4v.py
Lines 314 to 370 in 8828b2e
| @slow | |
| def test_small_model_integration_test(self): | |
| model = Glm4vForConditionalGeneration.from_pretrained( | |
| "THUDM/GLM-4.1V-9B-Thinking", dtype="auto", device_map="auto" | |
| ) | |
| inputs = self.processor.apply_chat_template( | |
| self.message, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt" | |
| ) | |
| expected_input_ids = [151331, 151333, 151336, 198, 151339, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343] # fmt: skip | |
| assert expected_input_ids == inputs.input_ids[0].tolist()[:17] | |
| expected_pixel_slice = torch.tensor( | |
| [ | |
| [-0.0988, -0.0842, -0.0842], | |
| [-0.5660, -0.5514, -0.4200], | |
| [-0.0259, -0.0259, -0.0259], | |
| [-0.1280, -0.0988, -0.2010], | |
| [-0.4638, -0.5806, -0.6974], | |
| [-1.2083, -1.2229, -1.2083], | |
| ], | |
| dtype=torch.float32, | |
| device="cpu", | |
| ) | |
| assert torch.allclose(expected_pixel_slice, inputs.pixel_values[:6, :3], atol=3e-3) | |
| # verify generation | |
| inputs = inputs.to(torch_device) | |
| output = model.generate(**inputs, max_new_tokens=30) | |
| EXPECTED_DECODED_TEXT = "\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks" | |
| self.assertEqual( | |
| self.processor.decode(output[0], skip_special_tokens=True), | |
| EXPECTED_DECODED_TEXT, | |
| ) | |
| @slow | |
| def test_small_model_integration_test_batch(self): | |
| model = Glm4vForConditionalGeneration.from_pretrained( | |
| "THUDM/GLM-4.1V-9B-Thinking", dtype="auto", device_map="auto" | |
| ) | |
| batch_messages = [self.message] * 2 | |
| inputs = self.processor.apply_chat_template( | |
| batch_messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt" | |
| ).to(torch_device) | |
| # it should not matter whether two images are the same size or not | |
| output = model.generate(**inputs, max_new_tokens=30) | |
| EXPECTED_DECODED_TEXT = [ | |
| "\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks", | |
| "\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks" | |
| ] # fmt: skip | |
| self.assertEqual( | |
| self.processor.batch_decode(output, skip_special_tokens=True), | |
| EXPECTED_DECODED_TEXT, | |
| ) |
Would it be possible to have something similar for videos? Fine with this tho as well, just not sure if batching x videos could have some different issues than batching x image
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will just leave as it is. Don't have enough bandwidth. It never pass anyway.
| model_id = "mistralai/Mistral-7B-v0.1" | ||
| EXPECTED_COMPLETIONS = [ | ||
| "This is a nice place. This is a nice place. This is a nice place. This is", | ||
| "scenery, scenery, scenery, scenery, scenery,", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
due to the 800 --> 682 below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Repetition goes wild 😆
| if attn_implementation in ["flex_attention", "eager"]: | ||
| input_text = input_text[:1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eager still OOM with 682, just make it batch size 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small comment to add ^
| @pytest.mark.flash_attn_test | ||
| def test_model_600m_long_prompt(self): | ||
| EXPECTED_OUTPUT_TOKEN_IDS = [306, 338] | ||
| EXPECTED_OUTPUT_TOKEN_IDS = [198, 198] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
never run before. The sdpa version of this test already use this 198
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe move to same test and parametrize instead? Looks like another parity check between sdpa and flash that was hidden
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
look at the 2 tests, one is doing more work than another and loading is also different (4-bit or not). I will simply keep as they are
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ic, makes sense didnt look in detail myself ^^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall! Just some smaller things and good to see that we now run these hidden tests instead
| model = Glm4vForConditionalGeneration.from_pretrained( | ||
| "THUDM/GLM-4.1V-9B-Thinking", dtype=torch.float16, device_map="auto" | ||
| ) | ||
| questions = ["Describe this video."] * 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, we have the same for images at
transformers/tests/models/glm4v/test_modeling_glm4v.py
Lines 314 to 370 in 8828b2e
| @slow | |
| def test_small_model_integration_test(self): | |
| model = Glm4vForConditionalGeneration.from_pretrained( | |
| "THUDM/GLM-4.1V-9B-Thinking", dtype="auto", device_map="auto" | |
| ) | |
| inputs = self.processor.apply_chat_template( | |
| self.message, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt" | |
| ) | |
| expected_input_ids = [151331, 151333, 151336, 198, 151339, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343, 151343] # fmt: skip | |
| assert expected_input_ids == inputs.input_ids[0].tolist()[:17] | |
| expected_pixel_slice = torch.tensor( | |
| [ | |
| [-0.0988, -0.0842, -0.0842], | |
| [-0.5660, -0.5514, -0.4200], | |
| [-0.0259, -0.0259, -0.0259], | |
| [-0.1280, -0.0988, -0.2010], | |
| [-0.4638, -0.5806, -0.6974], | |
| [-1.2083, -1.2229, -1.2083], | |
| ], | |
| dtype=torch.float32, | |
| device="cpu", | |
| ) | |
| assert torch.allclose(expected_pixel_slice, inputs.pixel_values[:6, :3], atol=3e-3) | |
| # verify generation | |
| inputs = inputs.to(torch_device) | |
| output = model.generate(**inputs, max_new_tokens=30) | |
| EXPECTED_DECODED_TEXT = "\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks" | |
| self.assertEqual( | |
| self.processor.decode(output[0], skip_special_tokens=True), | |
| EXPECTED_DECODED_TEXT, | |
| ) | |
| @slow | |
| def test_small_model_integration_test_batch(self): | |
| model = Glm4vForConditionalGeneration.from_pretrained( | |
| "THUDM/GLM-4.1V-9B-Thinking", dtype="auto", device_map="auto" | |
| ) | |
| batch_messages = [self.message] * 2 | |
| inputs = self.processor.apply_chat_template( | |
| batch_messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt" | |
| ).to(torch_device) | |
| # it should not matter whether two images are the same size or not | |
| output = model.generate(**inputs, max_new_tokens=30) | |
| EXPECTED_DECODED_TEXT = [ | |
| "\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks", | |
| "\nWhat kind of dog is this?\n<think>Got it, let's look at the image. The animal in the picture is not a dog; it's a cat. Specifically, it looks" | |
| ] # fmt: skip | |
| self.assertEqual( | |
| self.processor.batch_decode(output, skip_special_tokens=True), | |
| EXPECTED_DECODED_TEXT, | |
| ) |
Would it be possible to have something similar for videos? Fine with this tho as well, just not sure if batching x videos could have some different issues than batching x image
| model_id = "mistralai/Mistral-7B-v0.1" | ||
| EXPECTED_COMPLETIONS = [ | ||
| "This is a nice place. This is a nice place. This is a nice place. This is", | ||
| "scenery, scenery, scenery, scenery, scenery,", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Repetition goes wild 😆
| if attn_implementation in ["flex_attention", "eager"]: | ||
| input_text = input_text[:1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small comment to add ^
| @pytest.mark.flash_attn_test | ||
| def test_model_600m_long_prompt(self): | ||
| EXPECTED_OUTPUT_TOKEN_IDS = [306, 338] | ||
| EXPECTED_OUTPUT_TOKEN_IDS = [198, 198] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe move to same test and parametrize instead? Looks like another parity check between sdpa and flash that was hidden
|
[For maintainers] Suggested jobs to run (before merge) run-slow: glm4v, mistral, qwen2, qwen3 |
What does this PR do?