[ESM] support attention API #40370

zucchini-nlp · 2025-08-22T09:54:45Z

What does this PR do?

Addresses #34954 and updates ESM to supports attention API and modeling outputs

zucchini-nlp · 2025-08-22T09:54:52Z

run-slow: esm

zucchini-nlp · 2025-08-22T09:56:03Z

run-slow: esm

github-actions · 2025-08-22T09:57:22Z

This comment contains run-slow, running the specified jobs:

models: ['models/esm']
quantizations: [] ...

HuggingFaceDocBuilderDev · 2025-08-22T10:05:15Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp · 2025-08-22T10:31:35Z

run-slow: esm, evolla

github-actions · 2025-08-22T10:32:57Z

This comment contains run-slow, running the specified jobs:

models: ['models/esm', 'models/evolla']
quantizations: [] ...

vasqu

Some initial thoughts, it would be nice if we could align things with #38301 as we can then refactor things more easily afterwards.

src/transformers/models/evolla/modeling_evolla.py

src/transformers/models/esm/modeling_esm.py

vasqu · 2025-08-22T11:03:29Z

src/transformers/models/esm/modeling_esm.py

+        if self.config._attn_implementation != "flash_attention_2":
+            batch_size, seq_length = inputs_embeds.shape[:-1]
+            if attention_mask is None:
+                attention_mask = torch.ones(((batch_size, seq_length)), device=inputs_embeds.device)

-        else:
-            # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
-            # ourselves in which case we just need to make it broadcastable to all heads.
-            extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape)
+            attention_mask: torch.Tensor = self.get_extended_attention_mask(
+                attention_mask, input_shape=(batch_size, seq_length)
+            )

        # If a 2D or 3D attention mask is provided for the cross-attention
        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
        if self.config.is_decoder and encoder_hidden_states is not None:
            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()
            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)
            if encoder_attention_mask is None:
-                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)
+                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=inputs_embeds.device)
            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)
        else:
            encoder_extended_attention_mask = None


Can we align mask creation to the same as in #38301

It will make refactoring easier and the mask creations are "more proven".

yeah, also thought of it, but after seeing it supports non-causal mask as well I think we need a cleaner approach for that in the future. Some kind of a small function that would decide which mask to construct and return it

Not meaning the attention mask interface yet as that definitely needs an update for non-causal variants :D thinking of

transformers/src/transformers/models/bert/modeling_bert.py

Lines 990 to 1034 in e0f1e83

if attention_mask is None:

# required mask seq length can be calculated via length of past cache

mask_seq_length = past_key_values_length + seq_length

attention_mask = torch.ones(batch_size, mask_seq_length, device=device)

if self.config.is_decoder and encoder_hidden_states is not None and encoder_attention_mask is None:

encoder_attention_mask = torch.ones(encoder_hidden_states.shape[:2], device=device)

if attention_mask.dim() == 2:

if self.config.is_decoder:

attention_mask = create_causal_mask(

config=self.config,

input_embeds=embedding_output,

attention_mask=attention_mask,

cache_position=cache_position,

past_key_values=past_key_values,

)

else:

attention_mask = self._update_full_mask(

attention_mask,

embedding_output,

)

elif attention_mask.dim() == 3:

if self.config._attn_implementation in ["flash_attention_2", "flex_attention"]:

raise ValueError(

"Passing attention mask with a 3D/4D shape does not work with type "

f"{self.config._attn_implementation} - please use either `sdpa` or `eager` instead."

)

attention_mask = self.get_extended_attention_mask(attention_mask, input_shape)

if encoder_attention_mask is not None:

if encoder_attention_mask.dim() == 2:

encoder_attention_mask = self._update_cross_attn_mask(

encoder_hidden_states,

encoder_attention_mask,

embedding_output.shape[:2],

embedding_output,

)

else:

if self.config._attn_implementation in ["flash_attention_2", "flex_attention"]:

raise ValueError(

"Passing attention mask with a 3D/4D shape does not work with type "

f"{self.config._attn_implementation} - please use either `sdpa` or `eager` instead."

)

encoder_attention_mask = self.invert_attention_mask(encoder_attention_mask)

(without the checks on the dims). It would make it easy to remove all those functions later on if we keep it consistent across models that have yet to get the mask interface. (And I have tested it quite thoroughly on all attention variations)

zucchini-nlp · 2025-08-25T10:30:10Z

run-slow: esm, evolla

github-actions · 2025-08-25T10:31:25Z

This comment contains run-slow, running the specified jobs:

models: ['models/esm', 'models/evolla']
quantizations: [] ...

vasqu

Oops, commented on the modular generated file but should be carried over to esm - my bad ^^'

tests/test_modeling_common.py

src/transformers/models/evolla/configuration_evolla.py

src/transformers/utils/generic.py

src/transformers/models/evolla/modeling_evolla.py

vasqu

Just one small nit (order of relative scaling) and could you change the mask creation per https://github.com/huggingface/transformers/pull/40370/files#r2297912836

LGTM otherwise! cc @pstjohn re #40211

src/transformers/utils/generic.py

src/transformers/models/evolla/modular_evolla.py

src/transformers/models/evolla/modeling_evolla.py

zucchini-nlp · 2025-08-27T10:18:26Z

run-slow: esm, evolla

github-actions · 2025-08-27T10:19:55Z

This comment contains run-slow, running the specified jobs:

models: ['models/esm', 'models/evolla']
quantizations: [] ...

zucchini-nlp · 2025-08-27T11:06:55Z

run-slow: esm, evolla

vasqu · 2025-08-27T11:14:24Z

src/transformers/models/esm/modeling_esm.py

        self.scaling = 1.0  # For BC we apply scaling before RoPE
        self.is_decoder = config.is_decoder
        self.layer_idx = layer_idx
+        self.is_causal = self.is_decoder  # used only in FA2/FA3


Sdpa uses this as well :D this is probably incorrect in case of cross-attention, can you add an optional argument here instead

Ah right, it also has cross attention. Copied from current ESM attention assuming it would be working on main branch

github-actions · 2025-08-27T12:58:20Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: esm, evolla

Original PR #40370 by zucchini-nlp Original: huggingface/transformers#40370

Merged from original PR #40370 Original: huggingface/transformers#40370

Original PR #40370 by zucchini-nlp Original: huggingface/transformers#40370

Merged from original PR #40370 Original: huggingface/transformers#40370

Original PR #40370 by zucchini-nlp Original: huggingface/transformers#40370

Merged from original PR #40370 Original: huggingface/transformers#40370

ESM supports attention API

c1a51c4

supports flags

a353cad

zucchini-nlp added 3 commits August 22, 2025 12:25

fix tests

2cbf837

fix copiees

00544c3

another fixup needed after fixing tests

0c65069

zucchini-nlp requested a review from ArthurZucker August 22, 2025 10:31

vasqu reviewed Aug 22, 2025

View reviewed changes

zucchini-nlp added 2 commits August 25, 2025 12:27

fix tests and make sure Evolla copied everything

4294230

merge main

d91fa5a

vasqu reviewed Aug 25, 2025

View reviewed changes

zucchini-nlp added 2 commits August 26, 2025 15:01

fix

56496f7

Merge branch 'main' into esm-sdpa

767392e

zucchini-nlp requested a review from vasqu August 26, 2025 13:24

vasqu approved these changes Aug 26, 2025

View reviewed changes

src/transformers/utils/generic.py Show resolved Hide resolved

src/transformers/models/evolla/modular_evolla.py Show resolved Hide resolved

src/transformers/models/evolla/modeling_evolla.py Show resolved Hide resolved

order

a12c7b2

forgot about "is_causal" for fa2

74b8c4f

vasqu reviewed Aug 27, 2025

View reviewed changes

zucchini-nlp added 2 commits August 27, 2025 13:25

cross attention can't be causal

6d65770

Merge branch 'main' into esm-sdpa

bca3bd7

zucchini-nlp merged commit ed5dd29 into huggingface:main Aug 27, 2025
24 checks passed

pstjohn mentioned this pull request Sep 2, 2025

[WIP] ESM-2 Attention interface refactor #40211

Closed

pstjohn mentioned this pull request Sep 19, 2025

[ESM] add accepts_loss_kwargs=False to EsmPreTrainedModel #41006

Merged

snorkelopstesting1-a11y pushed a commit to snorkel-marlin-repos/huggingface_transformers_pr_40370_ed923d69-fddc-4960-b3c4-ab304ead727b that referenced this pull request Oct 11, 2025

[ESM] support attention API

e1ef650

Original PR #40370 by zucchini-nlp Original: huggingface/transformers#40370

snorkelopstesting1-a11y mentioned this pull request Oct 11, 2025

[ESM] support attention API snorkel-marlin-repos/huggingface_transformers_pr_40370_ed923d69-fddc-4960-b3c4-ab304ead727b#1

Merged

snorkelopstesting1-a11y pushed a commit to snorkel-marlin-repos/huggingface_transformers_pr_40370_85d8b4dd-85a6-47cb-8943-eb1a51b613b7 that referenced this pull request Oct 11, 2025

[ESM] support attention API

51d7cce

Original PR #40370 by zucchini-nlp Original: huggingface/transformers#40370

snorkelopstesting1-a11y mentioned this pull request Oct 11, 2025

[ESM] support attention API snorkel-marlin-repos/huggingface_transformers_pr_40370_85d8b4dd-85a6-47cb-8943-eb1a51b613b7#1

Merged

snorkelopstesting2-coder mentioned this pull request Oct 11, 2025

[ESM] support attention API snorkel-marlin-repos/huggingface_transformers_pr_40370_1a79d765-0dc4-4ef3-932b-4d39714a6fe0#1

Merged

	if attention_mask is None:
	# required mask seq length can be calculated via length of past cache
	mask_seq_length = past_key_values_length + seq_length
	attention_mask = torch.ones(batch_size, mask_seq_length, device=device)

	if self.config.is_decoder and encoder_hidden_states is not None and encoder_attention_mask is None:
	encoder_attention_mask = torch.ones(encoder_hidden_states.shape[:2], device=device)

	if attention_mask.dim() == 2:
	if self.config.is_decoder:
	attention_mask = create_causal_mask(
	config=self.config,
	input_embeds=embedding_output,
	attention_mask=attention_mask,
	cache_position=cache_position,
	past_key_values=past_key_values,
	)
	else:
	attention_mask = self._update_full_mask(
	attention_mask,
	embedding_output,
	)
	elif attention_mask.dim() == 3:
	if self.config._attn_implementation in ["flash_attention_2", "flex_attention"]:
	raise ValueError(
	"Passing attention mask with a 3D/4D shape does not work with type "
	f"{self.config._attn_implementation} - please use either `sdpa` or `eager` instead."
	)
	attention_mask = self.get_extended_attention_mask(attention_mask, input_shape)

	if encoder_attention_mask is not None:
	if encoder_attention_mask.dim() == 2:
	encoder_attention_mask = self._update_cross_attn_mask(
	encoder_hidden_states,
	encoder_attention_mask,
	embedding_output.shape[:2],
	embedding_output,
	)
	else:
	if self.config._attn_implementation in ["flash_attention_2", "flex_attention"]:
	raise ValueError(
	"Passing attention mask with a 3D/4D shape does not work with type "
	f"{self.config._attn_implementation} - please use either `sdpa` or `eager` instead."
	)
	encoder_attention_mask = self.invert_attention_mask(encoder_attention_mask)

[ESM] support attention API #40370

[ESM] support attention API #40370

Uh oh!

Conversation

zucchini-nlp commented Aug 22, 2025

What does this PR do?

Uh oh!

zucchini-nlp commented Aug 22, 2025

Uh oh!

zucchini-nlp commented Aug 22, 2025

Uh oh!

github-actions bot commented Aug 22, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Aug 22, 2025

Uh oh!

zucchini-nlp commented Aug 22, 2025

Uh oh!

github-actions bot commented Aug 22, 2025

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vasqu Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

vasqu Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp commented Aug 25, 2025

Uh oh!

github-actions bot commented Aug 25, 2025

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zucchini-nlp commented Aug 27, 2025

Uh oh!

github-actions bot commented Aug 27, 2025

Uh oh!

zucchini-nlp commented Aug 27, 2025

Uh oh!

vasqu Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Aug 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants