[deepspeed / m2m_100] make deepspeed zero-3 work with layerdrop #16717

stas00 · 2022-04-12T00:55:48Z

Same as I had to fix in wav2vec2 it looks that this fix should eventually go to all models that use LayerDrop. At least at the moment Deepspeed is not capable of randomly skipping layers, so this PR uses the same now well tested workaround I used in wav2vec2, where all layers always run when deepspeed zero-3 is detected, but the results are ignored if it was meant to be skipped.

transformers/src/transformers/models/wav2vec2/modeling_wav2vec2.py

Lines 817 to 849 in 69233cf

    
           deepspeed_zero3_is_enabled = is_deepspeed_zero3_enabled() 
        
           for layer in self.layers: 
        
               if output_hidden_states: 
        
                   all_hidden_states = all_hidden_states + (hidden_states,) 
        
               # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description) 
        
               dropout_probability = np.random.uniform(0, 1) 
        
               skip_the_layer = True if self.training and (dropout_probability < self.config.layerdrop) else False 
        
               if not skip_the_layer or deepspeed_zero3_is_enabled: 
        
                   # under deepspeed zero3 all gpus must run in sync 
        
                   if self.gradient_checkpointing and self.training: 
        
                       # create gradient checkpointing function 
        
                       def create_custom_forward(module): 
        
                           def custom_forward(*inputs): 
        
                               return module(*inputs, output_attentions) 
        
                           return custom_forward 
        
                       layer_outputs = torch.utils.checkpoint.checkpoint( 
        
                           create_custom_forward(layer), 
        
                           hidden_states, 
        
                           attention_mask, 
        
                       ) 
        
                   else: 
        
                       layer_outputs = layer( 
        
                           hidden_states, attention_mask=attention_mask, output_attentions=output_attentions 
        
                       ) 
        
                   hidden_states = layer_outputs[0] 
        
               if skip_the_layer: 
        
                   layer_outputs = (None, None)

Perhaps one day Deepspeed will be able to randomly skip layers, at the moment the solution is not the most efficient one. I made a request.

When ZeRO-3 is not used the original code path is taken.

The test exercising this code path will be merged as part of this huge additional tests set PR #12695 (it's been long overdue).

For posterity, the error for this issue will look something like:

RuntimeError: tracing error at step 42: expected the next 2 parameters in the parameter fetch queue to be 
({'id': 26, 'status': 'AVAILABLE', 'numel': 1024, 'ds_numel': 1024, 'shape': (1024,), 'ds_shape': (1024,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {24}}, {'id': 27, 'status': 'AVAILABLE', 'numel': 1024, 'ds_numel': 1024, 'shape': (1024,), 'ds_shape': (1024,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {24}}) 
but got 
({'id': 115, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 1024, 'shape': (0,), 'ds_shape': (1024,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': set()}, {'id': 116, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 1048576, 'shape': (0,), 'ds_shape': (1024, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set()}).

Fixes: #16688

@patil-suraj, @sgugger

HuggingFaceDocBuilderDev · 2022-04-12T01:09:33Z

The documentation is not available anymore as the PR was closed or merged.

src/transformers/models/m2m_100/modeling_m2m_100.py

patil-suraj

Thanks a lot for fixing this!

sgugger

Thanks for working on this!

…ingface#16717) * [deepspeed / m2m_100] make deepspeed 3 work with layerdrop * fix * revert last

[deepspeed / m2m_100] make deepspeed 3 work with layerdrop

2349e44

stas00 mentioned this pull request Apr 12, 2022

Cannot train M2M100 using run_translation.py and DeepSpeed ZeRO stage 3 #16688

Closed

stas00 self-assigned this Apr 12, 2022

stas00 added the DeepSpeed label Apr 12, 2022

stas00 changed the title ~~[deepspeed / m2m_100] make deepspeed 3 work with layerdrop~~ [deepspeed / m2m_100] make deepspeed zero-3 work with layerdrop Apr 12, 2022

stas00 added 2 commits April 11, 2022 19:38

fix

e127e4f

revert last

8236f6a

stas00 commented Apr 12, 2022

View reviewed changes

src/transformers/models/m2m_100/modeling_m2m_100.py Show resolved Hide resolved

patil-suraj approved these changes Apr 12, 2022

View reviewed changes

sgugger approved these changes Apr 12, 2022

View reviewed changes

Merge remote-tracking branch 'origin/main' into ds-m2m-layerdrop

2709ba6

stas00 merged commit c21e107 into main Apr 14, 2022

stas00 deleted the ds-m2m-layerdrop branch April 14, 2022 13:51

elusenji pushed a commit to elusenji/transformers that referenced this pull request Jun 12, 2022

[deepspeed / m2m_100] make deepspeed zero-3 work with layerdrop (hugg…

4f3c722

…ingface#16717) * [deepspeed / m2m_100] make deepspeed 3 work with layerdrop * fix * revert last

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[deepspeed / m2m_100] make deepspeed zero-3 work with layerdrop #16717

[deepspeed / m2m_100] make deepspeed zero-3 work with layerdrop #16717

Uh oh!

stas00 commented Apr 12, 2022 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Apr 12, 2022 •

edited

Loading

Uh oh!

Uh oh!

patil-suraj left a comment

Uh oh!

sgugger left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	deepspeed_zero3_is_enabled = is_deepspeed_zero3_enabled()

	for layer in self.layers:
	if output_hidden_states:
	all_hidden_states = all_hidden_states + (hidden_states,)

	# add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)
	dropout_probability = np.random.uniform(0, 1)

	skip_the_layer = True if self.training and (dropout_probability < self.config.layerdrop) else False
	if not skip_the_layer or deepspeed_zero3_is_enabled:
	# under deepspeed zero3 all gpus must run in sync
	if self.gradient_checkpointing and self.training:
	# create gradient checkpointing function
	def create_custom_forward(module):
	def custom_forward(*inputs):
	return module(*inputs, output_attentions)

	return custom_forward

	layer_outputs = torch.utils.checkpoint.checkpoint(
	create_custom_forward(layer),
	hidden_states,
	attention_mask,
	)
	else:
	layer_outputs = layer(
	hidden_states, attention_mask=attention_mask, output_attentions=output_attentions
	)
	hidden_states = layer_outputs[0]

	if skip_the_layer:
	layer_outputs = (None, None)

[deepspeed / m2m_100] make deepspeed zero-3 work with layerdrop #16717

[deepspeed / m2m_100] make deepspeed zero-3 work with layerdrop #16717

Uh oh!

Conversation

stas00 commented Apr 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Apr 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

patil-suraj left a comment

Choose a reason for hiding this comment

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

stas00 commented Apr 12, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 12, 2022 •

edited

Loading