Fix NJT linear_backward() memory usage #141163

jbschlosser · 2024-11-20T20:37:42Z

Stack from ghstack (oldest at bottom):

-> Fix NJT linear_backward() memory usage #141163

The formula we're using for linear_backward() is inefficient for higher dim input sizes, even if the input is trivially higher dim (e.g. via use of unsqueeze()). This PR updates the formula to match the more efficient version employed by NST. Specifically, note the leading dim collapse for grad_output's values before we compute the various matmuls.

pytorch/aten/src/ATen/native/nested/NestedTensorBackward.cpp

Lines 37 to 70 in d5ee1d1

    
           std::tuple<Tensor, Tensor, Tensor> nested_linear_backward( 
        
               const Tensor& input, 
        
               const Tensor& grad_output, 
        
               const Tensor& weight, 
        
               std::array<bool, 3> output_mask) { 
        
             if (!grad_output.defined()) { 
        
               return std::tuple<Tensor, Tensor, Tensor>{Tensor(), Tensor(), Tensor()}; 
        
             } 
        
             Tensor grad_input, grad_weight, grad_bias; 
        
             auto grad_output_contiguous = grad_output.contiguous(); 
        
             auto* nt_grad_output = get_nested_tensor_impl(grad_output_contiguous); 
        
             auto* nt_input = get_nested_tensor_impl(input); 
        
             TORCH_INTERNAL_ASSERT(nt_grad_output != nullptr); 
        
             TORCH_INTERNAL_ASSERT(nt_input != nullptr); 
        
             TORCH_INTERNAL_ASSERT(nested_tensor_impl_is_contiguous(nt_grad_output)); 
        
             auto grad_output_buffer = nt_grad_output->get_buffer(); 
        
             auto input_buffer = nt_input->get_buffer(); 
        
             auto reshaped_grad = grad_output_buffer.reshape({-1, weight.size(0)}); 
        
             if (output_mask[0]) { 
        
               auto grad_input_buffer = at::mm(reshaped_grad, weight).view({-1}); 
        
               auto grad_input_nt_size = nt_input->get_nested_sizes().clone(); 
        
               grad_input = wrap_buffer(grad_input_buffer, grad_input_nt_size); 
        
             } 
        
             if (output_mask[1]) { 
        
               grad_weight = 
        
                   at::mm(reshaped_grad.t(), input_buffer.reshape({-1, weight.size(1)})); 
        
             } 
        
             if (output_mask[2]) { 
        
               grad_bias = reshaped_grad.sum(0); 
        
             } 
        
             return std::tuple<Tensor, Tensor, Tensor>{grad_input, grad_weight, grad_bias}; 
        
           }

Testing for correctness is done via existing gradcheck tests (e.g. test_backward_nn_functional_linear). I added a memory usage test but I think it's likely there's a better way to do this.

[ghstack-poisoned]

pytorch-bot · 2024-11-20T20:37:46Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141163

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[DomainsOnly] Jobs fail with GLIBC version not found

❌ 1 New Failure

As of commit ddfa80b with merge base a440a01 ():

NEW FAILURE - The following job has failed:

linux-binary-manywheel / manywheel-py3_9-cuda12_6-test / test (gh)
RuntimeError: cuDNN version incompatibility: PyTorch was compiled against (9, 5, 1) but found runtime version (9, 1, 0). PyTorch already comes bundled with cuDNN. One option to resolving this error is to ensure PyTorch can find the bundled cuDNN. one possibility is that there is a conflicting cuDNN in LD_LIBRARY_PATH.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

Fixes #141112 The formula we're using for `linear_backward()` is inefficient for higher dim input sizes, even if the input is trivially higher dim (e.g. via use of `unsqueeze()`). This PR updates the formula to match the more efficient version employed by NST. Specifically, note the leading dim collapse for `grad_output`'s values before we compute the various matmuls. https://github.com/pytorch/pytorch/blob/d5ee1d1b581da8399d604bd661ea5fe454b485d6/aten/src/ATen/native/nested/NestedTensorBackward.cpp#L37-L70 Testing for correctness is done via existing gradcheck tests (e.g. `test_backward_nn_functional_linear`). I added a memory usage test but I think it's likely there's a better way to do this. [ghstack-poisoned]

jbschlosser · 2024-11-20T21:48:54Z

Discussed offline: reset the max memory stat via torch.cuda.reset_max_memory_allocated() and measure max afterwards. If it's too high (in practice, I see over 3 GB allocated during the backward call), fail the test. Assuming this stat is process-isolated, this should work fine (we don't run tests in CI multi-threaded, only multi-process). If the test fails later on, we can revisit this but at least the fix is in :)

jbschlosser · 2024-11-20T21:50:23Z

@pytorchbot merge

pytorchmergebot · 2024-11-20T21:52:16Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

torch/nested/_internal/ops.py

pytorchmergebot · 2024-11-20T22:56:55Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / cuda12.1-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu)

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

Fixes #141112 The formula we're using for `linear_backward()` is inefficient for higher dim input sizes, even if the input is trivially higher dim (e.g. via use of `unsqueeze()`). This PR updates the formula to match the more efficient version employed by NST. Specifically, note the leading dim collapse for `grad_output`'s values before we compute the various matmuls. https://github.com/pytorch/pytorch/blob/d5ee1d1b581da8399d604bd661ea5fe454b485d6/aten/src/ATen/native/nested/NestedTensorBackward.cpp#L37-L70 Testing for correctness is done via existing gradcheck tests (e.g. `test_backward_nn_functional_linear`). I added a memory usage test but I think it's likely there's a better way to do this. [ghstack-poisoned]

ghstack-source-id: 3785f3a Pull Request resolved: #141163

jbschlosser · 2024-11-20T23:41:47Z

@pytorchbot merge

pytorchmergebot · 2024-11-20T23:43:35Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-11-21T01:50:53Z

Merge failed

Reason: 1 jobs have failed, first few of them are: linux-binary-manywheel / manywheel-py3_9-cuda12_6-test / test

Details for Dev Infra team

Raised by workflow job

jbschlosser · 2024-11-21T15:14:55Z

@pytorchbot merge -i

pytorchmergebot · 2024-11-21T15:16:57Z

Merge started

Your change will be merged while ignoring the following 1 checks: linux-binary-manywheel / manywheel-py3_9-cuda12_6-test / test

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Fixes pytorch#141112 The formula we're using for `linear_backward()` is inefficient for higher dim input sizes, even if the input is trivially higher dim (e.g. via use of `unsqueeze()`). This PR updates the formula to match the more efficient version employed by NST. Specifically, note the leading dim collapse for `grad_output`'s values before we compute the various matmuls. https://github.com/pytorch/pytorch/blob/d5ee1d1b581da8399d604bd661ea5fe454b485d6/aten/src/ATen/native/nested/NestedTensorBackward.cpp#L37-L70 Testing for correctness is done via existing gradcheck tests (e.g. `test_backward_nn_functional_linear`). I added a memory usage test but I think it's likely there's a better way to do this. Pull Request resolved: pytorch#141163 Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch, https://github.com/soulitzer

Fix NJT linear_backward() memory usage

1919dc5

[ghstack-poisoned]

Update on "Fix NJT linear_backward() memory usage"

f9d2194

[ghstack-poisoned]

jbschlosser requested a review from soulitzer November 20, 2024 20:38

Skylion007 approved these changes Nov 20, 2024

View reviewed changes

jbschlosser added topic: bug fixes topic category release notes: nested tensor Changes that have a direct impact on nested tensors labels Nov 20, 2024

jbschlosser requested a review from cpuhrsch November 20, 2024 20:42

cpuhrsch approved these changes Nov 20, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 20, 2024

pytorchmergebot added the merging label Nov 20, 2024

soulitzer reviewed Nov 20, 2024

View reviewed changes

torch/nested/_internal/ops.py Outdated Show resolved Hide resolved

soulitzer approved these changes Nov 20, 2024

View reviewed changes

pytorchmergebot removed the merging label Nov 20, 2024

jbschlosser added a commit that referenced this pull request Nov 20, 2024

Fix NJT linear_backward() memory usage

5b65317

ghstack-source-id: 3785f3a Pull Request resolved: #141163

pytorchmergebot added the merging label Nov 20, 2024

pytorchmergebot removed the merging label Nov 21, 2024

pytorchmergebot added the merging label Nov 21, 2024

pytorchmergebot added the Merged label Nov 21, 2024

pytorchmergebot closed this in 41f3154 Nov 21, 2024

pytorchmergebot removed the merging label Nov 21, 2024

github-actions bot deleted the gh/jbschlosser/202/head branch December 22, 2024 02:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix NJT linear_backward() memory usage #141163

Fix NJT linear_backward() memory usage #141163

Uh oh!

jbschlosser commented Nov 20, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Nov 20, 2024 •

edited

Loading

Uh oh!

jbschlosser commented Nov 20, 2024 •

edited

Loading

Uh oh!

jbschlosser commented Nov 20, 2024

Uh oh!

pytorchmergebot commented Nov 20, 2024

Uh oh!

Uh oh!

pytorchmergebot commented Nov 20, 2024

Uh oh!

jbschlosser commented Nov 20, 2024

Uh oh!

pytorchmergebot commented Nov 20, 2024

Uh oh!

pytorchmergebot commented Nov 21, 2024

Uh oh!

jbschlosser commented Nov 21, 2024

Uh oh!

pytorchmergebot commented Nov 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	std::tuple<Tensor, Tensor, Tensor> nested_linear_backward(
	const Tensor& input,
	const Tensor& grad_output,
	const Tensor& weight,
	std::array<bool, 3> output_mask) {
	if (!grad_output.defined()) {
	return std::tuple<Tensor, Tensor, Tensor>{Tensor(), Tensor(), Tensor()};
	}
	Tensor grad_input, grad_weight, grad_bias;
	auto grad_output_contiguous = grad_output.contiguous();
	auto* nt_grad_output = get_nested_tensor_impl(grad_output_contiguous);
	auto* nt_input = get_nested_tensor_impl(input);
	TORCH_INTERNAL_ASSERT(nt_grad_output != nullptr);
	TORCH_INTERNAL_ASSERT(nt_input != nullptr);
	TORCH_INTERNAL_ASSERT(nested_tensor_impl_is_contiguous(nt_grad_output));
	auto grad_output_buffer = nt_grad_output->get_buffer();
	auto input_buffer = nt_input->get_buffer();

	auto reshaped_grad = grad_output_buffer.reshape({-1, weight.size(0)});

	if (output_mask[0]) {
	auto grad_input_buffer = at::mm(reshaped_grad, weight).view({-1});
	auto grad_input_nt_size = nt_input->get_nested_sizes().clone();
	grad_input = wrap_buffer(grad_input_buffer, grad_input_nt_size);
	}
	if (output_mask[1]) {
	grad_weight =
	at::mm(reshaped_grad.t(), input_buffer.reshape({-1, weight.size(1)}));
	}
	if (output_mask[2]) {
	grad_bias = reshaped_grad.sum(0);
	}
	return std::tuple<Tensor, Tensor, Tensor>{grad_input, grad_weight, grad_bias};
	}

Fix NJT linear_backward() memory usage #141163

Fix NJT linear_backward() memory usage #141163

Uh oh!

Conversation

jbschlosser commented Nov 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141163

❗ 1 Active SEVs

❌ 1 New Failure

Uh oh!

jbschlosser commented Nov 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jbschlosser commented Nov 20, 2024

Uh oh!

pytorchmergebot commented Nov 20, 2024

Merge started

Uh oh!

Uh oh!

pytorchmergebot commented Nov 20, 2024

Merge failed

Uh oh!

jbschlosser commented Nov 20, 2024

Uh oh!

pytorchmergebot commented Nov 20, 2024

Merge started

Uh oh!

pytorchmergebot commented Nov 21, 2024

Merge failed

Uh oh!

jbschlosser commented Nov 21, 2024

Uh oh!

pytorchmergebot commented Nov 21, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jbschlosser commented Nov 20, 2024 •

edited

Loading

pytorch-bot bot commented Nov 20, 2024 •

edited

Loading

jbschlosser commented Nov 20, 2024 •

edited

Loading