KEMBAR78
[pipelining] allow multiple backward grads by H-Huang · Pull Request #140981 · pytorch/pytorch · GitHub
Skip to content

Conversation

@H-Huang
Copy link
Member

@H-Huang H-Huang commented Nov 18, 2024

Stack from ghstack (oldest at bottom):

fixes #139404. The input grads get saved in a new self.bwd_cache container and get popped off after they are used in backward_one_chunk

python test/distributed/pipelining/test_schedule_multiproc.py -k test_pipeline_schedule_runtime_custom_sched

cc @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 18, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140981

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 9b5b50d with merge base b379a28 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

  • linux-binary-manywheel / manywheel-py3_9-cuda12_6-test / test (gh) (similar failure)
    RuntimeError: cuDNN version incompatibility: PyTorch was compiled against (9, 5, 1) but found runtime version (9, 1, 0). PyTorch already comes bundled with cuDNN. One option to resolving this error is to ensure PyTorch can find the bundled cuDNN. one possibility is that there is a conflicting cuDNN in LD_LIBRARY_PATH.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

H-Huang added a commit that referenced this pull request Nov 18, 2024
ghstack-source-id: 57fdd80
Pull Request resolved: #140981
@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Nov 18, 2024
@H-Huang H-Huang added the release notes: distributed (pipeline) release notes category label Nov 18, 2024
@H-Huang H-Huang marked this pull request as ready for review November 19, 2024 20:22
@H-Huang H-Huang requested review from kwen2501 and wconstab and removed request for kwen2501 November 19, 2024 20:22
_Action(2, F, 1),
None,
None,
_Action(2, B, 0),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm probably just missing something but isn't this schedule ordered in the usual order, i.e. not reordered B?

the usual order would execute B's in increasing microbatch_id sequence right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point, I realized the test is not good because the error would only surface if we use send for mb A after we did a backward for mb B, which the test is not catching. So i updated it to reorder the send operations in the new schedule

Copy link
Contributor

@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code change LGTM. just want to make sure the test case tests what it's supposed to

fixes #139404. The input grads get saved in a new `self.bwd_cache` container and get popped off after they are used in `backward_one_chunk`

`python test/distributed/pipelining/test_schedule_multiproc.py -k test_pipeline_schedule_runtime_custom_sched`

cc awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
@H-Huang
Copy link
Member Author

H-Huang commented Nov 22, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 22, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

fixes #139404. The input grads get saved in a new `self.bwd_cache` container and get popped off after they are used in `backward_one_chunk`

`python test/distributed/pipelining/test_schedule_multiproc.py -k test_pipeline_schedule_runtime_custom_sched`

cc awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
H-Huang added a commit that referenced this pull request Nov 22, 2024
ghstack-source-id: 2a976f7
Pull Request resolved: #140981
@H-Huang
Copy link
Member Author

H-Huang commented Nov 22, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Ryo-not-rio pushed a commit to Ryo-not-rio/pytorch that referenced this pull request Dec 2, 2024
fixes pytorch#139404. The input grads get saved in a new `self.bwd_cache` container and get popped off after they are used in `backward_one_chunk`

`python test/distributed/pipelining/test_schedule_multiproc.py -k test_pipeline_schedule_runtime_custom_sched`

Pull Request resolved: pytorch#140981
Approved by: https://github.com/wconstab
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
fixes pytorch#139404. The input grads get saved in a new `self.bwd_cache` container and get popped off after they are used in `backward_one_chunk`

`python test/distributed/pipelining/test_schedule_multiproc.py -k test_pipeline_schedule_runtime_custom_sched`

Pull Request resolved: pytorch#140981
Approved by: https://github.com/wconstab
@github-actions github-actions bot deleted the gh/H-Huang/155/head branch December 23, 2024 02:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (pipeline) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants