KEMBAR78
fix sequence number for group by GSSBMW · Pull Request #134578 · pytorch/pytorch · GitHub
Skip to content

Conversation

@GSSBMW
Copy link

@GSSBMW GSSBMW commented Aug 27, 2024

Summary:
Fix sequence number in execution trace dump for matching between collective/p2p op and wait in execution trace replay.

ProcessGroupNCCL has 2 sequence number counter, seqCollective_ and seqP2P_.

uint64_t seqCollective_{0};
// Counting for the sequential number of NCCL P2P calls.
uint64_t seqP2P_{0};

However, WorkNCCL only has one sequence number member seq_.
We need to match collective and p2p with wait separately.
facebookresearch/param@29b5a46

Depend on: #135132

Test Plan: buck2 run mode/dev-nosan kineto/libkineto/fb/integration_tests:pytorch_execution_trace_integration_test

Differential Revision:

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 27, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134578

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 1ad6e7a with merge base de4c2a3 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Aug 27, 2024
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Aug 27, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: GSSBMW / name: Sanshan Gao (1ad6e7a)

@GSSBMW
Copy link
Author

GSSBMW commented Aug 27, 2024

@zou3519 zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 27, 2024
@GSSBMW
Copy link
Author

GSSBMW commented Sep 16, 2024

@shengfukevin Has updated and depend on #134619. Please help review!

Copy link
Contributor

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR adds isP2P_ to WorkNCCL. Does WorkNCCL use it somewhere?

@kwen2501
Copy link
Contributor

Do you mind adding in the PR description what issue this PR fixes and how it fixes the issue?
Thanks a lot!

@GSSBMW
Copy link
Author

GSSBMW commented Sep 18, 2024

The PR adds isP2P_ to WorkNCCL. Does WorkNCCL use it somewhere?
Do you mind adding in the PR description what issue this PR fixes and how it fixes the issue?

Hi, @kwen2501 I'm working with @shengfukevin on Execution Trace replay.
Add isP2P_ to WorkNCCL is for matching p2p op and work.wait. facebookresearch/param#180
Has updated the description.

Comment on lines 720 to 722
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I see. You put isP2P_ into work because of this RECORD_PARAM_COMMS call in work.wait().
Do you think we can infer isP2P from work.opType? (i.e. when it is SEND or RECV?). Or do you think there isn't a mapping at the moment?

Copy link
Contributor

@kwen2501 kwen2501 Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this example work for you?

bool isP2P = isP2POp(opType);

Copy link
Author

@GSSBMW GSSBMW Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't because we can't deduce whether it's collective or p2p from OpType::COALESCED .

return endCoalescing(OpType::COALESCED);

group._start_coalescing(device)

Comment on lines 2261 to +2294
Copy link
Contributor

@kwen2501 kwen2501 Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious -- is there a corresponding change to RECORD_PARAM_COMMS that accepts this new argument type?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It's here.
I can't get the code position link of this PR, so put screen shot here.
image

Copy link
Contributor

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Giving approval to unblock.
Please be advised that we cannot guarantee compatibility going forward with the trace player you mentioned -- we may change the sequence number as we see fit.

Comment on lines -2431 to +2469
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fduwjj @c-p-i-o do you think we can consolidate the bool isP2P = isP2POp(opType) logic below with this change? It may not be so trivial though. See this comment: https://github.com/pytorch/pytorch/pull/134578/files#r1764430474

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I can try to address this in a subsequent change.

@GSSBMW
Copy link
Author

GSSBMW commented Sep 23, 2024

Thanks, @kwen2501 @shengfukevin
Then, what's the next step for merging?

@kwen2501
Copy link
Contributor

It seems there are lint issues. Can you run
lintrunner -a
locally to fix them?

If CI looks good, you can hit:
pytorchbot merge

@GSSBMW
Copy link
Author

GSSBMW commented Oct 9, 2024

Hi, @kwen2501
I have fixed the issues of lint and CLA, rebase from main. Still this blocking issue. How should I fix it? Or need your approve again? Thanks!!
image

@GSSBMW
Copy link
Author

GSSBMW commented Oct 9, 2024

cc+ @shengfukevin @c-p-i-o

@c-p-i-o
Copy link
Contributor

c-p-i-o commented Oct 9, 2024

cc+ @shengfukevin @c-p-i-o

Can you run:
"@pytorchbot merge"

This should get this PR into the main branch after CI passes.

@GSSBMW
Copy link
Author

GSSBMW commented Oct 9, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 9, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

shengfukevin added a commit to shengfukevin/param that referenced this pull request Oct 28, 2024
Summary: The recent DIFF pytorch/pytorch#134578 change the sequence id in ET from sequence id ( an integer) to a [sequence id, is_p2p_op], an array of two integers. The replay code need to make the similar change.

Differential Revision: D65099031
facebook-github-bot pushed a commit to facebookresearch/param that referenced this pull request Oct 29, 2024
Summary:
Pull Request resolved: #185

The recent DIFF pytorch/pytorch#134578 change the sequence id in ET from sequence id ( an integer) to a [sequence id, is_p2p_op], an array of two integers. The replay code need to make the similar change.

Reviewed By: briancoutinho

Differential Revision: D65099031

fbshipit-source-id: 91174f27cf3ced1d25d0c11748adea88b87bdc03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (c10d) release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants