KEMBAR78
[C10D] Add better profiling title for NCCL barrier, nccl:all_reduce to nccl:all_reduce_barrier by mori360 · Pull Request #140785 · pytorch/pytorch · GitHub
Skip to content

Conversation

@mori360
Copy link
Contributor

@mori360 mori360 commented Nov 15, 2024

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Nov 15, 2024
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 15, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140785

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 84058f6 with merge base 452e1a7 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

BROKEN TRUNK - The following job failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Contributor

@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the profiling title for all allreduce operations, which is not correct. We should only change the title for barrier operations. It might be trickier to do the correct thing.


// All reduce to achieve the barrier
auto work = allreduce_impl(barrierTensor);
auto work = allreduce_impl<true>(barrierTensor);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: does this work? Its nice to avoid passing mysterious booleans without names in c++ using comments auto work = allreduce_impl</*isBarrier*/true>(barrierTensor);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question on top of Will's comment. How do you verify your change is working?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I am searching the test methods, would update when test results are ready.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ProfilingTitle should be used in multiple places in UTs? add a barrier call there to verify? And you can check the github issue which originally complains about the profileTitle being confusing, and re-pro the same test?

@kwen2501
Copy link
Contributor

No need to use template. You can just extend the allreduce_impl function to take an argument with default value:
In ProcessGroupNCCL.hpp:

  c10::intrusive_ptr<Work> allreduce_impl(
      at::Tensor& tensor,
      const AllreduceOptions& opts = AllreduceOptions(),
      bool isBarrier = false);

@kwen2501
Copy link
Contributor

@fduwjj @c-p-i-o Would a change of profiling title impacts Flight Recorder's analysis?

@mori360
Copy link
Contributor Author

mori360 commented Nov 15, 2024

No need to use template. You can just extend the allreduce_impl function to take an argument with default value: In ProcessGroupNCCL.hpp:

  c10::intrusive_ptr<Work> allreduce_impl(
      at::Tensor& tensor,
      const AllreduceOptions& opts = AllreduceOptions(),
      bool isBarrier = false);

Thank you for the review.
There's already an optional argument opts with default value.
c10::intrusive_ptr<Work> allreduce_impl(at::Tensor& tensor, const AllreduceOptions& opts = AllreduceOptions());
auto work = allreduce_impl<true>(barrierTensor); here just takes the default opts
Not sure if adding the second optional argument would be good.

#endif
}

template <bool IsBarrier>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: start with lowerCase variable name (isBarrier) to be consistent with other variable definition?

@kwen2501
Copy link
Contributor

I am not sure if C++ does not support multiple optional arguments.
If specifying true / false is easier, we can put isBarrier as the first optional argument, and AllreduceOptions the second.
Then it would be valid to call: allreduce_impl(barrierTensor, true).

Using template here doubles the binary size for a very minor difference, that's why I would prefer argument form.

@mori360
Copy link
Contributor Author

mori360 commented Dec 10, 2024

I am not sure if C++ does not support multiple optional arguments. If specifying true / false is easier, we can put isBarrier as the first optional argument, and AllreduceOptions the second. Then it would be valid to call: allreduce_impl(barrierTensor, true).

Using template here doubles the binary size for a very minor difference, that's why I would prefer argument form.

Discussed offline, to apply an extra "profiling name" input, and order it as the first optional input

@mori360 mori360 marked this pull request as ready for review December 10, 2024 18:18
@mori360 mori360 requested a review from wconstab December 10, 2024 18:18
@mori360 mori360 changed the title Change nccl:all_reduce to nccl:all_reduce_barrier [C10D] Add better profiling title for NCCL barrier, nccl:all_reduce to nccl:all_reduce_barrier Dec 10, 2024
Copy link
Contributor

@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the updates, lgtm

@mori360
Copy link
Contributor Author

mori360 commented Dec 10, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 10, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[C10D] Add better profiling title for NCCL barrier

6 participants