-
Notifications
You must be signed in to change notification settings - Fork 1.8k
feat: Add support for benchmarking individual gemms in MOE benchmark #6080
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add support for benchmarking individual gemms in MOE benchmark #6080
Conversation
e79ba96 to
d723c5c
Compare
d723c5c to
9b53b24
Compare
|
/bot run |
|
PR_Github #12026 [ run ] triggered by Bot |
|
PR_Github #12026 [ run ] completed with state |
Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
9b53b24 to
98a9407
Compare
WalkthroughThe updates introduce a new profiling mode for the Mixture of Experts (MoE) backend benchmark, allowing selective profiling of individual GEMM operations or the full layer. The benchmark configuration generation script and fixture are refactored to support this, with new parameters, buffer management, and tactic selection logic. Documentation and help text are updated with cautionary disclaimers. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant BenchmarkLauncher
participant BenchmarkFixture
participant GemmProfilerBackend
User->>BenchmarkLauncher: Provide benchmark config (incl. gemm_to_profile)
BenchmarkLauncher->>BenchmarkFixture: Initialize benchmark with config
BenchmarkFixture->>BenchmarkFixture: Allocate buffers (based on gemm_to_profile)
BenchmarkFixture->>GemmProfilerBackend: Prepare profiler (if profiling GEMM_1 or GEMM_2)
BenchmarkFixture->>BenchmarkFixture: Select tactics (per GEMM)
loop For each benchmark iteration
BenchmarkFixture->>GemmProfilerBackend: Run selected GEMM (if GEMM_1 or GEMM_2)
BenchmarkFixture->>BenchmarkFixture: Or run full MoE layer (if LAYER)
end
BenchmarkFixture-->>BenchmarkLauncher: Report results
BenchmarkLauncher-->>User: Output benchmark summary
Poem
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
|
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h (1)
848-851: Verify the expert count calculation change for tensor parallelism scenarios.The removal of
tp_sizefrom themNumExpertsPerNodecalculation appears correct given thatinter_sizeis already divided bytp_size. However, this needs verification to ensure it doesn't cause issues in TP>1 scenarios.Please confirm that this change:
- Correctly calculates the number of experts per node for all TP/EP configurations
- Addresses the autotuner issues mentioned in the past review comments for TP>1
#!/bin/bash # Search for other usages of mNumExpertsPerNode to understand the impact rg -A 5 -B 5 "mNumExpertsPerNode" --type cpp # Look for test cases or benchmarks that validate expert count calculations fd -e cpp -e cu | xargs rg -l "GemmProfilerBackend.*init" | head -20
🧹 Nitpick comments (2)
cpp/micro_benchmarks/gen-moe-benchmark-file.py (1)
72-101: Configuration generation logic looks good with comprehensive parameter coverage.The nested loops effectively test different parallelism configurations, and the conditional routing strategy is appropriate. The
gemm_to_profileparameter correctly iterates over all profiling modes.Consider logging when configurations are skipped due to alignment issues:
if inter_size % (tp_size * 128) != 0: + print(f"Skipping config: ep_size={ep_size}, tp_size={tp_size}, inter_size={inter_size} - insufficient alignment") continue # Insufficient alignmentcpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (1)
955-1009: Well-structured profiling dispatch logic.The switch statement cleanly separates the profiling logic for individual GEMMs vs full layer execution.
Consider refactoring the duplicate MoE runner calls to reduce code duplication:
+auto runMoeLayer = [&]() { + mMoERunner.runMoe(mInputTensor + mInputTensorSize * mBufferIndex, nullptr, + mSelectedExperts + mSelectedExpertsSize * mBufferIndex, + mUseFinalScale ? mScaleProbs + mScaleProbsSize * mBufferIndex : nullptr, + mExpertWeight1 + mExpertWeight1Size * mBufferIndex, mExpertBias1 + mExpertBias1Size * mBufferIndex, + mActType, mExpertWeight2 + mExpertWeight2Size * mBufferIndex, + mExpertBias2 + mExpertBias2Size * mBufferIndex, mQuantParams[mBufferIndex], mTotalTokens, mHiddenSize, + mInterSize, mNumExperts, mK, mWorkspace + mWorkspaceSize * mBufferIndex, + mFinalOutput + mFinalOutputSize * mBufferIndex, + mSourceToExpandedMap + mSourceToExpandedMapSize * mBufferIndex, parallelism_config, +#ifdef USING_OSS_CUTLASS_MOE_GEMM + /*enable_alltoall=*/false, +#endif + mUseLora, mLoraParams[mBufferIndex], + /*use_fp8_block_scaling=*/false, /*min_latency_mode=*/false, min_latency_params, stream); +}; + case GemmToProfile::LAYER: { auto stream = streamPtr->get(); MoeMinLatencyParams min_latency_params; -#ifdef USING_OSS_CUTLASS_MOE_GEMM - mMoERunner.runMoe(...); -#else - mMoERunner.runMoe(...); -#endif + runMoeLayer(); break; }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
cpp/micro_benchmarks/README.md(1 hunks)cpp/micro_benchmarks/gen-moe-benchmark-file.py(2 hunks)cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h(19 hunks)cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu(10 hunks)cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (4)
cpp/micro_benchmarks/README.md (1)
14-16: Good addition of cautionary disclaimer.The disclaimer appropriately warns users about the experimental nature of this benchmark tool and sets proper expectations.
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkLauncher.cu (1)
632-634: Excellent documentation updates.The disclaimer and parameter documentation are clear and comprehensive, helping users understand the experimental nature of the benchmark and how to use the new profiling features.
Also applies to: 662-662, 703-703, 720-721
cpp/micro_benchmarks/mixtureOfExpertsBackendBenchmarkFixture.h (2)
74-79: Clean enum definition for profiling modes.The
GemmToProfileenum properly maps to the backend profiler types and provides a clear interface for selecting profiling targets.
577-738: Efficient buffer allocation based on profiling mode.The conditional buffer allocation based on
gemm_to_profileis well-implemented and reduces memory usage when profiling individual GEMMs. ThepadSizehelper ensures proper memory alignment.
|
PR_Github #12117 [ run ] triggered by Bot |
|
PR_Github #12117 [ run ] completed with state |
|
/bot run |
|
PR_Github #12133 [ run ] triggered by Bot |
|
PR_Github #12133 [ run ] completed with state |
|
/bot run |
|
PR_Github #12154 [ run ] triggered by Bot |
|
PR_Github #12154 [ run ] completed with state |
…VIDIA#6080) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com>
…VIDIA#6080) Signed-off-by: Daniel Stokes <40156487+djns99@users.noreply.github.com> Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Chores