-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[None] [feat] nsys profile output kernel classifier #7020
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Grace Ho <grho@nvidia.com>
📝 WalkthroughWalkthroughAdds a new CLI tool that converts NVIDIA Nsight Systems CUDA traces into non-overlapped per-kernel GPU timing summaries, categorizes kernels via engine/model JSON mappings, and produces CSV and HTML visualizations; includes a JSON mapping file and README usage docs. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant User
participant Tool as gputrc2graph.py
participant NSYS as nsys (stats)
participant FS as Filesystem
participant Plot as Plotly/HTML
User->>Tool: run --in_file <nsys>,<engine>,<model>,<total_sec> --out_dir --title [--nsys_cmd]
Tool->>FS: load engine/model JSON mappings
loop per input tuple
Tool->>FS: check for cached summed CSV
alt cached up-to-date
Tool->>FS: read summed CSV
else need generation
Tool->>NSYS: run nsys stats to produce raw CUDA trace CSV
NSYS-->>Tool: raw GPU trace CSV
Tool->>Tool: compute non-overlapping kernel intervals & aggregate
Tool->>FS: write per-file summed CSV
end
Tool->>Tool: annotate kernels with categories
Tool->>Tool: compute/add non-GPU row (total_sec - GPU_time)
end
Tool->>FS: write combined result.csv
Tool->>Plot: build histogram and pivot table HTML
Plot-->>FS: write result.html
Tool-->>User: emit result.csv and result.html
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
🧹 Nitpick comments (16)
examples/profiler/nsys_profile_tools/README.md (4)
63-67
: Minor wording: “till” → “until”; tighten descriptionNit to improve clarity and tone.
- - DELAY: how many seconds to delay nsys from collecting profiles, needed so - that profiles aren't captured till trtllm server has come up and load - generation starts. + - DELAY: how many seconds to delay nsys before collecting profiles (to avoid + capturing startup), until the trtllm server is up and load generation begins.
83-86
: Tighten phrasing and fix minor grammar in outputs sectionSmall clarity and grammar fixes for the outputs and description of the HTML view.
-- result.html: this categorizes kernel names into different categories in a - stacked bar chart. +- result.html: a stacked bar chart categorizing kernel names into groups. - result.csv: shows how the kernel names are mapped to the different categories. -The html file shows the number of elapsed seconds due to different GPU -Substages or categories, which consist of moe_gemm as the biggest -category, at .14 seconds, followed by "attn" kernels. This lets the user -prioritize the kernels to focus on for performance optimizations. +The HTML file shows elapsed seconds by category (e.g., "moe_gemm", "attn"), +helping you quickly prioritize optimization targets. -There's also an appended data table underneath the bar chart for copying out to - other post-processing tools. +There is also a data table appended below the chart for easy copy/paste into +other tools.Also applies to: 90-94, 97-101
124-127
: Grammar nit: “stack” → “stacked”; simplify sentence-The analysis process is similar to example 1 but now there will be multiple -stack bar charts that can be compared. The categories for the different +The analysis process is similar to Example 1, but now there will be multiple +stacked bar charts that can be compared. The categories for the different kernels will remain the same, so that it's easy to compare the GPU cycles for the same categories.
141-146
: Fix typo (“kernelshave”) and tighten instructions-Then, for this new model, suppose there are 4 kernels to be classified into -"gemm" and "attn", where the gemm kernelshave names with "*H*" or "*I*" in -them, and attn kernels have names with "*J*" or "*K*" in them, just add another - .json file in the same directory as gputrc2graph.py with the same format as - the other json files, like the following: +Then, for this new model, suppose there are 4 kernels to be classified into +"gemm" and "attn". If gemm kernels have names containing "*H*" or "*I*", and +attn kernels contain "*J*" or "*K*", add another .json file in the same +directory as gputrc2graph.py with the same format as the other json files:examples/profiler/nsys_profile_tools/gputrc2graph.py (11)
1-1
: Update copyright year to include 2025Per coding guidelines, prepend NVIDIA copyright header with the current year.
-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
15-19
: Docstring: fix report name, punctuation, and styleThe opening docstring references
cuda_gpu_kern_trace
, but the code usescuda_gpu_trace
. Also, adopt a concise summary line with closing punctuation.-""" - This generates gpu kernel analysis output from nsys rep. Will call nsys - stats -r cuda_gpu_kern_trace, get non-overlapped gpu cycles, then generate - csv and html output for analysis -""" +""" +Generate GPU kernel analysis outputs from Nsight Systems (.nsys-rep) traces. + +This module calls `nsys stats -r cuda_gpu_trace` to export the CUDA GPU kernel +trace CSV, computes a non-overlapped elapsed time per kernel name, and then +emits a CSV summary and an HTML stacked bar chart for analysis. +"""
25-26
: Consider using built-inre
or documenting theregex
dependencyYou’re importing the third-party
regex
module asre
. Unless you rely on advanced features fromregex
, the built-inre
suffices and avoids an extra dependency. If you keepregex
, ensure it’s documented in the README (I added a note there).-import regex as re +import re
31-44
: Robust JSON loader: deep-merge engines and add basic validationIf multiple JSON files define the same engine key,
dict.update
overwrites the earlier one. Prefer deep-merge so multiple files can contribute models under the same engine. Also, validate JSON shape and log file names for traceability.def load_engine_model(): - """returns engine_model built from all json files in the current dir""" + """Return the combined engine→model mapping from all JSON files in this directory.""" import glob import json engine_model = {} json_files = glob.glob( os.path.join(os.path.dirname(__file__) or ".", "*.json")) for fname in json_files: - with open(fname, encoding="utf-8") as f: - engine_model.update(json.load(f)) + with open(fname, encoding="utf-8") as f: + data = json.load(f) + if not isinstance(data, dict): + logger.warning("Skipping %s: top-level JSON must be an object.", fname) + continue + # Deep-merge engines + for eng, models in data.items(): + if not isinstance(models, dict): + logger.warning("Skipping %s: engine %r must map to an object.", fname, eng) + continue + engine_model.setdefault(eng, {}) + engine_model[eng].update(models) + logger.info("Loaded engine/model definitions from %s", fname) return engine_model
51-56
: Guard pandas copy-on-write for older pandas versions
pd.options.mode.copy_on_write
exists in pandas ≥ 2.0. On older versions this raises AttributeError.def __init__(self): import pandas as pd # avoid importing till needed self.pd = pd - self.pd.options.mode.copy_on_write = True + try: + # pandas ≥ 2.0 + self.pd.options.mode.copy_on_write = True + except Exception: + # Best-effort; safe to ignore for older pandas + pass
120-170
: Avoid temp file in CWD; write temp HTML in output_dir and stack bars explicitly
- Writing “temp.html” in CWD can conflict across runs. Use a NamedTemporaryFile in output_dir.
- Ensure stacked bars via
barmode="stack"
for clarity.def make_html(self, df, output_dir, title): """make html graph from df""" import plotly.express as px + import tempfile if df.empty: return output_name = os.path.join(output_dir, "result") if not title: title = "Model_Engine" x = "Model_Engine" y = "Elapsed Time (sec)" color = "Category" """ generate kernel mapping table """ # Sort Model_Engine categories by last field after underscore df["Model_Engine"] = self.pd.Categorical( df["Model_Engine"], sorted(df["Model_Engine"].unique(), key=lambda x: x.split("_")[-1]), ) df[["Model_Engine", color, "Instances", "Name", y]].sort_values(by=color).to_csv(f"{output_name}.csv", index=False) graph = px.histogram( df.round(2), x=x, y=y, title=(f"{y} for {title}"), color=color, text_auto=True, ) # wrap x axis labels graph.update_xaxes(automargin=True) + graph.update_layout(barmode="stack") graph.write_html(f"{output_name}.html") """ Generate data table with columns per Model_Engine into result.html """ pivot_df = df.pivot_table( values="Elapsed Time (sec)", index="Category", columns="Model_Engine", aggfunc="sum", observed=False, ).round(2) # Add sum row at bottom pivot_df.loc["total_elapsed_sec"] = pivot_df.sum() - pivot_df.fillna("").to_html("temp.html") - with ( - open(f"{output_name}.html", "a", encoding="utf-8") as outfile, - open("temp.html", encoding="utf-8") as infile, - ): - outfile.write(infile.read()) - os.remove("temp.html") + with tempfile.NamedTemporaryFile( + mode="w+", suffix=".html", dir=output_dir, delete=False, encoding="utf-8" + ) as tmp: + pivot_df.fillna("").to_html(buf=tmp.name) + tmp.flush() + with open(f"{output_name}.html", "a", encoding="utf-8") as outfile, \ + open(tmp.name, encoding="utf-8") as infile: + outfile.write(infile.read()) + os.remove(tmp.name) - print(f"Finished generating: \n" - f" {output_name}.html for stack bar chart \n" - f" {output_name}.csv for Kernel-Category mapping") + logger.info("Finished generating: %s.html (stacked bar chart)", output_name) + logger.info("Finished generating: %s.csv (Kernel-Category mapping)", output_name)
175-184
: Precompile regex patterns once to speed up classificationOptional perf improvement: compile patterns per mapping once instead of re.search per row per pattern. This matters if the number of kernel names grows.
def anno_gpu_kernname(self, df, mapping): - """add "Category" column""" - - def anno_gpu_kernname_helper(name): - for kern_name, val in mapping.items(): - if re.search(kern_name, name): - return val - - df["Category"] = df["Name"].apply(anno_gpu_kernname_helper) + """Add a 'Category' column by matching kernel names against regex patterns.""" + compiled = [(re.compile(pat), val) for pat, val in mapping.items()] + def classify(name: str): + for pat, val in compiled: + if pat.search(name): + return val + return None + df["Category"] = df["Name"].apply(classify)
198-209
: Docstring punctuation and reuse messageMinor fix to docstring punctuation; logic looks fine.
def should_gen_file(self, new_file, base_file): - """figure out if new file should be generated from base_file""" + """Return True if new_file should be (re)generated from base_file.""" self.is_valid_file(base_file) if (os.path.exists(new_file) and (os.path.getmtime(new_file) > os.path.getmtime(base_file)) and (os.path.getsize(base_file) > 0)): logger.info("reusing %s", new_file) return False else: logger.info("generating %s", new_file) return True
298-300
: Be more tolerant of whitespace in --in_file tuplesStrip parts to avoid subtle parsing issues.
def parse_tuple(s): - return tuple(s.split(",")) + return tuple(part.strip() for part in s.split(","))
303-313
: Help text example: fix engine name typo“sglan” seems unintentional; use “trtllm” to match the provided engine JSON.
"gputrc2graph.py --in_file d1.nsys-rep,trtllm,llama,100 \n" "d2.nsys-rep,trtllm,gpt-oss,102 "
347-349
: Minor: prefersys.exit(main())
pattern for CLICurrent usage is fine; no change required. Just a note for future consistency.
examples/profiler/nsys_profile_tools/trtllm_engine_model.json (1)
1-63
: Include AllGather in NCCL patterns to match README and common kernelsThe README example calls out AllGather, but patterns for NCCL currently omit it for some engines. Suggest appending
|AllGather
to the NCCL patterns to ensure those kernels are categorized correctly.@@ - "ncclDevKernel|AllReduce": "nccl_and_custom_ar", + "ncclDevKernel|AllReduce|AllGather": "nccl_and_custom_ar", @@ - "ncclDevKernel|cross_device_reduce|AllReduce": "nccl_and_custom_ar", + "ncclDevKernel|cross_device_reduce|AllReduce|AllGather": "nccl_and_custom_ar", @@ - "ncclDevKernel|cross_device_reduce|AllReduce": "nccl_and_custom_ar", + "ncclDevKernel|cross_device_reduce|AllReduce|AllGather": "nccl_and_custom_ar",Also: great having a catch-all
".*": "misc"
to avoid None categories.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
⛔ Files ignored due to path filters (3)
examples/profiler/nsys_profile_tools/images/csv.png
is excluded by!**/*.png
examples/profiler/nsys_profile_tools/images/html.png
is excluded by!**/*.png
examples/profiler/nsys_profile_tools/images/html_tbl.png
is excluded by!**/*.png
📒 Files selected for processing (3)
examples/profiler/nsys_profile_tools/README.md
(1 hunks)examples/profiler/nsys_profile_tools/gputrc2graph.py
(1 hunks)examples/profiler/nsys_profile_tools/trtllm_engine_model.json
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
examples/profiler/nsys_profile_tools/gputrc2graph.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
examples/profiler/nsys_profile_tools/gputrc2graph.py
🪛 LanguageTool
examples/profiler/nsys_profile_tools/README.md
[grammar] ~5-~5: There might be a mistake here.
Context: ...f GPU and non-GPU time. It is useful for profiling and analyzing nsys profile out...
(QB_NEW_EN)
[grammar] ~39-~39: There might be a mistake here.
Context: ...vidia.com/nsight-systems/get-started) is installed, and specify the path to the `...
(QB_NEW_EN)
[grammar] ~41-~41: There might be a mistake here.
Context: ...--nsys_cmd
if it is not in your PATH. - For more details on available engines an...
(QB_NEW_EN)
[grammar] ~53-~53: There might be a mistake here.
Context: ...rtllm: 1. Run the following command to collect nsys profile, for trtllm serve config. ...
(QB_NEW_EN)
[style] ~64-~64: Consider using the more formal “until”.
Context: ...d so that profiles aren't captured till trtllm server has come up and load ...
(TILL)
[grammar] ~65-~65: There might be a mistake here.
Context: ...come up and load generation starts. - DURATION: how many seconds for nsys prof...
(QB_NEW_EN)
[grammar] ~84-~84: There might be a mistake here.
Context: ...ent categories in a stacked bar chart. - result.csv: shows how the kernel names a...
(QB_NEW_EN)
[grammar] ~91-~91: There might be a mistake here.
Context: ...which consist of moe_gemm as the biggest category, at .14 seconds, followed by "a...
(QB_NEW_EN)
[grammar] ~92-~92: There might be a mistake here.
Context: ...ed by "attn" kernels. This lets the user prioritize the kernels to focus on for p...
(QB_NEW_EN)
[grammar] ~105-~105: There might be a mistake here.
Context: ...result.csv to dive into what the kernels are which compose the nccl GPU cycles. T...
(QB_NEW_EN)
[grammar] ~106-~106: There might be a mistake here.
Context: ...U cycles. The following image shows that ar_fusion all reduce kernel to be the bi...
(QB_NEW_EN)
[grammar] ~114-~114: There might be a mistake here.
Context: ...ce files, captured for different models, say llama and gpt-oss in this case, and ...
(QB_NEW_EN)
[grammar] ~124-~124: There might be a mistake here.
Context: ...example 1 but now there will be multiple stack bar charts that can be compared. ...
(QB_NEW_EN)
[grammar] ~125-~125: There might be a mistake here.
Context: ...pared. The categories for the different kernels will remain the same, so that it...
(QB_NEW_EN)
[grammar] ~126-~126: There might be a mistake here.
Context: ... it's easy to compare the GPU cycles for the same categories. Once a category is...
(QB_NEW_EN)
[grammar] ~129-~129: There might be a mistake here.
Context: ...e more cycles for one configuration than another, the next step would be to use t...
(QB_NEW_EN)
[grammar] ~130-~130: There might be a mistake here.
Context: ...use the csv file to see what kernels are mapped into that category, and which ker...
(QB_NEW_EN)
[grammar] ~131-~131: There might be a mistake here.
Context: ...kernels are taking the largest amount of time which would cause a difference for ...
(QB_NEW_EN)
[grammar] ~137-~137: There might be a mistake here.
Context: ...the same format as the other json files. The script will automatically pick up al...
(QB_NEW_EN)
[grammar] ~141-~141: There might be a mistake here.
Context: ...here are 4 kernels to be classified into "gemm" and "attn", where the gemm kernel...
(QB_NEW_EN)
[grammar] ~142-~142: There might be a mistake here.
Context: ...kernelshave names with "H" or "I" in them, and attn kernels have names with "...
(QB_NEW_EN)
[grammar] ~162-~162: There might be a mistake here.
Context: ...ey: a regex used to classify the kernels - value: the category to classify the kern...
(QB_NEW_EN)
[grammar] ~165-~165: There might be a mistake here.
Context: ...engine/models, consisting of CUDA memory operations and a 'misc' for anything tha...
(QB_NEW_EN)
[grammar] ~168-~168: There might be a mistake here.
Context: ... a trace file with this new model/engine like the following: ```bash --in_file n...
(QB_NEW_EN)
[grammar] ~176-~176: There might be a mistake here.
Context: ...ing engine file, after the other models.
(QB_NEW_EN)
🪛 Ruff (0.12.2)
examples/profiler/nsys_profile_tools/gputrc2graph.py
15-18: 1 blank line required between summary line and description
(D205)
15-18: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
31-31: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
46-47: One-line docstring should fit on one line
Reformat to one line
(D200)
46-47: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
79-81: 1 blank line required between summary line and description
(D205)
79-81: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
118-118: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
173-173: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
183-183: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
191-191: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
196-196: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
208-210: 1 blank line required between summary line and description
(D205)
208-210: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
255-255: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
examples/profiler/nsys_profile_tools/README.md (1)
105-109
: Align terminology: include AllGather in NCCL example (and mapping)The text calls out AllGather, but the JSON patterns currently don’t match AllGather for all engines. I suggested JSON updates to include
AllGather
in the NCCL pattern. See the JSON review comment.examples/profiler/nsys_profile_tools/gputrc2graph.py (1)
315-345
: Expose supported engines/models in help and keep errors user-friendlyGood addition. Consider also logging the supported combos on ValueError in
gen_graph
(addressed in another comment). Otherwise, this block looks solid.
Signed-off-by: Grace Ho <grho@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (6)
examples/profiler/nsys_profile_tools/README.md (2)
36-41
: Document all runtime dependencies (include plotly and the third-party regex).The script imports pandas, plotly.express, and the third-party regex package. Only pandas/plotly are mentioned, and regex is missing.
Apply this diff to the Notes:
- - Make sure you have pandas and plotly python packages installed. + - Make sure you have the following Python packages installed: + - pandas + - plotly + - regex (third-party regex engine; imported as `regex`) + You can install them with: + ```bash + pip install pandas plotly regex + ``` - - Make sure [nsys](https://developer.nvidia.com/nsight-systems/get-started) is + - Make sure [nsys](https://developer.nvidia.com/nsight-systems/get-started) is installed, and specify the path to the `nsys` command with `--nsys_cmd` if it - is not in your PATH. + is not in your PATH.
19-21
: Fix the 0-seconds behavior description (it currently contradicts the code).When
elapsed_nonprofiled_sec
is 0, the code uses total = GPU elapsed seconds, which yields CPU(non-GPU) = 0. The text says it “may inflate non-GPU time,” which is the opposite of what happens.Apply this diff to clarify the behavior:
- - `elapsed_nonprofiled_sec`: Wall-clock runtime (in seconds) without - profiling. Specify `0` to use the elapsed GPU time calculated from the nsys-rep file (this may inflate non-GPU time if actual runtime without profiling is less). Multiple entries can be provided, separated by spaces. + - `elapsed_nonprofiled_sec`: Wall-clock runtime (in seconds) without + profiling. Specify `0` to use the GPU-elapsed time from the nsys trace as the total elapsed time. This yields CPU(non-GPU) = 0. Prefer providing the actual wall-clock time (without profiling) to compute a realistic non-GPU estimate. Multiple entries can be provided, separated by spaces.examples/profiler/nsys_profile_tools/gputrc2graph.py (4)
186-192
: Don’t depend on copying the last row to build the non-GPU row.Copying
df.iloc[-1]
is fragile (fails on empty frames; may inherit unintended values). Construct the row explicitly and pass the label fields in.Apply this diff:
- def make_nongpu_row(self, df, nongpu_sec): - """this will append non-gpu time entry at end of df""" - nongpu_row = self.pd.DataFrame([df.iloc[-1]]) - nongpu_row["Category"] = nongpu_row["Name"] = "CPU(non-GPU)" - nongpu_row["Instances"] = 1 - nongpu_row["Elapsed Time (sec)"] = nongpu_sec - return nongpu_row + def make_nongpu_row(self, model_engine, nongpu_sec): + """Create a single-row DataFrame representing non-GPU (CPU) time.""" + return self.pd.DataFrame([{ + "Model_Engine": model_engine, + "Category": "CPU(non-GPU)", + "Name": "CPU(non-GPU)", + "Instances": 1, + "Elapsed Time (sec)": float(nongpu_sec), + }])
194-198
: Avoid assert for runtime validation; raise concrete exceptions.
assert
can be stripped with-O
, turning this into a no-op. Prefer explicit checks.Apply this diff:
- def is_valid_file(self, base_file): - """asserts if base_file is non-existent or is empty""" - assert (os.path.isfile(base_file) and os.path.getsize(base_file) - > 0), f"{base_file} doesn't exist or is empty" + def is_valid_file(self, base_file): + """Raise if base_file is missing or empty.""" + if not os.path.isfile(base_file): + raise FileNotFoundError(f"{base_file} doesn't exist") + if os.path.getsize(base_file) <= 0: + raise ValueError(f"{base_file} is empty")
228-257
: Detect nsys failures and regenerate the sum when stale; avoid silent errors.
subprocess.run(cmd)
won’t raise on non-zero exit, so failures go undetected until later.- If
nsys_stats_file
exists butsum_file
is missing or older, the current code won’t regenerate the sum.Apply this diff:
- if self.should_gen_file(nsys_stats_file, file): + if self.should_gen_file(nsys_stats_file, file): cmd = [ nsys_cmd, "stats", "-r", "cuda_gpu_trace", file, "-o", f"{file_dir}/{file_name}", ] cmd_str = " ".join(cmd) logger.info("+ %s", cmd_str) # estimate time based on calibrated 240M/min file_size_mb = os.path.getsize(file) / 1e6 logger.info( "nsys stats for %.2f MB file expected to take %.2f min", file_size_mb, file_size_mb / 240, ) try: - subprocess.run(cmd) - except Exception: - logger.error("%s failed; Use --nsys_cmd to specify nsys path", - cmd_str) - exit(1) - logger.info("generating non-overalapped sum %s", sum_file) - self.gen_nonoverlapped_sum_from_gputrace(nsys_stats_file, sum_file) + subprocess.run(cmd, check=True, capture_output=True, text=True) + except subprocess.CalledProcessError as e: + logger.error("%s failed (code %s): %s", cmd_str, e.returncode, e.stderr.strip()) + raise SystemExit(1) + # (Re)generate the non-overlapped kernel summary if missing or stale + if self.should_gen_file(sum_file, nsys_stats_file): + logger.info("generating non-overlapped sum %s", sum_file) + self.gen_nonoverlapped_sum_from_gputrace(nsys_stats_file, sum_file) self.is_valid_file(sum_file) logger.info("Finished generating %s", sum_file) return sum_file
268-289
: Avoid premature rounding; update non-GPU row construction to match new helper.
- Rounding before comparison can cause
total_sec < gpu_sec
anomalies.- Use the new
make_nongpu_row(model_engine, ...)
signature.- Keep a
model_engine
variable for reuse.Apply this diff:
- sum_file = self.gen_sum_file(file, nsys_cmd) + sum_file = self.gen_sum_file(file, nsys_cmd) # read kernel summary file df = self.pd.read_csv(sum_file) # annotate kernel to their categories assert engine_model.get(engine), f"engine {engine} unknown" assert engine_model[engine].get(model), f"model {model} unknown" # remove nsys-rep from file_name for shorter x-label file_name = file_name.replace(".nsys-rep", "") - df["Model_Engine"] = f"{model}_{engine}_{file_name}_{idx}" + model_engine = f"{model}_{engine}_{file_name}_{idx}" + df["Model_Engine"] = model_engine self.anno_gpu_kernname(df, engine_model[engine][model]) # patch in non-gpu time - gpu_sec = round(df["Elapsed Time (sec)"].sum(), 1) - total_sec = round(float(total_sec), 1) - if total_sec < gpu_sec: + gpu_sec = float(df["Elapsed Time (sec)"].sum()) + total_sec_val = float(total_sec) + if total_sec_val < gpu_sec: logger.warning( - "Elapsed sec %.2f < GPU sec %.2f resetting Elapsed sec ", - total_sec, + "Elapsed sec %.2f < GPU sec %.2f resetting Elapsed sec ", + total_sec_val, gpu_sec, ) - total_sec = gpu_sec - nongpu_row = self.make_nongpu_row(df, total_sec - gpu_sec) + total_sec_val = gpu_sec + nongpu_row = self.make_nongpu_row(model_engine, total_sec_val - gpu_sec) df = self.pd.concat([df, nongpu_row], ignore_index=True)Note: Consider replacing the
assert
lines with explicitValueError
to avoid optimization stripping, but I left them unchanged here to minimize the diff.
🧹 Nitpick comments (7)
examples/profiler/nsys_profile_tools/README.md (3)
61-65
: Minor grammar and clarity nits for Step 1 bullets.Tighten wording and fix “till”.
Apply this diff:
- - DELAY: how many seconds to delay nsys from collecting profiles, needed so - that profiles aren't captured till trtllm server has come up and load - generation starts. - - DURATION: how many seconds for nsys profile to run before generating the - profile. This should be > the duration of the run. + - DELAY: how many seconds to delay nsys before collecting profiles, so + that profiles aren't captured until the trtllm server is up and load + generation has started. + - DURATION: how many seconds for nsys to collect the profile before it + stops. This should be greater than the duration of the run.
102-107
: Fix phrasing in the result.csv explanation.Improve readability and terminology (AllReduce capitalization).
Apply this diff:
-Suppose the user would like to focus on improving decreasing calls to nccl -kernels. The next step is to use the result.csv to dive into what the kernels -are which compose the nccl GPU cycles. The following image shows that -ar_fusion all reduce kernel to be the biggest contributor to GPU cycles for -nccl, followed by AllGather. +Suppose you want to reduce time spent in NCCL kernels. +Use result.csv to drill into which kernels comprise NCCL GPU time. +The following image shows the `ar_fusion` AllReduce kernel as the biggest +contributor to NCCL GPU time, followed by AllGather.
139-142
: Fix typo: “kernelshave” → “kernels have”.Apply this diff:
-Then, for this new model, suppose there are 4 kernels to be classified into -"gemm" and "attn", where the gemm kernelshave names with "*H*" or "*I*" in -them, and attn kernels have names with "*J*" or "*K*" in them, just add another +Then, for this new model, suppose there are 4 kernels to be classified into +"gemm" and "attn", where the gemm kernels have names with "*H*" or "*I*" in +them, and attn kernels have names with "*J*" or "*K*" in them. Just add anotherexamples/profiler/nsys_profile_tools/gputrc2graph.py (4)
86-105
: Avoid assert, handle empty frames, and use logger for progress.
assert not df.empty
can be stripped with -O and is unnecessary; handle empty explicitly.- Replace print-based progress with
logger.info
and reduce verbosity.Apply this diff:
- logger.info("sorting %s trace records by start time", str(df.shape)) - assert not df.empty, 'empty nsys records' + logger.info("sorting %s trace records by start time", str(df.shape)) # Sort by start time and reset index df = df.sort_values(by="Start (ns)").reset_index(drop=True) + if df.empty: + df["Elapsed Time (ns)"] = self.pd.Series(dtype="int64") + return df + # Initialize elapsed time as duration df["Elapsed Time (ns)"] = df["Duration (ns)"] # Get numpy arrays for faster operations starts = df["Start (ns)"].values ends = df["End (ns)"].values # Keep track of current interval end current_end = ends[0] - display_units = max(1, int(len(df) / 100)) + n = len(df) + display_step = max(1, n // 10) # Update current_end for overlapping intervals - for i in range(1, len(df)): - if i % display_units == 0: - print(f"processing trace: {int(i/len(df) * 100)} %", end="\r") + for i in range(1, n): + if i % display_step == 0: + logger.info("processing trace: %d %%", int(i / n * 100))
164-175
: Avoid writing temp files to CWD; use output_dir and logger instead of print.Writing “temp.html” to the current working directory can be surprising; keep all artifacts under output_dir and replace print with logging.
Apply this diff:
- pivot_df.fillna("").to_html("temp.html") - with ( - open(f"{output_name}.html", "a", encoding="utf-8") as outfile, - open("temp.html", encoding="utf-8") as infile, - ): + tmp_html = f"{output_name}.tmp.html" + pivot_df.fillna("").to_html(tmp_html) + with ( + open(f"{output_name}.html", "a", encoding="utf-8") as outfile, + open(tmp_html, encoding="utf-8") as infile, + ): outfile.write(infile.read()) - os.remove("temp.html") + os.remove(tmp_html) - print(f"Finished generating: \n" - f" {output_name}.html for stack bar chart \n" - f" {output_name}.csv for Kernel-Category mapping") + logger.info("Finished generating:\n %s.html (stacked bar chart)\n %s.csv (Kernel-Category mapping)", + output_name, output_name)
176-185
: Prefer importing regex as ‘regex’ to avoid confusion with the stdlib ‘re’.Aliasing the third-party regex engine as “re” can surprise readers. Use
import regex
and callregex.search(...)
.Apply this diff:
-import regex as re +import regexAnd here:
- if re.search(kern_name, name): + if regex.search(kern_name, name): return val
331-333
: Fix the help example: replace ‘sglan’ with a supported engine.The example uses
sglan
, which isn’t one of the engines discovered from JSON. Replace withtrtllm
to match the README and shipped JSON.Apply this diff:
- f"Example: --in_file d1.nsys-rep,sglan,llama,100 " + f"Example: --in_file d1.nsys-rep,trtllm,llama,100 "
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
examples/profiler/nsys_profile_tools/README.md
(1 hunks)examples/profiler/nsys_profile_tools/gputrc2graph.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
examples/profiler/nsys_profile_tools/gputrc2graph.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
examples/profiler/nsys_profile_tools/gputrc2graph.py
🪛 Ruff (0.12.2)
examples/profiler/nsys_profile_tools/gputrc2graph.py
15-18: 1 blank line required between summary line and description
(D205)
15-18: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
31-31: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
46-47: One-line docstring should fit on one line
Reformat to one line
(D200)
46-47: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
80-82: 1 blank line required between summary line and description
(D205)
80-82: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
119-119: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
174-174: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
184-184: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
192-192: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
197-197: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
209-211: 1 blank line required between summary line and description
(D205)
209-211: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
256-256: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
🪛 LanguageTool
examples/profiler/nsys_profile_tools/README.md
[grammar] ~3-~3: There might be a mistake here.
Context: ... Nsight Systems (nsys
) GPU trace files (.nsys-rep
) with -t cuda tracing enabl...
(QB_NEW_EN)
[grammar] ~4-~4: There might be a mistake here.
Context: ...cing enabled, and generates kernel-level summaries and visualizations of GPU and ...
(QB_NEW_EN)
[grammar] ~5-~5: There might be a mistake here.
Context: ...f GPU and non-GPU time. It is useful for profiling and analyzing nsys profile out...
(QB_NEW_EN)
[grammar] ~36-~36: There might be a mistake here.
Context: ...system PATH. ## Notes - Make sure you have pandas and plotly python packages insta...
(QB_NEW_EN)
[grammar] ~36-~36: There might be a mistake here.
Context: ...as and plotly python packages installed. - Make sure [nsys](https://developer.nvidi...
(QB_NEW_EN)
[grammar] ~37-~37: There might be a mistake here.
Context: ...vidia.com/nsight-systems/get-started) is installed, and specify the path to the `...
(QB_NEW_EN)
[grammar] ~39-~39: There might be a mistake here.
Context: ...--nsys_cmd
if it is not in your PATH. - For more details on available engines an...
(QB_NEW_EN)
[grammar] ~51-~51: There might be a mistake here.
Context: ...rtllm: 1. Run the following command to collect nsys profile, for trtllm serve config. ...
(QB_NEW_EN)
[style] ~62-~62: Consider using the more formal “until”.
Context: ...d so that profiles aren't captured till trtllm server has come up and load ...
(TILL)
[grammar] ~63-~63: There might be a mistake here.
Context: ...come up and load generation starts. - DURATION: how many seconds for nsys prof...
(QB_NEW_EN)
[grammar] ~82-~82: There might be a mistake here.
Context: ...ent categories in a stacked bar chart. - result.csv: shows how the kernel names a...
(QB_NEW_EN)
[grammar] ~89-~89: There might be a mistake here.
Context: ...which consist of moe_gemm as the biggest category, at .14 seconds, followed by "a...
(QB_NEW_EN)
[grammar] ~90-~90: There might be a mistake here.
Context: ...ed by "attn" kernels. This lets the user prioritize the kernels to focus on for p...
(QB_NEW_EN)
[grammar] ~103-~103: There might be a mistake here.
Context: ...result.csv to dive into what the kernels are which compose the nccl GPU cycles. T...
(QB_NEW_EN)
[grammar] ~104-~104: There might be a mistake here.
Context: ...U cycles. The following image shows that ar_fusion all reduce kernel to be the bi...
(QB_NEW_EN)
[grammar] ~112-~112: There might be a mistake here.
Context: ...ce files, captured for different models, say llama and gpt-oss in this case, and ...
(QB_NEW_EN)
[grammar] ~122-~122: There might be a mistake here.
Context: ...example 1 but now there will be multiple stack bar charts that can be compared. ...
(QB_NEW_EN)
[grammar] ~123-~123: There might be a mistake here.
Context: ...pared. The categories for the different kernels will remain the same, so that it...
(QB_NEW_EN)
[grammar] ~124-~124: There might be a mistake here.
Context: ... it's easy to compare the GPU cycles for the same categories. Once a category is...
(QB_NEW_EN)
[grammar] ~127-~127: There might be a mistake here.
Context: ...e more cycles for one configuration than another, the next step would be to use t...
(QB_NEW_EN)
[grammar] ~128-~128: There might be a mistake here.
Context: ...use the csv file to see what kernels are mapped into that category, and which ker...
(QB_NEW_EN)
[grammar] ~129-~129: There might be a mistake here.
Context: ...kernels are taking the largest amount of time which would cause a difference for ...
(QB_NEW_EN)
[grammar] ~135-~135: There might be a mistake here.
Context: ...the same format as the other json files. The script will automatically pick up al...
(QB_NEW_EN)
[grammar] ~139-~139: There might be a mistake here.
Context: ...here are 4 kernels to be classified into "gemm" and "attn", where the gemm kernel...
(QB_NEW_EN)
[grammar] ~140-~140: There might be a mistake here.
Context: ...kernelshave names with "H" or "I" in them, and attn kernels have names with "...
(QB_NEW_EN)
[grammar] ~160-~160: There might be a mistake here.
Context: ...ey: a regex used to classify the kernels - value: the category to classify the kern...
(QB_NEW_EN)
[grammar] ~163-~163: There might be a mistake here.
Context: ...engine/models, consisting of CUDA memory operations and a 'misc' for anything tha...
(QB_NEW_EN)
[grammar] ~166-~166: There might be a mistake here.
Context: ... a trace file with this new model/engine like the following: ```bash --in_file n...
(QB_NEW_EN)
[grammar] ~174-~174: There might be a mistake here.
Context: ...ing engine file, after the other models.
(QB_NEW_EN)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
Signed-off-by: Grace Ho <grho@nvidia.com>
Signed-off-by: Grace Ho <grho@nvidia.com>
Signed-off-by: Grace Ho <grho@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (10)
tensorrt_llm/tools/profiler/nsys_profile_tools/trtllm_engine_model.json (1)
22-40
: Minor: consider adding explicit attention variants and lowercase-insensitive matches for robustness.If kernel name casing varies (e.g., softmax vs SoftMax) or additional attention variants appear (e.g., mha variants), consider adding case-insensitive matching or explicit alternatives. This can be done in code (use re.IGNORECASE) or by expanding patterns here.
Apply either of these approaches:
- Approach A (preferred): Use
re.IGNORECASE
when searching in code; no JSON change needed.- Approach B: Expand patterns with both cases (e.g., add
softmax|SoftMax
).tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py (6)
81-118
: Minor: use logging instead of print for progress; tighten docstring.Progress output should use the logger. Also, format the docstring to satisfy docstyle checks (D200/D205/D415).
Apply this diff:
def sum_non_overlapping_intervals(self, df): - """ - returns new sorted df with Elapsed Time (ns) column using - vectorized operations - """ + """Return a sorted DataFrame with non-overlapped Elapsed Time (ns).""" logger.info("sorting %s trace records by start time", str(df.shape)) assert not df.empty, 'empty nsys records' @@ - display_units = max(1, int(len(df) / 100)) + display_units = max(1, int(len(df) / 100)) # Update current_end for overlapping intervals for i in range(1, len(df)): - if i % display_units == 0: - print(f"processing trace: {int(i/len(df) * 100)} %", end="\r") + if i % display_units == 0: + pct = int(i / len(df) * 100) + logger.info("processing trace: %d%%", pct)Note: If log volume is a concern, gate this behind a verbose flag or use
logger.debug
.
74-79
: Compute Instances via groupby size (clearer/robust) and tidy docstring.Setting
Instances = 1
and then aggregating with"size"
works but is indirect.Apply this diff:
- # get ready to print table with elapsed times per kernel - df["Instances"] = 1 - df_sum = df.groupby("Name", as_index=False).agg({ - "Elapsed Time (ns)": "sum", - "Duration (ns)": "sum", - "Instances": "size" - }) + # Aggregate elapsed/total time and number of kernel instances. + df_sum = df.groupby("Name", as_index=False).agg( + **{ + "Elapsed Time (ns)": ("Elapsed Time (ns)", "sum"), + "Duration (ns)": ("Duration (ns)", "sum"), + "Instances": ("Name", "size"), + } + )This avoids relying on a placeholder column.
121-171
: Categorical ordering by the last underscore token can mis-sort ≥10 entries; parse the trailing index as int.String-based sort will order "10" before "2". Parse the trailing token to int for stable ordering.
Apply this diff:
- df["Model_Engine"] = self.pd.Categorical( - df["Model_Engine"], - sorted(df["Model_Engine"].unique(), key=lambda x: x.split("_")[-1]), - ) + def _idx_key(s: str) -> int: + try: + return int(s.rsplit("_", 1)[1]) + except Exception: + return 0 + df["Model_Engine"] = self.pd.Categorical( + df["Model_Engine"], + sorted(df["Model_Engine"].unique(), key=_idx_key), + )
176-185
: Default to “misc” if no mapping matches; optionally allow case-insensitive matching.JSONs include a catch-all
".*": "misc"
, but being defensive avoids None categories if future JSONs omit it. Also consider case-insensitive matching in code to keep JSONs simpler.Apply this diff:
def anno_gpu_kernname(self, df, mapping): """add "Category" column""" def anno_gpu_kernname_helper(name): for kern_name, val in mapping.items(): - if re.search(kern_name, name): + if re.search(kern_name, name): return val + return "misc" df["Category"] = df["Name"].apply(anno_gpu_kernname_helper)If you want case-insensitive matching across the board:
- if re.search(kern_name, name): + if re.search(kern_name, name, flags=re.IGNORECASE):
186-193
: Create the non-GPU row from a clean schema to avoid leaking unrelated fields (e.g., Total Time from the copied row).Copying an arbitrary last row can carry unrelated columns with misleading values.
Apply this diff:
def make_nongpu_row(self, df, nongpu_sec): """this will append non-gpu time entry at end of df""" - nongpu_row = self.pd.DataFrame([df.iloc[-1]]) - nongpu_row["Category"] = nongpu_row["Name"] = "CPU(non-GPU)" - nongpu_row["Instances"] = 1 - nongpu_row["Elapsed Time (sec)"] = nongpu_sec + cols = list(df.columns) + base = {c: None for c in cols} + base.update({ + "Category": "CPU(non-GPU)", + "Name": "CPU(non-GPU)", + "Instances": 1, + "Elapsed Time (sec)": nongpu_sec, + }) + nongpu_row = self.pd.DataFrame([base], columns=cols) return nongpu_row
15-19
: Docstring style nits (D200/D205/D415).Several docstrings trigger docstyle warnings (missing terminal punctuation, summary/description separation). Not functionally blocking, but easy to tidy.
Would you like me to push a patch that standardizes all docstrings to Google style and resolves the listed Ruff warnings?
Also applies to: 31-33, 46-49, 81-86, 119-119, 174-174, 184-184, 192-192, 197-197, 209-211, 256-256
tensorrt_llm/tools/profiler/nsys_profile_tools/README.md (3)
51-58
: Minor grammar and clarity in the profiling command step.Tighten language and fix minor issues.
Apply this diff:
-1. Run the following command to collect nsys profile, for trtllm serve config. +1. Run the following command to collect an nsys profile for a trtllm serve configuration: @@ - python3 -m trtllm-serve meta-llama/Llama-4-Scout-17B-16E-Instruct ... + python3 -m trtllm-serve meta-llama/Llama-4-Scout-17B-16E-Instruct ... @@ - - DELAY: how many seconds to delay nsys from collecting profiles, needed so - that profiles aren't captured till trtllm server has come up and load - generation starts. + - DELAY: number of seconds to delay nsys before collecting profiles, so + collection starts after the trtllm server is up and load generation begins. - DURATION: how many seconds for nsys profile to run before generating the profile. This should be > the duration of the run.
100-109
: Minor grammar fixes in result.csv section.Apply this diff:
-Suppose the user would like to focus on improving decreasing calls to nccl +Suppose the user would like to focus on reducing calls to NCCL kernels. The next step is to use the result.csv to dive into what the kernels -are which compose the nccl GPU cycles. The following image shows that -ar_fusion all reduce kernel to be the biggest contributor to GPU cycles for -nccl, followed by AllGather. +are that comprise the NCCL GPU cycles. The following image shows +ar_fusion AllReduce to be the biggest contributor to NCCL GPU cycles, +followed by AllGather.
132-156
: Typos and phrasing in “add new classification” section.Apply this diff:
-To create a new engine DEF with model ABC, just add another json file in the +To create a new engine DEF with model ABC, add another JSON file in the same directory as gputrc2graph.py with the same format as the other json files. The script will automatically pick up all the json files in the same directory as engine/model specifications. @@ -Then, for this new model, suppose there are 4 kernels to be classified into -"gemm" and "attn", where the gemm kernelshave names with "*H*" or "*I*" in -them, and attn kernels have names with "*J*" or "*K*" in them, just add another - .json file in the same directory as gputrc2graph.py with the same format as - the other json files, like the following: +Then, for this new model, suppose there are four kernels to be classified into +"gemm" and "attn", where the gemm kernels have names with "*H*" or "*I*", +and attention kernels have names with "*J*" or "*K*". Add a JSON file in the +same directory as gputrc2graph.py with the same format as the other JSON files, +like the following: @@ -Each entry in the dictionary consists of: +Each entry in the dictionary consists of: @@ -- key: a regex used to classify the kernels -- value: the category to classify the kernels into. +- key: a regex used to classify kernels +- value: the category to classify kernels into @@ -The last 2 entries are common for all engine/models, consisting of CUDA memory -operations and a 'misc' for anything that's leftover and can't be classified. +The last two entries are common for all engines/models: CUDA memory operations +and a 'misc' category for anything that can't be classified.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
⛔ Files ignored due to path filters (3)
tensorrt_llm/tools/profiler/nsys_profile_tools/images/csv.png
is excluded by!**/*.png
tensorrt_llm/tools/profiler/nsys_profile_tools/images/html.png
is excluded by!**/*.png
tensorrt_llm/tools/profiler/nsys_profile_tools/images/html_tbl.png
is excluded by!**/*.png
📒 Files selected for processing (3)
tensorrt_llm/tools/profiler/nsys_profile_tools/README.md
(1 hunks)tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py
(1 hunks)tensorrt_llm/tools/profiler/nsys_profile_tools/trtllm_engine_model.json
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py
🪛 Ruff (0.12.2)
tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py
15-18: 1 blank line required between summary line and description
(D205)
15-18: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
31-31: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
46-47: One-line docstring should fit on one line
Reformat to one line
(D200)
46-47: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
80-82: 1 blank line required between summary line and description
(D205)
80-82: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
119-119: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
174-174: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
184-184: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
192-192: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
197-197: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
209-211: 1 blank line required between summary line and description
(D205)
209-211: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
256-256: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
🪛 LanguageTool
tensorrt_llm/tools/profiler/nsys_profile_tools/README.md
[grammar] ~3-~3: There might be a mistake here.
Context: ... Nsight Systems (nsys
) GPU trace files (.nsys-rep
) with -t cuda tracing enabl...
(QB_NEW_EN)
[grammar] ~4-~4: There might be a mistake here.
Context: ...cing enabled, and generates kernel-level summaries and visualizations of GPU and ...
(QB_NEW_EN)
[grammar] ~5-~5: There might be a mistake here.
Context: ...f GPU and non-GPU time. It is useful for profiling and analyzing nsys profile out...
(QB_NEW_EN)
[grammar] ~36-~36: There might be a mistake here.
Context: ...system PATH. ## Notes - Make sure you have pandas and plotly python packages insta...
(QB_NEW_EN)
[grammar] ~36-~36: There might be a mistake here.
Context: ...as and plotly python packages installed. - Make sure [nsys](https://developer.nvidi...
(QB_NEW_EN)
[grammar] ~37-~37: There might be a mistake here.
Context: ...vidia.com/nsight-systems/get-started) is installed, and specify the path to the `...
(QB_NEW_EN)
[grammar] ~39-~39: There might be a mistake here.
Context: ...--nsys_cmd
if it is not in your PATH. - For more details on available engines an...
(QB_NEW_EN)
[grammar] ~51-~51: There might be a mistake here.
Context: ...rtllm: 1. Run the following command to collect nsys profile, for trtllm serve config. ...
(QB_NEW_EN)
[style] ~62-~62: Consider using the more formal “until”.
Context: ...d so that profiles aren't captured till trtllm server has come up and load ...
(TILL)
[grammar] ~63-~63: There might be a mistake here.
Context: ...come up and load generation starts. - DURATION: how many seconds for nsys prof...
(QB_NEW_EN)
[grammar] ~82-~82: There might be a mistake here.
Context: ...ent categories in a stacked bar chart. - result.csv: shows how the kernel names a...
(QB_NEW_EN)
[grammar] ~89-~89: There might be a mistake here.
Context: ...which consist of moe_gemm as the biggest category, at .14 seconds, followed by "a...
(QB_NEW_EN)
[grammar] ~90-~90: There might be a mistake here.
Context: ...ed by "attn" kernels. This lets the user prioritize the kernels to focus on for p...
(QB_NEW_EN)
[grammar] ~103-~103: There might be a mistake here.
Context: ...result.csv to dive into what the kernels are which compose the nccl GPU cycles. T...
(QB_NEW_EN)
[grammar] ~104-~104: There might be a mistake here.
Context: ...U cycles. The following image shows that ar_fusion all reduce kernel to be the bi...
(QB_NEW_EN)
[grammar] ~112-~112: There might be a mistake here.
Context: ...ce files, captured for different models, say llama and gpt-oss in this case, and ...
(QB_NEW_EN)
[grammar] ~122-~122: There might be a mistake here.
Context: ...example 1 but now there will be multiple stack bar charts that can be compared. ...
(QB_NEW_EN)
[grammar] ~123-~123: There might be a mistake here.
Context: ...pared. The categories for the different kernels will remain the same, so that it...
(QB_NEW_EN)
[grammar] ~124-~124: There might be a mistake here.
Context: ... it's easy to compare the GPU cycles for the same categories. Once a category is...
(QB_NEW_EN)
[grammar] ~127-~127: There might be a mistake here.
Context: ...e more cycles for one configuration than another, the next step would be to use t...
(QB_NEW_EN)
[grammar] ~128-~128: There might be a mistake here.
Context: ...use the csv file to see what kernels are mapped into that category, and which ker...
(QB_NEW_EN)
[grammar] ~129-~129: There might be a mistake here.
Context: ...kernels are taking the largest amount of time which would cause a difference for ...
(QB_NEW_EN)
[grammar] ~135-~135: There might be a mistake here.
Context: ...the same format as the other json files. The script will automatically pick up al...
(QB_NEW_EN)
[grammar] ~139-~139: There might be a mistake here.
Context: ...here are 4 kernels to be classified into "gemm" and "attn", where the gemm kernel...
(QB_NEW_EN)
[grammar] ~140-~140: There might be a mistake here.
Context: ...kernelshave names with "H" or "I" in them, and attn kernels have names with "...
(QB_NEW_EN)
[grammar] ~160-~160: There might be a mistake here.
Context: ...ey: a regex used to classify the kernels - value: the category to classify the kern...
(QB_NEW_EN)
[grammar] ~163-~163: There might be a mistake here.
Context: ...engine/models, consisting of CUDA memory operations and a 'misc' for anything tha...
(QB_NEW_EN)
[grammar] ~166-~166: There might be a mistake here.
Context: ... a trace file with this new model/engine like the following: ```bash --in_file n...
(QB_NEW_EN)
[grammar] ~174-~174: There might be a mistake here.
Context: ...ing engine file, after the other models.
(QB_NEW_EN)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (4)
tensorrt_llm/tools/profiler/nsys_profile_tools/trtllm_engine_model.json (2)
3-21
: Kernel categorization coverage for llama looks solid; ordering ensures catch-all fallback.Patterns are reasonably comprehensive and the catch-all ".*" is correctly placed last to avoid shadowing specific patterns.
41-60
: ds and gpt-oss mappings are consistent; “.*” fallback is present.Consistent taxonomy across engines, which will make cross-model comparison straightforward in the chart and pivot table.
tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py (1)
259-297
: Clarify behavior for total runtime = 0 and align with README, or compute total from trace.Current behavior sets total_sec = gpu_sec if the CLI value is 0 (or less than GPU). README states “0 uses elapsed time from nsys-rep,” which is not implemented. Either:
- implement reading total elapsed from the trace (e.g., first start to last end) and use that when the input is 0, or
- update README to state that 0 means “use GPU elapsed time,” which yields non-GPU time = 0.
If you decide to implement trace-derived total when 0 is passed, this patch adds a simple approximation:
for idx, (file, engine, model, total_sec) in enumerate(in_file): @@ - gpu_sec = round(df["Elapsed Time (sec)"].sum(), 1) - total_sec = round(float(total_sec), 1) + gpu_sec = round(df["Elapsed Time (sec)"].sum(), 1) + total_sec = float(total_sec) + if total_sec == 0: + # Approximate total from kernel timestamps: last End - first Start. + # This may still underestimate true wall time if CPU-only gaps exist. + try: + raw = self.pd.read_csv( + os.path.join(file_dir, f"{file_name}_cuda_gpu_trace.csv"), + usecols=["Start (ns)", "Duration (ns)"], + ) + if not raw.empty: + last_end = (raw["Start (ns)"] + raw["Duration (ns)"]).max() + first_start = raw["Start (ns)"].min() + total_sec = max(0.0, (last_end - first_start) / 1e9) + except Exception: + logger.warning("Could not derive total_sec from trace; defaulting to gpu_sec.") + total_sec = 0.0 + total_sec = round(total_sec or gpu_sec, 1)Otherwise, adjust README wording (see separate comment).
tensorrt_llm/tools/profiler/nsys_profile_tools/README.md (1)
36-41
: List all Python dependencies (missing “pandas” min version and “regex” if kept).If you keep using
regex
, add it here. Also consider noting pandas ≥ 2.0 if you rely oncopy_on_write
.Apply this diff if
regex
is removed and pandas ≥ 2.0 is desired:-- Make sure you have pandas and plotly python packages installed. +- Make sure you have pandas (>= 2.0) and plotly python packages installed.If you keep
regex
:-- Make sure you have pandas and plotly python packages installed. +- Make sure you have pandas (>= 2.0), plotly, and the `regex` Python package installed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (4)
tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py (4)
25-25
: Replace "regex" with "re" module from standard library.The code uses the third-party
regex
package aliased asre
, creating confusion and introducing an undeclared dependency. The patterns used in this code don't require advancedregex
features.
31-43
: Fix shallow merge that overwrites engine mappings and docstring punctuation.The current implementation will overwrite entire engine mappings when multiple JSON files define the same engine. The README encourages multiple JSONs, so this needs deep merging. Also fix the missing period in the docstring.
58-63
: Handle empty GPU trace CSV to prevent downstream failures.When
cuda_gpu_trace.csv
is empty, the early return causesgen_sum_file
to fail on assertions since no summary file is created.
228-252
: Improve nsys subprocess execution robustness.The current implementation has multiple issues: missing
-f csv
flag, silent failures, and usingexit(1)
instead ofsys.exit(1)
.
🧹 Nitpick comments (4)
tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py (4)
104-104
: Replace print with logger for consistency.Direct print statements bypass the configured logging system and may not appear in log files.
- print(f"processing trace: {int(i/len(df) * 100)} %", end="\r") + logger.debug("processing trace: %d %%", int(i/len(df) * 100))
272-273
: Improve error messages for unsupported engine/model combinations.The current assertion messages don't provide helpful guidance on available options.
- assert engine_model.get(engine), f"engine {engine} unknown" - assert engine_model[engine].get(model), f"model {model} unknown" + if not engine_model.get(engine): + available_engines = list(engine_model.keys()) + raise ValueError(f"Unknown engine '{engine}'. Available engines: {available_engines}") + if not engine_model[engine].get(model): + available_models = list(engine_model[engine].keys()) + raise ValueError(f"Unknown model '{model}' for engine '{engine}'. Available models: {available_models}")
299-300
: Add input validation to parse_tuple function.The function blindly splits on commas without validating the expected tuple structure, which can cause confusing downstream errors.
def parse_tuple(s): - return tuple(s.split(",")) + parts = s.split(",") + if len(parts) != 4: + raise ValueError(f"Expected 4 comma-separated values (file,engine,model,total_sec), got {len(parts)}: {s}") + return tuple(parts)
15-18
: Fix docstring formatting issues.Multiple docstrings are missing periods, need blank lines, or should be reformatted to single lines per Python conventions.
""" - This generates gpu kernel analysis output from nsys rep. Will call nsys - stats -r cuda_gpu_trace, get non-overlapped gpu cycles, then generate - csv and html output for analysis + This generates GPU kernel analysis output from nsys rep. + + Will call nsys stats -r cuda_gpu_trace, get non-overlapped GPU cycles, + then generate CSV and HTML output for analysis. """ - """returns engine_model built from all json files in the current dir""" + """Return engine_model built from all JSON files in the current dir.""" - """ - Parses output of nsys report, generates csv and bar chart output - """ + """Parse output of nsys report, generate CSV and bar chart output.""" - """ - returns new sorted df with Elapsed Time (ns) column using - vectorized operations - """ + """Return new sorted df with Elapsed Time (ns) column using vectorized operations.""" - """make html graph from df""" + """Make HTML graph from df.""" - """add "Category" column""" + """Add 'Category' column.""" - """this will append non-gpu time entry at end of df""" + """Append non-GPU time entry at end of df.""" - """asserts if base_file is non-existent or is empty""" + """Assert if base_file is non-existent or is empty.""" - """figure out if new file should be generated from base_file""" + """Figure out if new file should be generated from base_file.""" - """ - generates sum file from nsys trace with times per kernel and - returns the name of the sum file - """ + """Generate sum file from nsys trace with times per kernel and return the name of the sum file.""" - """generates graph and csv file from in_file into out_dir""" + """Generate graph and CSV file from in_file into out_dir."""Also applies to: 31-31, 46-47, 80-82, 119-119, 174-174, 184-184, 192-192, 197-197, 209-211, 256-256
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py
🪛 Ruff (0.12.2)
tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py
15-18: 1 blank line required between summary line and description
(D205)
15-18: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
31-31: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
46-47: One-line docstring should fit on one line
Reformat to one line
(D200)
46-47: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
80-82: 1 blank line required between summary line and description
(D205)
80-82: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
119-119: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
174-174: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
184-184: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
192-192: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
197-197: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
209-211: 1 blank line required between summary line and description
(D205)
209-211: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
256-256: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
@gracehonv FYI - the release check is failing: https://github.com/NVIDIA/TensorRT-LLM/actions/runs/17106836453/job/48517867574?pr=7020 |
Signed-off-by: Grace Ho <grho@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (6)
tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py (6)
86-105
: Handle empty DataFrame gracefully and replace print-based progress with logging.The assertion will crash on valid empty traces. Also, prefer logging to
- logger.info("sorting %s trace records by start time", str(df.shape)) - assert not df.empty, 'empty nsys records' + logger.info("sorting %s trace records by start time", str(df.shape)) + if df.empty: + logger.warning("No GPU trace records to process.") + return df @@ - for i in range(1, len(df)): - if i % display_units == 0: - print(f"processing trace: {int(i/len(df) * 100)} %", end="\r") + for i in range(1, len(df)): + if i % display_units == 0: + logger.info("processing trace: %d %%", int(i / len(df) * 100))
120-171
: Use a NamedTemporaryFile instead of writing a fixedtemp.html
in CWD.The current approach can break with concurrent runs or if cleanup fails.
- pivot_df.fillna("").to_html("temp.html") - with ( - open(f"{output_name}.html", "a", encoding="utf-8") as outfile, - open("temp.html", encoding="utf-8") as infile, - ): - outfile.write(infile.read()) - os.remove("temp.html") + import tempfile + with tempfile.NamedTemporaryFile(mode="w+", suffix=".html", delete=False) as tmp: + pivot_df.fillna("").to_html(tmp.name) + with open(f"{output_name}.html", "a", encoding="utf-8") as outfile, \ + open(tmp.name, encoding="utf-8") as infile: + outfile.write(infile.read()) + os.unlink(tmp.name)
25-26
: Replace third-party “regex” with stdlib “re” (or document dependency clearly).Aliasing
regex
asre
is confusing and introduces an undeclared dependency; the patterns used don’t require theregex
package.-import regex as re +import reIf you truly need
regex
features later, import it asimport regex
(not aliased tore
) and add it to README/requirements.
31-43
: Deep-merge engine/model mappings across multiple JSON files.update()
overwrites earlier models.If two JSONs define the same engine with different models, later files clobber earlier ones. README encourages multiple JSONs in the same dir, so this loses data.
-def load_engine_model(): - """returns engine_model built from all json files in the current dir""" +def load_engine_model(): + """Return a merged engine_model from all JSON files in this directory.""" import glob import json engine_model = {} json_files = glob.glob( os.path.join(os.path.dirname(__file__) or ".", "*.json")) for fname in json_files: with open(fname, encoding="utf-8") as f: - engine_model.update(json.load(f)) + data = json.load(f) + for engine, models in data.items(): + if not isinstance(models, dict): + continue + if engine not in engine_model: + engine_model[engine] = {} + for model, mapping in models.items(): + if model not in engine_model[engine]: + engine_model[engine][model] = {} + # Later files can override specific patterns if duplicated. + engine_model[engine][model].update(mapping) return engine_model
58-79
: Write an empty summary CSV when the GPU trace CSV is empty to avoid downstream failures.Returning early means
gen_sum_file()
later asserts on a missingsum_file
. Always emit a CSV with the expected headers.def gen_nonoverlapped_sum_from_gputrace(self, in_file, out_file): logger.info("loading %s", in_file) df = self.pd.read_csv(in_file, usecols=["Start (ns)", "Duration (ns)", "Name"]) if df.empty: - return + self.pd.DataFrame( + columns=["Elapsed Time (sec)", "Total Time (sec)", "Instances", "Name"] + ).to_csv(out_file, index=False) + logger.warning("No GPU kernel records in %s; wrote empty summary %s", + in_file, out_file) + return
211-258
: Robustify nsys invocation and ensure the sum CSV is regenerated when needed.Current code:
- Doesn’t force CSV output for
nsys stats
.- Doesn’t check the subprocess return code or log stderr.
- Only generates the sum CSV when the stats CSV was regenerated; if stats exists but sum is missing/outdated, you skip generating the sum and then assert it exists.
- Uses
exit(1)
instead ofsys.exit(1)
.Fix all four and correct a log typo.
def gen_sum_file(self, file, nsys_cmd): @@ - if self.should_gen_file(nsys_stats_file, file): + if self.should_gen_file(nsys_stats_file, file): cmd = [ nsys_cmd, "stats", "-r", "cuda_gpu_trace", + "-f", + "csv", file, "-o", f"{file_dir}/{file_name}", ] cmd_str = " ".join(cmd) logger.info("+ %s", cmd_str) @@ - try: - subprocess.run(cmd) - except Exception: - logger.error("%s failed; Use --nsys_cmd to specify nsys path", - cmd_str) - exit(1) - logger.info("generating non-overalapped sum %s", sum_file) - self.gen_nonoverlapped_sum_from_gputrace(nsys_stats_file, sum_file) - self.is_valid_file(sum_file) + try: + res = subprocess.run(cmd, check=True, capture_output=True, text=True) + except FileNotFoundError: + logger.error("%s not found; Use --nsys_cmd to specify nsys path", nsys_cmd) + import sys + sys.exit(1) + except subprocess.CalledProcessError as e: + logger.error("nsys stats failed: %s\nstdout:\n%s\nstderr:\n%s", + cmd_str, e.stdout, e.stderr) + import sys + sys.exit(1) + if not (os.path.isfile(nsys_stats_file) and os.path.getsize(nsys_stats_file) > 0): + logger.error("Expected output %s not found or empty after running nsys.", nsys_stats_file) + import sys + sys.exit(1) + # (Re)generate the sum CSV if it is missing or older than the stats CSV. + if self.should_gen_file(sum_file, nsys_stats_file): + logger.info("generating non-overlapped sum %s", sum_file) + self.gen_nonoverlapped_sum_from_gputrace(nsys_stats_file, sum_file) + self.is_valid_file(sum_file)
🧹 Nitpick comments (8)
tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py (8)
1-15
: Update copyright year to include 2025.Repo guideline says to prepend NVIDIA copyright header with the current year. This new file should read 2022-2025 (or just 2025 for a new file introduced this year).
-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
15-19
: Tighten the module docstring and satisfy docstring style checks.Ruff flags D205/D415. Make the first line a concise summary ending with a period, followed by a blank line and details.
-""" - This generates gpu kernel analysis output from nsys rep. Will call nsys - stats -r cuda_gpu_trace, get non-overlapped gpu cycles, then generate - csv and html output for analysis -""" +""" +Generate GPU-kernel analysis outputs from an nsys .rep file. + +Calls `nsys stats -r cuda_gpu_trace` to compute non-overlapped GPU time, then emits CSV and HTML artifacts for analysis. +"""
176-185
: Ensure unmatched kernels get a category and precompile patterns for speed.
anno_gpu_kernname_helper
returnsNone
when no pattern matches, creating NaNs and dropping rows in pivoting. Also, compiling regexes once reduces per-row overhead.def anno_gpu_kernname(self, df, mapping): """add "Category" column""" - - def anno_gpu_kernname_helper(name): - for kern_name, val in mapping.items(): - if re.search(kern_name, name): - return val + compiled = [(re.compile(pat), cat) for pat, cat in mapping.items()] + + def anno_gpu_kernname_helper(name): + for rx, cat in compiled: + if rx.search(name): + return cat + return "uncategorized" df["Category"] = df["Name"].apply(anno_gpu_kernname_helper)
195-207
: Prefer explicit exceptions over asserts for file validity.
assert
can be stripped with-O
and is less informative for users. Consider raisingFileNotFoundError
/ValueError
with a clear message.- def is_valid_file(self, base_file): - """asserts if base_file is non-existent or is empty""" - assert (os.path.isfile(base_file) and os.path.getsize(base_file) - > 0), f"{base_file} doesn't exist or is empty" + def is_valid_file(self, base_file): + """Validate that base_file exists and is non-empty.""" + if not (os.path.isfile(base_file) and os.path.getsize(base_file) > 0): + raise FileNotFoundError(f"{base_file} doesn't exist or is empty.")
303-341
: Argparse help text drift and minor polish.
- The example includes an unknown engine
sglan
; likely meanttrtllm
.- Help string claims “Specify 0 to use elapsed time from nsys-rep” which the code does not implement (see prior comment). Either update the help or implement the behavior.
- "profiling used to calculate non-gpu time. Specify 0 to use " - "elapsed time from nsys-rep but that might inflate non-gpu time. " + "profiling used to calculate non-GPU time. Specify 0 to assume " + "non-GPU time is 0 (i.e., use GPU elapsed time only). " f"Available engine:[model] are: {engine_model_supported_str} " - f"Example: --in_file d1.nsys-rep,sglan,llama,100 " + f"Example: --in_file d1.nsys-rep,trtllm,llama,100 " "d2.nsys-rep,trtllm,gpt-oss,102"),Would you prefer I implement deriving wall time from
nsys stats -r summary
so that0
means “use nsys total elapsed”? If yes, I can push a patch.
172-175
: Switch to logging for final status message.Minor consistency tweak: use
logger.info
instead of- print(f"Finished generating: \n" - f" {output_name}.html for stack bar chart \n" - f" {output_name}.csv for Kernel-Category mapping") + logger.info("Finished generating:\n %s.html (stacked bar chart)\n %s.csv (Kernel-Category mapping)", + output_name, output_name)
317-322
: Defensive UX: handle missing JSONs gracefully in CLI help.If no JSON files are present,
engine_model_supported_str
becomes empty. Consider an explicit notice so users aren’t confused by “Available engine:[model] are: ” with nothing after it.- engine_model_supported_str = ", ".join( + engine_model_supported_str = ", ".join( f"{engine}:[{', '.join(models.keys())}]" for engine, models in engine_model_supported.items()) + if not engine_model_supported_str: + engine_model_supported_str = "None found (place *.json mappings next to this script)."
304-314
: Docstring/style nits across functions per Ruff hints.Several function docstrings lack terminal punctuation or the blank line after the summary (D200/D205/D415). Apply quick, local fixes as convenient; keeps release checks green.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py
🪛 Ruff (0.12.2)
tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py
15-18: 1 blank line required between summary line and description
(D205)
15-18: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
31-31: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
46-47: One-line docstring should fit on one line
Reformat to one line
(D200)
46-47: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
80-82: 1 blank line required between summary line and description
(D205)
80-82: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
119-119: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
174-174: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
184-184: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
192-192: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
197-197: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
209-211: 1 blank line required between summary line and description
(D205)
209-211: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
256-256: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
🔇 Additional comments (1)
tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py (1)
81-119
: Non-overlap algorithm LGTM for a single “active interval” accounting.The sweep correctly attributes only “new coverage” per event; this aligns with the stated non-overlapped GPU-time goal.
@kaiyux fixed the issue! It's taking a long time waiting for blossom-ci which is strange since my script doesn't touch anything internal to trtllm. |
/bot skip --comment "the tool is not protected by CI pipeline" |
PR_Github #16243 [ skip ] triggered by Bot |
PR_Github #16243 [ skip ] completed with state |
This script processes NVIDIA Nsight Systems (nsys) GPU trace files (.nsys-rep) with -t cuda tracing enabled, and generates kernel-level summaries and visualizations of GPU and non-GPU time. It is useful for profiling and analyzing nsys profile output. Supported models currently include llama, deepseek, and gpt-oss.
Run command example:
python3 gputrc2graph.py
--in_file run1.nsys-rep,trtllm,llama,103
--title "TRTLLM-gpt-oss profile"
The command will produce 2 files for analysis:
result.html: this categorizes kernel names into different categories in a stacked bar chart.
result.csv: shows how the kernel names are mapped to the different categories.
Detailed explanation is in the accompanying README.md
Summary by CodeRabbit
New Features
Documentation