[None] [feat] nsys profile output kernel classifier #7020

gracehonv · 2025-08-19T05:00:05Z

This script processes NVIDIA Nsight Systems (nsys) GPU trace files (.nsys-rep) with -t cuda tracing enabled, and generates kernel-level summaries and visualizations of GPU and non-GPU time. It is useful for profiling and analyzing nsys profile output. Supported models currently include llama, deepseek, and gpt-oss.

Run command example:

python3 gputrc2graph.py
--in_file run1.nsys-rep,trtllm,llama,103
--title "TRTLLM-gpt-oss profile"
The command will produce 2 files for analysis:

result.html: this categorizes kernel names into different categories in a stacked bar chart.
result.csv: shows how the kernel names are mapped to the different categories.
Detailed explanation is in the accompanying README.md

Summary by CodeRabbit

New Features
- Added a CLI profiling tool to analyze Nsight Systems CUDA traces, output non-overlapped kernel timing summaries, and reconcile GPU vs non-GPU time.
- Produces interactive HTML charts and CSV reports and supports comparing multiple profiles and customizable title/output location.
- Extensible engine/model kernel categorization via provided JSON mappings.
Documentation
- Added a usage guide with prerequisites, examples for single-run analysis, multi-run comparisons, and instructions to extend classification rules.

Signed-off-by: Grace Ho <grho@nvidia.com>

coderabbitai · 2025-08-19T05:00:12Z

📝 Walkthrough

Walkthrough

Adds a new CLI tool that converts NVIDIA Nsight Systems CUDA traces into non-overlapped per-kernel GPU timing summaries, categorizes kernels via engine/model JSON mappings, and produces CSV and HTML visualizations; includes a JSON mapping file and README usage docs.

Changes

Cohort / File(s)	Summary
Profiler tool (NSYS -> graph) `tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py`	New CLI tool to run `nsys` stats, compute non-overlapping GPU kernel times, aggregate per-kernel metrics, annotate kernels by engine/model mapping, append non-GPU time, and emit combined CSV and HTML plots. Includes file validity/regeneration checks and lazy pandas import.
Engine/model kernel categorizations `tensorrt_llm/tools/profiler/nsys_profile_tools/trtllm_engine_model.json`	New JSON providing kernel-name pattern → category mappings under a top-level `trtllm` object with groups (`llama`, `ds`, `gpt-oss`) and catch-all rules to classify kernels for profiling.
Documentation `tensorrt_llm/tools/profiler/nsys_profile_tools/README.md`	New README describing tool purpose, CLI usage, input tuple format, dependencies, outputs (result.html, result.csv), examples for single/multiple profiles, and instructions to extend classifications via per-engine JSON files.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User
  participant Tool as gputrc2graph.py
  participant NSYS as nsys (stats)
  participant FS as Filesystem
  participant Plot as Plotly/HTML

  User->>Tool: run --in_file <nsys>,<engine>,<model>,<total_sec> --out_dir --title [--nsys_cmd]
  Tool->>FS: load engine/model JSON mappings
  loop per input tuple
    Tool->>FS: check for cached summed CSV
    alt cached up-to-date
      Tool->>FS: read summed CSV
    else need generation
      Tool->>NSYS: run nsys stats to produce raw CUDA trace CSV
      NSYS-->>Tool: raw GPU trace CSV
      Tool->>Tool: compute non-overlapping kernel intervals & aggregate
      Tool->>FS: write per-file summed CSV
    end
    Tool->>Tool: annotate kernels with categories
    Tool->>Tool: compute/add non-GPU row (total_sec - GPU_time)
  end
  Tool->>FS: write combined result.csv
  Tool->>Plot: build histogram and pivot table HTML
  Plot-->>FS: write result.html
  Tool-->>User: emit result.csv and result.html

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 9

🧹 Nitpick comments (16)

examples/profiler/nsys_profile_tools/README.md (4)

63-67: Minor wording: “till” → “until”; tighten description

Nit to improve clarity and tone.

-   - DELAY: how many seconds to delay nsys from collecting profiles, needed so
-     that profiles aren't captured till trtllm server has come up and load
-     generation starts.
+   - DELAY: how many seconds to delay nsys before collecting profiles (to avoid
+     capturing startup), until the trtllm server is up and load generation begins.

83-86: Tighten phrasing and fix minor grammar in outputs section

Small clarity and grammar fixes for the outputs and description of the HTML view.

-- result.html: this categorizes kernel names into different categories in a
-  stacked bar chart.
+- result.html: a stacked bar chart categorizing kernel names into groups.
 - result.csv: shows how the kernel names are mapped to the different
   categories.

-The html file shows the number of elapsed seconds due to different GPU
-Substages or categories, which consist of moe_gemm as the biggest
-category, at .14 seconds, followed by "attn" kernels. This lets the user
-prioritize the kernels to focus on for performance optimizations.
+The HTML file shows elapsed seconds by category (e.g., "moe_gemm", "attn"), 
+helping you quickly prioritize optimization targets.
 
-There's also an appended data table underneath the bar chart for copying out to
- other post-processing tools.
+There is also a data table appended below the chart for easy copy/paste into 
+other tools.

Also applies to: 90-94, 97-101

124-127: Grammar nit: “stack” → “stacked”; simplify sentence

-The analysis process is similar to example 1 but now there will be multiple
-stack bar charts that can be compared.  The categories for the different
+The analysis process is similar to Example 1, but now there will be multiple
+stacked bar charts that can be compared. The categories for the different
 kernels will remain the same, so that it's easy to compare the GPU cycles for
 the same categories.

141-146: Fix typo (“kernelshave”) and tighten instructions

-Then, for this new model, suppose there are 4 kernels to be classified into
-"gemm" and "attn", where the gemm kernelshave names with "*H*" or "*I*" in
-them, and attn kernels have names with "*J*" or "*K*" in them, just add another
- .json file in the same directory as gputrc2graph.py with the same format as
- the other json files, like the following:
+Then, for this new model, suppose there are 4 kernels to be classified into
+"gemm" and "attn". If gemm kernels have names containing "*H*" or "*I*", and 
+attn kernels contain "*J*" or "*K*", add another .json file in the same 
+directory as gputrc2graph.py with the same format as the other json files:

examples/profiler/nsys_profile_tools/gputrc2graph.py (11)

1-1: Update copyright year to include 2025

Per coding guidelines, prepend NVIDIA copyright header with the current year.

-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

15-19: Docstring: fix report name, punctuation, and style

The opening docstring references cuda_gpu_kern_trace, but the code uses cuda_gpu_trace. Also, adopt a concise summary line with closing punctuation.

-"""
-    This generates gpu kernel analysis output from nsys rep. Will call nsys
-    stats  -r cuda_gpu_kern_trace, get non-overlapped gpu cycles, then generate
-    csv and html output for analysis
-"""
+"""
+Generate GPU kernel analysis outputs from Nsight Systems (.nsys-rep) traces.
+
+This module calls `nsys stats -r cuda_gpu_trace` to export the CUDA GPU kernel
+trace CSV, computes a non-overlapped elapsed time per kernel name, and then
+emits a CSV summary and an HTML stacked bar chart for analysis.
+"""

25-26: Consider using built-in re or documenting the regex dependency

You’re importing the third-party regex module as re. Unless you rely on advanced features from regex, the built-in re suffices and avoids an extra dependency. If you keep regex, ensure it’s documented in the README (I added a note there).

-import regex as re
+import re

31-44: Robust JSON loader: deep-merge engines and add basic validation

If multiple JSON files define the same engine key, dict.update overwrites the earlier one. Prefer deep-merge so multiple files can contribute models under the same engine. Also, validate JSON shape and log file names for traceability.

 def load_engine_model():
-    """returns engine_model built from all json files in the current dir"""
+    """Return the combined engine→model mapping from all JSON files in this directory."""
     import glob
     import json
 
     engine_model = {}
 
     json_files = glob.glob(
         os.path.join(os.path.dirname(__file__) or ".", "*.json"))
     for fname in json_files:
-        with open(fname, encoding="utf-8") as f:
-            engine_model.update(json.load(f))
+        with open(fname, encoding="utf-8") as f:
+            data = json.load(f)
+        if not isinstance(data, dict):
+            logger.warning("Skipping %s: top-level JSON must be an object.", fname)
+            continue
+        # Deep-merge engines
+        for eng, models in data.items():
+            if not isinstance(models, dict):
+                logger.warning("Skipping %s: engine %r must map to an object.", fname, eng)
+                continue
+            engine_model.setdefault(eng, {})
+            engine_model[eng].update(models)
+        logger.info("Loaded engine/model definitions from %s", fname)
     return engine_model

51-56: Guard pandas copy-on-write for older pandas versions

pd.options.mode.copy_on_write exists in pandas ≥ 2.0. On older versions this raises AttributeError.

     def __init__(self):
         import pandas as pd  # avoid importing till needed
 
         self.pd = pd
-        self.pd.options.mode.copy_on_write = True
+        try:
+            # pandas ≥ 2.0
+            self.pd.options.mode.copy_on_write = True
+        except Exception:
+            # Best-effort; safe to ignore for older pandas
+            pass

120-170: Avoid temp file in CWD; write temp HTML in output_dir and stack bars explicitly

Writing “temp.html” in CWD can conflict across runs. Use a NamedTemporaryFile in output_dir.
Ensure stacked bars via barmode="stack" for clarity.

     def make_html(self, df, output_dir, title):
         """make html graph from df"""
         import plotly.express as px
+        import tempfile
 
         if df.empty:
             return
         output_name = os.path.join(output_dir, "result")
         if not title:
             title = "Model_Engine"
         x = "Model_Engine"
         y = "Elapsed Time (sec)"
         color = "Category"
         """ generate kernel mapping table  """
         # Sort Model_Engine categories by last field after underscore
         df["Model_Engine"] = self.pd.Categorical(
             df["Model_Engine"],
             sorted(df["Model_Engine"].unique(), key=lambda x: x.split("_")[-1]),
         )
         df[["Model_Engine", color, "Instances", "Name",
             y]].sort_values(by=color).to_csv(f"{output_name}.csv", index=False)
         graph = px.histogram(
             df.round(2),
             x=x,
             y=y,
             title=(f"{y} for {title}"),
             color=color,
             text_auto=True,
         )
         # wrap x axis labels
         graph.update_xaxes(automargin=True)
+        graph.update_layout(barmode="stack")
         graph.write_html(f"{output_name}.html")
         """
             Generate data table with columns per Model_Engine into result.html
         """
         pivot_df = df.pivot_table(
             values="Elapsed Time (sec)",
             index="Category",
             columns="Model_Engine",
             aggfunc="sum",
             observed=False,
         ).round(2)
         # Add sum row at bottom
         pivot_df.loc["total_elapsed_sec"] = pivot_df.sum()
-        pivot_df.fillna("").to_html("temp.html")
-        with (
-                open(f"{output_name}.html", "a", encoding="utf-8") as outfile,
-                open("temp.html", encoding="utf-8") as infile,
-        ):
-            outfile.write(infile.read())
-        os.remove("temp.html")
+        with tempfile.NamedTemporaryFile(
+            mode="w+", suffix=".html", dir=output_dir, delete=False, encoding="utf-8"
+        ) as tmp:
+            pivot_df.fillna("").to_html(buf=tmp.name)
+            tmp.flush()
+            with open(f"{output_name}.html", "a", encoding="utf-8") as outfile, \
+                 open(tmp.name, encoding="utf-8") as infile:
+                outfile.write(infile.read())
+        os.remove(tmp.name)
 
-        print(f"Finished generating: \n"
-              f" {output_name}.html for stack bar chart \n"
-              f" {output_name}.csv for Kernel-Category mapping")
+        logger.info("Finished generating: %s.html (stacked bar chart)", output_name)
+        logger.info("Finished generating: %s.csv (Kernel-Category mapping)", output_name)

175-184: Precompile regex patterns once to speed up classification

Optional perf improvement: compile patterns per mapping once instead of re.search per row per pattern. This matters if the number of kernel names grows.

     def anno_gpu_kernname(self, df, mapping):
-        """add "Category" column"""
-
-        def anno_gpu_kernname_helper(name):
-            for kern_name, val in mapping.items():
-                if re.search(kern_name, name):
-                    return val
-
-        df["Category"] = df["Name"].apply(anno_gpu_kernname_helper)
+        """Add a 'Category' column by matching kernel names against regex patterns."""
+        compiled = [(re.compile(pat), val) for pat, val in mapping.items()]
+        def classify(name: str):
+            for pat, val in compiled:
+                if pat.search(name):
+                    return val
+            return None
+        df["Category"] = df["Name"].apply(classify)

198-209: Docstring punctuation and reuse message

Minor fix to docstring punctuation; logic looks fine.

     def should_gen_file(self, new_file, base_file):
-        """figure out if new file should be generated from base_file"""
+        """Return True if new_file should be (re)generated from base_file."""
         self.is_valid_file(base_file)
         if (os.path.exists(new_file)
                 and (os.path.getmtime(new_file) > os.path.getmtime(base_file))
                 and (os.path.getsize(base_file) > 0)):
             logger.info("reusing %s", new_file)
             return False
         else:
             logger.info("generating %s", new_file)
             return True

298-300: Be more tolerant of whitespace in --in_file tuples

Strip parts to avoid subtle parsing issues.

 def parse_tuple(s):
-    return tuple(s.split(","))
+    return tuple(part.strip() for part in s.split(","))

303-313: Help text example: fix engine name typo

“sglan” seems unintentional; use “trtllm” to match the provided engine JSON.

             "gputrc2graph.py --in_file d1.nsys-rep,trtllm,llama,100 \n"
             "d2.nsys-rep,trtllm,gpt-oss,102 "

347-349: Minor: prefer sys.exit(main()) pattern for CLI

Current usage is fine; no change required. Just a note for future consistency.

examples/profiler/nsys_profile_tools/trtllm_engine_model.json (1)

1-63: Include AllGather in NCCL patterns to match README and common kernels

The README example calls out AllGather, but patterns for NCCL currently omit it for some engines. Suggest appending |AllGather to the NCCL patterns to ensure those kernels are categorized correctly.
@@
-      "ncclDevKernel|AllReduce": "nccl_and_custom_ar",
+      "ncclDevKernel|AllReduce|AllGather": "nccl_and_custom_ar",
@@
-      "ncclDevKernel|cross_device_reduce|AllReduce": "nccl_and_custom_ar",
+      "ncclDevKernel|cross_device_reduce|AllReduce|AllGather": "nccl_and_custom_ar",
@@
-      "ncclDevKernel|cross_device_reduce|AllReduce": "nccl_and_custom_ar",
+      "ncclDevKernel|cross_device_reduce|AllReduce|AllGather": "nccl_and_custom_ar",
Also: great having a catch-all ".*": "misc" to avoid None categories.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between daa2a65 and 7947c82.

⛔ Files ignored due to path filters (3)

examples/profiler/nsys_profile_tools/images/csv.png is excluded by !**/*.png
examples/profiler/nsys_profile_tools/images/html.png is excluded by !**/*.png
examples/profiler/nsys_profile_tools/images/html_tbl.png is excluded by !**/*.png

📒 Files selected for processing (3)

examples/profiler/nsys_profile_tools/README.md (1 hunks)
examples/profiler/nsys_profile_tools/gputrc2graph.py (1 hunks)
examples/profiler/nsys_profile_tools/trtllm_engine_model.json (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

examples/profiler/nsys_profile_tools/gputrc2graph.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

examples/profiler/nsys_profile_tools/gputrc2graph.py

🪛 LanguageTool

examples/profiler/nsys_profile_tools/README.md

[grammar] ~5-~5: There might be a mistake here.
Context: ...f GPU and non-GPU time. It is useful for profiling and analyzing nsys profile out...

(QB_NEW_EN)

[grammar] ~39-~39: There might be a mistake here.
Context: ...vidia.com/nsight-systems/get-started) is installed, and specify the path to the `...

(QB_NEW_EN)

[grammar] ~41-~41: There might be a mistake here.
Context: ...--nsys_cmd if it is not in your PATH. - For more details on available engines an...

(QB_NEW_EN)

[grammar] ~53-~53: There might be a mistake here.
Context: ...rtllm: 1. Run the following command to collect nsys profile, for trtllm serve config. ...

(QB_NEW_EN)

[style] ~64-~64: Consider using the more formal “until”.
Context: ...d so that profiles aren't captured till trtllm server has come up and load ...

(TILL)

[grammar] ~65-~65: There might be a mistake here.
Context: ...come up and load generation starts. - DURATION: how many seconds for nsys prof...

(QB_NEW_EN)

[grammar] ~84-~84: There might be a mistake here.
Context: ...ent categories in a stacked bar chart. - result.csv: shows how the kernel names a...

(QB_NEW_EN)

[grammar] ~91-~91: There might be a mistake here.
Context: ...which consist of moe_gemm as the biggest category, at .14 seconds, followed by "a...

(QB_NEW_EN)

[grammar] ~92-~92: There might be a mistake here.
Context: ...ed by "attn" kernels. This lets the user prioritize the kernels to focus on for p...

(QB_NEW_EN)

[grammar] ~105-~105: There might be a mistake here.
Context: ...result.csv to dive into what the kernels are which compose the nccl GPU cycles. T...

(QB_NEW_EN)

[grammar] ~106-~106: There might be a mistake here.
Context: ...U cycles. The following image shows that ar_fusion all reduce kernel to be the bi...

(QB_NEW_EN)

[grammar] ~114-~114: There might be a mistake here.
Context: ...ce files, captured for different models, say llama and gpt-oss in this case, and ...

(QB_NEW_EN)

[grammar] ~124-~124: There might be a mistake here.
Context: ...example 1 but now there will be multiple stack bar charts that can be compared. ...

(QB_NEW_EN)

[grammar] ~125-~125: There might be a mistake here.
Context: ...pared. The categories for the different kernels will remain the same, so that it...

(QB_NEW_EN)

[grammar] ~126-~126: There might be a mistake here.
Context: ... it's easy to compare the GPU cycles for the same categories. Once a category is...

(QB_NEW_EN)

[grammar] ~129-~129: There might be a mistake here.
Context: ...e more cycles for one configuration than another, the next step would be to use t...

(QB_NEW_EN)

[grammar] ~130-~130: There might be a mistake here.
Context: ...use the csv file to see what kernels are mapped into that category, and which ker...

(QB_NEW_EN)

[grammar] ~131-~131: There might be a mistake here.
Context: ...kernels are taking the largest amount of time which would cause a difference for ...

(QB_NEW_EN)

[grammar] ~137-~137: There might be a mistake here.
Context: ...the same format as the other json files. The script will automatically pick up al...

(QB_NEW_EN)

[grammar] ~141-~141: There might be a mistake here.
Context: ...here are 4 kernels to be classified into "gemm" and "attn", where the gemm kernel...

(QB_NEW_EN)

[grammar] ~142-~142: There might be a mistake here.
Context: ...kernelshave names with "H" or "I" in them, and attn kernels have names with "...

(QB_NEW_EN)

[grammar] ~162-~162: There might be a mistake here.
Context: ...ey: a regex used to classify the kernels - value: the category to classify the kern...

(QB_NEW_EN)

[grammar] ~165-~165: There might be a mistake here.
Context: ...engine/models, consisting of CUDA memory operations and a 'misc' for anything tha...

(QB_NEW_EN)

[grammar] ~168-~168: There might be a mistake here.
Context: ... a trace file with this new model/engine like the following: ```bash --in_file n...

(QB_NEW_EN)

[grammar] ~176-~176: There might be a mistake here.
Context: ...ing engine file, after the other models.

(QB_NEW_EN)

🪛 Ruff (0.12.2)

examples/profiler/nsys_profile_tools/gputrc2graph.py

15-18: 1 blank line required between summary line and description

(D205)

15-18: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

31-31: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

46-47: One-line docstring should fit on one line

Reformat to one line

(D200)

46-47: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

79-81: 1 blank line required between summary line and description

(D205)

79-81: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

118-118: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

173-173: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

183-183: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

191-191: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

196-196: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

208-210: 1 blank line required between summary line and description

(D205)

208-210: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

255-255: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (2)

examples/profiler/nsys_profile_tools/README.md (1)

105-109: Align terminology: include AllGather in NCCL example (and mapping)

The text calls out AllGather, but the JSON patterns currently don’t match AllGather for all engines. I suggested JSON updates to include AllGather in the NCCL pattern. See the JSON review comment.

examples/profiler/nsys_profile_tools/gputrc2graph.py (1)

315-345: Expose supported engines/models in help and keep errors user-friendly

Good addition. Consider also logging the supported combos on ValueError in gen_graph (addressed in another comment). Otherwise, this block looks solid.

examples/profiler/nsys_profile_tools/gputrc2graph.py

examples/profiler/nsys_profile_tools/README.md

Signed-off-by: Grace Ho <grho@nvidia.com>

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (6)

examples/profiler/nsys_profile_tools/README.md (2)

36-41: Document all runtime dependencies (include plotly and the third-party regex).

The script imports pandas, plotly.express, and the third-party regex package. Only pandas/plotly are mentioned, and regex is missing.

Apply this diff to the Notes:

- - Make sure you have pandas and plotly python packages installed.
+ - Make sure you have the following Python packages installed:
+   - pandas
+   - plotly
+   - regex  (third-party regex engine; imported as `regex`)
+   You can install them with:
+   ```bash
+   pip install pandas plotly regex
+   ```
- - Make sure [nsys](https://developer.nvidia.com/nsight-systems/get-started) is
+ - Make sure [nsys](https://developer.nvidia.com/nsight-systems/get-started) is
   installed, and specify the path to the `nsys` command with `--nsys_cmd` if it
-  is not in your PATH.
+  is not in your PATH.

19-21: Fix the 0-seconds behavior description (it currently contradicts the code).

When elapsed_nonprofiled_sec is 0, the code uses total = GPU elapsed seconds, which yields CPU(non-GPU) = 0. The text says it “may inflate non-GPU time,” which is the opposite of what happens.

Apply this diff to clarify the behavior:

-  - `elapsed_nonprofiled_sec`: Wall-clock runtime (in seconds) without
-    profiling. Specify `0` to use the elapsed GPU time calculated from the nsys-rep file (this may inflate non-GPU time if actual runtime without profiling is less). Multiple entries can be provided, separated by spaces.
+  - `elapsed_nonprofiled_sec`: Wall-clock runtime (in seconds) without
+    profiling. Specify `0` to use the GPU-elapsed time from the nsys trace as the total elapsed time. This yields CPU(non-GPU) = 0. Prefer providing the actual wall-clock time (without profiling) to compute a realistic non-GPU estimate. Multiple entries can be provided, separated by spaces.

examples/profiler/nsys_profile_tools/gputrc2graph.py (4)

186-192: Don’t depend on copying the last row to build the non-GPU row.

Copying df.iloc[-1] is fragile (fails on empty frames; may inherit unintended values). Construct the row explicitly and pass the label fields in.

Apply this diff:

-    def make_nongpu_row(self, df, nongpu_sec):
-        """this will append non-gpu time entry at end of df"""
-        nongpu_row = self.pd.DataFrame([df.iloc[-1]])
-        nongpu_row["Category"] = nongpu_row["Name"] = "CPU(non-GPU)"
-        nongpu_row["Instances"] = 1
-        nongpu_row["Elapsed Time (sec)"] = nongpu_sec
-        return nongpu_row
+    def make_nongpu_row(self, model_engine, nongpu_sec):
+        """Create a single-row DataFrame representing non-GPU (CPU) time."""
+        return self.pd.DataFrame([{
+            "Model_Engine": model_engine,
+            "Category": "CPU(non-GPU)",
+            "Name": "CPU(non-GPU)",
+            "Instances": 1,
+            "Elapsed Time (sec)": float(nongpu_sec),
+        }])

194-198: Avoid assert for runtime validation; raise concrete exceptions.

assert can be stripped with -O, turning this into a no-op. Prefer explicit checks.

Apply this diff:

-    def is_valid_file(self, base_file):
-        """asserts if base_file is non-existent or is empty"""
-        assert (os.path.isfile(base_file) and os.path.getsize(base_file)
-                > 0), f"{base_file} doesn't exist or is empty"
+    def is_valid_file(self, base_file):
+        """Raise if base_file is missing or empty."""
+        if not os.path.isfile(base_file):
+            raise FileNotFoundError(f"{base_file} doesn't exist")
+        if os.path.getsize(base_file) <= 0:
+            raise ValueError(f"{base_file} is empty")

228-257: Detect nsys failures and regenerate the sum when stale; avoid silent errors.

subprocess.run(cmd) won’t raise on non-zero exit, so failures go undetected until later.
If nsys_stats_file exists but sum_file is missing or older, the current code won’t regenerate the sum.

Apply this diff:

-        if self.should_gen_file(nsys_stats_file, file):
+        if self.should_gen_file(nsys_stats_file, file):
             cmd = [
                 nsys_cmd,
                 "stats",
                 "-r",
                 "cuda_gpu_trace",
                 file,
                 "-o",
                 f"{file_dir}/{file_name}",
             ]
             cmd_str = " ".join(cmd)
             logger.info("+ %s", cmd_str)
             # estimate time based on calibrated 240M/min
             file_size_mb = os.path.getsize(file) / 1e6
             logger.info(
                 "nsys stats for %.2f MB file expected to take %.2f min",
                 file_size_mb,
                 file_size_mb / 240,
             )
             try:
-                subprocess.run(cmd)
-            except Exception:
-                logger.error("%s failed; Use --nsys_cmd to specify nsys path",
-                             cmd_str)
-                exit(1)
-            logger.info("generating non-overalapped sum %s", sum_file)
-            self.gen_nonoverlapped_sum_from_gputrace(nsys_stats_file, sum_file)
+                subprocess.run(cmd, check=True, capture_output=True, text=True)
+            except subprocess.CalledProcessError as e:
+                logger.error("%s failed (code %s): %s", cmd_str, e.returncode, e.stderr.strip())
+                raise SystemExit(1)
+        # (Re)generate the non-overlapped kernel summary if missing or stale
+        if self.should_gen_file(sum_file, nsys_stats_file):
+            logger.info("generating non-overlapped sum %s", sum_file)
+            self.gen_nonoverlapped_sum_from_gputrace(nsys_stats_file, sum_file)
         self.is_valid_file(sum_file)
         logger.info("Finished generating %s", sum_file)
         return sum_file

268-289: Avoid premature rounding; update non-GPU row construction to match new helper.

Rounding before comparison can cause total_sec < gpu_sec anomalies.
Use the new make_nongpu_row(model_engine, ...) signature.
Keep a model_engine variable for reuse.

Apply this diff:

-            sum_file = self.gen_sum_file(file, nsys_cmd)
+            sum_file = self.gen_sum_file(file, nsys_cmd)
             # read kernel summary file
             df = self.pd.read_csv(sum_file)
             # annotate kernel to their categories
             assert engine_model.get(engine), f"engine {engine} unknown"
             assert engine_model[engine].get(model), f"model {model} unknown"
             # remove nsys-rep from file_name for shorter x-label
             file_name = file_name.replace(".nsys-rep", "")
-            df["Model_Engine"] = f"{model}_{engine}_{file_name}_{idx}"
+            model_engine = f"{model}_{engine}_{file_name}_{idx}"
+            df["Model_Engine"] = model_engine
             self.anno_gpu_kernname(df, engine_model[engine][model])
             # patch in non-gpu time
-            gpu_sec = round(df["Elapsed Time (sec)"].sum(), 1)
-            total_sec = round(float(total_sec), 1)
-            if total_sec < gpu_sec:
+            gpu_sec = float(df["Elapsed Time (sec)"].sum())
+            total_sec_val = float(total_sec)
+            if total_sec_val < gpu_sec:
                 logger.warning(
-                    "Elapsed sec %.2f < GPU sec %.2f resetting Elapsed sec ",
-                    total_sec,
+                    "Elapsed sec %.2f < GPU sec %.2f resetting Elapsed sec ",
+                    total_sec_val,
                     gpu_sec,
                 )
-                total_sec = gpu_sec
-            nongpu_row = self.make_nongpu_row(df, total_sec - gpu_sec)
+                total_sec_val = gpu_sec
+            nongpu_row = self.make_nongpu_row(model_engine, total_sec_val - gpu_sec)
             df = self.pd.concat([df, nongpu_row], ignore_index=True)

Note: Consider replacing the assert lines with explicit ValueError to avoid optimization stripping, but I left them unchanged here to minimize the diff.

🧹 Nitpick comments (7)

examples/profiler/nsys_profile_tools/README.md (3)

61-65: Minor grammar and clarity nits for Step 1 bullets.

Tighten wording and fix “till”.

Apply this diff:

-   - DELAY: how many seconds to delay nsys from collecting profiles, needed so
-     that profiles aren't captured till trtllm server has come up and load
-     generation starts.
-   - DURATION: how many seconds for nsys profile to run before generating the
-     profile. This should be > the duration of the run.
+   - DELAY: how many seconds to delay nsys before collecting profiles, so
+     that profiles aren't captured until the trtllm server is up and load
+     generation has started.
+   - DURATION: how many seconds for nsys to collect the profile before it
+     stops. This should be greater than the duration of the run.

102-107: Fix phrasing in the result.csv explanation.

Improve readability and terminology (AllReduce capitalization).

Apply this diff:

-Suppose the user would like to focus on improving decreasing calls to nccl
-kernels. The next step is to use the result.csv to dive into what the kernels
-are which compose the nccl GPU cycles. The following image shows that
-ar_fusion all reduce kernel to be the biggest contributor to GPU cycles for 
-nccl, followed by AllGather.
+Suppose you want to reduce time spent in NCCL kernels.
+Use result.csv to drill into which kernels comprise NCCL GPU time.
+The following image shows the `ar_fusion` AllReduce kernel as the biggest
+contributor to NCCL GPU time, followed by AllGather.

139-142: Fix typo: “kernelshave” → “kernels have”.

Apply this diff:

-Then, for this new model, suppose there are 4 kernels to be classified into
-"gemm" and "attn", where the gemm kernelshave names with "*H*" or "*I*" in
-them, and attn kernels have names with "*J*" or "*K*" in them, just add another
+Then, for this new model, suppose there are 4 kernels to be classified into
+"gemm" and "attn", where the gemm kernels have names with "*H*" or "*I*" in
+them, and attn kernels have names with "*J*" or "*K*" in them. Just add another

examples/profiler/nsys_profile_tools/gputrc2graph.py (4)

86-105: Avoid assert, handle empty frames, and use logger for progress.

assert not df.empty can be stripped with -O and is unnecessary; handle empty explicitly.
Replace print-based progress with logger.info and reduce verbosity.

Apply this diff:

-        logger.info("sorting %s trace records by start time", str(df.shape))
-        assert not df.empty, 'empty nsys records'
+        logger.info("sorting %s trace records by start time", str(df.shape))
         # Sort by start time and reset index
         df = df.sort_values(by="Start (ns)").reset_index(drop=True)
 
+        if df.empty:
+            df["Elapsed Time (ns)"] = self.pd.Series(dtype="int64")
+            return df
+
         # Initialize elapsed time as duration
         df["Elapsed Time (ns)"] = df["Duration (ns)"]
 
         # Get numpy arrays for faster operations
         starts = df["Start (ns)"].values
         ends = df["End (ns)"].values
 
         # Keep track of current interval end
         current_end = ends[0]
-        display_units = max(1, int(len(df) / 100))
+        n = len(df)
+        display_step = max(1, n // 10)
         # Update current_end for overlapping intervals
-        for i in range(1, len(df)):
-            if i % display_units == 0:
-                print(f"processing trace: {int(i/len(df) * 100)} %", end="\r")
+        for i in range(1, n):
+            if i % display_step == 0:
+                logger.info("processing trace: %d %%", int(i / n * 100))

164-175: Avoid writing temp files to CWD; use output_dir and logger instead of print.

Writing “temp.html” to the current working directory can be surprising; keep all artifacts under output_dir and replace print with logging.

Apply this diff:

-        pivot_df.fillna("").to_html("temp.html")
-        with (
-                open(f"{output_name}.html", "a", encoding="utf-8") as outfile,
-                open("temp.html", encoding="utf-8") as infile,
-        ):
+        tmp_html = f"{output_name}.tmp.html"
+        pivot_df.fillna("").to_html(tmp_html)
+        with (
+            open(f"{output_name}.html", "a", encoding="utf-8") as outfile,
+            open(tmp_html, encoding="utf-8") as infile,
+        ):
             outfile.write(infile.read())
-        os.remove("temp.html")
+        os.remove(tmp_html)
 
-        print(f"Finished generating: \n"
-              f" {output_name}.html for stack bar chart \n"
-              f" {output_name}.csv for Kernel-Category mapping")
+        logger.info("Finished generating:\n %s.html (stacked bar chart)\n %s.csv (Kernel-Category mapping)",
+                    output_name, output_name)

176-185: Prefer importing regex as ‘regex’ to avoid confusion with the stdlib ‘re’.

Aliasing the third-party regex engine as “re” can surprise readers. Use import regex and call regex.search(...).

Apply this diff:

-import regex as re
+import regex

And here:

-                if re.search(kern_name, name):
+                if regex.search(kern_name, name):
                     return val

331-333: Fix the help example: replace ‘sglan’ with a supported engine.

The example uses sglan, which isn’t one of the engines discovered from JSON. Replace with trtllm to match the README and shipped JSON.

Apply this diff:

-              f"Example: --in_file d1.nsys-rep,sglan,llama,100 "
+              f"Example: --in_file d1.nsys-rep,trtllm,llama,100 "

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 7947c82 and 3eddbfb.

📒 Files selected for processing (2)

examples/profiler/nsys_profile_tools/README.md (1 hunks)
examples/profiler/nsys_profile_tools/gputrc2graph.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

examples/profiler/nsys_profile_tools/gputrc2graph.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

examples/profiler/nsys_profile_tools/gputrc2graph.py

🪛 Ruff (0.12.2)

examples/profiler/nsys_profile_tools/gputrc2graph.py

15-18: 1 blank line required between summary line and description

(D205)

15-18: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

31-31: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

46-47: One-line docstring should fit on one line

Reformat to one line

(D200)

46-47: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

80-82: 1 blank line required between summary line and description

(D205)

80-82: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

119-119: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

174-174: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

184-184: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

192-192: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

197-197: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

209-211: 1 blank line required between summary line and description

(D205)

209-211: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

256-256: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

🪛 LanguageTool

examples/profiler/nsys_profile_tools/README.md

[grammar] ~3-~3: There might be a mistake here.
Context: ... Nsight Systems (nsys) GPU trace files (.nsys-rep) with -t cuda tracing enabl...

(QB_NEW_EN)

[grammar] ~4-~4: There might be a mistake here.
Context: ...cing enabled, and generates kernel-level summaries and visualizations of GPU and ...

(QB_NEW_EN)

[grammar] ~5-~5: There might be a mistake here.
Context: ...f GPU and non-GPU time. It is useful for profiling and analyzing nsys profile out...

(QB_NEW_EN)

[grammar] ~36-~36: There might be a mistake here.
Context: ...system PATH. ## Notes - Make sure you have pandas and plotly python packages insta...

(QB_NEW_EN)

[grammar] ~36-~36: There might be a mistake here.
Context: ...as and plotly python packages installed. - Make sure [nsys](https://developer.nvidi...

(QB_NEW_EN)

[grammar] ~37-~37: There might be a mistake here.
Context: ...vidia.com/nsight-systems/get-started) is installed, and specify the path to the `...

(QB_NEW_EN)

[grammar] ~39-~39: There might be a mistake here.
Context: ...--nsys_cmd if it is not in your PATH. - For more details on available engines an...

(QB_NEW_EN)

[grammar] ~51-~51: There might be a mistake here.
Context: ...rtllm: 1. Run the following command to collect nsys profile, for trtllm serve config. ...

(QB_NEW_EN)

[style] ~62-~62: Consider using the more formal “until”.
Context: ...d so that profiles aren't captured till trtllm server has come up and load ...

(TILL)

[grammar] ~63-~63: There might be a mistake here.
Context: ...come up and load generation starts. - DURATION: how many seconds for nsys prof...

(QB_NEW_EN)

[grammar] ~82-~82: There might be a mistake here.
Context: ...ent categories in a stacked bar chart. - result.csv: shows how the kernel names a...

(QB_NEW_EN)

[grammar] ~89-~89: There might be a mistake here.
Context: ...which consist of moe_gemm as the biggest category, at .14 seconds, followed by "a...

(QB_NEW_EN)

[grammar] ~90-~90: There might be a mistake here.
Context: ...ed by "attn" kernels. This lets the user prioritize the kernels to focus on for p...

(QB_NEW_EN)

[grammar] ~103-~103: There might be a mistake here.
Context: ...result.csv to dive into what the kernels are which compose the nccl GPU cycles. T...

(QB_NEW_EN)

[grammar] ~104-~104: There might be a mistake here.
Context: ...U cycles. The following image shows that ar_fusion all reduce kernel to be the bi...

(QB_NEW_EN)

[grammar] ~112-~112: There might be a mistake here.
Context: ...ce files, captured for different models, say llama and gpt-oss in this case, and ...

(QB_NEW_EN)

[grammar] ~122-~122: There might be a mistake here.
Context: ...example 1 but now there will be multiple stack bar charts that can be compared. ...

(QB_NEW_EN)

[grammar] ~123-~123: There might be a mistake here.
Context: ...pared. The categories for the different kernels will remain the same, so that it...

(QB_NEW_EN)

[grammar] ~124-~124: There might be a mistake here.
Context: ... it's easy to compare the GPU cycles for the same categories. Once a category is...

(QB_NEW_EN)

[grammar] ~127-~127: There might be a mistake here.
Context: ...e more cycles for one configuration than another, the next step would be to use t...

(QB_NEW_EN)

[grammar] ~128-~128: There might be a mistake here.
Context: ...use the csv file to see what kernels are mapped into that category, and which ker...

(QB_NEW_EN)

[grammar] ~129-~129: There might be a mistake here.
Context: ...kernels are taking the largest amount of time which would cause a difference for ...

(QB_NEW_EN)

[grammar] ~135-~135: There might be a mistake here.
Context: ...the same format as the other json files. The script will automatically pick up al...

(QB_NEW_EN)

[grammar] ~139-~139: There might be a mistake here.
Context: ...here are 4 kernels to be classified into "gemm" and "attn", where the gemm kernel...

(QB_NEW_EN)

[grammar] ~140-~140: There might be a mistake here.
Context: ...kernelshave names with "H" or "I" in them, and attn kernels have names with "...

(QB_NEW_EN)

[grammar] ~160-~160: There might be a mistake here.
Context: ...ey: a regex used to classify the kernels - value: the category to classify the kern...

(QB_NEW_EN)

[grammar] ~163-~163: There might be a mistake here.
Context: ...engine/models, consisting of CUDA memory operations and a 'misc' for anything tha...

(QB_NEW_EN)

[grammar] ~166-~166: There might be a mistake here.
Context: ... a trace file with this new model/engine like the following: ```bash --in_file n...

(QB_NEW_EN)

[grammar] ~174-~174: There might be a mistake here.
Context: ...ing engine file, after the other models.

(QB_NEW_EN)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

examples/profiler/nsys_profile_tools/gputrc2graph.py

examples/profiler/nsys_profile_tools/trtllm_engine_model.json

examples/profiler/nsys_profile_tools/README.md

Signed-off-by: Grace Ho <grho@nvidia.com>

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (10)

tensorrt_llm/tools/profiler/nsys_profile_tools/trtllm_engine_model.json (1)

22-40: Minor: consider adding explicit attention variants and lowercase-insensitive matches for robustness.

If kernel name casing varies (e.g., softmax vs SoftMax) or additional attention variants appear (e.g., mha variants), consider adding case-insensitive matching or explicit alternatives. This can be done in code (use re.IGNORECASE) or by expanding patterns here.

Apply either of these approaches:

Approach A (preferred): Use re.IGNORECASE when searching in code; no JSON change needed.

Approach B: Expand patterns with both cases (e.g., add softmax|SoftMax).

tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py (6)

81-118: Minor: use logging instead of print for progress; tighten docstring.

Progress output should use the logger. Also, format the docstring to satisfy docstyle checks (D200/D205/D415).

Apply this diff:

     def sum_non_overlapping_intervals(self, df):
-        """
-        returns new sorted df with Elapsed Time (ns) column using
-        vectorized operations
-        """
+        """Return a sorted DataFrame with non-overlapped Elapsed Time (ns)."""
         logger.info("sorting %s trace records by start time", str(df.shape))
         assert not df.empty, 'empty nsys records'
@@
-        display_units = max(1, int(len(df) / 100))
+        display_units = max(1, int(len(df) / 100))
         # Update current_end for overlapping intervals
         for i in range(1, len(df)):
-            if i % display_units == 0:
-                print(f"processing trace: {int(i/len(df) * 100)} %", end="\r")
+            if i % display_units == 0:
+                pct = int(i / len(df) * 100)
+                logger.info("processing trace: %d%%", pct)

Note: If log volume is a concern, gate this behind a verbose flag or use logger.debug.

74-79: Compute Instances via groupby size (clearer/robust) and tidy docstring.

Setting Instances = 1 and then aggregating with "size" works but is indirect.

Apply this diff:

-        # get ready to print table with elapsed times per kernel
-        df["Instances"] = 1
-        df_sum = df.groupby("Name", as_index=False).agg({
-            "Elapsed Time (ns)": "sum",
-            "Duration (ns)": "sum",
-            "Instances": "size"
-        })
+        # Aggregate elapsed/total time and number of kernel instances.
+        df_sum = df.groupby("Name", as_index=False).agg(
+            **{
+                "Elapsed Time (ns)": ("Elapsed Time (ns)", "sum"),
+                "Duration (ns)": ("Duration (ns)", "sum"),
+                "Instances": ("Name", "size"),
+            }
+        )

This avoids relying on a placeholder column.

121-171: Categorical ordering by the last underscore token can mis-sort ≥10 entries; parse the trailing index as int.

String-based sort will order "10" before "2". Parse the trailing token to int for stable ordering.

Apply this diff:

-        df["Model_Engine"] = self.pd.Categorical(
-            df["Model_Engine"],
-            sorted(df["Model_Engine"].unique(), key=lambda x: x.split("_")[-1]),
-        )
+        def _idx_key(s: str) -> int:
+            try:
+                return int(s.rsplit("_", 1)[1])
+            except Exception:
+                return 0
+        df["Model_Engine"] = self.pd.Categorical(
+            df["Model_Engine"],
+            sorted(df["Model_Engine"].unique(), key=_idx_key),
+        )

176-185: Default to “misc” if no mapping matches; optionally allow case-insensitive matching.

JSONs include a catch-all ".*": "misc", but being defensive avoids None categories if future JSONs omit it. Also consider case-insensitive matching in code to keep JSONs simpler.

Apply this diff:

     def anno_gpu_kernname(self, df, mapping):
         """add "Category" column"""
 
         def anno_gpu_kernname_helper(name):
             for kern_name, val in mapping.items():
-                if re.search(kern_name, name):
+                if re.search(kern_name, name):
                     return val
+            return "misc"
 
         df["Category"] = df["Name"].apply(anno_gpu_kernname_helper)

If you want case-insensitive matching across the board:

-                if re.search(kern_name, name):
+                if re.search(kern_name, name, flags=re.IGNORECASE):

186-193: Create the non-GPU row from a clean schema to avoid leaking unrelated fields (e.g., Total Time from the copied row).

Copying an arbitrary last row can carry unrelated columns with misleading values.

Apply this diff:

     def make_nongpu_row(self, df, nongpu_sec):
         """this will append non-gpu time entry at end of df"""
-        nongpu_row = self.pd.DataFrame([df.iloc[-1]])
-        nongpu_row["Category"] = nongpu_row["Name"] = "CPU(non-GPU)"
-        nongpu_row["Instances"] = 1
-        nongpu_row["Elapsed Time (sec)"] = nongpu_sec
+        cols = list(df.columns)
+        base = {c: None for c in cols}
+        base.update({
+            "Category": "CPU(non-GPU)",
+            "Name": "CPU(non-GPU)",
+            "Instances": 1,
+            "Elapsed Time (sec)": nongpu_sec,
+        })
+        nongpu_row = self.pd.DataFrame([base], columns=cols)
         return nongpu_row

15-19: Docstring style nits (D200/D205/D415).

Several docstrings trigger docstyle warnings (missing terminal punctuation, summary/description separation). Not functionally blocking, but easy to tidy.

Would you like me to push a patch that standardizes all docstrings to Google style and resolves the listed Ruff warnings?

Also applies to: 31-33, 46-49, 81-86, 119-119, 174-174, 184-184, 192-192, 197-197, 209-211, 256-256

tensorrt_llm/tools/profiler/nsys_profile_tools/README.md (3)

51-58: Minor grammar and clarity in the profiling command step.

Tighten language and fix minor issues.

Apply this diff:

-1. Run the following command to collect nsys profile, for trtllm serve config.
+1. Run the following command to collect an nsys profile for a trtllm serve configuration:
@@
-   python3 -m trtllm-serve meta-llama/Llama-4-Scout-17B-16E-Instruct ...
+   python3 -m trtllm-serve meta-llama/Llama-4-Scout-17B-16E-Instruct ...
@@
-   - DELAY: how many seconds to delay nsys from collecting profiles, needed so
-     that profiles aren't captured till trtllm server has come up and load
-     generation starts.
+   - DELAY: number of seconds to delay nsys before collecting profiles, so
+     collection starts after the trtllm server is up and load generation begins.
    - DURATION: how many seconds for nsys profile to run before generating the
      profile. This should be > the duration of the run.

100-109: Minor grammar fixes in result.csv section.

Apply this diff:

-Suppose the user would like to focus on improving decreasing calls to nccl
+Suppose the user would like to focus on reducing calls to NCCL
 kernels. The next step is to use the result.csv to dive into what the kernels
-are which compose the nccl GPU cycles. The following image shows that
-ar_fusion all reduce kernel to be the biggest contributor to GPU cycles for 
-nccl, followed by AllGather.
+are that comprise the NCCL GPU cycles. The following image shows
+ar_fusion AllReduce to be the biggest contributor to NCCL GPU cycles,
+followed by AllGather.

132-156: Typos and phrasing in “add new classification” section.

Apply this diff:

-To create a new engine DEF with model ABC, just add another json file in the 
+To create a new engine DEF with model ABC, add another JSON file in the 
 same directory as gputrc2graph.py with the same format as the other json files.
 The script will automatically pick up all the json files in the same directory 
 as engine/model specifications.
@@
-Then, for this new model, suppose there are 4 kernels to be classified into
-"gemm" and "attn", where the gemm kernelshave names with "*H*" or "*I*" in
-them, and attn kernels have names with "*J*" or "*K*" in them, just add another
- .json file in the same directory as gputrc2graph.py with the same format as
- the other json files, like the following:
+Then, for this new model, suppose there are four kernels to be classified into
+"gemm" and "attn", where the gemm kernels have names with "*H*" or "*I*",
+and attention kernels have names with "*J*" or "*K*". Add a JSON file in the
+same directory as gputrc2graph.py with the same format as the other JSON files,
+like the following:
@@
-Each entry in the dictionary consists of:
+Each entry in the dictionary consists of:
@@
-- key: a regex used to classify the kernels
-- value: the category to classify the kernels into.
+- key: a regex used to classify kernels
+- value: the category to classify kernels into
@@
-The last 2 entries are common for all engine/models, consisting of CUDA memory
-operations and a 'misc' for anything that's leftover and can't be classified.
+The last two entries are common for all engines/models: CUDA memory operations
+and a 'misc' category for anything that can't be classified.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 3eddbfb and 75c3868.

⛔ Files ignored due to path filters (3)

tensorrt_llm/tools/profiler/nsys_profile_tools/images/csv.png is excluded by !**/*.png
tensorrt_llm/tools/profiler/nsys_profile_tools/images/html.png is excluded by !**/*.png
tensorrt_llm/tools/profiler/nsys_profile_tools/images/html_tbl.png is excluded by !**/*.png

📒 Files selected for processing (3)

tensorrt_llm/tools/profiler/nsys_profile_tools/README.md (1 hunks)
tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py (1 hunks)
tensorrt_llm/tools/profiler/nsys_profile_tools/trtllm_engine_model.json (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py

🪛 Ruff (0.12.2)

tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py

15-18: 1 blank line required between summary line and description

(D205)

15-18: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

31-31: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

46-47: One-line docstring should fit on one line

Reformat to one line

(D200)

46-47: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

80-82: 1 blank line required between summary line and description

(D205)

80-82: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

119-119: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

174-174: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

184-184: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

192-192: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

197-197: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

209-211: 1 blank line required between summary line and description

(D205)

209-211: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

256-256: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

🪛 LanguageTool

tensorrt_llm/tools/profiler/nsys_profile_tools/README.md

[grammar] ~3-~3: There might be a mistake here.
Context: ... Nsight Systems (nsys) GPU trace files (.nsys-rep) with -t cuda tracing enabl...

(QB_NEW_EN)

[grammar] ~4-~4: There might be a mistake here.
Context: ...cing enabled, and generates kernel-level summaries and visualizations of GPU and ...

(QB_NEW_EN)

[grammar] ~5-~5: There might be a mistake here.
Context: ...f GPU and non-GPU time. It is useful for profiling and analyzing nsys profile out...

(QB_NEW_EN)

[grammar] ~36-~36: There might be a mistake here.
Context: ...system PATH. ## Notes - Make sure you have pandas and plotly python packages insta...

(QB_NEW_EN)

[grammar] ~36-~36: There might be a mistake here.
Context: ...as and plotly python packages installed. - Make sure [nsys](https://developer.nvidi...

(QB_NEW_EN)

[grammar] ~37-~37: There might be a mistake here.
Context: ...vidia.com/nsight-systems/get-started) is installed, and specify the path to the `...

(QB_NEW_EN)

[grammar] ~39-~39: There might be a mistake here.
Context: ...--nsys_cmd if it is not in your PATH. - For more details on available engines an...

(QB_NEW_EN)

[grammar] ~51-~51: There might be a mistake here.
Context: ...rtllm: 1. Run the following command to collect nsys profile, for trtllm serve config. ...

(QB_NEW_EN)

[style] ~62-~62: Consider using the more formal “until”.
Context: ...d so that profiles aren't captured till trtllm server has come up and load ...

(TILL)

[grammar] ~63-~63: There might be a mistake here.
Context: ...come up and load generation starts. - DURATION: how many seconds for nsys prof...

(QB_NEW_EN)

[grammar] ~82-~82: There might be a mistake here.
Context: ...ent categories in a stacked bar chart. - result.csv: shows how the kernel names a...

(QB_NEW_EN)

[grammar] ~89-~89: There might be a mistake here.
Context: ...which consist of moe_gemm as the biggest category, at .14 seconds, followed by "a...

(QB_NEW_EN)

[grammar] ~90-~90: There might be a mistake here.
Context: ...ed by "attn" kernels. This lets the user prioritize the kernels to focus on for p...

(QB_NEW_EN)

[grammar] ~103-~103: There might be a mistake here.
Context: ...result.csv to dive into what the kernels are which compose the nccl GPU cycles. T...

(QB_NEW_EN)

[grammar] ~104-~104: There might be a mistake here.
Context: ...U cycles. The following image shows that ar_fusion all reduce kernel to be the bi...

(QB_NEW_EN)

[grammar] ~112-~112: There might be a mistake here.
Context: ...ce files, captured for different models, say llama and gpt-oss in this case, and ...

(QB_NEW_EN)

[grammar] ~122-~122: There might be a mistake here.
Context: ...example 1 but now there will be multiple stack bar charts that can be compared. ...

(QB_NEW_EN)

[grammar] ~123-~123: There might be a mistake here.
Context: ...pared. The categories for the different kernels will remain the same, so that it...

(QB_NEW_EN)

[grammar] ~124-~124: There might be a mistake here.
Context: ... it's easy to compare the GPU cycles for the same categories. Once a category is...

(QB_NEW_EN)

[grammar] ~127-~127: There might be a mistake here.
Context: ...e more cycles for one configuration than another, the next step would be to use t...

(QB_NEW_EN)

[grammar] ~128-~128: There might be a mistake here.
Context: ...use the csv file to see what kernels are mapped into that category, and which ker...

(QB_NEW_EN)

[grammar] ~129-~129: There might be a mistake here.
Context: ...kernels are taking the largest amount of time which would cause a difference for ...

(QB_NEW_EN)

[grammar] ~135-~135: There might be a mistake here.
Context: ...the same format as the other json files. The script will automatically pick up al...

(QB_NEW_EN)

[grammar] ~139-~139: There might be a mistake here.
Context: ...here are 4 kernels to be classified into "gemm" and "attn", where the gemm kernel...

(QB_NEW_EN)

[grammar] ~140-~140: There might be a mistake here.
Context: ...kernelshave names with "H" or "I" in them, and attn kernels have names with "...

(QB_NEW_EN)

[grammar] ~160-~160: There might be a mistake here.
Context: ...ey: a regex used to classify the kernels - value: the category to classify the kern...

(QB_NEW_EN)

[grammar] ~163-~163: There might be a mistake here.
Context: ...engine/models, consisting of CUDA memory operations and a 'misc' for anything tha...

(QB_NEW_EN)

[grammar] ~166-~166: There might be a mistake here.
Context: ... a trace file with this new model/engine like the following: ```bash --in_file n...

(QB_NEW_EN)

[grammar] ~174-~174: There might be a mistake here.
Context: ...ing engine file, after the other models.

(QB_NEW_EN)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (4)

tensorrt_llm/tools/profiler/nsys_profile_tools/trtllm_engine_model.json (2)

3-21: Kernel categorization coverage for llama looks solid; ordering ensures catch-all fallback.

Patterns are reasonably comprehensive and the catch-all ".*" is correctly placed last to avoid shadowing specific patterns.

41-60: ds and gpt-oss mappings are consistent; “.*” fallback is present.

Consistent taxonomy across engines, which will make cross-model comparison straightforward in the chart and pivot table.
tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py (1)

259-297: Clarify behavior for total runtime = 0 and align with README, or compute total from trace.

Current behavior sets total_sec = gpu_sec if the CLI value is 0 (or less than GPU). README states “0 uses elapsed time from nsys-rep,” which is not implemented. Either:

implement reading total elapsed from the trace (e.g., first start to last end) and use that when the input is 0, or

update README to state that 0 means “use GPU elapsed time,” which yields non-GPU time = 0.

If you decide to implement trace-derived total when 0 is passed, this patch adds a simple approximation:
         for idx, (file, engine, model, total_sec) in enumerate(in_file):
@@
-            gpu_sec = round(df["Elapsed Time (sec)"].sum(), 1)
-            total_sec = round(float(total_sec), 1)
+            gpu_sec = round(df["Elapsed Time (sec)"].sum(), 1)
+            total_sec = float(total_sec)
+            if total_sec == 0:
+                # Approximate total from kernel timestamps: last End - first Start.
+                # This may still underestimate true wall time if CPU-only gaps exist.
+                try:
+                    raw = self.pd.read_csv(
+                        os.path.join(file_dir, f"{file_name}_cuda_gpu_trace.csv"),
+                        usecols=["Start (ns)", "Duration (ns)"],
+                    )
+                    if not raw.empty:
+                        last_end = (raw["Start (ns)"] + raw["Duration (ns)"]).max()
+                        first_start = raw["Start (ns)"].min()
+                        total_sec = max(0.0, (last_end - first_start) / 1e9)
+                except Exception:
+                    logger.warning("Could not derive total_sec from trace; defaulting to gpu_sec.")
+                    total_sec = 0.0
+            total_sec = round(total_sec or gpu_sec, 1)
Otherwise, adjust README wording (see separate comment).
tensorrt_llm/tools/profiler/nsys_profile_tools/README.md (1)

36-41: List all Python dependencies (missing “pandas” min version and “regex” if kept).

If you keep using regex, add it here. Also consider noting pandas ≥ 2.0 if you rely on copy_on_write.

Apply this diff if regex is removed and pandas ≥ 2.0 is desired:
-- Make sure you have pandas and plotly python packages installed.
+- Make sure you have pandas (>= 2.0) and plotly python packages installed.
If you keep regex:
-- Make sure you have pandas and plotly python packages installed.
+- Make sure you have pandas (>= 2.0), plotly, and the `regex` Python package installed.

tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py

tensorrt_llm/tools/profiler/nsys_profile_tools/README.md

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (4)

tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py (4)

25-25: Replace "regex" with "re" module from standard library.

The code uses the third-party regex package aliased as re, creating confusion and introducing an undeclared dependency. The patterns used in this code don't require advanced regex features.

31-43: Fix shallow merge that overwrites engine mappings and docstring punctuation.

The current implementation will overwrite entire engine mappings when multiple JSON files define the same engine. The README encourages multiple JSONs, so this needs deep merging. Also fix the missing period in the docstring.

58-63: Handle empty GPU trace CSV to prevent downstream failures.

When cuda_gpu_trace.csv is empty, the early return causes gen_sum_file to fail on assertions since no summary file is created.

228-252: Improve nsys subprocess execution robustness.

The current implementation has multiple issues: missing -f csv flag, silent failures, and using exit(1) instead of sys.exit(1).

🧹 Nitpick comments (4)

tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py (4)

104-104: Replace print with logger for consistency.

Direct print statements bypass the configured logging system and may not appear in log files.

-                print(f"processing trace: {int(i/len(df) * 100)} %", end="\r")
+                logger.debug("processing trace: %d %%", int(i/len(df) * 100))

272-273: Improve error messages for unsupported engine/model combinations.

The current assertion messages don't provide helpful guidance on available options.

-            assert engine_model.get(engine), f"engine {engine} unknown"
-            assert engine_model[engine].get(model), f"model {model} unknown"
+            if not engine_model.get(engine):
+                available_engines = list(engine_model.keys())
+                raise ValueError(f"Unknown engine '{engine}'. Available engines: {available_engines}")
+            if not engine_model[engine].get(model):
+                available_models = list(engine_model[engine].keys())
+                raise ValueError(f"Unknown model '{model}' for engine '{engine}'. Available models: {available_models}")

299-300: Add input validation to parse_tuple function.

The function blindly splits on commas without validating the expected tuple structure, which can cause confusing downstream errors.

 def parse_tuple(s):
-    return tuple(s.split(","))
+    parts = s.split(",")
+    if len(parts) != 4:
+        raise ValueError(f"Expected 4 comma-separated values (file,engine,model,total_sec), got {len(parts)}: {s}")
+    return tuple(parts)

15-18: Fix docstring formatting issues.

Multiple docstrings are missing periods, need blank lines, or should be reformatted to single lines per Python conventions.

 """
-    This generates gpu kernel analysis output from nsys rep. Will call nsys
-    stats  -r cuda_gpu_trace, get non-overlapped gpu cycles, then generate
-    csv and html output for analysis
+    This generates GPU kernel analysis output from nsys rep.
+    
+    Will call nsys stats -r cuda_gpu_trace, get non-overlapped GPU cycles, 
+    then generate CSV and HTML output for analysis.
 """

-    """returns engine_model built from all json files in the current dir"""
+    """Return engine_model built from all JSON files in the current dir."""

-    """
-    Parses output of nsys report, generates csv and bar chart output
-    """
+    """Parse output of nsys report, generate CSV and bar chart output."""

-        """
-        returns new sorted df with Elapsed Time (ns) column using
-        vectorized operations
-        """
+        """Return new sorted df with Elapsed Time (ns) column using vectorized operations."""

-        """make html graph from df"""
+        """Make HTML graph from df."""

-        """add "Category" column"""
+        """Add 'Category' column."""

-        """this will append non-gpu time entry at end of df"""
+        """Append non-GPU time entry at end of df."""

-        """asserts if base_file is non-existent or is empty"""
+        """Assert if base_file is non-existent or is empty."""

-        """figure out if new file should be generated from base_file"""
+        """Figure out if new file should be generated from base_file."""

-        """
-        generates sum file from nsys trace with times per kernel and
-        returns the name of the sum file
-        """
+        """Generate sum file from nsys trace with times per kernel and return the name of the sum file."""

-        """generates graph and csv file from in_file into out_dir"""
+        """Generate graph and CSV file from in_file into out_dir."""

Also applies to: 31-31, 46-47, 80-82, 119-119, 174-174, 184-184, 192-192, 197-197, 209-211, 256-256

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 75c3868 and d54e34a.

📒 Files selected for processing (1)

tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py

🪛 Ruff (0.12.2)

tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py

15-18: 1 blank line required between summary line and description

(D205)

15-18: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

31-31: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

46-47: One-line docstring should fit on one line

Reformat to one line

(D200)

46-47: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

80-82: 1 blank line required between summary line and description

(D205)

80-82: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

119-119: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

174-174: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

184-184: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

192-192: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

197-197: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

209-211: 1 blank line required between summary line and description

(D205)

209-211: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

256-256: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py

kaiyux · 2025-08-21T01:10:12Z

@gracehonv FYI - the release check is failing: https://github.com/NVIDIA/TensorRT-LLM/actions/runs/17106836453/job/48517867574?pr=7020

Signed-off-by: Grace Ho <grho@nvidia.com>

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (6)

tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py (6)

86-105: Handle empty DataFrame gracefully and replace print-based progress with logging.

The assertion will crash on valid empty traces. Also, prefer logging to print for progress updates.

-        logger.info("sorting %s trace records by start time", str(df.shape))
-        assert not df.empty, 'empty nsys records'
+        logger.info("sorting %s trace records by start time", str(df.shape))
+        if df.empty:
+            logger.warning("No GPU trace records to process.")
+            return df
@@
-        for i in range(1, len(df)):
-            if i % display_units == 0:
-                print(f"processing trace: {int(i/len(df) * 100)} %", end="\r")
+        for i in range(1, len(df)):
+            if i % display_units == 0:
+                logger.info("processing trace: %d %%", int(i / len(df) * 100))

120-171: Use a NamedTemporaryFile instead of writing a fixed temp.html in CWD.

The current approach can break with concurrent runs or if cleanup fails.

-        pivot_df.fillna("").to_html("temp.html")
-        with (
-                open(f"{output_name}.html", "a", encoding="utf-8") as outfile,
-                open("temp.html", encoding="utf-8") as infile,
-        ):
-            outfile.write(infile.read())
-        os.remove("temp.html")
+        import tempfile
+        with tempfile.NamedTemporaryFile(mode="w+", suffix=".html", delete=False) as tmp:
+            pivot_df.fillna("").to_html(tmp.name)
+            with open(f"{output_name}.html", "a", encoding="utf-8") as outfile, \
+                 open(tmp.name, encoding="utf-8") as infile:
+                outfile.write(infile.read())
+        os.unlink(tmp.name)

25-26: Replace third-party “regex” with stdlib “re” (or document dependency clearly).

Aliasing regex as re is confusing and introduces an undeclared dependency; the patterns used don’t require the regex package.

-import regex as re
+import re

If you truly need regex features later, import it as import regex (not aliased to re) and add it to README/requirements.

31-43: Deep-merge engine/model mappings across multiple JSON files. update() overwrites earlier models.

If two JSONs define the same engine with different models, later files clobber earlier ones. README encourages multiple JSONs in the same dir, so this loses data.

-def load_engine_model():
-    """returns engine_model built from all json files in the current dir"""
+def load_engine_model():
+    """Return a merged engine_model from all JSON files in this directory."""
     import glob
     import json
 
     engine_model = {}
 
     json_files = glob.glob(
         os.path.join(os.path.dirname(__file__) or ".", "*.json"))
     for fname in json_files:
         with open(fname, encoding="utf-8") as f:
-            engine_model.update(json.load(f))
+            data = json.load(f)
+            for engine, models in data.items():
+                if not isinstance(models, dict):
+                    continue
+                if engine not in engine_model:
+                    engine_model[engine] = {}
+                for model, mapping in models.items():
+                    if model not in engine_model[engine]:
+                        engine_model[engine][model] = {}
+                    # Later files can override specific patterns if duplicated.
+                    engine_model[engine][model].update(mapping)
     return engine_model

58-79: Write an empty summary CSV when the GPU trace CSV is empty to avoid downstream failures.

Returning early means gen_sum_file() later asserts on a missing sum_file. Always emit a CSV with the expected headers.

     def gen_nonoverlapped_sum_from_gputrace(self, in_file, out_file):
         logger.info("loading %s", in_file)
         df = self.pd.read_csv(in_file,
                               usecols=["Start (ns)", "Duration (ns)", "Name"])
         if df.empty:
-            return
+            self.pd.DataFrame(
+                columns=["Elapsed Time (sec)", "Total Time (sec)", "Instances", "Name"]
+            ).to_csv(out_file, index=False)
+            logger.warning("No GPU kernel records in %s; wrote empty summary %s",
+                           in_file, out_file)
+            return

211-258: Robustify nsys invocation and ensure the sum CSV is regenerated when needed.

Current code:

Doesn’t force CSV output for nsys stats.
Doesn’t check the subprocess return code or log stderr.
Only generates the sum CSV when the stats CSV was regenerated; if stats exists but sum is missing/outdated, you skip generating the sum and then assert it exists.
Uses exit(1) instead of sys.exit(1).

Fix all four and correct a log typo.

     def gen_sum_file(self, file, nsys_cmd):
@@
-        if self.should_gen_file(nsys_stats_file, file):
+        if self.should_gen_file(nsys_stats_file, file):
             cmd = [
                 nsys_cmd,
                 "stats",
                 "-r",
                 "cuda_gpu_trace",
+                "-f",
+                "csv",
                 file,
                 "-o",
                 f"{file_dir}/{file_name}",
             ]
             cmd_str = " ".join(cmd)
             logger.info("+ %s", cmd_str)
@@
-            try:
-                subprocess.run(cmd)
-            except Exception:
-                logger.error("%s failed; Use --nsys_cmd to specify nsys path",
-                             cmd_str)
-                exit(1)
-            logger.info("generating non-overalapped sum %s", sum_file)
-            self.gen_nonoverlapped_sum_from_gputrace(nsys_stats_file, sum_file)
-        self.is_valid_file(sum_file)
+            try:
+                res = subprocess.run(cmd, check=True, capture_output=True, text=True)
+            except FileNotFoundError:
+                logger.error("%s not found; Use --nsys_cmd to specify nsys path", nsys_cmd)
+                import sys
+                sys.exit(1)
+            except subprocess.CalledProcessError as e:
+                logger.error("nsys stats failed: %s\nstdout:\n%s\nstderr:\n%s",
+                             cmd_str, e.stdout, e.stderr)
+                import sys
+                sys.exit(1)
+            if not (os.path.isfile(nsys_stats_file) and os.path.getsize(nsys_stats_file) > 0):
+                logger.error("Expected output %s not found or empty after running nsys.", nsys_stats_file)
+                import sys
+                sys.exit(1)
+        # (Re)generate the sum CSV if it is missing or older than the stats CSV.
+        if self.should_gen_file(sum_file, nsys_stats_file):
+            logger.info("generating non-overlapped sum %s", sum_file)
+            self.gen_nonoverlapped_sum_from_gputrace(nsys_stats_file, sum_file)
+        self.is_valid_file(sum_file)

🧹 Nitpick comments (8)

tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py (8)
1-15: Update copyright year to include 2025.

Repo guideline says to prepend NVIDIA copyright header with the current year. This new file should read 2022-2025 (or just 2025 for a new file introduced this year).
-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
15-19: Tighten the module docstring and satisfy docstring style checks.

Ruff flags D205/D415. Make the first line a concise summary ending with a period, followed by a blank line and details.
-"""
-    This generates gpu kernel analysis output from nsys rep. Will call nsys
-    stats  -r cuda_gpu_trace, get non-overlapped gpu cycles, then generate
-    csv and html output for analysis
-"""
+"""
+Generate GPU-kernel analysis outputs from an nsys .rep file.
+
+Calls `nsys stats -r cuda_gpu_trace` to compute non-overlapped GPU time, then emits CSV and HTML artifacts for analysis.
+"""
176-185: Ensure unmatched kernels get a category and precompile patterns for speed.

anno_gpu_kernname_helper returns None when no pattern matches, creating NaNs and dropping rows in pivoting. Also, compiling regexes once reduces per-row overhead.
     def anno_gpu_kernname(self, df, mapping):
         """add "Category" column"""
-
-        def anno_gpu_kernname_helper(name):
-            for kern_name, val in mapping.items():
-                if re.search(kern_name, name):
-                    return val
+        compiled = [(re.compile(pat), cat) for pat, cat in mapping.items()]
+
+        def anno_gpu_kernname_helper(name):
+            for rx, cat in compiled:
+                if rx.search(name):
+                    return cat
+            return "uncategorized"
 
         df["Category"] = df["Name"].apply(anno_gpu_kernname_helper)
195-207: Prefer explicit exceptions over asserts for file validity.

assert can be stripped with -O and is less informative for users. Consider raising FileNotFoundError/ValueError with a clear message.
-    def is_valid_file(self, base_file):
-        """asserts if base_file is non-existent or is empty"""
-        assert (os.path.isfile(base_file) and os.path.getsize(base_file)
-                > 0), f"{base_file} doesn't exist or is empty"
+    def is_valid_file(self, base_file):
+        """Validate that base_file exists and is non-empty."""
+        if not (os.path.isfile(base_file) and os.path.getsize(base_file) > 0):
+            raise FileNotFoundError(f"{base_file} doesn't exist or is empty.")
303-341: Argparse help text drift and minor polish.

The example includes an unknown engine sglan; likely meant trtllm.

Help string claims “Specify 0 to use elapsed time from nsys-rep” which the code does not implement (see prior comment). Either update the help or implement the behavior.
-              "profiling used to calculate non-gpu time. Specify 0 to use "
-              "elapsed time from nsys-rep but that might inflate non-gpu time. "
+              "profiling used to calculate non-GPU time. Specify 0 to assume "
+              "non-GPU time is 0 (i.e., use GPU elapsed time only). "
               f"Available engine:[model] are: {engine_model_supported_str} "
-              f"Example: --in_file d1.nsys-rep,sglan,llama,100 "
+              f"Example: --in_file d1.nsys-rep,trtllm,llama,100 "
               "d2.nsys-rep,trtllm,gpt-oss,102"),
Would you prefer I implement deriving wall time from nsys stats -r summary so that 0 means “use nsys total elapsed”? If yes, I can push a patch.

172-175: Switch to logging for final status message.

Minor consistency tweak: use logger.info instead of print.
-        print(f"Finished generating: \n"
-              f" {output_name}.html for stack bar chart \n"
-              f" {output_name}.csv for Kernel-Category mapping")
+        logger.info("Finished generating:\n %s.html (stacked bar chart)\n %s.csv (Kernel-Category mapping)",
+                    output_name, output_name)
317-322: Defensive UX: handle missing JSONs gracefully in CLI help.

If no JSON files are present, engine_model_supported_str becomes empty. Consider an explicit notice so users aren’t confused by “Available engine:[model] are: ” with nothing after it.
-    engine_model_supported_str = ", ".join(
+    engine_model_supported_str = ", ".join(
         f"{engine}:[{', '.join(models.keys())}]"
         for engine, models in engine_model_supported.items())
+    if not engine_model_supported_str:
+        engine_model_supported_str = "None found (place *.json mappings next to this script)."
304-314: Docstring/style nits across functions per Ruff hints.

Several function docstrings lack terminal punctuation or the blank line after the summary (D200/D205/D415). Apply quick, local fixes as convenient; keeps release checks green.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d54e34a and 19c84ec.

📒 Files selected for processing (1)

tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py

🪛 Ruff (0.12.2)

tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py

15-18: 1 blank line required between summary line and description

(D205)

15-18: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

31-31: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

46-47: One-line docstring should fit on one line

Reformat to one line

(D200)

46-47: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

80-82: 1 blank line required between summary line and description

(D205)

80-82: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

119-119: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

174-174: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

184-184: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

192-192: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

197-197: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

209-211: 1 blank line required between summary line and description

(D205)

209-211: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

256-256: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)

🔇 Additional comments (1)

tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py (1)

81-119: Non-overlap algorithm LGTM for a single “active interval” accounting.

The sweep correctly attributes only “new coverage” per event; this aligns with the stated non-overlapped GPU-time goal.

tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py

gracehonv · 2025-08-21T20:21:16Z

@kaiyux fixed the issue! It's taking a long time waiting for blossom-ci which is strange since my script doesn't touch anything internal to trtllm.

kaiyux · 2025-08-23T04:40:01Z

/bot skip --comment "the tool is not protected by CI pipeline"

tensorrt-cicd · 2025-08-23T04:45:40Z

PR_Github #16243 [ skip ] triggered by Bot

tensorrt-cicd · 2025-08-23T04:57:35Z

PR_Github #16243 [ skip ] completed with state SUCCESS
Skipping testing for commit 19c84ec

clean-up typos

7947c82

Signed-off-by: Grace Ho <grho@nvidia.com>

gracehonv requested a review from a team as a code owner August 19, 2025 05:00

gracehonv requested review from Shixiaowei02 and chzblych August 19, 2025 05:00

gracehonv changed the title ~~nsys profile output kernel classifier~~ [None] [feat] nsys profile output kernel classifier Aug 19, 2025

svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Aug 19, 2025

coderabbitai bot reviewed Aug 19, 2025

View reviewed changes

fixes from coderabbit

3eddbfb

Signed-off-by: Grace Ho <grho@nvidia.com>

coderabbitai bot reviewed Aug 20, 2025

View reviewed changes

examples/profiler/nsys_profile_tools/gputrc2graph.py Show resolved Hide resolved

examples/profiler/nsys_profile_tools/gputrc2graph.py Show resolved Hide resolved

kaiyux reviewed Aug 20, 2025

View reviewed changes

examples/profiler/nsys_profile_tools/trtllm_engine_model.json Show resolved Hide resolved

examples/profiler/nsys_profile_tools/README.md Show resolved Hide resolved

gracehonv added 3 commits August 20, 2025 11:10

move nsys_profiler_tools under tensorrt_llm/tools/profiler

75c3868

Signed-off-by: Grace Ho <grho@nvidia.com>

move nsys_profiler_tools under tensorrt_llm/tools/profiler

d95c49c

Signed-off-by: Grace Ho <grho@nvidia.com>

correct docstring and rm files from examples

d54e34a

Signed-off-by: Grace Ho <grho@nvidia.com>

coderabbitai bot reviewed Aug 20, 2025

View reviewed changes

tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py Show resolved Hide resolved

tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py Show resolved Hide resolved

fix import process nosec issue

19c84ec

Signed-off-by: Grace Ho <grho@nvidia.com>

coderabbitai bot reviewed Aug 21, 2025

View reviewed changes

tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py Show resolved Hide resolved

tensorrt_llm/tools/profiler/nsys_profile_tools/gputrc2graph.py Show resolved Hide resolved

kaiyux approved these changes Aug 23, 2025

View reviewed changes

kaiyux enabled auto-merge (squash) August 23, 2025 04:40

kaiyux merged commit 3d54a1a into NVIDIA:main Aug 23, 2025
4 of 5 checks passed

[None] [feat] nsys profile output kernel classifier #7020

[None] [feat] nsys profile output kernel classifier #7020

Uh oh!

Conversation

gracehonv commented Aug 19, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kaiyux commented Aug 21, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gracehonv commented Aug 21, 2025

Uh oh!

kaiyux commented Aug 23, 2025

Uh oh!

tensorrt-cicd commented Aug 23, 2025

Uh oh!

tensorrt-cicd commented Aug 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gracehonv commented Aug 19, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 19, 2025 •

edited

Loading