-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[Graph Partition] allow sharing default device context #162873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162873
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 2bb9647 with merge base d2f6daf ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
| # is important for nested subgraph codegening. | ||
| def write_get_raw_stream(self, device_idx: int, graph_name: str) -> str: | ||
| self.write_get_raw_stream_header_once() | ||
| self.write_get_raw_stream_header() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
standalone_compile may need additional imports of get_raw_stream
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why additional imports are needed? Can those be added as cache key if repeated imports are needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need imports of get_raw_stream for both subgraphs and the main graphs, including the final output code and autotune_at_compile_time blocks.
Currently, write_get_raw_stream_header_once relies on cache_on_self to import once. When generating subgraph, it calls SubgraphPythonWrapperCodegen.write_get_raw_stream_header_once() (cache_on_self) which calls PythonWrapperCodegen.write_get_raw_stream_header_once() (cache_on_self).
pytorch/torch/_inductor/codegen/wrapper.py
Lines 3652 to 3660 in 8590c3a
| @cache_on_self | |
| def write_get_raw_stream_header_once(self) -> None: | |
| # TODO: Uncomment in future. This will be needed to support subgraph | |
| # codegen for cpp wrapper. | |
| # if config.triton.autotune_at_compile_time: | |
| # self.kernel_autotune_calls.writeline( | |
| # V.graph.device_ops.import_get_raw_stream_as("get_raw_stream") | |
| # ) | |
| self.parent_wrapper.write_get_raw_stream_header_once() |
pytorch/torch/_inductor/codegen/wrapper.py
Lines 1191 to 1193 in 8590c3a
| @cache_on_self | |
| def write_get_raw_stream_header_once(self) -> None: | |
| self.write_get_raw_stream_header() |
As a result, the main graph will skip PythonWrapperCodegen.write_get_raw_stream_header_once() (cache_on_self) since it has just been called. However, this leads to errors since the main graph code also needs this import.
So I call write_get_raw_stream_header here whenever we need it, and only write to self.imports when self.imports does not contain import_get_raw_stream_str.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, write_get_raw_stream_header_once relies on cache_on_self to import once. When generating subgraph, it calls SubgraphPythonWrapperCodegen.write_get_raw_stream_header_once() (cache_on_self) which calls PythonWrapperCodegen.write_get_raw_stream_header_once() (cache_on_self).
Is the import for the subgraph wrapper not added to the global scope?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. A new wrapper code is generated for each subgraph.
pytorch/torch/_inductor/scheduler.py
Lines 5155 to 5173 in 8590c3a
| def _codegen_partition_wrapper( | |
| self, | |
| partition: PartitionType, | |
| signature: GraphPartitionSignature, | |
| ) -> None: | |
| """Codegen a partition given its inputs/outputs""" | |
| from .codegen.wrapper import SubgraphPythonWrapperCodegen | |
| parent_wrapper_code = V.graph.wrapper_code | |
| graph_partition_id = next(self._graph_partition_counter) | |
| with V.graph.set_current_wrapper_code(): | |
| V.graph.init_wrapper_code( | |
| is_subgraph=True, | |
| subgraph_name=f"partition_{graph_partition_id}", | |
| parent_wrapper_code=parent_wrapper_code, | |
| partition_signatures=signature, | |
| ) | |
| self._codegen(partition) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. A new wrapper code is generated for each subgraph.
But all wrapper code still dump to the same .py file right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes for the output_code. But no for config.triton.autotune_at_compile_time, which creates a separate code to run IIUC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice finding!
Is it possible to improve device context manager instead? i.e., if we are already on the mentioned device, make it cheaper to enter/exit
Also any number for the overall improvement for inference?
| # is important for nested subgraph codegening. | ||
| def write_get_raw_stream(self, device_idx: int, graph_name: str) -> str: | ||
| self.write_get_raw_stream_header_once() | ||
| self.write_get_raw_stream_header() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why additional imports are needed? Can those be added as cache key if repeated imports are needed
Prior to this PR, we actually only codegen pytorch/torch/_inductor/scheduler.py Lines 5274 to 5282 in 8590c3a
The issue is the assumption that However, this assumption is NOT true for graph partitions. We have many graph partitions so we can reuse existing device context. The major purpose of this PR is to modify the assumption a bit. pytorch/torch/_inductor/scheduler.py Line 5250 in 8590c3a
|
For vllm Qwen/Qwen3-0.6B, this PR reduces latency from 0.34 seconds to 0.328 seconds, around 3.5% speedup. Command to repro: We can easily see the speedup from trace. |
Want to dive a bit more on the benchmarking. How many subgraphs do we generate for Qwen/Qwen3-0.6B? Let's say we save X device.enter/exit each time we call the wrapper. Assuming a pair of device.enter/exit cost Y us (~40us according to your measurement). Then the total saving for a wrapper call is (X * Y) us. (X * Y) / 3.5% should roughly match the latency of one inference call. Also what blocks us from going one step further to call device.enter/exit only once and do multiple wrapper calls. |
This is expected. The attention kernel is cudagraph unsafe. so it is explicitly marked with |
Yes we currently only call device.enter/exit once with multiple wrapper calls. tlparse and output code. There are 29 partitions and 28 |
|
For Qwen3-0.6B. Overall, This 10% saving is larger than the end-to-end 3.5% saving. Because the model also have several other components such as update_states, prepare_inputs, etc. |
|
ok. didn't realize that there are so many partitions due to not capture attention kernels in cudagraphs |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Entering a device context takes 30 us and exiting a device context takes 11 us. If all graph partitions and cudagraph-unsafe ops happen on the same device, we can share the device context. ## Trace Use vLLM as an example. The first trace shows dynamo graph partition. <img width="1338" height="453" alt="image" src="https://github.com/user-attachments/assets/b81815fd-cdcb-4024-846a-5b64164f8bac" /> The second trace shows inductor graph partition prior to this PR. <img width="1331" height="270" alt="image" src="https://github.com/user-attachments/assets/8d98b127-2053-4eae-9a31-5491661f14d8" /> Comparing with fx graph partition, we can see inductor graph partition shows extra overhead from enter/exit device contexts (13+6 us -> 30+11 us), but smaller runtime overhead (13 us -> 7 us). This motivates the PR to share default device context. The third trace shows Inductor graph partition after this PR. We observe that the extra overhead from enter/exit device contexts have been fixed. At the same time, we observe the smaller runtime overhead. <img width="1336" height="276" alt="image" src="https://github.com/user-attachments/assets/77be2237-34dd-4bac-ad9c-d9af3be36417" /> Pull Request resolved: pytorch#162873 Approved by: https://github.com/shunting314
Entering a device context takes 30 us and exiting a device context takes 11 us. If all graph partitions and cudagraph-unsafe ops happen on the same device, we can share the device context. ## Trace Use vLLM as an example. The first trace shows dynamo graph partition. <img width="1338" height="453" alt="image" src="https://github.com/user-attachments/assets/b81815fd-cdcb-4024-846a-5b64164f8bac" /> The second trace shows inductor graph partition prior to this PR. <img width="1331" height="270" alt="image" src="https://github.com/user-attachments/assets/8d98b127-2053-4eae-9a31-5491661f14d8" /> Comparing with fx graph partition, we can see inductor graph partition shows extra overhead from enter/exit device contexts (13+6 us -> 30+11 us), but smaller runtime overhead (13 us -> 7 us). This motivates the PR to share default device context. The third trace shows Inductor graph partition after this PR. We observe that the extra overhead from enter/exit device contexts have been fixed. At the same time, we observe the smaller runtime overhead. <img width="1336" height="276" alt="image" src="https://github.com/user-attachments/assets/77be2237-34dd-4bac-ad9c-d9af3be36417" /> Pull Request resolved: pytorch#162873 Approved by: https://github.com/shunting314
Entering a device context takes 30 us and exiting a device context takes 11 us. If all graph partitions and cudagraph-unsafe ops happen on the same device, we can share the device context. ## Trace Use vLLM as an example. The first trace shows dynamo graph partition. <img width="1338" height="453" alt="image" src="https://github.com/user-attachments/assets/b81815fd-cdcb-4024-846a-5b64164f8bac" /> The second trace shows inductor graph partition prior to this PR. <img width="1331" height="270" alt="image" src="https://github.com/user-attachments/assets/8d98b127-2053-4eae-9a31-5491661f14d8" /> Comparing with fx graph partition, we can see inductor graph partition shows extra overhead from enter/exit device contexts (13+6 us -> 30+11 us), but smaller runtime overhead (13 us -> 7 us). This motivates the PR to share default device context. The third trace shows Inductor graph partition after this PR. We observe that the extra overhead from enter/exit device contexts have been fixed. At the same time, we observe the smaller runtime overhead. <img width="1336" height="276" alt="image" src="https://github.com/user-attachments/assets/77be2237-34dd-4bac-ad9c-d9af3be36417" /> Pull Request resolved: pytorch#162873 Approved by: https://github.com/shunting314
Entering a device context takes 30 us and exiting a device context takes 11 us. If all graph partitions and cudagraph-unsafe ops happen on the same device, we can share the device context. ## Trace Use vLLM as an example. The first trace shows dynamo graph partition. <img width="1338" height="453" alt="image" src="https://github.com/user-attachments/assets/b81815fd-cdcb-4024-846a-5b64164f8bac" /> The second trace shows inductor graph partition prior to this PR. <img width="1331" height="270" alt="image" src="https://github.com/user-attachments/assets/8d98b127-2053-4eae-9a31-5491661f14d8" /> Comparing with fx graph partition, we can see inductor graph partition shows extra overhead from enter/exit device contexts (13+6 us -> 30+11 us), but smaller runtime overhead (13 us -> 7 us). This motivates the PR to share default device context. The third trace shows Inductor graph partition after this PR. We observe that the extra overhead from enter/exit device contexts have been fixed. At the same time, we observe the smaller runtime overhead. <img width="1336" height="276" alt="image" src="https://github.com/user-attachments/assets/77be2237-34dd-4bac-ad9c-d9af3be36417" /> Pull Request resolved: pytorch#162873 Approved by: https://github.com/shunting314
|
@pytorchbot cherry-pick --onto release/2.9 --c critical |
Cherry picking #162873Command Details for Dev Infra teamRaised by workflow job |
|
Resolved in #163097 no need to cherry pick |


Entering a device context takes 30 us and exiting a device context takes 11 us. If all graph partitions and cudagraph-unsafe ops happen on the same device, we can share the device context.
Trace
Use vLLM as an example. The first trace shows dynamo graph partition.

The second trace shows inductor graph partition prior to this PR.

Comparing with fx graph partition, we can see inductor graph partition shows extra overhead from enter/exit device contexts (13+6 us -> 30+11 us), but smaller runtime overhead (13 us -> 7 us). This motivates the PR to share default device context.
The third trace shows Inductor graph partition after this PR. We observe that the extra overhead from enter/exit device contexts have been fixed. At the same time, we observe the smaller runtime overhead.

cc @mcarilli @ezyang @eellison @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @mlazos