-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Closed
Labels
hackathononcall: pt2triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
Repro:
TORCH_COMPILE_DEBUG=1 CUDA_VISIBLE_DEVICES=3 benchmarks/dynamo/torchbench.py --inductor --cpp-wrapper --bfloat16 --accuracy --inference --device cuda --only BERT_pytorch
In the generated output_code.py, you can see the following code is generated multiple times. Because we don't have control flow in our graph, we only really need to generate it once.
if (triton_poi_fused_gelu_7 == nullptr) {
triton_poi_fused_gelu_7 = loadKernel("/tmp/torchinductor_binbao/jo/cjokzfaqrvkztfhvz2yqzxlmpzub2sdi57zctqbad4js7crxsm5n.cubin", "triton_poi_fused_gelu_7_0d1d");
}
Metadata
Metadata
Assignees
Labels
hackathononcall: pt2triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module