KEMBAR78

[AOTI] Fix a special case compile time data type codegen for sym int variables by YUNQIUGUO · Pull Request #138106 · pytorch/pytorch · GitHub

[AOTI] Fix a special case compile time data type codegen for sym int variables #138106

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

YUNQIUGUO wants to merge 1 commit into pytorch:main from YUNQIUGUO:export-D64490039

Contributor

YUNQIUGUO commented Oct 16, 2024 •

edited by pytorch-bot bot

Loading

Summary:
This change unblocks the CFR AOTI lowering runtime error.

TL;DR:

In this model, one triton kernel expects a scalar input dtype as i64, but getting an i32. The reason is "auto" can infer a smaller data type if the variable it passed in e.g. is i32. thus cause CUDA IMA.
Original problematic kernel: triton_poi_fused_add_ge_logical_and_logical_or_lt_46_grid_100.

This diff manually cast it to i64 for all symbolic arguments in compile time for i64 triton kernel inputs, instead of use auto var_x = {arg} in cpp wrapper code.

Test Plan:
Verified in FLB locally:

PYTORCH_NO_CUDA_MEMORY_CACHING=1 AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3 TORCH_LOGS="output_code" TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCH_SHOW_CPP_STACKTRACES=1 CUDA_LAUNCH_BLOCKING=1 ~/fbsource/buck-out/v2/gen/fbcode/98e643f8bb44fe9d/hpc/new/models/feed/benchmark/__feed_lower_benchmark__/feed_lower_benchmark.par --skip-eager --skip-flop-estimation --lower-backend="AOT_INDUCTOR" --sync-mode=0 --precision bf16 --output-precision bf16  --lower-presets="ifr_cint;disable_new_lowering_weights;disable_dper_passes:passes=fuse_parallel_linear_no_weight_change" --remove-unexpected-type-cast=False --load="manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/924293663/0/gpu_lowering/input.merge"```

Differential Revision: D64490039




cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

pytorch-bot bot commented Oct 16, 2024 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138106

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 2267787 with merge base 620039c ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

inductor-periodic / cuda12.1-py3.10-gcc9-sm80 / test (inductor_torchbench_smoketest_perf, 1, 1, linux.gcp.a100) (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot bot added ciflow/inductor module: inductor labels

Contributor

facebook-github-bot commented Oct 16, 2024

This pull request was exported from Phabricator. Differential Revision: D64490039

facebook-github-bot added the fb-exported label

YUNQIUGUO force-pushed the export-D64490039 branch from 67f74f2 to 9640411 Compare

October 16, 2024 20:15

Contributor

facebook-github-bot commented Oct 16, 2024

This pull request was exported from Phabricator. Differential Revision: D64490039

YUNQIUGUO added the topic: not user facing label

YUNQIUGUO force-pushed the export-D64490039 branch from 9640411 to 52d4e80 Compare

October 16, 2024 21:25

Contributor

facebook-github-bot commented Oct 16, 2024

This pull request was exported from Phabricator. Differential Revision: D64490039

YUNQIUGUO force-pushed the export-D64490039 branch from 52d4e80 to 68df8f6 Compare

October 16, 2024 22:27

YUNQIUGUO added a commit to YUNQIUGUO/pytorch that referenced this pull request


          [AOTI] Fix a special case compile time data type codegen for sym int …

68df8f6

…variables (pytorch#138106)

Summary:

This change unblocks the CFR AOTI lowering runtime error.

TL;DR:

In this model, one triton kernel expects a scalar input dtype as i64, but getting an i32. The reason is "auto"  can infer a smaller data type if the variable get passed in e.g. is i32. thus cause CUDA IMA.
 Original problematic kernel: `triton_poi_fused_add_ge_logical_and_logical_or_lt_46_grid_100`. and third input `auto var_402 = u0`.

This diff explicitly specifies it to i64 for all symbolic arguments in compile time for i64 triton kernel inputs, instead of use `auto var_x = {arg}` in cpp wrapper code.

Test Plan:
Verified in FLB locally:

```
PYTORCH_NO_CUDA_MEMORY_CACHING=1 AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3 TORCH_LOGS="output_code" TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCH_SHOW_CPP_STACKTRACES=1 CUDA_LAUNCH_BLOCKING=1 ~/fbsource/buck-out/v2/gen/fbcode/98e643f8bb44fe9d/hpc/new/models/feed/benchmark/__feed_lower_benchmark__/feed_lower_benchmark.par --skip-eager --skip-flop-estimation --lower-backend="AOT_INDUCTOR" --sync-mode=0 --precision bf16 --output-precision bf16  --lower-presets="ifr_cint;disable_new_lowering_weights;disable_dper_passes:passes=fuse_parallel_linear_no_weight_change" --remove-unexpected-type-cast=False --load="manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/924293663/0/gpu_lowering/input.merge"```

Differential Revision: D64490039

Contributor

facebook-github-bot commented Oct 16, 2024

This pull request was exported from Phabricator. Differential Revision: D64490039

YUNQIUGUO force-pushed the export-D64490039 branch from 68df8f6 to c4aa337 Compare

October 17, 2024 00:47

Contributor

facebook-github-bot commented Oct 17, 2024

This pull request was exported from Phabricator. Differential Revision: D64490039

YUNQIUGUO force-pushed the export-D64490039 branch from c4aa337 to 98dacff Compare

October 17, 2024 05:33

YUNQIUGUO added a commit to YUNQIUGUO/pytorch that referenced this pull request


          [AOTI] Fix a special case compile time data type codegen for sym int …

98dacff

…variables (pytorch#138106)

Summary:

This change unblocks the CFR AOTI lowering runtime error.

TL;DR:

In this model, one triton kernel expects a scalar input dtype as i64, but getting an i32. The reason is "auto"  can infer a smaller data type if the variable get passed in e.g. is i32. thus cause CUDA IMA.
 Original problematic kernel: `triton_poi_fused_add_ge_logical_and_logical_or_lt_46_grid_100`. and third input `auto var_402 = u0`.

This diff explicitly specifies it to i64 for all symbolic arguments in compile time for i64 triton kernel inputs, instead of use `auto var_x = {arg}` in cpp wrapper code.

Test Plan:
Verified in FLB locally:

```
PYTORCH_NO_CUDA_MEMORY_CACHING=1 AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3 TORCH_LOGS="output_code" TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCH_SHOW_CPP_STACKTRACES=1 CUDA_LAUNCH_BLOCKING=1 ~/fbsource/buck-out/v2/gen/fbcode/98e643f8bb44fe9d/hpc/new/models/feed/benchmark/__feed_lower_benchmark__/feed_lower_benchmark.par --skip-eager --skip-flop-estimation --lower-backend="AOT_INDUCTOR" --sync-mode=0 --precision bf16 --output-precision bf16  --lower-presets="ifr_cint;disable_new_lowering_weights;disable_dper_passes:passes=fuse_parallel_linear_no_weight_change" --remove-unexpected-type-cast=False --load="manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/924293663/0/gpu_lowering/input.merge"```

Differential Revision: D64490039

Contributor

facebook-github-bot commented Oct 17, 2024

This pull request was exported from Phabricator. Differential Revision: D64490039

YUNQIUGUO force-pushed the export-D64490039 branch from 98dacff to 4be7f82 Compare

October 17, 2024 16:32

YUNQIUGUO added a commit to YUNQIUGUO/pytorch that referenced this pull request


          [AOTI] Fix a special case compile time data type codegen for sym int …

4be7f82

…variables (pytorch#138106)

Summary:

This change unblocks the CFR AOTI lowering runtime error.

TL;DR:

In this model, one triton kernel expects a scalar input dtype as i64, but getting an i32. The reason is "auto"  can infer a smaller data type if the variable get passed in e.g. is i32. thus cause CUDA IMA.
 Original problematic kernel: `triton_poi_fused_add_ge_logical_and_logical_or_lt_46_grid_100`. and third input `auto var_402 = u0`.

This diff explicitly specifies it to i64 for all symbolic arguments in compile time for i64 triton kernel inputs, instead of use `auto var_x = {arg}` in cpp wrapper code.

Test Plan:
Verified in FLB locally:

```
PYTORCH_NO_CUDA_MEMORY_CACHING=1 AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3 TORCH_LOGS="output_code" TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCH_SHOW_CPP_STACKTRACES=1 CUDA_LAUNCH_BLOCKING=1 ~/fbsource/buck-out/v2/gen/fbcode/98e643f8bb44fe9d/hpc/new/models/feed/benchmark/__feed_lower_benchmark__/feed_lower_benchmark.par --skip-eager --skip-flop-estimation --lower-backend="AOT_INDUCTOR" --sync-mode=0 --precision bf16 --output-precision bf16  --lower-presets="ifr_cint;disable_new_lowering_weights;disable_dper_passes:passes=fuse_parallel_linear_no_weight_change" --remove-unexpected-type-cast=False --load="manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/924293663/0/gpu_lowering/input.merge"```

Differential Revision: D64490039

Contributor

facebook-github-bot commented Oct 17, 2024

This pull request was exported from Phabricator. Differential Revision: D64490039

YUNQIUGUO force-pushed the export-D64490039 branch from 4be7f82 to 12cab86 Compare

October 17, 2024 18:19

Contributor

facebook-github-bot commented Oct 17, 2024

This pull request was exported from Phabricator. Differential Revision: D64490039

YUNQIUGUO force-pushed the export-D64490039 branch from 12cab86 to aa6d908 Compare

October 17, 2024 20:37

Contributor

facebook-github-bot commented Oct 17, 2024

This pull request was exported from Phabricator. Differential Revision: D64490039

YUNQIUGUO force-pushed the export-D64490039 branch from aa6d908 to 0240146 Compare

October 17, 2024 23:38

Contributor

facebook-github-bot commented Oct 17, 2024

This pull request was exported from Phabricator. Differential Revision: D64490039

YUNQIUGUO added a commit to YUNQIUGUO/pytorch that referenced this pull request


          [AOTI] Fix a special case compile time data type codegen for sym int …

2d258c8

…variables (pytorch#138106)

Summary:

This change unblocks the CFR AOTI lowering runtime error.

TL;DR:

In this model, one triton kernel expects a scalar input dtype as i64, but getting an i32. The reason is "auto"  can infer a smaller data type if the variable get passed in e.g. is i32. thus cause CUDA IMA.
 Original problematic kernel: `triton_poi_fused_add_ge_logical_and_logical_or_lt_46_grid_100`. and third input `auto var_402 = u0`.

This diff explicitly specifies it to i64 for all symbolic arguments in compile time for i64 triton kernel inputs, instead of use `auto var_x = {arg}` in cpp wrapper code.

Test Plan:
Verified in FLB locally:

```
PYTORCH_NO_CUDA_MEMORY_CACHING=1 AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3 TORCH_LOGS="output_code" TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCH_SHOW_CPP_STACKTRACES=1 CUDA_LAUNCH_BLOCKING=1 ~/fbsource/buck-out/v2/gen/fbcode/98e643f8bb44fe9d/hpc/new/models/feed/benchmark/__feed_lower_benchmark__/feed_lower_benchmark.par --skip-eager --skip-flop-estimation --lower-backend="AOT_INDUCTOR" --sync-mode=0 --precision bf16 --output-precision bf16  --lower-presets="ifr_cint;disable_new_lowering_weights;disable_dper_passes:passes=fuse_parallel_linear_no_weight_change" --remove-unexpected-type-cast=False --load="manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/924293663/0/gpu_lowering/input.merge"```

Differential Revision: D64490039

YUNQIUGUO force-pushed the export-D64490039 branch from 0240146 to 2d258c8 Compare

October 17, 2024 23:45

Contributor

facebook-github-bot commented Oct 17, 2024

This pull request was exported from Phabricator. Differential Revision: D64490039


          [AOTI] Fix a special case compile time data type codegen for sym int …

…variables (pytorch#138106)

Summary:

This change unblocks the CFR AOTI lowering runtime error.

TL;DR:

In this model, one triton kernel expects a scalar input dtype as i64, but getting an i32. The reason is "auto"  can infer a smaller data type if the variable get passed in e.g. is i32. thus cause CUDA IMA.
 Original problematic kernel: `triton_poi_fused_add_ge_logical_and_logical_or_lt_46_grid_100`. and third input `auto var_402 = u0`.

This diff explicitly specifies it to i64 for all symbolic arguments in compile time for i64 triton kernel inputs, instead of use `auto var_x = {arg}` in cpp wrapper code.

Test Plan:
Verified in FLB locally:

```
PYTORCH_NO_CUDA_MEMORY_CACHING=1 AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3 TORCH_LOGS="output_code" TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCH_SHOW_CPP_STACKTRACES=1 CUDA_LAUNCH_BLOCKING=1 ~/fbsource/buck-out/v2/gen/fbcode/98e643f8bb44fe9d/hpc/new/models/feed/benchmark/__feed_lower_benchmark__/feed_lower_benchmark.par --skip-eager --skip-flop-estimation --lower-backend="AOT_INDUCTOR" --sync-mode=0 --precision bf16 --output-precision bf16  --lower-presets="ifr_cint;disable_new_lowering_weights;disable_dper_passes:passes=fuse_parallel_linear_no_weight_change" --remove-unexpected-type-cast=False --load="manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/924293663/0/gpu_lowering/input.merge"```

Differential Revision: D64490039

YUNQIUGUO force-pushed the export-D64490039 branch from 2d258c8 to 2267787 Compare

October 18, 2024 18:01

Contributor

facebook-github-bot commented Oct 18, 2024

This pull request was exported from Phabricator. Differential Revision: D64490039

ColinPeppler approved these changes

View reviewed changes

pytorch-bot bot added the ciflow/trunk label

Contributor

facebook-github-bot commented Oct 19, 2024

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot added the merging label

Collaborator

pytorchmergebot commented Oct 19, 2024

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot added the Merged label

pytorchmergebot closed this in

ea412d5

pytorchmergebot removed the merging label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk fb-exported Merged module: inductor topic: not user facing