KEMBAR78
Fix vs2022 caused AVX512 illegal instruction issue. by xuhancn · Pull Request #153480 · pytorch/pytorch · GitHub
Skip to content

Conversation

@xuhancn
Copy link
Collaborator

@xuhancn xuhancn commented May 13, 2025

Fixes #145702

Add /d2implyavx512upperregs- to disable compiler over-aggressive optimization, which caused involeved AVX512 register on AVX2 machine.

Reference to: #145702 (comment)

Local test passed:
image

cc @peterjc123 @mszhanyi @skyline75489 @nbcsm @iremyux @Blackhex @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168

@pytorch-bot
Copy link

pytorch-bot bot commented May 13, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153480

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit 99978ab with merge base dccd19c (image):

NEW FAILURE - The following job has failed:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@xuhancn xuhancn added topic: not user facing topic category ciflow/binaries Trigger all binary build and upload jobs on the PR module: windows Windows support for PyTorch ciflow/trunk Trigger trunk jobs on your pull request intel This tag is for PR from Intel labels May 13, 2025
@xuhancn xuhancn requested review from Blackhex, atalman and malfet May 15, 2025 17:27
@xuhancn xuhancn changed the title fix vs2022 illegal instruction. Fix vs2022 caused AVX512 illegal instruction issue. May 15, 2025
@xuhancn xuhancn marked this pull request as ready for review May 15, 2025 17:28
@xuhancn xuhancn requested a review from a team as a code owner May 15, 2025 17:28
@xuhancn xuhancn added the module: cpu CPU specific problem (e.g., perf, algorithm) label May 15, 2025
Copy link
Collaborator

@Blackhex Blackhex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. If it's desired, I'd suggest creating a GitHub issue to solve the root cause - removal of AVX512 intrinsic from code targeting AVX2 - and linking it to this PR.

@xuhancn
Copy link
Collaborator Author

xuhancn commented May 16, 2025

LGTM. If it's desired, I'd suggest creating a GitHub issue to solve the root cause - removal of AVX512 intrinsic from code targeting AVX2 - and linking it to this PR.

Upgrade VS2022 to latest can fix this issue, but I don't known how, please check here: #145702 (comment) @Blackhex can you check with Visual studio team to seek a solution?

@xuhancn
Copy link
Collaborator Author

xuhancn commented May 16, 2025

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased xu_fix_vs2022_illegal_instruction onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout xu_fix_vs2022_illegal_instruction && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the xu_fix_vs2022_illegal_instruction branch from a6214c5 to 56e7d38 Compare May 16, 2025 06:43
@Blackhex
Copy link
Collaborator

Blackhex commented May 16, 2025

Upgrade VS2022 to latest can fix this issue, but I don't known how, please check here: #145702 (comment) @Blackhex can you check with Visual studio team to seek a solution?

IMO, the https://github.com/pytorch/pytorch/blob/56e7d38b9f9ee727caf7b86aa0cca5088c94a489/.ci/pytorch/windows/internal/vc_install_helper.bat script is not used for PR builds that are done with https://github.com/pytorch/pytorch/blob/56e7d38b9f9ee727caf7b86aa0cca5088c94a489/.ci/pytorch/win-test-helpers/build_pytorch.bat script which probably assumes the build tools already installed on the image. If that's true, it should be updated on AMI image or installation added (https://github.com/Blackhex/pytorch-test-infra/blob/main/aws/ami/windows/scripts/Installers/Install-VS.ps1). The installation script should be used for nightly wheels builds. Can you confirm or decline whether the update of build tools work there?

PS: I am discussing with VS team what is the reason that the issue is not happening with newer build tools. I will need a small C++ repro case for that which will take some time to dissect from the codebase, unless someone already did that?

@xuhancn
Copy link
Collaborator Author

xuhancn commented May 19, 2025

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased xu_fix_vs2022_illegal_instruction onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout xu_fix_vs2022_illegal_instruction && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the xu_fix_vs2022_illegal_instruction branch from 56e7d38 to cd0344d Compare May 19, 2025 12:05
@xuhancn xuhancn added the ciflow/xpu Run XPU CI tasks label May 19, 2025
@xuhancn
Copy link
Collaborator Author

xuhancn commented May 19, 2025

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@xuhancn xuhancn force-pushed the xu_fix_vs2022_illegal_instruction branch from ffa3a12 to ad4ada3 Compare May 20, 2025 05:40
@xuhancn
Copy link
Collaborator Author

xuhancn commented May 20, 2025

image

Double confirmed the latest changes.

Copy link
Contributor

@atalman atalman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@xuhancn
Copy link
Collaborator Author

xuhancn commented May 20, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: xpu / linux-jammy-xpu-2025.1-py3.9 / test (default, 6, 6, linux.idc.xpu)

Details for Dev Infra team Raised by workflow job

@xuhancn
Copy link
Collaborator Author

xuhancn commented May 20, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: xpu / linux-jammy-xpu-2025.1-py3.9 / test (default, 6, 6, linux.idc.xpu)

Details for Dev Infra team Raised by workflow job

@xuhancn
Copy link
Collaborator Author

xuhancn commented May 20, 2025

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 2 checks: pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu), xpu / linux-jammy-xpu-2025.1-py3.9 / test (default, 6, 6, linux.idc.xpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@xuhancn xuhancn deleted the xu_fix_vs2022_illegal_instruction branch May 20, 2025 20:37
pytorchmergebot pushed a commit that referenced this pull request Jun 10, 2025
This reverts commit e4f2282.
I believe fix PR was landed #153480 that triggered the revert.
Hence this is reland.

Pull Request resolved: #155478
Approved by: https://github.com/malfet
@malfet
Copy link
Contributor

malfet commented Jul 29, 2025

I couldn't find any documentation for d2implyavx512upperregs....

@Blackhex
Copy link
Collaborator

Yeah, this flag is undocumented. I found out this is a solution to the problem from public (and internal) discussions like https://developercommunity.visualstudio.com/t/Code-gen-bug-uses-ymm16-register-for-AVX/10564317

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/binaries Trigger all binary build and upload jobs on the PR ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks intel This tag is for PR from Intel Merged module: cpu CPU specific problem (e.g., perf, algorithm) module: windows Windows support for PyTorch open source topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PyTorch VS2022 official build Windows binary illegal instruction on AVX2(max ISA level) CPU

7 participants