fix incorrect c10::SymFloat::sqrt #141728

bdhirsh · 2024-11-27T22:12:13Z

Fixes the silent correctness for SDPA in #141710

Stack from ghstack (oldest at bottom):

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames

[ghstack-poisoned]

pytorch-bot · 2024-11-27T22:12:16Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141728

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f3ccc5d with merge base 9125e91 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: b1dca33 Pull Request resolved: #141728

Skylion007 · 2024-11-27T23:29:46Z

Yikes! This is a good backport candidate.

drisspg · 2024-11-28T04:44:44Z

Summary

I think it is unlikely that this value ends up being symbolic (deff possible but unlikely)
For math path w/ symfloats we are also got lucky in symfloat case see below
For flash-attention this is only a problem if this is actually a symfloat, but since we were using as_float_unchecked() we already had bigger problems than this scale miscalc. I don't think the other calls within kernel dispatch could ever run the symfloat path
For non symfloat case we use std::sqrt which is correct and why this didnt show up in any UTs since I imagine we didn't exercise this dynamic last dim in any tests.

Math path reasoning:
We did a trick that to reduce the numerical deviation by doing double sqrt one on query and one on key instead of just once on the product of q@k.T

We only ever used sqrt as
bad_sqrt = pow(scale, -0.5)
good_sqrt = pow(scale, 0.5)

So for good sqrt you have

good_sqrt(x) = x^(0.5)
good_sqrt(good_sqrt(scale)) = (scale^0.5)^0.5 = scale^0.25

and for bad sqrt you have

bad_sqrt(x) = x^(-0.5)
bad_sqrt(bad_sqrt(scale)) = (scale^-0.5)^-0.5 = scale^0.25

Overall v important fix and great catch.I had a heart attack reading this at first and needed to do some sanity checks as to why this never showed up before, I think the existing blast radius is somewhat well contained

Fixes the silent correctness for SDPA in #141710 [ghstack-poisoned]

ghstack-source-id: e9187a2 Pull Request resolved: #141728

bdhirsh · 2024-12-03T23:26:44Z

@pytorchbot merge

pytorchmergebot · 2024-12-03T23:28:31Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Fixes #142076. Under compile, functional collectives are supposed to **not** return `AsyncCollectiveTensor`, and instead immediately issue calls to `wait_tensor()` (that we rely on the compiler to reorder as necessary. This is done with a function `_are_we_tracing()`, that tries to detect if we are running from inside of the compiler. One of the checks it performs is `is_torchdynamo_compiling()` ([here](https://github.com/pytorch/pytorch/blob/main/torch/distributed/_functional_collectives.py#L808C8-L808C34)). Unfortunately, this will always return False, even if dynamo is indeed tracing. The problem is that this function only returns true if dynamo **intercepts** the bytecode for `is_torchdynamo_compiling()`. However, this function is called during fake-tensor propagation, which is run as part of dynamo, but is not actually intercepted by dynamo itself. One thing that we know is the case during dynamo tracing, however, is that a `FakeTensorMode` is active. So I tweaked the logic to assume that we are tracing if there is an active fake mode. This could potentially have consequences for anybody running functional collectives with a fake mode directly, without compile in the loop. Although hopefully it's not too unreasonable to issue wait() calls immediately if you are running with fake tensor (presumably you only care about fake tensor propagation, in which case the wait() calls should technically be a no-op). Pull Request resolved: #142075 Approved by: https://github.com/yifuwang, https://github.com/kwen2501 ghstack dependencies: #141725, #141728

Fixes the silent correctness for SDPA in pytorch#141710 Pull Request resolved: pytorch#141728 Approved by: https://github.com/Skylion007, https://github.com/ezyang, https://github.com/drisspg ghstack dependencies: pytorch#141725

Fixes pytorch#142076. Under compile, functional collectives are supposed to **not** return `AsyncCollectiveTensor`, and instead immediately issue calls to `wait_tensor()` (that we rely on the compiler to reorder as necessary. This is done with a function `_are_we_tracing()`, that tries to detect if we are running from inside of the compiler. One of the checks it performs is `is_torchdynamo_compiling()` ([here](https://github.com/pytorch/pytorch/blob/main/torch/distributed/_functional_collectives.py#L808C8-L808C34)). Unfortunately, this will always return False, even if dynamo is indeed tracing. The problem is that this function only returns true if dynamo **intercepts** the bytecode for `is_torchdynamo_compiling()`. However, this function is called during fake-tensor propagation, which is run as part of dynamo, but is not actually intercepted by dynamo itself. One thing that we know is the case during dynamo tracing, however, is that a `FakeTensorMode` is active. So I tweaked the logic to assume that we are tracing if there is an active fake mode. This could potentially have consequences for anybody running functional collectives with a fake mode directly, without compile in the loop. Although hopefully it's not too unreasonable to issue wait() calls immediately if you are running with fake tensor (presumably you only care about fake tensor propagation, in which case the wait() calls should technically be a no-op). Pull Request resolved: pytorch#142075 Approved by: https://github.com/yifuwang, https://github.com/kwen2501 ghstack dependencies: pytorch#141725, pytorch#141728

Fixes the silent correctness for SDPA in pytorch#141710 Pull Request resolved: pytorch#141728 Approved by: https://github.com/Skylion007, https://github.com/ezyang, https://github.com/drisspg ghstack dependencies: pytorch#141725

Fixes pytorch#142076. Under compile, functional collectives are supposed to **not** return `AsyncCollectiveTensor`, and instead immediately issue calls to `wait_tensor()` (that we rely on the compiler to reorder as necessary. This is done with a function `_are_we_tracing()`, that tries to detect if we are running from inside of the compiler. One of the checks it performs is `is_torchdynamo_compiling()` ([here](https://github.com/pytorch/pytorch/blob/main/torch/distributed/_functional_collectives.py#L808C8-L808C34)). Unfortunately, this will always return False, even if dynamo is indeed tracing. The problem is that this function only returns true if dynamo **intercepts** the bytecode for `is_torchdynamo_compiling()`. However, this function is called during fake-tensor propagation, which is run as part of dynamo, but is not actually intercepted by dynamo itself. One thing that we know is the case during dynamo tracing, however, is that a `FakeTensorMode` is active. So I tweaked the logic to assume that we are tracing if there is an active fake mode. This could potentially have consequences for anybody running functional collectives with a fake mode directly, without compile in the loop. Although hopefully it's not too unreasonable to issue wait() calls immediately if you are running with fake tensor (presumably you only care about fake tensor propagation, in which case the wait() calls should technically be a no-op). Pull Request resolved: pytorch#142075 Approved by: https://github.com/yifuwang, https://github.com/kwen2501 ghstack dependencies: pytorch#141725, pytorch#141728

Update

77cbf18

[ghstack-poisoned]

bdhirsh mentioned this pull request Nov 27, 2024

guard on flash attention SymFloat scale instead of incorrectly casting to float #141725

Closed

pytorch-bot bot added the ciflow/inductor label Nov 27, 2024

bdhirsh added a commit that referenced this pull request Nov 27, 2024

fix incorrect c10::SymFloat::sqrt

55735a7

ghstack-source-id: b1dca33 Pull Request resolved: #141728

github-actions bot requested review from SherlockNoMad, albanD, antoniojkim, ezyang and miladm November 27, 2024 22:12

bdhirsh mentioned this pull request Nov 27, 2024

compiling attention layer with dynamic shapes yields nans #141710

Closed

Skylion007 approved these changes Nov 27, 2024

View reviewed changes

ezyang approved these changes Nov 28, 2024

View reviewed changes

drisspg approved these changes Nov 28, 2024

View reviewed changes

Skylion007 added the topic: bug fixes topic category label Nov 29, 2024

Update on "fix incorrect c10::SymFloat::sqrt"

657d36b

Fixes the silent correctness for SDPA in #141710 [ghstack-poisoned]

bdhirsh added ciflow/trunk Trigger trunk jobs on your pull request release notes: composability release notes category labels Dec 3, 2024

Update on "fix incorrect c10::SymFloat::sqrt"

f3ccc5d

Fixes the silent correctness for SDPA in #141710 [ghstack-poisoned]

bdhirsh added a commit that referenced this pull request Dec 3, 2024

fix incorrect c10::SymFloat::sqrt

c60e0f3

ghstack-source-id: e9187a2 Pull Request resolved: #141728

pytorch-bot bot added the module: dynamo label Dec 3, 2024

albanD removed their request for review December 3, 2024 21:31

pytorchmergebot added the merging label Dec 3, 2024

pytorchmergebot added the Merged label Dec 3, 2024

pytorchmergebot closed this in 20912ba Dec 3, 2024

pytorchmergebot removed the merging label Dec 3, 2024

bdhirsh mentioned this pull request Dec 4, 2024

AsyncCollectiveTensor: fix _are_we_tracing() in dynamo #142075

Closed

github-actions bot deleted the gh/bdhirsh/632/head branch January 3, 2025 02:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix incorrect c10::SymFloat::sqrt #141728

fix incorrect c10::SymFloat::sqrt #141728

Uh oh!

bdhirsh commented Nov 27, 2024 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Nov 27, 2024 •

edited

Loading

Uh oh!

Skylion007 commented Nov 27, 2024

Uh oh!

drisspg commented Nov 28, 2024 •

edited

Loading

Uh oh!

bdhirsh commented Dec 3, 2024

Uh oh!

pytorchmergebot commented Dec 3, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fix incorrect c10::SymFloat::sqrt #141728

fix incorrect c10::SymFloat::sqrt #141728

Uh oh!

Conversation

bdhirsh commented Nov 27, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141728

✅ No Failures

Uh oh!

Skylion007 commented Nov 27, 2024

Uh oh!

drisspg commented Nov 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

bdhirsh commented Dec 3, 2024

Uh oh!

pytorchmergebot commented Dec 3, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

bdhirsh commented Nov 27, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Nov 27, 2024 •

edited

Loading

drisspg commented Nov 28, 2024 •

edited

Loading