KEMBAR78
[AOTI] Use `len(serialized_weights)` when calculating `consts_size` by chunyuan-w · Pull Request #139054 · pytorch/pytorch · GitHub
Skip to content

Conversation

@chunyuan-w
Copy link
Collaborator

@chunyuan-w chunyuan-w commented Oct 28, 2024

Stack from ghstack (oldest at bottom):

Fixes the failure of INT8 DLRM using AOTI.
The previous code calculates consts_size directly using tensor from graph.constants:

  consts_size = sum(
      get_nbytes_of_tensor(tensor, all_cuda)
      for (name, tensor) in graph.constants.items()
      if name not in graph.folded_constants
  )

Meanwhile, the actual bytes to serialize (serialized_weights) is using graph.get_original_value_of_constant(name):

  serialized_weights = b"".join(
      _to_bytes(graph.get_original_value_of_constant(name), all_cuda)
      for name in graph.constants.keys()
      if name not in graph.folded_constants
  )

tensor from graph.constants could be different from graph.get_original_value_of_constant(name) thus making the consts_size inconsistent with the actual byte size of the serialized_weights, resulting in runtime error weights_offset must be aligned to 16K boundary, similar to what happened in #135205.

This PR direclty gets consts_size using len(serialized_weights), which fixes the inconsistency.

We also added a reduce_range argument to the get_default_x86_inductor_quantization_config function, which is needed in the unit test to avoid accuracy issue on CI machines (earlier CPUs without VNNI).

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 28, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139054

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 99f163f with merge base a7479fa (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]
[ghstack-poisoned]
@pytorch-bot pytorch-bot bot added the release notes: quantization release notes category label Oct 29, 2024
chunyuan-w added a commit that referenced this pull request Oct 29, 2024

class AOTInductorTestABICompatibleCpu(AOTITestCase):
device = "cpu"
device_type = "cpu"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

device_type is needed in skipCUDAIf

@unittest.skipIf(sys.platform == "darwin", "No CUDA on MacOS")
class AOTInductorTestABICompatibleCuda(AOTITestCase):
device = "cuda"
device_type = "cuda"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

chunyuan-w added a commit that referenced this pull request Oct 29, 2024

class AOTInductorTestABICompatibleCpuWithStackAllocation(AOTITestCase):
device = "cpu"
device_type = "cpu"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

TestCase
):
device = "cpu"
device_type = "cpu"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

[ghstack-poisoned]
[ghstack-poisoned]
@chunyuan-w chunyuan-w marked this pull request as ready for review October 29, 2024 12:57
@chunyuan-w chunyuan-w changed the title [AOTI] Use get_original_value_of_constant when calculating consts_size [AOTI] Use len(serialized_weights) when calculating consts_size Oct 29, 2024
@chunyuan-w chunyuan-w changed the title [AOTI] Use len(serialized_weights) when calculating consts_size [AOTI] Use len(serialized_weights) when calculating consts_size Oct 29, 2024
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
@chunyuan-w chunyuan-w requested a review from desertfire October 30, 2024 03:02
@chunyuan-w chunyuan-w added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 31, 2024
@chunyuan-w
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x 6cd0b4b8a1f050585d611bf9a28a406a4ee7bb88 returned non-zero exit code 1

Auto-merging test/inductor/test_aot_inductor.py
Auto-merging torch/_inductor/codecache.py
CONFLICT (content): Merge conflict in torch/_inductor/codecache.py
Auto-merging torch/ao/quantization/quantizer/x86_inductor_quantizer.py
error: could not apply 6cd0b4b8a1f... [AOTI] Use get_original_value_of_constant when calculating consts_size
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Details for Dev Infra team Raised by workflow job

chunyuan-w added a commit that referenced this pull request Oct 31, 2024
@chunyuan-w
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

[ghstack-poisoned]
@yushangdi
Copy link
Contributor

yushangdi commented Nov 1, 2024

@chunyuan-w Hi, I noticed that this PR uses capture_pre_autograd_graph. Note that this API is deprecated and we plan to delete it very soon. Please use torch.export.export_for_training(...).module() instead moving forward. Thanks!

cc @leslie-fang-intel @jgong5

rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request Nov 5, 2024
…ytorch#139054)

Fixes the failure of INT8 DLRM using AOTI.
The previous code calculates `consts_size` directly using `tensor` from `graph.constants`:
```
  consts_size = sum(
      get_nbytes_of_tensor(tensor, all_cuda)
      for (name, tensor) in graph.constants.items()
      if name not in graph.folded_constants
  )
```
Meanwhile, the actual bytes to serialize (`serialized_weights`) is using `graph.get_original_value_of_constant(name)`:
```
  serialized_weights = b"".join(
      _to_bytes(graph.get_original_value_of_constant(name), all_cuda)
      for name in graph.constants.keys()
      if name not in graph.folded_constants
  )
```

`tensor` from `graph.constants` could be different from `graph.get_original_value_of_constant(name)` thus making the `consts_size` inconsistent with the actual byte size of the `serialized_weights`, resulting in runtime error `weights_offset must be aligned to 16K boundary`, similar to what happened in pytorch#135205.

This PR direclty gets `consts_size ` using `len(serialized_weights)`, which fixes the inconsistency.

We also added a `reduce_range` argument to the `get_default_x86_inductor_quantization_config` function, which is needed in the unit test to avoid accuracy issue on CI machines (earlier CPUs without VNNI).

Pull Request resolved: pytorch#139054
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/desertfire
@github-actions github-actions bot deleted the gh/chunyuan-w/42/head branch December 2, 2024 02:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants