KEMBAR78
Introduce AcceleratorAllocatorConfig as the common class by guangyey · Pull Request #149601 · pytorch/pytorch · GitHub
Skip to content

Conversation

@guangyey
Copy link
Collaborator

@guangyey guangyey commented Mar 20, 2025

Stack from ghstack (oldest at bottom):

Motivation

This PR aims to generalize AllocatorConfig to be device-agnostic. Introduce the class AcceleratorAllocatorConfig to clarify its scope as a configuration manager for accelerator backends (e.g., CUDA, XPU). The another name AllocatorConfig is now reserved for a potential future base class that can unify configuration handling for both CPU and accelerator allocators, should similar requirements arise for the CPU path.

Design Rule

Overall

This class configures memory allocation for both device and host memory. A single AcceleratorAllocatorConfig instance is shared across all accelerator backends, such as CUDA and XPU, under the assumption that relevant environment variables apply uniformly to all accelerators. Device-specific configuration extensions are supported via hooks (see registerDeviceConfigParserHook).
Introduce a new class ConfigTokenizer to help process the env variable config key-value pair

Naming Convention:

  • Public API names in AcceleratorAllocatorConfig should be device-generic.
  • Members prefixed with pinned_ are specific to the host/pinned allocator.
  • Environment variable names should be generic across backends.
  • Comma-separated key-value pairs in the format: key:value. Use square brackets [] for list values Example: key1:123, key2:[val1,val2]

Environment Variables:

  • The default environment variable for configuration is PYTORCH_ALLOC_CONF.
  • For backward compatibility, PYTORCH_CUDA_ALLOC_CONF and PYTORCH_HIP_ALLOC_CONF are also supported with lower priority.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @albanD @EikanWang

Differential Revision: D79011786

@pytorch-bot
Copy link

pytorch-bot bot commented Mar 20, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149601

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (4 Unrelated Failures)

As of commit 63983b1 with merge base 6de2413 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@guangyey guangyey changed the title Generalize AllocatorConfig to be device-agnostic [WIP] Generalize AllocatorConfig to be device-agnostic Mar 20, 2025
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
@guangyey guangyey marked this pull request as draft March 24, 2025 03:01
[ghstack-poisoned]
[ghstack-poisoned]
@guangyey guangyey added release notes: cpp release notes category topic: improvements topic category labels Mar 26, 2025
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
guangyey added 10 commits March 31, 2025 17:05
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
@huydhn
Copy link
Contributor

huydhn commented Jul 15, 2025

@pytorchbot revert -m 'See #149601 (comment)' -c ghfirst

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Jul 15, 2025
@pytorchmergebot
Copy link
Collaborator

@guangyey your PR has been successfully reverted.

[ghstack-poisoned]
};

C10_API inline void setAllocatorSettings(const std::string& env) {
AcceleratorAllocatorConfig::instance().parseArgs(env);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this one still going to parse at global init time?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API will be only used inside torch._C._accelerator_setAllocatorSetting. The user will call it.

guangyey added 4 commits July 16, 2025 15:13
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
@wdvr
Copy link
Contributor

wdvr commented Jul 18, 2025

@wdvr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@wdvr
Copy link
Contributor

wdvr commented Jul 18, 2025

@wdvr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@wdvr
Copy link
Contributor

wdvr commented Jul 25, 2025

@wdvr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@pytorchmergebot
Copy link
Collaborator

Starting merge as part of PR stack under #156175

pytorchmergebot pushed a commit that referenced this pull request Jul 30, 2025
# Motivation
Add a mechanism to ensure raise the key if the key is unrecognized in allocator config.

Pull Request resolved: #157908
Approved by: https://github.com/albanD
ghstack dependencies: #149601
pytorchmergebot pushed a commit that referenced this pull request Jul 30, 2025
…0312)

# Motivation
Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC.

Pull Request resolved: #150312
Approved by: https://github.com/albanD
ghstack dependencies: #149601, #157908
pytorchmergebot pushed a commit that referenced this pull request Jul 30, 2025
…llocatorConfig instead (#156165)

Pull Request resolved: #156165
Approved by: https://github.com/albanD
ghstack dependencies: #149601, #157908, #150312
pytorchmergebot pushed a commit that referenced this pull request Jul 30, 2025
# Motivation
This PR moves the implementation of `torch.cuda.memory._set_allocator_settings` to `torch._C._accelerator_setAllocatorSettings`.
Since the original API was intended as a temporary/internal utility, I am not exposing the new function as a public API.

Pull Request resolved: #156175
Approved by: https://github.com/albanD
ghstack dependencies: #149601, #157908, #150312, #156165
[ghstack-poisoned]
yangw-dev pushed a commit that referenced this pull request Aug 1, 2025
# Motivation
This PR aims to generalize `AllocatorConfig` to be device-agnostic. Introduce the class `AcceleratorAllocatorConfig` to clarify its scope as a configuration manager for accelerator backends (e.g., CUDA, XPU). The another name `AllocatorConfig` is now reserved for a potential future base class that can unify configuration handling for both CPU and accelerator allocators, should similar requirements arise for the CPU path.

# Design Rule
## Overall
This class configures memory allocation for both device and host memory. A single `AcceleratorAllocatorConfig` instance is shared across all accelerator backends, such as CUDA and XPU, under the assumption that relevant environment variables apply uniformly to all accelerators. Device-specific configuration extensions are supported via hooks (see `registerDeviceConfigParserHook`).
Introduce a new class `ConfigTokenizer` to help process the env variable config key-value pair

## Naming Convention:
- Public API names in `AcceleratorAllocatorConfig` should be device-generic.
- Members prefixed with `pinned_` are specific to the host/pinned allocator.
- Environment variable names should be generic across backends.
- Comma-separated key-value pairs in the format: `key:value`. Use square brackets `[]` for list values Example: `key1:123, key2:[val1,val2]`

## Environment Variables:
- The default environment variable for configuration is `PYTORCH_ALLOC_CONF`.
- For backward compatibility, `PYTORCH_CUDA_ALLOC_CONF` and `PYTORCH_HIP_ALLOC_CONF` are also supported with lower priority.

Differential Revision: [D79011786](https://our.internmc.facebook.com/intern/diff/D79011786)
Pull Request resolved: #149601
Approved by: https://github.com/albanD
yangw-dev pushed a commit that referenced this pull request Aug 1, 2025
# Motivation
Add a mechanism to ensure raise the key if the key is unrecognized in allocator config.

Pull Request resolved: #157908
Approved by: https://github.com/albanD
ghstack dependencies: #149601
yangw-dev pushed a commit that referenced this pull request Aug 1, 2025
…0312)

# Motivation
Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC.

Pull Request resolved: #150312
Approved by: https://github.com/albanD
ghstack dependencies: #149601, #157908
yangw-dev pushed a commit that referenced this pull request Aug 1, 2025
…llocatorConfig instead (#156165)

Pull Request resolved: #156165
Approved by: https://github.com/albanD
ghstack dependencies: #149601, #157908, #150312
yangw-dev pushed a commit that referenced this pull request Aug 1, 2025
# Motivation
This PR moves the implementation of `torch.cuda.memory._set_allocator_settings` to `torch._C._accelerator_setAllocatorSettings`.
Since the original API was intended as a temporary/internal utility, I am not exposing the new function as a public API.

Pull Request resolved: #156175
Approved by: https://github.com/albanD
ghstack dependencies: #149601, #157908, #150312, #156165
@github-actions github-actions bot deleted the gh/guangyey/130/head branch August 30, 2025 02:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/android Trigger android build and test (run_android_test.yml) ciflow/rocm Trigger "default" config CI on ROCm ciflow/trunk Trigger trunk jobs on your pull request Merged module: accelerator Issues related to the shared accelerator API oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: cpp release notes category Reverted Stale topic: improvements topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants