Introduce AcceleratorAllocatorConfig as the common class #149601

guangyey · 2025-03-20T05:50:03Z

Stack from ghstack (oldest at bottom):

Motivation

This PR aims to generalize AllocatorConfig to be device-agnostic. Introduce the class AcceleratorAllocatorConfig to clarify its scope as a configuration manager for accelerator backends (e.g., CUDA, XPU). The another name AllocatorConfig is now reserved for a potential future base class that can unify configuration handling for both CPU and accelerator allocators, should similar requirements arise for the CPU path.

Design Rule

Overall

This class configures memory allocation for both device and host memory. A single AcceleratorAllocatorConfig instance is shared across all accelerator backends, such as CUDA and XPU, under the assumption that relevant environment variables apply uniformly to all accelerators. Device-specific configuration extensions are supported via hooks (see registerDeviceConfigParserHook).
Introduce a new class ConfigTokenizer to help process the env variable config key-value pair

Naming Convention:

Public API names in AcceleratorAllocatorConfig should be device-generic.
Members prefixed with pinned_ are specific to the host/pinned allocator.
Environment variable names should be generic across backends.
Comma-separated key-value pairs in the format: key:value. Use square brackets [] for list values Example: key1:123, key2:[val1,val2]

Environment Variables:

The default environment variable for configuration is PYTORCH_ALLOC_CONF.
For backward compatibility, PYTORCH_CUDA_ALLOC_CONF and PYTORCH_HIP_ALLOC_CONF are also supported with lower priority.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @albanD @EikanWang

Differential Revision: D79011786

pytorch-bot · 2025-03-20T05:50:07Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149601

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (4 Unrelated Failures)

As of commit 63983b1 with merge base 6de2413 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / linux-jammy-rocm-py3.10 / test (distributed, 1, 1, linux.rocm.gpu.4) (gh) (similar failure)
distributed/test_c10d_nccl.py::DistributedDataParallelTest::test_grad_layout_1devicemodule_1replicaperprocess

UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, lf.linux.12xlarge, unstable) (gh) (#158876)
sccache: error: couldn't connect to server
rocm / linux-jammy-rocm-py3.10 / test (default, 2, 6, linux.rocm.gpu.2, unstable) (gh) (#156098)
##[error]The operation was canceled.
rocm / linux-jammy-rocm-py3.10 / test (default, 4, 6, linux.rocm.gpu.2, unstable) (gh) (#156098)
inductor/test_helion_kernels.py::HelionTests::test_add_kernel

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

huydhn · 2025-07-15T18:38:58Z

@pytorchbot revert -m 'See #149601 (comment)' -c ghfirst

pytorchmergebot · 2025-07-15T18:40:48Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

…9601)" This reverts commit 1e8e9f7. Reverted #149601 on behalf of https://github.com/huydhn due to See #149601 (comment) ([comment](#149601 (comment)))

pytorchmergebot · 2025-07-15T18:41:02Z

@guangyey your PR has been successfully reverted.

[ghstack-poisoned]

albanD · 2025-07-16T15:06:26Z

c10/core/AllocatorConfig.h

+};
+
+C10_API inline void setAllocatorSettings(const std::string& env) {
+  AcceleratorAllocatorConfig::instance().parseArgs(env);


Isn't this one still going to parse at global init time?

This API will be only used inside torch._C._accelerator_setAllocatorSetting. The user will call it.

[ghstack-poisoned]

wdvr · 2025-07-18T21:02:36Z

@wdvr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

wdvr · 2025-07-18T21:34:40Z

@wdvr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

wdvr · 2025-07-25T21:53:33Z

@wdvr has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

pytorchmergebot · 2025-07-30T06:30:49Z

Starting merge as part of PR stack under #156175

# Motivation Add a mechanism to ensure raise the key if the key is unrecognized in allocator config. Pull Request resolved: #157908 Approved by: https://github.com/albanD ghstack dependencies: #149601

…0312) # Motivation Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC. Pull Request resolved: #150312 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908

…llocatorConfig instead (#156165) Pull Request resolved: #156165 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908, #150312

# Motivation This PR moves the implementation of `torch.cuda.memory._set_allocator_settings` to `torch._C._accelerator_setAllocatorSettings`. Since the original API was intended as a temporary/internal utility, I am not exposing the new function as a public API. Pull Request resolved: #156175 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908, #150312, #156165

[ghstack-poisoned]

# Motivation This PR aims to generalize `AllocatorConfig` to be device-agnostic. Introduce the class `AcceleratorAllocatorConfig` to clarify its scope as a configuration manager for accelerator backends (e.g., CUDA, XPU). The another name `AllocatorConfig` is now reserved for a potential future base class that can unify configuration handling for both CPU and accelerator allocators, should similar requirements arise for the CPU path. # Design Rule ## Overall This class configures memory allocation for both device and host memory. A single `AcceleratorAllocatorConfig` instance is shared across all accelerator backends, such as CUDA and XPU, under the assumption that relevant environment variables apply uniformly to all accelerators. Device-specific configuration extensions are supported via hooks (see `registerDeviceConfigParserHook`). Introduce a new class `ConfigTokenizer` to help process the env variable config key-value pair ## Naming Convention: - Public API names in `AcceleratorAllocatorConfig` should be device-generic. - Members prefixed with `pinned_` are specific to the host/pinned allocator. - Environment variable names should be generic across backends. - Comma-separated key-value pairs in the format: `key:value`. Use square brackets `[]` for list values Example: `key1:123, key2:[val1,val2]` ## Environment Variables: - The default environment variable for configuration is `PYTORCH_ALLOC_CONF`. - For backward compatibility, `PYTORCH_CUDA_ALLOC_CONF` and `PYTORCH_HIP_ALLOC_CONF` are also supported with lower priority. Differential Revision: [D79011786](https://our.internmc.facebook.com/intern/diff/D79011786) Pull Request resolved: #149601 Approved by: https://github.com/albanD

# Motivation Add a mechanism to ensure raise the key if the key is unrecognized in allocator config. Pull Request resolved: #157908 Approved by: https://github.com/albanD ghstack dependencies: #149601

…0312) # Motivation Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC. Pull Request resolved: #150312 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908

…llocatorConfig instead (#156165) Pull Request resolved: #156165 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908, #150312

# Motivation This PR moves the implementation of `torch.cuda.memory._set_allocator_settings` to `torch._C._accelerator_setAllocatorSettings`. Since the original API was intended as a temporary/internal utility, I am not exposing the new function as a public API. Pull Request resolved: #156175 Approved by: https://github.com/albanD ghstack dependencies: #149601, #157908, #150312, #156165

This was referenced Mar 20, 2025

Reuse format_size utils #149383

Closed

Add DeviceAllocator as the base device allocator #138222

Closed

guangyey changed the title ~~Generalize AllocatorConfig to be device-agnostic~~ [WIP] Generalize AllocatorConfig to be device-agnostic Mar 20, 2025

pytorchbot added the open source label Mar 20, 2025

guangyey added 3 commits March 20, 2025 13:24

Update

9e333ce

[ghstack-poisoned]

Update

d51d657

[ghstack-poisoned]

Update

ed672db

[ghstack-poisoned]

guangyey marked this pull request as draft March 24, 2025 03:01

guangyey added 2 commits March 24, 2025 17:58

Update

181fa78

[ghstack-poisoned]

Update

73aca49

[ghstack-poisoned]

guangyey added release notes: cpp release notes category topic: improvements topic category labels Mar 26, 2025

guangyey added 6 commits March 26, 2025 16:49

Update

6bbc005

[ghstack-poisoned]

Update

1636a9e

[ghstack-poisoned]

Update

47f93a2

[ghstack-poisoned]

Update

52e7fc0

[ghstack-poisoned]

Update

359815c

[ghstack-poisoned]

Update

4b4f8e8

[ghstack-poisoned]

guangyey mentioned this pull request Mar 31, 2025

Refactor CUDAAllocatorConfig to reuse AcceleratorAllocatorConfig #150312

Closed

guangyey added 10 commits March 31, 2025 17:05

Update

b597752

[ghstack-poisoned]

Update

9f39e52

[ghstack-poisoned]

Update

6552d5d

[ghstack-poisoned]

Update

821ae4c

[ghstack-poisoned]

Update

ec7b28b

[ghstack-poisoned]

Update

8e35b52

[ghstack-poisoned]

Update

6b2bcd4

[ghstack-poisoned]

Update

8d4fe50

[ghstack-poisoned]

Update

733981f

[ghstack-poisoned]

Update

aa83246

[ghstack-poisoned]

pytorchmergebot reopened this Jul 15, 2025

Update

ab5b92c

[ghstack-poisoned]

albanD reviewed Jul 16, 2025

View reviewed changes

guangyey added 4 commits July 16, 2025 15:13

Update

9afb37b

[ghstack-poisoned]

Update

f989274

[ghstack-poisoned]

Update

0fd9627

[ghstack-poisoned]

Update

954cfe5

[ghstack-poisoned]

guangyey mentioned this pull request Jul 23, 2025

[feature][xpu] support expandable_segments #158434

Closed

pytorchmergebot closed this in 914b1a3 Jul 30, 2025

Update

63983b1

[ghstack-poisoned]

izaitsevfb mentioned this pull request Aug 4, 2025

DISABLED test_pin_memory_no_cuda (__main__.TestDictDataLoader) #159802

Closed

github-actions bot deleted the gh/guangyey/130/head branch August 30, 2025 02:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introduce AcceleratorAllocatorConfig as the common class #149601

Introduce AcceleratorAllocatorConfig as the common class #149601

Uh oh!

guangyey commented Mar 20, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 20, 2025 •

edited

Loading

Uh oh!

huydhn commented Jul 15, 2025

Uh oh!

pytorchmergebot commented Jul 15, 2025

Uh oh!

pytorchmergebot commented Jul 15, 2025

Uh oh!

albanD Jul 16, 2025

Uh oh!

guangyey Jul 17, 2025

Uh oh!

wdvr commented Jul 18, 2025

Uh oh!

wdvr commented Jul 18, 2025

Uh oh!

wdvr commented Jul 25, 2025

Uh oh!

pytorchmergebot commented Jul 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Introduce AcceleratorAllocatorConfig as the common class #149601

Introduce AcceleratorAllocatorConfig as the common class #149601

Uh oh!

Conversation

guangyey commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Design Rule

Overall

Naming Convention:

Environment Variables:

Uh oh!

pytorch-bot bot commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149601

✅ You can merge normally! (4 Unrelated Failures)

Uh oh!

huydhn commented Jul 15, 2025

Uh oh!

pytorchmergebot commented Jul 15, 2025

Uh oh!

pytorchmergebot commented Jul 15, 2025

Uh oh!

albanD Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

guangyey Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

wdvr commented Jul 18, 2025

Uh oh!

wdvr commented Jul 18, 2025

Uh oh!

wdvr commented Jul 25, 2025

Uh oh!

pytorchmergebot commented Jul 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

guangyey commented Mar 20, 2025 •

edited

Loading

pytorch-bot bot commented Mar 20, 2025 •

edited

Loading