[DTensor] Support user-supplied Generator for random ops #159933

wconstab · 2025-08-06T04:45:55Z

Stack from ghstack (oldest at bottom):

-> [DTensor] Support user-supplied Generator for random ops #159933

If the user provides a generator kwarg to a random op (e.g.
nn.init.uniform_(..., generator=my_generator)), we can still advance
that generator's state in a SPMD-global way so that each local-tensor
gets appropriate values and the generator advances to the same state as
if it had operated on the full tensor.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @d4l3k @pragupta

If the user provides a generator kwarg to a random op (e.g. nn.init.uniform_(..., generator=my_generator)), we can still advance that generator's state in a SPMD-global way so that each local-tensor gets appropriate values and the generator advances to the same state as if it had operated on the full tensor. [ghstack-poisoned]

pytorch-bot · 2025-08-06T04:45:59Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159933

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit 7a01a13 with merge base 908c5cc ():

UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:

Check Labels / Check labels (gh) (#159894)
RuntimeError: GraphQL query
Check mergeability of ghstack PR / ghstack-mergeability-check (gh) (#159899)
RuntimeError: GraphQL query
pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, linux.12xlarge, unstable) (gh) (#158876)
/var/lib/jenkins/workspace/xla/torch_xla/csrc/runtime/BUILD:476:14: Compiling torch_xla/csrc/runtime/xla_util_test.cpp failed: (Exit 1): gcc failed: error executing CppCompile command (from target //torch_xla/csrc/runtime:xla_util_test) /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 229 arguments skipped)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

If the user provides a generator kwarg to a random op (e.g. nn.init.uniform_(..., generator=my_generator)), we can still advance that generator's state in a SPMD-global way so that each local-tensor gets appropriate values and the generator advances to the same state as if it had operated on the full tensor. ghstack-source-id: b93f34f Pull Request resolved: #159933

test/distributed/tensor/test_random_ops.py

wconstab · 2025-08-06T06:07:59Z

torch/distributed/tensor/_random.py


-    def _distribute_region(self, spec: DTensorSpec):
+    def _distribute_region(
+        self, spec: DTensorSpec, generator: Optional[torch.Generator]


Missing =None

XilunWu

Overall LGTM. Let me know if you need quick unblock.

We need to explicitly tell users to initialize the generator passed in with the same seed, or we add this functionality to our manual_seed API but users still need to be aware of this.

XilunWu · 2025-08-06T07:01:30Z

torch/distributed/tensor/_dispatch.py

+                assert maybe_user_generator is None or isinstance(
+                    maybe_user_generator, torch.Generator
+                )
+                # maybe_user_generator = None


remove comment

XilunWu · 2025-08-06T07:43:41Z

torch/distributed/tensor/_random.py

+            if g_name not in self.rng_states:
+                self.rng_states[g_name] = generator.get_state()


this is a slight behavior divergence over using the default RNG or the user-specified RNG. Either we require users to call our manual_seed() API with the RNG passed in, or we optimistically assume users know what they're doing and are responsible for initializing the RNG with the right seed value across ranks (which is what we're doing here).

I think I like the behavior this way- I don't want to introduce a collective on every op where a user supplies an RNG.

I will add documentation to dtensor docs stating that it is the user's responsibility to ensure the passed generator has the same state on every spmd rank.

the original feature request comes from @akolesnikoff and myself. We noticed the issue specifically because we were wondering about and comparing behaviour of passing in same vs different seeded RNGs across ranks. So at least from our perspective, yes, we are intentional.

I see.

passing in same vs different seeded RNGs across ranks

I assume you're trying to pass same RNGs for Data Parallel and different RNGs for Model Parallel (let me know if this is not the case).

wconstab · 2025-08-06T14:45:03Z

torch/distributed/tensor/_random.py

+            # not because we need to keep a copy of it but because its the easiest way to make it work with the
+            # existing set/get APIs
+            g_name = str(id(generator))
+            if g_name not in self.rng_states:


I will also add 2 more test cases and fix one bug here:

If the user changed the seed of their generator after using it with dtensor, we'd cache the first seed and use that forever. To prevent this I'm going to pop the temporary generator back out of self.rng_stated at the end of this context, so we add it fresh again every time.

This makes me want to do a refactor, I think we should not be storing states like this. We should probably just store a ref to our own generator and then use it.

If the user provides a generator kwarg to a random op (e.g. nn.init.uniform_(..., generator=my_generator)), we can still advance that generator's state in a SPMD-global way so that each local-tensor gets appropriate values and the generator advances to the same state as if it had operated on the full tensor. cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k pragupta [ghstack-poisoned]

If the user provides a generator kwarg to a random op (e.g. nn.init.uniform_(..., generator=my_generator)), we can still advance that generator's state in a SPMD-global way so that each local-tensor gets appropriate values and the generator advances to the same state as if it had operated on the full tensor. ghstack-source-id: 7b6672e Pull Request resolved: #159933

If the user provides a generator kwarg to a random op (e.g. nn.init.uniform_(..., generator=my_generator)), we can still advance that generator's state in a SPMD-global way so that each local-tensor gets appropriate values and the generator advances to the same state as if it had operated on the full tensor. cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k pragupta [ghstack-poisoned]

If the user provides a generator kwarg to a random op (e.g. nn.init.uniform_(..., generator=my_generator)), we can still advance that generator's state in a SPMD-global way so that each local-tensor gets appropriate values and the generator advances to the same state as if it had operated on the full tensor. ghstack-source-id: d5ab5a2 Pull Request resolved: #159933

lucasb-eyer · 2025-08-06T20:52:57Z

Thanks, Will!

If the user provides a generator kwarg to a random op (e.g. nn.init.uniform_(..., generator=my_generator)), we can still advance that generator's state in a SPMD-global way so that each local-tensor gets appropriate values and the generator advances to the same state as if it had operated on the full tensor. cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k pragupta [ghstack-poisoned]

If the user provides a generator kwarg to a random op (e.g. nn.init.uniform_(..., generator=my_generator)), we can still advance that generator's state in a SPMD-global way so that each local-tensor gets appropriate values and the generator advances to the same state as if it had operated on the full tensor. ghstack-source-id: d2523f2 Pull Request resolved: #159933

wconstab · 2025-08-06T22:20:35Z

I ended up implementing the version where the user-passed RNG IS mutated after it is used by DTensor.
(a) because I ran into trouble with the previous implementation that tried to cache the generator by id(generator) - I found that on some ranks, the id of the generator in the dtensor op kwargs changes, and I gave up trying to figure out why. (guessing there is a copy happening somewhere?)

(b) because i think this is the UX we want to aim for anyway.

XilunWu

this API change lgtm

XilunWu · 2025-08-06T22:32:22Z

test/distributed/tensor/test_random_ops.py

+        # ensure that we do not cache the 'seed' of `rng` from the first time we see it in DTensor
+        # TODO: we have a semantics decision to make
+        # There is a discontinuity between how the default RNG and a user-supplied RNG behaves with DTensor:
+        # (a) if the user calls `torch.manual_seed` after already using the default RNG with DTensor,
+        #     they may be surprised that it has no effect on DTensor.  They must instead call this private API
+        #     (`torch.distributed.tensor._random._rng_tracker._manual_seed`)
+        # (b) If we try to match the semantics of (a) with a user-supplied RNG, they may be very surprised to find that
+        #     their RNG object never advances its state after using it with DTensor.
+        # torch.distributed.tensor._random._rng_tracker._manual_seed(55)
+        # rng.manual_seed(55)
+        # torch.nn.init.uniform_(t1, 0.0, 1.0)
+        # torch.nn.init.uniform_(t2, 0.0, 1.0, rng)
+        # self.assertEqual(t1.full_tensor(), t2.full_tensor())
+


clean up this comment, maybe move to _random.py?

i think its ok to leave it here for now. i still need to get some agreement on if we're changing the default rng behavior. then i'd make another PR to do that and i can remove this TODO and enable this part of the test

wanchaol

lgtm

wanchaol · 2025-08-07T06:05:22Z

torch/distributed/tensor/_random.py

+            # This is a little hacky, but for any user-passed generator, we store its state under a unique key,
+            # not because we need to keep a copy of it but because its the easiest way to make it work with the
+            # existing set/get APIs. We also ensure we remove it from rng_states after each _distribute_region.
+            g_name = "user-passed-generator"


I wonder if it could just be str(generator) as the key?

i can try this. i think i'll land as-is and try experimenting more with this in the next PR.

I already tried using id(generator) as the key- this did not work, and I suspect we are making copies of the python wrapper at some point in our bindings or our dispatching layer, leading to the ID changing. I do notice that the str(generator) prints what looks like a memory address for the underlying CPP object, so it might indeed be more stable and fix my issue. Thanks for the suggestion!

wconstab · 2025-08-07T15:55:44Z

@pytorchbot merge -i

pytorchmergebot · 2025-08-07T15:57:35Z

Merge started

Your change will be merged while ignoring the following 3 checks: Check Labels / Check labels, Check mergeability of ghstack PR / ghstack-mergeability-check, pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, linux.12xlarge, unstable)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

) If the user provides a generator kwarg to a random op (e.g. nn.init.uniform_(..., generator=my_generator)), we can still advance that generator's state in a SPMD-global way so that each local-tensor gets appropriate values and the generator advances to the same state as if it had operated on the full tensor. Pull Request resolved: pytorch#159933 Approved by: https://github.com/fduwjj, https://github.com/XilunWu, https://github.com/wanchaol

pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue labels Aug 6, 2025

wconstab commented Aug 6, 2025

View reviewed changes

XilunWu reviewed Aug 6, 2025

View reviewed changes

wconstab commented Aug 6, 2025

View reviewed changes

wconstab mentioned this pull request Aug 6, 2025

[DTensor] Decide / Document RNG semantics #159991

Closed

fduwjj approved these changes Aug 6, 2025

View reviewed changes

wconstab added the release notes: distributed (dtensor) release notes category label Aug 6, 2025

XilunWu approved these changes Aug 6, 2025

View reviewed changes

wanchaol approved these changes Aug 7, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 7, 2025

pytorchmergebot added the merging label Aug 7, 2025

pytorchmergebot added the Merged label Aug 7, 2025

pytorchmergebot closed this in 3cf7b40 Aug 7, 2025

pytorchmergebot removed the merging label Aug 7, 2025

github-actions bot deleted the gh/wconstab/439/head branch September 7, 2025 02:13

		if g_name not in self.rng_states:
		self.rng_states[g_name] = generator.get_state()

[DTensor] Support user-supplied Generator for random ops #159933

[DTensor] Support user-supplied Generator for random ops #159933

Uh oh!

Conversation

wconstab commented Aug 6, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159933

✅ You can merge normally! (3 Unrelated Failures)

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

XilunWu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lucasb-eyer commented Aug 6, 2025

Uh oh!

wconstab commented Aug 6, 2025

Uh oh!

XilunWu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wanchaol left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wconstab commented Aug 7, 2025

Uh oh!

pytorchmergebot commented Aug 7, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

wconstab commented Aug 6, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Aug 6, 2025 •

edited

Loading