KEMBAR78
Fix the parity of original and exported module parameters by supercharleszhu · Pull Request #160600 · pytorch/pytorch · GitHub
Skip to content

Conversation

@supercharleszhu
Copy link
Contributor

@supercharleszhu supercharleszhu commented Aug 14, 2025

Problem

Fixing parameter mismatch issue during torch.export with strict mode (see "How to reproduce the issue" section below):

When there are two attribute mapping to the same tensor, the strict mode will

  1. Have a standard param buffer table to standardize the name (bug happens here! when 2 parameter have same id(param), the latter name will overwrite the previous name)
  2. Update exported signature with updated standard FQN (problematic)
  3. When getting exported_program.module(), it will call _unlift_exported_program_lifted_states to recover attribute from exported signature where the parameter name is defined and standardized
    Then the named_parameter of this module will have overwritten name instead of original name

How to reproduce the issue?

reproduce issue shared by @taotaohuang001

torch version: 2.8.0

import torch
from torch import nn

# ---- Toy model with embedding weight sharing (aliasing) ----
class Toy(nn.Module):
    def __init__(self):
        super().__init__()
        self.embedding_layers = nn.ModuleDict()
        tbl = nn.Embedding(100, 8)
        self.embedding_layers["ActorId"] = tbl
        # Alias: reuse the SAME module instance for another feature
        self.embedding_layers["RootActorId"] = self.embedding_layers["ActorId"]
        self.proj = nn.Linear(16, 1)

    def forward(self, feats: dict[str, torch.Tensor]):
        e1 = self.embedding_layers["ActorId"](feats["ActorId"])
        e2 = self.embedding_layers["RootActorId"](feats["RootActorId"])
        return self.proj(torch.cat([e1, e2], dim=-1))

torch.manual_seed(0)

m = Toy().eval()

# Show pre-export parameter names (canonicalized; shared weight appears once)
print("PRE-EXPORT named_parameters:")
print([name for name, _ in m.named_parameters()])

# Sanity: the two feature names point to the same weight object
w1 = m.embedding_layers["ActorId"].weight
w2 = m.embedding_layers["RootActorId"].weight
print("PRE-EXPORT alias -> same object:", w1 is w2, "| same storage:", w1.data_ptr() == w2.data_ptr())

# Example inputs (dict structure will be captured by export)
ex_in = {
    "ActorId":     torch.randint(0, 100, (4,)),
    "RootActorId": torch.randint(0, 100, (4,)),
}

# ---- Export (in memory) and materialize the runnable module ----
ep = torch.export.export(m, (ex_in,), strict=True)
gm = ep.module()  # GraphModule with new (canonical) parameter names

print("\nPOST-EXPORT named_parameters (GraphModule):")
post_names = [name for name, _ in gm.named_parameters()]
print(post_names)

# Prove alias persists after export: run fwd/bwd and check a single grad tensor exists
out = gm(ex_in).sum()
out.backward()

# Find the embedding weight in the exported module by shape (100, 8)
emb_names = [name for name, p in gm.named_parameters() if p.shape == torch.Size([100, 8])]
print("\nEmbedding param (post-export) canonical name:", emb_names[0] if emb_names else "<not found>")

# Show that only one grad exists for the shared table
for name, p in gm.named_parameters():
    if p.grad is not None and p.shape == torch.Size([100, 8]):
        print("Grad present on shared embedding weight:", name, "| grad shape:", tuple(p.grad.shape))
        break

And you will see parameters are different before and after export

PRE-EXPORT named_parameters:
['embedding_layers.ActorId.weight', 'proj.weight', 'proj.bias']
PRE-EXPORT alias -> same object: True | same storage: True

POST-EXPORT named_parameters (GraphModule):
['embedding_layers.RootActorId.weight', 'proj.weight', 'proj.bias']

Embedding param (post-export) canonical name: embedding_layers.RootActorId.weight
Grad present on shared embedding weight: embedding_layers.RootActorId.weight | grad shape: (100, 8)

Solution

Fixing this issue by making sure latter named parameter will not overwrite the param_buffer_table when original model's named parameter already maps to certain parameter.

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 14, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160600

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e1a841b with merge base 83283ce (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@cpuhrsch cpuhrsch added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 14, 2025
Copy link
Contributor

@angelayi angelayi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks so much for the fix! could you add a test to check for this behavior?

@facebook-github-bot
Copy link
Contributor

@angelayi has imported this pull request. If you are a Meta employee, you can view this in D80750328.

@pytorch-bot pytorch-bot bot added ciflow/trunk Trigger trunk jobs on your pull request and removed ciflow/trunk Trigger trunk jobs on your pull request labels Aug 21, 2025
@facebook-github-bot
Copy link
Contributor

@angelayi has imported this pull request. If you are a Meta employee, you can view this in D80750328.

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 22, 2025
Copy link
Contributor

@angelayi angelayi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks so much!

@supercharleszhu
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@pytorch-bot pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Aug 22, 2025
@supercharleszhu
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 22, 2025

Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra.

@facebook-github-bot
Copy link
Contributor

@angelayi has imported this pull request. If you are a Meta employee, you can view this in D80750328.

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 22, 2025
@pytorch-bot pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Aug 22, 2025
@facebook-github-bot
Copy link
Contributor

@angelayi has imported this pull request. If you are a Meta employee, you can view this in D80750328.

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 22, 2025
@supercharleszhu
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorch-bot pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Aug 25, 2025
@facebook-github-bot
Copy link
Contributor

@angelayi has imported this pull request. If you are a Meta employee, you can view this in D80750328.

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 25, 2025
@supercharleszhu
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
…0600)

## Problem
Fixing parameter mismatch issue during torch.export with strict mode (see "How to reproduce the issue" section below):

When there are two attribute mapping to the same tensor, the strict mode will
1. Have a standard param buffer table to standardize the name (bug happens [here](https://github.com/supercharleszhu/pytorch/blob/f861dc1826f7b49de37a5578d6e9ef6300498606/torch/export/_trace.py#L356)! when 2 parameter have same id(param), the latter name will overwrite the previous name)
2. [Update](https://github.com/supercharleszhu/pytorch/blob/f861dc1826f7b49de37a5578d6e9ef6300498606/torch/export/_trace.py#L1481) exported signature with updated standard FQN (problematic)
3. When getting exported_program.module(), it will call [_unlift_exported_program_lifted_states](https://github.com/supercharleszhu/pytorch/blob/f861dc1826f7b49de37a5578d6e9ef6300498606/torch/export/exported_program.py#L1297) to recover attribute from exported signature where the parameter name is defined and standardized
Then the named_parameter of this module will have overwritten name instead of original name

## How to reproduce the issue?
reproduce issue shared by @taotaohuang001

torch version: 2.8.0
```python
import torch
from torch import nn

# ---- Toy model with embedding weight sharing (aliasing) ----
class Toy(nn.Module):
    def __init__(self):
        super().__init__()
        self.embedding_layers = nn.ModuleDict()
        tbl = nn.Embedding(100, 8)
        self.embedding_layers["ActorId"] = tbl
        # Alias: reuse the SAME module instance for another feature
        self.embedding_layers["RootActorId"] = self.embedding_layers["ActorId"]
        self.proj = nn.Linear(16, 1)

    def forward(self, feats: dict[str, torch.Tensor]):
        e1 = self.embedding_layers["ActorId"](feats["ActorId"])
        e2 = self.embedding_layers["RootActorId"](feats["RootActorId"])
        return self.proj(torch.cat([e1, e2], dim=-1))

torch.manual_seed(0)

m = Toy().eval()

# Show pre-export parameter names (canonicalized; shared weight appears once)
print("PRE-EXPORT named_parameters:")
print([name for name, _ in m.named_parameters()])

# Sanity: the two feature names point to the same weight object
w1 = m.embedding_layers["ActorId"].weight
w2 = m.embedding_layers["RootActorId"].weight
print("PRE-EXPORT alias -> same object:", w1 is w2, "| same storage:", w1.data_ptr() == w2.data_ptr())

# Example inputs (dict structure will be captured by export)
ex_in = {
    "ActorId":     torch.randint(0, 100, (4,)),
    "RootActorId": torch.randint(0, 100, (4,)),
}

# ---- Export (in memory) and materialize the runnable module ----
ep = torch.export.export(m, (ex_in,), strict=True)
gm = ep.module()  # GraphModule with new (canonical) parameter names

print("\nPOST-EXPORT named_parameters (GraphModule):")
post_names = [name for name, _ in gm.named_parameters()]
print(post_names)

# Prove alias persists after export: run fwd/bwd and check a single grad tensor exists
out = gm(ex_in).sum()
out.backward()

# Find the embedding weight in the exported module by shape (100, 8)
emb_names = [name for name, p in gm.named_parameters() if p.shape == torch.Size([100, 8])]
print("\nEmbedding param (post-export) canonical name:", emb_names[0] if emb_names else "<not found>")

# Show that only one grad exists for the shared table
for name, p in gm.named_parameters():
    if p.grad is not None and p.shape == torch.Size([100, 8]):
        print("Grad present on shared embedding weight:", name, "| grad shape:", tuple(p.grad.shape))
        break

```

And you will see parameters are different before and after export
```
PRE-EXPORT named_parameters:
['embedding_layers.ActorId.weight', 'proj.weight', 'proj.bias']
PRE-EXPORT alias -> same object: True | same storage: True

POST-EXPORT named_parameters (GraphModule):
['embedding_layers.RootActorId.weight', 'proj.weight', 'proj.bias']

Embedding param (post-export) canonical name: embedding_layers.RootActorId.weight
Grad present on shared embedding weight: embedding_layers.RootActorId.weight | grad shape: (100, 8)

```
## Solution
Fixing this issue by making sure latter named parameter will not overwrite the `param_buffer_table` when original model's named parameter already maps to certain parameter.
Pull Request resolved: pytorch#160600
Approved by: https://github.com/angelayi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged open source release notes: export triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants