KEMBAR78
Expose functions used in custom backend in torch_python dll by wschin · Pull Request #148213 · pytorch/pytorch · GitHub
Skip to content

Conversation

wschin
Copy link
Collaborator

@wschin wschin commented Feb 28, 2025

Fixes #148208. There are solutions for exposing symbols implicitly from inline functions (i.e., inline function A calls non-inline function B in foo.h. Code includes foo.h has to see the symbol B in DLL).

Solution 1: tag the entire struct where the inline functions are defined as member functions with TORCH_PYTHON_API --- this PR does this for python_arg_parser.h. An alternative solution exists but will slow down dispatching a lot --- drop inline keyword and move implementation to .cc file.

Solution 2: tag individual functions with TORCH_PYTHON_API. This PR does this for python_tensor.h.

Related discussion about hiding torch_python symbols: #142214

@pytorch-bot
Copy link

pytorch-bot bot commented Feb 28, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/148213

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e0370b0 with merge base 4216478 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@wschin wschin force-pushed the expose-some-symbols branch from 7f63068 to b463f81 Compare February 28, 2025 21:37
@wschin
Copy link
Collaborator Author

wschin commented Feb 28, 2025

@pytorchbot label "topic: not user facing"

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Feb 28, 2025
@wschin wschin requested review from albanD and malfet February 28, 2025 22:18
@wschin wschin added the topic: bug fixes topic category label Mar 2, 2025
Comment on lines +30 to +34
TORCH_PYTHON_API c10::DispatchKey get_default_dispatch_key();
TORCH_PYTHON_API at::Device get_default_device();

// Gets the ScalarType for the default tensor type.
at::ScalarType get_default_scalar_type();
TORCH_PYTHON_API at::ScalarType get_default_scalar_type();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should test if this change affects torch_npu cc: @FFFrog

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it will just bring some symbols back. Those symbols used to exist but got removed ~3 months ago.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No effect, but adding TORCH_PYTHON_API for private functions is a bit weird to me :/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. So those are moved to public as discussed in #148208.

Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it's wrong to say that private class method should be accessible from the extensions. (The rest of the changes are fine)

Comment on lines 306 to 308
TORCH_PYTHON_API at::Tensor tensor_slow(int i);
TORCH_PYTHON_API at::Scalar scalar_slow(int i);
TORCH_PYTHON_API at::Scalar scalar_slow(PyObject* arg);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels wrong: we should not make private functions part of the API...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my PR message on the top. Those are called in (performance critical) inline functions in .h file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO two potential solutions: Add TORCH_PYTHON_API to the entire class, which will make all its methods symbols visible or make those methods public, rather than private to the class implementation

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try the 2nd solution, which looks more aligned with the goal of reducing API surace.

@janeyx99 janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 3, 2025
inline bool isNone(int i);
inline std::optional<c10::DispatchKeySet> toDispatchKeySetOptional(int i);

private:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not part of the functions that used to be public but moved to private 3 months ago? Why are you adding them here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you now who moved them to private?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "moved to private" I mention here is the PR from 3 months ago hiding the symbols. Not moving these to be on this class "private:" tag.

Copy link
Collaborator Author

@wschin wschin Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what do you mean by public. Do you mean public C++ member function or symbols available in DLL? Our custom backend links to these functions when building with PyTorch. Those functions are accessible in PyTorch DLL ~ 1 year ago, so they were exposed in DLL before.

Q: Why are those APIs are tagged with TORCH_PYTHON_API?
A: These private functions are called inline functions in .h file. Whoever includes that .h file and indirectly calls these private functions will need these symbols when linking to torch_python DLL.

Q: Why is C++ private keyword removed?
A: It's a consensus in #148208. We can change tensor_slow back to private C++ member function, but you will see

...
private:
TORCH_PYTHON_API ... tensor_slow(...)

which is strange because private function becomes an API.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albanD, @malfet, gentle ping. Please let me know if I missed anything.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, just curious, would TORCH_PYTHON_API the entire class will work? (That would transitively apply to all methods?)

Copy link
Collaborator Author

@wschin wschin Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@malfet, I can try but that's not the conclusion made by other meta folks in #148208. I will update if it works soon.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tagging TORCH_PYTHON_API at structure level works for my case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's land it this way then

@wschin wschin force-pushed the expose-some-symbols branch from e8faba8 to e0370b0 Compare March 5, 2025 23:59
Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you for the updates

@malfet
Copy link
Contributor

malfet commented Mar 7, 2025

@pytorchbot merge -f "Everything looks green"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged open source topic: bug fixes topic category topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Regression: Missing Symbols in PyTorch DLL (torch_python)

8 participants