KEMBAR78
[AOTInductor] remove CUDA dependency for cpp backend by chenyang78 · Pull Request #110409 · pytorch/pytorch · GitHub
Skip to content

Conversation

@chenyang78
Copy link
Contributor

@chenyang78 chenyang78 commented Oct 2, 2023

Summary:
Previously, we link against cuda libs even for pure cpp backend.
This caused issues for cases where the inference platform does not
have GPUs. This diff removed cuda dependency for cpp backend.

Reviewed By: bertmaher, muchulee8, mikekgfb

Differential Revision: D49800712

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @kadeng @muchulee8 @aakhundov @ColinPeppler

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 2, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110409

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c40c326 with merge base 428cbd7 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D49800712

@chenyang78 chenyang78 changed the title remove CUDA dependency for cpp backend [AOTInductor] remove CUDA dependency for cpp backend Oct 2, 2023
@chenyang78 chenyang78 added the topic: not user facing topic category label Oct 2, 2023
# For those cases, include the lpath and libs command as we do for pytorch above.
# This approach allows us to only pay for what we use.
ipaths = cpp_extension.include_paths(cuda) + [sysconfig.get_path("include")]
if aot_mode:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we will get to else at line 854 for our use case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will get to the else branch. I hit this with the internal e2e test, where none of clauses from include_pytorch, vec_isa != invalid_vec_isa, cuda and config.cpp.enable_kernel_profile is True.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vec_isa == invalid_vec_isa sounds problematic if we are talking about optimizing cpu models?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vec_isa == invalid_vec_isa sounds problematic if we are talking about optimizing cpu models?

Hmm, not sure. I think we can re-visit it if we get some issue with this.

functions=["run"],
extra_ldflags=[so_path],
with_cuda=True, # TODO: change this to not is_cpu
with_cuda=False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be with_cuda=not is_cpu,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch!

_binary_constants_bin_start + bytes_read,
data_size,
cudaMemcpyHostToDevice));
#else // !USE_CUDA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably don't need the else branch here? My general feeling is USE_CUDA and is_cpu start to make the code fragmented. Some of the code can be cleaned up a bit, but it seems like using a macro like USE_CUDA is inevitable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I have the same concern. I think USE_CUDA is inevitable. On the other hand, I think is_cpu seems to be useful for cases where we had mix-arch execution. I will try to minimize the use of is_cpu for now and refine it later when we have a better design for supporting multiple arches.

// multi devices.
#ifndef USE_CUDA

#define AOTI_RUNTIME_DEVICE_CHECK(EXPR) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this for CPU? If not, I think make it define empty can save some uses of USE_CUDA later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, I was thinking to have exactly the same empty macro to reduce the uses of USE_CUDA, but changed my mind later. One rationale is that I want to reduce any surprise. For example, I feel turning something like AOTI_RUNTIME_DEVICE_CHECK(foo()) to empty could be a surprise in cpu mode, because function foo may have both cuda and non-cuda side-effect. Instead, having a dedicated macro explicitly for each arch would make the code easier to interpret and reduce the chance of such surprises.

@@ -0,0 +1,4 @@
#pragma once

#include <torch/csrc/inductor/aoti_runtime/cpu_utils.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exclusiveness of the sets of macros is not obvious from this file. May be better to just keep this file and merge the contents of the two files into this one.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D49800712

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D49800712

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D49800712

Summary:

Previously, we link against cuda libs even for pure cpp backend.
This caused issues for cases where the inference platform does not
have GPUs. This diff removed cuda dependency for cpp backend.

Reviewed By: bertmaher, muchulee8, mikekgfb

Differential Revision: D49800712
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D49800712

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 3, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants