[inductor cpp] vectorize embedding lookup #114062

jgong5 · 2023-11-19T09:42:54Z

Stack from ghstack (oldest at bottom):

-> [inductor cpp] vectorize embedding lookup #114062

For embedding lookup, there are indirect indexing with indices that are invariant to the vectorized itervar. To vectorize it, we need to keep the related indexing variables as scalars and allow vectorization when the related index_exprs are invariant to the vectorized itervar.

This PR adds the support by lazily broadcasting scalar values (index_expr and constant) to vectors so that vector operations are only generated if needed by CppVecKernel when any of the inputs are vectors, otherwise, scalar ops are generated. The cse variable in cpp is now represented with CppCSEVariable which bookkeeps the relevant itervars to the variable and has a flag to mark whether it is a scalar or a vector. CppVecOverrides is improved to propagate these states when the ops are executed.

For the added UT test_embedding_vec, the generated code before this PR is:

extern "C" void kernel(const long* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0)
{
    #pragma omp parallel num_threads(64)
    {
        {
            #pragma omp for 
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(128L); x0+=static_cast<long>(1L))
            {
                #pragma GCC ivdep
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(128L); x1+=static_cast<long>(1L))
                {
                    auto tmp0 = in_ptr0[static_cast<long>(x0)];
                    auto tmp5 = in_ptr2[static_cast<long>(x1 + (128L*x0))];
                    auto tmp1 = decltype(tmp0)(tmp0 + 64);
                    auto tmp2 = tmp0 < 0;
                    auto tmp3 = tmp2 ? tmp1 : tmp0;
                    TORCH_CHECK((0 <= tmp3) & (tmp3 < 64L), "index out of bounds: 0 <= tmp3 < 64L")
                    auto tmp4 = in_ptr1[static_cast<long>(x1 + (128L*tmp3))];
                    auto tmp6 = decltype(tmp4)(tmp4 + tmp5);
                    out_ptr0[static_cast<long>(x1 + (128L*x0))] = tmp6;
                }
            }
        }
    }
}

After this PR, we have:

extern "C" void kernel(const long* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0)
{
    #pragma omp parallel num_threads(64)
    {
        {
            #pragma omp for 
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(128L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(128L); x1+=static_cast<long>(16L))
                {
                    auto tmp0 = in_ptr0[static_cast<long>(x0)];
                    auto tmp5 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x1 + (128L*x0)));
                    auto tmp1 = decltype(tmp0)(tmp0 + 64);
                    auto tmp2 = tmp0 < 0;
                    auto tmp3 = tmp2 ? tmp1 : tmp0;
                    TORCH_CHECK((0 <= tmp3) & (tmp3 < 64L), "index out of bounds: 0 <= tmp3 < 64L")
                    auto tmp4 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x1 + (128L*tmp3)));
                    auto tmp6 = tmp4 + tmp5;
                    tmp6.store(out_ptr0 + static_cast<long>(x1 + (128L*x0)));
                }
            }
        }
    }
}

cc @voznesenskym @penguinwu @EikanWang @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler

[ghstack-poisoned]

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

pytorch-bot · 2023-11-19T09:42:57Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/114062

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 654d076 with merge base c77a4a4 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

ghstack-source-id: 5c3dbf2 Pull Request resolved: #114062

For embedding lookup, there are indirect indexing with indices that are invariant to the vectorized itervar. To vectorize it, we need to keep the related indexing variables as scalars and allow vectorization when the related index_exprs are invariant to the vectorized itervar. This PR adds the support by lazily broadcasting scalar values (index_expr and constant) to vectors so that vector operations are only generated if needed by `CppVecKernel` when any of the inputs are vectors, otherwise, scalar ops are generated. The cse variable in cpp is now represented with `CppCSEVariable` which bookkeeps the relevant itervars to the variable and has a flag to mark whether it is a scalar or a vector. `CppVecOverrides` is improved to propagate these states when the ops are executed. For the added UT `test_embedding_vec`, the generated code before this PR is: ```c++ extern "C" void kernel(const long* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0) { #pragma omp parallel num_threads(64) { { #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(128L); x0+=static_cast<long>(1L)) { #pragma GCC ivdep for(long x1=static_cast<long>(0L); x1<static_cast<long>(128L); x1+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x0)]; auto tmp5 = in_ptr2[static_cast<long>(x1 + (128L*x0))]; auto tmp1 = decltype(tmp0)(tmp0 + 64); auto tmp2 = tmp0 < 0; auto tmp3 = tmp2 ? tmp1 : tmp0; TORCH_CHECK((0 <= tmp3) & (tmp3 < 64L), "index out of bounds: 0 <= tmp3 < 64L") auto tmp4 = in_ptr1[static_cast<long>(x1 + (128L*tmp3))]; auto tmp6 = decltype(tmp4)(tmp4 + tmp5); out_ptr0[static_cast<long>(x1 + (128L*x0))] = tmp6; } } } } } ``` After this PR, we have: ```c++ extern "C" void kernel(const long* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0) { #pragma omp parallel num_threads(64) { { #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(128L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(128L); x1+=static_cast<long>(16L)) { auto tmp0 = in_ptr0[static_cast<long>(x0)]; auto tmp5 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x1 + (128L*x0))); auto tmp1 = decltype(tmp0)(tmp0 + 64); auto tmp2 = tmp0 < 0; auto tmp3 = tmp2 ? tmp1 : tmp0; TORCH_CHECK((0 <= tmp3) & (tmp3 < 64L), "index out of bounds: 0 <= tmp3 < 64L") auto tmp4 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x1 + (128L*tmp3))); auto tmp6 = tmp4 + tmp5; tmp6.store(out_ptr0 + static_cast<long>(x1 + (128L*x0))); } } } } } ``` cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

ghstack-source-id: d4d3152 Pull Request resolved: #114062

lezcano

When I first saw this I figured out that the way I'd implement this is by:

Having CppCSEVars track what loops do they depend on (you already do this)
Then generalise Vec checker so that it returns a list of which loops can be vectorised
Change the vectorised code to work on "per iterloop case". This you have already implemented via that wrapper trick (quite neat)

I am struggling to find where is it's the equivalent of point 2 above, that is, where do you figure out whether an iterloop can be vectorised.

lezcano · 2023-11-20T09:16:26Z

torch/_inductor/codegen/cpp.py

+                    V.kernel.cse.varname_map[s.name].relevant_itervars
+                )
+
+    def is_relevant(self, itervar: sympy.Symbol):


perhaps you could call this method depends_on?

Thanks. Changed.

I am struggling to find where is it's the equivalent of point 2 above, that is, where do you figure out whether an iterloop can be vectorised.

Please check the function select_tiling_indices which gets the candidates of itervars that can be vectorized according to their contiguity. And then CppVecKernelChecker is given these candidates to see whether we have problems vectorizing them.

For embedding lookup, there are indirect indexing with indices that are invariant to the vectorized itervar. To vectorize it, we need to keep the related indexing variables as scalars and allow vectorization when the related index_exprs are invariant to the vectorized itervar. This PR adds the support by lazily broadcasting scalar values (index_expr and constant) to vectors so that vector operations are only generated if needed by `CppVecKernel` when any of the inputs are vectors, otherwise, scalar ops are generated. The cse variable in cpp is now represented with `CppCSEVariable` which bookkeeps the relevant itervars to the variable and has a flag to mark whether it is a scalar or a vector. `CppVecOverrides` is improved to propagate these states when the ops are executed. For the added UT `test_embedding_vec`, the generated code before this PR is: ```c++ extern "C" void kernel(const long* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0) { #pragma omp parallel num_threads(64) { { #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(128L); x0+=static_cast<long>(1L)) { #pragma GCC ivdep for(long x1=static_cast<long>(0L); x1<static_cast<long>(128L); x1+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x0)]; auto tmp5 = in_ptr2[static_cast<long>(x1 + (128L*x0))]; auto tmp1 = decltype(tmp0)(tmp0 + 64); auto tmp2 = tmp0 < 0; auto tmp3 = tmp2 ? tmp1 : tmp0; TORCH_CHECK((0 <= tmp3) & (tmp3 < 64L), "index out of bounds: 0 <= tmp3 < 64L") auto tmp4 = in_ptr1[static_cast<long>(x1 + (128L*tmp3))]; auto tmp6 = decltype(tmp4)(tmp4 + tmp5); out_ptr0[static_cast<long>(x1 + (128L*x0))] = tmp6; } } } } } ``` After this PR, we have: ```c++ extern "C" void kernel(const long* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0) { #pragma omp parallel num_threads(64) { { #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(128L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(128L); x1+=static_cast<long>(16L)) { auto tmp0 = in_ptr0[static_cast<long>(x0)]; auto tmp5 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x1 + (128L*x0))); auto tmp1 = decltype(tmp0)(tmp0 + 64); auto tmp2 = tmp0 < 0; auto tmp3 = tmp2 ? tmp1 : tmp0; TORCH_CHECK((0 <= tmp3) & (tmp3 < 64L), "index out of bounds: 0 <= tmp3 < 64L") auto tmp4 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x1 + (128L*tmp3))); auto tmp6 = tmp4 + tmp5; tmp6.store(out_ptr0 + static_cast<long>(x1 + (128L*x0))); } } } } } ``` cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

ghstack-source-id: 1fb18ab Pull Request resolved: #114062

jgong5 · 2023-11-21T05:08:19Z

@pytorchbot merge

huydhn · 2023-11-21T09:19:29Z

@pytorchbot revert -m 'Sorry for reverting your change, please help fix lint and reland it https://hud.pytorch.org/pytorch/pytorch/commit/2c0474c02d3ac04a429504225d7f1a6536d3b9e6' -c landrace

pytorchmergebot · 2023-11-21T09:21:15Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2023-11-21T09:21:25Z

@jgong5 your PR has been successfully reverted.

This reverts commit 2c0474c. Reverted #114062 on behalf of https://github.com/huydhn due to Sorry for reverting your change, please help fix lint and reland it https://hud.pytorch.org/pytorch/pytorch/commit/2c0474c02d3ac04a429504225d7f1a6536d3b9e6 ([comment](#114062 (comment)))

jgong5 · 2023-11-22T07:50:48Z

@pytorchbot merge

pytorchmergebot · 2023-11-22T07:52:30Z

Merge failed

Reason: PR 114062 is out of sync with the corresponding revision 387412d on branch origin/gh/jgong5/32/orig that would be merged into main. This usually happens because there is a non ghstack change in the PR. Please sync them and try again (ex. make the changes on origin/gh/jgong5/32/orig and run ghstack).

Details for Dev Infra team

Raised by workflow job

For embedding lookup, there are indirect indexing with indices that are invariant to the vectorized itervar. To vectorize it, we need to keep the related indexing variables as scalars and allow vectorization when the related index_exprs are invariant to the vectorized itervar. This PR adds the support by lazily broadcasting scalar values (index_expr and constant) to vectors so that vector operations are only generated if needed by `CppVecKernel` when any of the inputs are vectors, otherwise, scalar ops are generated. The cse variable in cpp is now represented with `CppCSEVariable` which bookkeeps the relevant itervars to the variable and has a flag to mark whether it is a scalar or a vector. `CppVecOverrides` is improved to propagate these states when the ops are executed. For the added UT `test_embedding_vec`, the generated code before this PR is: ```c++ extern "C" void kernel(const long* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0) { #pragma omp parallel num_threads(64) { { #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(128L); x0+=static_cast<long>(1L)) { #pragma GCC ivdep for(long x1=static_cast<long>(0L); x1<static_cast<long>(128L); x1+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x0)]; auto tmp5 = in_ptr2[static_cast<long>(x1 + (128L*x0))]; auto tmp1 = decltype(tmp0)(tmp0 + 64); auto tmp2 = tmp0 < 0; auto tmp3 = tmp2 ? tmp1 : tmp0; TORCH_CHECK((0 <= tmp3) & (tmp3 < 64L), "index out of bounds: 0 <= tmp3 < 64L") auto tmp4 = in_ptr1[static_cast<long>(x1 + (128L*tmp3))]; auto tmp6 = decltype(tmp4)(tmp4 + tmp5); out_ptr0[static_cast<long>(x1 + (128L*x0))] = tmp6; } } } } } ``` After this PR, we have: ```c++ extern "C" void kernel(const long* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0) { #pragma omp parallel num_threads(64) { { #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(128L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(128L); x1+=static_cast<long>(16L)) { auto tmp0 = in_ptr0[static_cast<long>(x0)]; auto tmp5 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x1 + (128L*x0))); auto tmp1 = decltype(tmp0)(tmp0 + 64); auto tmp2 = tmp0 < 0; auto tmp3 = tmp2 ? tmp1 : tmp0; TORCH_CHECK((0 <= tmp3) & (tmp3 < 64L), "index out of bounds: 0 <= tmp3 < 64L") auto tmp4 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x1 + (128L*tmp3))); auto tmp6 = tmp4 + tmp5; tmp6.store(out_ptr0 + static_cast<long>(x1 + (128L*x0))); } } } } } ``` cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

ghstack-source-id: 146818e Pull Request resolved: #114062

jgong5 · 2023-11-22T08:02:56Z

@pytorchbot merge

pytorchmergebot · 2023-11-22T08:04:43Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Jiong Gong added 2 commits November 19, 2023 17:41

[inductor cpp] vectorize embedding lookup

f50adbf

[ghstack-poisoned]

Update on "[inductor cpp] vectorize embedding lookup"

1bb8d14

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

github-actions bot added module: inductor ciflow/inductor labels Nov 19, 2023

pytorchbot added the open source label Nov 19, 2023

Update on "[inductor cpp] vectorize embedding lookup"

63bb3fd

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

Update on "[inductor cpp] vectorize embedding lookup"

34a3219

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

Update on "[inductor cpp] vectorize embedding lookup"

26fde27

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

Update on "[inductor cpp] vectorize embedding lookup"

4b0a784

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

jgong5 pushed a commit that referenced this pull request Nov 19, 2023

[inductor cpp] vectorize embedding lookup

b7e1e35

ghstack-source-id: 5c3dbf2 Pull Request resolved: #114062

jgong5 requested review from desertfire, jansel and lezcano and removed request for lezcano November 19, 2023 13:55

jgong5 added the topic: not user facing topic category label Nov 20, 2023

jgong5 pushed a commit that referenced this pull request Nov 20, 2023

[inductor cpp] vectorize embedding lookup

909f3a7

ghstack-source-id: d4d3152 Pull Request resolved: #114062

lezcano reviewed Nov 20, 2023

View reviewed changes

jansel approved these changes Nov 21, 2023

View reviewed changes

jgong5 pushed a commit that referenced this pull request Nov 21, 2023

[inductor cpp] vectorize embedding lookup

387412d

ghstack-source-id: 1fb18ab Pull Request resolved: #114062

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 21, 2023

pytorchmergebot added Merged and removed merging labels Nov 21, 2023

pytorchmergebot closed this in 2c0474c Nov 21, 2023

pytorchmergebot added the Reverted label Nov 21, 2023

pytorchmergebot reopened this Nov 21, 2023

Fix lint

c6a0c43

pytorchmergebot added the merging label Nov 22, 2023

pytorchmergebot removed the merging label Nov 22, 2023

Jiong Gong added 2 commits November 22, 2023 15:59

jgong5 pushed a commit that referenced this pull request Nov 22, 2023

[inductor cpp] vectorize embedding lookup

9ec944c

ghstack-source-id: 146818e Pull Request resolved: #114062

pytorchmergebot added the merging label Nov 22, 2023

pytorchmergebot removed the merging label Nov 22, 2023

pytorchmergebot closed this in a0e3321 Nov 22, 2023

facebook-github-bot deleted the gh/jgong5/32/head branch November 25, 2023 15:28

This was referenced Nov 30, 2023

[Inductor] Vectorize Embedding Lookup in CPP #93616

Closed

[inductor][cpu]basic_gnn_gin and basic_gnn_sage AMP performance regression #114879

Closed

kit1980 removed the Reverted label Dec 21, 2023

[inductor cpp] vectorize embedding lookup #114062

[inductor cpp] vectorize embedding lookup #114062

Uh oh!

Conversation

jgong5 commented Nov 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/114062

✅ No Failures

Uh oh!

lezcano left a comment

Choose a reason for hiding this comment

Uh oh!

lezcano Nov 20, 2023

Choose a reason for hiding this comment

Uh oh!

jgong5 Nov 21, 2023

Choose a reason for hiding this comment

Uh oh!

jgong5 commented Nov 21, 2023

Uh oh!

huydhn commented Nov 21, 2023

Uh oh!

pytorchmergebot commented Nov 21, 2023

Uh oh!

pytorchmergebot commented Nov 21, 2023

Uh oh!

jgong5 commented Nov 22, 2023

Uh oh!

pytorchmergebot commented Nov 22, 2023

Merge failed

Uh oh!

jgong5 commented Nov 22, 2023

Uh oh!

pytorchmergebot commented Nov 22, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

jgong5 commented Nov 19, 2023 •

edited

Loading

pytorch-bot bot commented Nov 19, 2023 •

edited

Loading