KEMBAR78
[SYCL][OPT] Fix reorder optimization for Q4_0 by NeoZhangJianyu · Pull Request #13003 · ggml-org/llama.cpp · GitHub
Skip to content

Conversation

@NeoZhangJianyu
Copy link
Collaborator

Idea: change the rule to call reorder tensor of Q4_0. Move it from initial graph_compute() to execute OPs.

  1. fix the issue that the reordered tensor and reorder OP don't match, that lead to wrong result in some LLM.
    Test by pythia-1.4b-Q4_0.gguf.

  2. set reorder optimization feature as default, since fixed the known issues.

  3. rm unused global variable.

  4. fix the bug of missing to reorder the tensors in second call graph_compute() of same context.
    It impacts the UT result: some UT cases can't test the reorder feature.

Todo:

  • support more cases of Q4_0.
  • consider reorder the tensor when load from GGUF. It's depended on all Q4_0 cases be supported (first item).
  • optimize other Quantized data type like Q4_K, Q5, ..Q8 by same framework.

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Apr 18, 2025
@NeoZhangJianyu NeoZhangJianyu requested a review from airMeng April 18, 2025 06:17
@qnixsynapse
Copy link
Collaborator

I think one more TODO is to remove setting tensor->extra in ggml_backend_sycl_buffer_init_tensor and follow what slaren suggested.

if (tensor->type == GGML_TYPE_Q4_0) {
ggml_tensor_extra_gpu * extra = new ggml_tensor_extra_gpu{};
tensor->extra = extra;
ctx->tensor_extras.push_back(extra); //used to release it when destroy ctx.
}

@Rbiessy Rbiessy self-requested a review April 18, 2025 07:53
@Rbiessy
Copy link
Collaborator

Rbiessy commented Apr 18, 2025

Thanks for the PR, we'll have a look! Please make sure to keep this PR in review until we have time to review it.

@Rbiessy
Copy link
Collaborator

Rbiessy commented Apr 18, 2025

I think one more TODO is to remove setting tensor->extra in ggml_backend_sycl_buffer_init_tensor and follow what slaren suggested.

I agree however the suggested solution to follow the logic from ggml-cpu-aarch64 also sets the extra field in the init_tensor function:

static enum ggml_status ggml_backend_cpu_aarch64_buffer_init_tensor(ggml_backend_buffer_t buffer, struct ggml_tensor * tensor) {
tensor->extra = (void *) const_cast<ggml::cpu::tensor_traits *>(ggml_aarch64_get_optimal_repack_type(tensor));
GGML_UNUSED(buffer);
return GGML_STATUS_SUCCESS;
}

It's not clear to me how this could be avoided at this stage.

@NeoZhangJianyu
Copy link
Collaborator Author

Yes, wait for you all review.

Yes, I have added the suggestion of slaren: consider reorder the tensor when load from GGUF. It's depended on all Q4_0 cases be supported (first item).

@NeoZhangJianyu
Copy link
Collaborator Author

I think one more TODO is to remove setting tensor->extra in ggml_backend_sycl_buffer_init_tensor and follow what slaren suggested.

if (tensor->type == GGML_TYPE_Q4_0) {
ggml_tensor_extra_gpu * extra = new ggml_tensor_extra_gpu{};
tensor->extra = extra;
ctx->tensor_extras.push_back(extra); //used to release it when destroy ctx.
}

Yes. I think this solution depend on all cases of Q4_0 supported reorder.

@NeoZhangJianyu
Copy link
Collaborator Author

Please test the PR with your LLMs of Q4_0.
I think all LLMs of Q4_0 shouldn't be blocked (wrong result).

@slaren
Copy link
Member

slaren commented Apr 18, 2025

I agree however the suggested solution to follow the logic from ggml-cpu-aarch64 also sets the extra field in the init_tensor function
It's not clear to me how this could be avoided at this stage.

The CPU backend uses extras for simplicity, but if there is no extra data that needs to be stored per-tensor, you can rely on the buffer type alone to determine if the tensor data is reordered.

Copy link
Contributor

@ShanoToni ShanoToni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from performance concerns @Rbiessy mentioned already, this resolves my concerns about running reorders for all relevant cases. (link to discussion for convenience: #5277 (reply in thread))

@NeoZhangJianyu
Copy link
Collaborator Author

@Rbiessy
Thank your test and feedback!

  • There seem to be very little benefit to enable the reorder optimization by default in the text generation and it degrades the performance of the prompt processing phase.

A: Yes.
Because the reorder process happen in the first mul_mat() OP, that will impact the PP performance.
And this solution can't make balance in the two stages.
It's general issue to optimize the LLM.
Maybe need another method to optimize PP.

  • The fact that reorder_qw requires a temporary buffer and would now run during the execution of the model increases the memory usage which can be an issue.

A: The temporary buffer will be released after finish the reorder. It's size is same as current Q4 tensor.
So, it won't take more memory for a long time.

@NeoZhangJianyu
Copy link
Collaborator Author

  • The reorder does seem beneficial here. I would still suggest to not enable GGML_SYCL_DISABLE_OPT=0 by default until we can solve the 2 issues above.

A: TG is more important than PP in customer cases.
In Qwen2 1.5 case above, PP is about 880 t/s, TG is about 40 t/s.
PP is reduce about 8%, TG is increased about 5%.
TG take more time and PP is short in a pipeline. So the +5% of TG will bring obviously absolutely performance increase than -8% of PP stage.

In the case of bigger LLM (like llama2-7B) and dGPU (Arc, BMG/PVC), the TG will be increased 20%-70% by this feature.

If we want a feature can make both are increased in same time, it's very hard to do in fact.
After balance, I think the benefit is more than side effect of this feature.

@NeoZhangJianyu
Copy link
Collaborator Author

The first PR of reorder lead to the wrong result of some LLM Q4_0.
So we disable it as default.

This PR fix the issue and won't impact the result of all LLM Q4_0.
It will increase TG performance more or less.
The degrade of PP could be accepted in customer cases compare to the benefit of TG.

For normal user, this feature will be ignored if they don't ready the guide.
User like OOB feature.
I suggest enabling this feature as default.

Copy link
Contributor

@Alcpz Alcpz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For normal user, this feature will be ignored if they don't ready the guide. User like OOB feature. I suggest enabling this feature as default.

I agree with this. The only reason we disabled the reorder by default was because it broke some user models. Even if we lose quite a bit of performance in prompt processing, from the user perspective, it feels better when using llama-cli.

However, I don't think this implementation should be final. We are most likely increasing other metrics that we are not really measuring, like the time to first token which also affects the "user" perspective, which relates to @Rbiessy 's concerns.

@NeoZhangJianyu can you explain further what was the issue was with reordered tensor and reorder OP not matching? A Q4_0 tensor using in a different operator?

@qnixsynapse
Copy link
Collaborator

The degrade of PP could be accepted in customer cases compare to the benefit of TG.

IMO. Both PP and TG are required for inference. A bad PP performance will result in bad experience, especially in long context LLMs.

My suggestion is to disable reorder opt by default until we find solution to fix PP. I agree with @Rbiessy here.

@Alcpz
Copy link
Contributor

Alcpz commented Apr 22, 2025

The numbers I have observed are not "bad" PP, but slightly worse than what we had. Less powerful systems are gonna notice more, but just for the first prompt that is processed and on benchmarks.
I don't expect a user to run an LLM just to work with a single prompt though, and applications (llama-cli for example) normally have a warm-up run when the application is loading (where the bottleneck is loading the model), so I don't mind the trade-off.

It would, however, have an impact from starting the application to actually starting to run things, so if @Rbiessy and @qnixsynapse disagree and notice the performance impact I wouldn't push for the merge.

@Rbiessy
Copy link
Collaborator

Rbiessy commented Apr 22, 2025

Because the reorder process happen in the first mul_mat() OP, that will impact the PP performance.
And this solution can't make balance in the two stages.
It's general issue to optimize the LLM.
Maybe need another method to optimize PP.

Are we sure the performance drop in PP is due to the first call to mul_mat? There is a warmup before running the benchmark so I don't see how it could affect that. It seems to me more likely that the issue is the mul_mat implementation using the reordered Q4_0 format is not as optimized for some sizes. How about we only use the reorder Q4_0 format for dequantize_mul_mat_vec as this seems to always improve performance? Could you confirm this @NeoZhangJianyu ?

I agree the extra memory usage should be fine since this is just for the first run.

I'd suggest in this PR we either don't enable Q4_0 by default or we disable the reorder optimization for the mul_mat case.

@NeoZhangJianyu
Copy link
Collaborator Author

For normal user, this feature will be ignored if they don't ready the guide. User like OOB feature. I suggest enabling this feature as default.

I agree with this. The only reason we disabled the reorder by default was because it broke some user models. Even if we lose quite a bit of performance in prompt processing, from the user perspective, it feels better when using llama-cli.

However, I don't think this implementation should be final. We are most likely increasing other metrics that we are not really measuring, like the time to first token which also affects the "user" perspective, which relates to @Rbiessy 's concerns.

@NeoZhangJianyu can you explain further what was the issue was with reordered tensor and reorder OP not matching? A Q4_0 tensor using in a different operator?

In the previous solution, reorder the Q4_0 tensor by go through all nodes in a model. Then execute the mul_mat_reorder (for example) in mul_mat() function by condition.

Because mul_mat_reorder() can't support all src0 and src1 combination cases, we can't reorder all Q4_0 tensors.
We must choose the Q4_0 tensors which are supported by mul_mat_reorder().

The condition of reorder tensor should be same as that of execute mul_mat_reorder() in mul_mat().
But the condition code can't be share/same in above two steps.
If the conditions are different, the reordered tensor can't be handled by mul_mat_reorder(). That lead to wrong result of the mul_mat() OP in same cases.

In this PR, I remove the reorder tensor in other function.
Reorder the tensor before the tensor is handled by mul_mat_reorder().
They execute in same code branch.
That could make sure the reordered tensor is handled by mul_mat_reorder().

Currently, mul_mat_reorder() is implemented in two legacy functions.
This solution can support more functions to be enhanced for reorder.

@NeoZhangJianyu
Copy link
Collaborator Author

Because the reorder process happen in the first mul_mat() OP, that will impact the PP performance.
And this solution can't make balance in the two stages.
It's general issue to optimize the LLM.
Maybe need another method to optimize PP.

Are we sure the performance drop in PP is due to the first call to mul_mat? There is a warmup before running the benchmark so I don't see how it could affect that. It seems to me more likely that the issue is the mul_mat implementation using the reordered Q4_0 format is not as optimized for some sizes. How about we only use the reorder Q4_0 format for dequantize_mul_mat_vec as this seems to always improve performance? Could you confirm this @NeoZhangJianyu ?

I agree the extra memory usage should be fine since this is just for the first run.

I'd suggest in this PR we either don't enable Q4_0 by default or we disable the reorder optimization for the mul_mat case.

dequantize_mul_mat_vec() is the bottleneck of performance in common LLM, like llama2.
We optimize this function will get better performance because this function is called more times than other sub_mul_mat() for Q4_0 type.
Because most LLMs are based on the structure of llama family, this optimization works well for most of them.

By this PR, the wrong result of mul_mat() for Q4_0 is fixed.
I think this feature should be opened as default.
So that normal user can enjoy good performance of SYCL backend.

Otherwise, user will turn to Other backend since all optimizations are enabled as default in other backend.
User can get the good result directly in other backend.

@Alcpz
Copy link
Contributor

Alcpz commented Apr 23, 2025

Thanks a lot for the explanation

@Rbiessy
Copy link
Collaborator

Rbiessy commented Apr 23, 2025

dequantize_mul_mat_vec() is the bottleneck of performance in common LLM, like llama2. We optimize this function will get better performance because this function is called more times than other sub_mul_mat() for Q4_0 type. Because most LLMs are based on the structure of llama family, this optimization works well for most of them.

By this PR, the wrong result of mul_mat() for Q4_0 is fixed. I think this feature should be opened as default. So that normal user can enjoy good performance of SYCL backend.

Otherwise, user will turn to Other backend since all optimizations are enabled as default in other backend. User can get the good result directly in other backend.

@NeoZhangJianyu Yes I understand that, you have not answered my question above which is: How about we only use the reorder Q4_0 format for dequantize_mul_mat_vec as this seems to always improve performance? I am not convinced that the reorder layout should be used for matrix multiplication (i.e. when it is used inside ggml_sycl_op_mul_mat_sycl here). Can you confirm whether this is needed to optimize text generation? I suspect this is slowing down prompt processing and may not be needed for text generation.

@NeoZhangJianyu
Copy link
Collaborator Author

dequantize_mul_mat_vec() is the bottleneck of performance in common LLM, like llama2. We optimize this function will get better performance because this function is called more times than other sub_mul_mat() for Q4_0 type. Because most LLMs are based on the structure of llama family, this optimization works well for most of them.
By this PR, the wrong result of mul_mat() for Q4_0 is fixed. I think this feature should be opened as default. So that normal user can enjoy good performance of SYCL backend.
Otherwise, user will turn to Other backend since all optimizations are enabled as default in other backend. User can get the good result directly in other backend.

@NeoZhangJianyu Yes I understand that, you have not answered my question above which is: How about we only use the reorder Q4_0 format for dequantize_mul_mat_vec as this seems to always improve performance? I am not convinced that the reorder layout should be used for matrix multiplication (i.e. when it is used inside ggml_sycl_op_mul_mat_sycl here). Can you confirm whether this is needed to optimize text generation? I suspect this is slowing down prompt processing and may not be needed for text generation.

The reorder method could work for other kernel functions like dequantize_mul_mat_vec() in ggml_sycl_op_mul_mat_sycl().

As the test result, it will reduce the PP.
I check the OPs of PP and TG, no performance reduce.
I think it due to the reordered tensor processing.

But it only happen once. The next PP won't be impacted.
But current test method only test 1 PP in a LLM.
If there are nPP + nTG in a case, we will see the first PP will be reduced, but next n-1 PP shouldn't be impacted.

@Rbiessy
Copy link
Collaborator

Rbiessy commented Apr 24, 2025

I looked into my suggestion myself, measuring the impact of disabling the reorder optimization for matrix-matrix multiplications only with the patch below:

diff --git i/ggml/src/ggml-sycl/ggml-sycl.cpp w/ggml/src/ggml-sycl/ggml-sycl.cpp
index 22927338b..2c7aac628 100644
--- i/ggml/src/ggml-sycl/ggml-sycl.cpp
+++ w/ggml/src/ggml-sycl/ggml-sycl.cpp
@@ -2909,7 +2909,7 @@ static void opt_for_reorder(ggml_backend_sycl_context * ctx, const ggml_tensor *
         ctx->opt_feature.reorder &&      //allow this device due to good perf, skip the devices with bad perf.
         dst->op == GGML_OP_MUL_MAT &&    //limit to some supported cases of Q4_0, to do for more cases.
         src0->type == GGML_TYPE_Q4_0 &&
-        src1->ne[2]==1 && src1->ne[3]==1) {
+        src1->ne[1] == 1 && src1->ne[2]==1 && src1->ne[3]==1) {

         ggml_tensor_extra_gpu* extra = (ggml_tensor_extra_gpu*)src0->extra;
         if (!extra) return; //only happen in CI/UT permute case.

This did not impact PP results. Also for some reason I am not able to reproduce the performance regression on PP that I mentioned in #13003 (comment) anymore. I use 100 iterations for this benchmark but it does not seem enough. For reference the results running on B580 again with the same changes that should match with the third table from #13003 (comment):

| model                          |       size |     params | backend    | ngl |    sm | mmap |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ----: | ---: | ------------: | -------------------: |
| qwen2 1.5B Q4_0                | 1013.62 MiB |     1.78 B | SYCL       |  99 |  none |    0 |         pp512 |      7822.20 ± 29.03 |
| qwen2 1.5B Q4_0                | 1013.62 MiB |     1.78 B | SYCL       |  99 |  none |    0 |         tg128 |         99.95 ± 1.91 |

Anyway I'm happy with these changes, thanks for the patch.

@NeoZhangJianyu
Copy link
Collaborator Author

@

I looked into my suggestion myself, measuring the impact of disabling the reorder optimization for matrix-matrix multiplications only with the patch below:

diff --git i/ggml/src/ggml-sycl/ggml-sycl.cpp w/ggml/src/ggml-sycl/ggml-sycl.cpp
index 22927338b..2c7aac628 100644
--- i/ggml/src/ggml-sycl/ggml-sycl.cpp
+++ w/ggml/src/ggml-sycl/ggml-sycl.cpp
@@ -2909,7 +2909,7 @@ static void opt_for_reorder(ggml_backend_sycl_context * ctx, const ggml_tensor *
         ctx->opt_feature.reorder &&      //allow this device due to good perf, skip the devices with bad perf.
         dst->op == GGML_OP_MUL_MAT &&    //limit to some supported cases of Q4_0, to do for more cases.
         src0->type == GGML_TYPE_Q4_0 &&
-        src1->ne[2]==1 && src1->ne[3]==1) {
+        src1->ne[1] == 1 && src1->ne[2]==1 && src1->ne[3]==1) {

         ggml_tensor_extra_gpu* extra = (ggml_tensor_extra_gpu*)src0->extra;
         if (!extra) return; //only happen in CI/UT permute case.

This did not impact PP results. Also for some reason I am not able to reproduce the performance regression on PP that I mentioned in #13003 (comment) anymore. I use 100 iterations for this benchmark but it does not seem enough. For reference the results running on B580 again with the same changes that should match with the third table from #13003 (comment):

| model                          |       size |     params | backend    | ngl |    sm | mmap |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ----: | ---: | ------------: | -------------------: |
| qwen2 1.5B Q4_0                | 1013.62 MiB |     1.78 B | SYCL       |  99 |  none |    0 |         pp512 |      7822.20 ± 29.03 |
| qwen2 1.5B Q4_0                | 1013.62 MiB |     1.78 B | SYCL       |  99 |  none |    0 |         tg128 |         99.95 ± 1.91 |

Anyway I'm happy with these changes, thanks for the patch.

@Rbiessy
src1->ne[1] == 1 && src1->ne[2]==1 && src1->ne[3]==1) don't impact PP, but make the performance increase of TG become smaller.

How you think this code?

@Rbiessy
Copy link
Collaborator

Rbiessy commented Apr 25, 2025

@Rbiessy src1->ne[1] == 1 && src1->ne[2]==1 && src1->ne[3]==1) don't impact PP, but make the performance increase of TG become smaller.

How you think this code?

@NeoZhangJianyu in the models I have been running I have not found cases where matrix-matrix multiplications are a bottleneck for TG, only matrix-vector multiplications. This is why I wanted us to try only enabling the reorder optimization for matrix-vector multiplication. I was worried the reorder optimization with matrix-matrix multiplications could somehow perform worse in some cases (and not just during the first iteration).

As I said this looks fine to me now so feel free to merge the PR if you think it is ready.

@NeoZhangJianyu NeoZhangJianyu merged commit 514c456 into ggml-org:master Apr 25, 2025
48 of 51 checks passed
@NeoZhangJianyu
Copy link
Collaborator Author

Thank you all support!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants