- 
                Notifications
    You must be signed in to change notification settings 
- Fork 155
Tool calls support from mainline #723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
        
          
                examples/server/server.cpp
              
                Outdated
          
        
      | svr->Post("/completion", handle_completions); // legacy | ||
| svr->Post("/completions", handle_completions); | ||
| svr->Post("/v1/completions", handle_completions); | ||
| svr->Post("/completions", handle_completions_oai); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is "/completions" being changed to call the OAI compatible method handle_completions_oai?
I just checked mainline and isn't done there, is there any reason for this change as I don't really think it makes sense.
Also on a related topic, the handle_completions, I haven't gone through all of the code in this PR or tested the PR yet but based on my skimming of it so far, the code looks like it might pull in a change which on mainline ended up changing default behavior for the non-OAI endpoint in a way that I don't really agree with because it changes things for downstream apps that result in worse performance due to different outputs
See this comment: ggml-org/llama.cpp#10783 (comment)
You can see the downstream impacts it has due to:
lmg-anon/mikupad#104 and lmg-anon/mikupad#110
If this does make that change (feel free to tell me if it doesn't) I do feel like "post_sampling_probs" should default true, this way downstream applications can opt in to the new behavior but are not impacted otherwise (and mikupad isn't the only downstream app affected, but it was the only one I know impacted users noticed and said something).
Sorry if what I said is not relevant (I wanted to comment this now, since testing may take a while and not sure if I can do it tonight)
Edit: Just saw the tests in tools/server/tests/unit/test_completion.py and if this test case is true, then it confirms what I was asking about above.
Being able to have access to the logprobs if wanted via the non OAI endpoint is nice, but I don't think introducing that functionality should change the default behavior.
I know it is complicated in that mainline made this change so there have been two separate expectations. They also changed the output stream at the same time, which made the break obvious for affected users as it changed what you received when you also asked for n_probs on that endpoint. Not only that but it put a performance cost on an API parameter (n_probs) that did not have one before as the default (it could be removed if you pass post_sampling_probs as true, but that is a parameter that did not exist before).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will change it back. Didn't realize it will break things.
| Thanks for looking into this, I'm not exactly sure how to use it, but seems we need something like this to support enable/disable thinking for DeepSeek-V3.1 given how it handles that in the chat template: https://huggingface.co/deepseek-ai/DeepSeek-V3.1#chat-template So as I understand it,  One thing I did test which works on mainline but was not working here is this part: 
 It seems like      --jinja \
    --chat-template-kwargs '{"thinking": true}' \That said, I still don't know how to get mainline to haha, but that is for me to figure out a proper OAI client "prompt" or whatever data it wants. Anyway, I'd love to see a working example of how to enable thinking for DeepSeek-V3.1 with this PR. | 
| You are correct about api usage. I forgot to add --chat-template-kwargs for the command line args. Need to add it. Maybe try --reasoning-budget 0 and see if it works. This disables thinking. | 
| --chat-template-kwargs is added | 
| def test_n_probs(): | ||
| global server | ||
| server.start() | ||
| res = server.make_request("POST", "/completion", data={ | ||
| "prompt": "I believe the meaning of life is", | ||
| "n_probs": 10, | ||
| "temperature": 0.0, | ||
| "n_predict": 5, | ||
| }) | ||
| assert res.status_code == 200 | ||
| assert "completion_probabilities" in res.body | ||
| assert len(res.body["completion_probabilities"]) == 5 | ||
| for tok in res.body["completion_probabilities"]: | ||
| assert "id" in tok and tok["id"] > 0 | ||
| assert "token" in tok and type(tok["token"]) == str | ||
| assert "logprob" in tok and tok["logprob"] <= 0.0 | ||
| assert "bytes" in tok and type(tok["bytes"]) == list | ||
| assert len(tok["top_logprobs"]) == 10 | ||
| for prob in tok["top_logprobs"]: | ||
| assert "id" in prob and prob["id"] > 0 | ||
| assert "token" in prob and type(prob["token"]) == str | ||
| assert "logprob" in prob and prob["logprob"] <= 0.0 | ||
| assert "bytes" in prob and type(prob["bytes"]) == list | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the specific test I was referring to before.
Assuming this test is accurate and passes this shows that the new default behavior is now to return top_logprobs which not only is different, but also as mentioned in my other comment based on what people have said comes at a performance penalty.
I just tested the current (without this PR) default behavior for n_probs (both in and outside of a stream).
It returned probs which is different from top_probs (which is what another test case says it will now return if you pass post_sampling_probs to be true), or top_logprobs (which is what this test case shows is the new default).
I don't see a benefit to this breaking change, n_probs returning probs by default makes sense to me, now allowing it to return logprobs by sending post_sampling_probs with false is also nice and is also not a breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still trying to understand your issue here. Does this logprob issue occur when you use /v1/completions which has been changed to open ai compatible completions, or /completion and /completions. I actually removed probabilities in old completion end point. I can add them back if that is what you want.
For open ai compatible completions, mainline defaults to false for post_sampling_probs (can be set by client), but it is undefined here. If I default it to true and allow it to be set by client, will it solve the performance issue with logprob?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually removed probabilities in old completion end point. I can add them back if that is what you want.
I would appreciate that.
Edit: No you didn't. I actually tested it things work. That does mean this test file is wrong, so either should be updated or removed.
For open ai compatible completions, mainline defaults to false for post_sampling_probs (can be set by client), but it is undefined here. If I default it to true and allow it to be set by client, will it solve the performance issue with logprob?
Following whatever the openAI spec is for those endpoints makes most sense (as they are being provided as OAI compatible), but I was referring to the two non OAI endpoints to not make breaking changes there which this test says changed.
Still trying to understand your issue here.
This test file is wrong, so either should be updated or removed.
| It looks like this PR is breaking usage/timings again. but I'm not sure, needs to be tested | 
| What is the status here? Are all disagreements resolved? | 
| 
 The test files included in this PR would not pass when run (in this case it is the test that is wrong as this branch handles things the old way still). It does look like there may be a regression in usage/timings in some places. I tested more and found another regression with token streaming off as with this PR the output does not contain any of the predicted tokens and also it is missing timings and usage. (I checked the KV cache with the new endpoint and was able to see it generated, just not output). | 
| The new commit should fix all of the regression. I didn't look at the code in the tests. | 
| res.oaicompat_model = slot.oaicompat_model; | ||
| res.data = json { | ||
| {"content", !slot.params.stream ? slot.generated_text : ""}, | ||
| {"generated_text", slot.generated_text}, // Always include full text for finish_reason logic | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you remove "generated_text" from the final response? and "oaicompat_msg_diffs"... from /v1/completions endpoint? Some people may already be using this in their applications... In addition "usage" allows you to display statistics in "llama-swap" for this endpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant old /v1/completions, which is now available at /completions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usage is already added unless you need it for /completions. generated_text is added back. oaicompat_msg_diffs is added in prior tool call PR, so should be fine. Just in case, we can add a message stating breaking changes and let people change their API when needed, or fix any regression in future PRs.
| 
 I haven't tested the code you just pushed in yet, but I see the tests are still in the PR. Merging in tests that do not reflect what is being merged in is just confusing. Can you either remove (or update the tests)? | 
| They are removed. | 
| Regarding GPT-OSS response format support, when using the main branch or this PR the thinking part is not parsed (the <|channel|>analysis<|message|> part is shown as is in clients like openwebui or aider instead of as a thinking block). This does not happen in llama.cpp where the thinking block appears as expected. | 
| I'm using version: I am using the --jinja flag, the command is:. The same command on main raises the "template not yet supported" warning and falls back to ChatML producing incorrect responses. Model is the one from ggml-org (SHA256=be37a636aca0fc1aae0d32325f82f6b4d21495f06823b5fbc1898ae0303e9935).  | 
| @ikawrakow The response from gpt-oss 20b is missing <|end|> under main, which makes reasoning parsing incorrect. but with mainline, I got: Any idea? | 
| @firecoperana Does #740 fix it? | 
| 
 Yes, now it's good. I will include this in my PR. | 
| With #740, tool calls also works properly with GPT-OSS. To move thinking process in separate field like mainline, use  | 
| Seems <think> was missing when running deepseek v3.1 on open-webui in this branch ( have </think> at the end of COT but no <think> at the beginning) The model run with | 
| Can you add --reasoning-format auto and compare the result with mainline? I don't have deepseek v3.1 to test myself. | 
| 
 I'm having this problem, too. I can turn on reasoning in DS v3.1 with --jinja --chat-template-kwargs '{"thinking": true}' parameters, but the tag <think> doesn't appears in the beginning of the COT (just the </think> at the end). I'm using ubergarm/DeepSeek-V3.1-smol-IQ4_KSS model. | 
| 
 Found this. ggml-org/llama.cpp#15496 | 
| 
 OK! I've just tried to run the mainline (it was a long time that I don't use it.) but it doesn't support this ubergarm model. 👍 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are no further objections I'll merge this tomorrow morning (my time).
| 
 I don't see any regressions anymore so no objection from me. | 
| Haven't checked if included here but today this fix was merged into mainline (fixes a bug with enable_thinking kwarg logic) | 
| 
 Deepseek v3.1 tool calling just merged in mainline. Will include the above PR when porting it. | 
| PR with Deepseek v3.1 tool calling @arichiardi Thinking model disabled assistant prefill is included in the above PR. | 
This commit is mostly a cherry-pick of ggml-org/llama.cpp#10783. That PR and friends were partially cherry-picked by ikawrakow#723, but wasn't really in a working state yet. A couple of additional changes: * Include timing information in response, which was (unintentionally?) done in mainline since ggml-org/llama.cpp#10643. * Also return the actual logprobs for accepted draft tokens. This is still a TODO in mainline [1]. Note that there is a TG performance penalty to return the logprobs. Here are some numbers I got with Qwen2.5-Coder-32B-Instruct: * no draft, no logprobs: 12.81 tok/s * no draft, with logprobs: 12.02 tok/s (6.2% drop) * with draft, no logprobs: 36.59 tok/s * with draft, with logprobs: 29.08 tok/s (20.5% drop) [1] https://github.com/ggml-org/llama.cpp/blob/b6548/tools/server/server.cpp#L4019
This commit is mostly a cherry-pick of ggml-org/llama.cpp#10783, plus optimization to do partial sort when sorting the logits. That mainline PR and friends were partially cherry-picked by ikawrakow#723, but wasn't really in a working state yet. A couple of additional changes: * Include timing information in response, which was (unintentionally?) done in mainline since ggml-org/llama.cpp#10643. * Also return the actual logprobs for accepted draft tokens. This is still a TODO in mainline [1]. Note that there is a TG performance penalty to return the logprobs, as we need to sort the logits. By doing partial sort, the penalty is quite small. Here are some numbers I got using the same prompt: This PR with partial sort: * no draft, no logprobs: 12.87 tok/s * no draft, with logprobs: 12.61 tok/s (2.0% drop) * with draft, no logprobs: 36.74 tok/s * with draft, with logprobs: 36.12 tok/s (1.7% drop) If cherry-pick the full sort from mainline PR: * no draft, no logprobs: 12.81 tok/s * no draft, with logprobs: 12.02 tok/s (6.2% drop) * with draft, no logprobs: 36.59 tok/s * with draft, with logprobs: 29.08 tok/s (20.5% drop) [1] https://github.com/ggml-org/llama.cpp/blob/b6548/tools/server/server.cpp#L4019 Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
This commit is mostly a cherry-pick of ggml-org/llama.cpp#10783, plus optimization to do partial sort when sorting the logits. That mainline PR and friends were partially cherry-picked by ikawrakow#723, but wasn't really in a working state yet. A couple of additional changes: * Include timing information in response, which was (unintentionally?) done in mainline since ggml-org/llama.cpp#10643. * Also return the actual logprobs for accepted draft tokens. This is still a TODO in mainline [1]. Note that there is a TG performance penalty to return the logprobs, as we need to sort the logits. By doing partial sort, the penalty is quite small. Here are some numbers I got using the same prompt: This PR with partial sort: * no draft, no logprobs: 12.87 tok/s * no draft, with logprobs: 12.61 tok/s (2.0% drop) * with draft, no logprobs: 36.74 tok/s * with draft, with logprobs: 36.12 tok/s (1.7% drop) If cherry-pick the full sort from mainline PR: * no draft, no logprobs: 12.81 tok/s * no draft, with logprobs: 12.02 tok/s (6.2% drop) * with draft, no logprobs: 36.59 tok/s * with draft, with logprobs: 29.08 tok/s (20.5% drop) [1] https://github.com/ggml-org/llama.cpp/blob/b6548/tools/server/server.cpp#L4019 Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
This commit is mostly a cherry-pick of ggml-org/llama.cpp#10783, plus optimization to do partial sort when sorting the logits. That mainline PR and friends were partially cherry-picked by ikawrakow#723, but wasn't really in a working state yet. A couple of additional changes: * Include timing information in response, which was (unintentionally?) done in mainline since ggml-org/llama.cpp#10643. * Also return the actual logprobs for accepted draft tokens. This is still a TODO in mainline [1]. Note that there is a TG performance penalty to return the logprobs, as we need to sort the logits. By doing partial sort, the penalty is quite small. Here are some numbers I got using the same prompt: This PR with partial sort: * no draft, no logprobs: 12.87 tok/s * no draft, with logprobs: 12.61 tok/s (2.0% drop) * with draft, no logprobs: 36.74 tok/s * with draft, with logprobs: 36.12 tok/s (1.7% drop) If cherry-pick the full sort from mainline PR: * no draft, no logprobs: 12.81 tok/s * no draft, with logprobs: 12.02 tok/s (6.2% drop) * with draft, no logprobs: 36.59 tok/s * with draft, with logprobs: 29.08 tok/s (20.5% drop) [1] https://github.com/ggml-org/llama.cpp/blob/b6548/tools/server/server.cpp#L4019 Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
This commit is mostly a cherry-pick of ggml-org/llama.cpp#10783, plus optimization to do partial sort when sorting the logits. That mainline PR and friends were partially cherry-picked by #723, but wasn't really in a working state yet. A couple of additional changes: * Include timing information in response, which was (unintentionally?) done in mainline since ggml-org/llama.cpp#10643. * Also return the actual logprobs for accepted draft tokens. This is still a TODO in mainline [1]. Note that there is a TG performance penalty to return the logprobs, as we need to sort the logits. By doing partial sort, the penalty is quite small. Here are some numbers I got using the same prompt: This PR with partial sort: * no draft, no logprobs: 12.87 tok/s * no draft, with logprobs: 12.61 tok/s (2.0% drop) * with draft, no logprobs: 36.74 tok/s * with draft, with logprobs: 36.12 tok/s (1.7% drop) If cherry-pick the full sort from mainline PR: * no draft, no logprobs: 12.81 tok/s * no draft, with logprobs: 12.02 tok/s (6.2% drop) * with draft, no logprobs: 36.59 tok/s * with draft, with logprobs: 29.08 tok/s (20.5% drop) [1] https://github.com/ggml-org/llama.cpp/blob/b6548/tools/server/server.cpp#L4019 Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
This is kind of a startover of the existing tool calls support, which supports wider range of the models. The reason for this PR is that each model tends to need different templates for tool calls and the implementation is very easy to have bugs. This PR uses mainline's approach to handle tools calls, which makes it easy to support new models and fix existing/future bugs. It includes mainline's tool call related PRs up to 8/22/2205.
It's not thoroughly tested but in my simple tests, results look fine. If there are bugs, confirm if it exists in mainline first.
Changes to API:
Chat completions API result should match latest mainline.
/completions and /v1/completions will use openAI type completions.
The important mainline's PRs are:
server: streaming of tool calls and thoughts when--jinjais on (#12379)server: add--reasoning-budget 0to disable thinking (incl. qwen3 w/ enable_thinking:false) ggml-org/llama.cpp#13771Support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client ggml-org/llama.cpp#13196