kv-cache : simplify the interface #13660

ggerganov · 2025-05-20T14:51:26Z

First part of some KV cache interface refactoring and simplification.

Public API changes

Deprecate llama_kv_self_n_tokens
Deprecate llama_kv_self_used_cells

Internal `llama_kv_cache` changes

Remove llama_kv_cache::get_n_tokens()
Remove llama_kv_cache::get_used_cells()
Remove llama_kv_cache::get_pos_max()
Add notion of n_seq_max to the KV cache objects. Will be needed later for improving the data structures for tracking the per-sequence information.
Remove unused type_k and type_v members
Rename padding -> n_pad for consistency

Other changes

llama_decode() now verifies that if the input batch has pos == null it should also have seq_id == null so that we can automatically assign all tokens to seq_id == 0 starting from the max position currently in the cache. This fixes/prevents an edge case where a batch with pos == null that also has tokens with seq_id != 0 would be assigned incorrect positions by the llama_batch_allocr.
Remove some KV-cache related fields (like "used cells" and "tokens count") from the server's /metrics endpoint. These are too internal and implementation-specific and should not be exposed to the public.

ggml-ci

slaren · 2025-05-21T01:22:37Z

src/llama-batch.cpp

        pos.resize(batch.n_tokens);
        for (int32_t i = 0; i < batch.n_tokens; i++) {
-            pos[i] = i + p0;
+            pos[i] = p0 + i + 1;


This change is a bit confusing to me. With p0 I would understand "position zero", but now this parameter seem to mean "previous max pos" instead.

I was focused on simplifying the call site by absorbing the + 1 into the function, but you are right that this makes the parameter meaning more confusing. Changed back to the previous version and also added assert for p0.

ggml-ci

* kv-cache : simplify the interface ggml-ci * context : revert llama_batch_allocr position change ggml-ci

github-actions bot added examples server labels May 20, 2025

ggerganov force-pushed the gg/kv-cache-simplify-part1 branch from 70321a1 to ef880b3 Compare May 20, 2025 16:51

kv-cache : simplify the interface

a91b15f

ggml-ci

ggerganov force-pushed the gg/kv-cache-simplify-part1 branch from ef880b3 to a91b15f Compare May 20, 2025 16:54

ggerganov marked this pull request as ready for review May 20, 2025 17:06

ggerganov requested a review from ngxson as a code owner May 20, 2025 17:06

ggerganov requested a review from slaren May 20, 2025 17:09

slaren reviewed May 21, 2025

View reviewed changes

context : revert llama_batch_allocr position change

e987482

ggml-ci

ggerganov force-pushed the gg/kv-cache-simplify-part1 branch from 0e096d4 to e987482 Compare May 21, 2025 11:55

slaren approved these changes May 21, 2025

View reviewed changes

ggerganov merged commit 797f2ac into master May 21, 2025
51 of 53 checks passed

ggerganov deleted the gg/kv-cache-simplify-part1 branch May 21, 2025 12:11

ggerganov mentioned this pull request May 21, 2025

changelog : llama-server REST API #9291

Open

infil00p pushed a commit to baseweight/llama.cpp that referenced this pull request May 22, 2025

kv-cache : simplify the interface (ggml-org#13660)

7b46cf2

* kv-cache : simplify the interface ggml-ci * context : revert llama_batch_allocr position change ggml-ci

mamei16 mentioned this pull request Jul 22, 2025

Cannot detect the model loader mamei16/context-progress-bar-text-generation-webui#2

Closed

jakexcosme mentioned this pull request Oct 22, 2025

changelog : llama-server REST API COG-GTM/llama.cpp#245

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

kv-cache : simplify the interface #13660

kv-cache : simplify the interface #13660

Uh oh!

ggerganov commented May 20, 2025 •

edited

Loading

Uh oh!

slaren May 21, 2025

Uh oh!

ggerganov May 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kv-cache : simplify the interface #13660

kv-cache : simplify the interface #13660

Uh oh!

Conversation

ggerganov commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Public API changes

Internal llama_kv_cache changes

Other changes

Uh oh!

slaren May 21, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov May 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ggerganov commented May 20, 2025 •

edited

Loading

Internal `llama_kv_cache` changes