KEMBAR78
server : refactor middleware and /health endpoint by ngxson · Pull Request #9056 · ggml-org/llama.cpp · GitHub
Skip to content

Conversation

@ngxson
Copy link
Collaborator

@ngxson ngxson commented Aug 16, 2024

/health endpoint

In the beginning, /health endpoint was used to retrieve slots state. That was because at the time, /completions endpoint returns an error if there is no slot available. Therefore, /health was used to allow the application to wait until one slot is available.

Nowadays, the server now can queue (defer) the request if no slots is available. /health is used by docker for health checking. This is now become a problem when the server is busy doing a long task, /health can timeout. On HF inference endpoint, this causes the container to be in unhealthy state, which triggers a force restart.

Therefore, I propose a cleaner usage:

  • GET /health is now purely used to report actual health
  • GET /slots can be used as a replacement to get slot state

As a consequence, /health?fail_on_no_slot=1 is also moved to /slots?fail_on_no_slot=1 (for compatibility, we keep this option)

Refactor middleware

Some repeated code blocks, for example setting Access-Control-Allow-Origin, is now moved to middleware.

Middleware now also responsible to return error if the server is not yet ready:

When the server starts, if the model is being loaded, accessing to any endpoint will result in 503 error code:

image

Behavior on loading model failed

If model fails to load (for example, file does not exist), the server will simply exit with status code 1. This resolves #7787 where user reports that loading invalid model causes the server to crash.

image


ngxson and others added 2 commits August 16, 2024 11:01
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@github-actions github-actions bot added the python python script changes label Aug 16, 2024
@mcharytoniuk
Copy link
Contributor

I started a discussion thread related to this issue, please take a look: #9276

@ngxson ngxson added the breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. label Sep 2, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
* server : refactor middleware and /health endpoint

* move "fail_on_no_slot" to /slots

* Update examples/server/server.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix server tests

* fix CI

* update server docs

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
* server : refactor middleware and /health endpoint

* move "fail_on_no_slot" to /slots

* Update examples/server/server.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix server tests

* fix CI

* update server docs

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Feb 25, 2025
* server : refactor middleware and /health endpoint

* move "fail_on_no_slot" to /slots

* Update examples/server/server.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix server tests

* fix CI

* update server docs

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. examples python python script changes server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor: investigate cleaner exception handling for server/server.cpp

3 participants