-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
🚀 Feature
Document: https://torchserve-docs.s3-us-west-2.amazonaws.com/docs/torchserve_architecture_v0.pdf
NOTE: The above document is a draft and subject to change
Request to build a Serving framework to host and serve trained PyTorch models.
@alexwong
Motivation
PyTorch provides an excellent and easy-to-use interface to train a model and also provides an easy to use optimized inference interface through JIT. Currently, there is a need for an optimized model serving solution to take any model trained using PyTorch framework into production. There exists multiple solutions to serve a PyTorch model in production, but most of the solutions are generic model serving solutions. There are multiple pain points that the PyTorch community currently face when trying to take a PyTorch model into production.
- Building a high performance web serving component to host PyTorch models is difficult to build and requires experience and domain knowledge.
- Adding custom pre-processing and post-processing for a model in service currently requires significant rework on the model server itself.
- Supporting multiple accelerators requires additional work.
- Any customization to the model server would require significant understanding of the existing serving framework itself and would also require significant rework.
We think the following are the most important requirements of a good PyTorch model serving framework.
Pitch
We want to contribute to building a serving framework, which addresses the above pain points and more. We foresee the
A PyTorch model serving solution will have the following capabilities:
- Performance: The server component should be highly-performant with low overhead from the serving framework. This implies that the average throughput must be high and the P90 latencies should be low. Its also important to get P50 and P90 latencies comparatively flat, signifying that all the requests are treated equally.
- Host Multiple Models: The server component should be able to host multiple models at the same time and customers should be able to load/unload a model at runtime. The model serving framework should expose an endpoint for each model, which can be reached by any customer.
- High Availability: The serving framework should be robust. Runtime errors of one model shouldn’t affect other models running on the server or the runtime of the server itself. There should be mechanisms to recuperate from any system out of resource errors.
- Metrics and Logs: A production grade serving framework should provide insight into the runtime of the model. The serving framework should provide easy access to logs and metrics and also provide easy hooks to add new logs and metrics without needing to understand the serving framework at a deep level.
- Support both Eager mode and Scripted mode models: The serving framework should support means to run PyTorch models in scripted mode for optimized execution.
- Support for multiple bindings: The serving framework should have support for models loaded via Python (eager/torchscript) or C++ bindings (JIT IR traces).
- Supports HTTP and gRPC Endpoints: The serving framework should come with a full set of HTTP endpoints for managing models as well as running inference on the models. PyTorch serve would also come with an SDK to easily customize the endpoints. The serving framework would also support gRPC endpoints.
- Ease of use and access: The serving framework should be easy to set up on any platform (MacOS, Linux, Windows) and should be testable on these systems. Users should have the same experience to containerize the Serving framework and launch into production using any container orchestration mechanism. The PyTorch serve framework would also have a fully featured CLI to start and run the model server.
- Lightweight: This implies that the serving component itself shouldn’t have multiple dependencies.
- Supports features such as request batching: The serving framework would have features such as request batching, to optimally run inference on accelerators.
- Support model Versioning and A/B testing: The serving framework should have capabilities to load multiple versions of the same model and run A/B tests on the model. This is very useful for when rolling out newer versions of a model into production. This can also be used to roll back if the new model is not as performant.
- Zero code serving: While providing the feature to customize the pre-processing and post-processing of inference requests, the PyTorch serve framework should also allow customers to simply drop their trained models into the server and use it for inference. In other words, the PyTorch serving framework should come with sensible defaults for pre processing and post processing.
- Easy customizability: The serving framework must be easy to customize for endpoints. This means easily modifying and adding new management endpoints, defining custom request batching algorithms, defining custom AuthZ and AuthN mechanisms.
- Support Accelerators: A production grade model server should be able to run on GPU hosts as well as any other custom accelerator host.
- Web UI: The PyTorch serving framework should come with a Web UI to allow interaction with a served model.
Proposed Technical Details
The model server will be based on a micro-service based solution rather than a monolithic approach. This architecture would bring us the benefit of decoupling the work of handling ingress requests and running the actual inference. This also allows the PyTorch serving framework to scale beyond a single host. The high-level components of the serving framework are divided into a frontend and backend which share different responsibilities.
Frontend responsibilities:
- Manage connections: In other words, the incoming requests and outgoing responses are managed by the frontend.
- Manage models : The frontend is responsible for the lifecycle of the model. Each hosted model will have its unique endpoint created, which could take any data type and return any data-type back.
- Manage the backend: The frontend is responsible for providing the models to be loaded onto the backend workers and also managing the backend workers themselves.
- Manage requests: Requests coming into the server’s frontend will be queued in model-sepecific queues for handling.
- Request distribution: The frontend will be responsible for request distribution, to backend workers.
- Metrics and Logs : Frontend will be responsible for metrics management, logs management and capturing any custom metrics and logs that come from the backend.
- Retrieve models from anywhere : The frontend is also responsible for retrieving the models from cloud or local storage.
Backend responsibilities:
- Running inference: Tightly integrate with PyTorch backend and also responsible for running any preprocessing required, running the forward method of the model with the incoming request and running post process on the inference response.
- Default pre-process, inference and post-process: If no custom processing logic is provided to the backend, it would have default logic to run preprocess, inference and post process on the model.
- Publish custom metrics and logs: Backend will have the capabilities to publish custom model level metrics and logs.
Proposed sequence diagrams
Monitoring status of the Server
Checking status of the models
Running inference
Loading model
Deleting a model
Next Steps
- We are looking for feedback/comments on this proposal. Specifically, we are looking for feedback on the list of capabilities outlined in this RFC and their priority. We also welcome feedback on our proposed design and implementation.
- Add details on proposed architecture and details on endpoint.
- Add additional sequence diagrams.
- Target Q4 2019 for Experimental release.