Before you begin, make sure you install all necessary libraries by running:
pip install "optimum-onnx[onnxruntime]"If you want to use the GPU version of ONNX Runtime, make sure the CUDA and cuDNN requirements are satisfied, and install the additional dependencies by running :
pip install "optimum-onnx[onnxruntime-gpu]"To avoid conflicts between onnxruntime and onnxruntime-gpu, make sure the package onnxruntime is not installed by running pip uninstall onnxruntime prior to installing Optimum.
It is possible to export 🤗 Transformers, Diffusers, Timm and Sentence Transformers models to the ONNX format and perform graph optimization as well as quantization easily:
optimum-cli export onnx --model meta-llama/Llama-3.2-1B onnx_llama/The model can also be optimized and quantized with onnxruntime.
For more information on the ONNX export, please check the documentation.
Once the model is exported to the ONNX format, we provide Python classes enabling you to run the exported ONNX model in a seamless manner using ONNX Runtime in the backend:
from transformers import AutoTokenizer, pipeline
- from transformers import AutoModelForCausalLM
+ from optimum.onnxruntime import ORTModelForCausalLM
- model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B") # PyTorch checkpoint
+ model = ORTModelForCausalLM.from_pretrained("onnx-community/Llama-3.2-1B", subfolder="onnx") # ONNX checkpoint
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
result = pipe("He never went out without a book under his arm")More details on how to run ONNX models with ORTModelForXXX classes here.
Check out the examples folder for more usage examples including optimization, quantization, and model-specific demonstrations.