ONNX Runtime is an accelerator for machine learning models with multi platform support and a flexible interface to integrate with hardware-specific libraries. ONNX Runtime can be used with models from PyTorch, Tensorflow/Keras, TFLite, scikit-learn, and other frameworks.
ONNX Runtime Inference powers machine learning models in key Microsoft products and services across Office, Azure, Bing, as well as dozens of community projects.
Examples use cases for ONNX Runtime Inferencing include:
- Improve inference performance for a wide variety of ML models
- Run on different hardware and operating systems
- Train in Python but deploy into a C#/C++/Java app
- Train and perform inference with models created in different frameworks
The premise is simple.
- Get a model. This can be trained from any framework that supports export/conversion to ONNX format. See the tutorials for some of the popular frameworks/libraries.
- Load and run the model with ONNX Runtime. See the basic tutorials for running models in different languages.
- (Optional) Tune performance using various runtime configurations or hardware accelerators. There are lots of options here - see How to: Tune Performance as a starting point.
Even without step 3, ONNX Runtime will often provide performance improvements compared to the original framework.
ONNX Runtime applies a number of graph optimizations on the model graph then partitions it into subgraphs based on available hardware-specific accelerators. Optimized computation kernels in core ONNX Runtime provide performance improvements and assigned subgraphs benefit from further acceleration from each Execution Provider.
Released in April 2021, ONNX Runtime Training provides a one-line addition for existing PyTorch training scripts to accelerate training times. The current support is focused on large transformer models on multi-node NVIDIA GPUs, with more to come.
Using the ORTModule class wrapper, ONNX Runtime for PyTorch runs the forward and backward passes of the training script using an optimized automatically-exported ONNX computation graph. ORT Training uses the same graph optimizations as ORT Inferencing, allowing for model training acceleration.