How to Run ONNX Models in Rust

When Python is too heavy

You spent weeks training a neural network in Python. The accuracy is solid. Now you need to ship it to production. The Python runtime is too heavy for your edge device, or your Rust backend needs to make predictions without spinning up a separate microservice. You export the model to ONNX. Now you have a .onnx file and a Rust project. How do you turn that file into predictions?

The answer is the ort crate. It wraps the ONNX Runtime C++ library and gives you a safe, ergonomic API to load models and run inference. You don't retrain the model. You don't rewrite the architecture. You load the graph, feed it tensors, and get results.

The ONNX runtime in Rust

ONNX stands for Open Neural Network Exchange. It is a standard format for representing machine learning models. Think of it like PDF for documents. You can write a document in Word, save it as PDF, and open it in any PDF viewer. Similarly, you train in PyTorch or TensorFlow, save as ONNX, and run it in Rust using a viewer called ort.

The ort crate provides two main concepts. The Environment manages the global runtime state, including memory pools and threading. The Session represents a loaded model. You create one Environment per process. You create Session objects from that environment. Each session holds the model graph and weights in memory, optimized for execution.

The workflow is linear. Build the environment. Build a session from the environment. Create tensors for inputs. Run the session. Extract tensors from outputs. The borrow checker enforces that the environment lives as long as the session. If you drop the environment, the session becomes invalid.

Minimal working example

Add the dependency to your Cargo.toml. The ort crate is the safe wrapper. Avoid ort-sys unless you need raw bindings.

[dependencies]
ort = "2.0"

This example loads a model and runs it with a dummy input. The shape [1, 3, 224, 224] is common for image classification models like ResNet. Adjust the shape to match your model.

use ort::{Session, SessionBuilder, GraphOptimizationLevel, Tensor};

fn main() -> ort::Result<()> {
    // Environment initializes the C++ runtime.
    // Creating this is expensive. Reuse it across the application.
    let env = ort::Environment::builder()
        .with_name("my-app")
        .build()?;

    // SessionBuilder configures the model load.
    // It borrows the environment, so the session cannot outlive the env.
    let session = SessionBuilder::new(&env)?
        // Level3 enables aggressive graph optimizations like operator fusion.
        .with_optimization_level(GraphOptimizationLevel::Level3)?
        // Parse the ONNX file and load weights into memory.
        .commit_from_file("model.onnx")?;

    // Prepare input data.
    // The slice must be contiguous and match the total element count of the shape.
    let input_data: Vec<f32> = vec![0.0; 1 * 3 * 224 * 224];

    // Create a tensor from the data and shape.
    // The shape tuple defines dimensions: [batch, channels, height, width].
    let input_tensor = Tensor::from_array((1, 3, 224, 224), input_data)?;

    // Run inference.
    // The inputs! macro packs tensors into the session's input format.
    let outputs = session.run(ort::inputs![input_tensor]?)?;

    // Extract the first output.
    let output_tensor = outputs.get(0).expect("Model has no outputs");

    // Convert output tensor back to a Rust slice.
    let predictions = output_tensor.try_extract_tensor::<f32>()?;
    println!("Predictions shape: {:?}", predictions.shape());

    Ok(())
}

Create the environment once. Create the session once. Run many times.

How the session lifecycle works

The Environment object allocates memory and initializes threading pools in the underlying C++ runtime. This step can take hundreds of milliseconds. Creating a new environment for every prediction kills performance. Store the environment in your application state or use a global singleton.

The Session object holds the model graph and weights. Loading the model parses the ONNX file, validates the graph, and loads weights into memory. This step also takes time. Do not create a new session for every request. Create the session during startup and reuse it.

Sessions are immutable after creation. You can call run concurrently from multiple threads. The ort crate marks Session as Send and Sync. Wrap the session in Arc<Session> to share it across threads safely.

use std::sync::Arc;

// Store the session in an Arc for thread-safe sharing.
let session = Arc::new(session);

// Spawn threads that share the session.
let session_clone = Arc::clone(&session);
std::thread::spawn(move || {
    // Run inference in the thread.
    // The session is safe to use concurrently.
    let _ = session_clone.run(ort::inputs![tensor]?).unwrap();
});

Convention aside: The community calls this the "singleton session" pattern. Every high-performance Rust inference service loads the model once and shares the session. If you see code creating a session inside a request handler, flag it for review.

Tensors: bridging Rust data and model inputs

Tensors are the data containers. A tensor holds a multidimensional array of values with a specific data type. ONNX models expect inputs as tensors. Rust code produces values as slices, arrays, or vectors. The Tensor struct bridges the gap.

The Tensor::from_array function takes a shape tuple and a slice. The shape tuple defines the dimensions. The slice provides the raw data. The slice must be contiguous in memory. The total number of elements in the slice must match the product of the shape dimensions.

// Shape [2, 3] means 2 rows, 3 columns. Total 6 elements.
let data = vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0];
let tensor = Tensor::from_array((2, 3), data)?;

If the slice length does not match the shape, ort returns an error at runtime. The compiler cannot check this. The shape is dynamic data. Treat the shape tuple as a contract. If the model expects [1, 3, 224, 224], give it exactly that.

Data types must match. ONNX types map to Rust types. float maps to f32. double maps to f64. int64 maps to i64. bool maps to bool. If you pass a Tensor<f32> to a model expecting int64, the run fails.

// Model expects int64 input.
let data: Vec<i64> = vec![1, 2, 3];
let tensor = Tensor::from_array((3,), data)?;

Convention aside: Use try_extract_tensor::<T> to read outputs. The type parameter T ensures you extract the correct data type. If the model output type does not match T, the function returns an error. This catches type mismatches early.

Realistic inference service

Real code wraps the session in a struct. The struct manages the lifecycle and provides a clean API. This example shows a struct that holds the environment and session. The environment must live as long as the session. Storing both in the struct satisfies the borrow checker.

use ort::{Session, SessionBuilder, GraphOptimizationLevel, Tensor, Environment};
use std::sync::Arc;

/// Runs inference on an ONNX model.
pub struct ModelRunner {
    session: Arc<Session>,
    // Keep the environment alive. The session borrows it.
    _env: Environment,
}

impl ModelRunner {
    /// Loads a model and creates a runner.
    pub fn new(model_path: &str) -> ort::Result<Self> {
        let env = Environment::builder()
            .with_name("image-classifier")
            .build()?;

        let session = SessionBuilder::new(&env)?
            .with_optimization_level(GraphOptimizationLevel::Level3)?
            .commit_from_file(model_path)?;

        Ok(Self {
            session: Arc::new(session),
            _env: env,
        })
    }

    /// Runs prediction on image data.
    pub fn predict(&self, image_data: &[f32]) -> ort::Result<Vec<f32>> {
        // Reshape data to [1, 3, 224, 224].
        let tensor = Tensor::from_array((1, 3, 224, 224), image_data)?;

        let outputs = self.session.run(ort::inputs![tensor]?)?;
        let output = outputs.get(0).expect("Output index 0 exists");

        let tensor_view = output.try_extract_tensor::<f32>()?;
        Ok(tensor_view.as_slice().unwrap().to_vec())
    }
}

The _env field prevents the environment from being dropped. The underscore prefix signals to readers that the field is unused except for lifetime management. This is a common Rust idiom.

Convention aside: Check session.inputs to verify input names. Models can have multiple inputs with specific names. The ort::inputs! macro supports named inputs. Use ort::inputs![("input_name", tensor)] when the model has named inputs. Relying on positional order is fragile.

Pitfalls and compiler errors

Shape mismatches cause runtime errors. The compiler cannot catch these. If you pass a tensor with the wrong shape, ort returns an error like "Input tensor shape mismatch". Always validate shapes against the model definition.

Input name mismatches also cause runtime errors. If the model expects an input named "data" and you provide a tensor without a name, the run fails. Check session.inputs to see expected names.

The borrow checker catches lifetime errors. If you create the environment inside a function and return the session, the compiler rejects the code with E0515 (cannot return value referencing local data). The session holds a reference to the environment. You must keep the environment alive. Store both in a struct, or use Arc<Environment>.

// This code fails to compile.
// E0515: cannot return value referencing local data `env`.
fn bad_loader() -> ort::Result<Session> {
    let env = Environment::builder().with_name("bad").build()?;
    let session = SessionBuilder::new(&env)?.commit_from_file("model.onnx")?;
    Ok(session) // Error: env is dropped here, session borrows env.
}

Type mismatches in tensor creation trigger compiler errors. If you try to put a String into a Tensor<f32>, you get E0308 (mismatched types). The compiler enforces type safety at the tensor level.

Store the environment. The session borrows it. Drop the environment and the session becomes a dangling reference.

Choosing your ML stack

Rust has several machine learning options. Pick the tool that matches your workflow.

Use ort when you have a pre-trained ONNX model and need fast inference with minimal code. The model is already exported. You just need to run it.

Use candle when you want to define models in Rust, train them, or run inference without relying on external C++ runtimes. It is pure Rust and great for custom architectures.

Use tch-rs when you need the full PyTorch ecosystem and are comfortable with a binding to the PyTorch C++ library. It supports dynamic graphs and training.

Use burn when you are building a deep learning framework from scratch or need support for multiple backends like WebGPU and Metal alongside CPU.

Use ort for inference. Use candle for training.

Where to go next

Running ONNX models in Rust lets your program use pre-trained AI models saved in the standard ONNX format. Think of it like plugging a universal adapter into your computer to run software designed for other systems. You use it when you need to integrate machine learning predictions directly into high-performance Rust applications.