How long does an AI implementation take?

Most single workflow implementations take 2-6 weeks from kickoff to production. Full AI transformation programmes run 6-12 weeks.

Do you work with specific AI models?

We are model-agnostic and work with all major providers including Anthropic Claude, OpenAI GPT, Google Gemini, Meta Llama, Mistral, and more.

Can you deploy AI on our own servers?

Yes. Our Local & Private AI service deploys models on your own infrastructure or private cloud.

Glossary

ONNX Runtime

ONNX Runtime is an open-source inference engine that runs AI models in the ONNX (Open Neural Network Exchange) format, enabling optimised, cross-platform model deployment across CPUs, GPUs, and specialised hardware.

What is ONNX Runtime?

ONNX Runtime is a high-performance inference engine developed by Microsoft for running machine learning models. It supports models exported in the ONNX (Open Neural Network Exchange) format — an open standard that enables models trained in one framework (PyTorch, TensorFlow, scikit-learn) to run in any ONNX-compatible runtime. The runtime applies automatic optimisations to models, including graph optimisation (simplifying the model's computational graph), operator fusion (combining multiple operations into single efficient operations), and hardware-specific acceleration (utilising GPU, CPU SIMD instructions, or specialised AI accelerators). ONNX Runtime supports multiple execution providers — backend implementations optimised for specific hardware. This includes CUDA for NVIDIA GPUs, DirectML for Windows GPUs, OpenVINO for Intel hardware, and CoreML for Apple devices. This portability allows the same model to run efficiently across different deployment targets.

Why ONNX Runtime Matters for Business

ONNX Runtime addresses a practical deployment challenge: models are often trained using one framework but need to run on different hardware and platforms in production. The ONNX standard and runtime decouple model training from deployment, giving organisations flexibility in their production infrastructure. Performance improvements from ONNX Runtime can be substantial — 2-5x speedup over default framework inference is common, with some models seeing 10x or greater improvements. This translates directly to lower latency, higher throughput, and reduced compute costs. For organisations deploying models at the edge (on devices, in browsers, or on IoT hardware), ONNX Runtime's cross-platform support is particularly valuable. A single model can be optimised and deployed across cloud servers, desktop applications, mobile devices, and web browsers.

Related Terms

Explore further

tensorrt inference optimisation model serving quantisation edge ai

FAQ

Frequently asked questions

ONNX (Open Neural Network Exchange) is an open format for representing machine learning models. It defines a common set of operators and a file format that allows models to be transferred between different frameworks and tools. ONNX Runtime is the engine that runs ONNX models.

Most models from major frameworks (PyTorch, TensorFlow, scikit-learn) can be exported to ONNX. Some very new or custom operators may not have ONNX equivalents yet, but coverage is comprehensive for standard architectures.

Yes, with caveats. ONNX Runtime supports LLM inference and offers optimisations for transformer models. However, for very large LLMs, specialised serving solutions like vLLM or TensorRT-LLM may offer better performance for specific deployment scenarios.

Grove AI

AI Consultancy

Grove AI helps businesses adopt artificial intelligence fast. From strategy to production in weeks, not months.

TensorRT

An alternative inference optimisation platform from NVIDIA.

Inference Optimisation

The broader field of making models efficient to run.

Need help implementing this?

Our team can help you apply these concepts to your business. Book a free strategy call.

Book a Strategy Call View Pricing