Running Microsoft's Phi-3 on CPU with Rust & Candle
Source: Dev.to
Introduction
Python is currently the best tool for training machine learning models, with a rich ecosystem that includes PyTorch and Hugging Face Transformers. However, when it comes to inference on production environments or edge devices, Python’s overhead—large Docker images, slow cold‑starts, and high memory usage—can be prohibitive.
Rust and Candle provide a way to run large language models (LLMs) with far less overhead. Candle is a minimalistic ML framework from Hugging Face that supports quantized GGUF models, allowing you to deploy state‑of‑the‑art models like Microsoft Phi‑3 on a standard CPU without a GPU.
In this guide you will learn how to:
- Remove the heavy PyTorch dependency.
- Load a quantized Phi‑3 model directly in Rust.
- Build a standalone, lightweight CLI tool for fast CPU inference.
Step 1: Setting Up the Project
cargo new rust-phi3-cpu
cd rust-phi3-cpu
Add the Candle stack and other dependencies to Cargo.toml:
[package]
name = "rust-phi3"
version = "0.1.0"
edition = "2021"
[dependencies]
anyhow = "1.0"
tokenizers = "0.19.1"
clap = { version = "4.4", features = ["derive"] }
candle-core = { git = "https://github.com/huggingface/candle.git", branch = "main" }
candle-transformers = { git = "https://github.com/huggingface/candle.git", branch = "main" }
candle-nn = { git = "https://github.com/huggingface/candle.git", branch = "main" }
candle-transformers includes built‑in support for GGUF (quantized) models, which is the key to efficient CPU inference.
Step 2: Implementation
Create src/main.rs with the following code:
use anyhow::{Error as E, Result};
use clap::Parser;
use candle_transformers::models::quantized_phi3 as model; // Phi‑3 specific module
use candle_core::{Tensor, Device};
use candle_core::quantized::gguf_file;
use tokenizers::Tokenizer;
use std::io::Write;
#[derive(Parser, Debug)]
#[command(author, version, about, long_about = None)]
struct Args {
/// Prompt for inference
#[arg(short, long, default_value = "Physics is fun. Explain quantum physics to a 5-year-old in simple words:")]
prompt: String,
/// Path to the GGUF model file
#[arg(long, default_value = "Phi-3-mini-4k-instruct-q4.gguf")]
model_path: String,
}
fn main() -> Result<()> {
let args = Args::parse();
println!("Loading model from: {}", args.model_path);
// 1. Setup device (CPU)
let device = Device::Cpu;
// 2. Load the GGUF model
let mut file = std::fs::File::open(&args.model_path)
.map_err(|_| E::msg(format!("Could not find model file at {}. Did you download it?", args.model_path)))?;
let content = gguf_file::Content::read(&mut file)?;
// Flash Attention disabled for CPU
let mut model = model::ModelWeights::from_gguf(false, content, &mut file, &device)?;
// 3. Load tokenizer
println!("Loading tokenizer...");
let tokenizer = Tokenizer::from_file("tokenizer.json").map_err(E::msg)?;
// 4. Encode prompt
let tokens = tokenizer.encode(args.prompt, true).map_err(E::msg)?;
let prompt_tokens = tokens.get_ids();
let mut all_tokens = prompt_tokens.to_vec();
// 5. Inference loop
println!("Generating response...\n");
let mut to_generate = 100; // max tokens
let mut logits_processor = candle_transformers::generation::LogitsProcessor::new(299_792_458, None, None);
print!("Response: ");
std::io::stdout().flush()?;
let mut next_token = *prompt_tokens.last().unwrap();
for _ in 0..to_generate {
let input = Tensor::new(&[next_token], &device)?.unsqueeze(0)?;
let logits = model.forward(&input, all_tokens.len())?;
let logits = logits.squeeze(0)?;
next_token = logits_processor.sample(&logits)?;
all_tokens.push(next_token);
if let Some(t) = tokenizer.decode(&[next_token], true).ok() {
print!("{}", t);
std::io::stdout().flush()?;
}
}
println!("\n\nDone!");
Ok(())
}
Step 3: Obtaining the Model Weights
Download the quantized Phi‑3 Mini model (GGUF format) from Hugging Face:
- Model file:
Phi-3-mini-4k-instruct-q4.gguf(≈ 2.3 GB,q4_k_m.ggufvariant) - Tokenizer:
tokenizer.json(available in the official repository)
Place both files in the project root.
Step 4: Running the Demo
Compile in release mode for optimal performance:
cargo run --release -- \
--model-path "Phi-3-mini-4k-instruct-q4.gguf" \
--prompt "whatever you want:"
You should see tokens streaming to the console almost instantly, without the long startup delay typical of Python‑based inference.

On a typical laptop (e.g., Intel i5‑8100Y), the inference runs smoothly entirely on the CPU. The deployment artifact (binary + runtime) is dramatically smaller than a comparable Python/Docker setup.
Conclusion
Python remains the premier ecosystem for model training and experimentation. However, for production deployments—especially on edge, IoT, or serverless platforms where startup time and memory footprint matter—Rust combined with Candle offers a compelling alternative. By leveraging quantized GGUF models like Phi‑3, you can achieve fast, low‑overhead inference on standard CPUs.