Running Microsoft's Phi-3 on CPU with Rust & Candle

Published: 2 days ago (December 3, 2025 at 05:56 AM EST)

3 min read

Source: Dev.to

Introduction

Python is currently the best tool for training machine learning models, with a rich ecosystem that includes PyTorch and Hugging Face Transformers. However, when it comes to inference on production environments or edge devices, Python’s overhead—large Docker images, slow cold‑starts, and high memory usage—can be prohibitive.

Rust and Candle provide a way to run large language models (LLMs) with far less overhead. Candle is a minimalistic ML framework from Hugging Face that supports quantized GGUF models, allowing you to deploy state‑of‑the‑art models like Microsoft Phi‑3 on a standard CPU without a GPU.

In this guide you will learn how to:

Remove the heavy PyTorch dependency.
Load a quantized Phi‑3 model directly in Rust.
Build a standalone, lightweight CLI tool for fast CPU inference.

Step 1: Setting Up the Project

cargo new rust-phi3-cpu
cd rust-phi3-cpu

Add the Candle stack and other dependencies to Cargo.toml:

[package]
name = "rust-phi3"
version = "0.1.0"
edition = "2021"

[dependencies]
anyhow = "1.0"
tokenizers = "0.19.1"
clap = { version = "4.4", features = ["derive"] }

candle-core = { git = "https://github.com/huggingface/candle.git", branch = "main" }
candle-transformers = { git = "https://github.com/huggingface/candle.git", branch = "main" }
candle-nn = { git = "https://github.com/huggingface/candle.git", branch = "main" }

candle-transformers includes built‑in support for GGUF (quantized) models, which is the key to efficient CPU inference.

Step 2: Implementation

Create src/main.rs with the following code:

use anyhow::{Error as E, Result};
use clap::Parser;
use candle_transformers::models::quantized_phi3 as model; // Phi‑3 specific module
use candle_core::{Tensor, Device};
use candle_core::quantized::gguf_file;
use tokenizers::Tokenizer;
use std::io::Write;

#[derive(Parser, Debug)]
#[command(author, version, about, long_about = None)]
struct Args {
    /// Prompt for inference
    #[arg(short, long, default_value = "Physics is fun. Explain quantum physics to a 5-year-old in simple words:")]
    prompt: String,

    /// Path to the GGUF model file
    #[arg(long, default_value = "Phi-3-mini-4k-instruct-q4.gguf")]
    model_path: String,
}

fn main() -> Result<()> {
    let args = Args::parse();

    println!("Loading model from: {}", args.model_path);

    // 1. Setup device (CPU)
    let device = Device::Cpu;

    // 2. Load the GGUF model
    let mut file = std::fs::File::open(&args.model_path)
        .map_err(|_| E::msg(format!("Could not find model file at {}. Did you download it?", args.model_path)))?;

    let content = gguf_file::Content::read(&mut file)?;
    // Flash Attention disabled for CPU
    let mut model = model::ModelWeights::from_gguf(false, content, &mut file, &device)?;

    // 3. Load tokenizer
    println!("Loading tokenizer...");
    let tokenizer = Tokenizer::from_file("tokenizer.json").map_err(E::msg)?;

    // 4. Encode prompt
    let tokens = tokenizer.encode(args.prompt, true).map_err(E::msg)?;
    let prompt_tokens = tokens.get_ids();
    let mut all_tokens = prompt_tokens.to_vec();

    // 5. Inference loop
    println!("Generating response...\n");
    let mut to_generate = 100; // max tokens
    let mut logits_processor = candle_transformers::generation::LogitsProcessor::new(299_792_458, None, None);

    print!("Response: ");
    std::io::stdout().flush()?;

    let mut next_token = *prompt_tokens.last().unwrap();

    for _ in 0..to_generate {
        let input = Tensor::new(&[next_token], &device)?.unsqueeze(0)?;
        let logits = model.forward(&input, all_tokens.len())?;
        let logits = logits.squeeze(0)?;

        next_token = logits_processor.sample(&logits)?;
        all_tokens.push(next_token);

        if let Some(t) = tokenizer.decode(&[next_token], true).ok() {
            print!("{}", t);
            std::io::stdout().flush()?;
        }
    }

    println!("\n\nDone!");
    Ok(())
}

Step 3: Obtaining the Model Weights

Download the quantized Phi‑3 Mini model (GGUF format) from Hugging Face:

Model file: Phi-3-mini-4k-instruct-q4.gguf (≈ 2.3 GB, q4_k_m.gguf variant)
Tokenizer: tokenizer.json (available in the official repository)

Place both files in the project root.

Step 4: Running the Demo

Compile in release mode for optimal performance:

cargo run --release -- \
    --model-path "Phi-3-mini-4k-instruct-q4.gguf" \
    --prompt "whatever you want:"

You should see tokens streaming to the console almost instantly, without the long startup delay typical of Python‑based inference.

Demo output

On a typical laptop (e.g., Intel i5‑8100Y), the inference runs smoothly entirely on the CPU. The deployment artifact (binary + runtime) is dramatically smaller than a comparable Python/Docker setup.

Conclusion

Python remains the premier ecosystem for model training and experimentation. However, for production deployments—especially on edge, IoT, or serverless platforms where startup time and memory footprint matter—Rust combined with Candle offers a compelling alternative. By leveraging quantized GGUF models like Phi‑3, you can achieve fast, low‑overhead inference on standard CPUs.

Running Microsoft's Phi-3 on CPU with Rust & Candle

Introduction

Step 1: Setting Up the Project

Step 2: Implementation

Step 3: Obtaining the Model Weights

Step 4: Running the Demo

Conclusion

Related posts

AWS re:Invent 2025 - Beyond web browsers: HITL and tool integration for Nova Agents (AIM3334)

AWS re:Invent 2025 - Zoox: Building Machine Learning Infrastructure for Autonomous Vehicles (AMZ304)

arreglar pinchazos cerca de mi en Alpedrete

AWS re:Invent 2025 - Intelligent security: Protection at scale from development to production-INV214