在 CPU 上使用 Rust 与 Candle 运行 Microsoft 的 Phi-3

发布: 2天前 (2025年12月3日 GMT+8 18:56)

5 min read

Source: Dev.to

介绍

Python 目前是训练机器学习模型的最佳工具，拥有包括 PyTorch 和 Hugging Face Transformers 在内的丰富生态系统。然而，在生产环境或边缘设备上进行推理时，Python 的开销——庞大的 Docker 镜像、缓慢的冷启动以及高内存使用——可能会成为阻碍。

Rust 与 Candle 提供了一种在更低开销下运行大型语言模型（LLM）的方法。Candle 是 Hugging Face 推出的极简主义机器学习框架，支持量化的 GGUF 模型，使你能够在没有 GPU 的普通 CPU 上部署如 Microsoft Phi‑3 这样的前沿模型。

在本指南中，你将学习如何：

移除沉重的 PyTorch 依赖。
直接在 Rust 中加载量化的 Phi‑3 模型。
构建一个独立、轻量的 CLI 工具，实现快速的 CPU 推理。

步骤 1：设置项目

cargo new rust-phi3-cpu
cd rust-phi3-cpu

在 Cargo.toml 中添加 Candle 堆栈及其他依赖：

[package]
name = "rust-phi3"
version = "0.1.0"
edition = "2021"

[dependencies]
anyhow = "1.0"
tokenizers = "0.19.1"
clap = { version = "4.4", features = ["derive"] }

candle-core = { git = "https://github.com/huggingface/candle.git", branch = "main" }
candle-transformers = { git = "https://github.com/huggingface/candle.git", branch = "main" }
candle-nn = { git = "https://github.com/huggingface/candle.git", branch = "main" }

candle-transformers 包含对 GGUF（量化）模型的内置支持，这是实现高效 CPU 推理的关键。

步骤 2：实现

在 src/main.rs 中写入以下代码：

use anyhow::{Error as E, Result};
use clap::Parser;
use candle_transformers::models::quantized_phi3 as model; // Phi‑3 specific module
use candle_core::{Tensor, Device};
use candle_core::quantized::gguf_file;
use tokenizers::Tokenizer;
use std::io::Write;

#[derive(Parser, Debug)]
#[command(author, version, about, long_about = None)]
struct Args {
    /// Prompt for inference
    #[arg(short, long, default_value = "Physics is fun. Explain quantum physics to a 5-year-old in simple words:")]
    prompt: String,

    /// Path to the GGUF model file
    #[arg(long, default_value = "Phi-3-mini-4k-instruct-q4.gguf")]
    model_path: String,
}

fn main() -> Result<()> {
    let args = Args::parse();

    println!("Loading model from: {}", args.model_path);

    // 1. Setup device (CPU)
    let device = Device::Cpu;

    // 2. Load the GGUF model
    let mut file = std::fs::File::open(&args.model_path)
        .map_err(|_| E::msg(format!("Could not find model file at {}. Did you download it?", args.model_path)))?;

    let content = gguf_file::Content::read(&mut file)?;
    // Flash Attention disabled for CPU
    let mut model = model::ModelWeights::from_gguf(false, content, &mut file, &device)?;

    // 3. Load tokenizer
    println!("Loading tokenizer...");
    let tokenizer = Tokenizer::from_file("tokenizer.json").map_err(E::msg)?;

    // 4. Encode prompt
    let tokens = tokenizer.encode(args.prompt, true).map_err(E::msg)?;
    let prompt_tokens = tokens.get_ids();
    let mut all_tokens = prompt_tokens.to_vec();

    // 5. Inference loop
    println!("Generating response...\n");
    let mut to_generate = 100; // max tokens
    let mut logits_processor = candle_transformers::generation::LogitsProcessor::new(299_792_458, None, None);

    print!("Response: ");
    std::io::stdout().flush()?;

    let mut next_token = *prompt_tokens.last().unwrap();

    for _ in 0..to_generate {
        let input = Tensor::new(&[next_token], &device)?.unsqueeze(0)?;
        let logits = model.forward(&input, all_tokens.len())?;
        let logits = logits.squeeze(0)?;

        next_token = logits_processor.sample(&logits)?;
        all_tokens.push(next_token);

        if let Some(t) = tokenizer.decode(&[next_token], true).ok() {
            print!("{}", t);
            std::io::stdout().flush()?;
        }
    }

    println!("\n\nDone!");
    Ok(())
}

步骤 3：获取模型权重

从 Hugging Face 下载量化的 Phi‑3 Mini 模型（GGUF 格式)：

模型文件： Phi-3-mini-4k-instruct-q4.gguf（约 2.3 GB，q4_k_m.gguf 变体）
分词器： tokenizer.json（可在官方仓库中获取）

将上述两个文件放置在项目根目录下。

步骤 4：运行演示

使用 release 模式编译以获得最佳性能：

cargo run --release -- \
    --model-path "Phi-3-mini-4k-instruct-q4.gguf" \
    --prompt "whatever you want:"

你应该会看到令牌几乎即时地流向控制台，且没有 Python 推理常见的长时间启动延迟。

Demo output

在普通笔记本（例如 Intel i5‑8100Y）上，推理全程在 CPU 上顺畅运行。相较于等价的 Python/Docker 设置，部署产物（二进制 + 运行时）体积大幅更小。

结论

Python 仍然是模型训练和实验的首选生态系统。然而，对于生产部署——尤其是在边缘、IoT 或无服务器平台上，启动时间和内存占用至关重要的场景——Rust 与 Candle 提供了一个极具吸引力的替代方案。通过利用 Phi‑3 等量化 GGUF 模型，你可以在普通 CPU 上实现快速、低开销的推理。

在 CPU 上使用 Rust 与 Candle 运行 Microsoft 的 Phi-3

介绍

步骤 1：设置项目

步骤 2：实现

步骤 3：获取模型权重

步骤 4：运行演示

结论

相关文章

使用 Bandit 作为 SAST 工具来保护您的 Python 应用

Stripboard救世主：AI 自动化您的电路布局

Blender 插件开发需要更多 DevOps

构建 SlimShield：具备 18 项高级功能的生产就绪 Docker 安全平台 🚀