Running llama-cli on Linux

Published: (January 30, 2026 at 02:52 PM EST)
3 min read
Source: Dev.to

Source: Dev.to

Overview

If you’ve experimented with local LLMs, you’ve likely used Ollama, LM Studio, or Jan.ai. These tools are excellent for accessibility, but as a Linux user you might want more control, less background “magic,” and higher performance.

We’ll go “under the hood,” moving from the wrapper (Ollama) to the engine (llama.cpp) to extract every bit of power from your local silicon.

Hierarchy Overview

llama.cpp → Engine: A raw C++ implementation of the Llama architecture; the core mathematical library that performs inference.
Ollama   → Wrapper: Bundles llama.cpp with a Go‑based management layer, a model registry, and a background service.

Why Switch to the CLI?

Transparency: No hidden daemons. When the process ends, your RAM is 100 % empty.
Performance: 10–20 % faster token generation by cutting out software overhead.
Hardware Mastery: You can explicitly target instruction sets like AVX‑512, which generic binaries often ignore for compatibility.

The Universal Baseline (Hardware)

To run modern 7B‑ or 8B‑parameter models (e.g., Llama 3.1 or Mistral) comfortably on Linux without a dedicated GPU, aim for:

  • PC manufactured in 2020 or newer
  • RAM: 12 GB + (8 GB works but may cause swapping)
  • CPU: Intel 11th Gen + or AMD Ryzen 5000 +
  • OS: Any modern Linux distro (Debian/Ubuntu preferred for simplicity)

The Instruction Set Audit

The “secret power” of local AI is AVX (Advanced Vector Extensions). To see what your CPU supports, run:

lscpu | grep -E "avx(2|512(f|bw|dq|vl|vnni)?)"

Look for:

  • AVX2 – the standard baseline.
  • AVX‑512 / VNNI – the gold standard. If you see avx512_vnni, your CPU can process AI‑specific math significantly faster.

Installation: Building for Your Silicon

We compile the engine ourselves to ensure it’s optimized for your CPU flags.

# Install build essentials
sudo apt update && sudo apt install -y build-essential cmake git wget
# Clone the repository
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# Build with native flags (auto‑detects AVX2/AVX‑512)
cmake -B build -DGGML_NATIVE=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j "$(nproc)"

Downloading the Model

llama.cpp uses the GGUF format. Instead of Ollama’s pull command, download the model directly from Hugging Face. For a machine with ~12 GB RAM, the Q4_K_M (4‑bit) quantization is a good balance.

mkdir -p models
wget -O models/llama-3.1-8b-q4.gguf \
  https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

(If your hardware is more limited, consider a smaller model such as a 4B or 1.5B variant.)

Operation: The Power‑User Commands

The primary binary is llama-cli. Run a local session with:

./build/bin/llama-cli \
  -m models/llama-3.1-8b-q4.gguf \
  -cnv \
  --color \
  --mlock \
  -t 4

Understanding the Flags

  • -cnv Enables Conversation Mode (handles chat templates automatically).
  • --color Colors prompts (green/cyan) and model responses (white).
  • --mlock Pins the model in physical RAM to prevent swapping—crucial on laptops.
  • -t 4 Uses 4 physical cores (adjust to match your CPU).

Verification: Is It Working?

When you launch the command, check the first few lines for a system_info line similar to:

system_info: n_threads = 4 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VNNI = 1 |

If AVX512 = 1, you’ve successfully optimized your AI assistant to the limit of your hardware. You are now running a private, hyper‑optimized LLM with zero telemetry and full transparency.

If you encounter issues, feel free to post them for help. Good luck!

Back to Blog

Related posts

Read more »

xAI Joins SpaceX

Article URL: https://www.spacex.com/updatesxai-joins-spacex Comments URL: https://news.ycombinator.com/item?id=46862222 Points: 76 Comments: 113...