2.78 TFLOPS on a Fanless MacBook Air? Benchmarking Apple's M4 with MLX

Published: 1 month ago (March 19, 2026 at 01:58 AM EDT)

3 min read

Source: Dev.to

Source: Dev.to

This fanless M4 MacBook Air achieved 2.78 TFLOPS in a matrix‑multiplication benchmark using Apple’s MLX framework.
Matrix multiplication (GEMM) is the core operation behind modern machine‑learning models and large language models (LLMs). By measuring how quickly a device can multiply huge matrices, we gauge its raw AI‑compute capability.

Test Environment

Machine: M4 MacBook Air
Memory: 16 GB unified memory
Python: 3.10.11 (also works with 3.12.12)
Framework: MLX v0.28.0 (compatible with v0.31.1)

All computations were performed with the bfloat16 data type.

Benchmark Script

The benchmark script is available as a public Gist. To run it on any Apple‑Silicon Mac:

# Create a virtual environment
python3 -m venv venv
source venv/bin/activate

# Install MLX
pip install mlx

# Run the benchmark (prevents sleep)
caffeinate python matmul_benchmark.py

The script measures the execution time of a simple matrix‑multiplication (C = A·B) for various matrix sizes.

Results

Average times are based on 5 consecutive runs per size.

Matrix Size	Avg Time (s)	Total Ops	Real TFLOPS
10 000 × 10 000	0.740	2 × 10¹²	2.70
20 000 × 20 000	5.754	16 × 10¹²	2.78
30 000 × 30 000	21.624	54 × 10¹²	2.50
40 000 × 40 000	75.872	128 × 10¹²	1.69

Individual run times (seconds) for each size are included in the original table for reference.

Peak Performance

The highest measured performance was 2.78 TFLOPS at a matrix size of 20 000 × 20 000.

Thermal Behavior

For the 20 000 × 20 000 and 30 000 × 30 000 runs, execution times increased slightly across successive runs, suggesting the fanless chassis was beginning to throttle due to heat.

Impact of Swap Memory

At 40 000 × 40 000 the execution time spiked and variance grew. With only 16 GB of unified memory, the system exhausted RAM and started swapping to the SSD, dramatically slowing the benchmark. Activity Monitor confirmed a surge in swap usage during this run.

Calculating TFLOPS

For a square matrix multiplication (C = A \cdot B) of size (N \times N):

[ \text{Total Operations} \approx 2N^{3} ]

Each output element requires (N) multiplications and (N-1) additions, i.e., roughly (2N) operations per element. With (N^{2}) output elements, the total is (2N^{3}).

Example for (N = 20{,}000)

[ \text{Total Operations} \approx 2 \times 20{,}000^{3} = 16 \times 10^{12} \text{ (16 trillion operations)} ]

TFLOPS Formula

[ \text{TFLOPS} = \frac{\text{Total Operations}}{\text{Execution Time (s)} \times 10^{12}} ]

Applying the measured time for (N = 20{,}000):

[ \text{TFLOPS} = \frac{16 \times 10^{12}}{5.754 \times 10^{12}} \approx 2.78 ]

The results demonstrate that even a fanless laptop can deliver multi‑teraflop AI performance locally, highlighting the impressive efficiency of Apple’s M4 silicon.

2.78 TFLOPS on a Fanless MacBook Air? Benchmarking Apple's M4 with MLX

Test Environment

Benchmark Script

Results

Peak Performance

Thermal Behavior

Impact of Swap Memory

Calculating TFLOPS

Example for (N = 20{,}000)

TFLOPS Formula

Related posts

Your Pipeline Is 21.5h Behind: Catching Startups Sentiment Leads with Pulsebit

The Claude Code CVE That Should Change How You Review AI-Generated Code

Are Banking Apps Safe? Why Yes, But Your Habits Matter More

45,000 Layoffs in March. Companies Blamed AI. The Numbers Say Otherwise.