2.78 TFLOPS on a Fanless MacBook Air? Benchmarking Apple's M4 with MLX

Published: (March 19, 2026 at 01:58 AM EDT)
3 min read
Source: Dev.to

Source: Dev.to

This fanless M4 MacBook Air achieved 2.78 TFLOPS in a matrix‑multiplication benchmark using Apple’s MLX framework.
Matrix multiplication (GEMM) is the core operation behind modern machine‑learning models and large language models (LLMs). By measuring how quickly a device can multiply huge matrices, we gauge its raw AI‑compute capability.

Test Environment

  • Machine: M4 MacBook Air
  • Memory: 16 GB unified memory
  • Python: 3.10.11 (also works with 3.12.12)
  • Framework: MLX v0.28.0 (compatible with v0.31.1)

All computations were performed with the bfloat16 data type.

Benchmark Script

The benchmark script is available as a public Gist. To run it on any Apple‑Silicon Mac:

# Create a virtual environment
python3 -m venv venv
source venv/bin/activate

# Install MLX
pip install mlx

# Run the benchmark (prevents sleep)
caffeinate python matmul_benchmark.py

The script measures the execution time of a simple matrix‑multiplication (C = A·B) for various matrix sizes.

Results

Average times are based on 5 consecutive runs per size.

Matrix SizeAvg Time (s)Total OpsReal TFLOPS
10 000 × 10 0000.7402 × 10¹²2.70
20 000 × 20 0005.75416 × 10¹²2.78
30 000 × 30 00021.62454 × 10¹²2.50
40 000 × 40 00075.872128 × 10¹²1.69

Individual run times (seconds) for each size are included in the original table for reference.

Peak Performance

The highest measured performance was 2.78 TFLOPS at a matrix size of 20 000 × 20 000.

Thermal Behavior

For the 20 000 × 20 000 and 30 000 × 30 000 runs, execution times increased slightly across successive runs, suggesting the fanless chassis was beginning to throttle due to heat.

Impact of Swap Memory

At 40 000 × 40 000 the execution time spiked and variance grew. With only 16 GB of unified memory, the system exhausted RAM and started swapping to the SSD, dramatically slowing the benchmark. Activity Monitor confirmed a surge in swap usage during this run.

Calculating TFLOPS

For a square matrix multiplication (C = A \cdot B) of size (N \times N):

[ \text{Total Operations} \approx 2N^{3} ]

Each output element requires (N) multiplications and (N-1) additions, i.e., roughly (2N) operations per element. With (N^{2}) output elements, the total is (2N^{3}).

Example for (N = 20{,}000)

[ \text{Total Operations} \approx 2 \times 20{,}000^{3} = 16 \times 10^{12} \text{ (16 trillion operations)} ]

TFLOPS Formula

[ \text{TFLOPS} = \frac{\text{Total Operations}}{\text{Execution Time (s)} \times 10^{12}} ]

Applying the measured time for (N = 20{,}000):

[ \text{TFLOPS} = \frac{16 \times 10^{12}}{5.754 \times 10^{12}} \approx 2.78 ]


The results demonstrate that even a fanless laptop can deliver multi‑teraflop AI performance locally, highlighting the impressive efficiency of Apple’s M4 silicon.

0 views
Back to Blog

Related posts

Read more »