Chess engines do weird stuff

Published: (February 17, 2026 at 12:07 PM EST)
4 min read

Source: Hacker News

Training method

Since AlphaZero, lc0‑style chess engines have been trained with reinforcement learning (RL). Specifically, you have the engine (search + model) play itself many times and train the model to predict the outcome of the game.

It turns out this isn’t necessary. A good model vs. a bad model is only ~200 ELO, but search is ~1200 ELO (see the Stockfish nodes chart), so even a bad model + search is essentially an oracle to a good model without search. You can therefore distill from bad model + search → good model.

So RL was necessary only once. Once a good model with search was trained, every future engine (including their competitors!)1 can distill from that and doesn’t have to generate games (which is expensive). lc0 trained its premier model, BT4, with distillation and it got worse when you put it back into the RL loop.

Why is distillation from search so powerful?
People often compare this to distilling from best‑of‑N in RL, which is limited—a chess engine that runs the model on 50 positions is roughly equivalent to a model 30× larger, whereas LLM best‑of‑50 is generously worth a model only ~2× larger. Perhaps this explains why test‑time search was undervalued when RL‑VR was right under their noses.

Training at runtime

A recent technique is applying the distillation trick at runtime. At runtime you evaluate early positions with your neural network, then search them and get a more accurate picture. If your network says the position is +0.15 pawns better than the search says, subtract 0.15 pawns from future evaluations. Your network then adapts live to the position it’s in!
See the implementation in the Stockfish PR #4950.

Training on winning

The fundamental training objective of distilling from search is almost—but not quite—what we actually care about: winning. It’s highly correlated, but we really care about how well the model performs after search, after looking at many positions.

To address this, lc0 uses a technique called SPSA (Simultaneous Perturbation Stochastic Approximation). You randomly perturb the weights in two opposite directions, play a bunch of games, and move in the direction that wins more2. This works very well and can add +50 ELO on small models3.

“Consider for a moment how insane it is that this works at all. You’re modifying the weights in purely random directions. You have no gradient whatsoever. And yet it works quite well! +50 ELO is ~1.5× model size or ~a year’s worth of development effort!”

The main issue is that SPSA is wildly expensive. A single step requires playing thousands of games, each with dozens of moves and hundreds of position inferences per move.

Like LLMs, you train for a long time on a pseudo‑objective that’s close to what you want, then a short time on a very expensive and limited objective that’s closer to the true goal.

Tuning through C++

The underlying SPSA technique can be applied to any numeric parameter in a chess program. Modify the number, see if it wins more or loses more, and move in the winning direction. For example, you might adjust the depth‑back‑off when a checkmate is found in the search:

// Original back‑off
if (checkmate_found) depth_backoff = 1;

// Tune with SPSA
depth_backoff = 1.09; // optimal value found, yields +5 ELO

You can do this for every number in the search algorithm, effectively performing gradient descent through arbitrary C++ code because you have a grading function (winning).

Weird architecture

lc0 uses a standard‑ish transformer architecture, which they found to be hundreds of ELO better than their older convolution‑based models. The only substantial architectural change they employ is “smolgen,” a system for generating attention biases. They claim smolgen incurs a ~1.2× throughput hit but provides an accuracy gain equivalent to a 2.5× model size. The reasons for its effectiveness remain unclear.

Notes

Footnotes

  1. Their primary competitor, Stockfish, uses their data and so do most engines. Some engines don’t, mostly because they care about originality.

  2. SPSA works by taking a particular part of the weights, randomly choosing +1 or −1 for each parameter to form a direction tensor, creating two network versions (weight += direction and weight -= direction), playing them against each other, and updating the weights in the winning direction.

  3. Up to about 15 ELO on larger models.

0 views
Back to Blog

Related posts

Read more »