When Deep Learning Meets the Devil's Wheel: RL for European Roulette (Part 1)'

Published: 1 month ago (December 12, 2025 at 09:25 PM EST)

4 min read

Source: Dev.to

Part 1: The Theory, The Math, and The Architecture

Disclaimer

If you somehow manage to turn a profit with any of these techniques, I’m expecting my cut. Seriously, a beer or a commit to the repo will do. The house always wins, but at least we’re learning something cool along the way.

Why Build an RL Agent for European Roulette?

European roulette has a house edge of about 2.7 %. The casino wins in the long run, period. The challenge isn’t about beating the house; it’s about pushing the boundaries of reinforcement learning (RL) when faced with pure randomness, catastrophic noise, and a 47‑dimensional action space where most decisions lead to losses. If an agent can learn anything in such a hostile environment, imagine what it could do where exploitable patterns exist.

Action Space

European roulette has 37 pockets (0‑36). When you include outside bets, the action space expands to 47 discrete actions:

Action	Description	Payout
0‑36	Straight bet on a single number	35:1
37‑38	Color bet (Red/Black)	1:1
39‑40	Parity (Odd/Even)	1:1
41‑42	High/Low (1‑18 vs 19‑36)	1:1
43‑45	Dozens (First/Second/Third twelve)	2:1
46	PASS – “don’t bet” (often the smartest move)	0

The agent must decide not only where to bet but also implicitly how much risk to take.

State Representation

History Buffer – The last 20 spins (integers 0‑36). Provides a sequence for learners, even though statistically it’s meaningless.
Gain Ratio – Current bankroll divided by the initial bankroll. This contextualizes decisions (e.g., play conservatively when ahead, go aggressive when behind).

Including the bankroll context proved essential; without it, agents made identical decisions regardless of being up 50 % or down 80 %.

Reward Structure

Outcome	Reward
Win on straight bet	+35
Win on red/black, odd/even, high/low	+1
Lose any bet	–1
PASS action	0

Episodes typically look like -1, -1, -1, +1, -1, -1, … – sparse positive rewards, dense negatives.

Network Architecture (BatchNorm‑Based DQN)

Embedding Layer (0‑36 → 64‑dim)  
→ Flatten (20 × 64 = 1280)  
→ BatchNorm Dense (1280 → 128)  
→ BatchNorm Dense (128 → 128)  
Gain Network (gain ratio: 1 → 32)  
→ Concatenate (128 + 32 = 160)  
→ Dense (160 → 64)  
→ Dense (64 → 47) → Q‑values for all actions

Why BatchNorm over LSTM?
Roulette spins are independent; there’s no temporal dependency to model. LSTMs end up trying to fit noise, leading to unstable Q‑value estimates. BatchNorm normalizes activations within each mini‑batch, smoothing the gradient landscape and accelerating convergence. In practice, the BatchNorm version converged faster and produced a more stable policy.

Double DQN Update

Standard DQN tends to overestimate Q‑values because the same network selects and evaluates actions. Double DQN mitigates this bias by using the online network for action selection and the target network for evaluation:

# Double DQN target computation
target = r + γ * Q_target(s_next, argmax_a Q_online(s_next, a))
Q(s, a) ← Q(s, a) + α * (target - Q(s, a))

Overestimation is especially problematic in roulette, where most actions lose money; the agent might otherwise cling to the belief that betting is profitable, delaying the discovery that PASS is often optimal.

LSTM Predictor (Anomaly Detector)

A separate LSTM model was built solely to predict the next spin, not to drive betting decisions:

Embedding (0‑36 → 32‑dim)  
→ LSTM (64 hidden)  
→ LSTM (64 hidden)  
→ Dropout (p=0.2)  
→ Linear → Softmax (37 classes)

Training details

Loss: Cross‑entropy
Optimizer: Adam
Batch size: 64

On truly random data, accuracy hovers around 2.7 % (1/37), as expected. When a tiny bias (e.g., a 0.5 % higher probability for certain sectors) is introduced, the LSTM begins to detect it after a few hundred spins. It doesn’t generate profit, but it serves as an effective anomaly detector and can be incorporated into a multi‑model ensemble.

Hardware & Performance

GPU: RTX 3060
Training speed: ~15 ms per batch of 64 sequences
PyTorch: model.to('cuda') enables GPU acceleration; CPU‑only training would be prohibitively slow.

Meta‑Learning Idea (Preview)

Instead of a single agent learning actions directly, a meta‑agent could learn when to trust different sub‑models (e.g., the BatchNorm DQN, the LSTM predictor, bias detectors). This hierarchical approach is explored in later parts of the series.