nareshkarthigeyan

Transformer Chess

Chasing some marginal gains and tiny improvements

Jun 21, 2026

Six hours. That’s how long I baked my M3 MacBook Air's integrated fanless chips running 8 full epochs over 553,854 unique board states.

But it was completely worth it (sort of?)

In the last part, the model was playing like a total potato. It managed an accidental checkmate against a random move bot, but it got utterly obliterated by Stockfish at depth 1 and drew 18 times against literal random noise because it kept getting caught in infinite king-shuffling loops. It was pretty clear: Dimensional Blindness (and lack of data, but that's for later). Flattening an 8×8 grid into a simple 1D line of 64 tokens completely destroyed spatial geometry. The transformer had no inherent concept that columns A and B were adjacent, or how pieces coordinate space. It had to guess the rules of 2D geometry completely from scratch. For Phase 2, I added two structural pillars to force spatial intelligence into the network.

Fusing 2D Grid Vectors

Instead of a generic 1D positional index, I completely refactored src/model.py to use distinct Rank (Row) and File (Column) Embeddings. By summing them together during the forward pass, the network structurally inherits a geometric coordinate grid.

rows = self.row_embedding(self.row_indices)
cols = self.col_embedding(self.col_indices)
x = x_emb + rows + cols

Forcing attention exclusively onto legal outcomes

The second pillar was Training-Time Legal Move Masking. In Phase 1, the classification head calculated gradients across all 4,096 possible move slots. The model was wasting massive capacity trying to learn not to jump a pawn backward over its own rook. For Phase 2, I updated the loop to parse raw FEN strings on the fly, build a binary mask of legal moves, and apply it as a heavy penalty (−∞) directly to the logits before computing the Cross-Entropy loss.

mask = torch.full_like(logits, float('-inf'))
for move in board.legal_moves:
mask[batch_idx, move.from_square * 64 + move.to_square] = 0.0
masked_logits = logits + mask 2. The Convergence Floor

Watching the training logs move on half a million positions was painful... Each epoch took about ~45 minutes on my laptop. I forgot about it for a while and went out and touched some grass, came home, had coffee, and still waited for stuff to move on. The loss dropped pretty cleanly across the epochs:

Epoch 1: Average Loss: 2.8730
Epoch 4: Average Loss: 1.9964
Epoch 7: Average Loss: 1.3874
Epoch 8: Average Loss: 1.2267

By Epoch 8, the step loss started fluctuating erratically between 1.2 and 1.4. When an optimizer starts pacing around the same block, it means you've hit the convergence floor. Letting it bake for another two hours would achieve nothing but overfitting, forcing the network to memorize specific historical lines from Anand or Kasparov (if they're in the dataset) instead of learning general strategy.

I killed the process (Ctrl + C) and fired up the tournament suite.

Actual Defensive Resilience

I ran the updated evaluate_elo.py script to see if the 2D geometry changes yielded actual performance points. The results are a night-and-day translation:

Pure Random Bot (Elo ~100): (5W 15D 0L) | 188 Elo

Stockfish D1 (Elo ~600): (0W 13D 7L) | 331 Elo

Two massive takeaways stand out here:

The engine went completely undefeated against the random bot (5 wins, 15 draws, 0 losses) as compared to 2 wins and 18 draws. The best part was the wins were a bit more convincing (instead of pure luck!) It even clocked two hyper-aggressive, clean tactical checkmates in just 33 and 35 moves!

Against a random, The model actually performs much more convincingly!

In Phase 1, Stockfish Depth 1 wiped the floor with us 20-0-0. In Phase 2, our model held Stockfish to 7 structural draws (mostly because of maximum moves 150 reached LOL) Because the transformer finally respects grid topology and focuses its attention only on valid squares, it learned how to coordinate a stubborn, high-volume defensive shell that forced a 600-Elo engine to grind all the way out to the 150-move cap. So that meant, the model was atleast good to hold the worst bot to a lucky draw. Which is pretty much a win in my books.

A draw against the worst stockfish model ever? But still, it's a draw!!!!! (BY MAXIMUM MOVES, BUT IT'S STILL A DRAWW)

What's next?

I don't know, actually. It's either:

a) a data problem. Need tons of data, a lot more matches, apart from grandmaster and historical ones, publically avaiable good quality ones too!? And wait, I'm taking ALL matches, what if I take only the winning side's matches as a training data?

or,

b) compute problem. 6 hours for a measly 500k data is pretty much a long, long time. And my laptop is kinda heating up to which is scary, so I might have to rent some GPU or find a free GPU platform to run my training code

or orrr,

c) Benchmarking problem. I'm not quite sure if playing vs random and a Lobotomized Stockfish is really doing justice? I should look for a better way to benchmark the models I'm training.

But let's figure it out, as always, the source code is in GitHub