nareshkarthigeyan

Transformer Chess

Teaching a Neural Network Chess Intuition: A pretty terrible model

Jun 20, 2026

Building an unbeatable Tic Tac Toe bot takes about four hours. You grab the Minimax algorithm, map out the game tree, score the terminals (+1 for a win, -1 for a loss), and recursive math does the rest. It's a neat, self-contained afternoon project because the state space is tiny.

Chess is a completely different kind of nightmare.

Standard engines like Stockfish treat chess like an industrial-scale search problem. They use alpha-beta pruning, iterative deepening, and massive evaluation tables to calculate millions of possible future branches before making a move. It's incredibly effective, but it’s pure, calculated brute force.

I wanted to see what happens if you strip away the search tree entirely. No look-aheads, no deep branching calculations, no evaluation functions. Just pure, instant neural intuition.

The goal was simple: throw raw board states at a transformer model and see if it could instantly predict a strong, legal move in a single forward pass. Why? Mostly because it sounded like a fun way to stress-test an architecture that was never originally meant to handle spatial geometry.

The Core Concept: Translating Boards into Tokens

Since we are doing this without an active search engine, we have to treat chess as a sequence prediction problem. This is where the Transformer comes in.

Introduced in the seminal 2017 paper Attention Is All You Need, the transformer relies on a mechanism called self-attention. Instead of processing data sequentially step-by-step, self-attention allows every single element in a sequence to look at and connect with every other element simultaneously. In natural language, it maps how words relate to each other in a sentence. On a chessboard, it maps how pieces coordinate across columns and ranks.

To set this up, the game state has to be translated into something the architecture can ingest. I wrote a basic tokenizer to flatten the $8 \times 8$ grid into a 1D sequence of 64 integer tokens:

Empty squares = 0
White pieces = 1 through 6 (Pawn to King)
Black pieces = 7 through 12

For the output layer, the problem becomes a massive classification task. Every unique move combination can be represented as an integer ID from $0$ to $4095$ ( $64 \text{ origin squares} \times 64 \text{ target squares}$ ). The model's job is to look at the 64-token input and output a probability distribution over those 4,096 possible move classes.

Data Collection and Pipeline

Transformers are notoriously data-hungry, and they don't have built-in inductive biases for the rules of the physical world; they have to infer relationships entirely from the statistical patterns in the data you feed them.

To give the network a fighting chance, I gathered 35 separate master-level PGN database files—including early tournament games from grandmasters like Nodirbek Abdusattorov from PGN Mentor.

This is a very small amount of data in the grand scheme of the project, almost like a toy model. 20 * 35 = ~700 games.

I built a consolidated parsing pipeline to stream these text files sequentially without exploding my RAM. The script iterates through the game strings move-by-move, cross-references each position against python-chess to ensure validity, and logs the current state paired with the move that was actually played.

The final compiled dataset yielded 56,335 unique board-move pairs.

i.e: Move 1 -> Expected output: Move 2 Move 1 + Move 2 -> Exprected output: Move 3 Move 1, 2, 3 -> Expected output: Move 4 ...and so on.

Model Setup & The Training Phase

The model uses a standard PyTorch TransformerEncoder structure coupled with two distinct embedding layers:

self.piece_embedding = nn.Embedding(vocab_size=13, d_model=128)
self.pos_embedding = nn.Embedding(num_embeddings=64, embedding_dim=128)

Because self-attention is naturally permutation-invariant (it doesn't care about the order of tokens), the pos_embedding is critical. It injects a learned spatial index vector for each of the 64 squares so the network can distinguish between a rook on A1 and a rook on H8. The tokens pass through four encoder layers before hitting a dense feed-forward classification head that outputs raw move logits.

encoder_layer = nn.TransformerEncoderLayer(d_model=128, nhead=4, dim_feedforward=512, activation='gelu', batch_first=True)
self.transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=4)
self.fc_out = nn.Linear(64 * 128, 4096)

When I first hit run on the training loop using my MacBook Air's mps GPU backend, the terminal went pitch black. Zero logs. Total silence.

I assumed the code had hung. It turns out the Apple Silicon metal performance shaders require a noticeable upfront warmup window to compile the hardware graphs on the first batch, and my logging script only reported metrics at the end of a full epoch. Once I patched the loop to print out progress markers every 10 batches, the console came alive, and the Cross-Entropy loss began stepping down cleanly.

The Reality Check: Evaluation

Once checkpoint.pt saved, it was time to establish a hard baseline performance metric. I set up a local testing script to throw the trained weights into a 20-game tournament against two automated opponents: a Pure Random Move Bot and a Lobotomized Stockfish manually forced to an execution limit of depth=1 and time=0.001.

The final metrics sheet was a clean reality check:

Pure Random Bot (Elo ~100): (2W 18D 0L)

Stockfish D1 (Elo ~600): (0W 0D 20L)

The model swept 0-20 against Stockfish Depth 1. Even when stripped of its search tree, Stockfish's internal static evaluation terms understand piece values and king safety perfectly, making it an incredibly efficient punisher against unstructured moves.

However, the games against the random bot revealed something fascinating. In Game 1, playing as White, the transformer successfully coordinated its minor pieces and delivered a completely valid, automated Checkmate. The model genuinely learned how to checkmate a king purely by analyzing text PGN patterns.

But overall, the practical play is still weak. It drew 18 times against a random bot because it kept falling into endless king-shuffling loops and threefold repetitions in the endgame. It understands basic tactical development, but it completely lacks the long-term intentionality needed to convert a winning advantage.

It's pretty safe to say my model's elo rating is somewhere around 100 - 110, which is pretty terrible. Worst of the worst players are around 250 - 350.

Match number #10: Against a random chess move generator, our stupid transformer model actually managed to win!

Match number #3: A pretty handicapped Stockfish of depth 1 obliretating our model.

Where the Code Goes From Here

Seeing the engine struggle on the board is the best part of building this in public—it gives me concrete architectural bottlenecks to unpack for the final project report.

Right now, the model faces two massive constraints:

Dimensional Blindness: Flattening the board into a continuous 1D line of 64 tokens breaks 2D geometry. The network has no inherent concept that row 1 and row 2 are vertically adjacent; it has to infer the two-dimensional grid completely from scratch.
Unmasked Loss: The final layer calculates gradients across all 4,096 possible move slots. The model spends a massive amount of its capacity learning not to generate illegal moves (like jumping a pawn backwards) instead of optimizing for the absolute best strategic option.

To clean this up, the next iteration of the repo will swap the 1D spatial embeddings for a combined 2D Coordinate Positional Encoding (adding separate rank and file vector components) and implement an active Legal Move Mask directly over the output logits during evaluation.

The web dashboard is fully wired up, the baseline numbers are logged, and the core pipeline works. Now it's just a matter of scaling the data parameters.

Follow this project on GitHub!