What Happened
Researchers Quanhao Li and Wei Jiang submitted a paper to arXiv on March 31, 2026, identifying a structural conflict in training chess-playing transformer models from move sequences. The paper, “Tracking vs. Deciding: The Dual-Capability Bottleneck in Searchless Chess Transformers”, formalizes the problem as a mathematical inequality and demonstrates a 120-million-parameter model that reached a Lichess bullet rating of 2570 over 253 rated games — without using tree search at inference time.
The central finding is that sequence-based training forces a model to develop two capabilities whose optimal training data are in direct conflict with one another.
- Sequence-based chess training requires two conflicting skills: state tracking (reconstructing board state from move history) and decision quality (selecting good moves from that reconstructed state).
- Low-rated games supply the data diversity needed for tracking; high-rated games supply the quality signal for decision learning — and these requirements cannot be fully satisfied by a single dataset strategy.
- The authors formalize the tension as P ≤ min(T, Q): overall model performance is bounded by whichever capability is weaker.
- Their final 120M-parameter model achieved 55.2% Top-1 human move prediction accuracy, exceeding Maia-2 rapid and Maia-2 blitz, with no tree search.
Why It Matters
Most prior work on human-like chess AI — including the Maia and Maia-2 model families — uses board position as input rather than move sequences. Position-based methods capture the current state directly but lose the path that led to it. Li and Jiang argue that sequence input “naturally encodes full game history, enabling history-dependent decisions that single-position models cannot exhibit.”
The bottleneck has a counterintuitive practical consequence: removing low-rated games from training data degrades overall performance, even though those games contain weaker moves. Curating only high-quality games does not produce a stronger human-like model — it weakens state tracking by reducing diversity.
Technical Details
The dual-capability bottleneck is expressed as P ≤ min(T, Q), where T is tracking capability and Q is decision quality. As the authors state in the abstract, “low-rated games provide the diversity needed for tracking, while high-rated games provide the quality signal for decision learning.” These requirements pull in opposite directions, and neither can be fully satisfied by optimizing for the other.
To address the bottleneck, the team first scaled the model from 28 million to 120 million parameters, which the experiments show improves state tracking. They then introduced Elo-weighted training — assigning higher sample weight to games from higher-rated players — to improve decision quality while preserving data diversity. A 2×2 factorial ablation confirmed that scaling and weighting each contribute independently, and that their combination is superadditive: the joint gain exceeds the sum of individual gains.
The researchers found linear Elo weighting performs best. Overly aggressive weighting lowered validation loss but measurably harmed tracking capability, illustrating the bottleneck mechanism directly. The paper also introduces a coverage-decay formula, t* = log(N/k_crit) / log b, as a reliability horizon for predicting when intra-game move degeneration becomes likely within a sequence.
The final 120M-parameter model achieved 55.2% Top-1 accuracy on human move prediction — exceeding both Maia-2 rapid and Maia-2 blitz — and reached a Lichess bullet rating of 2570 across 253 rated games, all without search.
Who’s Affected
The findings are most directly relevant to researchers building chess engines designed to model human play style rather than maximize competitive strength. Systems targeting human move prediction, such as Maia-2, face the same data trade-off described here, and the P ≤ min(T, Q) framing provides a diagnostic lens for understanding where those models may be limited.
Developers working on sequence-based modeling tasks where training data quality and diversity are structurally in tension may find the dual-capability framework conceptually relevant, though Li and Jiang do not extend this claim themselves beyond the chess domain.
What’s Next
The paper establishes that aggressive Elo weighting can hurt tracking even when it reduces validation loss, indicating that validation loss alone is an insufficient guide for this type of training. The coverage-decay formula t* = log(N/k_crit) / log b offers a quantitative reliability estimate, but the authors frame it as a theoretical horizon rather than a hard operational limit.
Li and Jiang do not claim their architecture or the bottleneck framing generalizes to other games or domains. Extending the dual-capability analysis to tasks where sequence history matters — such as long-context reasoning or multi-turn dialogue — remains open for follow-up work outside the scope of this paper.