MoonshotAI, the Beijing-based AI lab, published Attention Residuals (AttnRes) — a replacement for standard residual connections in Transformer architectures that uses learned, input-dependent attention over layer depth instead of fixed-weight accumulation. The method was released alongside a preprint PDF and an open-source codebase. Author details were not available at time of publication from the repository’s publicly accessible materials.
- AttnRes replaces fixed-weight additive residuals with a softmax-weighted aggregation of all earlier layer representations, computed via a single learned pseudo-query per layer.
- Block AttnRes reduces memory complexity from O(Ld) to O(Nd) by partitioning layers into approximately eight blocks — making the approach practical for large-scale training.
- Scaling law experiments reported in the paper showed AttnRes “consistently outperforms the baseline across all compute budgets.”
- The repository includes PyTorch-style pseudocode, and the Block variant is positioned as a drop-in replacement requiring minimal architecture changes.
What Happened
MoonshotAI released Attention Residuals (AttnRes) via a public GitHub repository that includes a preprint PDF and an arXiv submission. The method targets a structural limitation in standard residual connections: as Transformer depth increases, uniform additive accumulation of layer outputs dilutes each layer’s contribution and allows hidden-state magnitudes to grow without bound. The researchers identify this as a well-known problem specific to PreNorm configurations, the normalization strategy used in most modern large language models. AttnRes replaces fixed-unit-weight accumulation with a learned attention mechanism that gives each layer selective, content-aware access to all previous layer representations.
Why It Matters
Residual connections have been a core component of Transformer architectures since their adoption from deep residual networks, and their design — uniform additive accumulation with fixed unit weights — has remained essentially unchanged. The MoonshotAI team writes in the repository documentation that this approach results in “uniform aggregation [that] dilutes each layer’s contribution and causes hidden-state magnitudes to grow unboundedly,” framing it as an inherent liability at depth rather than an incidental implementation issue.
DenseNet-style architectures explored cross-layer feature reuse in convolutional networks, but those approaches relied on fixed concatenation or summation. AttnRes introduces a trainable, input-conditioned variant of the same idea into the Transformer residual stream directly: aggregation weights are computed dynamically from each token’s current representation, rather than being fixed at design time.
Technical Details
In the standard residual formulation, each layer’s output is added to the running hidden state with a fixed weight of one. AttnRes replaces this with a softmax-weighted sum: each layer computes attention weights α using a single learned pseudo-query parameter matched against keys derived from all previous layer outputs, then produces a new hidden state as the weighted combination Σα_i · h_i across all prior representations h_i. Using a single pseudo-query per layer keeps the added parameter count small relative to the full attention sublayer.
Full AttnRes carries O(Ld) memory complexity — where L is total layer count and d is hidden dimension — because all L intermediate hidden states must be stored. At large scale, this becomes a practical bottleneck. The researchers developed Block AttnRes to address it: layers are partitioned into approximately eight blocks, and attention is applied only over compressed block-level representations rather than every individual layer output, reducing memory from O(Ld) to O(Nd) where N is the number of blocks. The team states Block AttnRes “recovers most of Full AttnRes’s gains while serving as a practical drop-in replacement with marginal overhead,” though the exact quantitative gap between the two variants is not specified in the publicly available summary.
Scaling law experiments reported in the paper demonstrated that “AttnRes consistently outperforms the baseline across all compute budgets.” The repository includes PyTorch-style pseudocode for the block_attn_res function, which handles both the completed block representations and the partial running sum within the current in-progress block.
Who’s Affected
The method is directly relevant to ML engineers running Transformer pretraining at scale, particularly teams working with deep model configurations where PreNorm hidden-state growth is most acute. Because Block AttnRes is described as a drop-in replacement compatible with existing architectures, practitioners using standard PyTorch-based training stacks can integrate the component without redesigning their model topology. Researchers studying depth efficiency and representation collapse in large language models will find the scaling law comparisons in the preprint the most immediately applicable evidence.
What’s Next
As of early April 2026, the repository had accumulated 2.9k GitHub stars and 138 forks, indicating active interest from the broader research community. The inclusion of both a preprint PDF and an arXiv submission suggests external review and independent replication attempts are likely already underway. AttnRes has not been demonstrated in a publicly released production model, so its behavior at frontier LLM scale — beyond the scaling law experiments reported in the paper — has not yet been independently validated by external researchers.