Overview of Attention Residuals
\nAttention Residuals (AttnRes) offer a direct substitution for the classic residual pathways that dominate modern Transformer stacks, granting each layer the ability to reference earlier hidden states through a learned attention mechanism. This design replaces the fixedâweight summation with a dynamic weighting scheme, allowing the model to emphasize the most relevant historical representations.
\nWhy Standard Residuals Fall Short
\nConventional residual links accumulate all preceding outputs with uniform unit coefficients, which leads to two observable drawbacks: the contribution of any single layer becomes diluted as depth increases, and the magnitude of hidden states can drift toward unbounded values, a wellâdocumented issue for PreNorm configurations. Replacing this static accumulation with a selective process addresses both concerns.
\nMechanics of Full AttnRes
\nThe Full AttnRes variant computes a softmax attention distribution over every prior layer output. Each layer possesses a learned pseudoâquery vector w_l â â^d the attention weight α_i^l for a previous layer i is derived from the dot product between w_l and the representation of layer i, followed by a softmax normalization. The resulting weighted sum replaces the ordinary addition, delivering contentâaware depth integration.
\nMemory Implications of Full Attention
\nWhile Full AttnRes provides the most expressive depthâwise interaction, it incurs O(L·d) memory usage, where L denotes the total number of layers and d the hidden dimension. For deep models (LâŻ>âŻ50), this requirement can exceed typical GPU capacities, prompting the need for a more memoryâfriendly alternative.
\nBlock AttnRes Architecture
\nBlock AttnRes mitigates the memory burden by partitioning the model into N contiguous blocks. Within each block, standard residual summation is retained, preserving intraâblock efficiency. Between blocks, a lightweight attention operation is performed over the aggregated block representations, reducing the overall cost to O(N·d) while still capturing longârange dependencies.
\nImplementation Sketch
\nThe following pseudocode outlines a practical Block AttnRes layer:\n
\nblocks = []\nfor b in range(N):\n # intraâblock residual accumulation\n block_rep = residual_sum(layers[b])\n blocks.append(block_rep)\n# interâblock attention\nquery = learnable_vector\nlogits = torch.einsum('bd,nd->bn', query, torch.stack(blocks))\nweights = torch.softmax(logits, dim=1)\noutput = torch.einsum('bn,bd->bd', weights, torch.stack(blocks))\n\nTwo critical components-residual_sum and the softmax weighting-ensure that the blockâlevel attention remains computationally