Read-Refreshed Eligibility Traces: Bridging BPTT and Long-Range Credit Assignment in Recurrent Transformers

Author: Richard Vermillion Date: 2025-10-25


Abstract

Windowed back-propagation through time (BPTT) enables efficient training of recurrent architectures such as recurrent transformers but limits learning to short temporal horizons. Full BPTT captures long-range dependencies but scales linearly in both memory and compute with sequence length.
We introduce Read-Refreshed Eligibility Traces (RRET) — a hybrid approach that maintains compact, usage-conditioned “credit traces” for past activations and updates them only when those memories are retrieved by attention.
Inside a fixed-length window we compute exact gradients; outside it, we refresh Hebbian-style eligibility traces parameterized by the attention weights, routing sensitivities, and critic-derived postsynaptic drives.
This yields a recurrent transformer that can learn from arbitrarily long sequences with fixed memory cost, propagating credit precisely to semantically relevant past states while ignoring unused ones.
We derive the update rules, describe an efficient implementation, and propose experiments demonstrating that RRET matches full BPTT on long-range dependency benchmarks at a fraction of the memory cost.


1 Introduction

Standard recurrent transformers face a fundamental trade-off: windowed BPTT enables efficient training with bounded memory but prevents learning from gradients or rewards beyond the truncation horizon \(T\). Full BPTT captures those dependencies but becomes infeasible for long sequences due to quadratic attention cost and linear memory growth.

We propose Read-Refreshed Eligibility Traces (RRET) — a mechanism that blends exact short-range back-propagation with biologically inspired, long-range credit assignment.
RRET maintains lightweight, synapse-like traces for past computations that are refreshed only when the corresponding memories are read from the key–value (KV) cache of a recurrent transformer.

When a query at time \(t\) attends to a cached key–value pair \((k_i,v_i)\) from step \(i<t\), the trace for that pair is updated according to:

These read-time refreshes form usage-conditioned eligibility traces that accumulate credit exactly for the memories that matter.
When a delayed loss or reward signal arrives, the accumulated traces serve as low-variance, temporally extended gradient estimators.


2 Method

2.1 Architecture Overview

RRET extends a standard recurrent transformer block with four components:

  1. Windowed unroll: Exact BPTT is performed for the most recent \(T\) steps.
  2. KV cache: Each past step stores its compressed input \(\tilde x_i=P x_i\), keys \(k_i=W_k x_i\), values \(v_i=W_v x_i\), and timestamp \(\tau_i\).
  3. Read-refreshed traces: For cache entries older than the current window, per-head eligibility matrices \(e^{(k)}{i,h}, e^{(v)}\) are updated on every read. }\in\mathbb{R}^{d_h\times r
  4. Critic head: A small network \( \hat r_t = f(o_t) \) predicts future returns and supplies the local Jacobian \( J_t=\partial \hat r_t/\partial o_t \), used as the postsynaptic drive \(u_t\) when true gradients are unavailable.

2.2 Eligibility Updates

For query \(q_t^{(h)}\) attending to item \(i\) with weight \(\alpha_{t,i}^{(h)}\):

\[ \begin{aligned} e^{(v)}{i,h} &\leftarrow \lambda\,e^{(v)} + \eta_v\,\alpha_{t,i}^{(h)}\, u_t^{(h)}\, \tilde x_i^\top,\[3pt] \Delta_{t,i}^{(h)} &= \frac{1}{\sqrt{d_h}}\,u_t^{(h)\top}(v_i^{(h)}-\tilde o_t^{(h)}),\[3pt] e^{(k)}{i,h} &\leftarrow \lambda\,e^{(k)} + \eta_k\,\alpha_{t,i}^{(h)}\,\tilde\Delta_{t,i}^{(h)}\, q_t^{(h)}\,\tilde x_i^\top, \end{aligned} \]

where \(\tilde\Delta\) denotes the variance-reduced, normalized routing sensitivity.
Traces decay with factor \(\lambda\in(0,1)\) and are updated only for items outside the current BPTT window.

When a scalar training signal \(\delta_t\) (e.g., TD error or per-step loss) arrives, parameter updates are:

\[ \Delta W_{v,k,q} = \Big(\sum_{i,h} e^{(v,k,q)}_{i,h}\Big) P^\top \,\delta_t. \]

This structure matches the three-factor learning rule observed in biological synapses: presynaptic activity (\(\tilde x_i\)), postsynaptic drive (\(u_t\)), and a modulatory signal (\(\delta_t\)).

2.3 Compression and Memory Efficiency

To limit storage, inputs are projected to a low-rank subspace \( \tilde x_i = P x_i \), with \(P\in\mathbb{R}^{r\times d_{\text{model}}}\).
If \(P\) is fixed (random or orthogonal), gradients flow only through \(W_{k,v,q}\); if learned, \(P\) is updated on a slower timescale.
Each eligibility matrix is \(d_h\times r\); with \(r=d_\text{model}/4\) and top-k attention, trace memory remains \(O(n_\text{heads}\,r\,d_h\,k)\), independent of sequence length.

2.4 Variance-Reduction Techniques


3 Experimental Plan

3.1 Tasks

  1. Synthetic Long-Range Copy / Addition – 10k-step sequences to test gradient reach.
  2. Character-level Text Modeling – enwik8 with long context; compare to truncated and full BPTT.
  3. Reinforcement Learning – delayed-reward maze or algorithmic tasks (e.g., DMLab MemoryMaze).
  4. Long-horizon control – recurrent world-model predicting rewards hundreds of steps ahead.

3.2 Baselines

3.3 Metrics

3.4 Expected outcomes

RRET should:


4 Related Work

No prior work combines read-conditioned eligibility updates with transformer-style KV caching and hybrid BPTT, making RRET a new member of this family.


5 Discussion

RRET reframes credit assignment as an event-driven process: memory tokens accumulate credit only when they influence future computation.
By updating traces on retrieval rather than on every timestep, the model aligns temporal credit with semantic relevance — an efficiency principle absent from standard BPTT.

Limitations include sensitivity to critic quality and additional hyperparameters (decay λ, compression rank r).
Future work includes multi-layer variants, stochastic-key training, and integration with differentiable memory systems or agents operating in continuous environments.


6 Conclusion

Read-Refreshed Eligibility Traces bridge the gap between truncated and full BPTT.
They provide exact gradients where possible, approximate gradients where necessary, and biologically inspired long-range credit assignment everywhere else.
This hybrid makes recurrent transformers scalable to arbitrarily long horizons while maintaining gradient flow aligned with their learned attention patterns.


References