Computing between models with residual coupling

Community Article Published May 19, 2026

Residual Coupling (RC) connects frozen language models through small learned bridge projections that inject corrective updates into each model's residual stream at intermediate layers, during a single parallel forward pass. No base weights are modified at any point. Across four domains, this consistently outperforms Mixture-of-Experts routing using the same frozen models. In the medical domain with three coupled models, it reduces perplexity by 80.7% against MoE's 0.5% and improves factual accuracy on TruthfulQA Health by 9 percentage points over the frozen baseline.

Fine-tuning a language model on domain data modifies its weights, and those modifications are the source of both the capability gain and the forgetting. The weight updates that encode new knowledge simultaneously reactivate whatever was memorized during pretraining, and no clean separation between the two effects exists. Liu et al. (2026) showed this directly: fine-tuning frontier models on Haruki Murakami's works caused them to reproduce verbatim passages from more than 30 unrelated copyrighted authors, with single spans exceeding 460 words. LoRA reduces parameter count but still modifies the model's internal logic and inherits the same forgetting properties. RC trains only the gap between two models rather than the weights of either.

The two-step paradigm

architecture_old

RC separates two things that a single trained model conflates. Frozen base models handle memorization: their weights define the boundaries of what the system can represent. Bridge projections handle relational alignment: they learn continuous linear maps between what the frozen models have separately encoded.

Constraining the bridges to linear maps is a design choice rather than a limitation. A non-linear bridge could memorize arbitrary input-output mappings, which makes any gain it produces difficult to attribute. A linear bridge can only navigate relationships that already exist in the geometric structure of the frozen representation spaces. This is tractable because independently trained transformers develop structurally compatible internal geometries (Huh et al., 2024; Kornblith et al., 2019): the relative positions of concepts are approximately preserved across models trained on different data, even when their coordinate systems differ.

During training, gate scalars learn to amplify projected components that produce consistent updates on both sides of the bridge and suppress components that appear on only one side. Because each model's private noise is statistically uncorrelated with the other model's representations, model-specific confabulation is suppressed without any explicit objective for doing so.

The residual stream as a communication channel

Every transformer layer adds its output to the hidden state it received rather than replacing it. At any layer the hidden state is a running accumulation from all preceding layers. RC places bridges at designated intermediate layers and performs the same additive operation across model boundaries.

For a unilateral bridge from specialist S into generalist G at layer l:

ΔhG=σ(gate)WSGhS\Delta h_G = \sigma(\text{gate}) \cdot W_{S \rightarrow G} \cdot h_S

hGhG+ΔhGh_G \leftarrow h_G + \Delta h_G

The gate initializes at -2.0, giving sigmoid(-2) ≈ 0.12, so bridge contributions begin small and grow only as the training signal supports them. In the bidirectional case a return bridge runs simultaneously from G to S.

The return bridge matters beyond training stability. Without it, the training objective minimizes combined perplexity at the expense of the generalist's individual residual stream. In the coding experiment below, unilateral coupling achieves a workable fused perplexity of 15.15 but degrades the generalist's individual output to 32.11 against its frozen baseline of 16.68. The return bridge recovers the generalist's individual output to 11.29 while the fused output reaches 5.91. The feedback loop drives both residual streams toward a shared representation rather than trading one model's performance for the other's.

Model implementation

The implementation (three_qa.py) is built around AutoModelForCausalLM and a config dict. The same code runs across model sizes, depths, and topologies without architectural modification of the base models. Bridge layers are selected proportionally to model depth, and the proportional depth alignment in ResidualCoupler handles models with different layer counts.

MultiLatentBridge handles all coupling topologies by building a projection dict keyed on directed source-target pairs:

class MultiLatentBridge(nn.Module):
    """
    Communication topologies between N frozen models.
    Supports Unilateral, Star-Bilateral, and Multi-Bilateral connections.
    """
    def __init__(self, dim, num_models, mode):
        super().__init__()
        self.mode = mode
        self.projections = nn.ModuleDict()

        for i in range(num_models):
            for j in range(num_models):
                if i == j: continue
                # multi_unilateral: only specialists -> generalist
                if mode == "multi_unilateral" and i != 0: continue
                # star_bilateral: generalist as hub, no specialist-specialist links
                if mode == "star_bilateral" and (i != 0 and j != 0): continue
                self.projections[f"{j}_to_{i}"] = nn.Linear(dim, dim, bias=False)

        # One gate per directed link, all initialized to -2.0
        self.gates = nn.Parameter(torch.full((num_models, num_models), -2.0))

    def forward(self, h_list):
        new_h = [h.clone() for h in h_list]
        for i in range(len(h_list)):
            delta = 0
            for j in range(len(h_list)):
                key = f"{j}_to_{i}"
                if key in self.projections:
                    gate = 1.0 if "no_gate" in self.mode else torch.sigmoid(self.gates[i, j])
                    delta += self.projections[key](h_list[j]) * gate
            new_h[i] = new_h[i] + delta
        return new_h

ResidualCoupler wraps the frozen models and runs them layer by layer, inserting bridges at designated indices. Proportional depth alignment at each layer synchronizes specialists of different depths. Vocabulary mismatches across tokenizers are handled by clamping input indices to each model's vocabulary size before the embedding lookup, keeping every specialist's forward pass valid without shared tokenizer alignment.

class ResidualCoupler(nn.Module):
    """Main engine wrapping multiple frozen models with learnable bridges."""
    def __init__(self, model_A, specialist_list, mode):
        super().__init__()
        self.mode = mode
        self.models = nn.ModuleList([model_A] + specialist_list)
        self.vocabs = [m.config.vocab_size for m in self.models]
        self.depths = [m.config.n_layer for m in self.models]
        self.bridges = nn.ModuleDict({
            str(l): LatentMoE(C["dim"], len(self.models)) if "moe" in mode
            else MultiLatentBridge(C["dim"], len(self.models), mode)
            for l in BRIDGE_LAYERS
        })
        self.final_mix = nn.Parameter(torch.zeros(len(self.models)))

    def forward(self, ids):
        pos = torch.arange(0, ids.size(1), device=ids.device).unsqueeze(0)
        h_list = [
            m.transformer.wte(torch.clamp(ids, 0, self.vocabs[i]-1)) + m.transformer.wpe(pos)
            for i, m in enumerate(self.models)
        ]
        curr_indices = [0] * len(self.models)
        L_A = self.depths[0]  # Generalist depth as reference

        for l in range(L_A):
            h_list[0] = self.models[0].transformer.h[l](h_list[0])[0]
            curr_indices[0] += 1

            # Proportional depth alignment for heterogeneous model pairs
            for i in range(1, len(self.models)):
                target_i = int((l + 1) * self.depths[i] / L_A)
                while curr_indices[i] < target_i:
                    h_list[i] = self.models[i].transformer.h[curr_indices[i]](h_list[i])[0]
                    curr_indices[i] += 1

            if str(l) in self.bridges and "logit_ensemble" not in self.mode:
                h_list = self.bridges[str(l)](h_list)

        max_v = max(self.vocabs)
        logits_list = []
        for i, m in enumerate(self.models):
            l_out = m.lm_head(m.transformer.ln_f(h_list[i]))
            if l_out.size(-1) < max_v:
                l_out = torch.cat([
                    l_out,
                    torch.full((*l_out.shape[:-1], max_v - self.vocabs[i]),
                               -1e4, device=DEVICE, dtype=l_out.dtype)
                ], dim=-1)
            logits_list.append(l_out)

        return logits_list[0], logits_list  # Generalist output steered by specialists

The config dict makes it straightforward to swap in different model pairs or specialists:

CONFIGS = {
    "medical_multi": {
        "A": "gpt2",
        "B_list": ["microsoft/DialoGPT-small", "nrslearning/finetuned-gpt2-medical-QA"],
        "dataset": "lavita/ChatDoctor-HealthCareMagic-100k",
        "dim": 768, "layers": 12,
        "map": lambda x: (
            f"Patient: {x.get('instruction', '')[:200]} "
            f"Doctor: {x.get('output', '')[:200]}"
        )
    }
}

BRIDGE_LAYERS = [2, 4, 6, 8, 10] # Layers where cross-model communication occurs

Results across domains

Four domains, each pairing a GPT-2 generalist anchor with a domain specialist. The comparison throughout is bilateral RC against MoE routing with the same two frozen models: MoE has access to identical frozen parameters and represents the strongest conventional approach to combining them without modification.

Domain Frozen baseline (PPL) MoE (PPL) Bilateral RC (PPL) Reduction
Medical 50.05 64.66 12.01 76%
Legal 26.48 21.83 8.30 69%
Scientific 28.54 26.85 17.51 39%

The scientific domain shows the smallest gain because the specialist's training distribution overlaps substantially with the generalist's: the representational gap is smaller and the bridge has less corrective work. MoE either matches or falls below the frozen baseline in the medical and legal domains, where committing each token to a single expert loses the cross-model correction that RC preserves at the hidden-state level.

The coding experiment

The fourth domain tests an extreme alignment failure. CodeGPT-small-py uses a different tokenizer from GPT-2. On general evaluation text its frozen perplexity is approximately 7 million: the vocabulary mapping has broken down and the model produces near-random token sequences. This is an extreme case that would not arise in normal deployment, but it isolates the mechanism clearly.

Condition Perplexity
Frozen GPT-2 16.68
Logit ensemble 596.16
MoE routing 878.40
Unilateral RC 15.15
Bilateral RC 5.91

Every method that combines models at the output layer fails. Logit averaging and MoE both perform far worse than GPT-2 alone. Bilateral coupling reaches 5.91 by reading the specialist's hidden states before the output projection collapses the representation into an incoherent vocabulary distribution. The bridge captures useful latent signal while it is still on the specialist's internal manifold.

Steered individual outputs

In bilateral coupling, both models improve individually, not only through the fused output. The return bridge's effect is most visible in the coding domain: unilateral coupling degrades the generalist's individual output to 32.11 against its frozen baseline of 16.68, because the training objective optimizes the fused output at the expense of the generalist's residual stream. Bilateral coupling recovers the generalist's individual output to 11.29 while the fused output reaches 5.91.

The pattern holds across other domains. In the medical domain the specialist's individual perplexity drops from 317.89 to 22.59 under bilateral coupling. In the legal domain it drops from 44.44 to 10.84. In the scientific domain both models converge to near-identical individual perplexity (18.35 and 18.24) despite having been trained on different corpora. The feedback loop drives both residual streams toward a shared representation rather than trading one model's performance against the other's.

Factual accuracy

Perplexity measures language modeling quality but not factual accuracy, and a model could become more fluent at producing false statements. In the three-model medical experiment, TruthfulQA Health (MC1) tracks whether the perplexity improvements carry over to verifiable claims.

Topology PPL TruthfulQA Health vs. baseline
Frozen baseline 57.08 16.36% -
MoE 56.80 20.00% +3.6 pp
Multi-unilateral 11.26 23.64% +7.3 pp
Star-bilateral 11.07 21.82% +5.5 pp
Multi-bilateral 11.02 25.45% +9.1 pp

MoE reduces perplexity by 0.5% and improves factual accuracy by 3.6 percentage points. Multi-bilateral reduces perplexity by 80.7% and improves accuracy by 9.1 points. The proposed mechanism is that each model's hallucinations are statistically uncorrelated: the bridge gates learn to amplify projections that produce consistent updates across both models and suppress projections that appear on only one side, where individually memorized confabulations live.

Ablation: learned structure is required

The three-model medical experiment includes two ablation conditions that test whether the gains come from learned projection structure or merely from the presence of an additive bridge at a larger parameter count. The first freezes the projection matrices at random initialization and trains only the gate values. The second removes learned gates entirely, setting all gate values to 1.0.

Condition PPL TruthfulQA Health
Frozen baseline 57.08 16.36%
MoE 56.80 20.00%
Multi-bilateral (trained) 11.02 25.45%
Multi-bilateral (no gate) 16.42 30.91%
Multi-bilateral (random projections) 166.82 20.00%

Random bridges make things dramatically worse: 166.82 against a frozen baseline of 57.08. Trained gate values alone cannot recover the loss. The gains therefore require learned projection structure, not a trivially parametric transformation of position.

The gate ablation is more nuanced. Removing learned gates worsens perplexity (16.42 against 11.02) while improving TruthfulQA accuracy (30.91% against 25.45%). The gate's stabilizing effect on the residual stream is more important in the three-model setting than in two-model experiments, where ungated bilateral marginally outperforms gated bilateral in some domains. The TruthfulQA divergence between the two conditions is an open question: the no-gate condition may allow more specialist signal to pass through, which helps factual accuracy while introducing enough distributional noise to raise perplexity.

Topologies

topologies_old

  • Unilateral: specialists inject into the generalist without return flow. The baseline for measuring what bidirectionality adds.
  • Star-bilateral: generalist and each specialist exchange bidirectional updates, specialists do not bridge each other. The generalist acts as a hub, pulling specialist knowledge in and returning corrective signal to each.
  • Multi-bilateral: all model pairs exchange bidirectional updates. Bridge parameter count scales as O(n(n-1)) in the number of models, though inference latency does not as all stacks run in parallel and are bounded by the slowest model's depth.
  • MoE (baseline): a learned router selects among specialist representations at each bridge layer, with approximately 2.3K parameters versus 4.7-14.2M for the RC topologies.

Scaling and modularity

Because all base models are frozen, adding a domain specialist means training bridge projections to a new frozen module while leaving existing bridges untouched. Removing a specialist means deactivating its bridges in reverse order of addition, returning the system to its prior state without retraining. At d = 768 with four bridge layers, the bilateral bridge between two 124M-parameter models adds approximately 4.7M parameters, under 2% of the combined frozen parameter count. Bridge training runs in a few thousand steps on a consumer GPU.

This modularity opens a path for certain agentic workflows. Standard pipelines pass outputs between models as text, compressing each model's continuous intermediate representations into discrete token sequences at every handoff. RC operates on hidden states throughout a single parallel forward pass, preserving the geometric structure that token sequences discard. For tasks where the relevant signal depends on where a concept sits in representation space rather than how it is verbally expressed, the difference is substantial: any method that operates on the out-of-distribution specialist's output logits fails in the coding experiment, while RC reaches 5.91 by reading hidden states directly.

Conclusion

Maturana and Varela (1980) described biological systems as operationally closed: the system does not change to accommodate its input but transforms its input according to its own internal organization. A frozen transformer works the same way. Bridge injections are assimilated through existing weights rather than altering them, and the model that processes a bridge injection is the same model that processed the token before it. Catastrophic forgetting is not mitigated here because the condition that produces it is never present.

Mountcastle (1978) proposed that the neocortex follows an analogous organization, with functionally specialized columns operating independently yet producing unified perception through their lateral interactions. The training cycle for Residual Coupling follows the same pattern. Specialist columns/models are frozen independently, and each column/model encodes domain knowledge without exposure to the others. Bridges are then trained on the frozen ensemble, learning the relational operators that coordinate what the columns have separately memorized.

"The pure present is an ungraspable advance of the past devouring the future. In truth, all sensation is already memory." Henri Bergson, Matter and Memory (1896)

That line appears in a Haruki Murakami novel, which is almost certainly how it entered the training data of the models evaluated by Liu et al. (2026) . The models in those experiments retrieved the text rather than composed it. Residual Coupling does not change what any model has memorized. It trains the map between what they have memorized separately.

Paper: SSRN-Elsevier 6746521 | Code: github.com/pfekin/residual-coupling

References

  • Huh, M., Cheung, B., Wang, T., and Isola, P. (2024). The Platonic Representation Hypothesis. arXiv:2405.07987.
  • Kornblith, S., Norouzi, M., Lee, H., and Hinton, G. (2019). Similarity of neural network representations revisited. ICML.
  • Liu, X., Mireshghallah, N., Ginsburg, J.C., and Chakrabarty, T. (2026). Alignment whack-a-mole: finetuning activates verbatim recall of copyrighted books in large language models. arXiv:2603.20957.
  • Maturana, H. and Varela, F. (1980). Autopoiesis and Cognition: The Realization of the Living. Reidel.
  • Mountcastle, V.B. (1978). An organizing principle for cerebral function. In Edelman, G.M. and Mountcastle, V.B. (Eds.), The Mindful Brain. MIT Press.

Community

Sign up or log in to comment