What is the difference between test-time training and test-time compute?

Test-time compute (used by o1, DeepSeek-R1) lets models think longer by generating more tokens, but the weights stay frozen. Test-time training actually updates the model's weights during inference using gradient descent. They are complementary — you could use TTT layers inside a reasoning model that also uses chain-of-thought.

How do TTT layers compare to Mamba and other attention alternatives?

TTT layers and Mamba are both sub-quadratic alternatives to attention with O(1) decode complexity. The key difference: Mamba's hidden state is a fixed-size vector updated by a learned linear rule, while TTT's hidden state is itself a model (linear layer or MLP) updated by gradient descent. In practice, Mamba stops improving after 16K context while TTT layers keep scaling like Transformers.

What are the GPU requirements for running test-time training in production?

TTT inference requires gradient computation per token, which means you need more FLOPs per token than standard forward-pass inference. However, TTT-E2E uses standard MLPs that shard across GPUs with existing parallelism strategies — no custom CUDA kernels required. The training side is heavier: the meta-learning objective makes training roughly 3.4x slower than standard Transformer training.

Test-Time Training: When Your Model Learns During Inference

Your model is deployed. Weights are frozen. A user sends a query about a domain your training data barely covered. The model does its best — a confident, mediocre answer.

What if the model could learn from the input before responding? Not retrieve relevant documents. Not generate a longer chain of thought. Actually update its own weights, right there in the inference path.

That is test-time training.

The Wall Between Training and Inference

The standard deep learning paradigm has two rigid phases. Phase one: train the model, update weights, iterate on your loss function. Phase two: freeze the weights, deploy, run forward passes. The model never changes again until you retrain.

This made sense when inference was a simple forward pass. But as context windows grow to 128K+ tokens and models are expected to handle distribution shifts in production, the frozen-weights assumption becomes a bottleneck.

TTT eliminates this wall. The core idea, introduced by Yu Sun et al. at UC Berkeley and later formalized at Stanford:

The model’s hidden state is itself a small machine learning model (a linear layer or MLP)
The update rule for this hidden state is a step of gradient descent on a self-supervised objective
At test time, each input sequence becomes an unlabeled dataset that the inner model trains on

There are two nested loops:

Outer loop (training time):
  Learn the self-supervised task
  Learn the initial parameters for the inner model
  → This is meta-learning: learning HOW to learn at test time

Inner loop (test time):
  For each input sequence:
    Treat context tokens as unlabeled data
    Run gradient descent on reconstruction objective
    Produce weights W_1, W_2, ..., W_t (one per token)
    → The model becomes a different model for each input

TTT Layers: Attention’s Replacement

The 2024 NeurIPS paper “Learning to (Learn at Test Time)” introduced TTT layers as a drop-in replacement for self-attention. Two variants:

TTT-Linear: The inner model is a simple linear layer. Fast, memory-efficient, and the default choice for most applications.

TTT-MLP: The inner model is a two-layer MLP. More expressive but requires storing intermediate activations for backpropagation through the inner loop.

The critical performance characteristic:

Complexity comparison (per-token decode):

Self-attention:  O(n)  — scales with context length
Mamba (SSM):     O(1)  — constant, but fixed-capacity state
TTT-Linear:      O(1)  — constant, with learnable state
TTT-MLP:         O(1)  — constant, with expressive state

At 128K context, TTT-E2E (the end-to-end formulation from December 2024) is 2.7x faster than full attention. But the real differentiator is not speed — it is scaling behavior. Mamba stops improving after 16K tokens of context. TTT layers keep reducing perplexity as context grows, matching Transformers but with linear complexity.

# Conceptual TTT layer forward pass (simplified)
# The hidden state IS a model, updated by gradient descent

class TTTLayer:
    def __init__(self, d_model):
        # Inner model: a linear layer that serves as the hidden state
        self.W = nn.Linear(d_model, d_model)
        # Self-supervised reconstruction head
        self.reconstruct = nn.Linear(d_model, d_model)
        self.lr_inner = 0.01  # Inner loop learning rate

    def forward(self, x_sequence):
        outputs = []
        for x_t in x_sequence:
            # 1. Self-supervised loss: reconstruct the input
            x_hat = self.reconstruct(self.W(x_t))
            loss = F.mse_loss(x_hat, x_t.detach())

            # 2. Gradient descent on the inner model
            grad = torch.autograd.grad(loss, self.W.parameters())
            with torch.no_grad():
                for p, g in zip(self.W.parameters(), grad):
                    p -= self.lr_inner * g  # Weight update at test time

            # 3. Use the updated inner model for output
            outputs.append(self.W(x_t))

        return torch.stack(outputs)

This is simplified — the actual implementation batches the inner gradient computation and uses mini-batch TTT for efficiency. But the key insight is visible: the forward pass includes a backward pass. Your inference pipeline just became a training pipeline.

Reinforcement Learning at Test Time

TTT layers use self-supervised learning inside the model. Two recent papers take a different approach: applying RL to entire models at test time.

TTRL (Test-Time Reinforcement Learning) uses an elegant trick — run the model multiple times on the same problem, take a majority vote across outputs, and use that consensus as a reward signal. No labels needed. On Qwen-2.5-Math-7B, TTRL boosts pass@1 accuracy by 211% on AIME 2024 competition problems.

TTT-Discover goes further. It casts each individual test problem as its own RL environment. The model policy is adapted on-the-fly using problem-specific reward functions. The results are striking: new state-of-the-art across mathematics (improving Erdős’ minimum overlap problem), GPU kernel generation (up to 2x faster than prior art), competitive programming (surpassing past AtCoder competitions), and single-cell biology analysis.

TTT taxonomy — what changes at test time:

Standard inference   → Frozen weights, forward pass only (GPT-4, Llama)
Test-time compute    → Frozen weights, longer reasoning (o1, DeepSeek-R1)
TTT layers           → Hidden state weights update via self-supervised learning
TTRL                 → Full model weights update via RL with majority-vote rewards
TTT-Discover         → Full model weights update via RL per individual problem

These are not competing approaches. Test-time compute (chain-of-thought) changes how long the model thinks. Test-time training changes the model itself. You can combine them.

GPU Infrastructure Implications

If you manage inference infrastructure, TTT changes your assumptions.

Standard inference is memory-bandwidth-bound. The GPU’s job is to move weight matrices through SRAM as fast as possible for each token. You optimize for batch size, KV-cache management, and memory layout.

TTT inference adds gradient computation to every forward pass. That shifts the bottleneck toward compute. For each token, the GPU must:

Run the forward pass through the inner model
Compute the self-supervised loss
Backpropagate through the inner model
Update the inner model’s weights
Run the final output pass

This means more FLOPs per token, but TTT-E2E compensates by using standard MLPs as inner models. No custom CUDA kernels — you can shard across GPUs with existing tensor parallelism strategies, which is a significant advantage over Mamba 2 and Gated DeltaNet that require specialized kernels for efficient memory I/O.

The infrastructure trade-off:

Metric	Standard Transformer	TTT-E2E
Decode complexity	O(n) per token	O(1) per token
FLOPs per token	Lower	Higher (gradient step)
Memory overhead	KV-cache grows with context	Fixed-size inner model
Training cost	1x baseline	~3.4x (meta-learning)
Custom kernels needed	No	No
Parallelism strategy	Standard	Standard

For short contexts (< 8K tokens), standard attention is still faster. TTT’s advantage kicks in at long contexts where attention’s O(n) decode becomes the bottleneck. The crossover point depends on model size and hardware — benchmark before committing.

The Bigger Picture

The industry is converging on a realization: inference is not just prediction anymore. Between chain-of-thought reasoning, retrieval-augmented generation, and now test-time training, the inference pipeline is becoming a complex computation graph with feedback loops, weight updates, and adaptive behavior.

For anyone building AI infrastructure, the question is no longer “how do I serve a frozen model efficiently?” It is “how do I orchestrate a system that learns, retrieves, reasons, and adapts — all within the latency budget of a single request?”

Test-time training is one piece of that puzzle. It is not production-ready for most teams today — the meta-learning training overhead is real, and the tooling is still research-grade. But the direction is clear: the wall between training and inference is coming down, and GPU infrastructure needs to be ready for it.

References

Sun et al., “Test-Time Training with Self-Supervision for Generalization under Distribution Shifts” (ICML 2020)
Sun, Li et al., “Learning to (Learn at Test Time): RNNs with Expressive Hidden States” (NeurIPS 2024)
“End-to-End Test-Time Training for Long Context” (December 2024)
Yuksekgonul, Koceja, Li et al., “Learning to Discover at Test Time” (January 2026)
“TTRL: Test-Time Reinforcement Learning” (NeurIPS 2025)
Dalal et al., “One-Minute Video Generation with Test-Time Training” (CVPR 2025)