By turn 8 of a long agent loop, reasoning quality has usually already collapsed — not because the context window is full, but because nobody audited what was filling it.

The symptom is familiar: the first few turns are sharp, the agent makes good tool choices, then somewhere around turn 6 it starts ignoring constraints from the original task, repeating searches it already did, or producing confidently wrong plans. The window isn't overflowing. The model is just reasoning over noise.

This is fixable, but not with a bigger context window. It's fixable with accounting.

What's actually eating your context

The intuitive culprits are the system prompt and conversation history. Both are wrong. The system prompt is static. User and assistant messages are small. The thing that grows unbounded, per turn, is tool call results.

A single web-search result runs 2–4K tokens. A file read on a 400-line source file is 3K+. A database query that returns 50 rows can be 5K. None of these are bounded by your code unless you bound them explicitly, and most agent loops don't.

Here's a representative breakdown of an 8-turn loop where the agent searches, reads files, and plans:

Turn Message tokens Tool result tokens Running total
0 (system) 1,200 0 1,200
1 280 3,400 4,880
2 310 2,800 7,990
3 290 3,900 12,180
4 340 2,600 15,120
5 320 4,100 19,540
6 290 3,700 23,530
7 360 3,200 27,090
8 310 2,900 30,300

Messages account for ~3,700 tokens. Tool results account for ~26,600. That's 88% of the window spent on raw tool output that the model wades through on every subsequent turn.

Most frameworks never show you this. They report total tokens (if anything) and assume uniform growth. The outlier turns — the ones where a single tool call adds 4K of mostly irrelevant text — are exactly the turns that blow your reasoning budget.

Fitting in the window is not the same as reasoning well

The vendor pitch is that 1M-token windows make this a non-issue. They don't. They raise the cost ceiling and delay the reckoning.

The Chroma 2025 Context Rot report and the NoLiMa benchmark both show frontier models dropping measurably below their short-context baselines on controlled tasks once inputs cross 32K tokens — in some cases below 50%. These are benchmark numbers, not your production agent, but the direction is what matters: fitting in the window doesn't preserve reasoning quality.

The failure mode is the worst kind: silent. No error, no refusal, no obvious hallucination. Just confident, coherent output that's subtly wrong — the agent picks the second-best tool, forgets a constraint stated 20K tokens ago, or re-runs a search whose results are already in the history.

Caching makes the window cheap, not clean

The reflexive objection here: who cares how big the window gets? Re-sending the same history every turn is exactly what prompt caching is for. Anthropic and OpenAI both cache the stable prefix, you pay a fraction for the repeated tokens, and the cost problem evaporates.

It does evaporate. That's worth being precise about, because cost and reasoning get conflated constantly.

Because caching is a billing optimization, not a context one. On turn 8 the model still attends over all 26K tokens of tool output, cached or not. The cache changes what you pay to put noise in the window; it changes nothing about what the noise does to the model once it's there. The Context Rot and NoLiMa degradation curves are plotted against token count, not dollars — they land in the same place whether you paid full freight or cache rates.

So caching isn't an alternative to the accounting. It removes the one excuse you had for tolerating a bloated window — cost — and leaves the reason that actually matters fully intact. If anything it raises the stakes: cheap context is context you'll let rot without noticing.

The audit primitive: per-turn token delta

Total context size tells you when you're in trouble. Per-turn delta tells you which turn caused it. That's the actionable signal.

Every Claude and OpenAI response exposes a usage object. Most agent code checks it for errors and throws it away. Don't. The field that matters is input_tokens — the whole window the model just re-read, which is exactly what the rot curves are plotted against. Record it after each call, compute the delta from the previous call, and flag deltas that cross a threshold you picked deliberately.

One subtlety the wrapper has to get right: a fat tool result from turn N doesn't show up in usage until turn N+1's input_tokens, because the result is appended to history and re-sent on the next call. So the delta you observe names the previous turn's tool call, not the current one. Attribute it correctly, or you'll go bounding the wrong tool.

Drop-in ContextBudget wrapper
from __future__ import annotations

from dataclasses import dataclass, field
from typing import Protocol

from loguru import logger


class Usage(Protocol):
    input_tokens: int
    output_tokens: int


class CheckpointNeeded(Exception):
    """Raised when the projected next window would cross the budget ceiling."""


@dataclass
class ContextBudget:
    """Tracks per-turn window growth across an agent loop.

    The model re-reads the entire window on every call, so ``input_tokens`` is
    the quantity context rot is plotted against. Growth between two calls is
    attributed to the turn that produced it, not the turn that observes it.
    """

    max_total: int = 60_000
    spike_threshold: int = 4_000
    occupancy: list[int] = field(default_factory=list)
    spikes: list[int] = field(default_factory=list)

    def record(self, usage: Usage) -> None:
        """Record one call's usage and flag spikes or budget breaches.

        Args:
            usage: The ``usage`` object from a Claude or OpenAI response.

        Raises:
            CheckpointNeeded: When the projected next window exceeds ``max_total``.
        """
        current = usage.input_tokens
        prev = self.occupancy[-1] if self.occupancy else 0
        delta = current - prev
        self.occupancy.append(current)

        # Growth in this call's window was produced by the PREVIOUS turn — its
        # assistant message plus the tool result that turn triggered — so a
        # spike here names the turn before, not the one we just observed.
        if len(self.occupancy) > 1 and delta > self.spike_threshold:
            culprit = len(self.occupancy) - 2
            self.spikes.append(culprit)
            logger.warning(
                "context spike attributable to turn {turn}: +{delta} tokens",
                turn=culprit,
                delta=delta,
            )

        # The next call re-reads this window plus the output we just generated,
        # so project that forward before deciding whether to checkpoint.
        projected_next = current + usage.output_tokens
        if projected_next > self.max_total:
            raise CheckpointNeeded(
                f"projected_next={projected_next} spikes={self.spikes}",
            )

Wrap your agent's response handler with budget.record(response.usage) and you have a signal. The log line names the culprit turn directly — the one whose tool result bloated the window — not the turn that happened to observe the growth. When CheckpointNeeded fires, you branch to the recovery path below instead of letting the loop drift into degraded reasoning.

What to do when the budget trips

Three options. Two of them throw away information you can't get back.

Front-truncation — drop the oldest messages — is the silent default in LangChain's ConversationBufferWindowMemory and several other frameworks. It is the worst option. It deletes the task framing and original constraints, leaving the agent optimizing for whatever's in the last few turns. If you didn't configure your memory class explicitly, check what it does. You probably don't want it.

Naive summarization — collapsing the history into prose — preserves facts and destroys judgment. The agent ends up knowing a decision was made but not why it ruled out the alternatives. Factually accurate, strategically lost.

The third option is the same operation done with discipline: checkpoint-and-restart with a structured state object instead of prose. Call it structured summarization if you like — the point isn't to avoid summarizing, it's to summarize the things judgment depends on (decisions, rationale, ruled-out alternatives, active constraints) instead of flattening them into narrative. In longer-running loops you can pair it with the last N turns kept verbatim, so recent reasoning survives intact while older context is compressed to state. Send this exact prompt when the budget trips:

Checkpoint prompt — emit a state object, not a summary
You are about to hand off this task to a fresh context window.
Do not summarize. Emit a structured state object as JSON with:

- original_task: the original user request, verbatim
- decisions: list of {decision, rationale, alternatives_ruled_out}
- open_tasks: list of concrete next steps with their preconditions
- active_constraints: rules still in force (verbatim from earlier turns)
- known_facts: only facts the next turn cannot re-derive cheaply

Omit anything the next context can recover from tools.
Do not paraphrase constraints. Quote them.

Two design constraints make this work in practice.

First: at most one restart per task. Each pass compounds information loss. A summary of a summary is where agents go to die. If one restart isn't enough, the task should be split, not re-checkpointed.

Second: store the original task framing externally and re-inject it on every restart. Don't trust the model to carry it. The verbatim original prompt costs maybe 300 tokens; losing it costs the whole run.

Threshold calibration matters too. Tripping the checkpoint at 40% of your budget burns more tokens on the recovery call than just finishing the run would have. Set the threshold based on expected remaining turns, not current consumption. If you're typically done by turn 12 and you're at 60% on turn 4, you have a problem. At 60% on turn 10, you're fine.

Token delta is the signal; the structured checkpoint is the fix; caching is the thing that quietly removes your last excuse not to care. Add the ContextBudget wrapper to your agent loop today, run it against the last long-horizon task that gave you a bad result, and read off the turn the log blames for the spike — that's the tool call you need to bound.