LLM, Context Window, and Why the Hell It Keeps Forgetting What You Told It

You spend an hour drilling project details into the model: your ideas, your vision, how you want things done. Files, dependencies, decisions. It nods along, responds on point.
Then — BAM, message 40 and it suggests exactly what you rejected on message 5.
Gotta keep a close eye on these LLMs, do ya!

Congratulations, you’ve hit the context window, which is everything the model “sees” while talking to you: your messages, its responses, system instructions, tool descriptions, contents of files it read. All of it one big block of text, measured in tokens.

But everything has a limit, even the context window. Some models now support very large context windows, up to around 1M tokens in certain API configurations. Sounds like a lot, until you realize: reading one 2,000-line file is ~20k tokens. Five files…and you’ve spent 100k on context.

And LLM remembers nothing between calls. Every time you send a message, the client (Claude Code, Cursor, ChatGPT) collects the full history and sends it to the model in one shot. The model reads, generates a response, and forgets everything. What looks like “memory” is the client’s memory, not the model one. It’s just forwarding history.

sequenceDiagram
    participant Client as Client<br/>(Claude Code / Cursor / ChatGPT)
    participant LLM as LLM<br/>(stateless)
    Note over Client: Stores history
    rect rgb(197, 224, 203)
        Note left of Client: Call 1
        Client->>LLM: system prompt + tools + project files + msg1
        LLM-->>Client: resp1
        Note right of LLM: Forgets everything
    end
    Note over Client: History: msg1 + resp1
    rect rgb(197, 224, 203)
        Note left of Client: Call 2
        Client->>LLM: system prompt + tools + project files + msg1 + resp1 + msg2
        LLM-->>Client: resp2
        Note right of LLM: Forgets everything
    end
    Note over Client: History: msg1 + resp1 + msg2 + resp2
    rect rgb(197, 224, 203)
        Note left of Client: Call 3
        Client->>LLM: system prompt + tools + project files + msg1 + resp1 + msg2 + resp2 + msg3
        LLM-->>Client: resp3
        Note right of LLM: Forgets everything
    end
    Note over Client: Stores everything
    Note over LLM: Stores nothing

Okay, so the full history is being sent. Then why does the model still “forget”?

And it has 3 reasons for that:

Attention dilution.
The model spreads attention across all tokens. With 1,000 tokens, each one carries weight. With 500,000, the weight is smeared thin. An important instruction is physically there in the context, but the model just isn’t looking at it.
Just like how you stop noticing details in a huge document.
Lost in the middle.
Models “remember” the beginning and end of context best. The middle is a blind spot. If your key decision landed at token 200,000 out of a million…geez good luck with that.
Garbage accumulation.
Every failed attempt, every file read, every long command output stays in context. After an hour of work, ~60% of the window is junk from failed tries. The model sees it all at once and can’t tell useful from trash.

When the window fills completely, one of two things happens: either the oldest messages get trimmed or compaction starts its work. In Claude Code, it’s compaction: a separate LLM call that compresses old history into a short summary. But details are lost for good: exact values, specific lines of code, reasons why certain approaches were rejected — all gone.

It can confidently hallucinate from fragments. And the worst part of all of it after compaction, the model doesn’t know what it doesn’t know.

Ok now we know why context falls apart. So… what to do about it?

For this one I have some rules:

Only add a file if it’s needed for the current task — not “just in case.” Extra context doesn’t help, it dilutes attention.
If an instruction needs to stick — put it in the client’s main config file: CLAUDE.md, .cursorrules, and similar.
This way you don’t have to keep telling the model the same thing over and over again. You can restart sessions confidently, because the client will load those instructions again at the start of the next session.
These files usually sit at the very beginning of the context. If that beginning doesn’t change, the server can reuse the work it already did (as long as the cache hasn’t expired yet). Which saves time and tokens.
Switching tasks? Then start a new session. Noise from the previous task bleeds into the quality of answers to the next one.
Compaction is not bad, not necessarily. If you’re working on the same task and the conversation just got long → let compaction do its thing. The gist of the task stays in the summary. What’s lost are the dead ends, failed attempts, and specific details. But you already follow rule 2 — so that’s fine!
The whole agent architecture — subagents, MCP, lazy-loaded tools, memory files — these are all different ways around the same problem: context is finite and it rots. (More on this in mcp and agents.)
If you’re seeing weird behavior in a long session: that’s a symptom of context rot. Clean the context or start a new session.

But what about the cache, someone may ask?

This one gets confused a lot: the client (Claude Code, Cursor) still sends the full context over the network every single time. What’s cached isn’t the transfer itself; it’s the computation on the server side.

When the model receives 100k tokens, it doesn’t read them like text. It runs a massive amount of matrix calculations. The server checks what it already computed last time and only recalculates what’s new. Under the hood, providers can reuse already-computed parts of the prompt through prompt caching / KV-cache-like mechanisms. So the part of your context that didn’t change may not need to be recomputed from scratch every time.

flowchart TD

A["User / Client (Claude Code, Cursor)<br/>sends full context, every request over the network"] --> B["Server receives N tokens"]

B --> C{"Is cached prefix<br/>still valid? TTL not expired?"}

C -->|"No — expired or prefix changed"| F["Recompute context<br/>(full matrix pass)"]

C -->|"Yes — same prefix, still within TTL"| D["Reuse cached K/V<br/>for the unchanged prefix"]

D --> G["Compute only new tokens<br/>added after the cached prefix"]

F --> H["Generate response"]

G --> H

classDef client fill:#dee3c6,stroke:#758879,stroke-width:1.5px,color:#2d2d30
classDef server fill:#e3d7c6,stroke:#8b7e7a,stroke-width:1.5px,color:#2d2d30
classDef decision fill:#ddc6e3,stroke:#7e7a8b,stroke-width:1.5px,color:#2d2d30
classDef cached fill:#c5e0cb,stroke:#758879,stroke-width:1.5px,color:#2d2d30
classDef compute fill:#e3c6d2,stroke:#8b7a87,stroke-width:1.5px,color:#2d2d30
classDef response fill:#c5e0cb,stroke:#758879,stroke-width:1.5px,color:#2d2d30

class A client
class B server
class C decision
class D,G cached
class F compute
class H response

This matters practically: providers implement cache differently. So you should stay tuned for how long the cache is stored on the server. It can change over time, but it could benefit you. If it’s 1 hour or 5 minutes, you can try not to go for a walk during that window to keep sessions cheaper. However, do not trade session quality for money. Because planning, reading your plans, and formulating your thoughts and ideas clearly are much more important, and arguably cheaper, than trying to explain to the LLM that it is implementing the wrong thing and nudging it to do better.

Sometimes you think keeping the session going will keep your tokens cheaper. But if you keep going with wrong or malfunctioning context, it can cost you more than stopping, evaluating, and starting fresh.

One piece of advice for that: save your plans, exit the session, read them carefully, formulate your thoughts, and start a new session with a clean context from the beginning.

Everything has its own benefit and trade-off — we just need to find the right one.

Viktoria Neva

LLM, Context Window, and Why the Hell It Keeps Forgetting What You Told It

Ok now we know why context falls apart. So… what to do about it?

But what about the cache, someone may ask?

Related articles