Chapter 7Memory and state architecture

Memory is the structural mechanism by which an agent’s state outlives a single reasoning step. Without it, the agent collapses to a stateless responder; with it, the agent acquires the second of its four defining properties (Chapter 2). Memory is also the part of an agentic system where the design surface is largest and the standard advice is thinnest. This chapter develops memory architecturally, taking the three-tier model (working, episodic, semantic) as a starting point and turning it into a set of architectural commitments.

The chapter’s central position is that memory is a system, not a feature. Treating memory as “the conversation history we pass to the model” produces systems that confuse caching for state, leak across sessions, exhaust their context windows, and make incidents hard to debug. Treating memory as a system means committing to lifecycle, retrieval policy, scoping, validation, and observability, the same discipline that any persistence layer in a non-agentic system requires, with additional concerns specific to probabilistic consumers.

The three memory tiers

The book uses the now-standard tripartite model (also used by Gulli 2025, the CSIRO catalog, and most production systems): working memory, episodic memory, and semantic memory. Their architectural responsibilities are distinct and should be implemented distinctly even when a single store backs all three.

Working memory

State that is task-scoped: relevant only for the duration of the current reasoning loop. Examples: the current plan, intermediate tool results, the running list of attempted strategies, scratchpad notes.

Architectural commitments:

Episodic memory

Persistent record of past tasks, outcomes, and notable observations. The agent’s experience over time.

Architectural commitments:

Semantic memory

Structured domain knowledge: facts, rules, ontologies, schemas, learned constraints, organizational policy.

Architectural commitments:

Semantic memory needs structure, not only curation. A vector index answers similarity queries over text, but an agent reasoning about a business has to know what the business’s concepts are and how they relate, that a tenant contains workspaces, that a workspace contains users, that users emit events. That governed map of the domain is an ontology, and its physical form is a knowledge graph the agent can traverse deterministically (its construction is developed in Chapter 8). Without it, the agent infers business structure by fuzzy-matching text chunks, guessing relationships the organization has already defined precisely. The ontology gives semantic memory a deterministic blueprint, so the agent’s understanding of the domain is governed rather than improvised; the same principle applied to business metrics rather than entities reappears as the semantic layer of Chapter 14.

The three tiers are not arranged by importance; they are arranged by lifecycle. Working memory exists for the duration of a single task, milliseconds to minutes of active work, longer when a task suspends to wait on an approval or a slow tool; episodic memory for sessions to indefinitely; semantic memory for the system’s operational lifetime. Architectural decisions about retrieval, scoping, and governance follow from the lifecycle.

A boundary worth drawing now: memory is state the system owns and curates, not all data it can reach. Semantic memory holds facts the agent or fleet maintains, “this user prefers Python,” “service X has a Tuesday maintenance window,” a validated policy. Searching a corporate wiki, a ticketing system, or the open web is not memory; it is a read-only tool call (Chapter 9) against a source the system does not own. The distinction is architectural: memory has a write path, a curation process, and a retention policy the system is responsible for, while an external search has none of these. Conflating the two leads architects to try to subsume all enterprise data into the memory layer. RAG is a retrieval mechanism that can serve either; what makes something memory is ownership and curation, not the fact that it was retrieved.

The memory architecture diagram

Figure 4. The memory architecture diagram

Three observations about this diagram:

  1. The gateway is the only path. The agent does not query stores directly; it requests memory through a gateway that enforces read policy (what is visible to this agent in this context), retrieval ranking (relevance, recency, authority), scoping (identity, tenant), and redaction (sensitive content removed or summarized).

  2. Memory writes go through governance (Chapter 6). A memory write is an action with policy implications. Writes are validated against schemas, gated by policy (e.g., do not memorize PII), and where appropriate, audited.

  3. Reads can be retrieval calls. Retrieval-augmented generation (RAG) is one realization of the gateway’s read path. The architectural commitments are the same whether the retrieval is over episodic memory, semantic memory, or external corpora.

Memory patterns

The architecturally significant memory patterns are cataloged below, with cross-references to canonical sources for full treatment.

Working-memory scratchpad

The within-loop record of the agent’s reasoning state. In modern reasoning models much of this is internal to the model call. The architectural commitment is observability: the trace must capture the agent’s working memory transitions so that runs can be replayed and debugged.

Read more. Anthropic Building Effective Agents (on plan persistence); Gulli (working memory chapter).

Episodic-memory accumulation

Persistent record of tasks and outcomes. The architectural questions are what is recorded (raw transcripts or summaries), when it is recorded (continuously or at episode boundaries), how it is retrieved (by similarity, by recency, by tag), and for how long it is retained.

The most common production failure is recording too much without curation: the episodic store accumulates noise, retrieval surfaces irrelevant material, and the agent’s behavior degrades over time as the store grows. The architectural answer is summarization at the episode boundary: each episode is summarized by a deterministic process (often a model call against a strict prompt and schema) and the summary, not the raw transcript, becomes the retrievable artifact. The raw trace is kept for audit but not surfaced to the agent.

Read more. Letta (formerly MemGPT); Gulli (memory management chapter).

Semantic-memory curation

Domain knowledge stored explicitly. Often realized as a vector index, a knowledge graph, a database, or a combination. The architectural commitments are around curation, how content enters the store, by what process, with what review, and retrieval correctness, how the store ensures that the agent receives accurate, up-to-date information.

The architectural pitfall is the “build a vector index and call it semantic memory” failure. A vector index without curation is a search engine over whatever was ingested, with all the failure modes of that ingestion. Real semantic memory is curated; the curation process is documented; the freshness of entries is tracked; superseded entries are explicitly retired. The ingestion pipeline that performs that curation, redaction, identity tagging, lineage, and cache invalidation on the write path, is the subject of Chapter 8.

Read more. Gulli (Retrieval-Augmented Generation chapter); CSIRO (Hierarchical RAG).

Shared memory bus

Cross-agent shared state with versioning and access control. Used in multi-agent systems where coordination requires shared observation (a research team accumulating findings; a build pipeline sharing state across agents). The architectural commitments are:

Read more. Blackboard architectures (classical AI literature); Gulli (Inter-Agent Communication chapter).

Memory compaction

Summarization and pruning to keep memory bounded. The standard mechanism: when memory grows past a threshold, the system summarizes older entries (collapsing many turns into one) and discards the originals (or moves them to cold storage for audit). The architectural concerns are loss (what details are dropped), bias (the summarizer is itself a probabilistic component), and reproducibility (running an agent against compacted memory yields different behavior than against original memory).

By 2025–2026 the mature form of compaction is structured extraction rather than flat summarization. Instead of collapsing an episode into a paragraph of prose, the compaction process extracts typed facts, user -> owns -> project-A, service-X -> maintenance-window -> Tuesday, and updates a knowledge graph (the approach taken by systems such as Mem0 and GraphRAG). Structured extraction is more reproducible than free-text summary, supports precise retrieval and targeted deletion, and degrades more gracefully: a dropped edge is a discrete, auditable loss rather than a silently omitted clause. Flat summaries remain useful for narrative context; graph extraction is the architecture of choice wherever the compacted memory must be queried or governed precisely.

Read more. Anthropic on context engineering and conversation compaction; Gulli on Resource Optimization.

Persistent agent identity

The collection of preferences, policies, and stable knowledge that defines what the agent is, distinct from what it knows. Identity persists across sessions and survives memory compaction. It is small, curated, and rarely modified.

The architectural commitments are: identity changes are controlled (a special class of writes, often with approval); identity is loaded reliably at the start of every session; identity changes are logged.

Read more. Letta on identity; CSIRO catalog on persistent persona.

Retrieval-augmented cognition

The on-demand injection of relevant memory into the reasoning loop. The pattern that subsumes most realizations of the read path. The architectural concerns are retrieval relevance (the retrieval system is itself a probabilistic component with its own failure modes), context budget (retrieved content competes for context-window space), and grounding (the agent should be able to attribute claims to retrieved content).

Read more. Anthropic Building Effective Agents; Gulli (RAG chapter); the body of academic literature on RAG.

Memory and context-window pressure

The context window remains a binding constraint despite expansion to millions of tokens in 2026-class models. Three architectural facts:

  1. Cost is no longer the primary constraint. Provider-side prompt caching, standard across the major APIs since late 2024, makes resending a large static block, semantic memory, tool documentation, a stable system prompt, much cheaper on subsequent turns, as long as the cached prefix stays warm; cache entries carry a write premium and expire on a time-to-live, so a turn after expiry re-pays the write. Long context still costs more than short, especially for the dynamic, per-turn portion that cannot be cached, but cost is no longer the main reason to keep context lean.

  2. Attention is the primary constraint. Models attend non-uniformly across long contexts; relevant content placed late in a long window is processed less effectively than the same content in a short one. Even a multi-million-token window that is cheap to fill degrades the model’s reasoning about the immediate task when most of it is irrelevant. The architecture’s job is to protect the model’s attention, to give it a short, relevant context, not merely to protect the bill. This is a signal-to-noise problem before it is a cost problem.

  3. Skills (Chapter 10) and progressive disclosure are responses to context pressure. Loading a skill on demand, when the agent’s task matches the skill’s description, gives the agent the full context it needs without paying for context it does not. The Skills layer is, in part, a memory-pattern realization.

The architectural commitment is to prefer rich tool-mediated retrieval over large static context: tools that fetch on demand, summaries promoted to retrieval entries, indexes maintained with curation discipline. The context window holds the agent’s working memory and the minimal retrieved material it needs now, not the agent’s entire history.

Memory governance

Memory is governed (Chapter 6) at three points:

The architectural pitfall is to treat memory governance as a feature of the data layer. It is part of the agentic system’s contract with its users and regulators, and it lives in the gateway alongside retrieval and scoping.

Right-to-be-forgotten deserves its own emphasis, because persistent memory generated by a non-deterministic model is uniquely hazardous under GDPR and CCPA. A model may infer and store a sensitive attribute about a user, a health condition, say, that the user never explicitly provided and that no one deliberately chose to retain. When that user invokes their right to erasure, “delete the conversation” is not enough: the inference may have propagated into episodic summaries and semantic-memory entries across tiers. The architecture must provide a deterministic purge API that, given an identity or an entity, removes every derived record of it from working, episodic, and semantic memory and their backups, and writes an audit record of the deletion. Graph-structured memory makes this tractable, erasing an entity is deleting its node and edges, rather than re-summarizing every episode that might mention it. A system that cannot perform targeted deletion cannot meet its compliance obligations, whatever its privacy policy claims.

Memory and identity

Multi-tenant agentic systems must enforce identity at the memory layer. The failure mode is severe: an agent serving one customer surfaces another customer’s data because the memory layer did not scope on identity. The architectural commitments:

This is the memory analog of the lethal-trifecta defense (Chapter 6): the architecture, not the prompt, enforces the scope.

Anti-patterns

Conversation-history-as-memory. Treating the model’s prior turns as the agent’s memory. Works for a single session, fails as the system grows: cross-session continuity, retrieval, scoping, and governance are all absent.

Vector-index-as-semantic-memory. Ingesting documents into a vector store without curation, freshness tracking, or authority. Produces retrieval that surfaces stale, incorrect, or irrelevant material with confidence.

Unbounded episodic accumulation. Recording every interaction without summarization. Memory grows; retrieval relevance degrades; the agent gets slower and worse.

Memory writes without governance. Memory updated directly by the agent without policy validation. PII gets memorized; incorrect facts get cached and surfaced confidently; behavior drifts.

Identity-less memory. Memory entries without identity scoping. The failure mode is multi-tenant data exposure, which is among the most severe incidents the system can suffer.

Testing implications

Memory testing has three classes:

Chapter 12 develops the testing framework; memory tests are among the most valuable because the failure modes are severe and the assertions are deterministic.

Summary

Memory is a tiered architecture — working, episodic, semantic — scoped and governed at the gateway. The Skills layer (Chapter 10) extends it with runtime-loaded semantic memory the agent consults on demand.