Chapter 16System architectures

Part IV synthesizes what Parts I–III developed. This chapter applies the patterns and architectural commitments of the earlier chapters to concrete system shapes. The point is not to produce reference architectures to be copied verbatim; it is to show what the discipline looks like in practice, how the bounding layer, governance layer, memory architecture, control structure, and Skills layer compose into systems that hold up under production conditions.

Six vignettes follow. Each describes the system at the architectural level: the agent’s role, the bounded surface, the governance structure, the memory model, the patterns selected, and the failure modes the architecture defends against. Framework-specific implementation is intentionally absent; the vignettes are about structure, not code.

The vignettes are:

  1. Autonomous research agent, a single agent with retrieval, reflection, and a strict output validator.

  2. AI coding system, an orchestrator with bounded specialist workers, sandboxed execution, and human-approval gates.

  3. Agentic customer support platform, a router-orchestrator with risk-based escalation and durable conversation memory.

  4. Autonomous operations controller, an event-driven agent with policy-gated actions on production infrastructure.

  5. Enterprise multi-agent knowledge platform, shared semantic memory, per-agent skills, and identity-scoped retrieval.

  6. Hybrid human–AI workflow system, long-running tasks with multiple approval checkpoints and replayable traces.

Vignette 1: Autonomous research agent

Purpose. Investigate an open-ended question across multiple sources; produce a structured report with citations.

Patterns selected. Single agent with tools. ReAct (Chapter 4) absorbed into a reasoning model. Reflection (Chapter 4) as an evaluator-optimizer pass before output. Retrieval-augmented cognition (Chapter 7).

Bounding (Chapter 5). Iteration limit on outer task (e.g., 30 reasoning steps); per-tool-call cost ceiling (the agent may run many cheap searches but few expensive ones); time budget (e.g., 5 minutes per question); action surface limited to search, fetch, read; no write actions; data scope per-session by default (no cross-session memory read unless the optional per-user episodic memory below is enabled).

Governance (Chapter 6). Schema validators on every search and fetch (URL allow-lists, format checks). Output validator on the final report (structured citations, fact-claim format, refusal of unsupported claims). No approval gates needed (no irreversible actions). Trace captured fully (Chapter 12).

Memory (Chapter 7). Working memory per session: the running set of sources read, notes taken, and intermediate conclusions. Episodic memory: per-user, optional, off by default; if enabled, summary of prior research scoped to the user. Semantic memory: a curated index of authoritative sources (organizational policy, prior reports) with explicit freshness tracking.

Skills (Chapter 10). Domain-specific research skills loaded on demand (legal-research skill, market-research skill, technical-research skill). Each skill carries source allowlists and citation-style requirements. Skill admission verifies that the skill’s declared tools match the agent’s authorized action surface.

Defended failure modes (Chapter 11). Tool injection (sanitized retrieval, refusal of attacker-controlled content treated as instruction); cost explosion (per-tool ceiling); premature termination (output validator checks completeness); unsupported claims (citation requirement in validator).

Compositional payoff. Most of the system’s reliability is in the validator and the bounded action surface. The agent is the search and synthesis engine; the architecture is the editor and fact-checker.

Vignette 2: AI coding system

Chapter 17 develops this vignette into a complete worked example, Concord, with concrete bounding YAML, gateway and pipeline pseudocode, policy tables, skill manifests, trace excerpts, and a failure-mode defense matrix. Readers wanting depth on the coding-system shape should consult Chapter 17 alongside the architectural sketch below.

Purpose. Take a coding task (feature request, bug report, refactor), produce a change to the codebase, run tests, propose a commit for review.

Patterns selected. Orchestrator–worker (Chapter 9): the orchestrator decomposes the task; workers specialize (planner, editor, tester, reviewer). Evaluator-optimizer (Chapter 4) between editor and tester. Saga / rollback (Chapter 9) for partial-failure recovery in multi-file changes.

Bounding (Chapter 5). Per-task iteration cap (orchestrator-level); per-worker iteration cap (within each worker’s loop); cost ceiling (a coding task has a defined economic value, set the ceiling accordingly); action surface, sandboxed file write, sandboxed test execution, sandboxed search; no network, no process spawn; no writes outside the working directory; no commits or pushes without explicit human approval. Data scope is the user’s, not the agent’s: the agent reads the repository through the requesting user’s own authorization, an OIDC/OAuth token propagated into every tool call, never a god-mode service account, so it cannot read files the user could not read themselves.

Governance (Chapter 6). Schema validators on every tool call. Policy gates on file edits (no edits to security-sensitive files without approval; no edits to vendored dependencies; no edits violating the codebase’s lint or formatting rules). Approval gate before commit: human reviews the diff, the test results, and the agent’s reasoning trace. Rollback: if any worker fails, the sandbox is discarded; no partial state escapes.

Memory (Chapter 7). Working memory per task: plan, files touched, attempted strategies, test results. Episodic memory: previous tasks completed in the same codebase, surfaced as context (with curation, only successful patterns are surfaced, not failed attempts). Semantic memory: codebase-specific conventions and style guide, ingested deliberately.

Skills (Chapter 10). Project-specific skill carries the codebase’s conventions (testing approach, build commands, structural conventions, prohibited patterns). Loaded for every task in this project. Additional skills load for specific change classes (e.g., a database-migration skill that requires extra approvals).

Defended failure modes (Chapter 11). Tool misuse on critical files (policy gates); state corruption (sandboxed execution, no escape); cascading tool failure (idempotency-aware tool wrappers); irreversible operations (commit and push require approval); confused deputy, the agent reading or summarizing a file the requesting user has no rights to, defended by inheriting the user’s own authorization for every read rather than using a privileged service account.

Compositional payoff. The architecture treats coding as a task that requires bounded iteration in a sandbox with a strong human review at the end: the model can be wrong many times inside the sandbox without consequence, and only the reviewed change reaches the repository.

Vignette 3: Agentic customer support platform

Purpose. Handle inbound customer queries across channels; resolve where possible; escalate when not.

Patterns selected. Router (Chapter 9) at intake. Single-agent-per-conversation downstream. Risk-based escalation (Chapter 6) for sensitive actions (refunds, account changes). Handoff (Chapter 9) to a human agent on escalation. Evaluator-optimizer (Chapter 4) on outbound messages.

Bounding (Chapter 5). Per-conversation iteration cap (typically modest, most conversations are short). Cost budget per conversation matched to support economics. Time per response (interactive SLO). Action surface narrow: read account, read order history, send message, propose refund (refunds are proposed; granting them requires risk-based approval). Reversibility envelope: messages can be sent; refunds, cancellations, and account modifications route to approval if above threshold. Data scope is the rep’s, enforced by impersonation rather than by prompt: the agent calls the platform’s existing APIs with the support rep’s own downscoped token, the backend-for-agent pattern (Chapter 14), so it can reach only the accounts and tickets in that rep’s queue, and a prompt-injected attempt to read another customer’s records draws the same denial a human rep would.

Governance (Chapter 6). Output validator on every outbound message (no PII of other customers, no commitments beyond the agent’s authority, no policy violations). Policy gates on refund amounts, account changes, communication tone for distressed customers. Risk-based escalation: low-risk queries handled autonomously; high-risk queries routed to human agents with the conversation history and the agent’s reasoning trace.

Memory (Chapter 7). Working memory per conversation. Episodic memory per customer (prior conversations, prior issues, prior resolutions, surfaced as summarized, retrieval-mediated context). Semantic memory: product knowledge base, policy documents, FAQ. Strict identity scoping (Chapter 7): every retrieval is per-customer; cross-customer leakage defended by default.

Skills (Chapter 10). Channel-specific skills (email tone vs. chat tone). Topic-specific skills (billing, technical, account). Compliance skills for regulated topics. Loaded on demand based on conversation context.

Defended failure modes (Chapter 11). Cross-customer leakage (identity-scoped memory, scoping tests in CI); unauthorized actions (policy gates and approval routing); approval fatigue (risk-based escalation routes only high-stakes actions); tone failures (output validator with content rules); long-running conversation drift (memory compaction).

Compositional payoff. The router lets the bulk of routine conversations resolve autonomously at support economics, while the risk score routes the few actions that touch money or accounts (refunds, cancellations, modifications) to a human.

Vignette 4: Autonomous operations controller

Purpose. React to alerts, logs, or events from production systems; investigate; take corrective action where authorized; escalate otherwise.

Patterns selected. Event-driven agent (Chapter 9). ReAct-style reasoning (Chapter 4) over runbook content. Consensus (Chapter 9) for high-stakes actions (two-key approval). Saga / rollback (Chapter 9) for compensating actions on infrastructure.

Bounding (Chapter 5). Aggressive bounds: this is the most dangerous class of agent. Iteration cap. Cost cap. Time cap. Action surface tightly limited to a small set of safe operations (restart a stateless service, scale a deployment, rotate a credential). High-risk operations (DB migrations, security-policy changes, production deploys) are not in the agent’s action surface, they require human action with the agent in an advisory role. As an agent-as-principal with no user to impersonate, it carries a dedicated workload identity scoped to that surface (Chapter 14), not a god-mode service account.

The sharpest line in this vignette is between read and write. The agent has broad, effectively unbounded access to diagnostic and read-only tools, logs, metrics, traces, configuration inspection, so it can investigate freely; every mutating tool (restart, scale, rollback, credential rotation) sits behind a human-in-the-loop approval gate (Chapter 6). This is the shape that production incident-response assistants, cluster-diagnostics agents, vendor incident copilots, converge on: unlimited looking, gated touching.

Governance (Chapter 6). Policy gates on every action (blast-radius checks, change-window checks, dependent-system checks). Approval gates for anything beyond a small allow-list. Two-key approval for the most consequential operations. Rollback paths tested in chaos exercises.

Memory (Chapter 7). Working memory per incident. Episodic memory of prior incidents (curated post-incident; only resolved cases with verified diagnoses are promoted to retrieval). Semantic memory: runbooks, architecture diagrams, dependency maps, on-call rosters. All authoritative-source.

Skills (Chapter 10). Runbook skills per service or per incident class. Skills carry the service’s escalation paths, dependencies, and known-failure modes. Skill admission verifies that the agent has access to the necessary tools (which it usually does not, most skills will surface human-required steps).

Defended failure modes (Chapter 11). Irreversible operations (not in the action surface); cascading actions (saga with explicit blast-radius limits per step); incorrect diagnosis (the agent proposes; the human acts, for anything beyond the small allow-list); cross-service contamination (per-service scoping).

Compositional payoff. The architecture commits the agent to investigation: it shrinks human time-to-diagnosis dramatically while keeping every mutating action (restart, scale, rollback, credential rotation) in human hands.

Vignette 5: Enterprise multi-agent knowledge platform

Purpose. A team of agents operating across an organization’s knowledge bases, answering questions, drafting documents, summarizing material, mediating between teams. Multi-agent because different sub-organizations have distinct data scopes and procedures.

Patterns selected. Orchestrator (Chapter 9) at the user interface; specialist agents per domain (legal, HR, engineering, finance) with their own scoped memory. Shared memory bus (Chapter 7) for the organization’s common knowledge (corporate strategy, organizational structure). Inter-agent handoff (Chapter 9) for cross-domain questions.

Bounding (Chapter 5). Per-agent action surface restricted to the agent’s domain. Cross-domain queries require explicit handoff and re-authorization. Strict per-agent data scoping.

Governance (Chapter 6). Centralized governance service (Chapter 6) used by all agents. Policy gates encode the organization’s data-sharing rules (legal documents not surfaced to engineering; HR-sensitive data not surfaced outside HR). Inter-agent messages pass through governance (Chapter 9): a legal agent cannot inject prompts into a finance agent through a handoff. Identity propagation across agents is explicit: just as JWTs flow between microservices, the original user’s identity context, a SAML or OIDC token, is attached to the payload of every inter-agent message. When the orchestrator asks the HR agent a question, the HR agent evaluates that original user’s token against the HR data store, not the orchestrator’s identity, so no agent can act with more authority than the human who initiated the request.

Memory (Chapter 7). Per-domain semantic memory (legal index, HR index, engineering index, finance index). Shared semantic memory for organization-wide artifacts. Episodic memory scoped to user and domain. Identity scoping enforced at the gateway.

Skills (Chapter 10). Domain-specific skills per agent. Cross-domain skills (e.g., a “regulatory-response” skill that requires coordination between legal and finance) trigger explicit handoff workflows.

Defended failure modes (Chapter 11). Cross-domain data leakage (centralized governance + identity scoping); inter-agent injection (governance on inter-agent channels); multi-tenant exposure (identity-scoped memory).

Compositional payoff. This is one of the few vignettes where multi-agent is genuinely justified, different agents have different data scopes by design, and the architecture enforces the separation rigorously. The system’s correctness depends on per-domain isolation; multi-agent is the structural answer.

Vignette 6: Hybrid human–AI workflow system

Purpose. Long-running workflows (drug discovery research, financial-model construction, regulatory submissions) where the agent does the bulk work and humans approve milestones.

Patterns selected. Plan–Execute (Chapter 4) with the plan as an explicit, human-approved deliverable. Saga (Chapter 9) for multi-step execution with rollback. Human-in-the-loop (Chapter 6) at milestone gates. Evaluator-optimizer (Chapter 4) inside each step before submission to milestone review.

Bounding (Chapter 5). Generous overall cost and time bounds (these tasks are valuable); strict per-step iteration bounds. Action surface depends on the domain; typically scoped to the workflow’s tools.

Governance (Chapter 6). The plan approval gate is the load-bearing governance step: the plan, with its tools, data accesses, and milestone breakdown, is approved before execution begins. Per-milestone approvals. Compensating actions defined for each milestone. Full audit trail (Chapter 12), regulatory submissions require it.

Memory (Chapter 7). Working memory persists across the long-running workflow (durable, not in-process). Episodic memory captures the workflow’s prior runs and outcomes. Semantic memory: domain-specific (drug interactions, financial regulations, regulatory frameworks), curated.

Because these workflows wait days for milestone approvals, the operational requirement is state hydration, or suspend-and-resume: the agent’s process terminates entirely during the wait, and when the approval webhook fires, the orchestrator spins up fresh compute, hydrates the agent’s context from the durable working-memory store, and resumes the trace where it left off. The agent is, in effect, an ordinary asynchronous distributed workflow that happens to call a model.

Skills (Chapter 10). Workflow-specific skills loaded for the run. Compliance skills for regulated steps. Format and style skills for the final deliverable.

Defended failure modes (Chapter 11). Plan drift (replan triggers re-approval); state corruption (each milestone is a saga with rollback); audit gap (full trace required by regulator); drift across long runs (replay-tested against prior milestone outputs).

Compositional payoff. Long-running tasks are well-served by the discipline of explicit plans, milestone approvals, and strong audit: the plan is approved before any work happens, and each milestone is a saga with a tested compensating action.

Reading across vignettes

Several commitments recur across all six vignettes: bounded autonomy is always present, with axes tuned to the task; governance is always structural, with the specific gates varying by domain; memory is always tiered, with scoping at the gateway; trace discipline is always full on flagged sessions; and skills are used where they fit naturally, never as escape hatches around the architecture.

What is absent is as telling as what is present. Peer multi-agent coordination appears only in Vignette 5, and there only because the domains are genuinely separate. Five of six vignettes are single-agent or orchestrator-worker. Debate, swarm, auction, and Tree-of-Thought do not appear at the system level, a specialized worker might use Tree-of-Thought internally for a narrow search task, but it is not a macro-orchestration shape. This is the empirical face of the caution developed in Chapter 3 and Chapter 9: the production shapes that hold up are the simpler ones, governed and bounded carefully, and the flashier coordination patterns have narrow real applicability.

Vignettes are not templates

These vignettes are deliberately architectural, not implementation-ready. A reader implementing a system in the same domain should expect to:

The architecture is the constant; the specifics are the variable. Treating these vignettes as templates and copying them verbatim would be the same mistake as treating GoF’s structural patterns as runnable code.

Summary

Six system architectures, each composed from the patterns and disciplines developed in earlier chapters. The structural commitments are recognizable across all six: bounded autonomy, structural governance, tiered identity-scoped memory, trace discipline, and skills where they fit. The variation is in domain specifics. Chapter 17 turns from these sketches to a single system worked end-to-end, the ultimate vignette, before Chapter 18 takes up the operational concerns that turn any of these architectures into running systems.