Agentic Systems Design Interview — Concept Glossary

Tailored to a role building agentic ML-development infrastructure: agents that work with code, data, experiments, and evaluations. Organized to your framework: FR → NFR → Core Entities → Data Flow → High-Level Design → Deep Dive. Name + one-line definition each.

1. Functional Requirements (what the agent system does)

Task scope — what work the agent performs: coding, data generation, eval, triage, experiment orchestration, debugging.
Autonomy level — fully autonomous vs human-approved vs human-in-the-loop at specific checkpoints.
Single-agent vs multi-agent — one agent with tools vs orchestrated specialists (planner, coder, reviewer, evaluator).
Interaction surface — chat, CLI, IDE plugin, CI hook, API, cron-triggered background worker.
Tool/action space — the concrete set of operations the agent may take (run code, edit files, launch training jobs, query data).
Environment — where actions execute: sandbox, repo checkout, cluster, staging vs prod.
Success criteria / definition of done — how the agent knows a task is complete (tests pass, eval score, human sign-off).
Feedback loops — how outcomes (failures, evals, human corrections) flow back to improve prompts, data, or models.

2. Non-Functional Requirements (the numbers/constraints)

Latency — interactive (seconds) vs background (minutes–hours); decides model size, parallelism, streaming.
Cost per task — tokens × price × attempts; the dominant budget line; drives model routing and caching.
Token budget — context window ceiling per call; forces compression/memory strategy.
Throughput / concurrency — simultaneous agent sessions; drives rate-limit management and queueing.
Reliability / task success rate — % of tasks completed correctly; the core SLO of an agent system.
Determinism vs reproducibility — LLMs are stochastic; reproducibility comes from logging seeds, prompts, model versions, and tool outputs.
Safety envelope — blast radius limits: what the agent must never do (prod writes, secrets, spend limits).
Auditability — every action traceable to an agent, prompt, and approving human.
Model/provider availability — fallback across providers/models when one degrades.
Rate limits & quotas — provider TPM/RPM caps; require client-side throttling, queues, backoff.
Data privacy — what code/data may be sent to which models (external API vs self-hosted).
Graceful degradation — reduced autonomy or smaller models under budget/outage pressure rather than hard failure.

3. Core Entities (the objects in an agent platform)

Agent — a policy (model + system prompt + tools + memory) pursuing a goal over multiple steps.
Task / run / episode — one end-to-end attempt at a goal, with its full trace.
Trajectory / trace — the ordered sequence of thoughts, tool calls, observations, and outputs in a run.
Tool — a typed function the agent can invoke, with schema, docs, permissions, and side-effect classification.
Message / turn — one unit of the conversation loop (user, assistant, tool result).
Prompt template / system prompt — versioned instructions defining agent behavior; treated as code.
Memory — state persisted across turns or sessions (scratchpad, episodic, semantic, long-term stores).
Artifact — outputs an agent produces: diffs/PRs, datasets, eval reports, experiment configs, checkpoints.
Environment / sandbox — the isolated workspace where agent actions execute.
Eval / benchmark / test case — a task with a grading function used to score agent or model behavior.
Policy / permission set — what a given agent identity is allowed to do (tools, resources, spend).
Session / checkpoint — resumable saved state of a long-running agent.

4. Data Flow (the agent loop & pipelines)

Agent loop (ReAct pattern) — reason → act (tool call) → observe → repeat until done; the core control flow.
Planning / task decomposition — breaking a goal into subtasks up front (plan-then-execute) or incrementally.
Plan-and-execute vs reactive — explicit upfront plan vs step-by-step decisions; predictability vs adaptability.
Tool calling / function calling — model emits structured (JSON-schema) calls; runtime executes and returns results.
Structured output / constrained decoding — forcing model output into a schema so downstream code can parse it.
Tool result truncation/summarization — compressing large tool outputs before they enter context.
Context assembly — deciding, per call, what goes into the window: instructions, memory, retrieved docs, recent turns.
Context compression — summarizing or pruning history to stay under the token budget while preserving task state.
Retrieval-augmented generation (RAG) — fetching relevant documents/code/embeddings into context instead of storing everything.
Embeddings / vector store — dense representations enabling semantic search over code, docs, and past traces.
Chunking — splitting documents/code into retrieval-sized pieces; chunk boundaries drive retrieval quality.
Reranking — a second-stage model ordering retrieved candidates by relevance.
Streaming — token-by-token output for responsiveness and early cancellation.
Handoff / delegation — one agent passing a subtask and scoped context to another agent.
Human-in-the-loop gate — a pause point where a human approves, edits, or rejects before the agent proceeds.
Event/message bus between agents — async communication substrate for multi-agent coordination.
Trace logging pipeline — capturing every prompt, completion, tool call, and result into storage for eval/debugging/training.

5. High-Level Design — Architecture Patterns

Orchestrator–worker (supervisor) pattern — a coordinator agent routes subtasks to specialist agents/tools.
Pipeline / workflow (fixed DAG) vs agentic (dynamic) — predetermined step graph vs model-decided control flow; use the least autonomy that solves the task.
Router — a cheap classifier/model directing requests to the right agent, tool, or model tier.
Model routing / cascading — trying small/cheap models first, escalating to larger ones on failure or low confidence.
Evaluator–optimizer (generator–critic) loop — one model produces, another critiques, iterate until pass.
Reflection / self-critique — the agent reviewing its own output/trace to fix errors before finishing.
Debate / ensemble / self-consistency — multiple samples or agents; majority vote or judge picks the best.
Best-of-N with verifier — sample N candidates, score with a verifier (tests, judge), keep the winner.
Blackboard / shared state — agents coordinating via a shared workspace rather than direct messages.
Agent runtime / executor — the service running the loop: model calls, tool dispatch, retries, timeouts, state.
Tool registry — central catalog of tools with schemas, versions, owners, and permission requirements.
MCP (Model Context Protocol) / tool protocol — standardized interface for exposing tools/resources to agents.
Gateway / LLM proxy — single choke point for all model calls: auth, routing, caching, rate limiting, logging, cost attribution.
Prompt caching / KV-cache reuse — reusing computed prefixes (stable system prompts) to cut latency and cost.
Semantic caching — returning cached responses for semantically-equivalent requests.
Session state store — externalized agent state (DB/object store) so runs survive process restarts.
Queue-based task execution — agent tasks as queued jobs with priorities, retries, and idempotency keys.
Checkpoint & resume — persisting loop state so long tasks recover from crashes without restarting.

6. Agent Capabilities Layer

Code agent — agent operating on a repo: read, search, edit, run tests, open PRs.
Repo mapping / code indexing — building searchable structure (symbols, ASTs, embeddings) so agents navigate large codebases.
Sandboxed code execution — running agent-written code in isolated containers/VMs with resource and network limits.
Terminal/shell tool — giving agents command execution; the highest-power, highest-risk tool.
Computer/browser use — agents driving GUIs or web pages for tasks without APIs.
Sub-agent spawning — an agent creating scoped child agents for parallel subtasks (e.g., parallel file edits).
Parallel tool calls — issuing independent tool calls concurrently to cut wall-clock time.
Skill / procedure library — reusable, versioned instructions or scripts agents load on demand.
Scratchpad reasoning / chain-of-thought — intermediate reasoning tokens improving multi-step reliability.

7. Memory & Context (called out in the JD)

Working memory (context window) — everything the model sees this call; the scarcest resource.
Episodic memory — records of past runs/interactions retrievable later.
Semantic memory — distilled facts/preferences/knowledge extracted from experience.
Procedural memory — learned how-to knowledge: refined prompts, skills, playbooks.
Summarization-based compression — rolling summaries replacing old turns as conversations grow.
Hierarchical summarization — summaries of summaries for very long horizons.
Sliding window + pinned context — keep recent turns verbatim, pin critical instructions, compress the middle.
Memory write policy — what gets persisted, when, by whom (agent-decided vs rule-based), and how conflicts resolve.
Memory retrieval — scoring stored memories (recency, relevance, importance) for inclusion in context.
Context poisoning — bad/stale/injected content in memory or retrieval corrupting future behavior.
Context rot / lost-in-the-middle — degraded model attention on content buried mid-window; argues for aggressive curation.
Knowledge cutoff vs live retrieval — model’s frozen training knowledge vs fetched current state (docs, code HEAD).

8. Evaluation & Self-Improvement (core of this JD)

Eval harness / platform — infrastructure to run task suites against agents/models and score outputs at scale.
Golden dataset / benchmark suite — curated tasks with reference answers or checkable outcomes.
Programmatic graders — deterministic checks: unit tests pass, output schema valid, exact/numeric match.
LLM-as-judge — a model grading outputs against a rubric; scalable but needs calibration against humans.
Rubric-based evaluation — decomposing quality into scored dimensions (correctness, style, safety).
Pairwise comparison / Elo — ranking systems by head-to-head judgments instead of absolute scores.
Human evaluation / annotation pipeline — structured human labeling with sampling, instructions, and agreement tracking.
Inter-annotator agreement — consistency between human raters; ceiling for judge accuracy.
Judge calibration / meta-evaluation — measuring the judge against human labels before trusting it.
Position/verbosity/self-preference bias — systematic LLM-judge biases requiring randomization and controls.
Outcome vs process evaluation — grading only the final artifact vs grading the trajectory (tool choices, efficiency).
Trajectory evaluation — scoring the steps: unnecessary calls, loops, recovery from errors.
pass@k / success rate — probability at least one of k samples succeeds; the standard agent metric.
Regression testing for prompts/agents — re-running the eval suite on every prompt/model/tool change; CI for AI behavior.
A/B testing / shadow mode — comparing agent variants on live traffic, or running new agents observation-only.
Offline vs online evaluation — benchmark suites pre-deploy vs live metrics and feedback post-deploy.
Eval contamination / overfitting to the benchmark — the suite leaking into prompts/training, inflating scores; hold-out sets.
Failure taxonomy / error analysis — clustering failed traces into named categories to prioritize fixes.
Failure mining / triage agents — agents that scan logs/traces to surface and cluster failures automatically.
Synthetic data generation — models generating training/eval data; needs diversity controls and quality filtering.
Data curation / filtering loop — scoring and selecting generated or logged data before it enters training sets.
Distillation from traces — fine-tuning smaller models on successful agent trajectories.
RLHF / RLAIF / DPO (awareness level) — preference-based post-training methods that eval pipelines feed.
Self-improving loop — deploy → log traces → mine failures → generate/curate data → retrain or re-prompt → re-eval → redeploy.
Flywheel data governance — versioning, provenance, and consent tracking for data harvested from usage.

9. ML Pipeline & Experimentation Infrastructure (the substrate agents operate on)

Experiment tracking — logging configs, metrics, artifacts per run (W&B/MLflow-style) so agents and humans compare results.
Training orchestration — scheduling distributed jobs on GPU clusters; queues, priorities, preemption.
Hyperparameter search / sweep — automated exploration of configs; a natural agent-driven workflow.
Reproducibility — pinned code SHA + data version + config + seed + environment = re-runnable experiment.
Data versioning — immutable, addressable dataset versions so experiments and evals are comparable.
Model registry — versioned catalog of trained models with lineage, eval scores, and deployment stage.
Feature/dataset lineage — tracing which data produced which model produced which decision.
Checkpointing (training) — periodic model-state saves for fault tolerance and warm restarts.
Multimodal pipelines — handling text/image/audio/video with per-modality preprocessing, storage, and eval.
GPU utilization & scheduling — the scarce resource; batching, packing, and priority policies.
Inference serving — batching, quantization, autoscaling, and latency/throughput trade-offs for model endpoints.
CI/CD for ML — automated tests + evals gating model, prompt, and pipeline changes.
Packaging & environments — reproducible Python environments (lockfiles, containers) across research and prod.
Monorepo / modular codebase health — abstraction boundaries, interfaces, and testing that keep a fast-moving ML codebase evolvable.

10. Safety, Identity & Governance (JD calls out AuthN/AuthZ/IAM)

Agent identity — each agent has its own principal/service account, distinct from any human user.
Authentication (AuthN) — proving who the agent is (workload identity, short-lived tokens, mTLS).
Authorization (AuthZ) — what the identity may do; enforced at the tool/resource layer, not by the prompt.
Least privilege — agents get the minimum permissions for the task, scoped and time-boxed.
Scoped / short-lived credentials — per-task tokens that expire; no long-lived secrets in agent context.
On-behalf-of (delegation) semantics — agent actions carrying both the agent’s identity and the initiating user’s.
Secrets management — credentials injected at execution time by the runtime, never placed in the prompt.
Prompt injection — malicious instructions embedded in data (web pages, files, tool results) hijacking the agent.
Instruction/data boundary — treating all tool/retrieved content as untrusted data, never as commands.
Jailbreaking — adversarial inputs bypassing model safety behavior.
Excessive agency — agent granted more tools/permissions/autonomy than the task requires.
Sandboxing / isolation — containers/VMs/network policies limiting what executed code can touch.
Egress controls — allowlisting network destinations to prevent data exfiltration by compromised agents.
Side-effect classification — labeling tools read-only vs reversible-write vs irreversible; gating accordingly.
Approval workflows — human sign-off required for high-risk action classes (prod deploys, spend, deletion).
Action budget / spend limits — hard caps on tokens, dollars, API calls, or steps per run.
Kill switch / circuit breaker — immediate halt of an agent or fleet when anomalous behavior is detected.
Guardrails (input/output filtering) — classifiers or rules screening prompts and outputs for policy violations.
Audit trail — immutable log of every agent action with identity, inputs, and results.
Rollback / reversibility — preferring undoable actions (branch + PR, not direct push; soft delete).
Content provenance / watermarking (awareness) — marking agent-generated artifacts as machine-produced.

11. Reliability & Failure Modes (deep-dive ammunition)

Non-determinism handling — retries produce different outputs; verify with checks, don’t assume replays match.
Hallucination — confident fabrication (APIs, file paths, results); mitigated by grounding, retrieval, and verification.
Hallucinated tool calls — calling nonexistent tools or invalid arguments; schema validation + graceful error feedback.
Infinite loops / repeated actions — agent stuck retrying the same failing step; step limits + loop detection.
Max-step / timeout limits — hard ceilings on iterations and wall-clock per run.
Error feedback loop — returning tool errors to the model so it can self-correct, vs failing the run.
Retry with idempotency — safe re-execution requires idempotency keys on side-effectful tools.
Compensating actions (saga pattern) — undo steps for multi-step workflows that fail midway.
Partial failure in multi-agent runs — one sub-agent fails; supervisor must retry, reassign, or degrade.
Deadlock / livelock between agents — agents waiting on or endlessly negotiating with each other.
Error propagation / compounding — small per-step error rates compound over long horizons (0.99^50 ≈ 0.6).
Context overflow failure — task state exceeding the window mid-run; needs proactive compression triggers.
Model version drift — provider silently updates a model, shifting behavior; pin versions + regression evals.
Prompt regression — a prompt edit fixing one case and breaking others; caught only by eval suites.
Cascading rate-limit failures — parallel agents exhausting shared quotas; central gateway + admission control.
Cost runaway — loops or fan-out exploding token spend; per-run and per-tenant budgets with hard stops.
Observability for agents — distributed-tracing-style spans over LLM calls and tool calls (OpenTelemetry conventions).
Trace debugging / replay — re-running a recorded trajectory step-by-step to localize the failure.
Online monitoring metrics — success rate, steps per task, tokens per task, tool-error rate, human-override rate, cost per task.
Drift detection — task mix or model behavior shifting over time, degrading success rates.
What breaks at 10x — for agents: quota exhaustion, queue backlog, eval throughput, trace-storage volume, judge cost.

Quick trade-off checklist (name these out loud)

Workflow (fixed DAG) vs agent (dynamic) → predictability/debuggability vs flexibility; use the least autonomy that works
Single agent vs multi-agent → simplicity vs specialization/parallelism (and coordination + compounding-error cost)
Bigger model vs cascade/routing → quality vs cost & latency
More context vs retrieval/compression → fidelity vs cost, latency, and lost-in-the-middle
Autonomy vs human-in-the-loop gates → speed vs safety/trust; gate on side-effect severity
LLM-as-judge vs human eval → scale & speed vs ground-truth reliability; calibrate judge against humans
Outcome vs process evaluation → cheap signal vs diagnosable signal
Best-of-N / reflection vs single pass → success rate vs multiplied token cost
Synthetic data vs human data → volume & cost vs diversity and quality risk (model collapse)
Guardrails in prompt vs enforcement in runtime → flexible but bypassable vs hard but rigid; security lives in the runtime
Pin model versions vs auto-upgrade → stability vs missing improvements; upgrade behind regression evals