Tailored to a role building agentic ML-development infrastructure: agents that work with code, data, experiments, and evaluations. Organized to your framework: FR → NFR → Core Entities → Data Flow → High-Level Design → Deep Dive. Name + one-line definition each.


1. Functional Requirements (what the agent system does)

  • Task scope — what work the agent performs: coding, data generation, eval, triage, experiment orchestration, debugging.
  • Autonomy level — fully autonomous vs human-approved vs human-in-the-loop at specific checkpoints.
  • Single-agent vs multi-agent — one agent with tools vs orchestrated specialists (planner, coder, reviewer, evaluator).
  • Interaction surface — chat, CLI, IDE plugin, CI hook, API, cron-triggered background worker.
  • Tool/action space — the concrete set of operations the agent may take (run code, edit files, launch training jobs, query data).
  • Environment — where actions execute: sandbox, repo checkout, cluster, staging vs prod.
  • Success criteria / definition of done — how the agent knows a task is complete (tests pass, eval score, human sign-off).
  • Feedback loops — how outcomes (failures, evals, human corrections) flow back to improve prompts, data, or models.

2. Non-Functional Requirements (the numbers/constraints)

  • Latency — interactive (seconds) vs background (minutes–hours); decides model size, parallelism, streaming.
  • Cost per task — tokens × price × attempts; the dominant budget line; drives model routing and caching.
  • Token budget — context window ceiling per call; forces compression/memory strategy.
  • Throughput / concurrency — simultaneous agent sessions; drives rate-limit management and queueing.
  • Reliability / task success rate — % of tasks completed correctly; the core SLO of an agent system.
  • Determinism vs reproducibility — LLMs are stochastic; reproducibility comes from logging seeds, prompts, model versions, and tool outputs.
  • Safety envelope — blast radius limits: what the agent must never do (prod writes, secrets, spend limits).
  • Auditability — every action traceable to an agent, prompt, and approving human.
  • Model/provider availability — fallback across providers/models when one degrades.
  • Rate limits & quotas — provider TPM/RPM caps; require client-side throttling, queues, backoff.
  • Data privacy — what code/data may be sent to which models (external API vs self-hosted).
  • Graceful degradation — reduced autonomy or smaller models under budget/outage pressure rather than hard failure.

3. Core Entities (the objects in an agent platform)

  • Agent — a policy (model + system prompt + tools + memory) pursuing a goal over multiple steps.
  • Task / run / episode — one end-to-end attempt at a goal, with its full trace.
  • Trajectory / trace — the ordered sequence of thoughts, tool calls, observations, and outputs in a run.
  • Tool — a typed function the agent can invoke, with schema, docs, permissions, and side-effect classification.
  • Message / turn — one unit of the conversation loop (user, assistant, tool result).
  • Prompt template / system prompt — versioned instructions defining agent behavior; treated as code.
  • Memory — state persisted across turns or sessions (scratchpad, episodic, semantic, long-term stores).
  • Artifact — outputs an agent produces: diffs/PRs, datasets, eval reports, experiment configs, checkpoints.
  • Environment / sandbox — the isolated workspace where agent actions execute.
  • Eval / benchmark / test case — a task with a grading function used to score agent or model behavior.
  • Policy / permission set — what a given agent identity is allowed to do (tools, resources, spend).
  • Session / checkpoint — resumable saved state of a long-running agent.

4. Data Flow (the agent loop & pipelines)

  • Agent loop (ReAct pattern) — reason → act (tool call) → observe → repeat until done; the core control flow.
  • Planning / task decomposition — breaking a goal into subtasks up front (plan-then-execute) or incrementally.
  • Plan-and-execute vs reactive — explicit upfront plan vs step-by-step decisions; predictability vs adaptability.
  • Tool calling / function calling — model emits structured (JSON-schema) calls; runtime executes and returns results.
  • Structured output / constrained decoding — forcing model output into a schema so downstream code can parse it.
  • Tool result truncation/summarization — compressing large tool outputs before they enter context.
  • Context assembly — deciding, per call, what goes into the window: instructions, memory, retrieved docs, recent turns.
  • Context compression — summarizing or pruning history to stay under the token budget while preserving task state.
  • Retrieval-augmented generation (RAG) — fetching relevant documents/code/embeddings into context instead of storing everything.
  • Embeddings / vector store — dense representations enabling semantic search over code, docs, and past traces.
  • Chunking — splitting documents/code into retrieval-sized pieces; chunk boundaries drive retrieval quality.
  • Reranking — a second-stage model ordering retrieved candidates by relevance.
  • Streaming — token-by-token output for responsiveness and early cancellation.
  • Handoff / delegation — one agent passing a subtask and scoped context to another agent.
  • Human-in-the-loop gate — a pause point where a human approves, edits, or rejects before the agent proceeds.
  • Event/message bus between agents — async communication substrate for multi-agent coordination.
  • Trace logging pipeline — capturing every prompt, completion, tool call, and result into storage for eval/debugging/training.

5. High-Level Design — Architecture Patterns

  • Orchestrator–worker (supervisor) pattern — a coordinator agent routes subtasks to specialist agents/tools.
  • Pipeline / workflow (fixed DAG) vs agentic (dynamic) — predetermined step graph vs model-decided control flow; use the least autonomy that solves the task.
  • Router — a cheap classifier/model directing requests to the right agent, tool, or model tier.
  • Model routing / cascading — trying small/cheap models first, escalating to larger ones on failure or low confidence.
  • Evaluator–optimizer (generator–critic) loop — one model produces, another critiques, iterate until pass.
  • Reflection / self-critique — the agent reviewing its own output/trace to fix errors before finishing.
  • Debate / ensemble / self-consistency — multiple samples or agents; majority vote or judge picks the best.
  • Best-of-N with verifier — sample N candidates, score with a verifier (tests, judge), keep the winner.
  • Blackboard / shared state — agents coordinating via a shared workspace rather than direct messages.
  • Agent runtime / executor — the service running the loop: model calls, tool dispatch, retries, timeouts, state.
  • Tool registry — central catalog of tools with schemas, versions, owners, and permission requirements.
  • MCP (Model Context Protocol) / tool protocol — standardized interface for exposing tools/resources to agents.
  • Gateway / LLM proxy — single choke point for all model calls: auth, routing, caching, rate limiting, logging, cost attribution.
  • Prompt caching / KV-cache reuse — reusing computed prefixes (stable system prompts) to cut latency and cost.
  • Semantic caching — returning cached responses for semantically-equivalent requests.
  • Session state store — externalized agent state (DB/object store) so runs survive process restarts.
  • Queue-based task execution — agent tasks as queued jobs with priorities, retries, and idempotency keys.
  • Checkpoint & resume — persisting loop state so long tasks recover from crashes without restarting.

6. Agent Capabilities Layer

  • Code agent — agent operating on a repo: read, search, edit, run tests, open PRs.
  • Repo mapping / code indexing — building searchable structure (symbols, ASTs, embeddings) so agents navigate large codebases.
  • Sandboxed code execution — running agent-written code in isolated containers/VMs with resource and network limits.
  • Terminal/shell tool — giving agents command execution; the highest-power, highest-risk tool.
  • Computer/browser use — agents driving GUIs or web pages for tasks without APIs.
  • Sub-agent spawning — an agent creating scoped child agents for parallel subtasks (e.g., parallel file edits).
  • Parallel tool calls — issuing independent tool calls concurrently to cut wall-clock time.
  • Skill / procedure library — reusable, versioned instructions or scripts agents load on demand.
  • Scratchpad reasoning / chain-of-thought — intermediate reasoning tokens improving multi-step reliability.

7. Memory & Context (called out in the JD)

  • Working memory (context window) — everything the model sees this call; the scarcest resource.
  • Episodic memory — records of past runs/interactions retrievable later.
  • Semantic memory — distilled facts/preferences/knowledge extracted from experience.
  • Procedural memory — learned how-to knowledge: refined prompts, skills, playbooks.
  • Summarization-based compression — rolling summaries replacing old turns as conversations grow.
  • Hierarchical summarization — summaries of summaries for very long horizons.
  • Sliding window + pinned context — keep recent turns verbatim, pin critical instructions, compress the middle.
  • Memory write policy — what gets persisted, when, by whom (agent-decided vs rule-based), and how conflicts resolve.
  • Memory retrieval — scoring stored memories (recency, relevance, importance) for inclusion in context.
  • Context poisoning — bad/stale/injected content in memory or retrieval corrupting future behavior.
  • Context rot / lost-in-the-middle — degraded model attention on content buried mid-window; argues for aggressive curation.
  • Knowledge cutoff vs live retrieval — model’s frozen training knowledge vs fetched current state (docs, code HEAD).

8. Evaluation & Self-Improvement (core of this JD)

  • Eval harness / platform — infrastructure to run task suites against agents/models and score outputs at scale.
  • Golden dataset / benchmark suite — curated tasks with reference answers or checkable outcomes.
  • Programmatic graders — deterministic checks: unit tests pass, output schema valid, exact/numeric match.
  • LLM-as-judge — a model grading outputs against a rubric; scalable but needs calibration against humans.
  • Rubric-based evaluation — decomposing quality into scored dimensions (correctness, style, safety).
  • Pairwise comparison / Elo — ranking systems by head-to-head judgments instead of absolute scores.
  • Human evaluation / annotation pipeline — structured human labeling with sampling, instructions, and agreement tracking.
  • Inter-annotator agreement — consistency between human raters; ceiling for judge accuracy.
  • Judge calibration / meta-evaluation — measuring the judge against human labels before trusting it.
  • Position/verbosity/self-preference bias — systematic LLM-judge biases requiring randomization and controls.
  • Outcome vs process evaluation — grading only the final artifact vs grading the trajectory (tool choices, efficiency).
  • Trajectory evaluation — scoring the steps: unnecessary calls, loops, recovery from errors.
  • pass@k / success rate — probability at least one of k samples succeeds; the standard agent metric.
  • Regression testing for prompts/agents — re-running the eval suite on every prompt/model/tool change; CI for AI behavior.
  • A/B testing / shadow mode — comparing agent variants on live traffic, or running new agents observation-only.
  • Offline vs online evaluation — benchmark suites pre-deploy vs live metrics and feedback post-deploy.
  • Eval contamination / overfitting to the benchmark — the suite leaking into prompts/training, inflating scores; hold-out sets.
  • Failure taxonomy / error analysis — clustering failed traces into named categories to prioritize fixes.
  • Failure mining / triage agents — agents that scan logs/traces to surface and cluster failures automatically.
  • Synthetic data generation — models generating training/eval data; needs diversity controls and quality filtering.
  • Data curation / filtering loop — scoring and selecting generated or logged data before it enters training sets.
  • Distillation from traces — fine-tuning smaller models on successful agent trajectories.
  • RLHF / RLAIF / DPO (awareness level) — preference-based post-training methods that eval pipelines feed.
  • Self-improving loop — deploy → log traces → mine failures → generate/curate data → retrain or re-prompt → re-eval → redeploy.
  • Flywheel data governance — versioning, provenance, and consent tracking for data harvested from usage.

9. ML Pipeline & Experimentation Infrastructure (the substrate agents operate on)

  • Experiment tracking — logging configs, metrics, artifacts per run (W&B/MLflow-style) so agents and humans compare results.
  • Training orchestration — scheduling distributed jobs on GPU clusters; queues, priorities, preemption.
  • Hyperparameter search / sweep — automated exploration of configs; a natural agent-driven workflow.
  • Reproducibility — pinned code SHA + data version + config + seed + environment = re-runnable experiment.
  • Data versioning — immutable, addressable dataset versions so experiments and evals are comparable.
  • Model registry — versioned catalog of trained models with lineage, eval scores, and deployment stage.
  • Feature/dataset lineage — tracing which data produced which model produced which decision.
  • Checkpointing (training) — periodic model-state saves for fault tolerance and warm restarts.
  • Multimodal pipelines — handling text/image/audio/video with per-modality preprocessing, storage, and eval.
  • GPU utilization & scheduling — the scarce resource; batching, packing, and priority policies.
  • Inference serving — batching, quantization, autoscaling, and latency/throughput trade-offs for model endpoints.
  • CI/CD for ML — automated tests + evals gating model, prompt, and pipeline changes.
  • Packaging & environments — reproducible Python environments (lockfiles, containers) across research and prod.
  • Monorepo / modular codebase health — abstraction boundaries, interfaces, and testing that keep a fast-moving ML codebase evolvable.

10. Safety, Identity & Governance (JD calls out AuthN/AuthZ/IAM)

  • Agent identity — each agent has its own principal/service account, distinct from any human user.
  • Authentication (AuthN) — proving who the agent is (workload identity, short-lived tokens, mTLS).
  • Authorization (AuthZ) — what the identity may do; enforced at the tool/resource layer, not by the prompt.
  • Least privilege — agents get the minimum permissions for the task, scoped and time-boxed.
  • Scoped / short-lived credentials — per-task tokens that expire; no long-lived secrets in agent context.
  • On-behalf-of (delegation) semantics — agent actions carrying both the agent’s identity and the initiating user’s.
  • Secrets management — credentials injected at execution time by the runtime, never placed in the prompt.
  • Prompt injection — malicious instructions embedded in data (web pages, files, tool results) hijacking the agent.
  • Instruction/data boundary — treating all tool/retrieved content as untrusted data, never as commands.
  • Jailbreaking — adversarial inputs bypassing model safety behavior.
  • Excessive agency — agent granted more tools/permissions/autonomy than the task requires.
  • Sandboxing / isolation — containers/VMs/network policies limiting what executed code can touch.
  • Egress controls — allowlisting network destinations to prevent data exfiltration by compromised agents.
  • Side-effect classification — labeling tools read-only vs reversible-write vs irreversible; gating accordingly.
  • Approval workflows — human sign-off required for high-risk action classes (prod deploys, spend, deletion).
  • Action budget / spend limits — hard caps on tokens, dollars, API calls, or steps per run.
  • Kill switch / circuit breaker — immediate halt of an agent or fleet when anomalous behavior is detected.
  • Guardrails (input/output filtering) — classifiers or rules screening prompts and outputs for policy violations.
  • Audit trail — immutable log of every agent action with identity, inputs, and results.
  • Rollback / reversibility — preferring undoable actions (branch + PR, not direct push; soft delete).
  • Content provenance / watermarking (awareness) — marking agent-generated artifacts as machine-produced.

11. Reliability & Failure Modes (deep-dive ammunition)

  • Non-determinism handling — retries produce different outputs; verify with checks, don’t assume replays match.
  • Hallucination — confident fabrication (APIs, file paths, results); mitigated by grounding, retrieval, and verification.
  • Hallucinated tool calls — calling nonexistent tools or invalid arguments; schema validation + graceful error feedback.
  • Infinite loops / repeated actions — agent stuck retrying the same failing step; step limits + loop detection.
  • Max-step / timeout limits — hard ceilings on iterations and wall-clock per run.
  • Error feedback loop — returning tool errors to the model so it can self-correct, vs failing the run.
  • Retry with idempotency — safe re-execution requires idempotency keys on side-effectful tools.
  • Compensating actions (saga pattern) — undo steps for multi-step workflows that fail midway.
  • Partial failure in multi-agent runs — one sub-agent fails; supervisor must retry, reassign, or degrade.
  • Deadlock / livelock between agents — agents waiting on or endlessly negotiating with each other.
  • Error propagation / compounding — small per-step error rates compound over long horizons (0.99^50 ≈ 0.6).
  • Context overflow failure — task state exceeding the window mid-run; needs proactive compression triggers.
  • Model version drift — provider silently updates a model, shifting behavior; pin versions + regression evals.
  • Prompt regression — a prompt edit fixing one case and breaking others; caught only by eval suites.
  • Cascading rate-limit failures — parallel agents exhausting shared quotas; central gateway + admission control.
  • Cost runaway — loops or fan-out exploding token spend; per-run and per-tenant budgets with hard stops.
  • Observability for agents — distributed-tracing-style spans over LLM calls and tool calls (OpenTelemetry conventions).
  • Trace debugging / replay — re-running a recorded trajectory step-by-step to localize the failure.
  • Online monitoring metrics — success rate, steps per task, tokens per task, tool-error rate, human-override rate, cost per task.
  • Drift detection — task mix or model behavior shifting over time, degrading success rates.
  • What breaks at 10x — for agents: quota exhaustion, queue backlog, eval throughput, trace-storage volume, judge cost.

Quick trade-off checklist (name these out loud)

  • Workflow (fixed DAG) vs agent (dynamic) → predictability/debuggability vs flexibility; use the least autonomy that works
  • Single agent vs multi-agent → simplicity vs specialization/parallelism (and coordination + compounding-error cost)
  • Bigger model vs cascade/routing → quality vs cost & latency
  • More context vs retrieval/compression → fidelity vs cost, latency, and lost-in-the-middle
  • Autonomy vs human-in-the-loop gates → speed vs safety/trust; gate on side-effect severity
  • LLM-as-judge vs human eval → scale & speed vs ground-truth reliability; calibrate judge against humans
  • Outcome vs process evaluation → cheap signal vs diagnosable signal
  • Best-of-N / reflection vs single pass → success rate vs multiplied token cost
  • Synthetic data vs human data → volume & cost vs diversity and quality risk (model collapse)
  • Guardrails in prompt vs enforcement in runtime → flexible but bypassable vs hard but rigid; security lives in the runtime
  • Pin model versions vs auto-upgrade → stability vs missing improvements; upgrade behind regression evals