The Data Operations Cycle

What Actually Feeds a Production Model

Picture a fraud-detection model at a payments company. Someone taps their card at a coffee shop, and somewhere in the few hundred milliseconds before the terminal beeps, a model has to decide whether that transaction looks like the cardholder or looks like someone who stole their number in a data breach three states away. It’s tempting to think the interesting engineering problem here is the model — the architecture, the training run, the accuracy on a held-out set. It isn’t. The interesting problem, and the one that actually determines whether the model works in production, is everything that happens to the data before and after that model ever sees it.

That’s the subject of this chapter: not a single pipeline, but a cycle. Data gets collected, checked, labeled, enriched, merged into something trustworthy, used to train a model, used to serve predictions, watched for signs of decay, and then — this is the part that turns it into a cycle rather than a one-way trip — fed back into the beginning, because the predictions the model makes in production become tomorrow’s training data. Every stage in that loop has its own failure modes, its own tools, and its own tradeoffs, and the thread connecting all of them is a single question: can training and serving agree on what the data means?

The Two Lanes

Data arrives at different speeds, and the fraud model needs both ends of that spectrum. On one end, there’s the historical record: a year of past transactions, chargebacks, merchant categories, and account histories, sitting in object storage waiting to be turned into training data. On the other end, there’s the transaction happening right now, which has to be scored before the customer’s card is done being read.

The historical side runs through what’s usually called the batch lane. Data lands in a raw zone — S3 dumps, database snapshots, exports from other systems — gets picked up by an orchestrator like Airflow or Dagster, processed at scale with something like Spark or Ray, and lands in a lakehouse format such as Iceberg or Delta, which gives you schema evolution and the ability to time-travel back to what a table looked like on any given day. The batch lane’s whole reason for existing is that it can afford to be slow and thorough. It owns the durable historical record, and it needs to be able to answer a specific question months from now: what did the model actually know at the moment it made a given prediction?

The live side runs through the streaming lane. Events — a card swipe, a login, a password reset — hit a message bus like Kafka or Kinesis, get processed by something like Flink, and update a low-latency store like Redis or DynamoDB that a model can query in single-digit milliseconds. But here’s the detail that trips people up: that low-latency store is serving state, not a source of truth. If it only lived in Redis, you’d have no way to replay what happened, audit a decision after the fact, or rebuild the state if it got flushed. So every event that updates the online store has to also land, asynchronously, in the same lakehouse the batch lane writes to. The streaming lane is fast, but it isn’t allowed to be the only place a piece of data exists.

Getting streaming right depends on a small set of ideas that are easy to state and easy to get subtly wrong. Every event needs a key — which entity does this update? Every event has both a processing time (when the system happened to see it) and an event time (when it actually happened), and only the second one is the ML contract; a fraud signal that arrives five minutes late still describes something that happened five minutes ago, not something happening now. A watermark is the system’s running estimate of how far event time has progressed, and it decides when a window is allowed to close. Idempotency determines whether replaying the same event twice silently double-counts something. Backpressure is what happens when the downstream store can’t keep up with the firehose. And state TTL decides when an online feature quietly expires. None of this is exotic, but a fraud model that gets even one of these wrong — say, closing a window before a late-arriving chargeback event lands — will confidently produce a wrong number and never tell you it did.

The rule of thumb worth carrying forward: batch gives you history, streaming gives you freshness, and you build the streaming lane only for the features where freshness actually changes a decision. Most data doesn’t need to move at streaming speed. The fraud model’s account-age feature can wait a day. Its “has this device been seen in the last ten minutes” feature cannot.

Quality Gates, Before Anyone Trusts Anything

The worst place to discover bad data is inside the model’s behavior, which is why both lanes need their own validation step, sized to how much time they have. Batch can afford to be heavy: full schema checks, null-rate and range checks, checks for corrupt files, deduplication against near-identical records, and distribution comparisons against what the model was trained on. Streaming has to be light — a schema check, a range check, a rolling null-rate counter, a freshness check — because anything heavier would blow the latency budget the whole lane exists to protect.

The principle underneath both is the same: nothing is allowed to cross into the space where training and serving both draw from it until it has passed some kind of check. It’s a boring principle, and it’s also the single most common thing that’s missing when a fraud model starts making a strange decision that nobody can explain — usually because a stream of events skipped validation entirely and quietly poisoned the online store.

Labels Come From People, and People Drift

Most raw data has no ground truth attached to it. A transaction doesn’t arrive labeled “fraud” or “not fraud” — a fraud investigator has to look at a flagged case and decide. That decision becomes the label the next model version trains on, and the pipeline that produces it looks less like a data pipeline and more like a small operations team: an unlabeled pool gets sampled (randomly, or by some prioritization scheme), turned into a task with instructions, handed to a human reviewer, checked against a gold set or a second reviewer’s opinion, consolidated when reviewers disagree, and finally versioned into a labeled dataset.

The part of this that quietly breaks systems is guideline drift. Six months into a fraud program, someone updates the investigator guidelines — maybe a new pattern of fraud gets its own category — and if the old and new labels get merged as though they mean the same thing, the training set now contains two different definitions of “fraud” wearing the same label. The fix isn’t complicated, just easy to skip: version the guideline itself, not just the label, so you can always answer which definition of fraud produced any given data point.

There’s a second failure mode worth naming here, because it’s become more common as teams try to speed up labeling with a model that pre-labels and lets humans just confirm or correct: if the humans start rubber-stamping the model’s suggestions instead of actually reviewing them, the model’s own mistakes quietly become next quarter’s ground truth. Auto-labeling can be a real efficiency win, but only if someone is actually checking, not just clicking through.

Deciding What’s Worth a Human’s Time

Labeling everything is expensive, and most of what you’d label wouldn’t teach the model much — a transaction that looks exactly like a thousand other obviously-legitimate transactions doesn’t need a human’s attention. Active learning is the practice of using the current model to decide which unlabeled examples are actually worth that attention: the ones it’s least confident about, the ones where an ensemble of models disagrees with itself, or the ones whose embeddings sit furthest from anything the model has seen in training. A burst of low-confidence predictions from the fraud model in production is exactly the kind of signal that should route straight into next week’s labeling queue, which is also the detail that makes active learning feel less like a standalone technique and more like the mechanism that closes the loop between production and the labeling pipeline described above.

The tradeoffs are worth knowing by name. Pure uncertainty sampling can end up chasing noisy or genuinely mislabeled outliers rather than useful edge cases. Query-by-committee — labeling wherever an ensemble disagrees most — gives a stronger signal but costs more, since it needs multiple models running. Diversity sampling covers gaps in the embedding space without necessarily targeting the current model’s actual weaknesses. And error-driven sampling, which targets known production failure modes directly, only works once monitoring has actually told you what those failure modes are.

When Real Data Isn’t Enough

Two techniques get lumped together under “synthetic data” that are worth pulling apart, because they solve different problems. The first is enrichment: using a model to add derived information to data that already exists — an embedding, an entity tag, a toxicity score — without inventing any new ground truth. The second is genuinely synthetic generation: creating new examples that didn’t exist before, usually because real data can’t cheaply cover a gap. A fraud team might do this for a brand-new attack pattern that’s shown up in the wild only a handful of times — too rare to train on directly, but well enough understood that a team can generate synthetic examples of it to give the model something to learn from before the real thing happens at scale.

Synthetic data earns its keep, but it comes with obligations that are easy to underweight. Every synthetic record needs to carry its own provenance — which generator produced it, with what parameters, what fraction of a given training set is synthetic versus real — and it should never quietly end up in an evaluation set, because evaluating a model against its own generated data tells you nothing about how it performs against reality. There’s also a specific, well-documented failure mode called model collapse: if you keep training successive model generations on the previous generation’s synthetic output, quality and diversity degrade a little more each round, the same way a photocopy of a photocopy degrades. Synthetic data needs the same versioning discipline as labeled data — arguably more, since it’s easier to lose track of where it came from.

Where Trust Actually Lives

Eventually the batch features and the streaming features have to stop being two separate things and become one thing a model can actually consume — and this merged, governed layer is what the industry usually calls a feature store. It’s worth being precise about the name, because “the merged view” describes what the system does and “feature store” describes the system itself, and it belongs in any lineage diagram as its own component, not as a synonym for “the data” flowing through it. Downstream consumers of a feature store shouldn’t ever need to know whether a given number came out of Spark or Flink. What they need to be able to ask is: what entity is this keyed by, what timestamp is it valid for, what version of the definition produced it, and can I use it for training, for serving, or both?

Five things make a feature store trustworthy rather than just convenient. The first is treating event time, not processing time, as the actual join contract — a fraud prediction made at 10:05 should only ever see features that were knowable at 10:05, which batch enforces with as-of joins (matching each event to the most recent feature value at or before its timestamp) and streaming enforces with windowed state and watermarks. The second is refusing to let feature definitions duplicate across systems — training-serving skew almost always starts with two teams writing two slightly different implementations of “the same” feature, one filtering out bot traffic and the other not, until the numbers quietly diverge; the fix is one shared definition (the entity, the timestamp, the window, the filters, the null policy, the owner) that both engines are required to implement identically. The third is remembering that the online store is a cache of certified state, not a historical record — every value written there should carry enough metadata to debug it later, and if Redis gets flushed, the lakehouse needs to be able to rebuild it, never the other way around. The fourth is putting quality gates before the merge rather than after, so nothing reaches the feature store without having already passed a check appropriate to its lane. And the fifth is versioning the view itself, not just the underlying data — a version needs to capture the feature definitions, the transformation code, the quality checks, the annotation schema, the synthetic-data generator version, the enrichment model version, the backfill window, the online materialization policy, and the feature store’s own registry version, because if any single one of those changes, the model is looking at a different world even though the raw files on disk look identical.

Making the Past Reconstructible

Underneath all of this sits a storage and governance layer whose entire job is making sure the past can be reconstructed. Storage is the easy part — a lakehouse for offline truth, a key-value store for online serving. Versioning is harder, because it has to answer a whole family of questions on demand: which exact snapshot of data trained model version twelve, which feature definition produced a specific online value, which annotation guideline was active when a given label was made, which synthetic generator produced a given synthetic record, and which downstream datasets would be affected if a particular source got deleted.

That last question is where governance stops being an abstract concern and becomes a legal one. Access control decides who can read fields that contain personal information. Privacy techniques — anonymization, differential privacy, consent tracking — decide what’s collected and how it’s transformed before anyone downstream touches it. Retention policy decides how long any of this is kept in the first place. And the deletion path decides whether, when a user asks for their data to be removed, that request can actually be honored across the lake, the online store, the feature store’s registry, and any model that happened to be trained on it. This is the single most common gap in systems that otherwise look well-built: the collection and training side gets all the attention, and then a real deletion request arrives and there’s no way to trace — let alone remove — one person’s data from every table, cache, and training set it ever touched. The feature store’s registry, if it’s been treated as a proper lineage node all along, is usually the fastest way to answer that question, because it’s the one place almost everything passes through on its way to a model.

Training on the Past, Honestly

Training data preparation is where all of the discipline above either pays off or gets exposed. A model trained on point-in-time-correct data only ever sees, for each historical prediction, the features that would have actually existed at that moment — not the current value of those features looked up in hindsight. A naive train/test split can leak the future into the past just as easily: split randomly by row instead of by entity and time window, and the same customer can show up in both the training set and the test set with overlapping history, which inflates offline metrics in a way that quietly falls apart the moment the model meets real, sequential time in production. The practical output of this stage is a dataset snapshot — an immutable pull from the feature store, tagged with enough version metadata that the exact training run could be reproduced six months later. If you can’t do that, what you have isn’t a training pipeline, it’s a one-time script that happened to work once.

Two Ways to Answer a Question

Serving mirrors the two ingestion lanes it grew out of. Batch inference scores a large dataset periodically — a nightly run that re-evaluates every open account for fraud risk — and reads from the offline view. Online inference scores one request at a time, in the few milliseconds a payment terminal is willing to wait, reading from the online view. The reason the shared-definition rule from the feature store section matters most right here is that if serving computes “transactions in the last seven days” even slightly differently than training did, the model’s real-world behavior diverges from what it actually learned, and that divergence is usually invisible until someone notices a drop in a metric and has to go digging for why.

Watching for the Ground to Shift

By the time a model’s accuracy visibly drops, the data underneath it has usually already been wrong for a while — which is why monitoring has to watch more than accuracy. Data drift asks whether the incoming feature distribution is shifting away from what the model trained on. Label drift asks whether the ground-truth rate itself is changing — fraud rates do genuinely rise and fall over time, independent of anything the model is doing wrong. Prediction drift asks whether the model’s output distribution is shifting even when the inputs look stable, which can be an early warning sign before either of the other two show up clearly. Freshness SLAs ask whether a given feature is arriving within the latency window the model actually needs. And training-serving skew gets checked directly, by periodically comparing the online and offline computed values for the same feature, same entity, same moment in time, rather than assuming the shared definition held up in practice just because it was written down correctly.

Closing the Loop

This is the stage that turns everything above from a pipeline into an actual cycle. The predictions a fraud model makes, together with what eventually turns out to be true — a chargeback filed, a customer confirming a transaction was theirs — become a new data source that flows straight back into collection. Low-confidence or disagreement cases surfaced in production feed the active learning queue. A drift alert can trigger a fresh round of synthetic data generation to patch a coverage gap, or a new labeling batch aimed at whatever the drift revealed. And a deletion event has to propagate backward through every stage that ever touched the deleted data. None of this works unless every stage can be traced forward and backward to every other stage — which is really just the lineage and versioning work from the governance section, showing up again as the connective tissue that makes the whole cycle actually a cycle rather than a diagram of good intentions.

A few things sit alongside every stage rather than belonging to just one of them. Self-service tooling should only ever expose the feature store’s trusted view, never the raw internals of either lane. Agentic automation — a system that watches pipelines, diagnoses failures, and applies guarded fixes — is worth building once failure patterns are well understood, and a liability if it’s built before that. Multimodal data needs object storage and manifests rather than table rows, which is worth its own extended look. And cost sits quietly behind every decision above it — cluster sizing for batch, partition count for streaming, storage tiering for the lakehouse — trading against latency at every turn.

When the Data Isn’t Just Text

Everything above holds for any modality — you still collect, validate, label, enrich, merge, store, train, serve, monitor, and feed back — but what each of those words technically means shifts almost completely once the data stops being rows and columns. It’s easiest to see by switching examples entirely, from a fraud model to a content moderation system that has to handle text, images, audio, and video arriving together on the same platform, often as part of the same post.

Text is the modality all of the tooling above was originally built for, and it’s the cheapest to store and check: cleaning means stripping HTML and boilerplate and fixing encoding, deduplication runs on exact hashes or on MinHash signatures compared through locality-sensitive hashing to catch near-duplicates, and quality scoring leans on things like perplexity and grammatical-quality classifiers.

Images break almost every one of those assumptions. Cleaning becomes about format and resolution normalization and catching corrupt files rather than encoding fixes. Deduplication can’t use exact hashing at all, because a resized or re-compressed copy of the same image produces a completely different hash; instead, images get compared with perceptual hashing or by embedding similarity — CLIP-style vectors compared through approximate nearest-neighbor search. Quality scoring shifts to aesthetic score, blur, exposure, and watermark detection, since a stock-photo watermark baked into a few million training images is a real and well-documented way for a corpus to go quietly wrong. Safety filtering for images adds an obligation text mostly doesn’t have — mandatory screening against known CSAM hash databases, a legal requirement in most jurisdictions, on top of the usual NSFW and violence classifiers. And where text gets tokenized into subwords, an image gets patchified the way a vision transformer expects, or passed through a learned discrete codebook like a VQ-VAE. When synthetic generation shows up for images, it’s usually recaptioning — running a vision-language model over a corpus whose scraped alt-text is mostly SEO noise, to produce captions actually worth training on.

Audio shifts the same set of concerns again. Cleaning means resampling to a consistent rate, normalizing loudness, and trimming silence. Deduplication runs on acoustic fingerprinting, the same basic idea behind song-recognition apps. Quality scoring looks at signal-to-noise ratio and clipping instead of perplexity. And audio introduces a concern with real legal weight that text simply doesn’t have: a person’s voice carries consent and likeness rights that their written words don’t, which matters a great deal once voice cloning is a realistic risk. Audio gets “tokenized” by converting it into a mel-spectrogram or into discrete tokens through a neural codec, a meaningfully heavier preprocessing step than anything text requires.

Video is the most expensive lane of all, and the reason is mostly about redundancy: adjacent frames are nearly identical, so cleaning a video corpus means running shot and scene detection to extract representative keyframes rather than processing every single frame. Deduplication runs on embeddings of those sampled frames or short clips, since neither byte hashing nor single-frame comparison catches a re-encoded or re-cropped copy. Quality scoring adds temporal concerns on top of per-frame visual quality — camera shake, how often scenes cut, whether the audio track actually lines up with the video. And annotation for video is the most expensive per item of any modality, since temporal action labels, object tracking across frames, and speaker diarization all take far more of a human reviewer’s time than labeling one image or one span of text.

The moment a content moderation system has to handle pairs — an image with a caption, a video with a transcript — a new problem shows up that none of the single-modality pipelines needed on their own: whether the two modalities actually agree with each other, not just whether each one individually passes its own checks. A perfectly clean image paired with an unrelated caption is worse than either one being flawed on its own, because the model learns a false association between them. That needs its own filtering stage — commonly a CLIP-style similarity score between an image and its caption, or a transcript’s word-error-rate as a proxy for how well it actually matches the audio — and it needs its own provenance, since a pair can fail for either half independently and lineage has to be able to say which.

The pattern running underneath all four modalities is that the further you get from plain text, the more “cleaning” and “quality” stop being something a team writes by hand and start being something delegated to another model entirely — a CLIP score, a captioning model, a codec. That’s a real shift in what the engineering work actually is: instead of writing filters, the job becomes validating and versioning the auxiliary models doing the filtering, because a stale CLIP checkpoint or a captioning model that’s quietly degraded doesn’t throw an error, it just lets slightly worse data through, and that’s a far harder thing to notice than a broken regular expression. It’s also why the versioning list from the feature store section needs one more line item once a pipeline goes multimodal: the version of every auxiliary model used for cleaning, scoring, or captioning belongs in the data version, exactly the way a synthetic-data generator’s version does.

The Shape of a Minimal System

If you were building either of these systems — the fraud model or the content moderation system — from nothing, the honest advice is to add complexity only when a real requirement forces it, in roughly this order: start batch-only, with quality gates from day one, since those are cheap to build early and expensive to retrofit. Add human labeling once you actually need supervised targets you don’t already have. Add a streaming lane only for the specific features where freshness changes a decision, not by default. Add active learning once labeling cost has become genuinely painful. Add synthetic data only for a real, identified gap — class imbalance, a rare edge case, a coverage hole — not because it sounds impressive. Build production monitoring and the feedback loop back into collection at launch, not later; this is the one item on the list that isn’t really optional, because a model shipped without it is a model nobody will notice going stale. Add agentic auto-remediation only once incident patterns are well understood. And add self-service tooling once more than one team actually needs the same trusted data.

A minimal end-to-end version of the architecture looks like this: batch sources land in a raw zone, get processed by something like Spark into a lakehouse, and stream sources land through Kafka and Flink into both a low-latency store and the same lakehouse for replay. Both paths converge on the feature store, which is what training and batch inference read from on one side, and what online inference reads from on the other. Predictions flow into monitoring, and monitoring’s findings flow back into collection, closing the loop the same way they would for the fraud model or the content moderation system.

Where It Actually Breaks

Almost every real failure in a system like this traces back to the same root cause: some stage quietly stepped outside the shared contract the rest of the cycle depends on. Streaming without a replay path means the online store has the latest values but the raw events never made it to the lake, so nothing can be reconstructed later. Batch without any freshness path means training data stays clean while online decisions go stale, because events sit waiting for tomorrow’s job. Two independent implementations of “the same” feature drift apart and produce training-serving skew that nobody notices until a metric moves. A quality gate that only exists in the batch lane lets bad real-time data straight into the online store. A missing late-event policy causes windows to close too early or too late, producing silent counting errors that look like normal noise. Treating the online store as historical truth makes debugging effectively impossible, because low-latency state has no memory of its own past. Annotation guidelines quietly changing mid-project without being versioned merges two different definitions of the same label. A model’s own pre-labeled output getting rubber-stamped by reviewers turns the model’s mistakes into next quarter’s ground truth. Synthetic data mixed in without provenance tracking means nobody can separate it from real data later, and it sometimes ends up polluting an evaluation set by accident. Training repeatedly on a model’s own synthetic output degrades quality a little more with each generation. A train/test split that ignores entity and time boundaries leaks the future into the past and inflates offline metrics that won’t hold up in production. Governance without a deletion path means a compliance request arrives with no way to trace, let alone remove, a person’s data from everywhere it ended up. Monitoring without a feedback path means drift gets detected and nothing happens about it. And in multimodal systems specifically, byte-level deduplication misses a resized image or re-encoded clip entirely, an unversioned auxiliary model can silently degrade corpus quality with no record of why, and clean-but-mismatched pairs teach a model false associations that no single-modality check would ever catch.

The Short Version

If you had to explain this whole cycle in one breath, it would go something like this: data moves through two latency lanes, batch for anything that can wait and streaming for anything whose value decays fast, each validated at a level of rigor its own latency budget allows; it gets labeled or synthetically filled in wherever real data alone can’t cover the need; both lanes merge into a single governed feature store keyed by event time and shared definitions, versioned all the way down to the models used to clean and score it; training draws point-in-time-correct snapshots from that store, and serving draws from the exact same definitions whether the request is a nightly batch job or a payment terminal waiting a few milliseconds for an answer; and production is watched closely enough that drift, low-confidence predictions, and eventual ground truth all flow back into collection, so the system keeps re-teaching itself rather than slowly drifting away from the world it was built for. The specific tools — Spark or Ray, Flink or Kafka Streams, Redis or DynamoDB, Iceberg or Delta — are implementation detail. The design is the separation of latency lanes, plus the discipline of a shared contract holding all of it together, and that’s the part worth understanding well enough to explain without the diagram in front of you.

Sources referenced

Study Guide: Data Operations Architecture at Scale