Eval Feedback Loops for Multimodal Dataset Versions

The previous post made a dataset version trainable. It covered the jump from immutable manifests to WebDataset shards and training loaders: the physical layer that lets a model consume data efficiently.

This post asks the next question:

Should this dataset version be promoted at all?

Up to this point, the pipeline can ingest data, store raw assets, compute embeddings, run quality and dedup stages, build a catalog, create dataset versions, materialize shards, and benchmark loaders.

That is already a lot of machinery.

But a pipeline that only produces data is incomplete. A production-shaped data pipeline needs a feedback loop that can say whether a data change made the system better.

data change -> new dataset version -> evaluation -> decision

Without that loop, every data decision is anecdotal.

I can say:

I added more COCO images.

But the better question is:

Did adding those images improve retrieval quality, coverage, diversity,
or downstream training behavior?

That is Component 10: the eval and feedback loop.

For context, this continues the multimodal lakehouse series:

Component 10: Eval and Feedback Loop

In model-centric ML, evaluation usually compares model versions:

model_v1 vs model_v2

In data-centric ML, the dataset version also needs to be evaluated:

dataset_v1 vs dataset_v2

That distinction matters for this project because the main artifact is not a trained model. The main artifact is a versioned multimodal data pipeline.

For example:

v001:
  5K text documents
  1K images
  simple quality gates

v002:
  5K text documents
  1K images
  ANN dedup
  better caption filtering
  more balanced image clusters

The eval loop asks:

Did v002 improve retrieval quality compared to v001?
Did it reduce duplicates?
Did it improve diversity?
Did it hurt any important query category?

That is what makes data quality measurable instead of subjective.

The point is not to say “I made a bigger dataset.” The point is to prove whether the data change improved measured behavior.

Freeze the Test, Change the Data

A good eval loop needs one rule:

keep the eval set fixed while dataset versions change

If I change both the dataset and the eval questions at the same time, I cannot tell what caused the result.

The basic loop should look like this:

frozen eval probes
        |
        v
run against dataset v001
        |
        v
record metrics

same frozen eval probes
        |
        v
run against dataset v002
        |
        v
compare metrics

The pattern is:

hold eval constant
change dataset
measure difference

That is the scientific part of the data pipeline.

It is also the only way to make version promotion meaningful. A version should not be promoted because it built successfully. It should be promoted because it passed a stable set of checks and did not regress important behavior.

What an Eval Probe Looks Like

This project is a multimodal retrieval system, so the eval set should contain query probes.

A probe is a small test case:

id: image_dog_outside_001
query: "a dog running outside"
target_modality: image
expected_terms:
  - dog
  - outside
  - running
expected_source: COCO

Another text probe might look like:

id: text_reinforcement_learning_001
query: "what is reinforcement learning?"
target_modality: text
expected_terms:
  - reward
  - policy
  - agent
  - learning

For each probe, the system runs search and checks whether the top results match the expected intent.

At first, these probes can be simple and hand-written. They do not need to be perfect. They need to be stable enough to catch obvious regressions.

Example probe categories:

Eval task	Example query	Expected output
Text to Text	“What is reinforcement learning?”	Educational text about RL
Text to Image	“a dog running outside”	COCO image or caption about a dog outside
Text to Video	“people cooking food”	Video clip, caption, or keyframe about cooking
Text to Audio	“a person reading a sentence”	Speech clip, transcript, or metadata if represented

The audio case has an important caveat.

Text, image, and some video retrieval can use CLIP-like shared spaces if the embeddings are actually aligned. But dimensional compatibility is not semantic compatibility. Projecting Whisper vectors to 384 dimensions does not automatically make them comparable to MiniLM text vectors unless that projection is trained for the task.

For audio retrieval, a safer first step is transcript-based search or an embedding model explicitly trained to align audio and text.

That is the same lesson from the catalog schema bug: vectors from different models are not interchangeable just because they fit in a column.

Retrieval Metrics

Search returns ranked results:

query -> top 5 or top 10 results

So the eval loop needs ranking metrics.

Precision@K

Precision@K asks:

Of the top K results, how many were relevant?

For example:

query = "dog running outside"
top 5 results = 4 relevant images

Precision@5 = 4 / 5 = 0.8

Precision@K is useful when I care about result cleanliness. If the top five results are noisy, the system feels bad even if a good result appears somewhere later.

For early versions of this project, Precision@5 is probably the simplest useful metric.

Recall@K

Recall@K asks:

Of all known relevant items, how many did the system retrieve in the top K?

Recall is useful when I have known target IDs or a labeled relevance set.

At demo scale, recall is harder because I may not know all relevant items in the corpus. A query like “dog outside” could have many valid images. Unless those are labeled, recall is only approximate.

So I would treat Recall@K as a later metric, once the eval set has stronger labels.

Mean Reciprocal Rank

Mean Reciprocal Rank, or MRR, asks:

How early did the first good result appear?

If the first relevant result appears at rank 1:

score = 1.0

If it appears at rank 5:

score = 1 / 5 = 0.2

MRR is useful for search demos because users care a lot about whether the first useful result appears near the top.

If v002 has the same Precision@5 as v001 but a better MRR, the first good answer moved earlier. That is a real improvement.

nDCG@K

nDCG is useful when relevance is graded instead of binary.

Instead of marking each result as relevant or irrelevant, I can score it:

= excellent
= good
= weakly related
= irrelevant

nDCG rewards systems that put highly relevant results earlier in the ranking.

This is closer to real search evaluation because not all relevant results are equally good.

For this project, the staged path is:

Precision@K first
MRR next
nDCG@K after adding graded judgments

That keeps the first eval loop practical while leaving room for more realistic ranking evaluation later.

Data Quality Metrics Belong in the Same Report

Retrieval metrics measure system behavior. They do not fully describe the dataset.

Because this project is a data pipeline, each version report should also include corpus-level metrics:

Metric	What it tells me
Total items	Dataset size
Modality counts	Text/image/video/audio balance
Duplicate rate	Dedup effectiveness
Quality pass rate	Filter strictness
Largest cluster percentage	Concentration risk
Normalized entropy	Diversity
Missing asset count	Broken references
License distribution	Compliance and commercial usability

This is where the previous diversity work fits.

For example:

v001:
  clip_entropy = 0.71
  largest_cluster_pct = 28.4%

v002:
  clip_entropy = 0.83
  largest_cluster_pct = 14.1%

That says v002 has a more balanced visual corpus.

But that is not the whole answer. I still need retrieval eval to ask:

Did the better balance improve search quality?

The complete feedback loop has two halves:

corpus quality metrics + retrieval metrics

One measures the dataset itself. The other measures system behavior.

What This Project Has Today

The current project has the foundation for this component, but not the full production eval loop yet.

The implemented foundation includes:

immutable dataset version manifests
item IDs pinned per version
model versions recorded in manifests
modality counts recorded
LanceDB catalog as the query source
search endpoints for text, image, and combined search
WebDataset materialization
loader benchmark stage
lightweight version-comparison scaffolding

That means the hard structural pieces exist:

versioned data + searchable catalog + metrics outputs

The missing production layer is the frozen eval suite itself:

a committed eval probe file
known expected results or relevance labels
automatic search over each dataset version
Precision@K, MRR, and nDCG reports
comparison reports between versions
a generated datasheet for each promoted dataset version

The honest way to describe the current state is:

I implemented the versioning and catalog foundation needed for an eval loop,
plus lightweight comparison scaffolding. The next production step is a frozen
query-probe set with retrieval metrics per dataset version.

That distinction matters. It avoids pretending the project has a complete evaluation harness while still showing that the architecture is ready for one.

A Four-Level Eval Roadmap

I would build this component in four levels.

Level 1: Smoke-Test Evals

Start with a small hand-written probe file.

Example categories:

text:
  reinforcement learning
  biology population growth

image:
  dog outside
  woman cutting cake
  people riding horses

video:
  cooking
  sports
  driving

audio:
  speech sample
  reading sentence

The goal is simple:

Does search return obviously relevant results?

This is useful for demos and sanity checks. It catches broken embedding columns, empty search results, modality-routing bugs, and obvious quality regressions.

Level 2: Versioned Eval Reports

For every dataset version, run the same frozen probes and write a report:

run frozen probes
collect top K results
compute metrics
write eval_report.json

The report should include:

{
  "dataset_version": "multimodal-demo-v002",
  "probe_count": 48,
  "precision_at_5": 0.71,
  "mrr": 0.82,
  "empty_result_rate": 0.04,
  "duplicate_result_rate": 0.07,
  "by_modality": {
    "text": {"precision_at_5": 0.78},
    "image": {"precision_at_5": 0.74},
    "video": {"precision_at_5": 0.61},
    "audio": {"precision_at_5": 0.42}
  }
}

Now version comparison becomes concrete:

v001 Precision@5 = 0.62
v002 Precision@5 = 0.71

That is much better than “v002 has more rows.”

Level 3: Graded Relevance

Once the basic loop works, add graded relevance:

= irrelevant
= weak
= good
= excellent

Then compute nDCG@K.

This is closer to industry search evaluation because it captures quality differences among relevant results. A perfect match should count more than a merely acceptable match.

It also lets me inspect category-specific regressions:

image food queries improved
image sports queries regressed
video cooking queries still empty
audio retrieval is unreliable without transcripts

That is the level where evaluation starts pointing directly at data work.

Level 4: Feedback Into Data Decisions

The final step is using the report to decide what to change.

An eval report should not only summarize scores. It should suggest data actions:

dog image queries improved
cooking video queries are empty
audio search needs transcript alignment
largest CLIP cluster is still too large
caption mismatch rate is high
duplicate image results appear in top K

Those findings feed back into:

connector choice
quality thresholds
dedup thresholds
sampling and balancing strategy
model choice
embedding-space design
version promotion rules

That closes the loop.

data decision -> dataset version -> eval report -> next data decision

The system starts learning from its own outputs.

Promotion Gates

The eval loop should produce a decision.

A dataset version should not be promoted just because the pipeline finished.

It should be promoted if:

it passes quality checks
it improves or preserves retrieval metrics
it does not regress important query categories
it has acceptable diversity and duplicate-rate metrics
it has acceptable license and provenance coverage
it has no missing assets referenced by the manifest

Conceptually:

catalog
  |
  v
dataset version manifest
  |
  v
frozen eval probes
  |
  v
metrics report
  |
  v
decision: promote, reject, or revise

This is the production mindset: a dataset version is an artifact with a release gate.

The Takeaway

The eval loop connects data-engineering decisions to measurable system behavior.

Without it, I can build new dataset versions forever and only guess whether they are better. With it, I can compare versions against a frozen set of probes, track retrieval metrics, inspect corpus health, and decide whether a version deserves promotion.

The short version:

A dataset version is not better because it is bigger.
It is better if it improves measured behavior under a frozen eval.

That is the point of Component 10.

The earlier components made the data durable, queryable, versioned, streamable, and trainable. The eval feedback loop makes it accountable.

Running Glossary

Eval Probe: A fixed query test case used to evaluate search or retrieval behavior across dataset versions.

Frozen Eval Set: A stable collection of eval probes that stays constant while dataset versions change.

Precision@K: The fraction of the top K retrieved results that are relevant.

Recall@K: The fraction of known relevant items retrieved within the top K results.

Mean Reciprocal Rank (MRR): A ranking metric based on how early the first relevant result appears.

nDCG@K: A ranked retrieval metric that uses graded relevance and rewards highly relevant results appearing early.

Empty Result Rate: The fraction of probes that return no useful results.

Duplicate Result Rate: The fraction of top results that are duplicates or near-duplicates, useful for evaluating dedup quality.

Promotion Gate: The quality and evaluation threshold a dataset version must pass before being used downstream.

Component 10: Eval and Feedback Loop

Freeze the Test, Change the Data

What an Eval Probe Looks Like

Retrieval Metrics

Precision@K

Recall@K

Mean Reciprocal Rank

nDCG@K

Data Quality Metrics Belong in the Same Report

What This Project Has Today

A Four-Level Eval Roadmap

Level 1: Smoke-Test Evals

Level 2: Versioned Eval Reports

Level 3: Graded Relevance

Level 4: Feedback Into Data Decisions

Promotion Gates

The Takeaway

Running Glossary

Further Reading