From Search Demo to Data Infrastructure

The previous post covered the eval feedback loop: the point where a dataset version stops being “new” and has to prove whether it is actually better.

This final post closes the remaining two components of the multimodal lakehouse pipeline:

Component 11: Precompute vs On-the-Fly
Component 12: Provenance and Observability

These two pieces answer the last two questions I need before I can trust a dataset version downstream:

What work should be baked into the dataset, and what should happen live?
Can I trace every sample, failure, cost, and decision that produced it?

For context, this closes the lakehouse series:

Component 11: Precompute vs On-the-Fly

By this point in the pipeline, the dataset is already versioned, filtered, deduplicated, searchable, materialized, and benchmarked:

source connectors
-> CAS
-> Ray preprocessing
-> quality/dedup
-> embedding cache
-> LanceDB catalog
-> dataset version manifest
-> WebDataset shards
-> training loader
-> eval feedback loop

The next design question is subtle:

Which transformations belong in the dataset,
and which transformations belong in the training loop?

That is the precompute vs on-the-fly boundary.

It sounds like an implementation detail. It is not. This boundary affects cost, reproducibility, training speed, model generalization, and debugging.

Why the Boundary Exists

Not every transformation has the same purpose.

Some transformations are expensive, deterministic, and reused across many experiments:

decode video
extract keyframes
compute CLIP embeddings
compute Whisper features
resize assets to a canonical format
quality-score records
deduplicate embeddings

These are good candidates for precompute.

Other transformations are intentionally random:

random crop
color jitter
audio noise
text masking
mixup
cutmix
random frame sampling

These are better done on-the-fly during training.

The rule I like is:

Precompute what makes the dataset reproducible.
Randomize what makes training robust.

Or, stated another way:

If a transform defines the dataset, precompute it.
If a transform defines the training experience, keep it live.

The Bad Extreme: Precompute Everything

One naive approach is to bake every transformation into the dataset:

image -> resized image -> cropped image -> normalized tensor -> saved forever

That makes training fast because the loader has very little work left.

But it can make the dataset rigid.

If I precompute one crop of an image, every training run sees that same crop. If I crop on-the-fly, the model can see slightly different views across epochs.

The same idea applies across modalities:

precomputing one image crop removes spatial variation
precomputing one audio perturbation removes acoustic variation
precomputing one text mask removes token-level variation
precomputing one video frame sample removes temporal variation

So precomputing everything maximizes throughput, but it can freeze away useful training randomness.

The Other Bad Extreme: Compute Everything Live

The opposite mistake is moving all preprocessing into the training loader.

For example:

decode video
extract frames
run CLIP
run Whisper
quality score
deduplicate
augment
batch

That keeps the dataset flexible, but it makes the training loop responsible for expensive data engineering work.

The result is predictable: the GPU waits.

This is the same lesson from the sharding and loader post. A GPU can be expensive, available, and underused because the data path cannot feed it fast enough.

Computing everything live keeps the dataset flexible, but it risks turning the loader into the bottleneck.

The Hybrid Boundary

The right architecture is hybrid.

In this project, the boundary looks like this:

Transformation	Precompute?	Reason
Content hash	Yes	Needed for dedup, caching, and reproducibility
Media extraction	Yes	Expensive and deterministic
Text/image/video/audio embeddings	Yes	Expensive GPU work reused across versions
Quality status	Yes	Needed for catalog filtering and versioning
ANN dedup decisions	Yes	Dataset-level decision, not per-epoch work
Dataset version manifest	Yes	Must be immutable and reproducible
WebDataset shards	Yes	Training layout should be materialized
Random crop	No	Useful stochastic augmentation
Noise injection	No	Should vary across epochs
Random masking	No	Useful for training diversity
Lightweight normalization	Usually live or precomputed	Depends on framework and model

The important line is between dataset definition and training variation.

Precomputed artifacts should be part of the data contract. On-the-fly transforms should be part of the training config.

How This Changes by Modality

The boundary is especially important in a multimodal pipeline because each modality has a different cost profile.

For text:

token counts can be precomputed
text embeddings can be precomputed
quality scores can be precomputed
random masking should stay live for language-model training

For images:

decode validity checks can be precomputed
CLIP embeddings can be precomputed
canonical resizing can be precomputed if useful
random crops and color jitter should usually stay live

For video:

clip extraction should usually be precomputed
keyframes should usually be precomputed
full video decoding during training is expensive
random temporal sampling may stay live if the objective benefits from it

For audio:

silence detection can be precomputed
transcripts or Whisper features can be precomputed
noise augmentation can happen live
transcript alignment should be versioned if used for retrieval

This is why multimodal data pipelines need explicit transform planning. A video transform can be orders of magnitude more expensive than a text transform.

What This Project Has Today

The project already implements the foundation of this hybrid approach:

CAS stores raw assets immutably.
Ray preprocessing computes embeddings before training.
Quality and dedup decisions happen before cataloging.
LanceDB stores vectors and metadata together.
Dataset versions pin item IDs and model choices.
WebDataset shards materialize a version into a training layout.
The loader benchmark checks whether that layout can be streamed efficiently.
The precompute_assets stage exists as the place where deterministic training assets can be prepared.

The architecture separates:

offline data preparation
from
online training-time consumption

That separation is the main win.

The training loader should not be responsible for rediscovering quality, recomputing embeddings, rerunning dedup, or rebuilding training assets. It should stream, decode, lightly augment, batch, and prefetch.

The Next Upgrade: A Transform Registry

A stronger version of this component would add a formal transform registry.

Each transform would be described with metadata:

name
version
input columns
output columns
deterministic: true/false
precompute: true/false
model dependency
config hash

Then a dataset manifest could record:

transforms:
  resize_v001:
    deterministic: true
    precomputed: true
    config_hash: a31f...

  random_crop_v001:
    deterministic: false
    applied_in_loader: true
    config_hash: 91bd...

This matters because transformations are part of reproducibility.

If two model runs use the same dataset rows but different preprocessing, they are not really training on the same dataset.

A production-grade manifest should pin:

dataset version
model versions
embedding versions
quality thresholds
dedup thresholds
transform configs
loader randomness seed

The dataset version is not just a list of rows. It is a list of rows plus the transformation contract used to produce training samples.

Failure Modes

The precompute boundary has predictable failure modes.

Precomputing stochastic transforms reduces training diversity. Doing expensive deterministic transforms live slows training and starves the GPU. Failing to version transforms makes old training runs hard to reproduce.

Mixing embeddings from different model versions breaks search and dedup. Changing loader transforms silently makes model regressions hard to debug.

The boundary is not just about speed. It is about being able to explain what the model actually saw.

Component 12: Provenance and Observability

The final component answers a different question:

Can I explain where every training sample came from,
what happened to it,
and why it ended up in this dataset version?

That is provenance.

It also asks:

Can I see whether the pipeline is healthy, slow, expensive, failing, or drifting?

That is observability.

Together, they turn a demo pipeline into something closer to real infrastructure.

Why Provenance Matters

In small projects, it is tempting to treat data as anonymous rows.

In production, every row has a history.

For a single item, I should be able to answer:

Where did it come from?
Which connector ingested it?
What was its original license?
What is its content hash?
Was it filtered?
Was it deduplicated?
Which model embedded it?
Which dataset versions include it?
Which shards contain it?
Was it used in a training run?

That chain is the item’s lineage.

Without lineage, debugging becomes guesswork.

If a model behaves badly, I need to know whether the issue came from:

bad source data
wrong connector parsing
corrupted media
bad caption alignment
duplicate records
unsafe content
wrong embedding model
broken sharding
loader skew
license-restricted data

Provenance gives me a way to trace backward from model behavior to data decisions.

The simplest rule:

If I cannot trace a sample, I cannot trust the dataset.

Licensing Is Provenance

Licensing is not just paperwork. It changes what a dataset can be used for.

The source datasets in this project have different usage expectations:

FineWeb-Edu
COCO
FineVideo
LibriSpeech
MSR-VTT-style datasets

Some data may be open for broad use. Some may be research-only. Some may require attribution. Some may not be appropriate for commercial training.

Each catalog row should carry license metadata so the system can answer:

show all research-only items
exclude non-commercial sources
build a version using only permissive data
remove all items from source X

That last case is important.

A takedown request should become a new manifest excluding a source, not a full re-ingestion project.

That is the power of provenance plus versioning.

What to Track Per Item

At minimum, each item should track:

Field	Why it matters
`id`	Stable item identity
`source`	Original dataset or provider
`modality`	Text/image/video/audio
`content_hash`	CAS identity and dedup key
`content_path`	Where the raw asset lives
`license`	Usage rights
`connector_version`	How it was ingested
`quality_status`	Whether it passed filters
`quality_reason`	Why it failed or passed
`embedding_model`	Which model produced vectors
`embedding_version`	Prevents mixing incompatible vectors
`created_at`	Audit/debug timestamp
`dataset_versions`	Which versions include this item

The current catalog already captures many of these ideas: source, modality, content hash, license, quality status, vectors, and metadata.

A production version would make the lineage fields more explicit and queryable.

Observability: What to Measure

Provenance tells me what happened to a row.

Observability tells me what is happening to the system.

For this pipeline, useful observability metrics fall into a few groups.

Ingestion Metrics

rows scanned
rows accepted
rows skipped
skip reasons
download failures
decode failures
bytes written to CAS
duplicate hash count

These answer:

Are my connectors healthy?

Preprocessing Metrics

batches processed
rows/sec
GPU utilization
model load time
inference time
batch size
OOM count
retry count

These answer:

Are my Ray actors and GPUs doing useful work?

Quality and Dedup Metrics

quality pass rate
quality fail reasons
near duplicates removed
largest cluster percentage
normalized entropy
safety filter counts

These answer:

Is the dataset getting cleaner or collapsing?

Catalog and Versioning Metrics

catalog row count
missing embeddings
missing assets
rows per modality
license distribution
version size
version diff from previous

These answer:

Is the catalog trustworthy?

Sharding and Loader Metrics

number of shards
average shard size
shard size skew
items/sec
MB/sec
batch latency
worker imbalance
GPU idle time

These answer:

Can training consume this dataset efficiently?

Cost Metrics

Modal GPU seconds
CPU seconds
memory GB-seconds
storage used
egress/download cost
cost per 1K records
cost per modality

These answer:

What does it cost to build this dataset version?

Cost matters because data quality work can become expensive quietly. A useful pipeline should not only say that v002 is better. It should also say what v002 cost to produce.

Failure Observability

Good observability is not only about success metrics. It should preserve failure reasons.

The battle scars from this project were all boundary failures:

empty video manifest
audio decoder object shape changed
CLIP return shape changed
LanceDB rename failed on Modal Volume
text content treated as file path

A robust pipeline should report structured failure events:

stage
modality
item_id
error type
error message
retryable or not
skip reason

This turns failures into searchable data.

Instead of reading logs manually, I should be able to ask:

How many LibriSpeech rows failed because audio decoding changed?
How many video rows had no usable bytes?
How many images failed because files were corrupt?

That is the difference between logs and observability.

Logs tell me something happened. Observability lets me measure the pattern.

Lineage Events

A stronger production version would emit structured lineage events for every stage:

stage_name
input_artifacts
output_artifacts
row_count
duration
cost
code_version
config_hash
status

Then a single item could be traced end to end:

FineVideo row
-> CAS hash
-> keyframes
-> CLIP embedding
-> quality pass
-> catalog row
-> dataset version
-> shard
-> loader benchmark

A production-grade version could use OpenLineage-style events, Prometheus metrics, Grafana dashboards, structured JSON logs, validation reports, dataset datasheets, model/data cards, and cost reports per run.

The specific tool is less important than the contract:

every artifact should have lineage

Provenance Makes Versioning Auditable

Provenance and versioning are tightly linked.

A dataset version should not only say:

these item IDs are included

It should also say:

where those items came from
which licenses they carry
which transforms were applied
which models embedded them
which filters accepted them
which metrics describe them

That turns a manifest from a list into an audit record.

A stronger manifest might include:

version name
created time
source counts
license counts
quality thresholds
embedding model versions
transform config hashes
diversity metrics
eval metrics
shard paths

Then the dataset version becomes reproducible and explainable.

What This Project Has Today

This project implements several provenance and observability foundations:

source connectors preserve source identity
CAS gives every raw asset a content hash
the catalog stores source, modality, content hash, license, quality status, metadata, and vectors
dataset manifests pin item IDs and model choices
sharding materializes a version into a physical training layout
loader benchmarks record throughput
Modal runs record runtime and cost observations
architecture notes document failures and tradeoffs
battle scars act as human-readable operational history

That is more than a typical search demo needs.

The honest limitation is that observability is mostly file, log, and report based today. It is not a full dashboard.

That is fine for the scale of this project. The production extension would export structured events and metrics to a dashboard.

Closing Thoughts: From Search Demo to Data Infrastructure

I started this project thinking I was building a multimodal search demo.

The first version was simple:

text/images -> embeddings -> LanceDB -> search endpoint

But the deeper I went, the more I realized that embedding search is only one small part of the real problem.

Production multimodal data infrastructure is not just about generating vectors. It is about building a system where data can be ingested, traced, filtered, deduplicated, versioned, searched, materialized, benchmarked, evaluated, and trusted.

That is why the project grew into a 12-component pipeline:

source connectors
-> content-addressed storage
-> Ray preprocessing
-> quality/dedup
-> embedding cache
-> metadata catalog
-> dataset versioning
-> WebDataset sharding
-> training loader
-> eval feedback loop
-> precompute/on-the-fly boundary
-> provenance and observability

Each component taught me a different lesson.

The source connectors taught me that public datasets are not stable APIs. The CAS taught me that reproducibility starts with immutable content hashes. Ray taught me that distributed inference is mostly about keeping models warm and GPUs fed. The catalog taught me that trust boundaries matter. Versioning taught me that a dataset should be an artifact, not a folder.

Sharding and loaders taught me that training performance depends on physical layout. The eval loop taught me that a dataset version is not better because it is bigger. It is better only if it improves measured behavior. Provenance taught me that if I cannot trace a sample, I cannot trust the dataset.

This is still a demo-scale system. It does not process full web-scale datasets. It does not have production auth, dashboards, autoscaling policies, or enterprise governance. Some pieces are intentionally simple.

But the architecture follows the same pattern I saw in large-scale multimodal data systems:

separate raw storage from metadata
separate query formats from training formats
precompute expensive deterministic work
preserve lineage
close the loop with evaluation

The biggest lesson was that ML data engineering is not just preprocessing.

It is infrastructure for making data reusable, measurable, and trustworthy.

If I had to summarize the project in one sentence:

I built a demo-scale multimodal lakehouse pipeline to understand how raw data
becomes a trusted, versioned, training-ready dataset.

And if I had to summarize what I learned:

The hard part is not embedding the data.
The hard part is knowing which data I embedded,
why it was kept,
how it changed,
whether it improved the system,
and whether I can reproduce it later.

That is the difference between a notebook demo and data infrastructure.

Running Glossary

Precompute: Running deterministic, reusable transformations ahead of training and storing the result.

On-the-Fly Transform: A transformation applied live during training, often because randomness improves generalization.

Transform Registry: A versioned record of transformations, configs, inputs, outputs, and whether each transform is deterministic.

Transformation Contract: The set of preprocessing and augmentation rules that define how raw rows become training samples.

Provenance: The lineage of a data item, including where it came from, how it was transformed, and which versions include it.

Observability: Metrics, logs, traces, and reports that reveal pipeline health, failures, throughput, data quality, and cost.

Lineage Event: A structured record emitted by a pipeline stage describing inputs, outputs, row counts, duration, config, code version, and status.

Dataset Datasheet: Documentation that describes a dataset’s sources, composition, intended use, limitations, licensing, metrics, and known risks.