Explaining the Multimodal Lakehouse in Interviews

I have written six implementation deep dives about my serverless multimodal data lakehouse. Those posts cover how it works: Ray actors and the catalog trust boundary, content-addressed storage and versioning, training-ready shards and loaders, the eval feedback loop, and precompute and provenance.

This post is different. It is about how I talk about the project, which turned out to be a separate skill from building it.

A few mock interviews taught me something uncomfortable: I could build the thing and still explain it badly. I would dump twelve component names in one breath, watch the interviewer’s eyes glaze, and realize I had said nothing. So I rebuilt the explanation the same way I rebuilt the pipeline: a structure, a set of decisions I could defend, and an honest account of where it breaks. This is that explanation.

The mistake that taught me the rest

Here is how I used to answer “tell me about a project.”

“It’s a 12-stage pipeline: content-addressed ingestion, Ray preprocessing into embeddings, quality and dedup gates, a unified LanceDB catalog, immutable manifests, and WebDataset shards.”

Every word is correct. The sentence is useless. It is a parts list, and naming a part is not explaining it. An interviewer who has built real datasets hears that and cannot tell whether I did the work or read a tutorial.

The fix was a rule I now apply to every sentence: say the purpose, not the part. The test is whether the sentence contains the word “so” or “because,” out loud or implied. “I used LanceDB” is a fact. “I keep vectors and metadata together so I can filter and search in one place” is an answer. Compare:

“Every asset is stored by a hash of its bytes so duplicates collapse automatically, and nothing enters the catalog until it has passed quality checks, because that catalog is the one thing everything downstream trusts.”

Same system. Now every clause earns its place. Once I started stripping any sentence that could not pass the so/because test, the whole pitch got shorter and louder.

Open shallow, then stop

The second thing I got wrong was not stopping. I would give a good sixty-second summary, feel the silence, get nervous, and keep talking, straight back into the twelve-component list, burying the one thing I had just said well.

The silence after a good pitch is not my problem to solve. It is the interviewer deciding where to zoom in. So I learned to hold the project as four nested layers and always open at the shallowest one:

One line (10 seconds): a serverless, demo-scale multimodal lakehouse that takes raw Hugging Face data all the way to versioned, training-ready shards.
Sixty seconds: what it does, why I built it, what it proves.
Three-minute walk: five phases, narrated as a story.
Any single component, three whys deep: on request.

It works like Google Maps. You do not hand someone the street view of a city they have never seen. You start at the country and let them pinch-to-zoom into the neighborhood they care about. The pitch ends with an implicit question, “happy to go deeper anywhere,” and then I close my mouth and wait.

The five-phase story

Twelve components is a list. Five phases is a story. I group them so I can narrate an arc instead of reciting boxes.

Phase 1, ingest and store. Four connectors pull from Hugging Face: FineWeb-Edu text, COCO images, FineVideo, LibriSpeech audio. Every raw asset lands in a content-addressed store, keyed by the SHA-256 hash of its own bytes. That one decision gives me exact deduplication and integrity checking for free. Identical files collapse to the same key, and the key is the checksum, so corruption is detectable.

Phase 2, process and gate. Ray Data map_batches runs the heavy embedding work on GPUs using stateful actors: the model loads once per worker, and thousands of batches flow through the warm model instead of reloading two gigabytes of weights per batch. Then a tiered gate runs exact-hash dedup, rule checks, and a FAISS near-duplicate pass, so only clean rows survive.

Phase 3, catalog and version. Only the filtered rows enter one unified LanceDB catalog, and that catalog is the trust boundary, the single table every downstream consumer reads. A dataset version is then just an immutable JSON manifest of pointers into the content store, so I can branch a dataset without copying a single byte.

Phase 4, materialize and serve. Manifests get packed into WebDataset tar shards so a training loader can stream them as sequential reads instead of opening millions of tiny files. The loader does prefetch, deterministic shuffle, and mid-epoch resume.

Phase 5, prove and govern. Any version can be pinned for evals and a generated datasheet, and every item carries its license and source chain for provenance.

Five sentences, each with a so or a because. That is the whole architecture, and it takes about three minutes to say.

Defending every box

The fun starts when the interviewer pushes on a choice. This is where building it paid off. I had reasons, and reasons survive follow-ups in a way that memorized facts do not. A few of the ones I get asked most:

Why Ray and not just a Python loop? At ten thousand records, a single process could do this, and I say so out loud. Pretending otherwise is a bluff a real engineer catches instantly. Ray is there for the production shape: partitioned reads, map_batches, and stateful actors that keep the model warm. The same code scales by raising actor and GPU counts. The pattern is the point, not the present scale.

Why Ray and not Spark? Different tools for different shapes. Spark is built for partitioned, SQL-style transforms and shuffles over huge tabular data. Ray Data is built for heterogeneous pipelines, CPU decode then GPU inference, where you want long-lived actors holding model weights on the GPU. My bottleneck is per-batch model inference, so Ray fit. If the heavy lifting were joins and aggregations over Iceberg tables, Spark would own it. At real scale you often run both: Spark for tabular curation, Ray for the embedding stage.

Why two vector columns instead of one embedding column? This was a mistake I had to fix, and it makes a good answer. Text and audio embeddings from MiniLM are 384-dimensional; image and video embeddings from CLIP are 512-dimensional. They live in different vector spaces with different distance scales. A single column would mix incompatible geometry, so search would return nonsense. The schema has to make the difference visible with text_vector and clip_vector, and results come back per family, because you cannot paper over an embedding-space mismatch with a shared column.

Why store everything by a hash? Content addressing buys exact dedup and integrity with no extra machinery, plus immutable references that dataset versions can point at. Its limit is that it only catches byte-identical duplicates, which is exactly why there is a separate FAISS near-dedup tier for the semantic ones.

The shape of every one of these answers is the same: what it is, why I chose it, and the trade-off I will admit when pushed. Admitting the trade-off is what makes the rest believable.

The one idea worth going deep on

If I get exactly one moment to go deep, I spend it on the trust boundary, because it is the idea that turns “I built a pipeline” into “I understand why pipelines rot.”

The catalog is the one table everything downstream reads: search, versioning, sharding, the training loader, evals. That makes it a trust boundary: anything that crosses it is implicitly treated as valid by everything after it. My first implementation let the catalog read straight from the raw manifests/ directory, bypassing the quality gates. So a corrupt image or an empty caption could silently become a valid training sample. Nothing crashed. The pipeline looked green while quietly poisoning every consumer.

The fix was one line of discipline with a deep principle behind it: the catalog ingests only from filtered/*_filtered.parquet. Every stage must consume the previous stage’s approved output, never reach back to raw for convenience. Certify the catalog and you have certified the whole system. The moment one stage skips the boundary, you can no longer reason about what is in your dataset.

I wrote about this in more detail in the Ray actors and trust boundary post, but in an interview it compresses to about forty seconds, and it is the highest-signal forty seconds I have.

Demo-scale, said first and on purpose

The project is about ten thousand records and roughly fifteen cents of GPU. Someone will point that out. The worst thing I can do is get defensive or inflate it.

So I lead with the limitation and turn it into the argument. At a billion records the hard part still is not volume. Spark and Ray handle volume. The hard part is correct semantics: making sure every stage consumes approved data, that versions are reproducible, that embedding spaces do not get mixed. Those problems are identical at ten thousand and ten billion, and they are the ones I solved. The scale I skipped is the easy, expensive part. The scale-invariant part is the part I wanted to own.

Honesty plus a correct architecture beats a bluff, and it makes me sound like someone with judgment rather than someone with a demo.

The bugs are the best part

The bugs make the strongest interview material, because they are where the architecture met reality. A few I keep ready:

CLIP’s return shape changed under me. My code expected get_image_features() to return a plain tensor; one version of Transformers returned a wrapped output object, and the stage crashed in Modal. I added a compatibility adapter that normalizes every possible shape back to a tensor. The lesson: model wrappers are conveniences, not contracts, so you normalize deliberately at boundaries.
LanceDB could not rename files on a Modal Volume. LanceDB commits with atomic renames, which Modal Volume semantics rejected. I separated write semantics from persistence: build the catalog on local /tmp, let Lance finish its atomic commit there, then copy the finished directory into the Volume. Persistent cloud volumes are not clean POSIX build directories.
Resume by artifact, not by wishful thinking. When a long run got cancelled, the tempting shortcut was to rerun completed modalities with limit=0. But my connectors write manifests, so a zero-limit run would overwrite a good manifest with an empty one, and every downstream stage would faithfully process nothing, and the pipeline would look successful while silently deleting a finished modality. So resume became artifact-aware: skip committed outputs, rebuild partial ones, and require an explicit flag to overwrite. Cloud pipelines need restart semantics as much as they need model code.

Each of these is a small story with a transferable lesson, which is exactly the shape an interviewer wants. “Tell me about a hard bug” is not a trap; it is an invitation to show how I think when something breaks.

What I actually practiced

The structure only helps if it survives adrenaline. The things that moved the needle were mechanical:

Recording the seventy-five-second open on my phone and listening back. If it sounded rushed or list-y, I had found my work.
Cold tell-backs from memory, a few times a day, forcing a so or because into every sentence.
A handful of memorized bridge sentences for when my mind goes blank: “let me ground this in the actual flow,” or “I haven’t hit that exact case, here is how I would reason about it.” Honesty plus visible reasoning beats a confident wrong answer.
Drawing the five phases on a whiteboard with the trust boundary as a literal vertical line labeled “nothing crosses unless it passed the gate.” That single annotation is the highest-signal thing I can put on a board, because it shows I think in invariants, not boxes.

The takeaway

Building the lakehouse taught me distributed batch inference, content addressing, and why a catalog is a trust boundary. Learning to explain it taught me something narrower and just as useful: a system you cannot describe clearly is a system you do not fully understand. Every time I failed to explain a choice, I found a decision I had made on autopilot. Fixing the explanation fixed the gap.

So the pitch is short now. Open shallow. Say the purpose, not the part. Lead with the limitation. Stop talking and let them zoom. And when they push, ride the question down to a mechanism, because the mechanisms are the only part that scales.