A structured reference for how NVIDIA built the 25-trillion-token pretraining corpus behind Nemotron 3 Super — a case study in large-scale data engineering for LLM pretraining, distinct from (but related to) operational ML data pipelines like feature stores.
1. The Big Picture
Core fact to remember: Nemotron 3 Super was pretrained on ~25 trillion tokens, and the report itself points to data preparation — not model architecture — as the largest engineering effort behind it.
Two-phase training curriculum:
| Phase | Token count | Goal |
|---|---|---|
| Phase 1 | 20T tokens | Diversity, broad coverage, generalization |
| Phase 2 | 5T tokens | Higher-quality sources, benchmark accuracy |
🧠 Remember this line: The pipeline is a lakehouse/ETL system optimized for model training instead of analytics. The “tables” are documents, the “metrics” are quality/token/domain distributions, and the “serving layer” is GPU training input.
2. Full Stage List (Study in Order)
1. Collection 8. Document classification
2. Extraction & cleaning 9. Data balancing
3. Language detection 10. Synthetic data generation / rephrasing
4. Quality scoring 11. Curriculum / mixture design
5. Deduplication 12. Tokenization
6. Toxicity filtering 13. Sequence packing
7. PII removal 14. Sharding
15. Post-training data ops (SFT/RL/preference data)
3. Stage-by-Stage Reference
3.1 Collection
What it does: Assembles the corpus from many source types rather than one dataset — Common Crawl, curated high-quality websites, books, academic papers, source code, math datasets, Wikipedia, StackOverflow/GitHub, multilingual data, PDF-derived text, and synthetic SFT-style data. Key idea to remember: Source diversity is a deliberate design choice, not an artifact of crawl availability.
3.2 Extraction & Cleaning
What it does: Converts raw formats (HTML, PDF, Markdown, Word, XML) into plain text and strips non-content. Tools named in the report: jusText for HTML-to-text extraction (removes nav bars, ads, cookie banners, scripts); FastText for fast language identification. Sub-steps:
- HTML/script/CSS/ad removal
- Boilerplate removal (repeated phrases like “Accept Cookies,” “Privacy Policy”)
- Unicode/UTF-8 normalization, invalid-character removal
3.3 Language Detection
What it does: Filters the corpus down to supported languages using FastText-based classification.
Pattern: detect_language(doc) → keep if supported, else drop — this runs early and cheaply, before expensive quality/dedup work.
3.4 Quality Scoring
What it does: Instead of a binary keep/discard, documents are scored on features like perplexity, grammatical quality, repetition, spam signals, and information density, then routed into quality buckets:
high_quality | medium_quality | low_quality_but_recoverable | discard
Key idea to remember: The “recoverable” bucket is what feeds Stage 3.10 (synthetic rephrasing) — this pipeline treats quality filtering as a sorting problem, not just a subtraction problem.
3.5 Deduplication
Two layers:
| Type | Method | Catches |
|---|---|---|
| Exact | SHA/MD5-style hashing | Byte-identical duplicates |
| Fuzzy / near-duplicate | MinHash signatures + Locality-Sensitive Hashing (LSH) | ~90–98% similar documents |
Why it matters: the same article can appear on hundreds of sites; without dedup, a model doesn’t see 600 data points, it memorizes one data point 600 times — which hurts generalization.
Metadata tracked: document_id, content_hash, minhash_signature, dedup_cluster_id, canonical_document_flag
3.6 Toxicity & Safety Filtering
What it does: Removes hate speech, pornography, malware, phishing pages, spam, and low-quality AI-generated content, using trained classifiers rather than keyword lists.
3.7 PII Removal
What it does: Detects and strips phone numbers, emails, SSNs, addresses, API keys, and secrets — for legal compliance and to reduce model memorization risk.
3.8 Document Classification
What it does: Buckets documents into categories (books, code, math, science, conversation, news, forums, legal, medical) before sampling, so the mixture design step has something to weight.
3.9 Data Balancing
What it does: Corrects for the internet’s natural skew — a large share of raw crawled data is navigation boilerplate or SEO spam even after cleaning — by rebalancing the mixture toward higher-value categories (books, code, science) rather than training on whatever is most abundant.
3.10 Synthetic Data Generation / Rephrasing
Two distinct uses — keep them separate when studying this:
(a) Recovery rephrasing: documents flagged “useful facts, poor writing” get rewritten by an LLM into clean text, re-scored, and added back to the curated corpus.
low/medium-quality doc → quality model flags as recoverable → LLM rephrase → re-score → add to corpus
(b) Capability-targeted generation: entirely new synthetic data generated (and validated) for specific target skills — algorithmic reasoning, formal logic, economics, multiple-choice QA — released separately as a specialized synthetic dataset collection.
target capability → generate problem → generate solution/explanation → validate format/answer → filter duplicates → add to bucket
🧠 Distinction to remember: (a) rescues data that already exists in weak form; (b) generates data for capabilities that may not exist in the wild at all.
3.11 Curriculum / Mixture Design
What it does: Assigns sampling weights to every source based on domain, quality score, and training phase. Higher-quality datasets get proportionally more weight; similar-quality sources get similar weight.
Metadata schema: source_name, domain, quality_score, language, token_count, phase_1_weight, phase_2_weight, sampling_probability
3.12 Tokenization
What it does: Converts cleaned, curated text into token IDs. Training never consumes raw text — only integers.
3.13 Sequence Packing
What it does: Instead of training on ragged, variable-length documents (which wastes GPU compute), documents are packed into dense fixed-length blocks (e.g., 8192 or 16384 tokens).
3.14 Sharding
What it does: Splits the packed dataset into many binary shard files (e.g., tens of thousands) so each GPU node can read its own shard independently — this is what makes training parallelizable at this scale.
3.15 Post-Training Data Operations
What it does: After pretraining, a second wave of data prep supports SFT, RL, preference data, reward modeling, tool use, and agentic workflows — much of it synthetic, since target behaviors (instruction-following, tool use, step-by-step reasoning) are easier to generate and verify than to source from raw text at scale.
4. Tooling Split (Important Distinction)
NVIDIA’s own documentation draws a clean boundary between two tools:
| Tool | Owns | Examples |
|---|---|---|
| NeMo Curator | Large-scale curation | Deduplication, quality filtering, language ID, PII removal |
nemotron.data_prep |
“Last mile” transformation | Tokenization, packing, chat templating, loss-mask creation |
nemotron.data_prep runs on Ray for distributed processing, supports cloud-native I/O (Hugging Face, S3, GCS, local paths), and provides deterministic shard plans, resumable pipelines, and checksum verification.
🧠 Why this split matters: curation decisions (what stays in the corpus) and training-format decisions (how it’s packaged for the GPU) are different concerns with different failure modes — mixing them makes both harder to debug.
5. End-to-End Pipeline Diagram
Raw Internet Data
|
v
Data Collection (web, books, code, math, academic, synthetic)
|
v
HTML/PDF/Text Extraction (jusText)
|
v
Cleaning & Normalization
|
v
Language Detection (FastText)
|
v
Quality Scoring (perplexity + ensemble classifiers)
|
v
Synthetic Rephrasing (recover "useful but poorly written" docs)
|
v
Deduplication (exact hash + MinHash/LSH)
|
v
Toxicity & PII Filtering
|
v
Document Classification & Balancing
|
v
Curriculum & Mixture Weighting (Phase 1 diversity / Phase 2 quality)
|
v
Tokenization
|
v
Sequence Packing
|
v
Sharding
|
v
25 Trillion Training Tokens
|
v
Distributed GPU Training
6. Mapped to General Data Engineering Terms
| ETL Concept | Nemotron Pipeline Equivalent |
|---|---|
| Ingestion | Crawl dumps, HF datasets, PDFs, code repos |
| Parsing | HTML/PDF/Markdown/code extraction |
| Standardization | UTF-8, Unicode normalization, schema cleanup |
| Filtering | Language ID, spam, safety, PII |
| Deduplication | Exact hash + fuzzy MinHash/LSH |
| Scoring | Quality classifiers, perplexity, domain score |
| Enrichment | Domain labels, quality buckets, metadata |
| Synthetic generation | Rephrasing, QA generation, reasoning data |
| Sampling | Weighted blend by source/domain/quality/phase |
| Tokenization | Text → token IDs |
| Packing | Fixed-length sequence construction |
| Sharding | Distributed training files |
| Validation | Checksums, manifests, token counts |
| Observability | Data lineage, quality dashboards, run metadata |
7. How This Differs from an Operational ML Data Pipeline
Worth contrasting explicitly if you’ve studied feature-store-style architectures (batch/streaming/merged views):
| Dimension | Operational ML pipeline (e.g., feature store) | LLM pretraining pipeline (Nemotron) |
|---|---|---|
| Freshness | Often matters (streaming lane exists) | Largely irrelevant — corpus is built once per training run |
| Unit of data | Rows/events keyed by entity + timestamp | Documents/tokens, no entity key |
| “Serving” | Online store for real-time inference | GPU training job reading shards |
| Correctness contract | Point-in-time correctness (event time) | Quality-band correctness (is this text worth training on) |
| Synthetic data role | Fills coverage/class-imbalance gaps | Also rescues weak documents via rephrasing, and targets capability gaps directly |
| Versioning unit | Feature definition + merged view | Corpus snapshot + mixture weights + tokenizer version |
🧠 Key insight: the stages look similar (collect, clean, validate, enrich, version) but the contract each stage has to satisfy is different, because the consumer is a training run instead of a live model endpoint.
8. Glossary
| Term | Definition |
|---|---|
| jusText | A tool for extracting the main body text from HTML pages, stripping boilerplate |
| FastText | A lightweight library used here for high-throughput language identification |
| Perplexity | A measure of how well a language model predicts a piece of text — used here as a quality signal |
| MinHash | A technique for cheaply estimating similarity between large sets (documents), used for fuzzy dedup |
| Locality-Sensitive Hashing (LSH) | A method for finding near-duplicate items efficiently at scale using MinHash-style signatures |
| Quality bucket | A tier (high/medium/low-recoverable/discard) assigned to a document instead of a binary keep/drop decision |
| Recovery rephrasing | Using an LLM to rewrite a low-quality-but-useful document into clean training text |
| Capability-targeted synthetic data | Synthetic examples generated specifically to teach a skill (e.g., algorithmic reasoning) rather than to patch coverage |
| Curriculum (training) | A deliberate ordering/weighting of data across training phases, rather than one uniform shuffle |
| Sequence packing | Combining variable-length documents into fixed-length token blocks to maximize GPU utilization |
| Sharding | Splitting a dataset into many independent files so distributed workers can read in parallel |
| bin/idx format | A binary training-data format (paired with an index file) used by Megatron-style training frameworks |
9. Self-Check Questions
- Why does the report call data preparation the biggest engineering effort — bigger than architecture design?
- What are the two phases of the training curriculum, and what does each optimize for?
- Name the two layers of deduplication and the technique each one uses.
- What’s the difference between “recovery rephrasing” and “capability-targeted synthetic data generation”?
- Why does quality scoring produce buckets instead of a single keep/discard decision?
- What’s the division of responsibility between NeMo Curator and
nemotron.data_prep? - Why does sequence packing matter for training efficiency?
- In the general-ETL mapping table, what plays the role of “enrichment”?
- Contrast this pipeline’s correctness contract with a feature store’s point-in-time correctness — what does each one guarantee?
One-line takeaway
The 25-trillion-token corpus behind Nemotron 3 Super wasn’t found — it was built: collected from a dozen source types, cleaned, scored into quality bands, selectively rewritten rather than merely filtered, deduplicated at two levels, rebalanced against the internet’s natural skew, and packed into a two-phase curriculum designed to teach breadth first and precision second.