A structured reference for how NVIDIA built the 25-trillion-token pretraining corpus behind Nemotron 3 Super — a case study in large-scale data engineering for LLM pretraining, distinct from (but related to) operational ML data pipelines like feature stores.


1. The Big Picture

Core fact to remember: Nemotron 3 Super was pretrained on ~25 trillion tokens, and the report itself points to data preparation — not model architecture — as the largest engineering effort behind it.

Two-phase training curriculum:

Phase Token count Goal
Phase 1 20T tokens Diversity, broad coverage, generalization
Phase 2 5T tokens Higher-quality sources, benchmark accuracy

🧠 Remember this line: The pipeline is a lakehouse/ETL system optimized for model training instead of analytics. The “tables” are documents, the “metrics” are quality/token/domain distributions, and the “serving layer” is GPU training input.


2. Full Stage List (Study in Order)

1.  Collection            8.  Document classification
2.  Extraction & cleaning 9.  Data balancing
3.  Language detection    10. Synthetic data generation / rephrasing
4.  Quality scoring       11. Curriculum / mixture design
5.  Deduplication         12. Tokenization
6.  Toxicity filtering    13. Sequence packing
7.  PII removal           14. Sharding
                          15. Post-training data ops (SFT/RL/preference data)

3. Stage-by-Stage Reference

3.1 Collection

What it does: Assembles the corpus from many source types rather than one dataset — Common Crawl, curated high-quality websites, books, academic papers, source code, math datasets, Wikipedia, StackOverflow/GitHub, multilingual data, PDF-derived text, and synthetic SFT-style data. Key idea to remember: Source diversity is a deliberate design choice, not an artifact of crawl availability.

3.2 Extraction & Cleaning

What it does: Converts raw formats (HTML, PDF, Markdown, Word, XML) into plain text and strips non-content. Tools named in the report: jusText for HTML-to-text extraction (removes nav bars, ads, cookie banners, scripts); FastText for fast language identification. Sub-steps:

  • HTML/script/CSS/ad removal
  • Boilerplate removal (repeated phrases like “Accept Cookies,” “Privacy Policy”)
  • Unicode/UTF-8 normalization, invalid-character removal

3.3 Language Detection

What it does: Filters the corpus down to supported languages using FastText-based classification. Pattern: detect_language(doc) → keep if supported, else drop — this runs early and cheaply, before expensive quality/dedup work.

3.4 Quality Scoring

What it does: Instead of a binary keep/discard, documents are scored on features like perplexity, grammatical quality, repetition, spam signals, and information density, then routed into quality buckets:

high_quality | medium_quality | low_quality_but_recoverable | discard

Key idea to remember: The “recoverable” bucket is what feeds Stage 3.10 (synthetic rephrasing) — this pipeline treats quality filtering as a sorting problem, not just a subtraction problem.

3.5 Deduplication

Two layers:

Type Method Catches
Exact SHA/MD5-style hashing Byte-identical duplicates
Fuzzy / near-duplicate MinHash signatures + Locality-Sensitive Hashing (LSH) ~90–98% similar documents

Why it matters: the same article can appear on hundreds of sites; without dedup, a model doesn’t see 600 data points, it memorizes one data point 600 times — which hurts generalization. Metadata tracked: document_id, content_hash, minhash_signature, dedup_cluster_id, canonical_document_flag

3.6 Toxicity & Safety Filtering

What it does: Removes hate speech, pornography, malware, phishing pages, spam, and low-quality AI-generated content, using trained classifiers rather than keyword lists.

3.7 PII Removal

What it does: Detects and strips phone numbers, emails, SSNs, addresses, API keys, and secrets — for legal compliance and to reduce model memorization risk.

3.8 Document Classification

What it does: Buckets documents into categories (books, code, math, science, conversation, news, forums, legal, medical) before sampling, so the mixture design step has something to weight.

3.9 Data Balancing

What it does: Corrects for the internet’s natural skew — a large share of raw crawled data is navigation boilerplate or SEO spam even after cleaning — by rebalancing the mixture toward higher-value categories (books, code, science) rather than training on whatever is most abundant.

3.10 Synthetic Data Generation / Rephrasing

Two distinct uses — keep them separate when studying this:

(a) Recovery rephrasing: documents flagged “useful facts, poor writing” get rewritten by an LLM into clean text, re-scored, and added back to the curated corpus.

low/medium-quality doc → quality model flags as recoverable → LLM rephrase → re-score → add to corpus

(b) Capability-targeted generation: entirely new synthetic data generated (and validated) for specific target skills — algorithmic reasoning, formal logic, economics, multiple-choice QA — released separately as a specialized synthetic dataset collection.

target capability → generate problem → generate solution/explanation → validate format/answer → filter duplicates → add to bucket

🧠 Distinction to remember: (a) rescues data that already exists in weak form; (b) generates data for capabilities that may not exist in the wild at all.

3.11 Curriculum / Mixture Design

What it does: Assigns sampling weights to every source based on domain, quality score, and training phase. Higher-quality datasets get proportionally more weight; similar-quality sources get similar weight. Metadata schema: source_name, domain, quality_score, language, token_count, phase_1_weight, phase_2_weight, sampling_probability

3.12 Tokenization

What it does: Converts cleaned, curated text into token IDs. Training never consumes raw text — only integers.

3.13 Sequence Packing

What it does: Instead of training on ragged, variable-length documents (which wastes GPU compute), documents are packed into dense fixed-length blocks (e.g., 8192 or 16384 tokens).

3.14 Sharding

What it does: Splits the packed dataset into many binary shard files (e.g., tens of thousands) so each GPU node can read its own shard independently — this is what makes training parallelizable at this scale.

3.15 Post-Training Data Operations

What it does: After pretraining, a second wave of data prep supports SFT, RL, preference data, reward modeling, tool use, and agentic workflows — much of it synthetic, since target behaviors (instruction-following, tool use, step-by-step reasoning) are easier to generate and verify than to source from raw text at scale.


4. Tooling Split (Important Distinction)

NVIDIA’s own documentation draws a clean boundary between two tools:

Tool Owns Examples
NeMo Curator Large-scale curation Deduplication, quality filtering, language ID, PII removal
nemotron.data_prep “Last mile” transformation Tokenization, packing, chat templating, loss-mask creation

nemotron.data_prep runs on Ray for distributed processing, supports cloud-native I/O (Hugging Face, S3, GCS, local paths), and provides deterministic shard plans, resumable pipelines, and checksum verification.

🧠 Why this split matters: curation decisions (what stays in the corpus) and training-format decisions (how it’s packaged for the GPU) are different concerns with different failure modes — mixing them makes both harder to debug.


5. End-to-End Pipeline Diagram

Raw Internet Data
        |
        v
Data Collection (web, books, code, math, academic, synthetic)
        |
        v
HTML/PDF/Text Extraction  (jusText)
        |
        v
Cleaning & Normalization
        |
        v
Language Detection  (FastText)
        |
        v
Quality Scoring  (perplexity + ensemble classifiers)
        |
        v
Synthetic Rephrasing  (recover "useful but poorly written" docs)
        |
        v
Deduplication  (exact hash + MinHash/LSH)
        |
        v
Toxicity & PII Filtering
        |
        v
Document Classification & Balancing
        |
        v
Curriculum & Mixture Weighting  (Phase 1 diversity / Phase 2 quality)
        |
        v
Tokenization
        |
        v
Sequence Packing
        |
        v
Sharding
        |
        v
25 Trillion Training Tokens
        |
        v
Distributed GPU Training

6. Mapped to General Data Engineering Terms

ETL Concept Nemotron Pipeline Equivalent
Ingestion Crawl dumps, HF datasets, PDFs, code repos
Parsing HTML/PDF/Markdown/code extraction
Standardization UTF-8, Unicode normalization, schema cleanup
Filtering Language ID, spam, safety, PII
Deduplication Exact hash + fuzzy MinHash/LSH
Scoring Quality classifiers, perplexity, domain score
Enrichment Domain labels, quality buckets, metadata
Synthetic generation Rephrasing, QA generation, reasoning data
Sampling Weighted blend by source/domain/quality/phase
Tokenization Text → token IDs
Packing Fixed-length sequence construction
Sharding Distributed training files
Validation Checksums, manifests, token counts
Observability Data lineage, quality dashboards, run metadata

7. How This Differs from an Operational ML Data Pipeline

Worth contrasting explicitly if you’ve studied feature-store-style architectures (batch/streaming/merged views):

Dimension Operational ML pipeline (e.g., feature store) LLM pretraining pipeline (Nemotron)
Freshness Often matters (streaming lane exists) Largely irrelevant — corpus is built once per training run
Unit of data Rows/events keyed by entity + timestamp Documents/tokens, no entity key
“Serving” Online store for real-time inference GPU training job reading shards
Correctness contract Point-in-time correctness (event time) Quality-band correctness (is this text worth training on)
Synthetic data role Fills coverage/class-imbalance gaps Also rescues weak documents via rephrasing, and targets capability gaps directly
Versioning unit Feature definition + merged view Corpus snapshot + mixture weights + tokenizer version

🧠 Key insight: the stages look similar (collect, clean, validate, enrich, version) but the contract each stage has to satisfy is different, because the consumer is a training run instead of a live model endpoint.


8. Glossary

Term Definition
jusText A tool for extracting the main body text from HTML pages, stripping boilerplate
FastText A lightweight library used here for high-throughput language identification
Perplexity A measure of how well a language model predicts a piece of text — used here as a quality signal
MinHash A technique for cheaply estimating similarity between large sets (documents), used for fuzzy dedup
Locality-Sensitive Hashing (LSH) A method for finding near-duplicate items efficiently at scale using MinHash-style signatures
Quality bucket A tier (high/medium/low-recoverable/discard) assigned to a document instead of a binary keep/drop decision
Recovery rephrasing Using an LLM to rewrite a low-quality-but-useful document into clean training text
Capability-targeted synthetic data Synthetic examples generated specifically to teach a skill (e.g., algorithmic reasoning) rather than to patch coverage
Curriculum (training) A deliberate ordering/weighting of data across training phases, rather than one uniform shuffle
Sequence packing Combining variable-length documents into fixed-length token blocks to maximize GPU utilization
Sharding Splitting a dataset into many independent files so distributed workers can read in parallel
bin/idx format A binary training-data format (paired with an index file) used by Megatron-style training frameworks

9. Self-Check Questions

  1. Why does the report call data preparation the biggest engineering effort — bigger than architecture design?
  2. What are the two phases of the training curriculum, and what does each optimize for?
  3. Name the two layers of deduplication and the technique each one uses.
  4. What’s the difference between “recovery rephrasing” and “capability-targeted synthetic data generation”?
  5. Why does quality scoring produce buckets instead of a single keep/discard decision?
  6. What’s the division of responsibility between NeMo Curator and nemotron.data_prep?
  7. Why does sequence packing matter for training efficiency?
  8. In the general-ETL mapping table, what plays the role of “enrichment”?
  9. Contrast this pipeline’s correctness contract with a feature store’s point-in-time correctness — what does each one guarantee?

One-line takeaway

The 25-trillion-token corpus behind Nemotron 3 Super wasn’t found — it was built: collected from a dozen source types, cleaned, scored into quality bands, selectively rewritten rather than merely filtered, deduplicated at two levels, rebalanced against the internet’s natural skew, and packed into a two-phase curriculum designed to teach breadth first and precision second.