02 — Data¶
NemoSlides trains on 705 chat-JSONL rows with <think> reasoning traces plus Slidev markdown in the assistant turn, published as trillionlabs/slides-sft-v0. Seeds are generated with NeMo Data Designer; per-seed authoring is done by Codex CLI with a substance-gated validator pipeline.
Dataset spec¶
| Property | Value |
|---|---|
| Published at | trillionlabs/slides-sft-v0 |
| Splits | 705 train · 30 test |
| Format | Chat JSONL (messages array with system / user / assistant) |
| Reasoning | Assistant content contains <think>{trace}</think>\n\n{deck}; reasoning_content also exposed as a separate field on the assistant turn |
| System prompt | nemoslides.pipeline.slidev_reference.TASK_INSTRUCTIONS + Slidev cheatsheet (identical at synthesis, training, and inference) |
| Image references | image-query: "<natural language>" placeholders; URLs resolved at render time |
| Themes | default, seriph, apple-basic |
| License | Research use only |
Each row:
{"messages": [
{"role": "system", "content": "<Slidev expert system prompt + knowledge pack>"},
{"role": "user", "content": "Create a Slidev deck: <topic · audience · tone · constraints>"},
{"role": "assistant", "content": "<think>\n<reasoning>\n</think>\n\n---\ntheme: seriph\nlayout: cover\n---\n# ..."}
]}
Synthesis pipeline¶
flowchart LR
DD["NeMo Data<br/>Designer"] --> S["data/seeds.json<br/>(theme, domain, audience,<br/>style, features, outline)"]
S --> I["codex_pipeline init<br/>→ work_YYYYMMDD/<br/> seed_NNNN/"]
I --> WF["per-seed workspace<br/>PROMPT.md · think.md · deck.md<br/>(stubs)"]
WF --> CX["Codex CLI<br/>full-auto · parallel · tmux"]
CX --> F["per-seed artifacts<br/>(final)"]
F --> V["codex_pipeline pack<br/>+ validators"]
V --> JS["chat JSONL"]
JS --> HF["push_hf_dataset<br/>→ trillionlabs/slides-sft-v0"]
Seed generation — NeMo Data Designer¶
Seeds are (theme, domain, audience, style, features, outline?) tuples sampled from a categorical spine. Implementation in nemoslides.pipeline.seeds_dd:
- Categorical spine — theme ∈ {default, seriph, apple-basic}, domain spans pitch decks / tech talks / product launches / internal reviews / conferences / educational content, audience and tone enumerate ~10 variants each, feature set covers the Slidev capability surface (layouts, shiki, Mermaid, KaTeX, v-click, transitions, presenter notes).
- LLM spine — each seed's free-text fields (title, abstract, narrative beats) are generated by
z-ai/glm-5.1via OpenRouter. - Output — seeds are persisted to
data/seeds.json(canonical set) anddata/seeds.d/batch_NNNN.json(per-batch archives).
Per-seed authoring — Codex CLI¶
nemoslides.cli.codex_pipeline init materializes one folder per seed under work_YYYYMMDD/seed_NNNN/, each containing:
seed.json— the structured seed recordINSTRUCTIONS.md— compiled contract: workflow, quality rubric, Slidev cheatsheet (seecodex_templates/)HERO_EXAMPLE.md— a gold-example deck fromassets/reference/gold_examples/- Stub
PROMPT.md,think.md,deck.mdwithCodex:marker headers
scripts/run_codex_batch.sh invokes Codex CLI in parallel (tmux-driven, 6-way default concurrency). Per seed, Codex:
- Writes the user-side prompt (
PROMPT.md) — a realistic request with topic, audience, and any style constraints. This is the string that will be passed to the finetuned model at inference. - Writes the reasoning trace (
think.md) — structured prose under fixed headings (Reading the user prompt,Theme fit,Narrative arc,Key slide mapping,Image & feature choices,Self-review). 350–900 words. The self-review section is contractually mandatory beforedeck.mdis written. - Writes the final deck (
deck.md) — renderable Slidev markdown that executes the plan inthink.md.
The reasoning trace is not a demo prop. It is the primary training signal: the finetuned model inherits a habit of reading the prompt, planning theme and narrative, mapping content to layouts, and self-reviewing before producing slides. Per-seed Codex authoring (vs. single-shot LLM synthesis) is chosen specifically because one-shot generation over-compresses both artifacts — the reasoning goes sketchy, the deck drops its tail slides — whereas per-seed authoring gives file access, iteration, and a mandatory self-review gate at the cost of synthesis wall-clock.
Validation — substance + syntactic gates¶
nemoslides.cli.codex_pipeline pack scans each seed folder and drops non-conformant rows before publishing. Gates:
| Gate | Condition |
|---|---|
| Stub detection | Files beginning with <!--\nCodex: markers are treated as unwritten. |
| Minimum bytes | PROMPT ≥ 40 bytes, think ≥ 400, deck ≥ 300 (enforced pre-packing). |
| Theme whitelist | theme: must be one of default, seriph, apple-basic. |
| Layout whitelist | Every layout: must be in the Slidev built-in set. |
| Frontmatter hygiene | No blank lines between --- and layout:; no double frontmatter; no trailing --- past the final slide. |
| Prompt sanitation | Rejects PROMPTs containing internal-pipeline terminology (think.md, chain-of-thought, SFT, training data). |
| Image-URL ban | image: lines with raw https:// URLs are rejected — all image references must use the image-query: placeholder. |
| Mermaid component guard | Rejects decks using a stale <Mermaid chart={\...`}>` Vue-component form that Slidev no longer supports cleanly. |
Folders failing any gate are preserved in the workspace but excluded from packing. At last pack, 651 of the Codex-authored folders passed all gates. The final published dataset ships 705 train + 30 test after deduplication and a quality pass.
Slidev feature coverage¶
The dataset spans the full Slidev capability surface rather than a whitelisted subset. A narrower surface would give tighter loss curves, but the visible capability gap from the base model lives in the advanced features (shiki line-highlighting, Mermaid, KaTeX, v-click, non-default themes); a subset-trained model would improve default-layout decks without moving the VisCraft dimension on the eval. Coverage derives from two mechanisms.
Diverse seeds. The categorical spine forces breadth: a tech-talk seed requires code blocks and Mermaid; a pitch-deck seed requires fact layouts and v-click reveals; a product-launch seed requires image-right with strong visuals. Feature requirements are encoded in the features field of each seed and surfaced in INSTRUCTIONS.md so Codex author decisions are aligned with the categorical label.
Injected knowledge pack. nemoslides.pipeline.slidev_reference compiles a ~45KB / ~11K-token reference from the vendored Slidev docs at assets/reference/slidev_docs/ (sparse-cloned from slidevjs/slidev). The pack covers syntax, named layouts, animations (v-click, v-motion), Mermaid, KaTeX, shiki line-highlighting, icons, components, and a curated themes catalog.
The same pack is prepended to the training-time system prompt. At inference the finetuned model receives the same prompt and expects the same idioms it was trained against — training and inference stay in distributional lockstep.
Image-query placeholders¶
The model never emits an image URL. It emits a natural-language query:
nemoslides.pipeline.image_resolver runs before assets/renderer/render.sh invokes Slidev. For each image-query: line, it calls nemoslides.pipeline.tools.image_search.unsplash_search(query) and rewrites the line as image: <resolved URL>. On API failure or missing key, it falls back to a curated data/image_bank.json of ~40 Unsplash IDs tagged by theme.
The placeholder decouples the model from the asset backend in two ways:
- No hallucination pathway. The model has nothing to hallucinate — there is no URL in its output to begin with. The worst-case emission is a poorly-phrased query, not a 404.
- Backend swap without retraining. Pexels, an internal CDN, or a static bundled bank all plug into the same rewriter. The trained checkpoint is unchanged.