Reproduce¶
The full experimental protocol: environment setup, baseline evaluation, corpus synthesis, training, finetuned evaluation. Steps 1 and 4 share byte-identical configuration — the Δ they produce is the project's headline claim.
0 · Environment¶
Python dependencies managed by uv. Node required for the Slidev renderer.
# Python + project (hatchling-built, installed editable)
uv sync # runtime
uv sync --group dev # + pytest / ruff
uv sync --only-group docs # + mkdocs (docs-site only)
# API keys
cp .env.example .env
# fill: OPENROUTER_API_KEY (required — base + judge + reference models)
# UNSPLASH_ACCESS_KEY (required — image resolver)
# GEMINI_API_KEY (optional — direct Google API for judge)
# OPENAI_API_KEY (optional — legacy qualitative_check only)
# Slidev renderer (Node)
cd assets/renderer && npm install && cd ../..
Verify: uv run pytest should pass. ./assets/renderer/render.sh assets/reference/gold_examples/hero_tech_talk.md /tmp/render-test should produce a slide directory of PNGs.
1 · Baseline SlidevBench¶
The baseline is the reference point every SFT claim is compared against. It runs before any training begins.
# One reference model at a time; resumable per-seed via score.json gates.
uv run python -m nemoslides.eval.run --model nemotron-nano --concurrency 15
# All four reference points.
for m in nemotron-nano nemotron-super glm-5.1 gpt-5.4; do
uv run python -m nemoslides.eval.run --model $m
done
# Aggregate + plot.
uv run python -m nemoslides.eval.compare
uv run python -m nemoslides.eval.plot
Outputs at results/eval/runs/<model>/<seed>/ (deck.md, reasoning.md, gen.md, slides/*.png, score.json) and aggregated at results/eval/comparison.json / comparison_table.md. Plots at results/eval/plots/.
Resumability. A seed folder with a valid score.json is skipped. Delete that file to re-judge an existing render; delete the whole seed folder to regenerate + re-render + re-judge.
2 · Synthesize the training corpus¶
# Generate seeds with NeMo Data Designer.
uv run python -m nemoslides.pipeline.seeds_dd --out data/seeds.json
# Materialize per-seed Codex workspace.
WORK=work-$(date +%Y%m%d)
uv run python -m nemoslides.cli.codex_pipeline init \
--seeds data/seeds.json --out "$WORK"
# Run Codex in parallel across all seeds (tmux-driven, 6-way default).
./scripts/run_codex_batch.sh "$WORK"
# Monitor progress.
./scripts/watch_codex_status.sh "$WORK"
uv run python -m nemoslides.cli.codex_pipeline status --work "$WORK"
# Validate + pack completed folders.
uv run python -m nemoslides.cli.codex_pipeline pack \
--work "$WORK" --out data/raw/codex
Validators (prompt substance, think substance, deck syntax, image-URL ban, frontmatter hygiene) run during pack and drop non-conformant folders. Scan the output for the kept/dropped counts.
3 · Publish to Hugging Face Hub¶
# Pack kept folders into chat-JSONL and push.
uv run python -m nemoslides.cli.push_hf_dataset --work "$WORK" --push
Target: trillionlabs/slides-sft-v0. The dataset ships messages[0..2] with reasoning_content exposed as a separate field on the assistant turn.
4 · SFT with NeMo-RL¶
Requires a 2n8g layout (2 nodes × 8 GPUs) per the published LoRA+FSDP2 recipe.
The launch script invokes NeMo-RL examples/run_sft.py with two overrides against the published recipe:
policy.model_name = nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16- Data path pointing at the packed chat-JSONL from step 3.
Everything else — LoRA rank, sequence length, FSDP2 sharding, AdamW, LR schedule — stays as NVIDIA publishes. See 03 · Training for detail.
5 · Serve the finetuned checkpoint¶
vLLM with the base Nemotron checkpoint and the LoRA adapter attached. Registered in nemoslides.pipeline.clients as the model alias passed to --model in step 6 below. If running on a remote node, tunnel the port:
6 · Finetuned SlidevBench — identical protocol¶
Same 30 prompts, same judge, same rubric, same render pipeline, same aggregation formula as step 1.
uv run python -m nemoslides.eval.run --model nano-local --concurrency 15
uv run python -m nemoslides.eval.compare
uv run python -m nemoslides.eval.plot
The Δ between step-1 and step-6 Overall is the project's headline claim. A deviation from protocol identity invalidates the comparison — do not change the judge, rubric, render pipeline, or aggregation weights between the two runs.
7 · Human blindtest (optional second fold)¶
# Build a balanced pair queue from the render artifacts in step 1 + step 6.
uv run python -m nemoslides.blindtest.build_pairs
# Start the voting UI.
uv run python -m nemoslides.blindtest.app
# → http://localhost:5000
Votes persist to results/blindtest/votes.db. Results feed back into 05 · Results.
Demo¶
Optional — the prompt-to-deck web UI.
Requires DEMO_OPENAI_API_KEY in .env (or falls back to OPENAI_API_KEY + model gpt-5.4 for showcase). Set DEMO_OPENAI_BASE_URL to point at the NemoSlides vLLM endpoint for end-to-end local inference.