Following are plan for this hackathon
Kaggle NVIDIA Nemotron Reasoning Challenge — Plan
Goal
Produce a LoRA adapter (rank ≤ 32) on Nemotron-3-Nano-30B that maximizes accuracy on Alice's Wonderland reasoning puzzles.
Setup
- Base model: Nemotron-3-Nano-30B
- Hardware: 1× H100 80GB
- Method: QLoRA (4-bit base + LoRA rank 32 on all linear layers), if H100's vram allows maybe we can try using just LoRA not QLoRA.
- Framework: TRL v1.0 (supports SFT, DAPO-style GRPO out of the box)
- Output format: Model must place final answer inside
\boxed{...}
Task Categories (6 types in train.csv)
- Bit manipulation (8-bit binary transforms)
- Text encryption (ciphers)
- Numeral system conversion (Roman numerals, etc.)
- Unit conversion (secret conversion factor)
- Modified gravitational constant (physics)
- Equation transformation rules
Phase 1 — Data Foundation (cornerstone of the whole pipeline)
Step 1a: Analyze train.csv
- Count rows per category
- Record difficulty patterns and answer format quirks per category
- Hold out train.csv entirely as validation set (closest proxy to hidden test distribution)
Step 1b: Build 6 Python puzzle generators
- One generator per category, with controllable difficulty knobs
- Each generator must output
(prompt, ground_truth_answer) pairs
- Generate ~100K synthetic prompts total
- Match train.csv's category distribution
- Apply floor of ~5K per category (ensure coverage even for rare categories)
- Mix easy/medium/hard within each category - this shoudl also follow the diffuculity distribution of in given train.csv
Step 1c: Distill reasoning traces via API (only paid step in the whole pipeline)
- Use SOTA model (Claude / GPT-4 / etc.) to generate reasoning traces for each prompt
- Format:
reasoning steps + \boxed{answer}
- Budget estimate: ~$200–500
Step 1d: Filter by correctness
- Extract
\boxed{} from each generated trace
- Keep only rows where extracted answer matches ground truth (exact string or numerical tolerance)
- Expected yield:
70–90% → **80K clean SFT rows**
Step 1e: Train/val split
- Validation: use train.csv (9,500 rows) — NOT in training data
- Training: ~80K filtered synthetic (prompt, trace, answer) rows
Phase 2 — N × (SFT + DAPO) with checkpoint tracking
SFT
For sft, just do the things as a usual, Use cross entropy loss.
Reward function (used by DAPO every round):
Extract \boxed{} from rollout → 1 if matches ground truth (exact or numerical tolerance), 0 otherwise.
DAPO config (important — differs from vanilla GRPO):
- Decoupled clipping (Clip-Higher): epsilon_low < epsilon_high
- No KL penalty (β=0)
- Token-level policy gradient loss
- Dynamic sampling (filter degenerate batches)
- Group size: 8–16 (or 2–4 if compute-tight)
Round 1
Round 1 SFT
- Train on Phase 1 dataset (~80K rows)
- Evaluate on train.csv validation → save score + LoRA checkpoint
Round 1 DAPO
- Prompt pool: ~30K–50K prompts
- ~40% overlap with SFT prompts (tests generalization) - still wondering what should be the difficulity distribution of these problems
- ~60% fresh synthetic prompts - still wondering what should be the difficulity distribution of these problems
- Only prompts + ground truth needed — rollouts generated live
- Evaluate on train.csv → save score + LoRA checkpoint
Round 2+
Self-distillation (replaces API distillation from Phase 1):
- Use previous round's DAPO model to generate traces locally
- No more API cost — everything runs on H100
Round N SFT
- Prompt pool for self-distillation: ~150K prompts
- ~70K: reuse previous round's SFT prompts (regenerate with stronger model) - still wondering what should be the difficulity distribution of these problems
- ~80K: fresh synthetic prompts for diversity - still wondering what should be the difficulity distribution of these problems
- Generate 4–8 candidate traces per prompt (temperature ~0.7)
- Filter: keep traces whose
\boxed{} matches ground truth
- Expected yield: 85–95% (model is much stronger now) - still wondeing how manyu problem set should we target to get here
- Keep 1–2 best traces per prompt (shortest correct, or diverse)
- Mix in ~25K rows from previous round's SFT data (prevents drift / mode collapse) - still wondering what should be the difficulity distribution of these problems
- Final training set: ~175K rows
- Evaluate on train.csv → save score + LoRA checkpoint
Round N DAPO
- Prompt pool: ~30K–50K, skewed harder - still wondering what should be the difficulity distribution of these problems
- Goldilocks filter: keep prompts where current SFT model succeeds 20–80% of the time
- Drop prompts the model aces (no signal)
- Drop prompts the model always fails (no signal)
- DAPO's Dynamic Sampling helps here but manual curation is better
- As rounds progress, difficulty floor rises
- Evaluate → save score + LoRA checkpoint
Stopping Criterion
Stop iterating when validation score plateaus for 2 consecutive rounds.
Realistic target: 2–3 full rounds within the 2-month window.
Submission Strategy
- Save LoRA checkpoint + validation score after every SFT and DAPO stage
- Final submission: best-scoring checkpoint across all rounds (not necessarily the latest — models can regress)
- Package as
submission.zip containing adapter_config.json + adapter weights
What Gets Generated Where (Cost Summary)
| Stage |
Prompts |
Final Rows |
Source |
Cost |
| Phase 1 SFT |
100K |
~80K |
API distillation + filter |
$200–500 (API) |
| Phase 1 DAPO |
30K–50K |
N/A (live rollouts) |
Python generators |
GPU only |
| Phase 2+ SFT |
150K |
~175K |
Self-distillation + filter + previous-round mix |
GPU only |
| Phase 2+ DAPO |
30K–50K |
N/A |
Python generators (Goldilocks-filtered) |
GPU only |
Key insight: API is used exactly once (Phase 1 SFT). Everything after that uses Python scripts for prompts and the previous round's model for traces.
Hyperparameters Quick Reference
| Setting |
Value |
| Base model |
Nemotron-3-Nano-30B (4-bit via bitsandbytes) |
| LoRA rank |
32 (competition max) |
| LoRA alpha |
64 |
| LoRA target modules |
All linear layers (q, k, v, o, gate, up, down) |
| Compute precision |
bf16 |
| Optimizer |
AdamW 8-bit |
| Learning rate (SFT) |
1e-4 to 2e-4 |
| Learning rate (DAPO) |
1e-6 |
| Gradient checkpointing |
ON |
| Sequence length |
4096–8192 |
| Effective batch size |
16–64 (via gradient accumulation) |
Risk Mitigation Checklist