Following are plan for this hackathon

Kaggle NVIDIA Nemotron Reasoning Challenge — Plan

Goal

Produce a LoRA adapter (rank ≤ 32) on Nemotron-3-Nano-30B that maximizes accuracy on Alice's Wonderland reasoning puzzles.

Setup

Base model: Nemotron-3-Nano-30B
Hardware: 1× H100 80GB
Method: QLoRA (4-bit base + LoRA rank 32 on all linear layers), if H100's vram allows maybe we can try using just LoRA not QLoRA.
Framework: TRL v1.0 (supports SFT, DAPO-style GRPO out of the box)
Output format: Model must place final answer inside \boxed{...}

Task Categories (6 types in train.csv)

Bit manipulation (8-bit binary transforms)
Text encryption (ciphers)
Numeral system conversion (Roman numerals, etc.)
Unit conversion (secret conversion factor)
Modified gravitational constant (physics)
Equation transformation rules

Phase 1 — Data Foundation (cornerstone of the whole pipeline)

Step 1a: Analyze train.csv

Count rows per category
Record difficulty patterns and answer format quirks per category
Hold out train.csv entirely as validation set (closest proxy to hidden test distribution)

Step 1b: Build 6 Python puzzle generators

One generator per category, with controllable difficulty knobs
Each generator must output (prompt, ground_truth_answer) pairs
Generate ~100K synthetic prompts total
- Match train.csv's category distribution
- Apply floor of ~5K per category (ensure coverage even for rare categories)
- Mix easy/medium/hard within each category - this shoudl also follow the diffuculity distribution of in given train.csv

Step 1c: Distill reasoning traces via API (only paid step in the whole pipeline)

Use SOTA model (Claude / GPT-4 / etc.) to generate reasoning traces for each prompt
Format: reasoning steps + \boxed{answer}
Budget estimate: ~$200–500

Step 1d: Filter by correctness

Extract \boxed{} from each generated trace
Keep only rows where extracted answer matches ground truth (exact string or numerical tolerance)
Expected yield: 70–90% → **80K clean SFT rows**

Step 1e: Train/val split

Validation: use train.csv (9,500 rows) — NOT in training data
Training: ~80K filtered synthetic (prompt, trace, answer) rows

Phase 2 — N × (SFT + DAPO) with checkpoint tracking

SFT For sft, just do the things as a usual, Use cross entropy loss.

Reward function (used by DAPO every round): Extract \boxed{} from rollout → 1 if matches ground truth (exact or numerical tolerance), 0 otherwise.

DAPO config (important — differs from vanilla GRPO):

Decoupled clipping (Clip-Higher): epsilon_low < epsilon_high
No KL penalty (β=0)
Token-level policy gradient loss
Dynamic sampling (filter degenerate batches)
Group size: 8–16 (or 2–4 if compute-tight)

Round 1

Round 1 SFT

Train on Phase 1 dataset (~80K rows)
Evaluate on train.csv validation → save score + LoRA checkpoint

Round 1 DAPO

Prompt pool: ~30K–50K prompts
- ~40% overlap with SFT prompts (tests generalization) - still wondering what should be the difficulity distribution of these problems
- ~60% fresh synthetic prompts - still wondering what should be the difficulity distribution of these problems
Only prompts + ground truth needed — rollouts generated live
Evaluate on train.csv → save score + LoRA checkpoint

Round 2+

Self-distillation (replaces API distillation from Phase 1):

Use previous round's DAPO model to generate traces locally
No more API cost — everything runs on H100

Round N SFT

Prompt pool for self-distillation: ~150K prompts
- ~70K: reuse previous round's SFT prompts (regenerate with stronger model) - still wondering what should be the difficulity distribution of these problems
- ~80K: fresh synthetic prompts for diversity - still wondering what should be the difficulity distribution of these problems
Generate 4–8 candidate traces per prompt (temperature ~0.7)
Filter: keep traces whose \boxed{} matches ground truth
Expected yield: 85–95% (model is much stronger now) - still wondeing how manyu problem set should we target to get here
Keep 1–2 best traces per prompt (shortest correct, or diverse)
Mix in ~25K rows from previous round's SFT data (prevents drift / mode collapse) - still wondering what should be the difficulity distribution of these problems
Final training set: ~175K rows
Evaluate on train.csv → save score + LoRA checkpoint

Round N DAPO

Prompt pool: ~30K–50K, skewed harder - still wondering what should be the difficulity distribution of these problems
Goldilocks filter: keep prompts where current SFT model succeeds 20–80% of the time
- Drop prompts the model aces (no signal)
- Drop prompts the model always fails (no signal)
- DAPO's Dynamic Sampling helps here but manual curation is better
As rounds progress, difficulty floor rises
Evaluate → save score + LoRA checkpoint

Stopping Criterion

Stop iterating when validation score plateaus for 2 consecutive rounds.

Realistic target: 2–3 full rounds within the 2-month window.

Submission Strategy

Save LoRA checkpoint + validation score after every SFT and DAPO stage
Final submission: best-scoring checkpoint across all rounds (not necessarily the latest — models can regress)
Package as submission.zip containing adapter_config.json + adapter weights

What Gets Generated Where (Cost Summary)

Stage	Prompts	Final Rows	Source	Cost
Phase 1 SFT	100K	~80K	API distillation + filter	$200–500 (API)
Phase 1 DAPO	30K–50K	N/A (live rollouts)	Python generators	GPU only
Phase 2+ SFT	150K	~175K	Self-distillation + filter + previous-round mix	GPU only
Phase 2+ DAPO	30K–50K	N/A	Python generators (Goldilocks-filtered)	GPU only

Key insight: API is used exactly once (Phase 1 SFT). Everything after that uses Python scripts for prompts and the previous round's model for traces.

Hyperparameters Quick Reference

Setting	Value
Base model	Nemotron-3-Nano-30B (4-bit via bitsandbytes)
LoRA rank	32 (competition max)
LoRA alpha	64
LoRA target modules	All linear layers (q, k, v, o, gate, up, down)
Compute precision	bf16
Optimizer	AdamW 8-bit
Learning rate (SFT)	1e-4 to 2e-4
Learning rate (DAPO)	1e-6
Gradient checkpointing	ON
Sequence length	4096–8192
Effective batch size	16–64 (via gradient accumulation)

Risk Mitigation Checklist

All generator outputs verified against ground truth before training
train.csv held out as validation — NEVER used for training
\boxed{} format consistently enforced in all training traces
Per-category validation accuracy tracked (catch weak categories early)
Checkpoint saved after every stage
Best-checkpoint tracker maintained across all rounds