Following are plan for this hackathon

Kaggle NVIDIA Nemotron Reasoning Challenge — Plan

Goal

Produce a LoRA adapter (rank ≤ 32) on Nemotron-3-Nano-30B that maximizes accuracy on Alice's Wonderland reasoning puzzles.

Setup

Task Categories (6 types in train.csv)

  1. Bit manipulation (8-bit binary transforms)
  2. Text encryption (ciphers)
  3. Numeral system conversion (Roman numerals, etc.)
  4. Unit conversion (secret conversion factor)
  5. Modified gravitational constant (physics)
  6. Equation transformation rules

Phase 1 — Data Foundation (cornerstone of the whole pipeline)

Step 1a: Analyze train.csv

Step 1b: Build 6 Python puzzle generators

Step 1c: Distill reasoning traces via API (only paid step in the whole pipeline)

Step 1d: Filter by correctness

Step 1e: Train/val split


Phase 2 — N × (SFT + DAPO) with checkpoint tracking

SFT For sft, just do the things as a usual, Use cross entropy loss.

Reward function (used by DAPO every round): Extract \boxed{} from rollout → 1 if matches ground truth (exact or numerical tolerance), 0 otherwise.

DAPO config (important — differs from vanilla GRPO):


Round 1

Round 1 SFT

Round 1 DAPO


Round 2+

Self-distillation (replaces API distillation from Phase 1):

Round N SFT

Round N DAPO


Stopping Criterion

Stop iterating when validation score plateaus for 2 consecutive rounds.

Realistic target: 2–3 full rounds within the 2-month window.


Submission Strategy


What Gets Generated Where (Cost Summary)

Stage Prompts Final Rows Source Cost
Phase 1 SFT 100K ~80K API distillation + filter $200–500 (API)
Phase 1 DAPO 30K–50K N/A (live rollouts) Python generators GPU only
Phase 2+ SFT 150K ~175K Self-distillation + filter + previous-round mix GPU only
Phase 2+ DAPO 30K–50K N/A Python generators (Goldilocks-filtered) GPU only

Key insight: API is used exactly once (Phase 1 SFT). Everything after that uses Python scripts for prompts and the previous round's model for traces.

Hyperparameters Quick Reference

Setting Value
Base model Nemotron-3-Nano-30B (4-bit via bitsandbytes)
LoRA rank 32 (competition max)
LoRA alpha 64
LoRA target modules All linear layers (q, k, v, o, gate, up, down)
Compute precision bf16
Optimizer AdamW 8-bit
Learning rate (SFT) 1e-4 to 2e-4
Learning rate (DAPO) 1e-6
Gradient checkpointing ON
Sequence length 4096–8192
Effective batch size 16–64 (via gradient accumulation)

Risk Mitigation Checklist