The starting point wasn't "I want to fine-tune an LLM." It was a specific,
reproducible failure: every model I tested would get one field wrong on a
particular class of headline. A Fed rate cut announcement on a market asking
"Will the Fed cut rates?" would return direction: -0.85. The model was reasoning about interest rate
levels. The market question was asking about YES probability. Those are
different things, and no amount of prompting a non-thinking model fixed it
reliably.
The pipeline takes a news headline and a prediction market question and outputs a structured JSON assessment across six fields: relevance, direction, magnitude, source credibility, information type, and novelty. The goal was to run this on-device at low latency with no per-call API cost. Getting there required designing the schema, fixing the direction semantics in prompt space, constructing a 74-example training set with teacher model traces, and debugging a quantization merge bug that was silently discarding everything LoRA had learned.
The direction problem
The direction field is the hardest part of the schema. Its definition sounds obvious (does this headline make the market question resolving YES more or less likely?) but every non-thinking model I tested interpreted it as: does the underlying metric in the headline go up or down? Those two questions produce opposite signs on a large class of economically important headlines.
The fix was a rewrite of the field description: "does this make the market question resolving YES more likely (+1) or less likely (−1)? Do not consider the direction of the underlying metric." One sentence. It resolved the sign flip entirely in prompt space, but only for models with enough reasoning capacity to act on the distinction. 4B non-thinking models got direction right after the rewrite but magnitude wrong: scores calibrated 2–4× too high, noise filtering broken. The sign was fixed; the confidence was still hallucinated.
I built three reference validation examples covering the full signal spectrum (high signal, pure noise, ambiguous weak signal) and used them as a fixed benchmark to evaluate every model tested. Magnitude anchoring helped too: adding "0.1 is a moderate update, 0.4+ is near-resolving evidence" to the schema description brought magnitude closer to calibrated, but never consistently so across model sizes.
Model selection and the thinking model insight
I tested six configurations against the same three reference examples. The results converged fast.
- 4B non-thinking models: direction fixed by prompt, magnitude consistently 2–4× too high, noise filtering broken across the board.
- Qwen3-0.6B with native thinking: outperformed every 4B non-thinking model on direction and magnitude calibration. The on-device target.
- Claude Haiku: near-perfect on all six fields out of the box. The benchmark ceiling.
- Gemma 4B thinking: strong baseline, comparable to Haiku on the noise test. A viable alternative base for fine-tuning.
- BitNet 2B: researched as a future path: ternary weights (−1, 0, +1), no floating-point multiplication, 6× speedup and 12× energy reduction vs standard models at inference time.
The key finding: thinking models punch above their weight class. A 0.6B thinking model beats a 4B non-thinking model because it can explicitly traverse its own knowledge graph before committing to a single output token. Non-thinking models compress their full reasoning into one forward pass. Thinking mode spreads it across tokens; each intermediate token can condition the next. The 0.6B model doesn't have more knowledge; it has more reasoning steps to apply the knowledge it has.
Dataset construction and the spurious correlation
Building 74 training examples sounds mechanical. It wasn't.
I built the first 20 with Claude as a teacher model, feeding headline, question, and my expected output, using the model to generate high-quality thinking traces that showed the reasoning process explicitly. Each example was validated manually against expected outputs. The thinking traces were the critical piece: I was training on reasoning patterns, not just output patterns. I also designed malformed input examples with consistent error schema to cover graceful pipeline failure handling, and made sure all three systematic failure modes had targeted coverage: noise filtering, magnitude calibration, information type classification.
Then I sourced 40 more examples and combined the sets. That's when I found the problem: the original 20 examples had thinking traces of 600–1,100 characters. The new 40 had traces of 260–530 characters. The model was learning trace length as a proxy for relevance: longer reasoning → higher relevance score. A spurious correlation baked into the data distribution. It doesn't show up in aggregate metrics. It shows up when you look at failure modes per example and notice that noise items with short traces are scoring 0.3 relevance and high-signal items with long traces are scoring 0.9, not because the model has learned the distinction, but because it has learned the length difference.
Fine-tuning, and where it broke
Framework: Unsloth + LoRA on Windows, RTX 4070 Laptop (8 GB VRAM).
Base model: unsloth/Qwen3-0.6B-unsloth-bnb-4bit. LoRA config:
r=16, alpha=32, targeting all projection layers, 10M of 606M parameters
trained, 1.67% of the model. Training time: roughly 38 seconds for 15
steps.
The dependency setup on Windows was its own debugging session. Python 3.13
is incompatible with PyTorch; needed 3.11. CUDA versions conflicted between
unsloth, torch, triton, and xformers. Resolved by installing torch 2.6.0+cu124 first, then plain unsloth, then pinning
triton-windows. Flash Attention 2 is unavailable on Windows; fell back to
xformers throughout.
The first training run used lr=1e-4 over 5 epochs. Too
aggressive. The model catastrophically forgot its base reasoning circuits
and lost thinking trace coherence entirely. The fix was lr=5e-5
with shuffle=True, seed=42. Catastrophic forgetting in LoRA
fine-tuning is a real failure mode, not a theoretical one: an overly
aggressive learning rate doesn't just overfit, it overwrites the reasoning
pathways the base model developed during pretraining.
The subtlest failure was the quantization merge bug. The original merge loaded the base model in 4-bit quantization, then merged the float16 LoRA adapter into it. The mismatch matters: a 4-bit weight matrix has 16 representable values; float16 has 65,536. Small corrections (precisely what LoRA produces) get rounded to zero. Large corrections survive but may be destructive. The merged model retained the base model's behavior everywhere it was already wrong, and accepted large weight changes in the right direction only by accident.
Fix: load the base model in float16 before merging. Both matrices share the same precision, and small LoRA corrections survive intact.
LoRA trained the model to distinguish YES probability from underlying metric direction. A quantization type mismatch during the merge threw most of it away.
What I'm taking from this
The model works. Calibration needs improvement, specifically normalizing
thinking trace lengths across all 74 examples to 700–1,000 characters before
retraining, and increasing noise examples to 30–40% of the dataset. The
production path from here is GGUF conversion via llama.cpp, a Modelfile for
Ollama, and a local API at localhost:11434, replacing any
per-call API cost with a one-time training investment.
The lasting lesson isn't about LLMs. The spurious correlation between trace length and relevance scores was entirely self-inflicted and nearly invisible in aggregate metrics. The quantization merge bug was silent. The model ran, produced valid JSON, and looked fine from the outside. Both required tracing through the full pipeline systematically rather than reading output summaries. That discipline (assuming the failure is invisible and looking anyway) is the actual skill this project built.