LLM Evaluation Harness
Three rigorous LLM experiments with properly computed confidence intervals and published writeups — demonstrating the methodology most AI engineer portfolios lack.
Experiments
Exp 001: Quantization Cliff
- Qwen2.5-7B-Instruct evaluated across four llama.cpp k-quants (Q3_K_M through Q8_0) on math, factual QA, and structured extraction (n=600 per cell)
- “Q4 is fine” holds: Q4_K_M is statistically indistinguishable from Q8_0 on all three tasks. The only significant cliff is TriviaQA at Q3_K_M (−3.2pp vs Q8_0, p=0.011)
Exp 002: LoRA vs API
- LoRA fine-tuned Qwen2.5-1.5B beats gpt-5.4-mini zero-shot by +0.214 F1 (p=0.0001)
- Break-even ~19M inferences at 2026 pricing
Exp 003: Layer Probes
- Token embeddings (layer 0) already capture 95% of best probe accuracy (70.0% vs 73.1%)
- Refusal is primarily lexical; transformer layers add little discriminative signal