LLM Evaluation Harness

Three rigorous LLM experiments with properly computed confidence intervals and published writeups — demonstrating the methodology most AI engineer portfolios lack.

View the live site

Experiments

Exp 001: Quantization Cliff

  • Qwen2.5-7B-Instruct evaluated across four llama.cpp k-quants (Q3_K_M through Q8_0) on math, factual QA, and structured extraction (n=600 per cell)
  • “Q4 is fine” holds: Q4_K_M is statistically indistinguishable from Q8_0 on all three tasks. The only significant cliff is TriviaQA at Q3_K_M (−3.2pp vs Q8_0, p=0.011)

Exp 002: LoRA vs API

  • LoRA fine-tuned Qwen2.5-1.5B beats gpt-5.4-mini zero-shot by +0.214 F1 (p=0.0001)
  • Break-even ~19M inferences at 2026 pricing

Exp 003: Layer Probes

  • Token embeddings (layer 0) already capture 95% of best probe accuracy (70.0% vs 73.1%)
  • Refusal is primarily lexical; transformer layers add little discriminative signal