Stop checking outputs by vibes. Build a runnable eval — golden dataset, deterministic scorer, LLM judge — and read the result like an engineer.