Course · 5 chapters

LLM-Evaluierung

Build runnable LLM evals you can trust: golden datasets, deterministic scorers, calibrated LLM judges, Inspect AI suites, and CI gates. 5 chapters, advanced, for engineers.

Paidadvanced5 chapters100 minEnglish + 6 languagesCertificate on completion

What you'll be able to do

  • Build a runnable LLM eval from scratch
  • Design judge rubrics that resist bias
  • Calibrate LLM judges against humans
  • Run eval suites with Inspect AI
  • Gate bad merges with CI evals
  • Set thresholds that survive flaky judges

What's inside

  1. 1
    LLM-Evaluation: Hier starten

    Eine 12-minütige Orientierung zum Skill Path LLM-Evaluation – das Gateway-Kapitel, dann die drei Schichten (Judges, Suites, Gates), die Eval-nach-Bauchgefühl in eine Disziplin verwandeln, die ausliefert.

    12 min
  2. 2
    Eval-Grundlagen: Dein erstes LLM-Eval in 30 Minuten

    Schluss mit Bauchgefühl-Checks. Baue ein lauffähiges Eval — Golden Dataset, deterministischer Scorer, LLM-Judge — und lies das Ergebnis wie ein Engineer.

    22 min
  3. 3
    LLM-as-Judge: Rubrics, Bias und Reliabilität

    Entwirf Judges, die CALM-Biases überleben, kalibriere sie gegen Menschen und verdiene ihnen einen Platz in deinem CI-Gate.

    22 min
  4. 4
    Inspect AI: Produktionsreife Eval-Suiten im großen Maßstab

    Erstelle, führe aus und visualisiere Frontier-Grade-Eval-Suiten mit dem Open-Source-Framework von UK AISI.

    22 min
  5. 5
    Eval-Gating in CI: Schlechte Merges blockieren

    Verdrahte Per-PR-Evals mit GitHub Actions, wähle Schwellenwerte, die Flakiness überstehen, und entscheide, wann ein Gate auf main gehört.

    22 min

Frequently asked questions

What will I learn in this LLM evaluation course?
You build evaluation across three layers: a first runnable eval with a golden dataset and scorer, reliable LLM-as-judge rubrics calibrated against human ratings, and eval suites wired into CI as a merge gate. The path uses UK AISI's open-source Inspect AI framework and GitHub Actions.
Who is this course for?
It is for engineers building production AI features who need to test LLM outputs rigorously instead of checking them by vibes. The level is advanced, with a focus on software engineering and AI reliability.
Do I need to code to take this course?
Yes. This is a hands-on engineering path that involves writing eval scripts, configuring the Inspect AI framework, and setting up GitHub Actions workflows, so comfort with code and CI is expected.
How long is the course and is there a certificate?
The path has 5 chapters totaling about 100 minutes, starting with a 12-minute orientation. On completion you earn an AI Academy by Anthropos certificate.
Is this course free?
No, this is a paid skill path included with an AI Academy by Anthropos subscription.

Earn a certificate

Complete all chapters to receive your certificate of completion.