Prof. Dr. Kay Rottmann

Service · LLM Evaluation

LLM Evaluation for Production Systems

How do you know your LLM-based system actually works? We build evaluation frameworks that make your AI systems measurable - from prompts to complex multi-agent systems.

Consulting and content: Prof. Dr. Kay Rottmann

Professor of Applied AI · HdM Stuttgart · ex-Meta, Bosch, Amazon

Last updated:

Why LLM evaluation is the most underestimated part of any AI project

A model that looks good in a demo isn't a model that works in production. Once you deploy LLMs seriously, you need an answer to: how do we know it's good? And more importantly: how do we know it's better? This is where most AI projects fail. Not at the model. Not at the prompt. They fail because nobody systematically measures whether the system does what it should. LLM evaluation is the discipline that answers this - and it's just as important as the system itself.

What good LLM evaluation includes

Robust LLM evaluation has at least three layers. Offline evaluation: test sets with known correct answers, automatically scoreable (accuracy, F1, BLEU, ROUGE - depending on the task). Plus semantic evaluation with LLM-as-a-judge where deterministic metrics fall short. Online evaluation: A/B tests in production that measure real user signals - conversion, satisfaction, efficiency. Monitoring & drift detection: continuous observation of unexpected changes (model updates, data drift, prompt regressions). Only together do they give you a reliable picture.

Evaluation for AI agents - where it gets really hard

For simple LLM tasks (classification, summarization), standard metrics are enough. For agents that plan multiple steps, call tools, and self-correct, they're not. You have to measure: does the agent achieve its goal at all? Does it follow policies (e.g. "always confirm before deleting")? Does it work efficiently (not 50 tool calls when 5 would do)? Does it behave consistently across comparable requests? Hard-won lessons from real production systems flow directly into our consulting.

Frameworks and tools we typically use

Depending on the use case: DeepEval for structured LLM tests with semantic metrics. Langfuse or Phoenix (Arize) for observability and trace analysis in production. τ²-bench for agent simulation and policy-conformant evaluation. Promptfoo for lightweight A/B testing of prompts. A custom test set built from your real data - without it, everything else is worthless. We help you pick the right framework for your situation without locking you into specific vendors.

How we help with LLM evaluation

Three typical engagements. Evaluation audit (1 to 2 weeks): we review your existing AI system and tell you what your current eval strategy can and can't catch. Output: concrete recommendations with effort estimates. Eval framework build (3 to 6 weeks): we build an automated evaluation framework for your specific use case, including test set construction, metric definition, and CI integration. Eval as part of an implementation: every AI implementation we deliver includes an eval framework from day one - not as an afterthought.

What makes this LLM evaluation consulting different

You get a team that has spent years on the production side of LLM evaluation. We have built production AI systems at Amazon Alexa, Bosch, and Meta - and learned the hard way every time what happens when eval comes too late. That practitioner perspective is rare in the DACH market.

Frequently asked questions

What's the difference between offline and online evaluation?
Offline evaluation runs on a fixed test set, before deployment, automated and fast. Online evaluation runs in production with real users (e.g. via A/B test) and measures real outcomes - slower but closer to truth. You need both.
We just use the ChatGPT API. Do we still need evaluation?
Yes, especially then. If you don't train your own model, your only quality control is evaluation. Otherwise you'll find out from a customer call that the latest OpenAI model update broke something in your prompt.
How many test cases do we need at minimum?
Rule of thumb: at least 50 for a first useful signal, 200+ for statistically reliable statements, 1000+ for edge-case coverage. More important than raw count is quality - real data from your use case with good labels.
Can you build evaluation for an existing system retroactively?
Yes - that's actually one of the most common engagements. Many teams have a production AI system without structured evaluation and only notice after months that quality is slowly degrading. An eval framework can be added retroactively.
How much does building an eval framework cost?
An audit (1 to 2 weeks) is typically in the low four-figure euro range. A full framework build (3 to 6 weeks) is in the low to mid five-figure range, depending on complexity and test set size.
Are you tied to specific tools or frameworks?
No. We pick per use case from DeepEval, Langfuse, Phoenix, τ²-bench, Promptfoo, or custom builds. We have no affiliate deals - tool choice depends on your stack and requirements.

Let's talk.

Drop me a short note about what you're working on - I'll get back to you within a few days.

Request an eval audit