Question 1

What's the difference between offline and online evaluation?

Accepted Answer

Offline evaluation runs on a fixed test set, before deployment, automated and fast. Online evaluation runs in production with real users (e.g. via A/B test) and measures real outcomes - slower but closer to truth. You need both.

Question 2

We just use the ChatGPT API. Do we still need evaluation?

Accepted Answer

Yes, especially then. If you don't train your own model, your only quality control is evaluation. Otherwise you'll find out from a customer call that the latest OpenAI model update broke something in your prompt.

Question 3

How many test cases do we need at minimum?

Accepted Answer

Rule of thumb: at least 50 for a first useful signal, 200+ for statistically reliable statements, 1000+ for edge-case coverage. More important than raw count is quality - real data from your use case with good labels.

Question 4

Can you build evaluation for an existing system retroactively?

Accepted Answer

Yes - that's actually one of the most common engagements. Many teams have a production AI system without structured evaluation and only notice after months that quality is slowly degrading. An eval framework can be added retroactively.

Question 5

How much does building an eval framework cost?

Accepted Answer

An audit (1 to 2 weeks) is typically in the low four-figure euro range. A full framework build (3 to 6 weeks) is in the low to mid five-figure range, depending on complexity and test set size.

Question 6

Are you tied to specific tools or frameworks?

Accepted Answer

No. We pick per use case from DeepEval, Langfuse, Phoenix, τ²-bench, Promptfoo, or custom builds. We have no affiliate deals - tool choice depends on your stack and requirements.

LLM Evaluation for Production Systems

Why LLM evaluation is the most underestimated part of any AI project

What good LLM evaluation includes

Evaluation for AI agents - where it gets really hard

Frameworks and tools we typically use

How we help with LLM evaluation

What makes this LLM evaluation consulting different

Frequently asked questions

Let's talk.