AI EVAL
ai-eval.ts
Golden datasets, scoring, regression detection.
StarkWHAT THIS PATTERN TEACHES
How to build evaluation suites for AI features: golden datasets (input/expected-output pairs), pluggable scoring functions, regression detection across prompt versions. Eval results stored for history.
WHEN TO USE THIS
Before shipping any AI-powered feature. Run evals on every prompt change to catch regressions before production.
AT A GLANCE
const suite = new EvalSuite('classifier-v2', [
{ input: 'refund please', expected: 'refund' },
{ input: 'where is my order', expected: 'status' },
]);
await suite.run(classifyFn);FRAMEWORK IMPLEMENTATIONS
TypeScript
interface EvalCase {
input: string;
expected: string;
}
class EvalSuite {
constructor(
private name: string,
private cases: EvalCase[],
) {}
async run(fn: (input: string) => Promise<string>) {
const results = await Promise.all(
this.cases.map(async (c) => ({
...c,
actual: await fn(c.input),
pass: (await fn(c.input)) === c.expected,
}))
);
const score = results.filter(r => r.pass).length / results.length;