How We Built a Quality Scoring System for a 24-Agent AI Team

The first question anyone asks when I describe running a 24-agent AI system is usually: “How do you know they are doing a good job?”

It is the right question. And the honest answer, when we built the first version, was: we mostly did not. We could tell whether tasks completed. We could see whether outputs arrived. What we could not do — systematically, repeatedly, across agents — was assess the quality of the work independent of whether the task technically finished.

This is not a small gap. A completed task with a poor output is worse than a failed task with a clear error, because failed tasks surface immediately and prompt investigation. Poor-quality completions can accumulate for weeks before the pattern is visible.

Article illustration — quality-scoring-system-24-agent-ai-team

Fixing this required building an evaluation architecture. What follows is what we built, why we built it the way we did, and what we would do differently if we were starting again.


Why “did it work” is not an evaluation system

The temptation in any automated system is to define success as completion. The task ran. No errors were returned. The output file was created. Tick.

This produces systems that are very good at completing tasks and not reliably good at producing useful outputs. I have seen this in enterprise software — including during my years at RiverSoft, SMARTS, and Voyence, where root cause analysis tools could reliably complete their analysis algorithms whilst occasionally producing diagnostics that an experienced network engineer would immediately recognise as implausible. The algorithm ran. The output was wrong. The system reported success.

For AI agents, the problem is more acute. The outputs are language-based and contextual. There is no simple pass/fail criterion. “Did the content writing agent produce a draft?” is a binary question. “Did the content writing agent produce a draft that matches the brief, avoids banned patterns, maintains the correct voice, and contains no fabricated claims?” is an evaluation question. The two questions produce different information.

We needed an evaluation system, not a completion tracker.


The three-dimension scoring model

After several iterations, we settled on three scoring dimensions for every agent output. Each is scored on a 1-10 scale by the receiving agent or by the human reviewer:

Accuracy. Factual correctness. No hallucinations. Numbers, names, dates, and claims are correct or explicitly flagged as uncertain. For a content writing agent, this means the facts used in a draft can be traced back to the data signals provided. For a data collection agent, it means the collected data matches the source.

Completeness. All requirements were met. Nothing was skipped. The output addresses every element of the brief, every step of the workflow, every output specified by the task. A score of 10 means the receiving agent or reviewer can use the output without chasing down missing pieces.

Usability. The output can be used without rework by the receiving agent or person. This is the dimension that catches technically correct outputs that are nonetheless poorly structured, ambiguously framed, or formatted in a way that creates unnecessary work downstream. A score of 10 means the output is ready for the next stage without modification.

The three dimensions are independent. An output can be accurate but incomplete — all the facts are right, but the brief asked for five sections and three were produced. It can be complete but low usability — every section is present, but the structure makes it difficult for the downstream agent to extract what it needs. The three-dimension model catches what a binary pass/fail does not.


The exemplar library

Scoring alone produces data. The training loop requires that data to close back onto the agents being scored.

When a run scores 8.0 or above across all three dimensions, it can be promoted to the exemplar library. An exemplar is a reference output — a real delivered piece of work that met the quality bar, preserved in the library as a calibration target for future work of the same type.

The exemplar library serves a specific function: before producing output, an agent reads the exemplar for its task type. Not as a template to copy from, but as a quality calibration. What does a well-executed content brief look like? What does a well-structured data collection output look like? The agent compares its planned approach against the exemplar and adjusts before delivering.

This sounds straightforward. The discipline required to maintain it is less straightforward. Exemplars must be task-type-specific, not generic. A content writing exemplar for a city guide is not useful calibration for a board governance post. We maintain exemplar libraries indexed by agent and by task type, and we only promote to exemplar when the task type and quality dimensions are both clearly logged.

The anti-exemplar library is, if anything, more important. When a run fails — an agent produces a bad output, a known failure pattern recurs, a boundary is crossed that should not have been — we document the failure pattern and add it to the anti-exemplar library. Future runs start by reading not just exemplars, but anti-exemplars for the task type. The agent knows not just what good looks like, but what its specific known failure modes look like.


The training loop in practice

The full loop runs like this.

An agent produces output. The receiving agent — or, for significant deliverables, the human reviewer — scores it on accuracy, completeness, and usability. The score is logged against the run, the agent, and the task type.

Over time, the scoring data shows two things: which agents perform consistently at or above threshold, and which task types are producing systematic quality gaps. A single low-scoring run may be a one-off. Three low-scoring runs of the same task type from the same agent is a pattern — and patterns get investigated, documented as failure patterns, and result in specification updates.

The specification update is the closing of the loop. The agent’s specification is revised to address the identified failure mode. The anti-exemplar library is updated. Future runs operate against the updated specification.

The loop is not automatic. Someone has to review the scoring data, identify the patterns, make the specification changes, and confirm that subsequent runs have improved. In our system, that human is me, with data surfaced by our QA agents. An automated system that closed this loop without human review would produce a system that optimises for score rather than quality — which is a different problem, but not a smaller one.


What we would do differently

Three things, in order of importance.

First: we would have defined the scoring rubric more precisely at the start. “Accuracy 7” is ambiguous until you have calibrated it against a substantial sample of outputs. We spent several months calibrating what each score level meant for each dimension, which could have been accelerated with more upfront work on the rubric definition.

Second: we would have built the anti-exemplar library before the exemplar library, not after it. Knowing what failure looks like is more immediately useful than knowing what success looks like, because failures are common and successes need to be earned. The failure patterns are visible from the first few runs. The exemplars take longer to accumulate.

Third: we would have separated scoring the output from scoring the process. We score the deliverable. We do not yet score the process the agent used to produce it — whether it followed its workflow, read its memory file, ran required validation checks before delivering. These are agent behaviour signals that are independent of output quality but correlated with it. Outputs produced by agents that consistently skip process steps are lower quality on average. Measuring process adherence separately would give earlier warning signals.


Why boards care about this

I write for practitioners and for board-level decision-makers. The evaluation architecture question is not just an engineering concern.

If your organisation is deploying AI agents in production — and the category of “AI agents” is expanding rapidly to include autonomous decision support, contract review, customer communications, and financial analysis — the board’s governance question is: how do you know they are doing good work?

“We monitor error rates” is a completion metric, not a quality metric. “We have an evaluation framework with defined dimensions, scored outputs, and a documented improvement loop” is a governance answer. The distinction matters when the AI system produces a consequential output that was technically correct but substantively wrong.

The evaluation architecture is not ornamental. It is the mechanism that lets a board ask “how do we know this is working” and receive a specific, auditable answer.


The Board AI Governance Framework includes a section on AI quality evaluation — the questions boards should ask about evaluation architecture before approving AI deployments, and the minimum standards for a credible quality assurance process.

For organisations building or reviewing AI agent systems, contact Steven directly to discuss independent evaluation of your evaluation architecture.

Steven Vaile

Steven Vaile

Board technology advisor and QSECDEF co-founder. Writes on AI governance, quantum security, and commercial strategy for boards and deep tech founders.