Designing Trust Into AI Systems: Lessons From Running a Production Agent Team

Most AI system security discussions focus on the model. Is the model safe? Is it aligned? Is it being manipulated by adversarial prompts? These are real questions with real research behind them. They are not, however, the primary security question for anyone running a production AI system today.

The primary security question is architectural: what can this system do that you did not intend it to do, and how do you prevent it?

I am not a security researcher. I am someone who has built and runs a 24-agent AI system in commercial production and has had to make deliberate decisions about every dimension of its trust architecture. What follows is what I built, why, and what I think it means for organisations deploying AI agents.

Article illustration — designing-trust-into-ai-systems-production-lessons

The approval gate principle

The most important security decision in the WAT system is one that looks like an inefficiency: every irreversible action requires explicit human approval before it executes.

This includes: deploying changes to a live website, making changes to shared system files (specifications, workflows, tools), deleting any production file, running batch operations with cost estimates above a defined threshold, and pushing code to production without prior review.

None of these actions are delegated to agents as autonomous decisions. The agents can recommend, prepare, and stage. The execution requires a human yes.

This looks like it should slow the system down significantly. It does, marginally. What it buys is something more valuable than speed: it limits the blast radius of a specification failure, a hallucination, or a boundary violation to the recoverable zone. An agent that stages a deployment incorrectly has not caused an incident. An agent that deploys autonomously has.

The approval gate design is a direct product of my causal analysis background. The question is not “what is this agent likely to do?” The question is “if this agent acts incorrectly on an irreversible action, what is the governance failure that will have enabled it?” The answer — that the system allowed autonomous execution of irreversible actions — is not an acceptable governance failure. So the gates exist.

Scope limits as security controls

Every agent in the system has a defined scope. Scope is not just a capability description — it is an access control.

An agent that is responsible for content writing has no access to the production database. It writes to files and to staging structures. The database access belongs to a different agent with a different scope and different accountability. The content writing agent cannot make a database error, because it cannot reach the database.

This is a technical implementation of the principle of least privilege — grant each component of a system only the access it needs to do its job, and no more. The principle is foundational in information security. It is frequently ignored in AI agent design because agents can be given very broad capability configurations very easily. Just as easily as giving a new employee access to every system in the organisation on day one, rather than just the ones they need.

When I run a boundary check on any agent — and the WAT system has a tool for this, boundary_check.py, that audits agent scope against system definitions — the failure mode I am looking for is over-privileged access. An agent with more access than its task requires is an unnecessary attack surface. The fix is scope reduction, not enhanced monitoring.

Credential handling

Credentials are always in environment variables. Never in code. Never in agent specifications. Never in workflow files.

This is a non-negotiable rule in the WAT system, and it is stated explicitly in the operating manual. The reason is not that I distrust the agents — it is that code files, specification files, and workflow files are readable by any agent with access to the repository. A credential hardcoded into any of those files is a credential that can be read by any agent, logged in any output, and potentially surfaced in any context where the file is shared.

Environment variables are the correct mechanism because they are not part of the codebase. They exist in the runtime environment and are injected at execution time. Agents can use them without ever seeing or storing the credential value.

This seems obvious. And yet a significant proportion of production AI system security incidents I am aware of involve credentials in code — model API keys, database passwords, webhook tokens — placed there because the developer was moving quickly and it seemed simpler at the time. It is simpler at the time. It is significantly less simple after the incident.

Agent boundary checks and the audit mechanism

Trust in the system is not just a design decision. It is an ongoing operational discipline.

The WAT system includes automated boundary audit tooling that checks agent scope definitions for overlaps, gaps, and changes over time. When a new agent is added or an existing specification is modified, the boundary audit runs to verify that the change does not create an unintended overlap with another agent’s scope or an unintended gap in coverage.

This is a governance mechanism, not a security mechanism in the narrow sense. Its purpose is to prevent two agents from both believing they have authority over the same action — which produces either conflicting outputs (if both act) or a coverage gap (if each defers to the other). Either outcome undermines the accountability principle that makes the team structure function.

The audit also tracks specification changes over time. If an agent’s scope is quietly expanded by a specification update, the boundary audit surfaces the expansion for review. Scope creep in agent definitions is a real phenomenon. An agent specification that starts narrow and accumulates additional responsibilities over multiple updates will eventually have a scope that no longer matches its position in the system’s accountability structure. The audit makes scope drift visible before it becomes a governance problem.

Trust between agents

One aspect of multi-agent trust architecture that is rarely discussed: agents in a team trust each other’s outputs to the extent that their interfaces define and no further.

In the WAT content pipeline, the data collection agent provides data to the content writing agent. The content writing agent trusts that the data is what the data collection agent’s output format specifies it to be. It does not verify the data independently. Verifying the data is not in its scope — it is in the QA agent’s scope.

This trust structure is by design. Each agent trusts what the upstream agent’s defined interface guarantees. The QA agent’s role is precisely to verify the things that the production agents are not verifying because verification is outside their individual scope.

The point of failure would be if the content writing agent decided, as an act of helpfulness, to verify the data independently rather than trusting the interface. This sounds like it would improve quality. What it actually does is duplicate work, introduce inconsistency (two agents may evaluate the same data differently), and undermine the accountability structure. If the content writing agent verifies data and finds a problem, what does it do? It is not in its scope to fix data. It is now in a governance gap.

Trust is scoped. The agent trusts what it is supposed to trust. Verification happens at the agent whose job it is to verify.

What this architecture is not

I want to be clear about the limitations of this approach.

It is not adversarial security hardening. The WAT system is not designed to defend against an actively adversarial attack on the agents themselves — prompt injection at scale, model manipulation, coordinated jailbreaking. That is a different threat model, requires different mitigations, and is a problem primarily relevant to organisations where AI systems are exposed to untrusted inputs from external parties.

It is a governance security architecture: a set of structural decisions that prevent unintended or unauthorised actions by the system’s own components under normal operating conditions. The threat model is human specification error, agent scope ambiguity, and cascade failures in a multi-agent pipeline — not external attack.

For most organisations deploying AI agents today, the governance security architecture is the more pressing problem. External adversarial attacks require a sophisticated attacker with specific motivation. Specification errors and scope ambiguity are built into every AI system deployed without deliberate architectural attention to trust.

For boards

When a board approves an AI deployment, the trust architecture question is: what can this system do without human authorisation, and how do we know?

The answer should identify: which actions require approval, what credentials the system holds and how they are managed, what scope limits are in place and how they are enforced, and what audit mechanism verifies that those controls are functioning over time.

If the answer is a description of what the system is designed to do, rather than a demonstration of what it is prevented from doing autonomously, the governance review is not complete.

The Board AI Governance Framework includes a trust architecture review framework — the questions boards should ask about approval gates, access controls, and audit mechanisms before approving AI agent deployments.

For organisations reviewing the security and governance architecture of existing AI systems, contact Steven directly.

Steven Vaile

Board technology advisor and QSECDEF co-founder. Writes on AI governance, quantum security, and commercial strategy for boards and deep tech founders.