AI Agent Governance Part 3 - Runtime Governance: The Hidden Performance Cost of Agentic AI

Read Part 1 and Part 2

At the World Economic Forum cyber meeting in Geneva recently, I had an interesting conversation with Vinh Nguyen, who is a strategic security advisor and Senior Fellow for AI at CFR. I wanted to know from him how he sees runtime governance in agentic AI working out practically and what approaches actually work.

One of the challenges he mentioned was that yes, we need runtime governance to provide continuous and real time assurance that agents are doing what they are supposed to be doing. But the more context-aware runtime governance becomes, the more computationally expensive it gets.

Many organizations may still underestimate what continuous governance actually means operationally. We talk a lot about making AI agents more capable, more autonomous, and more integrated into workflows. But far less attention is being paid to what it takes to continuously monitor, constrain, validate, and intervene in those systems while they are operating.

And unlike traditional governance, this doesn’t happen once a year during an audit cycle. It needs to happen during execution.

Governance at Machine Speed

In my earlier articles on AI agent governance, I explored how organizations are shifting from decision-support systems to decision-authority systems. AI agents are no longer simply generating outputs for humans to review. Increasingly, they are executing workflows, making decisions, and interacting across environments with limited human oversight.This fundamentally changes the governance challenge.

Risk is no longer event-based. It becomes continuous and accumulative, emerging through thousands of small autonomous decisions made at machine speed. That means governance itself must also become continuous.

The Runtime Governance Performance Challenge

For runtime governance to work it increasingly requires contextual analysis, behavioral monitoring, anomaly detection, and intervention capabilities operating continuously during execution. All of that consumes resources. It may take up to 20% of a model's performance just to monitor for failure events. That is really expensive. In other words: Runtime governance may become the hidden performance tax of agentic AI.

Why Traditional Safeguards Break

Attackers are no longer simply attempting direct prompt injection. An agent that is otherwise well-designed, properly chartered, and carefully monitored can still be tricked into bypassing its own safeguards through:

fragmented malicious intent spread across multiple prompts,
contextual obfuscation; requests masked in metaphor, riddle, or coded language that appears harmless without context
hidden instructions,
and outputs designed to evade detection systems.

These aren't theoretical attacks. In human red-teaming efforts, Anthropic researchers found that previous-generation safeguards (Constitutional Classifiers) had measurable vulnerabilities to these techniques.

Just trying to identify harmful prompts through input and output analysis is not good enough. What we need is the ability to identify harmful intent distributed across interactions, context, memory, and execution chains. This becomes especially important for AI agents operating across systems where seemingly benign actions can combine into harmful outcomes.

Anthropic’s Next Generation Constitutional Classifiers++

In Anthropic’s paper about Next Generation Constitutional Classifiers, they provide an approach on how this can actually be practically addressed:

1. Exchange Classifiers (Not Input/Output Only)

Traditional safeguards often evaluate inputs and outputs separately. This creates a blindspot: obfuscation attacks that require context to understand.

Anthropic’s approach introduces what they call exchange classifiers, systems that evaluate outputs within the context of the broader interaction rather than in isolation. That may sound like a technical nuance, but conceptually it is extremely important. Why this matters for agents: An agent operating under uncertainty receives instructions that seem benign until combined with what the agent has already learned about its environment. Exchange classifiers catch this latent danger.

2. Layered Classifier Cascade Architectures (Fast Screening + Deep Analysis)

The paper also introduces layered “classifier cascade” architectures: lightweight classifiers continuously screen activity, while more computationally expensive analysis is reserved for suspicious interactions. Not every exchange needs intensive analysis. Rather than just deploying bigger, more expensive classifiers, Anthropic trains lightweight linear probes that read directly from a model's internal activations, essentially looking for activation areas of violation in semantic concepts.

By combining these cheap probes with external classifiers in an ensemble, they achieve better robustness at lower computational cost because different approaches capture different types of risks. It redistributes computational effort toward where it's needed and allows for lightweight, continuous monitoring with deeper analysis triggered by anomalies.

Organizations don't have an infinite computational budget for governance. But not every action requires maximum scrutiny. Only risky or abnormal behavior can trigger deeper inspection dynamically - very similar to how human security operations evolved over time. This creates a far more scalable and inexpensive model for runtime governance.

The Numbers: What Works at Scale

Their research show significant improvement in performance compared to previous approaches:

Previous-generation Constitutional Classifiers: 0.38% false positive rate on production traffic
Constitutional Classifiers++: 0.05% false positive rate (7.6x reduction)
Computational overhead: 40x cheaper than earlier approaches
Red-teaming results: 1,700+ hours of adversarial testing found only one high-risk vulnerability that could answer all target queries at detail comparable to undefended models

From Jailbreak Defense to Agent Drift Detection

What I was wondering about when reading the paper was whether these same layered classifier approaches could evolve into mechanisms for detecting agent drift. To help with the question: “Is this agent still behaving within its intended operational boundaries?”

In agentic systems, harmful behavior may not appear as a single malicious output. It may emerge gradually through subtle deviations in planning, tool use, memory handling, or decision-making patterns over time. This is where runtime governance begins to look less like traditional compliance and more like continuous behavioral supervision.

Internal Optimization vs. External Governance

It is important to note a key methodological difference between what Anthropic is doing here and external runtime governance tools. Anthropic’s Constitutional Classifiers++ focus on internal guardrail optimization, making the model itself inherently smarter and more efficient at catching its own vulnerabilities. An external runtime governance provides an independent, outside layer watching the agent's behavior. Both internal guardrails and external runtime governance have their place and provide a defence in depth approach to AI Security.

Take Away: Governance As Part of Architecture

One of the biggest mistakes organizations still make is treating governance as separate from system design. But governance for agentic AI cannot simply exist as policy documents, ethics statements, or high-level principles. It must exist inside the runtime environment itself.

In parts one and two of this series, I argued that organizations need to treat AI agents as formal organizational actors with charters, ownership structures, and runtime accountability. This creates an organizational governance framework.

But organizational governance is only as good as its enforcement mechanisms. You can write the most perfect agent charter, but if the agent can be tricked into violating it through sophisticated obfuscation, the charter becomes theater.

The Anthropic paper reinforces that governance is becoming an architectural discipline, and not just a policy discipline. Because policies by themselves are not enough, unless we have the enforcement mechanisms that make them come into effect in real time.

As I wrote in Part 2: "Governance only exists where it can shape, constrain, and intervene in decisions as they happen."

Constitutional Classifiers++ shows what that actually looks like in practice. Not governance as a checklist. But governance as part of enforcement architecture, designed in, not added on. Defense mechanisms layered from the start and not just kept in a policy paper.

A robust AI architecture will most likely require both external and internal governance tools. Internal guardrails make the agent inherently safer without destroying computational performance, while external runtime governance provides the independent validation that CISOs might need to trust these autonomous systems.