Revolutionizing AI Safety: The Execution Authorization Boundary

A common misconception in the AI safety discussion is that the primary concerns lie in the prompt layer. However, when dealing with autonomous agents, the real risks emerge at the execution layer.

Most current conversations focus on prompt alignment, jailbreaks, output filtering, and sandboxing. While these aspects are important, they do not address the actual dangers that arise when agents interact with real-world systems.

The primary goal is to prevent today’s tool-using agents from causing unintentional harm, such as depleting API budgets, spawning runaway loops, or provisioning infrastructure excessively. An agent does not need to be malicious to cause problems; it only needs permission to perform certain actions.

To mitigate these risks, an execution authorization boundary can be implemented. This concept involves a deterministic policy that evaluates proposed actions against the current state, allowing or denying execution accordingly.

The authorization boundary operates as follows: the agent runtime proposes an action, which is then evaluated by the authorization check. If the action is allowed, the system generates a cryptographically verifiable authorization artifact. If denied, the action is never executed.

Example rules for this policy might include daily tool budgets, limits on concurrent tool calls, requirements for explicit confirmation of destructive actions, and rejection of replayed actions.

An open-source project called OxDeAI has been experimenting with this model, incorporating a deterministic policy engine, cryptographic authorization artifacts, and tamper-evident audit chains.

The project includes runtime adapters for various AI agents and demonstrates the authorization boundary in practice through a simple scenario: two actions are executed, while the third is blocked before any side effects occur.

The OxDeAI repository is available for those interested in exploring this concept further.

The question remains: how are others building agent systems addressing execution safety? Are policy engines, capability models, sandboxing, or other approaches being used, or is the risk being accepted for now?

Photo by RDNE Stock project on Pexels
Photos provided by Pexels