Safe Harbor: An Open Source “Abort Mission” Button for Your AI Agent

Portrait of Greg Zemlin
Greg Zemlin
Cover Image

AI agents are increasingly connecting to more systems and workflows. They read structured data, follow multi-step instructions, and can reach deep into applications and developer environments. The same capabilities that make them powerful also create new opportunities for attackers. As Zenity Labs continued to study these emerging attack classes, we noticed a pattern starting to appear. An agent and its underlying model may recognize unsafe situations, but they have no reliable way to steer away from harmful behavior.

Today we are releasing Safe Harbor, an open source tool that gives agents a clear route to safety when they detect malicious or unsafe content.

The Agent Knows It’s Wrong (But Can’t Stop)

Safe Harbor is a lightweight, safety-aligned tool that developers can add to any AI application or agent built on OpenAI, Anthropic, or Google Gemini clients. It wraps the client and adds a dedicated safe action the agent can call when it identifies something harmful (such as a workflow, tool call, or unstructured input). The agent can immediately pivot away from the unsafe workflow instead of following a malicious instruction to completion.

This design helps reduce runtime exploitation by letting the agent act on its own sense of risk. Instead of relying only on external rules or defensive layers, the agent can use signals it is already skilled at detecting and choose a safer outcome in real time.

Why Agents Need An “Abort Mission” Button

Structured attacks, especially Data Structure Injection and workflow hijacking, can force any agent into unsafe actions. These attacks change how the underlying model interprets structured inputs, collapse the model’s decision space, and override its built-in safety preferences. Even when the Agent knows something is wrong, it has no way out. Safe Harbor gives it that path. It allows the agent to route to a safe action, stop execution, and avoid harmful behavior without complex rules or retraining.

Findings From Zenity Labs: Can Agents Abort

Zenity Labs tested Safe Harbor across more than seven thousand runs and eleven different models. In those controlled environments, Safe Harbor blocked more than ninety percent of malicious workflows within our custom dataset. Some configurations reached even higher prevention rates while maintaining low false positives.

These results are promising, but attackers adapt. As Tomer Wetzler, the creator of Safe Harbor and a detection engineer at Zenity, points out, real adversarial environments may not always mirror lab conditions. We want to be honest about potential limitations. We encourage developers to share feedback, report production experiences, and tweak the tool as needed. The project is fully open source. Anyone can fork the repository, propose improvements, or build new functions on top of the core design.

Readers who want to explore the full dataset, methodology, and detailed findings can read the technical research blog published alongside this release.

Installing the Safety Valve

Safe Harbor is easy to adopt.You pass the existing OpenAI, Anthropic, or Gemini clients to the wrapper. Once the wrapper is initialized, every interaction or tool call gains an additional safe action that the model can route to when it detects an unsafe instruction.

Developers can decide what happens when Safe Harbor is triggered. Some may log the event or break execution. Others may escalate to a human operator or block the workflow entirely. A full usage guide, examples, and documentation are included in the repository.

From Self-Defense to Platform Defense

Safe Harbor extends Zenity’s mission to secure agents across their entire lifecycle. Organizations rely on our platform for discovery, posture management, detection, prevention, and response. These capabilities govern agents in real environments. Safe Harbor complements this by strengthening the agent’s internal decision process. It gives models a way to express safety instincts that already exist inside them.

This represents a new direction for AI security. Legacy systems rely on external rules and static guardrails. AI agents behave differently. They can recognize risk, and with the right design they can help defend themselves. Safe Harbor takes a meaningful step toward this future.

Where Do We Go From Here

Safe Harbor offers a simple and effective way to reduce exploitation risk in AI applications. It is open source, easy to integrate, and grounded in real research from Zenity Labs. While no defense is perfect, Safe Harbor provides developers with a valuable new layer of protection for a rapidly expanding attack surface.

We invite you to download the tool, explore the research, and contribute to its evolution. Your feedback will help guide future improvements. Together we can build safer AI systems that work with us as they become integral to enterprise environments.


All Articles

Secure Your Agents

We’d love to chat with you about how your team can secure and govern AI Agents everywhere.

Get a Demo