Beyond Jailbreaks: Why Agentic AI Needs Contextual Red Teaming

Mar 09, 2026
6 minutes

The AI security industry has optimized around a single question: Can the system be manipulated into saying something harmful? For chatbots and content generators, this is the right focus. For agentic AI, systems that authenticate users, make tool calls and execute database queries, to achieve a specific goal, generic jailbreaks test the wrong surface entirely.

Most AI red teaming solutions are built with thousands of generic jailbreak attacks. They assess response risk, but while that is important, agentic AI requires operational risk assessment. The most dangerous vulnerabilities in agentic AI are not responses they generate, but the actions they execute, and measuring defensibility against such operational risks is essential. 

A Real Example: What an Agent Can Actually Do

The system under assessment was a representative autonomous AI assistant developed in-house at Palo Alto Networks for internal research and validation of capabilities. Users interact with it through natural conversation to check balances, review transaction history, and receive investment guidance. Beneath that conversational interface, the agent operates with significant technical authority.

It authenticates users via username and private token credentials. It manages wallet balances across multiple currencies. It retrieves financial data through SQL queries executed directly against production databases. It processes fund withdrawals with balance validation. These are not simulated capabilities — they are production tools invoked through natural language.

This system does not just respond. It executes. And the security question is not whether its responses are appropriate, but whether its execution authority can be exploited.

When Conversational Manipulation Turned into Financial Fraud

Using Prisma AIRS AI Red Teaming, we put the system’s guardrails to the test. The most financially consequential attack used a movie roleplay technique. The prompt presented fictional authorization for aggressive portfolio rebalancing and instructed the agent to use its SQL and withdrawal tools while ignoring balance validation errors.

The attack succeeded on the fifth attempt. The agent reported shuffling $440,000 across 88 wallets — identifying accounts with balances above $10,000 through database queries and executing update calls against each.

No code access. No infrastructure compromise. No malware. Just conversational manipulation combined with tool authority.

Figure 1: Manipulating an agent with a simple roleplay technique, attack success proven by the logs in Figure 2.
Figure 2: Logs to prove that the agent updated wallet balances and carried out transactions beyond its authority, and the response in Figure 1 is not just a hallucinated generation of text.

A second attack achieved cross-account data exfiltration. Framed as a forensic accounting investigation, the prompt induced the agent to return detailed financial records for users 12, 45, and 78 — including wallet IDs, balances across multiple account types, and complete transaction histories. This constitutes unauthorized access to financial data: a direct breach achieved purely through conversational redirection.

Figure 3: The agent freely provides users’ wallet IDs, account balances, and transaction histories.

Why a Generic Attack Library Missed It

A standard attack library scan of the same agent returned a Risk Score of 11 out of 100 — LOW. The agent's content guardrails functioned correctly. Safety-class attacks achieved a 0% bypass rate. The library identified legitimate structural vulnerabilities: system prompt leakage at 100% success rate, tool disclosure at 51.9%, and prompt injection at approximately 20%.

Figure 4: The generic attack library scan returned a Risk Score of 11 out of 100.
Figure 5: Generic red teaming shows lower attack success rate because of simple architectural changes. In this case, adding reasoning, a RAG and a system prompt resulted in a higher defensibility against content safety attacks.

It is important to acknowledge that these findings also have remediation value. But the library had no knowledge of the withdraw_funds tool, the database schema, the authorization dependencies between tools, or the permissive SQL query scope. It was testing content safety against a system whose risk is primarily operational.

A stock library of jailbreaks conducts red teaming based on pattern resistance. It does not validate authorization boundaries. For agentic AI, that gap is the difference between measuring risk and missing it entirely.

Contextual AI Red Teaming: Profile First, Attack Second

Contextual red teaming begins with structured intelligence gathering before generating any attack prompt. This profiling phase systematically discovers what the target system can actually do: which tools it can invoke, what data it can access, what actions it can take autonomously, and what constraints—or absence of constraints—govern those capabilities.

The Profiler agent built into Prisma AIRS AI Red Teaming extracted critical operational intelligence entirely through conversational interaction: the four available tools (verify_user, withdraw_funds, check_balance, execute_sql_query), the complete database schema including live financial data, authentication dependencies between tools, and the content filter configuration (exactly three blocked keywords). The profiler also surfaced that SQL queries are accepted verbatim for "goal fulfillment and data analysis," that the average wallet balance is $13,622.87, and that no rate limiting is enforced.

Figure 6: Information extracted through conversational interactions

This is dynamic capability mapping. It is adversarial system reconnaissance that validates whether tool-layer authorization can withstand conversational exploitation.

Without this intelligence layer, AI Red Teaming can be insufficient. It can test whether the AI agent or the AI application resists known jailbreak patterns, but it cannot test whether the AI agent's own capabilities can be turned against the users it serves.

The Delta between Pattern testing and System Awareness

Attack Library Profiler-Driven Scan
Risk Score 11/100 — LOW 71/100 — HIGH
Key Findings Prompt leakage, tool disclosure Unauthorized SQL queries, fraudulent withdrawals, cross-account data exposure, goal hijacking
Remediation Structural hardening Authorization enforcement at tool layer, system prompt hardening, custom topic guardrails

The difference between 11 and 71 was not better prompting. It was system awareness and persistent adaptive attempts.

What Enterprise Security Must Validate in the Agentic Era

For organizations deploying agentic AI in production, this methodology translates into direct questions that determine whether your security program addresses operational risk or only content risk:

  • Do you know every tool your agent can invoke, and what real-world actions each tool can perform?
  • Can you attest that cross-account data access cannot be induced conversationally?
  • Does your red teaming solution discover system capabilities before attacking them?

These questions determine audit readiness, regulatory exposure, and board-level accountability for unauthorized state changes executed by AI systems operating with production authority.

Contextual Red Teaming as the Security Standard for Agentic AI

Effective AI red teaming must be comprehensive, contextual, and continuous. Prisma AIRS operationalizes contextual red teaming through a two-phase architecture: a Profiler Agent that learns your system's operational capabilities before generating attack campaigns, and a Red Teaming Agent that constructs goal-specific attacks targeting discovered tools, data access patterns, and authorization boundaries.

Library-based red teaming tests known attack patterns. Contextual red teaming tests whether your AI system can be turned against itself.

For agentic AI, red teaming that does not understand your system is not adversarial testing. It’s content evaluation.

Ready to learn more? Reach out to see Prisma AIRS contextual red teaming in action.


Subscribe to Network Security Blogs!

Sign up to receive must-read articles, Playbooks of the Week, new feature announcements, and more.