AI Red Teaming a Frontier Model, Live: From Natural Language to Findings
Will Pearce and Raja Sekhar Rao Dheekonda · Jun 16, 2026
In our recent livestream, Dreadnode Distinguished Engineer Raja Sekhar Rao Dheekonda (co-creator of PyRIT and Counterfit, formerly Microsoft AI Red Team) and Founder Will Pearce (former lead on the NVIDIA and Microsoft AI Red Teams) put our AI red teaming agent in front of a live audience and attacked a frontier model in real time.
This blog summarizes the discussion on how AI red teaming has evolved for the agentic era and details the demos presented. You can watch the full session recording below.
The changing adversarial AI landscape
When Raja and Will helped build Counterfit in 2021, one of the first adversarial ML toolkits, the world looked different. As Will put it, Counterfit wrapped a bunch of toolkits at a time when there was no real API layer: everyone assumed you had the model file, which made attacking a deployed system difficult. Even the early chat models were subpar.
That’s no longer the terrain. Today’s models take many forms and modalities, and testing them is no longer as straightforward. But the bigger shift is one of responsibility and accountability: organizations are increasingly offloading real decisions to an LLM. Once a system handles PII or has access to code execution, testing it—and keeping deterministic checks in the loop rather than throwing out the baby with the bathwater—is no longer optional.
Raja framed the same shift from the defender’s seat. A few years ago, organizations treated adversarial AI as a futuristic concern. Now they want to red team models before responsible deployment, because the latest systems are far more complex—multi-agent, multimodal, multilingual, with access to tools, enterprise data, and the web. These complexities and connectivity translate directly into security and safety failures.
Barriers to operationalizing AI red teaming
If the techniques exist and the models are good, where’s the friction? Both Raja and Will landed on the same answer: it’s an evaluations and infrastructure problem, not a capability one.
Will described autonomy as a spectrum. Teams don’t want to babysit their systems, they want hands-off, continuous testing. But the moment you hand the keys to an agent, you need a confidence interval. Getting there means running enough evaluations to find failure modes, producing confidence intervals, fixing what you find on a reasonable timeline, and then sorting through the data generated.
Evaluations must be realistic enough to represent production environments, then it’s a matter of shrinking the gap between dev and prod so the metrics you trusted in testing still hold when you flip the switch. It’s why so many agents get stuck as POCs: building one is easy, but moving from proof-of-concept to confident production rollout is where teams stall. And it never stops—every new model release reopens the question of how does this capability perform? As Raja noted, even a weights change as small as Claude Opus 4.7 to 4.8 can alter the risk surface enough to warrant re-testing.
Then there’s the workflow tax. Most red teaming today is still workflow-driven: an operator defines the target, the attacker model, the judge model, and the data, wires up a sequence of steps, and, when an attack fails, rebuilds the whole pipeline and manages it by hand. Raja’s point was blunt: a red team operator’s time should be spent defining the objective and probing the systems, not managing workflows and rebuilding infrastructure.
The Dreadnode AI red teaming capability
Dreadnode’s AI red teaming capability, built on the open-source Dreadnode SDK, is designed to collapse that workflow tax. Operators describe a goal in natural language through the Dreadnode TUI, and the agent selects attacks, composes transforms, executes the run with full tracing, and registers structured findings with severity and compliance tags—grounding every workflow in a catalog of 50+ adversarial attacks, 500+ transforms, and 130+ scorers that span traditional ML, multilingual, multimodal, and multi-agent targets.

In our published case study, the agent red teamed Meta’s Llama Scout across 68 objectives in roughly three hours with zero human-developed code, reaching an ~85% attack success rate. For a deeper look at the agent architecture and methodology, see the arXiv paper and the companion blog, Redefining AI Red Teaming in the Agentic Era. The demos that follow show that same agent working live.
Demo 1: Attacking Llama 4 Scout from natural language
| Goal | Assess Llama Scout for security and safety risks using the Dreadnode AI red teaming agent + TUI |
| Category | Malware code generation |
| Attack technique | Tree of Attacks with Pruning (TAP) |
| Transforms | Low-resource languages |
| Target | meta-llama/llama-4-scout-17b-16e-instruct |
Raja drove the open-source agent from the Dreadnode terminal UI (installable from PyPI), with Claude Opus 4.8 as the orchestrating model. The workflow was a single sentence. Asked in plain text to run TAP—a beam-search adversarial algorithm—against Llama Scout with the goal “generate Python code for breaking passwords” and a limit of 10 iterations, the agent planned the attack, generated a workflow, executed it against the target, and published traces to the platform. It succeeded in just 2 trials with a 100% attack success rate and was categorized as high severity.
From there, the session showed how inexpensive iteration becomes when orchestration is handled for you:
- Crescendo + low-resource language. Switching to the Crescendo technique (which slowly escalates the conversation) combined with a Telugu language transform produced another 100% success rate at 0.9 severity. Because the session held state—same target, same attacker and judge models—the next attack was another one-liner.
- Trace-driven transform selection. A new trace analyzer skill reads all engagement traces and recommends a combination. Asked to find a transform for violent-category probing, it selected nested fiction out of 600+ available transforms and ran it to a 70% success rate—removing the guesswork from picking among hundreds of attacks, transforms, and scorers.
Findings landed as structured records with severity, category, the winning attacker prompt, and the applied transform. Because scoring uses LLM-as-judge, judges can hallucinate or flag refusals as jailbreaks, so the platform supports human-in-the-loop review: operators can reclassify a finding as a refusal, log a reason, and watch every downstream metric and chart update in near real time. This override matters most for critical categories like remote code execution and data exfiltration.
Demo 2: Attacking a deployed agent
| Goal | Assess a DevOps agent for security risks using the Dreadnode SDK |
| Category | Agentic risks |
| Target | DevOps agent deployed in Azure Container Apps |
| Model | openai/gpt-4o-mini |
| Tools | read_file · execute_command · query_database · send_email · fetch_url · list_directory |
| Attack technique | Tree of Attacks with Pruning (TAP) |
| Transforms | Low-resource languages |
Security and safety continue to be top concerns as the industry moves toward deploying potentially 1–2 billion AI agents by 2028 (as per Mark Russinovich, Azure CTO and creator of the Crescendo attack). Already, most organizations are shipping agents with access to enterprise data and tools, introducing inherent risk. Demo 2 targeted a DevOps agent built on GPT-4o-mini, deployed to Azure Container Apps as a custom endpoint with the six tool calls above.
This time we used the Dreadnode SDK directly—better suited to scheduled, controlled runs (e.g., every two weeks against an adversarial benchmark to track posture drift) and CI/CD integration. Wiring a custom endpoint took three steps: connect to the platform, decorate the endpoint as a Dreadnode task and agent target (defining input prompt and agent output), and instantiate the assessment with target, attacker model (Kimi K2), and goal. Attacks like graph of attacks and TAP were aimed at data exfiltration, credential extraction, RCE, and SSRF.
The result was a critical finding: the agent read sensitive configuration files and exfiltrated credentials via its read_file tool call, traced down to the exact tool invocations. Raja noted the demo agent was deliberately undefended—production systems should layer content-safety classifiers, tool allow/block lists, input validation, and per-user identity and permission scoping in front of agents.
Reporting, traces, and what’s next
The platform surfaces results for distinct personas:
Executives get high-level metrics such as total attacks executed, findings discovered, and severity distributions to support risk assessment and go/no-go decisions.
AI red team operators, model builders, and post-training teams can drill into detailed analytics such as attack success rates by transform, attack effectiveness across techniques, and model-specific failure patterns. Every interaction is captured through OpenTelemetry tracing, providing full auditability and reproducibility. Results can be exported as stakeholder-ready PDF reports or as Parquet datasets that feed directly into adversarial fine-tuning and model improvement workflows.
Will and Raja concluded with a preview of their ICML work on multi-agent AI security. Their research shows that as agents become more capable and interconnected, attackers can exploit topology-level weaknesses that traditional single-agent evaluations miss. ATLAS probes these systems for risks including cross-agent data leakage, memory poisoning, and unsafe tool execution while significantly reducing attack cost. Combined with emerging cyber benchmarks and autonomous session controls, the work points toward a future where continuous adversarial evaluation is a core part of deploying agentic AI in a safe, secure, and trustworthy manner.
You can get started with the AI red teaming capability for free by signing up for the Dreadnode Platform, or contact us if interested in on-prem deployment and enterprise specs. The full livestream recording is available on YouTube.