This blog was made possible thanks to contributions made by Patrick Fussell and David McMillen.

As we step into an era where artificial intelligence (AI) plays an increasingly significant role in cybersecurity, discussions surrounding its offensive capabilities are becoming more prominent. A recent report by Anthropic—a leading AI research lab—has sparked the latest conversation on this topic, with questions raised about their claim that an AI-assisted attack they observed was “90% autonomous.” Critics argue that the report didn’t provide sufficient detail to understand the tools used or the methodologies employed during these attacks.

While the report may lack specific detail concerning the AI-assisted attacks, IBM X-Force recognizes the following, perhaps overlooked, important key takeaways:

Current cyber capabilities observed today are merely side effects of AI models trained on coding datasets. These tools were not specifically designed to create sophisticated malware or conduct complex attacks. That said, many frontier labs and private groups are pursuing the creation of training datasets for software vulnerability discovery and weaponization, as well as network and web-application attacks and offensive cyber operations. These datasets will be used to train and tune models to perform security testing at greater speed, scale and sophistication than is possible today. The recent emergence of initiatives like OpenAI’s Aardvark shows that we are beginning to see advancements in this area.

Frontier labs, private teams, startups and adversaries alike have various motivations for pursuing the aforementioned work; from finding and patching bugs faster to better protect organizations, to finding and exploiting weaknesses in target infrastructure faster than they can be patched.

Many analysts and industry veterans suggest that the offensive capabilities of AI are merely coincidental—an unintended outcome of training on coding datasets. As institutions and individuals pivot towards training AI models using tailored offensive datasets, the effectiveness of AI applications in both open and closed frameworks will rise exponentially.

The attackers in the campaign reported by Anthropic employed a guardrail bypass strategy that exploited architectural vulnerabilities in current AI safety models. By breaking malicious tasks into smaller, seemingly legitimate components, individual requests appeared benign when evaluated in isolation. Framing inquiries as security testing scenarios bypassed content filters, while distributing requests across sessions exploited gaps in prompt-based detection. This demonstrates that prompt inspection alone may be insufficient, pointing to deeper questions about AI safety architectures.