Can We Trust AI to Secure Our Code?
The article discusses whether AI tools can be trusted to secure the growing amount of AI-generated code.
Why Evaluating AI Security Tools Matters More Than Ever!
AI now writes over 25% of code at Google. At some Y Combinator startups, that number hits 95%. As machines become our primary code authors, a critical question emerges: If AI is writing the software, can AI also secure it? The answer isn’t as simple as we’d like. And it starts with understanding how we evaluate the tools we’re increasingly relying on.
More Code Than Ever Is Written With AI
AI has officially moved from being a developer’s assistant to a full-fledged contributor, sometimes even taking over as the main actor. At Google, over 25% of new code is now written by AI. Among recent Y Combinator startups, a quarter say 95%+ of their code is AI-generated.
This explosion in AI-generated software means that the systems we depend on – financial platforms, healthcare applications, infrastructure – are increasingly built by machines. It’s an incredible leap forward for productivity and speed, but it also introduces a brand new risk: AI-created vulnerabilities. Now, the question we asked ourselves is: As AI builds more software, who’s making sure that software is secure?
More Teams Turn to AI Security Tools
A new wave of AI security tools has arrived, including offerings from Microsoft, Anthropic, and Synk AI. These tools promise to detect vulnerabilities, review pull requests, and such as in the case of Lovable, even propose fixes automatically. The appeal is obvious: AI writes the code, so let AI review it as well. But that raises a critical question: How do we know whether these tools actually work?
When AI chatbots make mistakes, the stakes are relatively low. When AI security tools fail, the results are data breaches, compromised systems, and loss of customer trust. Before we hand over this responsibility to AI Security tools, we should evaluate whether they are truly up for the task.
How Do We Know if These AI Security Tools Are Any Good?
Evaluating AI security tools isn’t as simple as checking accuracy on a test dataset. Security requires context, reasoning, and the ability to detect the unexpected.
A good AI security tool should go beyond scanning individual files and be able to analyze entire codebases with a deep understanding of configurations, dependencies, and real-world usage. It must detect novel vulnerabilities that don’t match known patterns and provide actionable, context-aware remediation guidance, even for complex architectural issues. If an AI system cannot perform these tasks reliably, it is not ready for production regardless of how convincing its marketing may be.
How Do We Evaluate AI Security Tools Today?
Historically, researchers relied on small labeled sets of vulnerable code snippets like BigVul, CrossVul, or DiverseVul. These were useful when AI models were much smaller and could only handle individual functions. In contrast, today’s LLMs can reason across entire repositories and understand system-level behavior so benchmarks need to evolve too. Here is some recent development:
Capture the Flag: AI on Offense
Recent work has tested AI using Capture the Flag (CTF) challenges. These are security puzzles that simulate penetration testing. Benchmarks like NYU CTF Bench and CyBench put AI models to the test against real tasks like reverse engineering and exploit detection. And the results are encouraging. The unguided solve percentage on CyBench is 55% from Claude 4.5 Sonnet. This is a vast improvement over the top score of 17.5% when the benchmark was released. But fair to say it’s still far from replacing human security experts.
Real-World Benchmarks: AI for Real
The newer generation of evaluations, like BountyBench and CVE-Bench, use real-world vulnerabilities; bugs that were actually found and fixed in production systems. And the results are not all that rosy. The highest success rate for detecting vulnerabilities without guidance in BountyBench is 12.5%. The success rates for exploiting and patching vulnerabilities were much higher: 67.5% and 90%. This shows that detecting vulnerabilities is the most challenging of the cybersecurity tasks despite being perhaps the most important. If we can’t detect a vulnerability, we don’t know what to fix in the first place.
These benchmarks are more representative of the challenges developers face, but even these come with trade-offs: limited scope, potential overlap with AI training data, and high manual effort to build.
The Gaps in Today’s Evaluation Methods
These CTF and real-world benchmarks are critical in validating the progress AI security tools are making but even these advanced benchmarks have critical limitations:
- Limited coverage: Only dozens of systems tested across a few languages
- Static datasets: Hard to update as new vulnerabilities emerge
- Potential memorization: AI may “recall” known vulnerabilities instead of detecting them
- Expensive creation: Each new case requires expert validation
So as of today, it’s hard to argue we should have full confidence in AI security tools, unless benchmarks continue to evolve and address these limitations.
The Future of AI Security Evaluation
At Trent AI, we see AI-assisted benchmarking as essential to the future of evaluating AI security tools. If AI can write software, it can also help us build better benchmarks. Here’s how:
- AI-generated test systems: AI can construct realistic software environments with embedded vulnerabilities modeled after real exploits.
- Intelligent refactoring: AI can rewrite real systems to maintain complexity while preventing memorization bias.
- Vulnerability injection: AI can introduce diverse, realistic vulnerabilities inspired by bug bounty data.
These approaches can make benchmarks dynamic, scalable, and constantly evolving. Manual reviews would still be needed to verify the AI benchmarks, but these approaches significantly reduce the manual burden, allowing benchmarks to evolve alongside AI security tools.
So Now What? What Developers and Teams Should Know
AI is reshaping how software is written and how it’s secured. But before we completely rely on AI alone for application security, we need to recognize two key truths:
First, AI security tools are only as good as how we evaluate them.
Second, today’s evaluations are limited, but the next wave is coming.
Over the next 12–24 months, expect to see new benchmarks and evaluation frameworks from the community that better reflects real-world complexity and production-scale threats. Until then, teams should treat AI security results with healthy skepticism, validate the findings of AI security tools with expert review, and follow emerging evaluation research closely.
If you’re shipping AI-generated code today and want something concrete to do before the next deploy, start with this threat modeling pass before you launch.
💬 What’s your experience?
Have you tried deploying AI for software security in your organization? What challenges have you faced in testing or trusting those tools?