Continuous code analysis at enterprise scale: How IBM Bob found what traditional tools miss

Graphic image of AI Operations

The software industry operates at a paradox: development teams ship commits at accelerating velocity, dozens per day, hundreds per week. Security audits, meanwhile, operate on calendar schedules: weekly reviews, monthly scans and quarterly compliance reviews.

When AI-generated code enters the picture, the paradox sharpens: development velocity increases further, but so does the density of potential vulnerabilities. The mathematics of cost can be brutal: research consistently shows that remediating security issues discovered post-deployment costs much more than addressing them before release.

Yet traditional code analysis tools, comprehensive as they are, cannot keep pace with commit velocity. A static application security testing (SAST) tool might generate 500+ findings, many of which are low confidence. Teams triage reports that arrive days or weeks after the code has merged.

By then, the vulnerable code sits in production. The context is lost and the cost of fixing has multiplied. The security team reviews findings based on pattern matching, not on understanding what the code is intended to do.

This moment is where I saw an opportunity: Can AI code analysis, applied with structured methodology, become continuous rather than exceptional? Can it bridge the gap between traditional SAST tools and the architectural understanding that only humans bring?

The experiment: Structured prompting at scale

Rather than asking Bob to find bugs, I developed a multi-prompt analytical framework. The insight that emerged during the first iteration proved critical: Bob’s analysis quality was directly proportional to prompt clarity and specificity.

The repositories were substantial:

  • Backend: ~15 core Python files, 40+ gen AI-powered API endpoints, comprehensive database schemas with 20+ tables, 30+ operational scripts, 29 Pydantic models
  • Front end: 829 JavaScript files, 50+ page routes, 43 MobX State Tree stores, 300+ API integrations across 9 backend services, 50+ modal components

For the backend, I used a 4-prompt sequence that built the context iteratively:

1. Architecture overview: A prompt requesting high-level structural understanding: directory organization, entry points, technology stack and component relationships. This established Bob’s baseline understanding.

2. Detailed domain documentation: A comprehensive prompt requesting documentation of specific subsystems: Gen AI endpoints, database schemas, operational scripts and compliance frameworks. This expanded Bob’s contextual understanding and forced it to reason about how components interact.

3. Database and migration analysis: A specialized prompt analyzing database design through the lens of modernization goals, identifying normalization opportunities, type system issues and architectural anti-patterns in concrete, actionable terms.

4. Technical debt analysis: A structured prompt requesting systematic categorization of code quality issues: database access patterns, authentication inconsistencies, code duplication and specific remediation effort estimates. This forced analysis into a taxonomy rather than free-form observation.

For the front end, I used a single comprehensive prompt that integrated architectural analysis, API integration patterns, technical debt identification and modernization planning.

The cost was around 10 bobcoins out of a monthly allocation of 50. This amount was just over 20% of the available basic user allocation to provide a comprehensive analysis of two large repositories. This cost structure meant that continuous analysis was economically feasible, not a quarterly luxury.

The findings: Systemic issues across four dimensions

Bob’s analysis identified over a dozen major types of findings, each evaluated across four impact dimensions: securityperformancestability, quality and development velocity. The breadth was significant, not security findings alone, but systemic architectural problems constraining the organization across multiple vectors.

Bob's results analysis chart Bob's analysis identified over a dozen major types of findings, each evaluated across four impact dimensions: Security, Performance, Stability/Quality, and Development Velocity.

The SQL injection was the most striking finding, but it was one part of a larger story. The analysis revealed that Concert® architecture was accumulating technical debt across every dimension simultaneously. Security gaps, performance bottlenecks, stability issues and development friction were all present.

Why context matters: The SQL injection case study

The SQL injection vulnerability resided in api/src/common/lib/auth.py in the get_userkey() function, where username values extracted from Base64-decoded API keys were embedded directly into SQL queries through f-strings, without parameterization:

username = decode_api_key(api_key)

cur.execute(f”SELECT salt, hash, role FROM user key WHERE username=’{username}’”)

Here’s what each tool saw:

  • SonarQube: Missed it entirely, false negative
  • Semgrep: Flagged it as ”Medium” severity, pattern matching without context
  • Bob: Flagged it as ”Critical” with full context analysis

Why the difference? Bob understood where this code ran. It wasn’t just an SQL injection, it was an SQL injection in the authentication layer that gates all API requests. Because the username comes from user-supplied API keys, an attacker doesn’t need to be authenticated to exploit it. They craft a malicious API key. The injection executes before authentication completes and they gain database access before the system knows who they are.

The implications extended across every dimension:

  • Security: Complete authentication bypass, full database access, lateral movement potential
  • Compliance: Automatic SOC 2 and ISO 27001 audit failure
  • Business impact: 1–2 weeks of incident response, reputational damage
  • Remediation: Simple fix, 4–8 hours of work, immediate risk elimination

When I walked the security team through Bob’s reasoning, skepticism turned into swift action as the finding was validated, fixed and deployed within 24 hours.

What makes structured AI analysis different

This experiment revealed why AI code analysis, when done with methodological rigor, bridges a gap that traditional tools cannot:

Traditional SAST tools analyze patterns: “Here’s an SQL injection vulnerability at line 253.” They generate hundreds of findings with varying confidence levels and force human teams to triage alerts arriving days after the code has merged.

Structured AI analysis reasons about context: “Here’s an SQL injection vulnerability. Here’s where it runs in your application flow. Here’s why it’s critical. Here’s what the business impact looks like. Here’s how to fix it.” It provides architectural understanding rather than just pattern matches.

The key enabling factor is clear analytical frameworks. Vague prompts (“Find security issues”) produced generic observations. Precisely structured prompts requesting specific analytical dimensions produced precise, actionable output.

Limitations and lessons learned

Bob’s analysis required human expertise at multiple points. Security team validation was essential. Bob provided analysis, but humans provided judgment. The SQL injection had to be tested and confirmed. Architectural context from development leadership was necessary to distinguish between genuine risks and acceptable patterns within specific constraints.

Bob also hallucinated details. Time estimates for implementing fixes were wildly inaccurate. Any use of Bob’s analysis in project planning required significant human review. But the core findings, the structural issues, the architectural anti-patterns and the vulnerability assessment held up to expert scrutiny.

The most important lesson: prompt engineering quality directly predicts analysis quality. This insight suggests that effective AI code analysis requires investment in understanding what analytical frameworks produce useful output and how to structure requests to align with those frameworks. It’s not about feeding code to Bob and getting answers, it’s about designing analytical prompts that force clear thinking.

Continuous governance, not periodic audits

The most powerful aspect of this approach can be what comes next. Bob can analyze incremental changes rather than rescanning entire repositories. A follow-up prompt can focus on new commits:

Compare the current branch against commit [hash]. Analyze the Git diff and identify: new security vulnerabilities introduced, technical debt items resolved, changes that affect previously identified issues. Focus only on changed files and their immediate dependencies.

This shifts code analysis from periodic audits to continuous governance. Security reviews happen at the pace of development, not on calendar schedules. Teams can integrate this shift into merge request validation, providing immediate feedback to developers before code integrates into main branches.

Estimated cost: 5–8% of monthly token budget per week. This estimate is roughly equivalent to the cost of one initial comprehensive audit per month, allowing continuous analysis across the entire codebase.

Implications for development organizations

This experiment demonstrated that AI code analysis has value that extends beyond the traditional SAST tool category. The findings were valuable not because they identified unknown vulnerability patterns, but because they provided comprehensive architectural visibility. This insight included what was working, what was degrading performance, what was constraining development velocity and what required immediate remediation.

For a product manager without deep security training, Bob provided visibility into architectural issues that typically require senior architects spending days in code review. For a security team, Bob provided a systematic way to identify architectural patterns that create risk surface, not just individual bugs.

The broader implication: software risk now operates at commit velocity. Our assessment tools need to operate at the same pace. Traditional audits that occur on weekly, monthly or quarterly cycles will always be reactive. Continuous analysis tools that provide immediate feedback at merge time, with a cost structure that makes regular use practical, change the economics and timing of risk management.

It is not a replacement for security specialists, architects or traditional tooling. It’s a force multiplier for teams that need visibility into large, complex codebases and lack the resources to conduct continuous manual review. When combined with structured analytical frameworks and human judgment, it demonstrates one model for how continuous code governance might work at enterprise scale.

IBM Bob is currently in limited availability. Learn more about IBM’s AI-assisted development tools.

All findings and vulnerability details have been remediated and validated by our development and security teams before publication. The analysis methodologies described in this post are applicable to any large codebase and any AI coding assistant with repository analysis capabilities.

Author

Chris Buxton

Technical Product Manager

IBM Concert