Using AI to Automate Vulnerability Discovery in Source Code

SAST is proven. AI is making it smarter.

Static Application Security Testing (SAST) has been the industry standard for decades. Tools like Checkmarx, Coverity, Fortify, and FindBugs catch real vulnerabilities at scale. Pattern-matching and AST-walking work. They’ve prevented countless exploits.

But they have known blind spots. Traditional SAST tools are good at catching syntactic patterns—buffer overflows, obvious SQL injection, hardcoded secrets. They’re weaker on semantic vulnerabilities. A function looks dangerous in isolation, but it’s safe because of validation three layers up. Or the validation gets bypassed in one edge case that only manifests under specific conditions.

This is where AI augments SAST. Machine learning models trained on millions of vulnerable and patched code samples can understand context better. LLMs can reason about code flow, trace data transformation, and explain why something is vulnerable—not just flag it. They reduce false positives and catch edge cases traditional tools miss.

The game isn’t SAST vs. AI. It’s SAST + AI. Let’s see what actually works.

The AI Vulnerability Detection Landscape (2025)

LLM-based approaches (Claude, GPT-4, open-source LLMs): These treat code analysis as a reasoning task. Feed the model a function, ask it “find vulnerabilities,” and it outputs findings with explanations.

Pros: Understands context, catches logic-based flaws, explains reasoning. Cons: Slower, more expensive, can hallucinate findings, needs careful prompting.

ML-trained models (GitHub Advanced Security, Semgrep, Snyk): Trained on real-world vulnerable code. They recognize patterns statistically.

Pros: Fast, accurate on known vulnerability classes, integrates into CI/CD easily. Cons: Limited to patterns in training data, less contextual reasoning.

Hybrid approaches: Combine ML for speed (filter noise) + LLM for depth (understand complex flows).

This is where the future is heading. Preliminary research shows hybrid systems catch 30-40% more vulnerabilities than SAST alone, with acceptable false-positive rates.

Practical Tools & How to Use Them

1. Semgrep with LLM Analysis

Semgrep is a static analysis engine that’s gotten scary good at finding vulnerabilities. Now, some teams are piping Semgrep findings into Claude or GPT-4 for validation and deeper analysis.

# .semgrep.yml
rules:
  - id: hardcoded-password
    patterns:
      - pattern: password = "..."
      - pattern-not: password = "changeme"
    message: Hardcoded password detected
    languages: [python, javascript]
    severity: HIGH

Once Semgrep flags findings, you can batch them to an LLM for context-aware analysis:

import anthropic

def analyze_with_claude(code_snippet: str, semgrep_finding: dict):
    client = anthropic.Anthropic()
    prompt = f"""
You are a senior security engineer. A static analyzer found this potential vulnerability:

**Finding:** {semgrep_finding['message']}
**Severity:** {semgrep_finding['severity']}

**Code:**

{code_snippet}

Analyze this finding:
1. Is it a true positive or false positive?
2. What is the actual exploit path?
3. What's the recommended fix?
4. Are there related vulnerabilities nearby?
"""
    
    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
    return message.content[0].text

# Usage
vulnerable_code = open('function.py').read()
analysis = analyze_with_claude(vulnerable_code, finding)
print(analysis)

2. GitHub Advanced Security + AI Context

GitHub Advanced Security (now part of GitHub Enterprise) uses ML models trained on the GitHub vulnerability database. The real power: it integrates with Dependabot and understands which dependencies introduced the vulnerability.

New in 2025: GitHub’s code scanning now supports custom ML models. You can fine-tune on your codebase’s patterns.

# Enable code scanning in GitHub Actions
gh code-scanning enable --language python javascript

Then pull results and correlate with AI analysis:

import requests

def get_gh_findings(owner, repo, token):
    """Fetch code scanning alerts from GitHub"""
    headers = {"Authorization": f"Bearer {token}"}
    url = f"https://api.github.com/repos/{owner}/{repo}/code-scanning/alerts"
    
    response = requests.get(url, headers=headers)
    return response.json()

alerts = get_gh_findings("myorg", "myrepo", "github_token")
# Filter HIGH severity + filter with LLM for false positives

3. Snyk Code (ML-powered)

Snyk has trained models specifically on open-source vulnerability patterns. Their ML catches CWEs 20x better than rule-based systems.

Integration example:

snyk code test --json > findings.json
# Then pipe to your validation/prioritization system

What’s new with Snyk (2025): They’ve added “Snyk Advisor” which uses LLMs to auto-suggest fixes and explain root causes. You get not just “SQL injection found” but “Here’s the parameterized query you should use instead, and here’s why it’s safe.”

4. Building Your Own AI Scanner with Open Models

If you want to own the whole stack, you can fine-tune open-source models on your organization’s code patterns.

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

# Use a code-trained model like CodeT5 or similar
model_name = "Salesforce/codet5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

def find_vulnerabilities(code: str):
    """Fine-tune approach: encode code, get vulnerability predictions"""
    inputs = tokenizer.encode(code, return_tensors="pt", max_length=512, truncation=True)
    
    outputs = model.generate(inputs, max_length=256)
    predictions = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    return predictions

For serious deployment, you’d fine-tune on labeled datasets (vulnerable vs. safe code), which requires significant effort but gives you a model tuned to your risk profile.

The Hard Truth: Limitations & Pitfalls

False positives explode fast. If your AI system flags 50 issues and 40 are false positives, your team stops trusting it. The research suggests LLMs have 15-25% false positive rates on real codebases. ML-trained models are better (5-10%), but still need tuning.

Context window limits: LLMs can’t see your entire codebase at once. A vulnerability in function A might depend on unsafe behavior in function B, three layers away. GPT-4’s 128k context helps, but isn’t infinite.

Training data bias: If your AI was trained on specific languages or frameworks, it’s blind to vulnerabilities in others. A model trained on Node.js might suck at Rust.

Hallucinations in explanations: An LLM might confidently explain why something is vulnerable, then the explanation is partially wrong. Always verify.

The speed/accuracy tradeoff: LLMs are slower but more accurate. ML models are fast but miss edge cases. You need to pick your battle based on your risk tolerance.

Recommended Workflow (2025)

Here’s what actually works in production:

Fast ML layer (Semgrep + Snyk): Catch 80% of known vulnerabilities instantly in CI/CD.
LLM validation layer: Run high-severity findings through Claude/GPT for context-aware confirmation (costs money, so filter first).
Manual review queue: Hand-off validated findings to engineers with explanations + suggested fixes.
Feedback loop: Track which findings were actually exploitable, fine-tune your thresholds, retrain custom models.

# Example CI/CD flow
name: Security Scan
on: [pull_request]

jobs:
  scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      # Fast layer: Semgrep + Snyk
      - name: Semgrep scan
        run: semgrep --json --output=semgrep.json .
      
      - name: Snyk scan
        run: snyk code test --json > snyk.json
      
      # Slow layer: LLM validation (only high/critical)
      - name: LLM validation
        run: python validate_findings.py
      
      # Fail if findings exceed threshold
      - name: Check results
        run: |
          CRITICAL=$(jq '[.[] | select(.severity=="CRITICAL")] | length' validated.json)
          if [ $CRITICAL -gt 0 ]; then exit 1; fi

What’s Actually New in 2025

Multimodal reasoning: Tools now combine code analysis + dependency graphs + runtime behavior to understand vulnerability chains.
Automated exploitation attempts: Some vendors are using AI to automatically attempt exploits against findings to confirm they’re real (vs. false positives).
LLM-driven remediation: Not just finding vulns, but auto-generating and testing patches. Still experimental, but promising.
Supply chain + AI: Tools now use AI to identify vulnerable dependencies, not just code you wrote. This is where most exploits actually come from.

Bottom Line

SAST works. SAST + AI works better. The improvements are real: fewer false positives, better context, faster remediation, clearer explanations. But AI isn’t a replacement for foundational static analysis—it’s a layer on top.

Start with battle-tested SAST (Semgrep, Snyk). Add ML-trained scanning to catch semantic patterns. Layer in LLMs (Claude/GPT) for high-severity findings to reduce false positives and explain exploitability. Close the loop with human judgment and feedback.

The win isn’t more findings. It’s better findings, faster validation, and smarter prioritization. Use traditional SAST as your foundation. Let AI make it smarter.

Go build it.