Using LLMs to Find Security Bugs: A Practitioner’s Playbook

TL;DR

LLMs won’t replace AppSec.
They will dramatically compress the search space.

If you use them right:

Run multi-model analysis (Opus + GPT + Gemini)
Structure prompts around attack surfaces, not “find bugs”
Require PoCs or tests for validation
Trust only cross-model consensus or reproducible exploits

If you don’t do this, you’ll drown in false positives.

Security research has always been asymmetric.
Attackers need one bug; defenders need zero.
Historically, scale worked against defenders.

LLMs start to rebalance that—not by magically finding zero-days, but by acting as a fast, always-on analyst that can:

Read entire subsystems in seconds
Connect logic across files
Generate realistic attack paths

Used correctly, they don’t replace expertise—they let you spend it where it matters.
Used incorrectly, they produce confident nonsense.
This is a practitioner’s workflow that actually works.

Why LLMs Are Useful

Let’s be blunt.

They’re very good at:

Cross-file reasoning (auth flows, data paths)
Recognizing known vulnerability patterns
Generating attack scenarios you didn’t think of
Turning vague suspicions into concrete hypotheses

They’re bad at:

Exhaustive coverage (although they are getting better and better. fast)
Subtle timing bugs (race conditions, TOCTOU)
Deep protocol-level vulnerabilities
Knowing when they’re wrong

The key shift:

LLMs can help you find these “old bugs” on scale.
They generate good guesses at large scale.

Your job is to filter, validate, and exploit.

Rule #1:
If two models independently flag the same issue, pay attention.
If one model does, assume it’s wrong until proven otherwise.

The Real Architecture

Most people get this wrong. They treat LLMs like scanners.
Don’t.
Use this instead:

Static tools → Context builder → Multi-model reasoning → Validation

Deterministic layer

Semgrep / CodeQL
Dependency scanning (OSV/Snyk)
Secret detection

Context builder (critical, often skipped)
Feed models:

Changed files (not entire repo blindly)
Call graph (who calls what)
Auth boundaries
Data flow (input → transformation → sink)

Multi-model analysis

Gemini → wide context
Opus → deep reasoning
GPT → structured judgment

Validation layer (non-negotiable)

Generate PoCs
Run tests / fuzzing
Score findings

If you skip validation, the system collapses.

The Four-Phase Workflow

Phase 1 — Recon & Attack Surface Mapping

Before looking for bugs, map where they can exist.

Category	Best Model(s)	Technique
Injection (SQL/NoSQL/LLM)	Gemini + Claude	Prompt for taint analysis
AuthN/AuthZ flaws	All three	Role-play as attacker
Cryptography / Secrets	Gemini + Claude	Multimodal + static rules
Business Logic	GPT + Gemini	Chain-of-thought
Supply-chain / Deps	All	Cross-reference with osv.dev
API / Rate-limit / SSRF	GPT	Payload generation
Smart contracts (if .eth)	Claude	Slither + manual audit combo

and you can use something like this prompt:

			
[SYSTEM] You are a world-class security researcher who has found 50+ CVEs and multiple bug-bounty $100k+ payouts.
[CONTEXT] <entire file or relevant files>
[ TASK ] Perform a deep security audit for <specific category, e.g., "IDOR, broken access control, race conditions">.
1. List every possible attack vector.
2. For each vector, give:
   - Likelihood (1-5)
   - Impact (1-5)
   - Exact vulnerable code snippet with line numbers
   - Proof-of-concept payload or curl command
   - Suggested fix (with secure code example)
3. Rank by risk score (Likelihood × Impact)
Output ONLY in markdown table + code blocks.

		

Currently the Best model: Gemini 3.1 Pro
(or the latest as this post will age quickly)

It handles massive context (1M tokens)—entire repos, specs, or docs.

What to extract:

Entry points (HTTP, CLI, background jobs)
Auth boundaries
Trust zones
Privileged operations

Then escalate to Claude Opus 4.7 (or the current latest as this post will age quickly) for deeper reasoning:

STRIDE analysis
Multi-step threat chains

Use GPT-5.4 (or… you know…) for:

Dependency + CVE triage
Protocol-level sanity checks

High-value output:
A structured map of:

entry point → trust level → reachable sensitive operations

That map drives everything else.

Phase 2 — Automated Code Review == Highest ROI

This is where most value comes from.
But “review this code” is useless.

You need specialized passes.

Pass 1: Attack surface extraction

			
Map all entry points, auth checks, and trust boundaries.
Return structured output only.

Pass 2: Taint analysis (Opus)

			
Trace user input → transformations → sinks.
Output: source → sink → vuln → severity

Pass 3: Auth & access control (GPT)

			
Find IDOR, missing checks, role escalation paths.
Focus on inconsistencies across endpoints.

Pass 4: Injection paths

			
Trace input into SQL, shell, templates, deserialization.
Flag only realistic exploit paths.

Pass 5: Business logic abuse (Opus)

			
Assume a valid user.
Find ways to break workflows, not systems.

That last one is where LLMs outperform traditional tools.

Phase 3 — Exploit Research & PoC Generation

This is where things get interesting.
Once you have a possible bug:

Use GPT for payload generation

WAF bypass variants
Encoding tricks
Edge cases

Use Opus for attack chains

Multi-step abuse scenarios
State manipulation
Privilege escalation flows

Generate PoCs (critical step)

Generate a minimal reproducible exploit or test case.

Then actually run it:

API tests
Integration tests
Fuzzing harnesses

Outcome:

Works → real vulnerability
Doesn’t → discard

This step alone removes ~80% of the noise.

Phase 4 — Reporting & Remediation

LLMs are extremely useful here—if you keep them honest.

CVSS scoring (Opus)

Structured, consistent severity

Patch generation

Ask one model to fix it
Ask another model to break the fix

This “adversarial review” catches a surprising number of bad patches.

Reporting (GPT)

Turn raw findings into:

Repro steps
Impact narrative
Fix recommendations

Multi-Model Strategy

Each model has a role:

Gemini 3.1 Pro
Wide context, architecture awareness
Claude Opus 4.7
Deep reasoning, best for logic + data flow
GPT-5.4
Structured output, protocols, consistency

Simple rule:

2 models agree → high signal
1 model → treat as hypothesis

Scoring System (Prevents Noise Collapse)

If you don’t rank findings, this becomes useless fast.
Example:

Signal	Score
Multi-model agreement	+3
Static tool match	+3
PoC generated	+4
PoC works	+10
Unrealistic assumptions	-5

Only escalate:

≥7 → must fix
4–6 → review
<4 → ignore

Where This Works Best

High ROI targets:

Authentication / RBAC
Multi-tenant isolation
Payments / credits
File uploads
Webhooks
Internal APIs exposed externally

That’s where logic bugs live—and where LLMs shine.

What Not to Do

Don’t run a single model
Don’t scan the whole repo blindly
Don’t trust “no issues found”
Don’t optimize for volume

Optimize for real, exploitable findings.

Where This Is Going

The next step is obvious: agentic security systems.

LLMs that:

Run scanners
Launch fuzzers
Generate hypotheses
Validate them automatically

We’re not fully there yet—but close. Think about openClaw that run a few agents that doing these tasks 24/7.
The teams that build structured workflows now will have a massive advantage when that layer matures.

Good luck and be safe 👊🏽

Discover more from Ido Green

Subscribe to get the latest posts sent to your email.

Ido Green

Thoughts To Remember

Using LLMs to Find Security Bugs: A Practitioner’s Playbook

TL;DR

Why LLMs Are Useful

The Real Architecture

The Four-Phase Workflow

Phase 1 — Recon & Attack Surface Mapping

Phase 2 — Automated Code Review == Highest ROI

Pass 1: Attack surface extraction

Pass 2: Taint analysis (Opus)

Pass 3: Auth & access control (GPT)

Pass 4: Injection paths

Pass 5: Business logic abuse (Opus)

Phase 3 — Exploit Research & PoC Generation

Use GPT for payload generation

Use Opus for attack chains

Generate PoCs (critical step)

Phase 4 — Reporting & Remediation

CVSS scoring (Opus)

Patch generation

Reporting (GPT)

Multi-Model Strategy

Scoring System (Prevents Noise Collapse)

Where This Works Best

What Not to Do

Where This Is Going

Discover more from Ido Green

Leave a comment Cancel reply

TL;DR

Why LLMs Are Useful

The Real Architecture

The Four-Phase Workflow

Phase 1 — Recon & Attack Surface Mapping

Phase 2 — Automated Code Review == Highest ROI

Pass 1: Attack surface extraction

Pass 2: Taint analysis (Opus)

Pass 3: Auth & access control (GPT)

Pass 4: Injection paths

Pass 5: Business logic abuse (Opus)

Phase 3 — Exploit Research & PoC Generation

Use GPT for payload generation

Use Opus for attack chains

Generate PoCs (critical step)

Phase 4 — Reporting & Remediation

CVSS scoring (Opus)

Patch generation

Reporting (GPT)

Multi-Model Strategy

Scoring System (Prevents Noise Collapse)

Where This Works Best

What Not to Do

Where This Is Going

Discover more from Ido Green

Rate this:

Share only with good friends:

Leave a comment Cancel reply