Why Codex Security Doesn’t Include a SAST Report

Published: 1 month ago (March 15, 2026 at 08:00 PM EDT)

7 min read

Source: OpenAI Blog

Static Application Security Testing (SAST) vs. Codex Security

For decades, static application security testing (SAST) has been one of the most effective ways security teams scale code review.

When we built Codex Security, we made a deliberate design choice: we didn’t start by importing a static analysis report and asking the agent to triage it. Instead, we designed the system to start with the repository itself—its architecture, trust boundaries, and intended behavior—and to validate what it finds before it asks a human to spend time on it.

Why the shift?

The hardest vulnerabilities usually aren’t simple data‑flow problems. They happen when code appears to enforce a security check, but that check doesn’t actually guarantee the property the system relies on. In other words, the challenge isn’t just tracking how data moves through a program—it’s determining whether the defenses in the code really work.

The classic SAST model

SAST is often framed as a clean pipeline:

Identify a source of untrusted input.
Track data through the program.
Flag cases where that data reaches a sensitive sink without sanitization.

It’s an elegant model, and it covers a lot of real bugs.

In practice, however, SAST must make approximations to stay tractable at scale—especially in codebases with indirection, dynamic dispatch, callbacks, reflection, and framework‑heavy control flow. Those approximations aren’t a knock on SAST; they’re the reality of trying to reason about code without executing it.

Note: This is not the sole reason Codex Security doesn’t start with a SAST report.

The deeper issue: after the source‑to‑sink trace

Even when static analysis correctly traces input across multiple functions and layers, it still has to answer the question that actually determines whether a vulnerability exists.

Example pattern:

# Pseudocode
sanitize_html(user_input)   # sanitizer runs
render(user_input)          # untrusted content rendered

A static analyzer can see that the sanitizer ran, but it usually can’t determine whether that sanitizer is sufficient for the specific rendering context, template engine, encoding behavior, and downstream transformations involved.

Key distinction:

“The code calls a sanitizer.”
“The system is safe.”

The latter requires reasoning about the effectiveness of the check, not just its presence.

Real‑world illustration

A web application receives a JSON payload, extracts a redirect_url, validates it against an allow‑list regex, URL‑decodes it, and passes the result to a redirect handler.

A classic source‑to‑sink report would be:

untrusted input → regex check → URL decode → redirect

The real question is whether the check still constrains the value after the transformations that follow.

If the regex runs before decoding, does it actually constrain the decoded URL the way the redirect handler interprets it?
Answering that requires reasoning about the entire transformation chain: the regex, decoding/normalization, URL parsing edge cases, and redirect logic.

Many practical vulnerabilities look like this: order‑of‑operations mistakes, partial normalization, parsing ambiguities, and mismatches between validation and interpretation. The data‑flow is visible; the weakness is in how constraints propagate—or fail to propagate—through the transformation chain.

Concrete case:
CVE‑2024‑29041 – Express was affected by an open‑redirect issue where malformed URLs could bypass common allow‑list implementations because of how redirect targets were encoded and then interpreted. The data‑flow was straightforward; the harder question—and the one that determined whether the bug existed—was whether the validation still held after the transformation chain.

How Codex Security tackles the problem

Codex Security is built around a simple goal: reduce triage by surfacing issues with stronger evidence. In the product, that means using repo‑specific context (including a threat model) and validating high‑signal issues in an isolated environment before surfacing them.

When Codex Security encounters a boundary that looks like “validation” or “sanitization,” it doesn’t treat that as a checkbox. It tries to understand what the code is attempting to guarantee—and then it tries to falsify that guarantee.

Typical workflow

Contextual code reading
- Read the relevant code path with full repository context, the way a security researcher would.
- Look for mismatches between intent and implementation (including comments, though the model doesn’t blindly trust them).
Isolate the smallest testable slice
- Extract a tiny code slice around the transformation pipeline.
- Write micro‑fuzzers for that slice.
Reason about constraints across transformations
- Treat checks as part of a chain, not as independent gates.
- When appropriate, formalize the problem as a satisfiability question (e.g., using a Python environment with z3‑solver).
- Useful for integer overflows or architecture‑specific bugs.
Execute hypotheses in a sandbox
- Run the isolated slice in a sandboxed validation environment.
- Distinguish “this could be a problem” from “this is a problem.”
- A full end‑to‑end PoC compiled in debug mode provides the strongest proof.

Bottom line

SAST gives you where data flows.
Codex Security asks whether the constraints that should stop that data actually hold after all transformations.

By combining repository‑wide context, targeted isolation, formal reasoning, and sandboxed execution, Codex Security surfaces higher‑confidence findings and dramatically reduces the manual triage burden.

The Key Shift

Instead of stopping at “a check exists,” the system should push toward “the invariant holds (or it doesn’t), and here’s the evidence.”
The model then chooses the best tool for that job.

A Reasonable Reaction

Why not do both?
Start with a SAST report, then use the agent to reason deeper.

Why Starting from a SAST Report Creates Predictable Failure Modes

1. Premature Narrowing

A findings list is a map of where a tool already looked.
Using it as the starting point can bias the system toward:
- Spending disproportionate effort in the same regions.
- Re‑using the same abstractions.
- Missing classes of issues that don’t fit the tool’s worldview.

2. Implicit, Hard‑to‑Unwind Judgments

Many SAST findings encode assumptions about sanitization, validation, or trust boundaries.
If those assumptions are wrong—or incomplete—feeding them into the reasoning loop shifts the agent from “investigate” to “confirm or dismiss,” which is not the intended behavior.

3. Evaluation Difficulty

When the pipeline starts with SAST output, it becomes hard to separate:
- What the agent discovered through its own analysis.
- What it inherited from another tool.
This separation is crucial for accurately measuring the system’s capabilities and for continuous improvement.

Our Approach: Codex Security

We built Codex Security to begin where security research begins:

From the code and the system’s intent.
Validation is used only to raise the confidence bar before we interrupt a human.

When SAST Tools Shine

Enforcing secure coding standards.
Catching straightforward source‑to‑sink issues.
Detecting known patterns at scale with predictable trade‑offs.
They can be a strong part of defense‑in‑depth.

Scope of This Post

Focused on why an agent designed to reason about behavior and validate findings should not start its work anchored to a static findings list.

Beyond Source‑to‑Sink Thinking

Not every vulnerability is a data‑flow problem.
Many real failures are state and invariant problems:
- Workflow bypasses.
- Authorization gaps.
- “The system is in the wrong state” bugs.

For these bugs, a tainted value does not reach a single “dangerous sink.”
The risk lies in what the program assumes will always be true.

Looking Ahead

We expect the security‑tooling ecosystem to keep improving:

Static analysis
Fuzzing
Runtime guards
Agentic workflows

All will have roles.

What We Want Codex Security to Excel At

Turning “this looks suspicious” into “this is real, here’s how it fails, and here’s a fix that matches system intent.”

This is the part that costs the most for security teams, and it’s where Codex Security aims to deliver the greatest value.