How I Compiled 647 Semgrep Rules to Native Rust
Source: Dev.to
I love Semgrep. It has thousands of community‑contributed security rules that catch real vulnerabilities. But every time I ran it on a large codebase, I’d wait… and wait.
The problem? Semgrep interprets YAML rules at runtime using Python. For a 500 K‑line monorepo, that meant 4+ minutes per scan.
So I asked myself: what if I compiled those rules to native code instead?
The Idea
Semgrep rules are just pattern matching. A rule like this:
rules:
- id: sql-injection
pattern: execute($QUERY)
message: "Possible SQL injection"
says “find any call to execute() with one argument.” That’s not fundamentally different from what Tree‑sitter does with its query language.
What if I translated Semgrep patterns into Tree‑sitter queries at build time, embedded them in the binary, and matched against ASTs directly?
The Hard Part: Metavariables
Semgrep uses $VARIABLES to capture arbitrary code:
eval($USER_INPUT)
This matches eval(x), eval(foo.bar), eval(getInput()) — anything.
Tree‑sitter queries don’t have metavariables; they have captures:
(call_expression
function: (identifier) @func
arguments: (arguments (_) @arg))
The @func and @arg are captures — they grab whatever matches that position.
So I built a translator. It parses Semgrep patterns, identifies metavariables, and generates Tree‑sitter queries with captures in the right places.
// Simplified version of the pattern compiler
fn compile_pattern(semgrep: &str) -> TreeSitterQuery {
let ast = parse_semgrep_pattern(semgrep);
let mut query = String::new();
for node in ast.walk() {
match node {
Metavar(name) => {
// $X becomes (_) @x
query.push_str(&format!("(_) @{}", name.to_lowercase()));
}
Literal(text) => {
query.push_str(&format!("\"{}\"", text));
}
// ... more cases
}
}
TreeSitterQuery::new(&query)
}
The Ellipsis Problem
Semgrep’s ... operator matches “zero or more of anything”:
func($ARG, ...)
This matches func(a), func(a, b), func(a, b, c, d, e).
Tree‑sitter queries can’t express this directly. For such patterns I fall back to walking the AST manually and checking if the structure matches. It isn’t as fast as native queries, but it’s still faster than Python interpretation.
Build‑Time Compilation
The magic happens in build.rs. At compile time:
- Parse all 647 Semgrep YAML files
- Translate each pattern to a Tree‑sitter query (or AST walker)
- Serialize everything to a binary blob
Then embed it with include_bytes!():
// In the compiled binary
static RULES: &[u8] = include_bytes!("compiled_rules.bin");
// At runtime – instant loading
fn load_rules() -> RuleSet {
bincode::deserialize(RULES).unwrap()
}
No file I/O, no YAML parsing, no pattern compilation at runtime. The rules are just there.
Results
On a 500 K LOC monorepo:
| Tool | Time |
|---|---|
| Semgrep | 4 m 12 s |
| RMA | 23 s |
About 10× faster. The difference grows as codebases get larger.
What’s Still Rough
- False positives on generated code (working on better heuristics)
- Some Semgrep features aren’t supported yet (taint mode is partial)
- Error messages could be clearer
Try It
cargo install rma-cli
rma scan .
Or with the interactive TUI:
rma scan . --interactive
It’s MIT licensed:
I’d love feedback, especially if you try it on your own projects. What rules are missing? Too many false positives? Let me know.
If you’re interested in the pattern compiler implementation, check out crates/rules/build.rs in the repo.