From 0 to 11 Bugs Fixed: How GoAWK Battle-Tested My 3000x Faster Regex Engine
Source: Dev.to
The Best Kind of Feedback
A week ago, I published “Go’s Regexp is Slow. So I Built My Own”. The response was incredible – but the most valuable feedback came from Ben Hoyt, creator of GoAWK.
He didn’t just read the article. He tried to actually use coregex.
“I’ve started integrating coregex into GoAWK… I’m finding a few issues.”
That message led to one of the most productive weeks of debugging I’ve ever had.
11 Bugs in 7 Days
Ben’s GoAWK test suite is ruthless – 1000+ regex patterns covering edge cases I never imagined. Here’s what he found:
| Day | Bug | Pattern | Symptom |
|---|---|---|---|
| 1 | [^,]* | Negated char class | Crash |
| 1 | [oO]+d | Case‑insensitive | Wrong match |
| 2 | ^foo | Start anchor | Matched everywhere |
| 2 | \bword\b | Word boundary | Find returned empty |
| 3 | ^ in FindAll | Anchor in loop | Matched at every position |
| 3 | Error format | – | Different from stdlib |
| 4 | \w+@... | Capture groups | DFA returned false |
| 4 | (?s:.) | Inline flags | Ignored |
| 5 | a$ | End anchor | First call wrong |
| 6 | (#\\\n#!) | Longest() | – |
Each bug taught me something. Some were embarrassing oversights; others revealed fundamental gaps in my understanding.
The Worst Bug: ^ Anchor
The start anchor (^) was my nemesis. It seemed simple – match only at position 0. But in a multi‑engine architecture, “simple” gets complicated fast.
- Version 1 – Naively checked
pos == 0. Worked forIsMatch, broke forFindAllIndex. - Version 2 – Added
FindAt(haystack, offset)methods. NowFindAllIndexcould tell the engine “this is position 5 in the original string.” - Version 3 – Discovered DFA’s
epsilonClosuredidn’t respect anchors. Implemented properLookSetfollowing Rust’sregex‑automata.
Three attempts over two days. Ben kept testing. I kept fixing.
The Sneakiest Bug: Longest()
This one was humbling. The Longest() method existed since v0.8.2. Documentation claimed it worked, and tests passed – but it was a no‑op stub.
// What I wrote (v0.8.2)
func (r *Regex) Longest() {
// TODO: implement leftmost-longest semantics
}
// What Ben expected
re := coregex.MustCompile(`(a|ab)`)
re.Longest()
// "ab" should match "ab" (longest), not "a" (first)
AWK uses POSIX semantics (leftmost‑longest). Go’s stdlib uses Perl semantics (leftmost‑first) by default, but Longest() switches modes. My engine only supported Perl semantics.
The fix required understanding a fundamental distinction:
Leftmost‑First (Perl): (a|ab) on "ab" → "a" (first alternative wins)
Leftmost‑Longest (POSIX): (a|ab) on "ab" → "ab" (longer match wins)
Implementing this in the PikeVM took ~100 lines. No performance regression in the default mode.
The Fix Velocity
| Version | Date | Fixes |
|---|---|---|
| v0.8.3 | Dec 4 | Negated classes, case‑insensitive |
| v0.8.4 | Dec 4 | ^ anchor (professional fix) |
| v0.8.5 | Dec 5 | Word boundaries \b \B |
| v0.8.6 | Dec 7 | ^ in FindAll/ReplaceAll |
| v0.8.7 | Dec 7 | Error message format |
| v0.8.8 | Dec 7 | DFA + capture groups |
| v0.8.9 | Dec 7 | Linter compatibility |
| v0.8.10 | Dec 7 | Inline flags (?s:…) |
| v0.8.11 | Dec 8 | End anchor first‑call bug |
| v0.8.12 | Dec 8 | Longest() implementation |
9 releases in 5 days, each one making coregex more stdlib‑compatible.
Performance: Still Fast
The real question: did all these fixes kill performance?
Pattern: .*connection.*
Input: 250 KB log file
stdlib: 12.6 ms
coregex: 4 µs
Speedup: 3,154× (unchanged from v0.8.0)
The architectural decisions paid off: SIMD pre‑filtering, lazy DFA, and strategy selection handle the fast path. The bug fixes live in edge‑case handling – code that rarely runs.
Full Stdlib Compatibility
After v0.8.12, GoAWK’s test suite passes completely:
$ cd goawk
$ go test ./...
ok github.com/benhoyt/goawk 4.832s
Drop‑in replacement confirmed.
// Before
import "regexp"
// After
import "github.com/coregx/coregex"
// That's it. Same API. 5‑3000× faster.
What I Learned
- Real‑world testing > Unit tests – My coverage was 88 %; GoAWK found 11 bugs. Users catch what you don’t imagine.
- Multi‑engine architecture = Multi‑engine bugs – Each strategy (DFA, NFA, ReverseAnchored, OnePass) has its own edge cases. Integration tests between engines became critical.
- “Works on my machine” is worthless – Ben exercised regex in ways my benchmarks never did.
- Fast feedback loops matter – Issues → fix → release → test, sometimes twice a day. Ben’s detailed reports made this possible.
The Collaboration
I want to publicly thank Ben Hoyt. He could have said “this library has bugs, I’ll use stdlib.” Instead, he filed detailed issues, provided test cases, and kept testing each release. This is open‑source at its best.
Try It Yourself
go get github.com/coregx/coregex@v0.8.12
package main
import (
"fmt"
"github.com/coregx/coregex"
)
func main() {
re := coregex.MustCompile(`\w+@[\w.]+`)
fmt.Println(re.FindString("email: test@example.com"))
// Output: test@example.com
}
Found a bug? Open an issue. I’ll fix it.
What’s Next
- v0.9.0: ARM NEON SIMD (waiting for Go 1.26)
- v1.0.0: API stability guarantee, security audit
Your feedback: the fastest path to production‑ready.
Links
- GitHub: coregx/coregex
- GoAWK PR #264 – The integration that found everything
- Original article
From 0 to 11 bugs fixed. From “interesting project” to “production‑ready.” Thanks to one developer who actually tried to use it.