From 0 to 11 Bugs Fixed: How GoAWK Battle-Tested My 3000x Faster Regex Engine

Published: (December 8, 2025 at 04:49 AM EST)
4 min read
Source: Dev.to

Source: Dev.to

The Best Kind of Feedback

A week ago, I published “Go’s Regexp is Slow. So I Built My Own”. The response was incredible – but the most valuable feedback came from Ben Hoyt, creator of GoAWK.

He didn’t just read the article. He tried to actually use coregex.

“I’ve started integrating coregex into GoAWK… I’m finding a few issues.”

That message led to one of the most productive weeks of debugging I’ve ever had.

11 Bugs in 7 Days

Ben’s GoAWK test suite is ruthless – 1000+ regex patterns covering edge cases I never imagined. Here’s what he found:

DayBugPatternSymptom
1[^,]*Negated char classCrash
1[oO]+dCase‑insensitiveWrong match
2^fooStart anchorMatched everywhere
2\bword\bWord boundaryFind returned empty
3^ in FindAllAnchor in loopMatched at every position
3Error formatDifferent from stdlib
4\w+@...Capture groupsDFA returned false
4(?s:.)Inline flagsIgnored
5a$End anchorFirst call wrong
6(#\\\n#!)Longest()

Each bug taught me something. Some were embarrassing oversights; others revealed fundamental gaps in my understanding.

The Worst Bug: ^ Anchor

The start anchor (^) was my nemesis. It seemed simple – match only at position 0. But in a multi‑engine architecture, “simple” gets complicated fast.

  • Version 1 – Naively checked pos == 0. Worked for IsMatch, broke for FindAllIndex.
  • Version 2 – Added FindAt(haystack, offset) methods. Now FindAllIndex could tell the engine “this is position 5 in the original string.”
  • Version 3 – Discovered DFA’s epsilonClosure didn’t respect anchors. Implemented proper LookSet following Rust’s regex‑automata.

Three attempts over two days. Ben kept testing. I kept fixing.

The Sneakiest Bug: Longest()

This one was humbling. The Longest() method existed since v0.8.2. Documentation claimed it worked, and tests passed – but it was a no‑op stub.

// What I wrote (v0.8.2)
func (r *Regex) Longest() {
    // TODO: implement leftmost-longest semantics
}

// What Ben expected
re := coregex.MustCompile(`(a|ab)`)
re.Longest()
// "ab" should match "ab" (longest), not "a" (first)

AWK uses POSIX semantics (leftmost‑longest). Go’s stdlib uses Perl semantics (leftmost‑first) by default, but Longest() switches modes. My engine only supported Perl semantics.

The fix required understanding a fundamental distinction:

Leftmost‑First (Perl):   (a|ab) on "ab" → "a" (first alternative wins)
Leftmost‑Longest (POSIX): (a|ab) on "ab" → "ab" (longer match wins)

Implementing this in the PikeVM took ~100 lines. No performance regression in the default mode.

The Fix Velocity

VersionDateFixes
v0.8.3Dec 4Negated classes, case‑insensitive
v0.8.4Dec 4^ anchor (professional fix)
v0.8.5Dec 5Word boundaries \b \B
v0.8.6Dec 7^ in FindAll/ReplaceAll
v0.8.7Dec 7Error message format
v0.8.8Dec 7DFA + capture groups
v0.8.9Dec 7Linter compatibility
v0.8.10Dec 7Inline flags (?s:…)
v0.8.11Dec 8End anchor first‑call bug
v0.8.12Dec 8Longest() implementation

9 releases in 5 days, each one making coregex more stdlib‑compatible.

Performance: Still Fast

The real question: did all these fixes kill performance?

Pattern: .*connection.*
Input: 250 KB log file

stdlib:   12.6 ms
coregex:   4 µs

Speedup: 3,154× (unchanged from v0.8.0)

The architectural decisions paid off: SIMD pre‑filtering, lazy DFA, and strategy selection handle the fast path. The bug fixes live in edge‑case handling – code that rarely runs.

Full Stdlib Compatibility

After v0.8.12, GoAWK’s test suite passes completely:

$ cd goawk
$ go test ./...
ok      github.com/benhoyt/goawk    4.832s

Drop‑in replacement confirmed.

// Before
import "regexp"

// After
import "github.com/coregx/coregex"

// That's it. Same API. 5‑3000× faster.

What I Learned

  1. Real‑world testing > Unit tests – My coverage was 88 %; GoAWK found 11 bugs. Users catch what you don’t imagine.
  2. Multi‑engine architecture = Multi‑engine bugs – Each strategy (DFA, NFA, ReverseAnchored, OnePass) has its own edge cases. Integration tests between engines became critical.
  3. “Works on my machine” is worthless – Ben exercised regex in ways my benchmarks never did.
  4. Fast feedback loops matter – Issues → fix → release → test, sometimes twice a day. Ben’s detailed reports made this possible.

The Collaboration

I want to publicly thank Ben Hoyt. He could have said “this library has bugs, I’ll use stdlib.” Instead, he filed detailed issues, provided test cases, and kept testing each release. This is open‑source at its best.

Try It Yourself

go get github.com/coregx/coregex@v0.8.12
package main

import (
    "fmt"
    "github.com/coregx/coregex"
)

func main() {
    re := coregex.MustCompile(`\w+@[\w.]+`)
    fmt.Println(re.FindString("email: test@example.com"))
    // Output: test@example.com
}

Found a bug? Open an issue. I’ll fix it.

What’s Next

  • v0.9.0: ARM NEON SIMD (waiting for Go 1.26)
  • v1.0.0: API stability guarantee, security audit

Your feedback: the fastest path to production‑ready.

From 0 to 11 bugs fixed. From “interesting project” to “production‑ready.” Thanks to one developer who actually tried to use it.

Back to Blog

Related posts

Read more »