GxPDF v0.1.0: 100% Table Extraction Accuracy in Pure Go
Source: Dev.to

The Problem with PDF Libraries
Every Go developer who has worked with PDFs knows the pain:
| Library | Issue |
|---|---|
| UniPDF | Powerful, but starts at $299/month |
| pdfcpu | Great for manipulation, no table extraction |
| gofpdf | Creation‑only, abandoned since 2019 |
I needed to extract tables from bank statements – 740 transactions across multiple pages. Commercial libraries worked, but the cost was prohibitive for an open‑source project.
Solution: I built GxPDF.
What is GxPDF?
GxPDF is a pure Go PDF library that handles both reading and creation.
- No CGO
- No external dependencies
- MIT licensed
# Install CLI
go install github.com/coregx/gxpdf/cmd/gxpdf@v0.1.0
# Or use as a library
go get github.com/coregx/gxpdf@v0.1.0
The Key Innovation – 4‑Pass Hybrid Detection
Table extraction is hard. PDFs don’t contain “tables”; they contain positioned text elements scattered across coordinates. Most algorithms fail on:
- Multi‑line cells (descriptions that wrap)
- Missing borders (modern designs)
- Merged cells
- Headers vs. data discrimination
GxPDF uses a 4‑Pass Hybrid Detection algorithm:
| Pass | Description |
|---|---|
| Pass 1 | Gap Detection (adaptive threshold) |
| Pass 2 | Overlap Detection (Tabula‑inspired) |
| Pass 3 | Alignment Detection (geometric clustering) |
| Pass 4 | Multi‑line Cell Merger (amount‑based discrimination) |
Pass 4 insight: transaction rows contain monetary amounts; continuation rows do not.
// Works on ALL banks without configuration
isTransactionRow := hasAmount(row) // Has amount = new transaction
isContinuation := !hasAmount(row) // No amount = continuation of previous
This universal discriminator works across different PDF generators, layouts, and bank formats.
Results – 100 % Accuracy
Tested on real bank statements:
| Bank | Transactions | Accuracy |
|---|---|---|
| Sberbank | 242 | 100 % |
| Alfa‑Bank | 281 | 100 % |
| VTB | 217 | 100 % |
| Total | 740 | 100 % |
Every transaction was extracted correctly, and every multi‑line description was preserved.
Code Examples
Extract Tables from a PDF
package main
import (
"fmt"
"log"
"github.com/coregx/gxpdf"
)
func main() {
// Open PDF
doc, err := gxpdf.Open("bank_statement.pdf")
if err != nil {
log.Fatal(err)
}
defer doc.Close()
// Extract all tables
tables := doc.ExtractTables()
for _, t := range tables {
fmt.Printf("Table: %d rows x %d cols\n",
t.RowCount(), t.ColumnCount())
// Access rows
for _, row := range t.Rows() {
fmt.Println(row)
}
}
}
Export to CSV / JSON
// Export to CSV
csv, _ := table.ToCSV()
fmt.Println(csv)
// Export to JSON
json, _ := table.ToJSON()
fmt.Println(json)
// Write to file
file, _ := os.Create("output.csv")
table.ExportCSV(file)
Create PDFs
package main
import (
"log"
"github.com/coregx/gxpdf/creator"
)
func main() {
c := creator.New()
c.SetTitle("Invoice")
c.SetAuthor("GxPDF")
page, _ := c.NewPage()
// Add text with Standard 14 fonts
page.AddText("Invoice #12345", 100, 750, creator.HelveticaBold, 24)
page.AddText("Amount: $1,234.56", 100, 700, creator.Helvetica, 14)
// Draw graphics
opts := &creator.RectOptions{
StrokeColor: &creator.Black,
FillColor: &creator.LightGray,
StrokeWidth: 1.0,
}
page.DrawRect(100, 600, 400, 50, opts)
// Save
if err := c.WriteToFile("invoice.pdf"); err != nil {
log.Fatal(err)
}
}
CLI Tool
GxPDF includes a CLI for quick operations:
# Extract tables
gxpdf tables invoice.pdf
gxpdf tables bank.pdf --format csv > transactions.csv
gxpdf tables report.pdf --format json
# Get PDF info
gxpdf info document.pdf
# Extract text
gxpdf text document.pdf
# Merge PDFs
gxpdf merge part1.pdf part2.pdf -o combined.pdf
# Split PDF
gxpdf split document.pdf --pages 1-5 -o first_five.pdf
Feature Matrix
| Feature | Status |
|---|---|
| Table Extraction | 100 % accuracy |
| Text Extraction | Supported |
| Image Extraction | Supported |
| PDF Creation | Supported |
| Standard 14 Fonts | All 14 |
| Embedded Fonts | TTF/OTF |
| Graphics | Lines, Rectangles, Circles, Bezier |
| Encryption | RC4 + AES‑128/256 |
| Export Formats | CSV, JSON, Excel |
Architecture
internal/
├── document/ # Document model
├── encoding/ # FlateDecode, DCTDecode
├── extractor/ # Text, image, graphics
├── fonts/ # Standard 14 + embedding
├── models/ # Data structures
├── parser/ # PDF parsing
├── reader/ # PDF reader
├── security/ # RC4/AES encryption
├── tabledetect/ # 4‑Pass Hybrid algorithm
└── writer/ # PDF generation
Clean separation. No CGO. Pure Go from top to bottom.
Performance
Table extraction on a 15‑page bank statement:
| Metric | Value |
|---|---|
| Time | ~200 ms |
| Memory | ~15 MB peak |
| Allocations | Minimal (see benchmarks in the repo) |
Benchmarks
Optimized with sync.Pool
PDF creation benchmarks:
BenchmarkNewPage-8 50000 28.4 µs/op
BenchmarkAddText-8 100000 11.2 µs/op
BenchmarkWriteToFile-8 5000 312.5 µs/op
What’s Next
The v0.1.0 release covers the core functionality. Planned for future releases:
- Form Filling – Fill existing PDF forms
- Digital Signatures – Sign PDFs cryptographically
- SVG Import – Vector graphics support
- PDF Rendering – Convert pages to images
We Need Your PDFs
This is v0.1.0 — our first public release. We’ve tested on bank statements, invoices, and reports, but PDFs are infinitely diverse.
We need testers with real documents:
- Corporate reports with complex tables
- Invoices from different countries and formats
- Scanned documents with OCR layers
- Multi‑language PDFs (CJK, Arabic, Hebrew)
- Legacy PDFs from old generators
- Edge cases that break other libraries
If GxPDF fails on your document, that’s valuable data. Open an issue, attach the PDF (or a sanitized version), and we’ll fix it.
Our goal is enterprise‑grade quality. Not “good enough for hobby projects” — we want GxPDF to handle production workloads at scale. The 740/740 accuracy on bank statements is our baseline, not our ceiling.
Try It
go install github.com/coregx/gxpdf/cmd/gxpdf@v0.1.0
gxpdf version
Repository:
Documentation and examples are in the repo. Issues and PRs are welcome.
GxPDF is MIT licensed. Built for the Go community that needed a real PDF library without commercial restrictions.
