GxPDF v0.1.0: 100% Table Extraction Accuracy in Pure Go

Published: (January 6, 2026 at 10:03 PM EST)
4 min read
Source: Dev.to

Source: Dev.to

Cover image for GxPDF v0.1.0: 100% Table Extraction Accuracy in Pure Go

Andrey Kolkov

The Problem with PDF Libraries

Every Go developer who has worked with PDFs knows the pain:

LibraryIssue
UniPDFPowerful, but starts at $299/month
pdfcpuGreat for manipulation, no table extraction
gofpdfCreation‑only, abandoned since 2019

I needed to extract tables from bank statements – 740 transactions across multiple pages. Commercial libraries worked, but the cost was prohibitive for an open‑source project.

Solution: I built GxPDF.

What is GxPDF?

GxPDF is a pure Go PDF library that handles both reading and creation.

  • No CGO
  • No external dependencies
  • MIT licensed
# Install CLI
go install github.com/coregx/gxpdf/cmd/gxpdf@v0.1.0

# Or use as a library
go get github.com/coregx/gxpdf@v0.1.0

The Key Innovation – 4‑Pass Hybrid Detection

Table extraction is hard. PDFs don’t contain “tables”; they contain positioned text elements scattered across coordinates. Most algorithms fail on:

  • Multi‑line cells (descriptions that wrap)
  • Missing borders (modern designs)
  • Merged cells
  • Headers vs. data discrimination

GxPDF uses a 4‑Pass Hybrid Detection algorithm:

PassDescription
Pass 1Gap Detection (adaptive threshold)
Pass 2Overlap Detection (Tabula‑inspired)
Pass 3Alignment Detection (geometric clustering)
Pass 4Multi‑line Cell Merger (amount‑based discrimination)

Pass 4 insight: transaction rows contain monetary amounts; continuation rows do not.

// Works on ALL banks without configuration
isTransactionRow := hasAmount(row)   // Has amount = new transaction
isContinuation   := !hasAmount(row)   // No amount = continuation of previous

This universal discriminator works across different PDF generators, layouts, and bank formats.

Results – 100 % Accuracy

Tested on real bank statements:

BankTransactionsAccuracy
Sberbank242100 %
Alfa‑Bank281100 %
VTB217100 %
Total740100 %

Every transaction was extracted correctly, and every multi‑line description was preserved.

Code Examples

Extract Tables from a PDF

package main

import (
    "fmt"
    "log"

    "github.com/coregx/gxpdf"
)

func main() {
    // Open PDF
    doc, err := gxpdf.Open("bank_statement.pdf")
    if err != nil {
        log.Fatal(err)
    }
    defer doc.Close()

    // Extract all tables
    tables := doc.ExtractTables()

    for _, t := range tables {
        fmt.Printf("Table: %d rows x %d cols\n",
            t.RowCount(), t.ColumnCount())

        // Access rows
        for _, row := range t.Rows() {
            fmt.Println(row)
        }
    }
}

Export to CSV / JSON

// Export to CSV
csv, _ := table.ToCSV()
fmt.Println(csv)

// Export to JSON
json, _ := table.ToJSON()
fmt.Println(json)

// Write to file
file, _ := os.Create("output.csv")
table.ExportCSV(file)

Create PDFs

package main

import (
    "log"

    "github.com/coregx/gxpdf/creator"
)

func main() {
    c := creator.New()
    c.SetTitle("Invoice")
    c.SetAuthor("GxPDF")

    page, _ := c.NewPage()

    // Add text with Standard 14 fonts
    page.AddText("Invoice #12345", 100, 750, creator.HelveticaBold, 24)
    page.AddText("Amount: $1,234.56", 100, 700, creator.Helvetica, 14)

    // Draw graphics
    opts := &creator.RectOptions{
        StrokeColor: &creator.Black,
        FillColor:   &creator.LightGray,
        StrokeWidth: 1.0,
    }
    page.DrawRect(100, 600, 400, 50, opts)

    // Save
    if err := c.WriteToFile("invoice.pdf"); err != nil {
        log.Fatal(err)
    }
}

CLI Tool

GxPDF includes a CLI for quick operations:

# Extract tables
gxpdf tables invoice.pdf
gxpdf tables bank.pdf --format csv > transactions.csv
gxpdf tables report.pdf --format json

# Get PDF info
gxpdf info document.pdf

# Extract text
gxpdf text document.pdf

# Merge PDFs
gxpdf merge part1.pdf part2.pdf -o combined.pdf

# Split PDF
gxpdf split document.pdf --pages 1-5 -o first_five.pdf

Feature Matrix

FeatureStatus
Table Extraction100 % accuracy
Text ExtractionSupported
Image ExtractionSupported
PDF CreationSupported
Standard 14 FontsAll 14
Embedded FontsTTF/OTF
GraphicsLines, Rectangles, Circles, Bezier
EncryptionRC4 + AES‑128/256
Export FormatsCSV, JSON, Excel

Architecture

internal/
├── document/       # Document model
├── encoding/       # FlateDecode, DCTDecode
├── extractor/      # Text, image, graphics
├── fonts/          # Standard 14 + embedding
├── models/         # Data structures
├── parser/         # PDF parsing
├── reader/         # PDF reader
├── security/       # RC4/AES encryption
├── tabledetect/    # 4‑Pass Hybrid algorithm
└── writer/         # PDF generation

Clean separation. No CGO. Pure Go from top to bottom.

Performance

Table extraction on a 15‑page bank statement:

MetricValue
Time~200 ms
Memory~15 MB peak
AllocationsMinimal (see benchmarks in the repo)

Benchmarks

Optimized with sync.Pool

PDF creation benchmarks:

BenchmarkNewPage-8        50000    28.4 µs/op
BenchmarkAddText-8       100000    11.2 µs/op
BenchmarkWriteToFile-8    5000   312.5 µs/op

What’s Next

The v0.1.0 release covers the core functionality. Planned for future releases:

  • Form Filling – Fill existing PDF forms
  • Digital Signatures – Sign PDFs cryptographically
  • SVG Import – Vector graphics support
  • PDF Rendering – Convert pages to images

We Need Your PDFs

This is v0.1.0 — our first public release. We’ve tested on bank statements, invoices, and reports, but PDFs are infinitely diverse.

We need testers with real documents:

  • Corporate reports with complex tables
  • Invoices from different countries and formats
  • Scanned documents with OCR layers
  • Multi‑language PDFs (CJK, Arabic, Hebrew)
  • Legacy PDFs from old generators
  • Edge cases that break other libraries

If GxPDF fails on your document, that’s valuable data. Open an issue, attach the PDF (or a sanitized version), and we’ll fix it.

Our goal is enterprise‑grade quality. Not “good enough for hobby projects” — we want GxPDF to handle production workloads at scale. The 740/740 accuracy on bank statements is our baseline, not our ceiling.

Try It

go install github.com/coregx/gxpdf/cmd/gxpdf@v0.1.0
gxpdf version

Repository:

Documentation and examples are in the repo. Issues and PRs are welcome.

GxPDF is MIT licensed. Built for the Go community that needed a real PDF library without commercial restrictions.

Back to Blog

Related posts

Read more »