GxPDF v0.1.0: 100% Table Extraction Accuracy in Pure Go

Published: 1 month ago (January 6, 2026 at 10:03 PM EST)

4 min read

Source: Dev.to

Cover image for GxPDF v0.1.0: 100% Table Extraction Accuracy in Pure Go

The Problem with PDF Libraries

Every Go developer who has worked with PDFs knows the pain:

Library	Issue
UniPDF	Powerful, but starts at $299/month
pdfcpu	Great for manipulation, no table extraction
gofpdf	Creation‑only, abandoned since 2019

I needed to extract tables from bank statements – 740 transactions across multiple pages. Commercial libraries worked, but the cost was prohibitive for an open‑source project.

Solution: I built GxPDF.

What is GxPDF?

GxPDF is a pure Go PDF library that handles both reading and creation.

No CGO
No external dependencies
MIT licensed

# Install CLI
go install github.com/coregx/gxpdf/cmd/gxpdf@v0.1.0

# Or use as a library
go get github.com/coregx/gxpdf@v0.1.0

The Key Innovation – 4‑Pass Hybrid Detection

Table extraction is hard. PDFs don’t contain “tables”; they contain positioned text elements scattered across coordinates. Most algorithms fail on:

Multi‑line cells (descriptions that wrap)
Missing borders (modern designs)
Merged cells
Headers vs. data discrimination

GxPDF uses a 4‑Pass Hybrid Detection algorithm:

Pass	Description
Pass 1	Gap Detection (adaptive threshold)
Pass 2	Overlap Detection (Tabula‑inspired)
Pass 3	Alignment Detection (geometric clustering)
Pass 4	Multi‑line Cell Merger (amount‑based discrimination)

Pass 4 insight: transaction rows contain monetary amounts; continuation rows do not.

// Works on ALL banks without configuration
isTransactionRow := hasAmount(row)   // Has amount = new transaction
isContinuation   := !hasAmount(row)   // No amount = continuation of previous

This universal discriminator works across different PDF generators, layouts, and bank formats.

Results – 100 % Accuracy

Tested on real bank statements:

Bank	Transactions	Accuracy
Sberbank	242	100 %
Alfa‑Bank	281	100 %
VTB	217	100 %
Total	740	100 %

Every transaction was extracted correctly, and every multi‑line description was preserved.

Code Examples

Extract Tables from a PDF

package main

import (
    "fmt"
    "log"

    "github.com/coregx/gxpdf"
)

func main() {
    // Open PDF
    doc, err := gxpdf.Open("bank_statement.pdf")
    if err != nil {
        log.Fatal(err)
    }
    defer doc.Close()

    // Extract all tables
    tables := doc.ExtractTables()

    for _, t := range tables {
        fmt.Printf("Table: %d rows x %d cols\n",
            t.RowCount(), t.ColumnCount())

        // Access rows
        for _, row := range t.Rows() {
            fmt.Println(row)
        }
    }
}

Export to CSV / JSON

// Export to CSV
csv, _ := table.ToCSV()
fmt.Println(csv)

// Export to JSON
json, _ := table.ToJSON()
fmt.Println(json)

// Write to file
file, _ := os.Create("output.csv")
table.ExportCSV(file)

Create PDFs

package main

import (
    "log"

    "github.com/coregx/gxpdf/creator"
)

func main() {
    c := creator.New()
    c.SetTitle("Invoice")
    c.SetAuthor("GxPDF")

    page, _ := c.NewPage()

    // Add text with Standard 14 fonts
    page.AddText("Invoice #12345", 100, 750, creator.HelveticaBold, 24)
    page.AddText("Amount: $1,234.56", 100, 700, creator.Helvetica, 14)

    // Draw graphics
    opts := &creator.RectOptions{
        StrokeColor: &creator.Black,
        FillColor:   &creator.LightGray,
        StrokeWidth: 1.0,
    }
    page.DrawRect(100, 600, 400, 50, opts)

    // Save
    if err := c.WriteToFile("invoice.pdf"); err != nil {
        log.Fatal(err)
    }
}

CLI Tool

GxPDF includes a CLI for quick operations:

# Extract tables
gxpdf tables invoice.pdf
gxpdf tables bank.pdf --format csv > transactions.csv
gxpdf tables report.pdf --format json

# Get PDF info
gxpdf info document.pdf

# Extract text
gxpdf text document.pdf

# Merge PDFs
gxpdf merge part1.pdf part2.pdf -o combined.pdf

# Split PDF
gxpdf split document.pdf --pages 1-5 -o first_five.pdf

Feature Matrix

Feature	Status
Table Extraction	100 % accuracy
Text Extraction	Supported
Image Extraction	Supported
PDF Creation	Supported
Standard 14 Fonts	All 14
Embedded Fonts	TTF/OTF
Graphics	Lines, Rectangles, Circles, Bezier
Encryption	RC4 + AES‑128/256
Export Formats	CSV, JSON, Excel

Architecture

internal/
├── document/       # Document model
├── encoding/       # FlateDecode, DCTDecode
├── extractor/      # Text, image, graphics
├── fonts/          # Standard 14 + embedding
├── models/         # Data structures
├── parser/         # PDF parsing
├── reader/         # PDF reader
├── security/       # RC4/AES encryption
├── tabledetect/    # 4‑Pass Hybrid algorithm
└── writer/         # PDF generation

Clean separation. No CGO. Pure Go from top to bottom.

Performance

Table extraction on a 15‑page bank statement:

Metric	Value
Time	~200 ms
Memory	~15 MB peak
Allocations	Minimal (see benchmarks in the repo)

Benchmarks

Optimized with sync.Pool

PDF creation benchmarks:

BenchmarkNewPage-8        50000    28.4 µs/op
BenchmarkAddText-8       100000    11.2 µs/op
BenchmarkWriteToFile-8    5000   312.5 µs/op

What’s Next

The v0.1.0 release covers the core functionality. Planned for future releases:

Form Filling – Fill existing PDF forms
Digital Signatures – Sign PDFs cryptographically
SVG Import – Vector graphics support
PDF Rendering – Convert pages to images

We Need Your PDFs

This is v0.1.0 — our first public release. We’ve tested on bank statements, invoices, and reports, but PDFs are infinitely diverse.

We need testers with real documents:

Corporate reports with complex tables
Invoices from different countries and formats
Scanned documents with OCR layers
Multi‑language PDFs (CJK, Arabic, Hebrew)
Legacy PDFs from old generators
Edge cases that break other libraries

If GxPDF fails on your document, that’s valuable data. Open an issue, attach the PDF (or a sanitized version), and we’ll fix it.

Our goal is enterprise‑grade quality. Not “good enough for hobby projects” — we want GxPDF to handle production workloads at scale. The 740/740 accuracy on bank statements is our baseline, not our ceiling.

Try It

go install github.com/coregx/gxpdf/cmd/gxpdf@v0.1.0
gxpdf version

Repository:

Documentation and examples are in the repo. Issues and PRs are welcome.

GxPDF is MIT licensed. Built for the Go community that needed a real PDF library without commercial restrictions.

GxPDF v0.1.0: 100% Table Extraction Accuracy in Pure Go

The Problem with PDF Libraries

What is GxPDF?

The Key Innovation – 4‑Pass Hybrid Detection

Results – 100 % Accuracy

Code Examples

Extract Tables from a PDF

Export to CSV / JSON

Create PDFs

CLI Tool

Feature Matrix

Architecture

Performance

Benchmarks

What’s Next

We Need Your PDFs

Try It

Related posts

The Java PDF Table Extraction Library You’ve Been Waiting For..

iMessage-kit is an iMessage SDK for macOS

Building PathCraft: An Open-Source Routing Engine in Go

Why I Rewrote Portage in Go: Introducing GRPM v0.1.0

The Problem with PDF Libraries

What is GxPDF?

The Key Innovation – 4‑Pass Hybrid Detection

Results – 100 % Accuracy

Code Examples

Extract Tables from a PDF

Export to CSV / JSON

Create PDFs

CLI Tool

Feature Matrix

Architecture

Performance

Benchmarks

What’s Next

We Need Your PDFs

Try It

Related posts

The Java PDF Table Extraction Library You’ve Been Waiting For..

iMessage-kit is an iMessage SDK for macOS

Building PathCraft: An Open-Source Routing Engine in Go

Why I Rewrote Portage in Go: Introducing GRPM v0.1.0

Results – 100 % Accuracy