GxPDF v0.1.0: Pure Go에서 100% 테이블 추출 정확도

발행: 1개월 전 (2026년 1월 7일 오후 12:03 GMT+9)

7 분 소요

Source: Dev.to

GxPDF v0.1.0 표 이미지: 순수 Go에서 100% 테이블 추출 정확도

PDF 라이브러리의 문제점

PDF 작업을 해본 모든 Go 개발자는 그 고통을 알고 있습니다:

라이브러리	문제점
UniPDF	강력하지만 월 $299부터 시작합니다
pdfcpu	조작에 뛰어나지만 테이블 추출 불가
gofpdf	생성‑전용, 2019년 이후로 유지보수 중단

은행 명세서에서 표를 추출해야 했습니다 – 여러 페이지에 걸쳐 740개의 거래가 있었습니다. 상용 라이브러리는 작동했지만, 오픈‑소스 프로젝트에 비용이 부담스러웠습니다.

해결책: GxPDF를 만들었습니다.

GxPDF란?

GxPDF는 순수 Go PDF 라이브러리로 읽기와 생성을 모두 지원합니다.

CGO 없음
외부 의존성 없음
MIT 라이선스

# Install CLI
go install github.com/coregx/gxpdf/cmd/gxpdf@v0.1.0

# Or use as a library
go get github.com/coregx/gxpdf@v0.1.0

핵심 혁신 – 4‑Pass Hybrid Detection

표 추출은 어렵습니다. PDF에는 “표”가 포함되어 있지 않고, 좌표에 흩어져 배치된 텍스트 요소가 포함되어 있습니다. 대부분의 알고리즘은 다음과 같은 경우에 실패합니다:

여러 줄 셀 (줄 바꿈이 있는 설명)
테두리 누락 (현대적인 디자인)
병합된 셀
헤더와 데이터 구분

GxPDF는 4‑Pass Hybrid Detection 알고리즘을 사용합니다:

Pass	Description
Pass 1	Gap Detection (adaptive threshold)
Pass 2	Overlap Detection (Tabula‑inspired)
Pass 3	Alignment Detection (geometric clustering)
Pass 4	Multi‑line Cell Merger (amount‑based discrimination)

Pass 4 insight: 거래 행은 금액을 포함하고, 연속 행은 포함하지 않습니다.

// Works on ALL banks without configuration
isTransactionRow := hasAmount(row)   // Has amount = new transaction
isContinuation   := !hasAmount(row)   // No amount = continuation of previous

이 보편적인 구분자는 다양한 PDF 생성기, 레이아웃 및 은행 형식에 걸쳐 작동합니다.

결과 – 100 % 정확도

실제 은행 명세서로 테스트:

Bank	Transactions	Accuracy
Sberbank	242	100 %
Alfa‑Bank	281	100 %
VTB	217	100 %
Total	740	100 %

모든 거래가 정확히 추출되었으며, 모든 다중 행 설명이 그대로 보존되었습니다.

코드 예시

PDF에서 표 추출

package main

import (
    "fmt"
    "log"

    "github.com/coregx/gxpdf"
)

func main() {
    // Open PDF
    doc, err := gxpdf.Open("bank_statement.pdf")
    if err != nil {
        log.Fatal(err)
    }
    defer doc.Close()

    // Extract all tables
    tables := doc.ExtractTables()

    for _, t := range tables {
        fmt.Printf("Table: %d rows x %d cols\n",
            t.RowCount(), t.ColumnCount())

        // Access rows
        for _, row := range t.Rows() {
            fmt.Println(row)
        }
    }
}

CSV / JSON으로 내보내기

// Export to CSV
csv, _ := table.ToCSV()
fmt.Println(csv)

// Export to JSON
json, _ := table.ToJSON()
fmt.Println(json)

// Write to file
file, _ := os.Create("output.csv")
table.ExportCSV(file)

PDF 생성

package main

import (
    "log"

    "github.com/coregx/gxpdf/creator"
)

func main() {
    c := creator.New()
    c.SetTitle("Invoice")
    c.SetAuthor("GxPDF")

    page, _ := c.NewPage()

    // Add text with Standard 14 fonts
    page.AddText("Invoice #12345", 100, 750, creator.HelveticaBold, 24)
    page.AddText("Amount: $1,234.56", 100, 700, creator.Helvetica, 14)

    // Draw graphics
    opts := &creator.RectOptions{
        StrokeColor: &creator.Black,
        FillColor:   &creator.LightGray,
        StrokeWidth: 1.0,
    }
    page.DrawRect(100, 600, 400, 50, opts)

    // Save
    if err := c.WriteToFile("invoice.pdf"); err != nil {
        log.Fatal(err)
    }
}

CLI 도구

GxPDF는 빠른 작업을 위한 CLI를 포함합니다:

# Extract tables
gxpdf tables invoice.pdf
gxpdf tables bank.pdf --format csv > transactions.csv
gxpdf tables report.pdf --format json

# Get PDF info
gxpdf info document.pdf

# Extract text
gxpdf text document.pdf

# Merge PDFs
gxpdf merge part1.pdf part2.pdf -o combined.pdf

# Split PDF
gxpdf split document.pdf --pages 1-5 -o first_five.pdf

기능 매트릭스

기능	상태
표 추출	100 % 정확도
텍스트 추출	지원됨
이미지 추출	지원됨
PDF 생성	지원됨
표준 14 글꼴	모두 14
포함된 글꼴	TTF/OTF
그래픽	선, 사각형, 원, 베지어
암호화	RC4 + AES‑128/256
내보내기 형식	CSV, JSON, Excel

아키텍처

internal/
├── document/       # Document model
├── encoding/       # FlateDecode, DCTDecode
├── extractor/      # Text, image, graphics
├── fonts/          # Standard 14 + embedding
├── models/         # Data structures
├── parser/         # PDF parsing
├── reader/         # PDF reader
├── security/       # RC4/AES encryption
├── tabledetect/    # 4‑Pass Hybrid algorithm
└── writer/         # PDF generation

깨끗하게 분리되었습니다. CGO가 없습니다. 순수 Go로 위에서 아래까지 구현되었습니다.

성능

측정항목	값
시간	~200 ms
메모리	~15 MB peak
할당	최소 (레포지토리의 벤치마크 참고)

벤치마크

sync.Pool 로 최적화됨

PDF 생성 벤치마크:

BenchmarkNewPage-8        50000    28.4 µs/op
BenchmarkAddText-8       100000    11.2 µs/op
BenchmarkWriteToFile-8    5000   312.5 µs/op

다음은

Form Filling – 기존 PDF 양식 채우기
Digital Signatures – PDF에 암호화 서명하기
SVG Import – 벡터 그래픽 지원
PDF Rendering – 페이지를 이미지로 변환

We Need Your PDFs

This is v0.1.0 — our first public release. We’ve tested on bank statements, invoices, and reports, but PDFs are infinitely diverse.

We need testers with real documents:

Corporate reports with complex tables
Invoices from different countries and formats
Scanned documents with OCR layers
Multi‑language PDFs (CJK, Arabic, Hebrew)
Legacy PDFs from old generators
Edge cases that break other libraries

If GxPDF fails on your document, that’s valuable data. Open an issue, attach the PDF (or a sanitized version), and we’ll fix it.

Our goal is enterprise‑grade quality. Not “good enough for hobby projects” — we want GxPDF to handle production workloads at scale. The 740/740 accuracy on bank statements is our baseline, not our ceiling.

사용해 보기

go install github.com/coregx/gxpdf/cmd/gxpdf@v0.1.0
gxpdf version

저장소:

문서와 예제는 저장소에 있습니다. 이슈와 PR을 환영합니다.

GxPDF는 MIT 라이선스를 가지고 있습니다. 상업적 제한 없이 실제 PDF 라이브러리가 필요했던 Go 커뮤니티를 위해 만들어졌습니다.

GxPDF v0.1.0: Pure Go에서 100% 테이블 추출 정확도

PDF 라이브러리의 문제점

GxPDF란?

핵심 혁신 – 4‑Pass Hybrid Detection

결과 – 100 % 정확도

코드 예시

PDF에서 표 추출

CSV / JSON으로 내보내기

PDF 생성

CLI 도구

기능 매트릭스

아키텍처

성능

벤치마크

다음은

We Need Your PDFs

사용해 보기

관련 글

당신이 기다려온 Java PDF 테이블 추출 라이브러리..

iMessage-kit은 macOS용 iMessage SDK입니다

PathCraft 구축: Go 기반 오픈소스 라우팅 엔진

왜 나는 Portage를 Go로 다시 작성했는가: GRPM v0.1.0 소개

PDF 라이브러리의 문제점

GxPDF란?

핵심 혁신 – 4‑Pass Hybrid Detection

결과 – 100 % 정확도

코드 예시

PDF에서 표 추출

CSV / JSON으로 내보내기

PDF 생성

CLI 도구

기능 매트릭스

아키텍처

성능

벤치마크

다음은

We Need Your PDFs

사용해 보기

관련 글

당신이 기다려온 Java PDF 테이블 추출 라이브러리..

iMessage-kit은 macOS용 iMessage SDK입니다

PathCraft 구축: Go 기반 오픈소스 라우팅 엔진

왜 나는 Portage를 Go로 다시 작성했는가: GRPM v0.1.0 소개

결과 – 100 % 정확도