A Minimal Go Toolkit for Cleaning, Validating, and Querying CSV/TSV/Excel/Parquet Files
Source: Dev.to
In many software systems, not all data lives inside a database.
Sometimes it’s stored in structured files such as CSV, TSV, or spreadsheets, and in practice these files are often not clean. Data may be entered manually, values can be missing, and rows may contain inconsistencies that break downstream processing. When the file is large, trying to “fix it in Excel” can lead to crashes or make it difficult to identify invalid records.
I repeatedly faced these issues in real projects, so I built three small Go libraries that focus on preprocessing, validation, and lightweight analysis of structured file data. All three libraries work with the standard io.Reader interface, allowing you to use familiar Go primitives without custom data structures.
- fileprep – preprocessing + field‑level validation using struct tags
- fileframe – a tiny, immutable DataFrame for filtering and inspection
- filesql – run SQL directly on CSV/TSV/LTSV/Excel/Parquet via an embedded SQLite database
These libraries are independent, but share the same io.Reader‑based API.
fileprep
Features
- Preprocessing: trim, replace, Unicode normalization, type coercion, etc.
- Clear error reporting: identifies which row and which column caused the failure
- Supports composite validation (cross‑column rules)
- Works with CSV, TSV, LTSV, Parquet, Excel — anything fed through
io.Reader
Example
type User struct {
Name string `prep:"trim" validate:"required"`
Email string `prep:"trim,lowercase"`
Age string
}
func main() {
csvData := `name,email,age
John Doe ,JOHN@EXAMPLE.COM,30
Jane Smith,jane@example.com,25
`
processor := fileprep.NewProcessor(fileprep.FileTypeCSV)
var users []User
reader, result, err := processor.Process(strings.NewReader(csvData), &users)
if err != nil {
fmt.Printf("Error: %v\n", err)
return
}
fmt.Printf("Processed %d rows, %d valid\n", result.RowCount, result.ValidRowCount)
for _, user := range users {
fmt.Printf("Name: %q, Email: %q\n", user.Name, user.Email)
}
// `reader` can be passed directly to filesql
_ = reader
}
Output
Processed 2 rows, 2 valid
Name: "John Doe", Email: "john@example.com"
Name: "Jane Smith", Email: "jane@example.com"
fileframe
Features
- Immutable DataFrame
- Filtering, mapping, grouping
- Ideal for “one‑time transformations” on small/medium CSV/TSV datasets
Example
// Sample sales data
csvData := `product,amount,category
Apple,100,Fruit
Banana,150,Fruit
Carrot,80,Vegetable
Orange,120,Fruit
Broccoli,90,Vegetable`
df, err := fileframe.NewDataFrame(strings.NewReader(csvData), fileframe.CSV)
if err != nil {
fmt.Println("Error:", err)
return
}
fmt.Printf("Total rows: %d\n", df.Len())
fmt.Printf("Columns: %v\n", df.Columns())
// Filter
filtered := df.Filter(func(row map[string]any) bool {
amount, ok := row["amount"].(int64)
return ok && amount > 100
})
fmt.Printf("Rows with amount > 100: %d\n", filtered.Len())
// GroupBy + Sum
groupedDf, err := df.GroupBy("category")
if err != nil {
fmt.Println("Error:", err)
return
}
grouped, err := groupedDf.Sum("amount")
if err != nil {
fmt.Println("Error:", err)
return
}
for _, row := range grouped.ToRecords() {
fmt.Printf(" %s: %.0f\n", row["category"], row["sum_amount"])
}
Output
Total rows: 5
Columns: [product amount category]
Rows with amount > 100: 2
Fruit: 370
Vegetable: 170
filesql
filesql is not “SQL on files” in the literal sense. Internally it loads your data into a temporary SQLite database, giving you full SQL capabilities without the need to manage a database yourself.
Example
func main() {
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
db, err := filesql.OpenContext(ctx, "data.csv")
if err != nil {
log.Fatal(err)
}
defer db.Close()
rows, err := db.QueryContext(ctx, "SELECT * FROM data WHERE age > 25")
if err != nil {
log.Fatal(err)
}
defer rows.Close()
for rows.Next() {
var name string
var age int
if err := rows.Scan(&name, &age); err != nil {
log.Fatal(err)
}
fmt.Printf("Name: %s, Age: %d\n", name, age)
}
}
If you’re a Go developer who doesn’t need the full Apache or Python ETL ecosystems, these three lightweight libraries may fit your workflow well. The user base is still small, so there may be undiscovered bugs—feedback and issue reports are very welcome.