开始使用 Dusty — 用于 ETL 与研究数据清洗的小型 DSL

发布: 1个月前 (2025年12月11日 GMT+8 19:55)

3 min read

原文: Dev.to

Source: Dev.to

Cover image for Starting Dusty — A Tiny DSL for ETL & Research Data Cleaning

在过去的几周里，我一直在认真考虑构建自己的编程语言。不是一个大型的通用语言，也不是 Python 的替代品，绝对不是抱有宏大抱负的项目。我只想做点小而有用、聚焦的东西。

这就是 Dusty 的出现原因。

Dusty 是一个轻量级 DSL（领域特定语言），专门用于 ETL 任务和科研数据清洗。仅此而已。没有庞大的生态系统、没有包管理器、没有框架。目标很简单：把凌乱的 CSV/JSON 清洗工作转化为简短、可读的脚本。

我从自己遇到的问题出发。每当我处理科研数据或黑客马拉松数据集时，总是反复写同样的模式：

加载 CSV
过滤行
修复缺失值
重命名某些字段
与另一个文件进行连接
导出清洗后的结果

Python 能用，但脚本很快就会变得丑陋。Pandas 功能强大，但不适合小任务。SQL 适合结构化表格，却不适用于不规则的 CSV。大多数 ETL 工具都是为企业打造的，而不是为学生或独立开发者设计的。

所以 Dusty 关注的是中间地带：在不增加负担的情况下进行简单的数据转换。

What Dusty will look like (early prototype idea)

A Dusty script looks like this:

source users = csv("users.csv")

transform adults = users
  | filter(r -> int(r.age) >= 18)
  | map(r -> { id: r.id, name: r.name })

save adults to csv("clean_adults.csv")

Readable.
No imports.
No boilerplate.
Just the data flow.

Essential ETL operations

Dusty will support the following core operations:

source
filter
map
select / rename
join
aggregate
save

That’s enough to clean real datasets used in labs, projects, and university research.

How I’m building it

This is my first language project, so I’m keeping things practical:

The Dusty interpreter is written in Python (not related to Dusty syntax at all).
Dusty code will live in .dusty files.
Users run it with a simple CLI:

dusty run main.dsty

Roadmap for v0.1

My plan is to finish Dusty v0.1 with:

a working parser
CSV support
filter / map operations
save functionality
a couple of example pipelines
basic documentation

I’m not adding a package manager, modules, or big features yet. Dusty v0.1 should be small enough that anyone can understand the whole project in one sitting.

Why I’m writing this publicly

I’ve noticed something: when you build in silence, you get lost. When you build in public, even quietly, you naturally stay accountable. So this weekly blog is just a way to share the progress, mistakes, and insights along the journey of creating a tiny DSL from scratch.

No big promises.
No hype.
Just consistent work.

开始使用 Dusty — 用于 ETL 与研究数据清洗的小型 DSL

What Dusty will look like (early prototype idea)

Essential ETL operations

How I’m building it

Roadmap for v0.1

Why I’m writing this publicly

相关文章

驾驭未来：2024 年及以后关键数据工程趋势

为什么幂等性在数据工程中如此重要

面向数据工程师的 REST API 调用：实用指南与示例

一个用于清理、验证和查询 CSV/TSV/Excel/Parquet 文件的简约 Go 工具包