Automating machine learning with AI agents
Source: Dev.to
Overview
When solving competitions on Kaggle, you quickly notice a pattern:
- Baseline – upload the data, run CatBoost or LightGBM, and get a baseline metric (≈ ½ hour).
- Top solutions – require dozens of preprocessing options, hundreds of feature combinations, and thousands of hyper‑parameter sets.
Existing AutoML systems don’t help much:
| System | How it works | Limitation |
|---|---|---|
| AutoGluon | Trains several models and builds a multi‑level ensemble. | Each run starts from scratch. |
| TPOT | Generates a pipeline via a genetic algorithm. | Doesn’t learn from previous runs. |
| Typical AutoML | Tries a fixed set of algorithms, picks the best according to the metric. | No reasoning, no adaptation, no experience accumulation. |
The main problem is lack of reasoning. These systems don’t analyze why a particular approach succeeded or failed, nor do they adapt to the specifics of a new task. Every new dataset is treated as if it were the first one.
Humans work differently. A data scientist who sees unbalanced classes immediately thinks about stratification and threshold selection; if they have tackled a similar problem before, they reuse what worked. When the first attempt fails, they analyze the cause and try a different approach.
A Human‑like AutoML Architecture
With large language models (LLMs) it became possible to build a system that reasons more like a human. An LLM can:
- Analyze data.
- Reason about method selection.
- Learn from examples.
One model alone, however, can still miss obvious mistakes or get stuck on a wrong approach. We therefore need an architecture that allows the system to check itself and accumulate experience.
Actor‑Critic Inspiration
In reinforcement learning, Actor‑Critic methods use two agents:
- Actor – takes actions.
- Critic – evaluates those actions.
Applying this idea to AutoML:
| Role | Responsibilities |
|---|---|
| Actor | Receives the data and a toolbox of specialized services (MCP servers). Explores the dataset, decides which steps are needed, and generates a solution (report + artifacts). |
| Critic | Receives only the Actor’s report (no tools). Checks whether everything was done correctly. If problems are found, it returns feedback so the Actor can iterate. |
| Memory | After each iteration, the experience (reports, feedback, outcomes) is stored and later retrieved for similar tasks. |
The loop is: Actor → Critic → Feedback → Actor (repeat).
Tooling: MCP (Model Context Protocol)
LLMs can reason, but they need tools to manipulate data. I grouped the tools into four categories:
- Data preview – quick inspection of the raw file.
- Statistical analysis – descriptive statistics, missing‑value diagnostics, etc.
- Processing – encoding, imputation, scaling, etc.
- Model training – fitting models, generating predictions, ensembling.
Example: Data‑preview tool output
{
"shape": [150, 5],
"columns": ["sepal_length", "sepal_width", "petal_length", "petal_width", "species"],
"dtypes": {
"sepal_length": "Float64",
"species": "String"
},
"sample": [
{"sepal_length": 5.1, "species": "setosa"},
...
]
}
The Actor can see the dimensions, column types, and a few rows—enough to decide the next steps.
Consistent preprocessing
A crucial requirement is identical transformations for train and test. For example, if a categorical feature is encoded as {"red": 0, "blue": 1} in the training set, the same mapping must be applied to the test set. Mappings are saved as JSON files:
mapping_path = Path(output_dir) / f"{column}_mapping.json"
with open(mapping_path, "w") as f:
json.dump(mapping, f)
This is especially important for categorical classification: the model outputs numeric codes, which must be converted back to the original class labels.
Training‑tool contract
Each training tool returns three items:
- Path to the saved model.
- Path to the predictions file.
- Metrics computed on the training data.
Paths are generated with a timestamp and a UUID, allowing the Actor to run many algorithms in parallel without naming conflicts.
Scaling the Toolbox
When the number of tools exceeds ten, managing, supporting, and scaling them becomes cumbersome. The FastMCP framework (implementation of the Model Context Protocol) solves this by:
- Packaging each tool as an independent server.
- Exposing a simple RPC‑style API that the Actor can call on demand.
I created five MCP servers:
| Server | Purpose |
|---|---|
file_operations | File I/O utilities. |
data_preview | Quick CSV preview. |
data_analysis | Statistical summaries. |
data_processing | Transformations (encoding, imputation, scaling). |
machine_learning | Model training, prediction, ensembling. |
Structured Reporting & Multi‑Judge Critic
The Actor produces a structured report with four sections:
- Data analysis
- Preprocessing
- Model training
- Results
The Critic does not use any tools; it only reads the report. Instead of a single monolithic judge, I employ four specialized LLM judges, each focusing on one section.
judges = [
LLMJudge(rubric="Evaluate data_analysis: Is exploration thorough?"),
LLMJudge(rubric="Evaluate preprocessing: Are steps appropriate?"),
LLMJudge(rubric="Evaluate model_training: Is selection justified?"),
LLMJudge(rubric="Evaluate results: Are metrics calculated correctly?"),
]
Each judge returns:
- A score between 0 and 1.
- A justification explaining the rating.
The overall Critic score is the average of the four judges. If the average falls below a predefined threshold, the Critic sends detailed feedback to the Actor, which then revises its solution and iterates.
End‑to‑End Flow (Pseudo‑code)
memory = ExperienceMemory()
def auto_ml_pipeline(data_path):
# 1. Load prior experience (if any)
context = memory.retrieve_similar(data_path)
# 2. Actor generates an initial solution
report, artifacts = Actor.run(data_path, context)
while True:
# 3. Critic evaluates the report
scores, feedback = Critic.evaluate(report)
# 4. If the average score is high enough → finish
if sum(scores) / len(scores) >= 0.85:
break
# 5. Otherwise, give feedback to Actor and iterate
report, artifacts = Actor.improve(report, feedback)
# 6. Store the successful experience for future tasks
memory.store(data_path, report, artifacts, scores)
return artifacts["predictions_path"]
Take‑aways
- Reasoning + tools → an LLM alone can’t manipulate data; a toolbox (MCP servers) fills that gap.
- Actor‑Critic loop catches mistakes early and drives iterative improvement.
- Specialized judges provide focused, granular feedback rather than a single monolithic evaluation.
- Experience memory enables the system to accumulate knowledge across tasks, moving AutoML closer to human‑like expertise.
Iterative Decision‑Making with Actor & Critic
The Actor’s solution is compared to an acceptance threshold (usually 0.75).
- If the score is higher, the decision is accepted.
- Otherwise, the Critic gathers feedback from all judges’ comments and passes it back to the Actor for the next iteration.
This multi‑judge approach is more stable than relying on a single judge.
A single LLM can be overly strict or miss obvious errors, while four specialized judges smooth out subjectivity.
File‑System Isolation
When an agent works with files it must not have access to the entire file system. Isolation is achieved by creating a dedicated directory for each session:
~/.scald/actor/
│
├─ data/ # copies of the source data
├─ output/ # intermediate files
└─ workspace/ # models and predictions
- The source CSV files are copied into
data/. - All tools operate only within these directories, preventing accidental overwrites of important files or reading of unrelated data.
After the run finishes, all artifacts are copied to a session directory with a timestamp, and the workspace is cleared. You can later inspect this directory to see exactly what the agent did:
- Which models were trained (load them from
.pkl) - What metrics were obtained
- Which steps were performed
Experience Storage & Retrieval
After each iteration the system saves the experience:
self.mm.save(
actor_solution=actor_solution,
critic_evaluation=critic_evaluation,
task_type=task_type,
iteration=iteration,
)
Search for Similar Past Solutions
# Retrieve the most relevant memories
actor_memory, critic_memory = self.mm.retrieve(
actor_report=actor_solution.report,
task_type=task_type,
top_k=5,
)
- The Actor report and Critic evaluation are stored in a ChromaDB vector database.
- When a new task arrives, the system performs a semantic search (using the Jina embedding model) to find similar past solutions.
- Those solutions are supplied to the agent as context.
Even unsuccessful attempts are valuable.
If the Critic once said, “you forgot to process the gaps,” that feedback can guide future tasks. Semantic search will surface such cases as well.
Full Iterative Cycle
When all components are ready, the cycle runs until:
- The maximum number of iterations is reached, or
- The Critic makes a final decision.
At each iteration:
- Actor solves the problem, incorporating any feedback.
- Critic evaluates the solution.
- The experience is stored in memory.
- Relevant context is extracted for the next attempt.
Observing Actor Learning
- Iteration 1: Simple preprocessing + one model.
- Critic feedback: “You did not check class balance,” “Missing feature engineering.”
- Iteration 2: Adds the missing steps, tries several models, builds an ensemble.
A concrete failure case: the Actor encoded the target column, trained the model, but forgot to decode predictions. The output was numeric IDs instead of class labels. The fix was to add explicit instructions in the system prompt:
If you encode the target column, you MUST DECODE predictions before returning.
Usedecode_categorical_labelwith the mapping path from the encoding step.
When the Actor experiments with multiple models, files can overwrite each other. Prompting the LLM to generate unique filenames proved unreliable. The robust solution is to handle naming at the tool level, appending a timestamp and UUID to each file.
Experimental Results
The system was evaluated on several OpenML datasets.
| Dataset | Metric (F1) | Baseline (RF) | Baseline (AutoGluon) | Baseline (FLAML) |
|---|---|---|---|---|
| christine | 0.743 | 0.713 (‑4 %) | – | – |
| cnae‑9 | 0.980 | – | – | 0.945 (‑3.5 %) |
| Australian | 0.836 | – | 0.860 | – |
| blood‑transfusion | 0.756 | 0.712 | 0.734 | 0.767 (‑1.5 %) |
- Cost per run ranged from $0.14 to $3.43, depending on task complexity and iteration count.
- Running time varied from 1 minute to 30 minutes.
The value of the system is not merely the raw metric but the intelligent automation it enables.
By modularising the workflow (MCP), we can plug in specialized agents for any task, preserving a single iterative improvement loop and accumulating experience over time.
Limitations
- Works best for tabular data with gradient‑boosting algorithms.
- Not suited out‑of‑the‑box for deep learning or time‑series tasks (additional tools required).
- Overall quality heavily depends on the size and capability of the underlying LLM.
Getting Started
Installation
pip install scald
Using the CLI
scald --train data/train.csv \
--test data/test.csv \
--target price \
--task-type regression
Using the Python API
from scald import Scald
scald = Scald(max_iterations=5)
predictions = await scald.run(
train="data/train.csv", # .csv or pandas DataFrame
test="data/test.csv",
target="target_column",
task_type="classification",
)
Note: You need an API key from a provider compatible with OpenAI (e.g., OpenRouter).
Additional Information
- Need a key from Jina for embeddings in the memory system (the service provides a large number of free tokens upon registration).
- All code is packaged in a library and available on GitHub.