[Paper] FAMOSE: A ReAct Approach to Automated Feature Discovery

Published: (February 19, 2026 at 01:53 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.17641v1

Overview

Feature engineering is often the hidden “secret sauce” that makes or breaks a machine‑learning model, especially for tabular data. The new FAMOSE framework shows how a ReAct‑style AI agent can automatically discover, evaluate, and select high‑impact features, delivering state‑of‑the‑art results on both classification and regression problems without heavy domain expertise.

Key Contributions

  • First ReAct‑based agent for automated feature engineering – integrates reasoning, acting, and tool use (e.g., feature generators, selectors, evaluators) inside a single LLM‑driven loop.
  • Unified pipeline for regression and classification – same architecture works across task types, handling datasets of any size.
  • Empirical gains on large‑scale tabular benchmarks – +0.23 % ROC‑AUC on classification tasks with >10 K rows and –2.0 % RMSE on regression tasks, matching or surpassing existing AutoFE systems.
  • Robustness to noisy or erroneous feature proposals – the agent’s iterative “think‑act‑reflect” cycle prunes bad features early, reducing error propagation.
  • Open‑source implementation & reproducible experiments – code and benchmark scripts released under an MIT license.

Methodology

FAMOSE treats feature engineering as an interactive problem‑solving task for a language model (LLM). The workflow follows the ReAct (Reason+Act) paradigm:

  1. Reasoning step – the LLM receives the raw dataset description, current feature set, and performance metrics, then generates a textual plan (e.g., “Create a log‑scaled version of column X”).
  2. Action step – the plan is translated into concrete tool calls:
    • Feature generators (arithmetic combinations, binning, embeddings, statistical transforms).
    • Feature selectors (mutual information, SHAP, L1 regularization).
    • Evaluators (quick cross‑validation to estimate ROC‑AUC or RMSE).
  3. Reflection step – the LLM reads back the evaluation results, updates its internal “memory” (the prompt context), and decides whether to keep, modify, or discard the new feature.
  4. Iterate – the loop repeats until a stopping criterion (budget, convergence, or performance plateau) is met.

Because the LLM’s context window retains the entire history of proposals and outcomes, it effectively builds a few‑shot prompt that teaches itself which transformations are useful for the given data distribution.

Results & Findings

TaskDataset sizeMetric improvement vs. best baseline
Classification (≥10 K rows)12 K – 150 K+0.23 % ROC‑AUC (average)
Classification (small)<10 KComparable to top AutoFE tools
Regression5 K – 80 K‑2.0 % RMSE (average)
Robustness test (noisy features)Error rate 15 % lower than competing agents

Key observations:

  • The agent’s memory of past successes guides it toward more “creative” transformations (e.g., interaction terms that a human might overlook).
  • Performance gains are most pronounced on larger datasets where the search space is huge and manual engineering becomes impractical.
  • The iterative evaluation loop prevents overfitting to spurious features, yielding more stable models across cross‑validation folds.

Practical Implications

  • Accelerated prototyping – Data scientists can drop a CSV into FAMOSE and obtain a ready‑to‑train feature matrix in minutes, freeing time for model selection and business logic.
  • Lowered expertise barrier – Teams without deep domain knowledge can still achieve competitive performance on tabular problems (e.g., churn prediction, credit scoring).
  • Integration with MLOps pipelines – Because FAMOSE’s tool calls are modular (Python functions, Spark jobs, etc.), it can be wrapped as a step in CI/CD for model training, automatically updating features when source data drifts.
  • Cost‑effective scaling – The agent’s ability to prune ineffective features early reduces the compute budget for downstream model training, especially valuable in cloud‑pay‑as‑you‑go environments.
  • Potential for domain‑specific extensions – Plug‑in custom generators (e.g., time‑series lag features, NLP embeddings) to tailor the agent to specialized industries.

Limitations & Future Work

  • LLM dependency – Quality of generated features hinges on the underlying LLM; smaller or open‑source models may yield weaker proposals.
  • Prompt length constraints – Very long feature histories can exceed context windows, requiring summarization heuristics that might lose nuance.
  • Interpretability – While the agent records its reasoning, the generated transformations can still be opaque to non‑technical stakeholders.
  • Scalability to ultra‑high‑dimensional data – Experiments above 200 K columns were not covered; future work will explore hierarchical search and distributed tool execution.
  • Broader evaluation – Extending benchmarks to time‑series, multi‑modal, and streaming data scenarios is on the roadmap.

FAMOSE demonstrates that AI agents equipped with reasoning‑action loops can tackle the creative, iterative nature of feature engineering, opening the door for more autonomous data‑science pipelines in the near future.

Authors

  • Keith Burghardt
  • Jienan Liu
  • Sadman Sakib
  • Yuning Hao
  • Bo Li

Paper Information

  • arXiv ID: 2602.17641v1
  • Categories: cs.LG, cs.AI
  • Published: February 19, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

What is an LLM Gateway?

markdown !smakoshhttps://media2.dev.to/dynamic/image/width=50,height=50,fit=cover,gravity=auto,format=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploa...