My Agent System Looks Powerful but Is Just Industrial Trash
Source: Dev.to
This weekend note is a bit late because Phase One of my Deep Data Analyst project failed for now. That means I can’t continue the promised Data Analyst Agent tutorial.
What Happened?
I actually built a single‑agent data‑analysis assistant based on the ReAct pattern.
The assistant could:
- Take a user’s analysis request.
- Form a reasonable hypothesis.
- Run EDA and modeling on the uploaded dataset.
- Deliver professional business insights and actionable suggestions.
- Create charts to back up its points.
If you’re curious about how it looked, here’s a screenshot:
After all, this was just a single‑agent app—not that hard to build. If you remember, I explained how I used a ReAct agent to solve the Advent of Code challenges. Here’s that tutorial:
How I Crushed Advent of Code And Solved Hard Problems Using Autogen Jupyter Executor and Qwen3
If you tweak that agent’s prompt a bit, you can get the same kind of data‑analysis ability I’m talking about.
Why Do I Call It a Failure?
Because my agent, like most that AI hobbyists build, is perfect for impressing your boss with a beautiful prototype, but once real users try it, it suddenly breaks down and becomes industrial trash.
Why Do I Say That?
My agent has two serious problems.
Very Poor Robustness
This is the top feedback I got after giving it to analyst users.
If you try it once, it looks amazing. It uses methods and technical skills beyond a regular analyst to give you a very professional argument. You’d think replacing humans with AI was the smartest move you’ve ever made.
But data analysis is about testing cause and effect over time. You must run the same analysis daily or weekly to see if the assistant’s advice actually works.
Even with the same question, the agent changes its hypotheses and analysis methods each run, producing different advice each time. That’s what I mean by poor stability and consistency.
Imagine you ask it to use an RFM model to segment your users and give marketing suggestions.
• Before a campaign it uses features A, B, C and makes five levels for each.
• After the campaign it suddenly adds a derived metric D and now segments on A, B, C, D.
You couldn’t even run an A/B test properly.
It Suffers from Context‑Position Bias
If you’ve read my earlier posts, you know my Data Analyst agent runs code through a stateful Jupyter‑Kernel‑based interpreter.
Exclusive Reveal: Code Sandbox Tech Behind Manus and Claude Agent Skills
This lets the agent act like a human analyst: first making a hypothesis, running code in a Jupyter notebook to test it, then forming a new hypothesis based on the results—iterating over and over.
But here’s the problem. In a past post I mentioned that LLMs have position bias when dealing with long conversation histories:
Fixing the Agent Handoff Problem in LlamaIndex’s AgentWorkflow System
In short, LLMs don’t treat each message fairly. They don’t weight importance by recency the way we might expect.
As we keep making and testing hypotheses, the history grows. Every message matters:
- The first message shows the data structure.
- A later one disproves a hypothesis, so we skip it next time.
All are important. The LLM, however, starts focusing on the wrong messages while ignoring the corrected ones, causing it to repeat mistakes. This either wastes tokens and time or sends the analysis off‑track into another topic—neither is good.
So Phase One of my data‑analysis agent is done.
Any Ways to Fix It?
Build a Multi‑Agent System with Atomic Skills
For robustness, you’d probably think of using a Context Engineer to lock in the plan and metric definitions before analysis starts.
When an analysis works well, we should save the plan and prior assumptions in long‑term memory.
Both require giving the agent new skills.
But remember, my agent is based on ReAct, which means its prompt is already huge—over a thousand lines now.
… (the rest of the original content continues here)
Multi‑Agent Data Analyst Design

Adding anything risks breaking this fragile system and disrupting prompt‑following.
A single agent won’t cut it. We need to split the system into multiple agents with atomic skills and orchestrate them.
Agents Overview
| Agent | Role |
|---|---|
| Issue Clarification Agent | Asks the user questions to clarify the problem, confirm metrics, and define scope. |
| Retrieval Agent | Pulls metric definitions, calculation methods, and analysis techniques from a knowledge base. |
| Planner Agent | Proposes hypotheses, selects an analysis approach, and creates a full plan to keep downstream agents on track. |
| Analyst Agent | Breaks the plan into executable steps, runs Python code, and tests the hypotheses. |
| Storyteller Agent | Transforms technical results into engaging business stories and actionable advice. |
| Validator Agent | Ensures the entire process is correct, reliable, and business‑compliant. |
| Orchestrator Agent | Manages all agents, assigns tasks, and routes messages between them. |

Choose the Right Agent Framework
We need a framework that supports message passing and context state saving:
- When a new task arrives or an agent finishes, a message should be sent to the orchestrator.
- The orchestrator should also dispatch tasks via messages.
- Intermediate results must be stored in a shared context rather than being sent back to the LLM each time (to avoid position bias).
Candidates
| Framework | Pros | Cons |
|---|---|---|
| LangGraph | Clean workflow definition. | Still built on LangChain, which I dislike. |
| Autogen | Good for research‑heavy tasks; offers a Selector Group Chat for orchestration. | Poor message‑history control, black‑box orchestration, half‑baked GraphFlow, and development has stalled. |
| Microsoft Agent Framework (MAF) | Easy to use, modern features (MCP, A2A, AG‑UI), robust workflow engine, context state management, OpenTelemetry observability, switch‑case & multi‑selection orchestration modes. | Newer, so community resources are still growing. |
Bottom line: I’ll skip LangGraph and Autogen and move forward with Microsoft Agent Framework (MAF).
What About Microsoft Agent Framework (MAF)?
I like MAF because it:
- Incorporates the best ideas from earlier frameworks while avoiding their pitfalls.
- Works well with Qwen‑3 and DeepSeek models (see my guide on structured output compatibility).
- Offers a powerful Workflow feature: multiple node types, built‑in context state, observability, and flexible orchestration modes.
“MAF feels ambitious. With new abilities like MCP, A2A, AG‑UI, and strong Microsoft backing, it should have a better long‑term future than Autogen.”
Reference:
Make Microsoft Agent Framework’s Structured Output Work With Qwen and DeepSeek Models
My Next Steps
- Read MAF’s user guide and source code to get comfortable with its API.
- Start building the agent system using MAF, integrating Qwen‑3 and DeepSeek as the underlying LLMs.
- Adapt the Deep Data Analyst architecture to the new framework (some refactoring will be required).
- Explore Workflow patterns in MAF to see how they map to common AI‑agent design patterns.
The advantage of a multi‑agent system is incremental progress: I can add skills step‑by‑step and share updates as soon as they’re ready, rather than waiting for the whole project to finish. 😂
Join the Conversation
What are you interested in? Leave a comment below.
Subscribe to my newsletter Mr.Q’s Weekend Notes for the latest agent‑research straight to your inbox.
Share this blog with friends—maybe it can help more people.
Enjoyed this read? Subscribe now for more cutting‑edge data‑science tips! Your feedback and questions are welcome—let’s discuss in the comments!
This article was originally published on Data Leads Future.



