Meet Bumblebee: Agentic AI Flagging Risky Merchants in Under 90 Seconds
Source: Dev.to
If you’re familiar with a payments company, you know the drill. Risk agents manually review thousands of merchant websites every month, checking for red flags: sketchy privacy policies, misaligned pricing, questionable social media presence, suspicious domain registration patterns.
At Razorpay, our risk operations team was conducting 10,000 to 12,000 manual website reviews monthly, each taking roughly four minutes of human attention. That’s 700 to 800 human hours consumed every month, and the quality was inconsistent because different agents would interpret the same signals differently.
The traditional approach to fraud detection involves throwing bodies at the problem or building rigid rule engines that break the moment fraudsters adapt their tactics. We needed something better, something that could scale with our transaction volume while actually getting smarter over time.
That’s why we built what we’re calling Agentic Risk, a multi‑agent AI system that automates merchant website evaluation from end to end while maintaining the nuanced judgment that used to require human expertise.
The journey from our initial n8n prototype through an AI agent to our current multi‑agent architecture reveals fundamental truths about building reliable AI systems at scale.
The Business Problem: When Manual Review Can’t Keep Up
Let me paint the picture of what risk operations looked like before automation. When a new merchant signs up for Razorpay or when our fraud detection system flags an existing merchant, a case lands in our Risk Case Manager system. A human agent picks up that case and begins the investigation dance.
This process takes four minutes when everything goes smoothly, but that’s rarely the case. Websites are structured differently, policy pages are hidden in weird places, domain information services have different interfaces, and social media handles aren’t always obvious. The worst part isn’t the time; it’s the inconsistency. One agent might flag a merchant for having a generic privacy policy while another agent considers the same policy acceptable.
We were also paying thousands of dollars monthly for a third‑party explicit content screening service, and it was generating about 50 alerts per month with less than 10 % precision. Moreover, this service only caught one specific type of risk while ignoring dozens of other fraud indicators we cared about.
The fundamental issue was that we had excellent observability tools, structured data systems, and experienced risk analysts, but the connective tissue between all these components was human labor. Scaling meant hiring more agents, which meant more inconsistency, higher cost, and no improvement in detection speed or accuracy.
Phase 1: The n8n Prototype - When Visual Orchestration Hits Its Limits
We started with n8n, a visual workflow automation platform, to quickly prototype and validate our hypothesis. Within weeks, we had a working proof‑of‑concept integrating webhook ingestion, merchant metadata fetching, website content review via multimodal AI, domain lookups, GST enrichment, fraud metrics, and LLM‑based risk analysis.
The prototype validated that automation was feasible and helped us identify the complete set of data points needed. However, n8n quickly revealed fundamental limitations:
- Branch explosion – handling edge cases created unmaintainable 40‑node workflows with duplicated logic.
- Observability gaps – debugging failed nodes was painful with coarse logs.
- Platform instability – non‑deterministic behavior in HTTP and merge operations.
The n8n prototype taught us that production‑grade risk automation would require a code‑first approach with proper observability and the ability to use Python libraries directly.
Phase 2: Python + ReAct Agent - Better Control, New Bottlenecks
We rebuilt as a Python web application with an API frontend and task workers. This immediately solved several Phase 1 problems: native Python libraries, structured logging with trace IDs, proper exception handling with retry logic, and complex NLP preprocessing capabilities.
The core was a single ReAct‑style agent that iteratively reasoned about which tools to call, executed them, and incorporated results until producing a structured risk assessment. Phase 2 brought full observability, easy tool addition, and dynamic behavior that replaced brittle conditional logic.
However, new bottlenecks emerged:
- Token bloat – the agent accumulated 50 KB+ of HTML content, domain data, and fraud metrics in its context window, regularly hitting token limits.
- Sequential execution – tool invocations happened one after another even when they had no dependencies, scaling linearly with tool count.
- Temperature conflation – a single temperature setting was suboptimal for both exploration (tool selection) and exploitation (final scoring).
Phase 2 proved agentic orchestration was right, but a single‑agent architecture couldn’t scale to thousands of concurrent evaluations.
Phase 3: Multi‑Agent Architecture - When Specialization Wins
The breakthrough came when we stopped treating fraud detection as a single AI task and started building a multi‑agent collaboration system. Rather than one agent doing everything, we split responsibilities across specialized agents optimized for specific roles: Planner, Fetchers, and Analyzer.
The Planner Agent receives the merchant case, examines available tools, checks system health and API quotas, and generates an execution plan. This isn’t a rigid script; it’s a structured specification of what information to gather, with priorities, timeouts, token budgets, and expected schemas. The Planner enforces business rules deterministically (e.g., skip GST validation for non‑Indian merchants, deprioritize social media checks for B2B merchants). This reduces unnecessary API calls and focuses resources on high‑signal checks.
Data Fetcher Agents execute in parallel, each owning one data source or tool—website scraping, WHOIS lookups, fraud database queries, social media metrics, pricing comparisons, policy verification. Crucially, fetchers perform local data pruning before returning results, keeping token usage low and allowing the downstream Analyzer to focus on synthesis rather than raw data handling.


