[Paper] MegaChat: A Synthetic Persian Q&A Dataset for High-Quality Sales Chatbot Evaluation
Source: arXiv - 2511.23397v1
Overview
The paper introduces MegaChat, the first fully synthetic Persian question‑and‑answer (Q&A) dataset tailored for evaluating sales chatbots on Telegram—a platform widely used by Iranian SMEs. By automating data creation with a multi‑agent system, the authors demonstrate a cost‑effective way to generate realistic conversational data for a low‑resource language, opening the door to smarter, locally‑adapted e‑commerce bots.
Key Contributions
- MegaChat dataset: ≈ 500 K persona‑aware Persian Q&A pairs generated entirely synthetically.
- Agentic pipeline: A novel multi‑agent architecture (question generator, validator, refiner) that harvests real‑world shopping channel content and produces high‑quality conversational data without human labeling.
- Advanced RAG baseline: Implementation of three classic retrieval‑augmented generation (RAG) models for comparison.
- Enhanced agentic RAG: Multi‑query retrieval, neural reranking, and persona‑aligned response synthesis that outperforms traditional RAG on 4/5 evaluated channels.
- Comprehensive evaluation: Use of GPT‑5.1 to score responses across six quality dimensions (relevance, fluency, factuality, persona consistency, engagement, and commercial suitability).
- Open‑source release: Dataset and code are publicly available on GitHub, encouraging reproducibility and community extensions.
Methodology
- Data Harvesting – The system crawls active Telegram shopping channels, extracting product listings, FAQs, and user comments.
- Persona Modeling – For each channel, a lightweight persona profile (e.g., “friendly boutique seller”, “tech‑gear specialist”) is inferred from channel metadata and language style.
- Multi‑Agent Generation
- Question Agent: Uses a Persian‑fine‑tuned language model to formulate plausible buyer questions based on product attributes and persona cues.
- Validation Agent: Checks each question for relevance, grammaticality, and alignment with the persona, discarding low‑quality items.
- Refinement Agent: Rewrites or expands questions to increase diversity and realism.
- Answer Synthesis – An answer agent retrieves relevant product info (multi‑query retrieval) and employs a reranker to pick the most appropriate snippet before generating a persona‑consistent response.
- Evaluation – GPT‑5.1 rates each Q&A pair on six dimensions; scores are aggregated to compare the agentic pipeline against three baseline RAG models (BM25‑RAG, DPR‑RAG, and ColBERT‑RAG).
Results & Findings
| Model | Avg. Quality Score (out of 10) | Channels where it leads |
|---|---|---|
| Agentic RAG (MegaChat pipeline) | 8.2 | 4/5 (fashion, electronics, home‑goods, cosmetics) |
| BM25‑RAG | 6.7 | – |
| DPR‑RAG | 7.0 | – |
| ColBERT‑RAG | 7.1 | – |
- Relevance & Persona Consistency: The agentic system scored 0.9 points higher on average than the best baseline, thanks to persona‑aware generation and reranking.
- Scalability: Generating the full dataset took ~12 hours on a single GPU node, compared to weeks of manual annotation for a comparable size.
- Cost Efficiency: Estimated annotation cost savings exceed US $150 K for a dataset of this scale.
Practical Implications
- Rapid Bot Prototyping – SMEs can bootstrap a Persian sales chatbot by fine‑tuning on MegaChat, cutting development cycles from months to days.
- Domain Adaptability – The agentic pipeline can be re‑targeted to other verticals (e.g., travel, finance) by swapping the source Telegram channels, making it a reusable data‑generation engine.
- Low‑Resource Language Boost – Demonstrates that high‑quality conversational data need not rely on expensive human labeling, encouraging more AI products in Persian and similar languages.
- Integration with Existing Platforms – The dataset aligns with Telegram’s Bot API, allowing developers to plug‑in a pre‑trained model and immediately benefit from persona‑aware responses.
- Benchmark for Future Research – Provides a standardized Persian sales‑chat benchmark, facilitating fair comparison of retrieval‑augmented and generative models.
Limitations & Future Work
- Synthetic Bias – Because the data are generated from existing channel content, any bias or misinformation present in those sources may propagate into the dataset.
- Persona Granularity – Current personas are coarse‑grained; finer distinctions (e.g., regional dialects, brand voice) remain unexplored.
- Evaluation Scope – Reliance on GPT‑5.1 for scoring, while practical, may not fully capture human user satisfaction; a user study is planned.
- Extension to Multi‑Turn Dialogues – MegaChat focuses on single‑turn Q&A; future work will expand to multi‑turn conversational flows and dynamic context handling.
MegaChat marks a significant step toward democratizing conversational AI for Persian e‑commerce, offering developers a ready‑to‑use dataset and a blueprint for synthetic data generation in other low‑resource domains.
Authors
- Mahdi Rahmani
- AmirHossein Saffari
- Reyhane Rahmani
Paper Information
- arXiv ID: 2511.23397v1
- Categories: cs.CL, cs.AI, cs.MA
- Published: November 28, 2025
- PDF: Download PDF