[Paper] MegaChat: A Synthetic Persian Q&A Dataset for High-Quality Sales Chatbot Evaluation

Published: (November 28, 2025 at 12:44 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2511.23397v1

Overview

The paper introduces MegaChat, the first fully synthetic Persian question‑and‑answer (Q&A) dataset tailored for evaluating sales chatbots on Telegram—a platform widely used by Iranian SMEs. By automating data creation with a multi‑agent system, the authors demonstrate a cost‑effective way to generate realistic conversational data for a low‑resource language, opening the door to smarter, locally‑adapted e‑commerce bots.

Key Contributions

  • MegaChat dataset: ≈ 500 K persona‑aware Persian Q&A pairs generated entirely synthetically.
  • Agentic pipeline: A novel multi‑agent architecture (question generator, validator, refiner) that harvests real‑world shopping channel content and produces high‑quality conversational data without human labeling.
  • Advanced RAG baseline: Implementation of three classic retrieval‑augmented generation (RAG) models for comparison.
  • Enhanced agentic RAG: Multi‑query retrieval, neural reranking, and persona‑aligned response synthesis that outperforms traditional RAG on 4/5 evaluated channels.
  • Comprehensive evaluation: Use of GPT‑5.1 to score responses across six quality dimensions (relevance, fluency, factuality, persona consistency, engagement, and commercial suitability).
  • Open‑source release: Dataset and code are publicly available on GitHub, encouraging reproducibility and community extensions.

Methodology

  1. Data Harvesting – The system crawls active Telegram shopping channels, extracting product listings, FAQs, and user comments.
  2. Persona Modeling – For each channel, a lightweight persona profile (e.g., “friendly boutique seller”, “tech‑gear specialist”) is inferred from channel metadata and language style.
  3. Multi‑Agent Generation
    • Question Agent: Uses a Persian‑fine‑tuned language model to formulate plausible buyer questions based on product attributes and persona cues.
    • Validation Agent: Checks each question for relevance, grammaticality, and alignment with the persona, discarding low‑quality items.
    • Refinement Agent: Rewrites or expands questions to increase diversity and realism.
  4. Answer Synthesis – An answer agent retrieves relevant product info (multi‑query retrieval) and employs a reranker to pick the most appropriate snippet before generating a persona‑consistent response.
  5. Evaluation – GPT‑5.1 rates each Q&A pair on six dimensions; scores are aggregated to compare the agentic pipeline against three baseline RAG models (BM25‑RAG, DPR‑RAG, and ColBERT‑RAG).

Results & Findings

ModelAvg. Quality Score (out of 10)Channels where it leads
Agentic RAG (MegaChat pipeline)8.24/5 (fashion, electronics, home‑goods, cosmetics)
BM25‑RAG6.7
DPR‑RAG7.0
ColBERT‑RAG7.1
  • Relevance & Persona Consistency: The agentic system scored 0.9 points higher on average than the best baseline, thanks to persona‑aware generation and reranking.
  • Scalability: Generating the full dataset took ~12 hours on a single GPU node, compared to weeks of manual annotation for a comparable size.
  • Cost Efficiency: Estimated annotation cost savings exceed US $150 K for a dataset of this scale.

Practical Implications

  • Rapid Bot Prototyping – SMEs can bootstrap a Persian sales chatbot by fine‑tuning on MegaChat, cutting development cycles from months to days.
  • Domain Adaptability – The agentic pipeline can be re‑targeted to other verticals (e.g., travel, finance) by swapping the source Telegram channels, making it a reusable data‑generation engine.
  • Low‑Resource Language Boost – Demonstrates that high‑quality conversational data need not rely on expensive human labeling, encouraging more AI products in Persian and similar languages.
  • Integration with Existing Platforms – The dataset aligns with Telegram’s Bot API, allowing developers to plug‑in a pre‑trained model and immediately benefit from persona‑aware responses.
  • Benchmark for Future Research – Provides a standardized Persian sales‑chat benchmark, facilitating fair comparison of retrieval‑augmented and generative models.

Limitations & Future Work

  • Synthetic Bias – Because the data are generated from existing channel content, any bias or misinformation present in those sources may propagate into the dataset.
  • Persona Granularity – Current personas are coarse‑grained; finer distinctions (e.g., regional dialects, brand voice) remain unexplored.
  • Evaluation Scope – Reliance on GPT‑5.1 for scoring, while practical, may not fully capture human user satisfaction; a user study is planned.
  • Extension to Multi‑Turn Dialogues – MegaChat focuses on single‑turn Q&A; future work will expand to multi‑turn conversational flows and dynamic context handling.

MegaChat marks a significant step toward democratizing conversational AI for Persian e‑commerce, offering developers a ready‑to‑use dataset and a blueprint for synthetic data generation in other low‑resource domains.

Authors

  • Mahdi Rahmani
  • AmirHossein Saffari
  • Reyhane Rahmani

Paper Information

  • arXiv ID: 2511.23397v1
  • Categories: cs.CL, cs.AI, cs.MA
  • Published: November 28, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »