[Paper] MegaChat: A Synthetic Persian Q&A Dataset for High-Quality Sales Chatbot Evaluation

Published: 2 months ago (November 28, 2025 at 12:44 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2511.23397v1

Overview

The paper introduces MegaChat, the first fully synthetic Persian question‑and‑answer (Q&A) dataset tailored for evaluating sales chatbots on Telegram—a platform widely used by Iranian SMEs. By automating data creation with a multi‑agent system, the authors demonstrate a cost‑effective way to generate realistic conversational data for a low‑resource language, opening the door to smarter, locally‑adapted e‑commerce bots.

Key Contributions

MegaChat dataset: ≈ 500 K persona‑aware Persian Q&A pairs generated entirely synthetically.
Agentic pipeline: A novel multi‑agent architecture (question generator, validator, refiner) that harvests real‑world shopping channel content and produces high‑quality conversational data without human labeling.
Advanced RAG baseline: Implementation of three classic retrieval‑augmented generation (RAG) models for comparison.
Enhanced agentic RAG: Multi‑query retrieval, neural reranking, and persona‑aligned response synthesis that outperforms traditional RAG on 4/5 evaluated channels.
Comprehensive evaluation: Use of GPT‑5.1 to score responses across six quality dimensions (relevance, fluency, factuality, persona consistency, engagement, and commercial suitability).
Open‑source release: Dataset and code are publicly available on GitHub, encouraging reproducibility and community extensions.

Methodology

Data Harvesting – The system crawls active Telegram shopping channels, extracting product listings, FAQs, and user comments.
Persona Modeling – For each channel, a lightweight persona profile (e.g., “friendly boutique seller”, “tech‑gear specialist”) is inferred from channel metadata and language style.
Multi‑Agent Generation
- Question Agent: Uses a Persian‑fine‑tuned language model to formulate plausible buyer questions based on product attributes and persona cues.
- Validation Agent: Checks each question for relevance, grammaticality, and alignment with the persona, discarding low‑quality items.
- Refinement Agent: Rewrites or expands questions to increase diversity and realism.
Answer Synthesis – An answer agent retrieves relevant product info (multi‑query retrieval) and employs a reranker to pick the most appropriate snippet before generating a persona‑consistent response.
Evaluation – GPT‑5.1 rates each Q&A pair on six dimensions; scores are aggregated to compare the agentic pipeline against three baseline RAG models (BM25‑RAG, DPR‑RAG, and ColBERT‑RAG).

Results & Findings

Model	Avg. Quality Score (out of 10)	Channels where it leads
Agentic RAG (MegaChat pipeline)	8.2	4/5 (fashion, electronics, home‑goods, cosmetics)
BM25‑RAG	6.7	–
DPR‑RAG	7.0	–
ColBERT‑RAG	7.1	–

Relevance & Persona Consistency: The agentic system scored 0.9 points higher on average than the best baseline, thanks to persona‑aware generation and reranking.
Scalability: Generating the full dataset took ~12 hours on a single GPU node, compared to weeks of manual annotation for a comparable size.
Cost Efficiency: Estimated annotation cost savings exceed US $150 K for a dataset of this scale.

Practical Implications

Rapid Bot Prototyping – SMEs can bootstrap a Persian sales chatbot by fine‑tuning on MegaChat, cutting development cycles from months to days.
Domain Adaptability – The agentic pipeline can be re‑targeted to other verticals (e.g., travel, finance) by swapping the source Telegram channels, making it a reusable data‑generation engine.
Low‑Resource Language Boost – Demonstrates that high‑quality conversational data need not rely on expensive human labeling, encouraging more AI products in Persian and similar languages.
Integration with Existing Platforms – The dataset aligns with Telegram’s Bot API, allowing developers to plug‑in a pre‑trained model and immediately benefit from persona‑aware responses.
Benchmark for Future Research – Provides a standardized Persian sales‑chat benchmark, facilitating fair comparison of retrieval‑augmented and generative models.

Limitations & Future Work

Synthetic Bias – Because the data are generated from existing channel content, any bias or misinformation present in those sources may propagate into the dataset.
Persona Granularity – Current personas are coarse‑grained; finer distinctions (e.g., regional dialects, brand voice) remain unexplored.
Evaluation Scope – Reliance on GPT‑5.1 for scoring, while practical, may not fully capture human user satisfaction; a user study is planned.
Extension to Multi‑Turn Dialogues – MegaChat focuses on single‑turn Q&A; future work will expand to multi‑turn conversational flows and dynamic context handling.

MegaChat marks a significant step toward democratizing conversational AI for Persian e‑commerce, offering developers a ready‑to‑use dataset and a blueprint for synthetic data generation in other low‑resource domains.

Authors

Mahdi Rahmani
AmirHossein Saffari
Reyhane Rahmani

Paper Information

arXiv ID: 2511.23397v1
Categories: cs.CL, cs.AI, cs.MA
Published: November 28, 2025
PDF: Download PDF

[Paper] MegaChat: A Synthetic Persian Q&A Dataset for High-Quality Sales Chatbot Evaluation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] ThetaEvolve: Test-time Learning on Open Problems

[Paper] Towards Improving Interpretability of Language Model Generation through a Structured Knowledge Discovery Approach

[Paper] Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models

[Paper] Toward Automatic Safe Driving Instruction: A Large-Scale Vision Language Model Approach