[Paper] ShareChat: A Dataset of Chatbot Conversations in the Wild

Published: 1 month ago (December 19, 2025 at 12:47 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.17843v1

Overview

The ShareChat paper introduces a massive, cross‑platform dataset of real‑world chatbot conversations collected from five leading LLM chat services (ChatGPT, Claude, Gemini, Perplexity, and Grok). By preserving each platform’s native UI cues—such as reasoning traces, citation links, and code snippets—the dataset gives researchers and engineers a far richer view of how users actually interact with LLM‑powered assistants.

Key Contributions

Largest public multi‑platform LLM chat corpus: 142,808 conversations (≈ 660 k turns) spanning five major chat services.
Native interface affordances retained: reasoning steps, source URLs, code blocks, and other UI‑specific artifacts are kept intact.
Broad linguistic coverage: conversations in 101 languages, reflecting global usage.
Extended context windows & depth: many dialogs exceed the typical 2–4 k token limits of prior datasets, enabling research on long‑term memory and multi‑turn reasoning.
Three demonstrative analyses:
1. Conversation completeness as a proxy for intent satisfaction
2. Citation behavior of LLMs
3. Temporal shifts in usage patterns from Apr 2023 to Oct 2025

Methodology

Data collection – Public URLs that embed or share chat logs (e.g., forum posts, social media threads, community archives) were scraped using platform‑specific crawlers. The authors filtered for genuine user‑assistant exchanges and removed duplicates.
Normalization & annotation – Each turn was parsed into a structured JSON record preserving:
- platform (ChatGPT, Claude, etc.)
- turn_id, speaker (user/assistant)
- content (raw markdown/text)
- metadata (timestamp, language, UI elements like “thought” blocks, citation links, code fences)
Quality control – A combination of automated heuristics (spam detection, language identification) and manual spot‑checks ensured that the dataset reflects authentic, high‑quality interactions.
Analysis pipelines – The authors built lightweight scripts to compute conversation completeness (ratio of user follow‑up vs. termination), extract citation URLs, and aggregate usage statistics over time.

Results & Findings

Conversation completeness: ~68 % of dialogs end with a user‑expressed “thanks” or “that solves it,” indicating a high satisfaction rate; the remaining 32 % show follow‑up questions, suggesting unmet intent or ambiguous responses.
Citation behavior: Claude and Gemini include source links in ~45 % of factual answers, whereas ChatGPT and Perplexity cite less frequently (~20 %). Grok rarely provides citations (<5 %).
Temporal trends: From 2023‑2024 to 2025, code‑generation turns grew from 12 % to 27 % of total turns, reflecting a surge in developer‑centric usage. Multilingual conversations also rose sharply, with Hindi, Spanish, and Arabic each crossing the 5 % threshold in 2025.
Context length: Average conversation length reached 4.6 k tokens, with the longest exceeding 30 k tokens—far beyond the limits of most existing benchmark datasets.

Practical Implications

Prompt‑engineering research: The long context windows enable testing of memory‑management strategies, retrieval‑augmented generation, and chain‑of‑thought prompting at scale.
Tooling for developers: IDE plugins or code‑assistant products can be trained on the rich code‑artifact portion to improve language‑specific suggestions and error‑handling patterns.
Compliance & citation auditing: The citation metadata offers a ground‑truth benchmark for building systems that must attribute sources (e.g., legal, medical, academic assistants).
Multilingual product rollout: With 101 languages represented, product teams can evaluate localization gaps and prioritize language support based on real usage signals.
User‑experience design: Understanding which UI affordances (e.g., “thought” bubbles, inline citations) correlate with higher conversation completeness can guide the next generation of chat interfaces.

Limitations & Future Work

Public‑URL bias: The dataset only captures conversations that users chose to share publicly, potentially over‑representing “interesting” or “successful” interactions and under‑representing routine or failed attempts.
Platform coverage: While five major services are included, emerging or niche chatbots (e.g., domain‑specific assistants) are absent, limiting generalizability to the broader ecosystem.
Temporal cutoff: Data stops at Oct 2025; rapid model updates after that point may shift citation or code‑generation behaviors.
Future directions suggested by the authors:
1. Augment the corpus with opt‑in private logs to reduce sharing bias
2. Expand to newer platforms and multimodal (image/video) interactions
3. Develop benchmark tasks (e.g., citation verification, long‑context reasoning) that directly leverage ShareChat’s unique affordances

Authors

Yueru Yan
Tuc Nguyen
Bo Su
Melissa Lieffers
Thai Le

Paper Information

arXiv ID: 2512.17843v1
Categories: cs.CL, cs.AI, cs.HC
Published: December 19, 2025
PDF: Download PDF

[Paper] ShareChat: A Dataset of Chatbot Conversations in the Wild

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] When Reasoning Meets Its Laws

[Paper] Bangla MedER: Multi-BERT Ensemble Approach for the Recognition of Bangla Medical Entity

[Paper] AncientBench: Towards Comprehensive Evaluation on Excavated and Transmitted Chinese Corpora

[Paper] SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories