[Paper] Can Large Language Models Handle Discourse Particles? A Case Study of Colloquial Malay

Published: (May 27, 2026 at 01:42 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2605.28782v1

Overview

A new benchmark called MalayPrag investigates whether today’s large language models (LLMs) can understand and generate discourse particles—tiny words like “well” or “kind of” that carry subtle emotional and interpersonal cues. Focusing on colloquial Malay, the study reveals that even state‑of‑the‑art models struggle to map these particles to their intended pragmatic functions, highlighting a blind spot in multilingual LLM development.

Key Contributions

  • MalayPrag benchmark: a curated dataset for evaluating LLM handling of discourse particles in informal Malay.
  • Five‑attribute framework: a linguistically grounded taxonomy (e.g., Attitude, Politeness, Emphasis, Uncertainty, Interactional stance) that captures the pragmatic roles of particles.
  • Comprehensive evaluation: ten off‑the‑shelf LLMs (including GPT‑4, LLaMA, and open‑source models) are tested on three prediction tasks (function classification, particle generation, and context‑aware usage).
  • Empirical insight: structured attribute prompts dramatically improve model performance, demonstrating the value of explicit pragmatic scaffolding.

Methodology

  1. Data collection – Native Malay speakers annotated a large corpus of social‑media posts and chat logs, marking each discourse particle and its pragmatic function according to the five‑attribute schema.
  2. Task design
    • Function Classification: Given a sentence and a highlighted particle, the model predicts the correct attribute(s).
    • Particle Generation: Given a context and a target attribute, the model must produce an appropriate particle.
    • Context‑aware Usage: The model selects the most suitable particle from a shortlist for a supplied dialogue turn.
  3. Prompt engineering – Experiments compare a vanilla prompt (plain question) against a structured prompt that explicitly lists the five attributes and provides examples.
  4. Evaluation – Accuracy, F1, and human‑rated naturalness are reported for each model and task.

Results & Findings

ModelFunction‑Classification Acc. (vanilla)Function‑Classification Acc. (structured)
GPT‑458%73%
LLaMA‑2‑13B42%61%
Open‑source 7B35%54%
  • All models show a large gap between English‑centric benchmarks and Malay particle handling.
  • Providing the five‑attribute scaffold improves performance by 15–20 percentage points on average, confirming that LLMs benefit from explicit pragmatic cues.
  • Human judges rate particle‑generated sentences from structured prompts as significantly more natural (average Likert 4.2/5 vs. 3.1/5).

Practical Implications

  • Chatbot localization – Deploying conversational agents in Southeast Asia will require fine‑tuning or prompt‑level scaffolding to handle particles that convey politeness, hesitation, or camaraderie.
  • Sentiment & intent analysis – Discourse particles often flip the tone of a message; ignoring them can lead to misclassification in moderation tools or market‑research pipelines.
  • Prompt design guidelines – The five‑attribute framework can be reused for other low‑resource languages, offering a recipe for developers to inject pragmatic knowledge without massive retraining.
  • LLM evaluation pipelines – Adding a pragmatic‑particle suite like MalayPrag to existing benchmark suites (e.g., MMLU, HELM) gives a more holistic view of a model’s “human‑like” communication abilities.

Limitations & Future Work

  • Scope limited to Malay – While the attribute taxonomy is linguistically motivated, its coverage for other Austronesian or tonal languages remains untested.
  • Dataset size – MalayPrag contains ~8 k annotated instances; larger, more diverse corpora could expose additional edge cases.
  • Model adaptation – The study only evaluates zero‑shot prompting; fine‑tuning on particle‑rich data may yield further gains and is an open research direction.
  • Human evaluation depth – Current human ratings focus on naturalness; future work should assess downstream task impact (e.g., dialogue success rates).

Bottom line: Discourse particles are a subtle yet vital piece of human‑like dialogue. This work shows that even the most powerful LLMs need explicit pragmatic scaffolding to master them—an insight that developers building multilingual conversational systems should keep in mind.*

Authors

  • Mariah Al Giptiah Binte Yusoff
  • Jakin Tan
  • Bocheng Chen
  • Guangliang Liu
  • Xi Chen

Paper Information

  • arXiv ID: 2605.28782v1
  • Categories: cs.CL
  • Published: May 27, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »