[Paper] Can Large Language Models Handle Discourse Particles? A Case Study of Colloquial Malay

Published: 2 weeks ago (May 27, 2026 at 01:42 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2605.28782v1

Overview

A new benchmark called MalayPrag investigates whether today’s large language models (LLMs) can understand and generate discourse particles—tiny words like “well” or “kind of” that carry subtle emotional and interpersonal cues. Focusing on colloquial Malay, the study reveals that even state‑of‑the‑art models struggle to map these particles to their intended pragmatic functions, highlighting a blind spot in multilingual LLM development.

Key Contributions

MalayPrag benchmark: a curated dataset for evaluating LLM handling of discourse particles in informal Malay.
Five‑attribute framework: a linguistically grounded taxonomy (e.g., Attitude, Politeness, Emphasis, Uncertainty, Interactional stance) that captures the pragmatic roles of particles.
Comprehensive evaluation: ten off‑the‑shelf LLMs (including GPT‑4, LLaMA, and open‑source models) are tested on three prediction tasks (function classification, particle generation, and context‑aware usage).
Empirical insight: structured attribute prompts dramatically improve model performance, demonstrating the value of explicit pragmatic scaffolding.

Methodology

Data collection – Native Malay speakers annotated a large corpus of social‑media posts and chat logs, marking each discourse particle and its pragmatic function according to the five‑attribute schema.
Task design –
- Function Classification: Given a sentence and a highlighted particle, the model predicts the correct attribute(s).
- Particle Generation: Given a context and a target attribute, the model must produce an appropriate particle.
- Context‑aware Usage: The model selects the most suitable particle from a shortlist for a supplied dialogue turn.
Prompt engineering – Experiments compare a vanilla prompt (plain question) against a structured prompt that explicitly lists the five attributes and provides examples.
Evaluation – Accuracy, F1, and human‑rated naturalness are reported for each model and task.

Results & Findings

Model	Function‑Classification Acc. (vanilla)	Function‑Classification Acc. (structured)
GPT‑4	58%	73%
LLaMA‑2‑13B	42%	61%
Open‑source 7B	35%	54%

All models show a large gap between English‑centric benchmarks and Malay particle handling.
Providing the five‑attribute scaffold improves performance by 15–20 percentage points on average, confirming that LLMs benefit from explicit pragmatic cues.
Human judges rate particle‑generated sentences from structured prompts as significantly more natural (average Likert 4.2/5 vs. 3.1/5).

Practical Implications

Chatbot localization – Deploying conversational agents in Southeast Asia will require fine‑tuning or prompt‑level scaffolding to handle particles that convey politeness, hesitation, or camaraderie.
Sentiment & intent analysis – Discourse particles often flip the tone of a message; ignoring them can lead to misclassification in moderation tools or market‑research pipelines.
Prompt design guidelines – The five‑attribute framework can be reused for other low‑resource languages, offering a recipe for developers to inject pragmatic knowledge without massive retraining.
LLM evaluation pipelines – Adding a pragmatic‑particle suite like MalayPrag to existing benchmark suites (e.g., MMLU, HELM) gives a more holistic view of a model’s “human‑like” communication abilities.

Limitations & Future Work

Scope limited to Malay – While the attribute taxonomy is linguistically motivated, its coverage for other Austronesian or tonal languages remains untested.
Dataset size – MalayPrag contains ~8 k annotated instances; larger, more diverse corpora could expose additional edge cases.
Model adaptation – The study only evaluates zero‑shot prompting; fine‑tuning on particle‑rich data may yield further gains and is an open research direction.
Human evaluation depth – Current human ratings focus on naturalness; future work should assess downstream task impact (e.g., dialogue success rates).

Bottom line: Discourse particles are a subtle yet vital piece of human‑like dialogue. This work shows that even the most powerful LLMs need explicit pragmatic scaffolding to master them—an insight that developers building multilingual conversational systems should keep in mind.*

Authors

Mariah Al Giptiah Binte Yusoff
Jakin Tan
Bocheng Chen
Guangliang Liu
Xi Chen

Paper Information

arXiv ID: 2605.28782v1
Categories: cs.CL
Published: May 27, 2026
PDF: Download PDF

[Paper] Can Large Language Models Handle Discourse Particles? A Case Study of Colloquial Malay

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Language Models Learn Constructional Semantics, Not To Mention Syntax: Investigating LM Understanding of Paired-Focus Constructions

[Paper] LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

[Paper] What Gets Unmasked First? Trajectory Analysis of Diffusion Models for Graph-to-Text Generation

[Paper] Disagreeing Rationales: Rethinking Classification and Explainability Evaluation in Hate Speech Detection