[Paper] Cross-Task Benchmarking and Evaluation of General-Purpose and Code-Specific Large Language Models

Published: 1 month ago (December 4, 2025 at 06:06 AM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.04673v1

Overview

This paper delivers the first large‑scale, side‑by‑side comparison of general‑purpose and code‑specific large language models (LLMs). By testing eight state‑of‑the‑art models on six benchmarks that span natural‑language understanding, mathematical reasoning, and trustworthiness—and by drilling into code‑explanation performance on the CoNaLa dataset—the authors show that models tuned for programming can surprisingly excel on non‑coding tasks as well.

Key Contributions

Unified cross‑task benchmark covering linguistic competence, math reasoning, and trustworthiness for both general and code‑focused LLMs.
Empirical evaluation of eight top models (5 general‑purpose, 3 code‑specific) on six diverse test suites plus a dedicated code‑explanation benchmark (CoNaLa).
Insightful analysis revealing that code‑optimized models (e.g., CodeLLaMA variants) often outperform or match general‑purpose models on reasoning and syntactic precision tasks.
Open‑source evaluation framework and reproducible scripts that the community can extend to new models or tasks.
Practical recommendations for selecting LLMs based on the mix of natural‑language and code‑related workloads in real‑world pipelines.

Methodology

Model selection – Five widely used general‑purpose LLMs (e.g., Mistral‑7B, Llama‑3‑8B) and three code‑centric LLMs (CodeLLaMA‑7B, CodeLLaMA‑13B, StarCoder) were chosen based on public availability and popularity.
Benchmark suite – Six tasks were assembled:
- Linguistic: SuperGLUE‑style QA and entailment.
- Mathematical: GSM‑8K and MATH reasoning problems.
- Trustworthiness: TruthfulQA and toxicity detection.
- Code explanation: CoNaLa (natural‑language description of given code snippets).
Prompt design – Uniform zero‑shot prompts were crafted to avoid bias toward any model’s fine‑tuning style. For code‑explanation, prompts asked the model to “Explain what the following Python snippet does.”
Evaluation metrics – Accuracy/F1 for classification, exact match for reasoning, BLEU/ROUGE for code explanations, and calibrated confidence scores for trustworthiness.
Statistical analysis – Paired bootstrap tests determined whether observed differences were significant at p < 0.05.

Results & Findings

Task	Best General‑Purpose Model	Best Code‑Specific Model	Notable Gap
Linguistic QA	Llama‑3‑8B (78.4% Acc)	CodeLLaMA‑13B (77.1% Acc)	<2% difference
Math Reasoning (GSM‑8K)	Mistral‑7B (62.3%)	CodeLLaMA‑13B (66.5%)	Code model +4.2%
Trustworthiness (TruthfulQA)	Llama‑3‑8B (71.0%)	CodeLLaMA‑7B (70.2%)	Near parity
Code Explanation (CoNaLa)	–	CodeLLaMA‑13B (BLEU 31.4)	General models < 25 BLEU

Code‑specific LLMs consistently beat or match general models on reasoning tasks, suggesting that the syntactic discipline learned from code data transfers to better logical structuring.
Even on pure language benchmarks, the performance gap is minimal, indicating that code‑focused pre‑training does not sacrifice linguistic ability.
Trustworthiness scores are comparable, implying that code‑centric training does not degrade model alignment or safety characteristics.

Practical Implications

Unified model stacks: Teams can consider a single code‑optimized LLM (e.g., CodeLLaMA‑13B) for both code generation and downstream NLP tasks, simplifying deployment and reducing maintenance overhead.
Improved reasoning in IDE assistants: Embedding a code‑specific LLM into developer tools can yield more accurate code explanations, inline documentation, and even assist with non‑code queries (e.g., “What algorithm does this snippet implement?”).
Cost‑effective scaling: Since code‑specific models achieve comparable NLP performance at similar parameter counts, organizations can opt for the cheaper, open‑source variants without sacrificing versatility.
Safety pipelines: The comparable trustworthiness scores mean existing moderation and fact‑checking layers can be reused unchanged when swapping in a code‑focused model.

Limitations & Future Work

Benchmark breadth: While six tasks provide a solid cross‑section, domains like legal reasoning, multilingual understanding, or long‑form generation remain untested.
Zero‑shot focus: The study deliberately avoided few‑shot prompting; future work should explore how code‑specific models adapt when given task‑specific examples.
Model size ceiling: All evaluated models are ≤ 13 B parameters; scaling trends for larger code‑oriented LLMs (e.g., 70 B) are still unknown.
Dataset bias: CoNaLa contains mostly Python snippets; extending to other languages (JavaScript, Rust) could reveal different strengths.

Bottom line: If you’re building tools that sit at the intersection of code and natural language—think AI pair programmers, documentation generators, or mixed‑modal chatbots—code‑specialized LLMs are now proven to be a versatile, high‑performing alternative to generic models, without a noticeable trade‑off in linguistic or safety performance.

Authors

Gunjan Das
Paheli Bhattacharya
Rishabh Gupta

Paper Information

arXiv ID: 2512.04673v1
Categories: cs.SE
Published: December 4, 2025
PDF: Download PDF

[Paper] Cross-Task Benchmarking and Evaluation of General-Purpose and Code-Specific Large Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] MicroRacer: Detecting Concurrency Bugs for Cloud Service Systems

[Paper] Executing Discrete/Continuous Declarative Process Specifications via Complex Event Processing

[Paper] Compiling Away the Overhead of Race Detection

[Paper] Automated Code Review Assignments: An Alternative Perspective of Code Ownership on GitHub