[Paper] C2LLM Technical Report: A New Frontier in Code Retrieval via Adaptive Cross-Attention Pooling
Source: arXiv - 2512.21332v1
Overview
The C2LLM technical report introduces a new family of code‑embedding models—C2LLM‑0.5B and C2LLM‑7B—that dramatically improve how developers retrieve and reason about code snippets. By marrying a large‑language‑model (LLM) backbone with a novel adaptive pooling layer, the authors break the long‑standing “EOS bottleneck” that has limited the quality of code embeddings derived from causal LLMs.
Key Contributions
- Contrastive Code LLMs (C2LLM): Two scalable models (0.5 B and 7 B parameters) built on the Qwen‑2.5‑Coder architecture, specifically tuned for code representation learning.
- Pooling by Multi‑head Attention (PMA): A lightweight attention‑based pooling module that aggregates information from all tokens, preserving the rich causal context learned during pre‑training.
- Dimension‑flexible embeddings: PMA can output embeddings of arbitrary size, offering a drop‑in replacement for traditional MRL (Mean‑Pooling‑Reduced‑Length) techniques.
- Large‑scale contrastive training: Trained on ~3 M publicly available code‑related pairs, leveraging contrastive loss to align semantically similar snippets.
- State‑of‑the‑art performance: Sets new records on the MTEB‑Code benchmark, with C2LLM‑7B topping the leaderboard among models of comparable size.
Methodology
-
Backbone Selection – The authors start from Qwen‑2.5‑Coder, a causal decoder‑only LLM pre‑trained on massive code corpora. This backbone already captures strong token‑level semantics.
-
Adaptive Pooling (PMA) – Instead of using the final EOS token as a summary (the classic approach), they insert a Pooling by Multi‑head Attention layer on top of the token embeddings.
- Each token’s hidden state serves as a “query” to a set of learnable attention heads.
- The heads attend over the entire sequence, producing a weighted sum that reflects the most informative parts of the code.
- The resulting pooled vector can be projected to any desired dimensionality (e.g., 256‑dim, 768‑dim).
-
Contrastive Learning – Training data consist of positive pairs (e.g., a function and its docstring) and negative samples drawn from the same batch. A contrastive loss pushes embeddings of positive pairs together while pulling unrelated pairs apart.
-
Scalable Training – Both model sizes are trained on a distributed GPU cluster, using mixed‑precision and gradient checkpointing to keep memory footprints manageable.
Results & Findings
| Model | Params | Avg. MTEB‑Code Score | Rank (size‑class) |
|---|---|---|---|
| C2LLM‑0.5B | 0.5 B | 71.3 | 1st among ≤1 B models |
| C2LLM‑7B | 7 B | 78.9 | 1st overall (≤7 B) |
- Breaking the EOS bottleneck: PMA improves retrieval accuracy by 4–6 pts over EOS‑only pooling across all benchmark tasks (code search, clone detection, semantic similarity).
- Dimension flexibility: Experiments show negligible performance loss when projecting embeddings down to 256 D, enabling faster index look‑ups without sacrificing quality.
- Training efficiency: Despite the contrastive setup, convergence is reached in ~2 M steps, roughly half the steps required by prior code‑embedding baselines.
Practical Implications
- Better code search engines – Integrating C2LLM embeddings can boost the relevance of search results in IDE plugins, internal codebases, or open‑source platforms like GitHub.
- Improved duplicate detection – Companies can more reliably spot copy‑pasted or near‑duplicate functions, aiding refactoring and license compliance.
- Language‑agnostic tooling – Because the model is trained on a multilingual code mix, it works out‑of‑the‑box for Python, JavaScript, Java, Go, and many others, reducing the need for language‑specific pipelines.
- Lightweight deployment – The 0.5 B variant runs comfortably on a single high‑end GPU or even on modern CPUs with quantization, making it feasible for on‑premise security‑sensitive environments.
- Plug‑and‑play embeddings – The flexible output dimension means developers can directly replace existing embedding services (e.g., OpenAI embeddings, CodeBERT) without redesigning downstream indexing structures.
Limitations & Future Work
- Training data scope – The 3 M pair dataset, while large, still under‑represents niche languages and domain‑specific APIs; performance may degrade on highly specialized code.
- Causal‑only backbone – Although PMA mitigates the EOS issue, the underlying decoder‑only architecture may still miss bidirectional context that encoder‑only models capture.
- Evaluation breadth – Benchmarks focus on retrieval‑style tasks; downstream impacts on tasks like automated code generation or bug fixing remain unexplored.
- Future directions suggested by the authors include expanding the contrastive corpus, experimenting with encoder‑decoder hybrids, and fine‑tuning the PMA layer for task‑specific objectives (e.g., security‑oriented code similarity).
Bottom line: C2LLM demonstrates that a modest architectural tweak—adaptive cross‑attention pooling—can unlock the full potential of causal code LLMs for embedding‑centric workflows. For developers building search, recommendation, or analysis tools, adopting C2LLM could translate into noticeably sharper, faster, and more flexible code‑understanding capabilities.
Authors
- Jin Qin
- Zihan Liao
- Ziyin Zhang
- Hang Yu
- Peng Di
- Rui Wang
Paper Information
- arXiv ID: 2512.21332v1
- Categories: cs.CL, cs.AI
- Published: December 24, 2025
- PDF: Download PDF