[Paper] VecCISC: Improving Confidence-Informed Self-Consistency with Reasoning Trace Clustering and Candidate Answer Selection

Published: (May 8, 2026 at 01:54 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2605.08070v1

Overview

The paper introduces VecCISC, a clever shortcut for the Confidence‑Informed Self‑Consistency (CISC) technique that powers large language models (LLMs) when they need to reason through multiple possible answers. By pruning redundant or low‑quality reasoning traces before they are sent to a secondary “critic” model, VecCISC slashes inference cost by almost half while keeping—or even improving—accuracy across a variety of hard benchmarks.

Key Contributions

  • Adaptive trace filtering: Uses vector‑based semantic similarity to detect and discard reasoning traces that are duplicates, degenerate, or hallucinated.
  • Lightweight integration: Works as a drop‑in pre‑processor for existing CISC pipelines, requiring no changes to the underlying LLM or critic model.
  • Broad evaluation: Tested on five diverse datasets (math, chemistry, biology, commonsense, humanities) showing up to 47 % token‑usage reduction with equal or better accuracy than vanilla CISC.
  • Open‑source implementation: The authors release code and prompts, making it easy for developers to plug VecCISC into their own inference pipelines.

Methodology

  1. Generate candidate answers – The base LLM is prompted to produce N answer candidates, each accompanied by a step‑by‑step reasoning trace.
  2. Embed traces – Each reasoning trace is turned into a dense vector using a pretrained embedding model (e.g., Sentence‑Transformers).
  3. Cluster by similarity – Vectors are grouped with a simple similarity threshold (cosine similarity > τ). Traces that fall into the same cluster are considered semantically equivalent.
  4. Filter candidates – From each cluster, only the representative trace (the one with the highest internal confidence or the shortest length) is kept; the rest are discarded.
  5. Critic scoring – The remaining, filtered traces are fed to the critic LLM, which returns a confidence score for each answer.
  6. Weighted voting – Answers are selected via CISC’s weighted majority vote, using the critic‑provided scores.

Because the critic is only invoked on a subset of the original candidates, the overall token count—and thus latency and cost—drops dramatically.

Results & Findings

Dataset (Domain)CISC AccuracyVecCISC AccuracyToken Savings
GSM‑8K (Math)78.2 %79.1 %46 %
ChemQA (Chem)71.5 %71.5 %48 %
BioReason (Bio)66.3 %66.8 %45 %
CommonsenseQA84.0 %84.2 %47 %
HumanitiesQA73.9 %74.5 %47 %
  • Accuracy: VecCISC matches or slightly outperforms vanilla CISC on every benchmark.
  • Efficiency: By cutting the number of critic calls roughly in half, total token consumption falls by ≈47 %, translating to proportional cost savings.
  • Robustness: The similarity‑based filter reliably removes hallucinated or nonsensical traces without discarding useful diversity.

Practical Implications

  • Cost‑effective scaling: Companies deploying LLM‑based assistants can now run CISC‑style reasoning at near‑CISC quality while paying almost half the inference bill.
  • Lower latency: Fewer critic calls mean faster response times—critical for real‑time chatbots, code‑assist tools, or decision‑support systems.
  • Plug‑and‑play: Since VecCISC sits between the generator and the critic, existing pipelines (e.g., OpenAI’s gpt‑4 with a separate evaluation model) can adopt it with minimal engineering effort.
  • Improved reliability: By automatically filtering out degenerate traces, developers get cleaner logs and fewer “nonsense” explanations, simplifying downstream debugging and audit trails.
  • Generalizable to other LLM frameworks: The vector‑clustering idea works with any embedding model, making it compatible with open‑source LLM stacks (LLaMA, Mistral, etc.) as well as commercial APIs.

Limitations & Future Work

  • Similarity threshold tuning: The τ hyper‑parameter needs dataset‑specific calibration; an overly aggressive threshold could discard genuinely distinct but correct reasoning paths.
  • Embedding model dependency: Quality of trace clustering hinges on the chosen embedding model; poor embeddings could misclassify traces.
  • Scalability of clustering: While cheap for the modest N (≈10‑20) candidates used in experiments, extremely large candidate sets may require more sophisticated clustering algorithms.
  • Future directions: The authors suggest exploring dynamic thresholds, hierarchical clustering, and integrating uncertainty estimation directly into the embedding stage to further reduce critic calls without sacrificing diversity.

VecCISC demonstrates that a little semantic‑aware pruning can make sophisticated self‑consistency reasoning both cheaper and faster—an attractive proposition for any developer looking to squeeze more value out of large language models.

Authors

  • James Petullo
  • Sonny George
  • Dylan Cashman
  • Nianwen Xue

Paper Information

  • arXiv ID: 2605.08070v1
  • Categories: cs.AI
  • Published: May 8, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Normalizing Trajectory Models

Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coar...