GitHub Copilot CLI combines model families for a second opinion

Published: (April 6, 2026 at 05:53 PM EDT)
6 min read
Source: GitHub Blog

Source: GitHub Blog

Introducing Rubber Duck (Experimental)

When you ask a coding agent to build a data pipeline, it may not always choose the optimal structure.
What if the agent could get a second opinion before executing its plan?

What is Rubber Duck?

Rubber Duck is an experimental feature in GitHub Copilot CLI that:

  • Leverages a second model from a different AI family to act as an independent reviewer.
  • Assesses the agent’s plans and work at the moments where feedback matters most.
  • Provides a fresh perspective to catch different kinds of errors.

Why It Matters

Our evaluations show that combining Claude Sonnet with Rubber Duck:

  • Closes 74.7 % of the performance gap between Sonnet and Opus.
  • Delivers better results on difficult multi‑file and long‑running tasks.

How to Try It

  1. Install or update GitHub Copilot CLI.
  2. Use the /experimental flag to enable Rubber Duck alongside other experimental features.
copilot  /experimental

Note: Rubber Duck is currently in experimental mode and may evolve based on user feedback.

Learn More

Give Rubber Duck a try and see how a second set of eyes can improve your coding workflow!

The Problem: Confident Mistakes Can Compound

Today’s coding agents follow a clear loop:

  1. Assess the task
  2. Draft a plan
  3. Implement
  4. Test
  5. Iterate (if necessary)

This flow is powerful, but it has blind spots. Any decision an agent makes early on—especially during the planning stage—becomes the foundation for everything that follows. Assumptions and inefficiencies turn into hidden dependencies, and by the time they’re noticed, you may have to fix more than just the initial mistake.

Why Self‑Reflection Helps

Using self‑reflection and having the agent review its own output before moving forward is a proven technique. However, a model reviewing its own work is still bounded by its own training biases: the same training data, the same techniques, and consequently, the same blind spots.

Rubber Duck adds a second perspective

Rubber Duck is a focused review agent powered by a model from a complementary family to your primary Copilot session. When you select a Claude model from the model picker as your orchestrator, Rubber Duck will be GPT‑5.4. As we experiment with Rubber Duck, we are exploring other model families for both the orchestrator and the reviewer.

The job of Rubber Duck is to:

  • check the agent’s work, and
  • surface a short, focused list of high‑value concerns—details the primary agent may have missed, assumptions worth questioning, and edge cases to consider.

When does the cross‑family review help?

We evaluated Rubber Duck on SWE‑Bench Pro, a benchmark of large, difficult, real‑world coding problems drawn from open‑source repositories.

Key findings

OrchestratorReviewerResult
Claude Sonnet 4.6 + Rubber Duck (GPT‑5.4)Achieved a resolution rate approaching Claude Opus 4.6 running alone, closing 74.7 % of the performance gap between Sonnet and Opus.

Rubber Duck helps most with difficult problems (≥ 3 files, ≥ 70 steps). On those problems:

  • Sonnet + Rubber Duck scores 3.8 % higher than the Sonnet baseline.
  • On the hardest problems (identified across three trials) the gain rises to 4.8 %.

Example findings

  • Architectural catch (OpenLibrary/async scheduler) – Rubber Duck noticed the proposed scheduler would start and immediately exit, running zero jobs. Even after fixing that, one scheduled task contained an infinite loop.
  • One‑liner bug, big impact (OpenLibrary/Solr) – It detected a loop that silently overwrote the same dict key on every iteration, causing three of four Solr facet categories to be dropped from every search query without raising an error.
  • Cross‑file conflict (NodeBB/email confirmation) – Rubber Duck identified three files that all read from a Redis key which the new code stopped writing to, silently breaking the confirmation UI and cleanup paths on deploy.

When does Rubber Duck activate?

GitHub Copilot can call Rubber Duck automatically (both proactively and reactively) and can also be triggered by a user at any time to critique and revise its work.

Proactive activation (automatic)

For complex work, Copilot seeks a critique at checkpoints where feedback yields the highest return:

  1. After drafting a plan – Early detection of sub‑optimal decisions prevents downstream errors.
  2. After a complex implementation – A second set of eyes helps catch edge cases in intricate code.
  3. After writing tests, before executing them – Gaps in test coverage or flawed assertions can be fixed before “everything passes” reinforces a false sense of correctness.

Reactive activation (automatic)

If the agent gets stuck in a loop or cannot make progress, it can request a Rubber Duck critique to break the logjam.

User‑initiated activation

You may request a critique at any point. Copilot will query Rubber Duck, reason over the feedback, and show you what changed and why.

Design note: The agent invokes Rubber Duck sparingly, targeting moments where the signal‑to‑noise ratio is highest, without getting in the way. Technically, Rubber Duck is invoked through Copilot’s existing task tool—the same infrastructure used for other sub‑agents.


Availability

Rubber Duck is currently enabled for all Claude family models (Opus, Sonnet, and Haiku) used as orchestrators in the model picker.

We are already exploring other model families for the reviewer and for pairing GPT‑5.4 as the orchestrator.

Getting Started

Rubber Duck is available today in experimental mode.

To start using it:

  1. Install the GitHub Copilot CLI.
  2. Run the /experimental slash command.

Rubber Duck will appear when you select any Claude model from the model picker and have access enabled to GPT‑5.4. You’ll see critiques surface in two ways:

  • Automatically – when Copilot decides a checkpoint warrants a second opinion (after planning, after complex implementations, or after writing tests).
  • On demand – just ask Copilot to critique its work; it will invoke Rubber Duck, incorporate the feedback, and show you exactly what changed.

Where Rubber Duck Helps Most

  • Complex refactors and architectural changes
  • High‑stakes tasks where a miss is costly
  • Ensuring comprehensive test coverage
  • Any time you want a second opinion on a plan before committing

Rubber Duck in the GitHub Copilot CLI is now available in experimental mode.
Share your feedback in the discussion.


Authors

Nick McKennaBartek Perz
Nick McKenna – Applied Researcher IIIBartek Perz – Principal Applied Researcher

(Add links to related posts here if desired.)


Explore More from GitHub

DocsGitHubCustomer storiesThe GitHub Podcast
Docs
Everything you need to master GitHub, all in one place.GitHub
Build what’s next on GitHub, the place for anyone from anywhere to build anything.Customer stories
Meet the companies and engineering teams that build with GitHub.The GitHub Podcast
Catch up on the GitHub podcast, a show dedicated to topics, trends, stories, and culture in and around the open‑source developer community on GitHub.
Go to Docs →Start building →Learn more →Listen now →
0 views
Back to Blog

Related posts

Read more »