GitHub Copilot CLI combines model families for a second opinion

Published: 0 month ago (April 6, 2026 at 05:53 PM EDT)

6 min read

Source: GitHub Blog

Introducing Rubber Duck (Experimental)

When you ask a coding agent to build a data pipeline, it may not always choose the optimal structure.
What if the agent could get a second opinion before executing its plan?

What is Rubber Duck?

Rubber Duck is an experimental feature in GitHub Copilot CLI that:

Leverages a second model from a different AI family to act as an independent reviewer.
Assesses the agent’s plans and work at the moments where feedback matters most.
Provides a fresh perspective to catch different kinds of errors.

Why It Matters

Our evaluations show that combining Claude Sonnet with Rubber Duck:

Closes 74.7 % of the performance gap between Sonnet and Opus.
Delivers better results on difficult multi‑file and long‑running tasks.

How to Try It

Install or update GitHub Copilot CLI.
Use the /experimental flag to enable Rubber Duck alongside other experimental features.

copilot  /experimental

Note: Rubber Duck is currently in experimental mode and may evolve based on user feedback.

Learn More

GitHub Copilot CLI – Experimental Features

Give Rubber Duck a try and see how a second set of eyes can improve your coding workflow!

The Problem: Confident Mistakes Can Compound

Today’s coding agents follow a clear loop:

Assess the task
Draft a plan
Implement
Test
Iterate (if necessary)

This flow is powerful, but it has blind spots. Any decision an agent makes early on—especially during the planning stage—becomes the foundation for everything that follows. Assumptions and inefficiencies turn into hidden dependencies, and by the time they’re noticed, you may have to fix more than just the initial mistake.

Why Self‑Reflection Helps

Using self‑reflection and having the agent review its own output before moving forward is a proven technique. However, a model reviewing its own work is still bounded by its own training biases: the same training data, the same techniques, and consequently, the same blind spots.

Rubber Duck adds a second perspective

Rubber Duck is a focused review agent powered by a model from a complementary family to your primary Copilot session. When you select a Claude model from the model picker as your orchestrator, Rubber Duck will be GPT‑5.4. As we experiment with Rubber Duck, we are exploring other model families for both the orchestrator and the reviewer.

The job of Rubber Duck is to:

check the agent’s work, and
surface a short, focused list of high‑value concerns—details the primary agent may have missed, assumptions worth questioning, and edge cases to consider.

When does the cross‑family review help?

We evaluated Rubber Duck on SWE‑Bench Pro, a benchmark of large, difficult, real‑world coding problems drawn from open‑source repositories.

Key findings

Orchestrator	Reviewer	Result
Claude Sonnet 4.6 + Rubber Duck (GPT‑5.4)	–	Achieved a resolution rate approaching Claude Opus 4.6 running alone, closing 74.7 % of the performance gap between Sonnet and Opus.

Rubber Duck helps most with difficult problems (≥ 3 files, ≥ 70 steps). On those problems:

Sonnet + Rubber Duck scores 3.8 % higher than the Sonnet baseline.
On the hardest problems (identified across three trials) the gain rises to 4.8 %.

Example findings

Architectural catch (OpenLibrary/async scheduler) – Rubber Duck noticed the proposed scheduler would start and immediately exit, running zero jobs. Even after fixing that, one scheduled task contained an infinite loop.
One‑liner bug, big impact (OpenLibrary/Solr) – It detected a loop that silently overwrote the same dict key on every iteration, causing three of four Solr facet categories to be dropped from every search query without raising an error.
Cross‑file conflict (NodeBB/email confirmation) – Rubber Duck identified three files that all read from a Redis key which the new code stopped writing to, silently breaking the confirmation UI and cleanup paths on deploy.

When does Rubber Duck activate?

GitHub Copilot can call Rubber Duck automatically (both proactively and reactively) and can also be triggered by a user at any time to critique and revise its work.

Proactive activation (automatic)

For complex work, Copilot seeks a critique at checkpoints where feedback yields the highest return:

After drafting a plan – Early detection of sub‑optimal decisions prevents downstream errors.
After a complex implementation – A second set of eyes helps catch edge cases in intricate code.
After writing tests, before executing them – Gaps in test coverage or flawed assertions can be fixed before “everything passes” reinforces a false sense of correctness.

Reactive activation (automatic)

If the agent gets stuck in a loop or cannot make progress, it can request a Rubber Duck critique to break the logjam.

User‑initiated activation

You may request a critique at any point. Copilot will query Rubber Duck, reason over the feedback, and show you what changed and why.

Design note: The agent invokes Rubber Duck sparingly, targeting moments where the signal‑to‑noise ratio is highest, without getting in the way. Technically, Rubber Duck is invoked through Copilot’s existing task tool—the same infrastructure used for other sub‑agents.

Availability

Rubber Duck is currently enabled for all Claude family models (Opus, Sonnet, and Haiku) used as orchestrators in the model picker.

We are already exploring other model families for the reviewer and for pairing GPT‑5.4 as the orchestrator.

Getting Started

Rubber Duck is available today in experimental mode.

To start using it:

Install the GitHub Copilot CLI.
Run the /experimental slash command.

Rubber Duck will appear when you select any Claude model from the model picker and have access enabled to GPT‑5.4. You’ll see critiques surface in two ways:

Automatically – when Copilot decides a checkpoint warrants a second opinion (after planning, after complex implementations, or after writing tests).
On demand – just ask Copilot to critique its work; it will invoke Rubber Duck, incorporate the feedback, and show you exactly what changed.

Where Rubber Duck Helps Most

Complex refactors and architectural changes
High‑stakes tasks where a miss is costly
Ensuring comprehensive test coverage
Any time you want a second opinion on a plan before committing

Rubber Duck in the GitHub Copilot CLI is now available in experimental mode.
Share your feedback in the discussion.

Authors


Nick McKenna – Applied Researcher III	Bartek Perz – Principal Applied Researcher

(Add links to related posts here if desired.)

Explore More from GitHub


Docs
Everything you need to master GitHub, all in one place.	GitHub
Build what’s next on GitHub, the place for anyone from anywhere to build anything.	Customer stories
Meet the companies and engineering teams that build with GitHub.	The GitHub Podcast
Catch up on the GitHub podcast, a show dedicated to topics, trends, stories, and culture in and around the open‑source developer community on GitHub.
Go to Docs →	Start building →	Learn more →	Listen now →

GitHub Copilot CLI combines model families for a second opinion

Introducing Rubber Duck (Experimental)

What is Rubber Duck?

Why It Matters

How to Try It

Learn More

The Problem: Confident Mistakes Can Compound

Why Self‑Reflection Helps

Rubber Duck adds a second perspective

When does the cross‑family review help?

Example findings

When does Rubber Duck activate?

Proactive activation (automatic)

Reactive activation (automatic)

User‑initiated activation

Availability

Getting Started

Where Rubber Duck Helps Most

Authors

Explore More from GitHub

Related posts

Copilot usage metrics now identify active and passive Copilot code review users

Organization runner controls for Copilot cloud agent

GPT-5.1 Codex, GPT-5.1-Codex-Max, and GPT-5.1-Codex-Mini deprecated

Organization firewall settings for Copilot cloud agent

Introducing Rubber Duck (Experimental)

What is Rubber Duck?

Why It Matters

How to Try It

Learn More

The Problem: Confident Mistakes Can Compound

Why Self‑Reflection Helps

Rubber Duck adds a second perspective

When does the cross‑family review help?

Example findings

When does Rubber Duck activate?

Proactive activation (automatic)

Reactive activation (automatic)

User‑initiated activation

Availability

Getting Started

Where Rubber Duck Helps Most

Authors

Related Posts

Explore More from GitHub

Related posts

Copilot usage metrics now identify active and passive Copilot code review users

Organization runner controls for Copilot cloud agent

GPT-5.1 Codex, GPT-5.1-Codex-Max, and GPT-5.1-Codex-Mini deprecated

Organization firewall settings for Copilot cloud agent

Introducing Rubber Duck (Experimental)

What is Rubber Duck?

When does Rubber Duck activate?

Where Rubber Duck Helps Most