Skip to main content

The MCP skill

your AI ismissing.

Structured debate between multiple AI models that argue the tradeoffs, catch blind spots, and show their reasoning — so every decision gets the full picture.

Mechanism backed by peer-reviewed research from UCL, Anthropic, MIT, and Google DeepMind

claude code
Claude Code v2.1.56
Opus 4.6 · Claude Max
claude
Install

Set up in seconds

Terminal
claude mcp add --transport http roundtable https://mcp.roundtable.now/mcp

Add the server first — authenticate via your API key when prompted.

0%

of knowledge workers use AI at work

0%

distrust AI accuracy

$0B

hallucination cost to enterprises

0%

report inaccurate AI decisions

The Problem

Your AI is a yes-man

single model
you › ALTER TABLE users ADD COLUMN last_login TIMESTAMP DEFAULT NOW() — safe for prod?
ai     Yes, adding a column with a default is a metadata-only operation. Ship it.
One model. One opinion. You find out Monday it was wrong.
council of three
you › ALTER TABLE users ADD COLUMN last_login TIMESTAMP DEFAULT NOW() — safe for prod?
GPT-4.1 · DevOps Lead: Depends on your Postgres version. Pre-11, this rewrites the entire table and holds a write lock. Even on 11+, NOW() is volatile — it still forces a rewrite.
Gemini · Database Expert: Use DEFAULT NULL instead, then backfill in batches. Volatile defaults bypass the metadata-only optimization even on modern Postgres.
Grok · Ops Realist: Both right. Also — it's Friday afternoon. Schema migrations before the weekend are how you get paged at 2am.
Council Verdict
Block deploy. Volatile default forces table rewrite regardless of PG version. Safe path: add as nullable, backfill, then set default.
Research

Backed by peer-reviewed science

Multi-model debate isn't a hypothesis. It's the mechanism behind the most accurate AI reasoning ever measured.

Accuracy improvement
+0 percentage points

Non-expert judges improved from 48% → 76% accuracy when evaluating debated answers vs single-model responses

Khan et al. · UCL + Anthropic · ICML 2024 Best Paper
Math reasoning boost
+0 percentage points

Multi-agent debate improved math reasoning from 67% → 81.8%. Models correct each other through sequential challenge rounds

Du et al. · MIT + DeepMind · ICML 2024
Open-source beats GPT-4
0% on AlpacaEval 2.0

Mixture-of-Agents: open-source models collaborating scored 65.1% vs GPT-4 Omni's 57.5% — proving collective reasoning beats individual capability

Wang et al. · Together AI + Stanford · ICLR 2025
Universal advantage
Debate wins on every task

Weak LLM judges supervising strong LLMs via debate outperformed direct questioning on every task tested — scalable oversight works

Kenton et al. · Google DeepMind · NeurIPS 2024

“Two sets of findings released in 2024 offer the first empirical evidence that debate between two LLMs helps a judge recognize the truth.”

Quanta Magazine, March 2025
Why We Built This

AI changed everything. Except how we decide.

Every team uses AI now. But they use it the same way — ask one model, trust the answer, ship it. For boilerplate, that works. For architecture calls, security reviews, and infrastructure changes, it's a coin flip with production on the line.

Worse — models are trained to agree with you. Anthropic's own research (ICLR 2024) showed that LLMs systematically tell users what they want to hear, even when the user is wrong. They call it sycophancy. We call it the core failure mode of single-model AI: a system optimized to sound right, not to be right.

And 66% of the time, the answer is almost right — close enough to ship, wrong enough to break. That's the danger zone. Not the obvious hallucinations. The confident, plausible, subtly wrong answers that pass code review because they sound like something a senior engineer would say.

The fix isn't a better model. It's structured disagreement. When AI is forced to challenge AI — reading, questioning, and stress-testing each other's reasoning — errors surface that no single model catches. This is peer-reviewed science presented at ICML, NeurIPS, and ICLR. Not a hypothesis.

Who this is for

Anyone making high-stakes decisions with AI — engineers, product leads, marketers, designers, founders. If the answer matters and one model isn't enough, you want a council arguing the tradeoffs before you commit.

What we're building

AI peer review for critical changes. Not a chat UI. Not a copilot. A council that argues the tradeoffs before you ship — with a full reasoning trail for every decision.

We used Roundtable to make this decision. The positioning, the target market, the copy on this page — all debated by a council of models before we committed. We build with what we ship.

Inside the Product

Presets or build your own

Start with a curated council of models and roles — or pick exactly which models debate and what perspective each one takes.

roundtable.now/chat

Critical Code Review

ANBuilder
OPCritic
GOCritic
XAPerformance Engineer

Architecture migration, code quality, security, and performance analysis.

Strategy Debate

ANStrategist
OPCritic
DEAnalyst

Build vs buy, tech stack decisions, and resource allocation trade-offs.

Creative Brainstorm

ANIdeator
OPBuilder
GOIdeator
XABuilder

Divergent ideation, concept exploration, and creative direction with competing perspectives.

Deep Analysis

ANStrategist
OPStrategist
GOBuilder

Complex problem decomposition, systems thinking, and multi-angle reasoning.

UX Research Panel

ANUX Researcher
OPProduct Designer
GOAccessibility Lead

User research synthesis, journey mapping, and experience gap identification.

Startup Pitch Review

ANVC Partner
OPFounder Coach
XAAnalyst
DEFinancial Modeler

Pitch deck teardown, market sizing, competitive positioning, and investor readiness.

Security Threat Review

ANSecurity Architect
OPPenetration Tester
GOCompliance Officer

Threat modeling, vulnerability assessment, and incident response planning.

Content & Copy Review

ANEditor
OPCopywriter
XAStrategist

Copy review, tone analysis, audience targeting, and messaging consistency.

Trust & Security

Built for high-stakes decisions

Roundtable is designed for confidential, critical work — code reviews, architecture calls, security audits.

Full Traceability

Every tool call logged with model attribution and reasoning chain. When the council says 'refactor,' you can trace which model proposed it, which challenged it, and why the verdict stands.

Your Code Stays Local

MCP runs in your IDE. Code context never leaves your machine. API calls are excluded from model training by every provider we route through.

Human-in-the-Loop

AI deliberates. You decide. Every verdict includes the reasoning so you can override with confidence. The council argues the tradeoffs — you make the call.

Compliance-Ready

Every council produces a decision record — which models participated, what positions they took, how the verdict was reached. The EU AI Act (August 2026) requires exactly this kind of AI decision documentation for high-risk systems.

Deep Dive

Read the research

The peer-reviewed papers behind multi-model deliberation — from UCL, Anthropic, Google DeepMind, MIT, and leading AI labs.

FAQ

Frequently asked questions

30 Seconds to Your First Verdict

Pick your MCP client, add the server, and start your first council debate.

Terminal
claude mcp add --transport http roundtable https://mcp.roundtable.now/mcp

Add the server first — authenticate via your API key when prompted.

Get Your API Key