Score: 27.2 | Matched keywords: agent, llm, multi-agent
Categories: physics.soc-ph, cs.AI, cs.MA
Inference-time multi-agent LLM scaling lacks a shared unit: counting nominal agents conflates cost with independent evidence. We derive a two-parameter scaling law R(N) = N_eff/N = 1/(1+c(N-1)N^{-}) where the regime exponent classifies any configuration into one of three asymptotic regimes -- hard-ceiling at 1/c ( = 0), sublinear at N^/c (0 < < 1), or linear ( 1), and a mean-field theorem predicts that peer count k…
Inference-time multi-agent LLM scaling lacks a shared unit: counting nominal agents conflates cost with independent evidence.
We derive a two-parameter scaling law R(N) = N_eff/N = 1/(1+c(N-1)N^{-}) where the regime exponent classifies any configuration into one of three asymptotic regimes -- hard-ceiling at 1/c ( = 0), sublinear at N^/c (0 < < 1), or linear ( 1), and a mean-field theorem predicts that peer count k and rounds during agent debate enter the dynamics only through their product k. The law applies at two levels: answer diversity and correctness redundancy.
Across 44 (model task condition) cells spanning peer debate, self-correction, random-noise placebo, self-consistency, three open-weight families (Qwen, Llama, Ministral) at scales from 7 B to 32 B with a frontier API check (Gemini), thinking models, heterogeneous teams, and sparse communication, the functional form fits every condition at R^2 > 0.99; only (c, ) shifts. On free-form math, dense peer influence collapses the answer-level regime from sublinear into hard-ceiling; correctness-level fits remain hard-ceiling throughout.
Inference-time multi-agent LLM scaling lacks a shared unit: counting nominal agents conflates cost with independent evidence. We derive a two-parameter scaling law R(N) = N_eff/N = 1/(1+c(N-1)N^{-}) where the regime exponent classifies any configuration into one of three asymptotic regimes -- hard-ceiling at 1/c ( = 0), sublinear at N^/c (0 < < 1), or linear ( 1), and a mean-field theorem predicts that peer count k and rounds during agent debate enter the dynamics only through their product k. The law applies at two levels: answer diversity and correctness redundancy. Across 44 (model task condition) cells spanning peer debate, self-correction, random-noise placebo, self-consistency, three open-weight families (Qwen, Llama, Ministral) at scales from 7 B to 32 B with a frontier API check (Gemini), thinking models, heterogeneous teams, and sparse communication, the functional form fits every condition at R^2 > 0.99; only (c, ) shifts. On free-form math, dense peer influence collapses the answer-level regime from sublinear into hard-ceiling; correctness-level fits remain hard-ceiling throughout. Three findings have practical implications. (i)~Thirty dense debating agents produce no more answer diversity than one on MMLU-Hard. (ii)~A noise placebo tracks self-correction on free-form math and at 4 scale, so within homogeneous teams the gain commonly attributed to ``debate'' comes from re-evaluation, not peer content. (iii)~A single N 5 pilot predicts the N=30 structural ceiling, and within the configurations tested only architectural diversity (heterogeneous teams) lowers c and escapes the hard-ceiling regime, communication-mode interventions do not.