ADK Arena: Evaluating Agent Development Kits via LLM-as-a-Developer

Abstract Snapshot

Compressed abstract

Main idea

Method signal

We propose LLM-as-a-Developer, a methodology that replaces human developers with an LLM coding agent that learns each framework's API from documentation, writes agent code, and iteratively repairs it through a validate-and-feedback loop until tests pass. By holding the developer constant and varying only the framework, generation effort becomes a quantitative proxy for API usability and the resulting agents provide a controlled measure of framework effectiveness.

Contribution signal

We implement this in ADK Arena, a fully automated pipeline with per-framework Docker isolation, a three-level validation pipeline, and benchmark adapters for SWE-bench, ^2-bench, Terminal-Bench, and MCP-Atlas. Evaluating all 51 popular Python ADK frameworks (204 agent--benchmark pairs), we find that: (1)~generation succeeds for 57% of runs, and its cost varies 5.6 across frameworks (\0.6 to \3.4 per agent), a quantitative proxy for API complexity, though cost alone does not predict success; (2)~no single framework dominates: the best single-benchmark ADK agents resolve up to 80% of tasks and can even beat general-purpose frontier coding agents at a fraction of the cost, yet the median framework resolves only 32%; (3)~across information-source ablations, genuine framework usage stays within a narrow 28--40% band (highest with raw source access and still 33% with no reference material at all), indicating that documentation, source code, and parametric knowledge are largely substitutable rather than any one being a hard bottleneck.

Original Abstract

The rapid proliferation of Agent Development Kits (ADKs), SDK-level frameworks for building LLM-powered autonomous agents, has outpaced any empirical understanding of how framework choice affects agent performance. We propose LLM-as-a-Developer, a methodology that replaces human developers with an LLM coding agent that learns each framework's API from documentation, writes agent code, and iteratively repairs it through a validate-and-feedback loop until tests pass. By holding the developer constant and varying only the framework, generation effort becomes a quantitative proxy for API usability and the resulting agents provide a controlled measure of framework effectiveness. We implement this in ADK Arena, a fully automated pipeline with per-framework Docker isolation, a three-level validation pipeline, and benchmark adapters for SWE-bench, ^2-bench, Terminal-Bench, and MCP-Atlas. Evaluating all 51 popular Python ADK frameworks (204 agent--benchmark pairs), we find that: (1)~generation succeeds for 57% of runs, and its cost varies 5.6 across frameworks (\0.6 to \3.4 per agent), a quantitative proxy for API complexity, though cost alone does not predict success; (2)~no single framework dominates: the best single-benchmark ADK agents resolve up to 80% of tasks and can even beat general-purpose frontier coding agents at a fraction of the cost, yet the median framework resolves only 32%; (3)~across information-source ablations, genuine framework usage stays within a narrow 28--40% band (highest with raw source access and still 33% with no reference material at all), indicating that documentation, source code, and parametric knowledge are largely substitutable rather than any one being a hard bottleneck.

#10 ADK Arena: Evaluating Agent Development Kits via LLM-as-a-Developer

Abstract Snapshot

Compressed abstract

Main idea

Method signal

Contribution signal

Original Abstract