← ListarXivPDF

#3 Self-Reflective APIs: Structure Beats Verbosity for AI Agent Recovery

Score: 34.7 | Matched keywords: agent, ai, ai agent, llm, prompt, reasoning, token

Categories: cs.SE, cs.AI

Abstract Snapshot

Compressed abstract

When an AI agent calls an API and hits a validation error, it needs more than what went wrong -- it needs what to do next. A self-reflective API returns, on validation failure, a machine-readable recovery_feedback.suggestions[] payload sufficient for the agent to repair the request and retry without external reasoning.

Main idea

When an AI agent calls an API and hits a validation error, it needs more than what went wrong -- it needs what to do next.

Method signal

A self-reflective API returns, on validation failure, a machine-readable recovery_feedback.suggestions[] payload sufficient for the agent to repair the request and retry without external reasoning. On a leak-audited pilot (N{=}30 per cell, 3 LLMs, 10 adversarial tasks), structured suggestions lift task-completion rate by +36.7--40.0 pp over plain-English diagnoses on Anthropic models (Fisher's exact p 0.0022), at 1.8--2.2 better per-success token efficiency.

Contribution signal

The lift is not significant on gpt-4 o-mini (p{=}0.435); a second-domain replication on a billing API confirms the pattern. The comparison only holds after auditing two undocumented classes of answer leakage in LLM benchmarks.

Original Abstract

When an AI agent calls an API and hits a validation error, it needs more than what went wrong -- it needs what to do next. A self-reflective API returns, on validation failure, a machine-readable recovery_feedback.suggestions[] payload sufficient for the agent to repair the request and retry without external reasoning. On a leak-audited pilot (N{=}30 per cell, 3 LLMs, 10 adversarial tasks), structured suggestions lift task-completion rate by +36.7--40.0 pp over plain-English diagnoses on Anthropic models (Fisher's exact p 0.0022), at 1.8--2.2 better per-success token efficiency. The lift is not significant on gpt-4 o-mini (p{=}0.435); a second-domain replication on a billing API confirms the pattern. The comparison only holds after auditing two undocumented classes of answer leakage in LLM benchmarks. We shipaudit_prompt_leakage.py as reusable CI infrastructure. Code and data: https://github.com/arquicanedo/self-reflective-apis.