#8 SkillReducer: Optimizing LLM Agent Skills for Token Efficiency

Detailed Summary (EN)

Read-like-fullpaper digest

This paper tackles Skills thus represent an emerging class of software artifacts: they are authored, versioned, shared via marketplaces, and maintained by developers, yet they lack the mature optimization ecosystem that traditional source code enjoys. A typical skill consists of two primary functional components: a brief description used by the agent runtime to route user queries, and a main body of instructions that is injected into the context window upon invocation. The design rationale behind skills is to save tokens: rather than repeating instructions in every conversation, a skill encapsulates reusable knowledge that the agent loads on demand.

The core proposal is To address these inefficiencies, we present SKILLREDUCER, a skill debloating framework that systematically removes unnecessary content while preserving functional quality. The design of SKILLREDUCER is directly driven by the bipartite structure of skills, resulting in a two-stage optimization pipeline. These benefits transfer across five models from four families with a mean retention of 0.965, and generalize to an independent agent framework. Evaluated on 600 skills and the SkillsBench benchmark, SKILLREDUCER achieves 48% description compression and 39% body compression while improving functional quality by 2.8%, revealing a less-is-more effect where removing non-essential content reduces distraction in the context window.

The empirical case is built around • A comprehensive evaluation demonstrating substantial token reduction with preserved or improved functional quality, highlighting a less-is-more effect in LLM context management (Section V). Counter-intuitively, the compressed skills improve functional quality by 2.8% over the originals, suggesting a less-is-more [cs.SE] 31 Mar 2026 effect where removing non-essential content reduces distraction in the context window, particularly for longer and more verbose skills. In a controlled baseline comparison at equivalent token budgets, SKILLREDUCER significantly outperforms LLMLingua [5], direct LLM compression, truncation, and random removal. SKILLREDUCER achieves significant token savings, with a 48% mean compression rate for descriptions and a 39% mean reduction for body tokens.

The central reported finding is In a controlled baseline comparison at equivalent token budgets, SKILLREDUCER significantly outperforms LLMLingua [5], direct LLM compression, truncation, and random removal. We evaluate our framework on 600 skills alongside the external SkillsBench benchmark, yielding several key findings. SKILLREDUCER achieves significant token savings, with a 48% mean compression rate for descriptions and a 39% mean reduction for body tokens. • A comprehensive evaluation demonstrating substantial token reduction with preserved or improved functional quality, highlighting a less-is-more effect in LLM context management (Section V).

Overall, the paper is most convincing where its proposed method is directly supported by the reported comparisons, but the scope of the claim should still be read in light of the evaluation setup and stated limitations.

Final takeaway

Main takeaway: In a controlled baseline comparison at equivalent token budgets, SKILLREDUCER significantly outperforms LLMLingua [5], direct LLM compression, truncation, and random removal.
Most important supporting result: We evaluate our framework on 600 skills alongside the external SkillsBench benchmark, yielding several key findings.

Problem definition

Skills thus represent an emerging class of software artifacts: they are authored, versioned, shared via marketplaces, and maintained by developers, yet they lack the mature optimization ecosystem that traditional source code enjoys.
A typical skill consists of two primary functional components: a brief description used by the agent runtime to route user queries, and a main body of instructions that is injected into the context window upon invocation.
The design rationale behind skills is to save tokens: rather than repeating instructions in every conversation, a skill encapsulates reusable knowledge that the agent loads on demand.
Large language model (LLM) based coding agents, such as Claude Code [1], Cursor [2], and Windsurf, have become essential tools for software development.

Core idea & method

To address these inefficiencies, we present SKILLREDUCER, a skill debloating framework that systematically removes unnecessary content while preserving functional quality.
The design of SKILLREDUCER is directly driven by the bipartite structure of skills, resulting in a two-stage optimization pipeline.
These benefits transfer across five models from four families with a mean retention of 0.965, and generalize to an independent agent framework.
Evaluated on 600 skills and the SkillsBench benchmark, SKILLREDUCER achieves 48% description compression and 39% body compression while improving functional quality by 2.8%, revealing a less-is-more effect where removing non-essential content reduces distraction in the context window.
Stage 2 restructures skill bodies through taxonomy-driven classification and progressive disclosure, separating actionable core rules from supplementary content loaded on demand, validated by faithfulness checks and a self-correcting feedback loop.
Skills thus represent an emerging class of software artifacts: they are authored, versioned, shared via marketplaces, and maintained by developers, yet they lack the mature optimization ecosystem that traditional source code enjoys.

Actual findings

In a controlled baseline comparison at equivalent token budgets, SKILLREDUCER significantly outperforms LLMLingua [5], direct LLM compression, truncation, and random removal.
We evaluate our framework on 600 skills alongside the external SkillsBench benchmark, yielding several key findings.

How the conclusion was reached

Step 1 — Proposed approach: To address these inefficiencies, we present SKILLREDUCER, a skill debloating framework that systematically removes unnecessary content while preserving functional quality.
Step 2 — Evaluation setup or comparison basis: • A comprehensive evaluation demonstrating substantial token reduction with preserved or improved functional quality, highlighting a less-is-more effect in LLM context management (Section V).
Step 3 — Main reported evidence: In a controlled baseline comparison at equivalent token budgets, SKILLREDUCER significantly outperforms LLMLingua [5], direct LLM compression, truncation, and random removal.
Step 4 — Additional supporting or qualifying result: We evaluate our framework on 600 skills alongside the external SkillsBench benchmark, yielding several key findings.

Experimental setup & results

Counter-intuitively, the compressed skills improve functional quality by 2.8% over the originals, suggesting a less-is-more [cs.SE] 31 Mar 2026 effect where removing non-essential content reduces distraction in the context window, particularly for longer and more verbose skills.
In a controlled baseline comparison at equivalent token budgets, SKILLREDUCER significantly outperforms LLMLingua [5], direct LLM compression, truncation, and random removal.
SKILLREDUCER achieves significant token savings, with a 48% mean compression rate for descriptions and a 39% mean reduction for body tokens.
• A comprehensive evaluation demonstrating substantial token reduction with preserved or improved functional quality, highlighting a less-is-more effect in LLM context management (Section V).
The optimization preserves functional quality effectively, maintaining an 86.0% pass rate on task-based evaluations and a 100% pass rate on SkillsBench.
We evaluate our framework on 600 skills alongside the external SkillsBench benchmark, yielding several key findings.

Limitations & risks

상세 요약 (KO)

전체 논문 읽은 느낌 요약

이 문서에서는 기술을 다루므로 새로운 클래스의 소프트웨어 아티팩트를 나타냅니다. 기술은 마켓플레이스를 통해 작성, 버전화, 공유되고 개발자에 의해 유지 관리되지만 기존 소스 코드가 누리는 성숙한 최적화 생태계가 부족합니다. 일반적인 기술은 두 가지 기본 기능 구성 요소, 즉 사용자 쿼리를 라우팅하기 위해 에이전트 런타임에서 사용하는 간략한 설명과 호출 시 컨텍스트 창에 삽입되는 지침의 주요 본문으로 구성됩니다. 기술의 설계 근거는 토큰을 절약하는 것입니다. 모든 대화에서 지침을 반복하는 대신 기술은 에이전트가 요청 시 로드하는 재사용 가능한 지식을 캡슐화합니다. 핵심 제안은 이러한 비효율성을 해결하기 위해 기능적 품질을 유지하면서 불필요한 콘텐츠를 체계적으로 제거하는 스킬 축소 프레임워크인 SKILLREDUCER를 제시하는 것입니다. SKILLREDUCER의 설계는 기술의 이분 구조에 의해 직접적으로 주도되어 2단계 최적화 파이프라인을 생성합니다. 이러한 이점은 평균 유지율이 0.965인 4개 제품군의 5개 모델에 걸쳐 전달되며 독립 에이전트 프레임워크로 일반화됩니다. 600개 스킬과 SkillsBench 벤치마크를 통해 평가된 SKILLREDUCER는 설명 압축 48%, 본문 압축 39%를 달성하는 동시에 기능적 품질을 2.8% 향상시켜 불필요한 콘텐츠를 제거하면 컨텍스트 창에서 주의가 산만해지는 효과가 적다는 것을 보여줍니다. 경험적 사례는 다음을 중심으로 구성됩니다. • LLM 컨텍스트 관리에서 적을수록 더 많은 효과를 강조하면서 기능적 품질이 유지되거나 개선되면서 상당한 토큰 감소를 보여주는 포괄적인 평가(섹션 V). 반직관적으로, 압축된 스킬은 원본에 비해 기능 품질을 2.8% 향상시켜, 특히 길고 장황한 스킬의 경우 필수적이지 않은 콘텐츠를 제거하면 컨텍스트 창에서 산만함을 줄이는 효과가 적다는 것이 더 많다는 것을 암시합니다. 동등한 토큰 예산으로 제어된 기준선 비교에서 SKILLREDUCER는 LLMLingua[5], 직접 LLM 압축, 자르기 및 무작위 제거보다 성능이 훨씬 뛰어납니다. SKILLREDUCER는 설명에 대한 평균 압축률이 48%이고 본문 토큰에 대한 평균 압축률이 39%로 상당한 토큰 절감 효과를 달성합니다. 중앙 보고 결과는 동등한 토큰 예산의 통제된 기준 비교에서 SKILLREDUCER가 LLMLingua[5], 직접 LLM 압축, 잘림 및 무작위 제거보다 훨씬 뛰어난 성능을 보인다는 것입니다. 우리는 외부 SkillsBench 벤치마크와 함께 600개 기술에 대한 프레임워크를 평가하여 몇 가지 주요 결과를 얻었습니다. SKILLREDUCER는 설명에 대한 평균 압축률이 48%이고 본문 토큰에 대한 평균 압축률이 39%로 상당한 토큰 절감 효과를 달성합니다. • LLM 컨텍스트 관리에서 적은 것이 더 많은 효과를 강조하면서 기능 품질을 유지하거나 개선하여 상당한 토큰 감소를 보여주는 종합 평가입니다(섹션 V). 전반적으로, 이 논문은 제안된 방법이 보고된 비교에 의해 직접적으로 뒷받침된다는 점에서 가장 설득력이 있지만, 청구 범위는 평가 설정 및 명시된 제한 사항을 고려하여 읽어야 합니다.

핵심 결론

주요 내용: 동등한 토큰 예산으로 제어된 기준선 비교에서 SKILLREDUCER는 LLMLingua[5], 직접 LLM 압축, 자르기 및 무작위 제거보다 성능이 훨씬 뛰어납니다.
가장 중요한 지원 결과: 외부 SkillsBench 벤치마크와 함께 600개 기술에 대한 프레임워크를 평가하여 몇 가지 주요 결과를 얻었습니다.

문제 정의

따라서 기술은 새로운 클래스의 소프트웨어 아티팩트를 나타냅니다. 기술은 마켓플레이스를 통해 작성, 버전화, 공유되고 개발자에 의해 유지 관리되지만 기존 소스 코드가 누리는 성숙한 최적화 생태계가 부족합니다.
일반적인 기술은 두 가지 기본 기능 구성 요소, 즉 사용자 쿼리를 라우팅하기 위해 에이전트 런타임에서 사용하는 간략한 설명과 호출 시 컨텍스트 창에 삽입되는 지침의 주요 본문으로 구성됩니다.
기술의 설계 근거는 토큰을 절약하는 것입니다. 모든 대화에서 지침을 반복하는 대신 기술은 에이전트가 요청 시 로드하는 재사용 가능한 지식을 캡슐화합니다.
Claude Code [1], Cursor [2], Windsurf와 같은 LLM(Large Language Model) 기반 코딩 에이전트는 소프트웨어 개발에 필수적인 도구가 되었습니다.

핵심 아이디어/방법

이러한 비효율성을 해결하기 위해 우리는 기능적 품질을 유지하면서 불필요한 콘텐츠를 체계적으로 제거하는 스킬 축소 프레임워크인 SKILLREDUCER를 제시합니다.
SKILLREDUCER의 설계는 기술의 이분 구조에 의해 직접적으로 주도되어 2단계 최적화 파이프라인을 생성합니다.
이러한 이점은 평균 유지율이 0.965인 4개 제품군의 5개 모델에 걸쳐 전달되며 독립 에이전트 프레임워크로 일반화됩니다.
600개 스킬과 SkillsBench 벤치마크를 통해 평가된 SKILLREDUCER는 설명 압축 48%, 본문 압축 39%를 달성하는 동시에 기능적 품질을 2.8% 향상시켜 불필요한 콘텐츠를 제거하면 컨텍스트 창에서 주의가 산만해지는 효과가 적다는 것을 보여줍니다.
2단계에서는 분류 중심 분류 및 점진적인 공개를 통해 기술 기관을 재구성하고 실행 가능한 핵심 규칙을 요청 시 로드된 보충 콘텐츠와 분리하고 충실도 검사 및 자체 수정 피드백 루프를 통해 검증합니다.
따라서 기술은 새로운 클래스의 소프트웨어 아티팩트를 나타냅니다. 기술은 마켓플레이스를 통해 작성, 버전화, 공유되고 개발자에 의해 유지 관리되지만 기존 소스 코드가 누리는 성숙한 최적화 생태계가 부족합니다.

실제 결과

동등한 토큰 예산으로 제어된 기준선 비교에서 SKILLREDUCER는 LLMLingua[5], 직접 LLM 압축, 자르기 및 무작위 제거보다 성능이 훨씬 뛰어납니다.
우리는 외부 SkillsBench 벤치마크와 함께 600개 기술에 대한 프레임워크를 평가하여 몇 가지 주요 결과를 얻었습니다.

결론이 나온 과정

1단계 — 제안된 접근 방식: 이러한 비효율성을 해결하기 위해 기능적 품질을 유지하면서 불필요한 콘텐츠를 체계적으로 제거하는 스킬 축소 프레임워크인 SKILLREDUCER를 제시합니다.
2단계 — 평가 설정 또는 비교 기준: • LLM 컨텍스트 관리에서 적은 것이 더 많은 효과를 강조하면서 기능적 품질을 유지하거나 개선하면서 상당한 토큰 감소를 보여주는 포괄적인 평가입니다(섹션 V).
3단계 — 보고된 주요 증거: 동등한 토큰 예산으로 통제된 기준선 비교에서 SKILLREDUCER는 LLMLingua[5], 직접 LLM 압축, 잘림 및 무작위 제거보다 성능이 훨씬 뛰어납니다.
4단계 — 추가 지원 또는 적격 결과: 외부 SkillsBench 벤치마크와 함께 600개 기술에 대한 프레임워크를 평가하여 몇 가지 주요 결과를 얻었습니다.

실험 설정/결과

반직관적으로, 압축된 스킬은 원본에 비해 기능 품질을 2.8% 향상시켜, 특히 길고 장황한 스킬의 경우 필수적이지 않은 콘텐츠를 제거하면 컨텍스트 창에서 산만함을 줄이는 효과가 적다는 것이 더 많다는 것을 암시합니다.
동등한 토큰 예산으로 제어된 기준선 비교에서 SKILLREDUCER는 LLMLingua[5], 직접 LLM 압축, 자르기 및 무작위 제거보다 성능이 훨씬 뛰어납니다.
SKILLREDUCER는 설명에 대한 평균 압축률이 48%이고 본문 토큰에 대한 평균 압축률이 39%로 상당한 토큰 절감 효과를 달성합니다.
• LLM 컨텍스트 관리에서 적은 것이 더 많은 효과를 강조하면서 기능 품질을 유지하거나 개선하여 상당한 토큰 감소를 보여주는 종합 평가입니다(섹션 V).
최적화는 기능적 품질을 효과적으로 유지하여 작업 기반 평가에서 86.0% 합격률과 SkillsBench에서 100% 합격률을 유지합니다.
우리는 외부 SkillsBench 벤치마크와 함께 600개 기술에 대한 프레임워크를 평가하여 몇 가지 주요 결과를 얻었습니다.