#6 Chain-of-Authorization: Internalizing Authorization into Large Language Models via Reasoning Trajectories

Score: 26.8 | Matched keywords: ai, artificial intelligence, fine-tuning, large language models, prompt, reasoning

Detailed Summary (EN)

Problem definition

Chain-of-Authorization: Internalizing Authorization into Large Language Models via Reasoning Trajectories Yang Li1, Yule Liu2, Xinlei He3, Youjian Zhao1, Qi Li1, Ke Xu1* 1* Department of Computer Science and Technology, Tsinghua University.
2Data Science and Analytic Thrust, Information Hub, The Hong Kong University of Science and Technology (Guangzhou).
However, LLMs typically treat all accessible data indiscriminately, lacking inherent awareness of knowledge ownership and access boundaries.
This deficiency heightens risks of sensitive data leakage and adversarial manipulation, potentially enabling unauthorized system access and severe security crises.

Core idea & method

a secure training and reasoning paradigm that internalizes authorization logic into LLMs’ core capabilities.
Unlike passive external defneses, CoA restructures the model’s information flow: it embeds permission context at input and requires generating explicit authorization reasoning trajectory that includes resource review, identity resolution, and decision-making stages before final response.
Through supervised fine-tuning on data covering various authorization status, CoA integrates policy execution with task responses, making authorization a causal prerequisite for substantive responses.
Extensive evaluations show that CoA not only maintains comparable utility in authorized scenarios but also overcomes the cognitive confusion when permissions mismatches.
It exhibits high rejection rates against various unauthorized and 1 [cs.AI] 24 Mar 2026 adversarial access.

Experimental setup & results

show that, compared to baseline methods, CoA maintains comparable accuracy in authorized scenarios and achieves a very high compliance rejection rate in unauthorized scenarios.
CoA also demonstrates robustness against various adversarial attacks, effectively preventing the model from being misled into leaking unauthorized information.
By visualizing the hidden-state representations on the WMDP Li et al.
(2024) dataset, we revealed the intrinsic mechanism of CoA in the representation space.
Finally, we empirically demonstrated the causal impact of internal reasoning trajectories on model security decisions by conducting targeted intervention experiments on key stages of CoA.

Limitations & risks

we re-examine the interaction between LLMs and the environment from the perspective of controlled information flow.
We no longer view the operation of LLMs as a simple query-response sequence, but rather as a cognitive process of observing and controlling specific information flows within strict permission boundaries.
4 2.1 Formalizing Authorization and the Challenge of Permission Mismatch Typically, the interaction between LLMs as the cognitive core and the environment can be formalized as a conditional probability distribution over outputs given a specific input.
The complete input sequence X ∈X received by LLMs encompasses the knowledge needed to prompt the model to generate responses, including the user prompt Q, the external context E, and the available tools G, while the output sequence Y ∈Y represents the natural language response or tool calling decision generated by the LLM.

Read-like-fullpaper digest

This paper addresses Chain-of-Authorization: Internalizing Authorization into Large Language Models via Reasoning Trajectories Yang Li1, Yule Liu2, Xinlei He3, Youjian Zhao1, Qi Li1, Ke Xu1* 1* Department of Computer Science and Technology, Tsinghua University. The core method is a secure training and reasoning paradigm that internalizes authorization logic into LLMs’ core capabilities. Key empirical findings include show that, compared to baseline methods, CoA maintains comparable accuracy in authorized scenarios and achieves a very high compliance rejection rate in unauthorized scenarios.

상세 요약 (KO)

문제 정의

인증 체인: 추론 궤적을 통해 대규모 언어 모델에 권한 부여 내부화 Yang Li1, Yule Liu2, Xinlei He3, Youjian Zhao1, Qi Li1, Ke Xu1* 1* Department of Computer Science and Technology, Tsinghua University.
2데이터 과학 및 분석 추진력, 정보 허브, 홍콩 과학기술대학교(광저우).
그러나 LLM은 일반적으로 액세스 가능한 모든 데이터를 무차별적으로 처리하므로 지식 소유권 및 액세스 경계에 대한 고유한 인식이 부족합니다.
이러한 결함으로 인해 민감한 데이터 유출 및 적대적 조작의 위험이 높아져 잠재적으로 무단 시스템 액세스 및 심각한 보안 위기가 발생할 수 있습니다.

핵심 아이디어/방법

인증 논리를 LLM의 핵심 기능에 내재화하는 안전한 교육 및 추론 패러다임입니다.
수동적 외부 방어와 달리 CoA는 모델의 정보 흐름을 재구성합니다. 즉, 입력에 권한 컨텍스트를 포함하고 최종 응답 전 리소스 검토, ID 확인 및 의사 결정 단계를 포함하는 명시적인 권한 부여 추론 궤적을 생성해야 합니다.
CoA는 다양한 인증 상태를 다루는 데이터에 대한 감독된 미세 조정을 통해 정책 실행을 작업 응답과 통합하여 인증을 실질적인 응답의 인과적 전제 조건으로 만듭니다.
광범위한 평가에 따르면 CoA는 승인된 시나리오에서 유사한 유틸리티를 유지할 뿐만 아니라 권한이 일치하지 않을 때 인지적 혼란을 극복하는 것으로 나타났습니다.
다양한 무단 액세스 및 1 [cs.AI] 2026년 3월 24일 적대적 액세스에 대해 높은 거부율을 나타냅니다.

실험 설정/결과

기본 방법과 비교하여 CoA는 승인된 시나리오에서 비슷한 정확도를 유지하고 승인되지 않은 시나리오에서 매우 높은 준수 거부율을 달성한다는 것을 보여줍니다.
CoA는 또한 다양한 적대적 공격에 대한 견고성을 입증하여 모델이 무단 정보 유출로 잘못 유도되는 것을 효과적으로 방지합니다.
WMDP Li et al.의 숨겨진 상태 표현을 시각화함으로써.
(2024) 데이터 세트를 통해 표현 공간에서 CoA의 고유 메커니즘을 공개했습니다.
마지막으로 CoA의 주요 단계에 대한 표적 개입 실험을 수행하여 모델 보안 결정에 대한 내부 추론 궤적의 인과적 영향을 경험적으로 입증했습니다.

한계/리스크

우리는 통제된 정보 흐름의 관점에서 LLM과 환경 간의 상호 작용을 재검토합니다.
우리는 더 이상 LLM의 작동을 단순한 쿼리-응답 순서로 보지 않고 오히려 엄격한 권한 경계 내에서 특정 정보 흐름을 관찰하고 제어하는 인지 프로세스로 봅니다.
4 2.1 승인 형식화 및 권한 불일치 문제 일반적으로 인지 핵심인 LLM과 환경 간의 상호 작용은 특정 입력이 주어진 경우 출력에 대한 조건부 확률 분포로 형식화될 수 있습니다.
LLM이 수신한 완전한 입력 시퀀스 X ∈X는 사용자 프롬프트 Q, 외부 컨텍스트 E 및 사용 가능한 도구 G를 포함하여 모델이 응답을 생성하도록 프롬프트하는 데 필요한 지식을 포함하는 반면, 출력 시퀀스 Y ∈Y는 LLM에서 생성된 자연어 응답 또는 도구 호출 결정을 나타냅니다.

전체 논문 읽은 느낌 요약

이 문서에서는 권한 부여: 추론 궤적을 통해 대규모 언어 모델에 권한 부여 내부화 Yang Li1, Yule Liu2, Xinlei He3, Youjian Zhao1, Qi Li1, Ke Xu1* 1* Department of Computer Science and Technology, Tsinghua University. 핵심 방법은 인증 논리를 LLM의 핵심 기능에 내재화하는 안전한 교육 및 추론 패러다임입니다. 주요 경험적 결과에는 기본 방법과 비교하여 CoA가 승인된 시나리오에서 비슷한 정확도를 유지하고 승인되지 않은 시나리오에서 매우 높은 준수 거부율을 달성한다는 사실이 포함됩니다.