#3 Evolving Jailbreaks: Automated Multi-Objective Long-Tail Attacks on Large Language Models

Score: 17.0 | Matched keywords: alignment, large language models, llm, prompt

Detailed Summary (EN)

Problem definition

Large Language Models (LLMs) have achieved remarkable success in natural language processing and are increasingly deployed through Web platforms, powering search engines [43], chat interfaces [28], social media [31], and other online applications.
These Web-mediated deployments enable unprecedented research opportunities, such as large-scale behavioral studies and real-time data collection [4].
At the same time, they expose critical security and privacy risks, as the generation of harmful or unsafe content can compromise user safety and erode the integrity of the Web ecosystem [39].
Existing safeguards mitigate some risks but remain far from foolproof, making systematic evaluation of LLM vulnerabilities a pressing need for model developers, Web platform operators, and the broader AI safety community [24, 33, 42].

Core idea & method

While recent studies have shown that leveraging long-tail distributions can facilitate such jailbreaks, existing approaches largely rely on handcrafted rules, limiting the systematic

Experimental setup & results

In this work, we present EvoJail, an automated framework for discovering long-tail distribution attacks via multi-objective evolutionary search.
EvoJail formulates long-tail attack prompt generation as a multi-objective optimization problem that jointly maximizes attack effectiveness and minimizes output perplexity, and introduces a semantic-algorithmic solution representation to capture both high-level semantic intent and low-level structural transformations of encryption-decryption logic.
Building upon this representation, EvoJail integrates LLM-assisted operators into a multi-objective evolutionary framework, enabling adaptive and semantically informed mutation and crossover for efficiently exploring a highly structured and open-ended search space.
Extensive experiments demonstrate that EvoJail consistently discovers diverse and effective long-tail jailbreak strategies, achieving competitive performance with existing methods in both individual and ensemble level.
∗corresponding author [cs.CR] 20 Mar 2026 1 Introduction Large Language Models (LLMs) have achieved remarkable success in natural language processing and are increasingly deployed through Web platforms, powering search engines [43], chat interfaces [28], social media [31], and other online applications.

Limitations & risks

motivates the need for a representation that can model what the transformation intends to achieve and how it is algorithmically realized.
4 def encryption(sentence): words = sentence.split() if not words: return {’encrypted’: [], ’seed’: 0} # Hybrid seed based on input characteristics seed = (len(sentence) + len(words)) % 2 + 2 groups = [words[i:i+seed] for i in range(0, len(words), seed)] # Reverse groups based on seed-dependent pattern for i in range(len(groups)): if i % (seed + 1) == 0: groups[i] = groups[i][::-1] encrypted = [word for group in groups] return {’encrypted’: encrypted, ’seed’: seed} def decryption(encrypted_data): encrypted = encrypted_data[’encrypted’] seed = encrypted_data[’seed’] if not encrypted: return ” # Re-group using the same seed groups = [encrypted[i:i+seed] for i in range(0, len(encrypted), seed)] # Reverse every other group to restore original order for i in range(len(groups)): if i % 2 == 1: groups[i] = groups[i][::-1] decrypted = [word for group in groups for word in group] return ’ ’.join(decrypted) Figure 2: An example of an encryption-decryption scheme.
3.1.1 End-to-End Attack Pipeline The long-tail distribution attacks is modeled as an end-to-end transformation pipeline, in which malicious intent is concealed through algorithmic obfuscation and later reconstructed within the model’s internal reasoning process.
An encryption function E and a corresponding decryption function D constitute the core components of the attack.

Read-like-fullpaper digest

This paper addresses Large Language Models (LLMs) have achieved remarkable success in natural language processing and are increasingly deployed through Web platforms, powering search engines [43], chat interfaces [28], social media [31], and other online applications. The core method is While recent studies have shown that leveraging long-tail distributions can facilitate such jailbreaks, existing approaches largely rely on handcrafted rules, limiting the systematic Key empirical findings include In this work, we present EvoJail, an automated framework for discovering long-tail distribution attacks via multi-objective evolutionary search.

상세 요약 (KO)

문제 정의

LLM(대규모 언어 모델)은 자연어 처리 분야에서 놀라운 성공을 거두었으며 웹 플랫폼, 검색 엔진[43], 채팅 인터페이스[28], 소셜 미디어[31] 및 기타 온라인 애플리케이션을 통해 점점 더 많이 배포되고 있습니다.
이러한 웹 기반 배포를 통해 대규모 행동 연구 및 실시간 데이터 수집과 같은 전례 없는 연구 기회가 가능해졌습니다[4].
동시에 유해하거나 안전하지 않은 콘텐츠가 생성되면 사용자 안전을 침해하고 웹 생태계의 무결성을 침식할 수 있기 때문에 중요한 보안 및 개인 정보 보호 위험에 노출됩니다[39].
기존 보호 장치는 일부 위험을 완화하지만 완벽하지는 않으므로 모델 개발자, 웹 플랫폼 운영자 및 광범위한 AI 안전 커뮤니티에 LLM 취약성에 대한 체계적인 평가가 절실히 필요합니다[24, 33, 42].

핵심 아이디어/방법

최근 연구에 따르면 롱테일 배포판을 활용하면 이러한 탈옥이 용이해질 수 있지만 기존 접근 방식은 주로 손으로 만든 규칙에 의존하여 체계적인 탈옥이 제한됩니다.

실험 설정/결과

이 연구에서는 다중 목표 진화 검색을 통해 롱테일 배포 공격을 발견하기 위한 자동화된 프레임워크인 EvoJail을 제시합니다.
EvoJail은 공격 효율성을 최대화하고 출력 복잡성을 최소화하는 다중 목표 최적화 문제로 롱테일 공격 프롬프트 생성을 공식화하고 암호화-복호화 논리의 상위 수준 의미 의도와 하위 수준 구조 변환을 모두 캡처하는 의미 알고리즘 솔루션 표현을 도입합니다.
이러한 표현을 바탕으로 EvoJail은 LLM 지원 연산자를 다목적 진화 프레임워크에 통합하여 고도로 구조화된 개방형 검색 공간을 효율적으로 탐색하기 위해 적응형 및 의미론적 정보를 제공하는 돌연변이 및 교차를 가능하게 합니다.
광범위한 실험을 통해 EvoJail은 다양하고 효과적인 롱테일 탈옥 전략을 지속적으로 발견하여 개인 및 앙상블 수준 모두에서 기존 방법으로 경쟁력 있는 성능을 달성하는 것으로 나타났습니다.
✽교신저자 [cs.CR] 2026년 3월 20일 1 소개 대규모 언어 모델(LLM)은 자연어 처리 분야에서 놀라운 성공을 거두었으며 웹 플랫폼, 검색 엔진 지원[43], 채팅 인터페이스[28], 소셜 미디어[31] 및 기타 온라인 애플리케이션을 통해 점점 더 많이 배포되고 있습니다.

한계/리스크

변환이 달성하려는 목표와 그것이 알고리즘적으로 실현되는 방법을 모델링할 수 있는 표현의 필요성을 유발합니다.
4 def 암호화(문장):words = 문장.split() if notwords: return {'encrypted': [], 'seed': 0} # 입력 특성을 기반으로 한 하이브리드 시드 seed = (len(sentence) + len(words)) % 2 + 2 groups = [words[i:i+seed] for i in range(0, len(words), seed)] # 시드 종속 패턴을 기반으로 한 역그룹 for i in range(len(groups)): if i % (seed + 1) == 0: groups[i] = groups[i][::-1] 암호화 = [그룹의 그룹에 대한 단어] return {'encrypted': 암호화, 'seed': 종자} def decryption(encrypted_data): 암호화 = 암호화된_데이터['encrypted'] seed = 암호화_데이터['seed'] 암호화되지 않은 경우: return ” # 동일한 시드를 사용하여 다시 그룹화 groups = [encrypted[i:i+seed] for i in range(0, len(encrypted), seed)] # 다른 모든 그룹을 뒤집어 원래 순서를 복원합니다 for i in range(len(groups)): if i % 2 == 1: groups[i] = groups[i][::-1] decrypted = [word for group in groups for word in group] return ’ ’.join(decrypted) 그림 2: 암호화-복호화의 예 계획.
3.1.1 종단 간 공격 파이프라인 롱테일 분산 공격은 알고리즘 난독화를 통해 악의적인 의도를 숨기고 나중에 모델의 내부 추론 프로세스 내에서 재구성하는 종단 간 변환 파이프라인으로 모델링됩니다.
암호화 기능 E와 해당 복호화 기능 D가 공격의 핵심 구성 요소를 구성합니다.

전체 논문 읽은 느낌 요약

이 문서에서는 LLM(대규모 언어 모델)이 자연어 처리에서 놀라운 성공을 거두었으며 웹 플랫폼, 검색 엔진[43], 채팅 인터페이스[28], 소셜 미디어[31] 및 기타 온라인 애플리케이션을 통해 점점 더 많이 배포되고 있음을 다룹니다. 핵심 방법은 최근 연구에서 롱테일 배포를 활용하면 이러한 탈옥이 가능하다는 사실이 밝혀졌지만 기존 접근 방식은 주로 손으로 만든 규칙에 의존하여 체계적인 제한이 있습니다. 주요 경험적 결과는 다음과 같습니다. 이 연구에서는 다목적 진화 검색을 통해 롱테일 배포 공격을 발견하기 위한 자동화된 프레임워크인 EvoJail을 제시합니다.