#6 Adversarial Prompt Injection Attack on Multimodal Large Language Models

Score: 17.2 | Matched keywords: large language models, multimodal, prompt

Detailed Summary (EN)

Read-like-fullpaper digest

This paper tackles However, existing targeted adversarial attacks on MLLMs predominantly formulate the attack objective as reproducing the semantic description of another natural image, thereby imposing an inherent limitation on the expressivity of feasible malicious prompts. 1 [cs.CV] 31 Mar 2026 Adversarial Prompt Injection Attack on Multimodal Large Language Models Accordingly, we explore a complementary attack paradigm, termed adv More recently, many frontier LLMs 1Rapid-Rich Object Search Lab, School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore.

The core proposal is Extensive experiments on two multimodal understanding tasks across multiple closed-source MLLMs demonstrate the superior performance of our approach compared to existing methods. However, existing targeted adversarial attacks on MLLMs predominantly formulate the attack objective as reproducing the semantic description of another natural image, thereby imposing an inherent limitation on the expressivity of feasible malic Meanwhile, the imperceptible visual perturbation is iteratively optimized to align the feature representation of the attacked image with those of the malicious visual and textual targets at both coarse- and finegrained levels. The visual target is instantiated as a text-rendered image and progressively refined during optimization to faithfully represent the desired malicious prompts and improve transferability.

The empirical case is built around Extensive experiments on two multimodal understanding tasks across multiple closed-source MLLMs demonstrate the superior performance of our approach compared to existing methods. Beyond the GPT models, Gemini-2.5 also exhibits a pronounced vulnerability to our attack, achieving 79% and 81% success rates under the soft and hard criteria, respectively. Performance under the hard criterion (target text) on the VQA task against different closed-source MLLMs. All attacks show limited effectiveness, yet our method still achieves the highest ASR and AvgSim.

The central reported finding is Our method substantially outperforms other methods, especially on the GPT-family models. The attack success rate (ASR) and the average similarity score (AvgSim) are reported. All attacks show limited effectiveness, yet our method still achieves the highest ASR and AvgSim. Our method substantially outperforms other methods, especially on the GPT-family models.

The paper also makes it clear that A key challenge is to construct a targeted image that is semantically consistent with the targeted text, such that the joint feature provides a coherent cross-modal supervision signal for guiding the update of adversarial perturbations on the source image. However, its performance is limited, potentially due to cross-modal representation mismatch between image and text features. With only a subtle text trigger, the message may be too inconspicuous for the model to reliably perceive. Overall, the paper is most convincing where its proposed method is directly supported by the reported comparisons, but the scope of the claim should still be read in light of the evaluation setup and stated limitations.

Final takeaway

Main takeaway: Our method substantially outperforms other methods, especially on the GPT-family models.
Most important supporting result: The attack success rate (ASR) and the average similarity score (AvgSim) are reported.
Important caution: A key challenge is to construct a targeted image that is semantically consistent with the targeted text, such that the joint feature provides a coherent cross-modal supervision signal for guiding the update of adversarial perturbations on the source image.

Problem definition

However, existing targeted adversarial attacks on MLLMs predominantly formulate the attack objective as reproducing the semantic description of another natural image, thereby imposing an inherent limitation on the expressivity of feasible malicious prompts.
1 [cs.CV] 31 Mar 2026 Adversarial Prompt Injection Attack on Multimodal Large Language Models Accordingly, we explore a complementary attack paradigm, termed adv
More recently, many frontier LLMs 1Rapid-Rich Object Search Lab, School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore.
Illustration of adversarial prompt injection attacks, where the adversary manipulates the behavior of MLLMs through imperceptible visual prompt injection.

Core idea & method

Extensive experiments on two multimodal understanding tasks across multiple closed-source MLLMs demonstrate the superior performance of our approach compared to existing methods.
However, existing targeted adversarial attacks on MLLMs predominantly formulate the attack objective as reproducing the semantic description of another natural image, thereby imposing an inherent limitation on the expressivity of feasible malic
Meanwhile, the imperceptible visual perturbation is iteratively optimized to align the feature representation of the attacked image with those of the malicious visual and textual targets at both coarse- and finegrained levels.
The visual target is instantiated as a text-rendered image and progressively refined during optimization to faithfully represent the desired malicious prompts and improve transferability.
More recently, many frontier LLMs 1Rapid-Rich Object Search Lab, School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore.
Illustration of adversarial prompt injection attacks, where the adversary manipulates the behavior of MLLMs through imperceptible visual prompt injection.

Actual findings

Our method substantially outperforms other methods, especially on the GPT-family models.
The attack success rate (ASR) and the average similarity score (AvgSim) are reported.

How the conclusion was reached

Step 1 — Proposed approach: Extensive experiments on two multimodal understanding tasks across multiple closed-source MLLMs demonstrate the superior performance of our approach compared to existing methods.
Step 2 — Evaluation setup or comparison basis: Extensive experiments on two multimodal understanding tasks across multiple closed-source MLLMs demonstrate the superior performance of our approach compared to existing methods.
Step 3 — Main reported evidence: Our method substantially outperforms other methods, especially on the GPT-family models.
Step 4 — Additional supporting or qualifying result: The attack success rate (ASR) and the average similarity score (AvgSim) are reported.
Step 5 — Claim boundary / limitation: A key challenge is to construct a targeted image that is semantically consistent with the targeted text, such that the joint feature provides a coherent cross-modal supervision signal for guiding the update of adversarial perturbations on the source image.

Experimental setup & results

Beyond the GPT models, Gemini-2.5 also exhibits a pronounced vulnerability to our attack, achieving 79% and 81% success rates under the soft and hard criteria, respectively.
Performance under the hard criterion (target text) on the VQA task against different closed-source MLLMs.
All attacks show limited effectiveness, yet our method still achieves the highest ASR and AvgSim.
Our method substantially outperforms other methods, especially on the GPT-family models.
The attack success rate (ASR) and the average similarity score (AvgSim) are reported.
If the similarity score exceeds 0.3, the attack is considered successful.

Limitations & risks

A key challenge is to construct a targeted image that is semantically consistent with the targeted text, such that the joint feature provides a coherent cross-modal supervision signal for guiding the update of adversarial perturbations on the source image.
However, its performance is limited, potentially due to cross-modal representation mismatch between image and text features.
With only a subtle text trigger, the message may be too inconspicuous for the model to reliably perceive.
To address this challenge, we propose a dynamic targeted image scheme that initializes a base image and iteratively refines it throughout the attack to improve the effectiveness.

상세 요약 (KO)

전체 논문 읽은 느낌 요약

그러나 본 논문에서는 MLLM에 대한 기존의 표적화된 적대적 공격은 주로 다른 자연 이미지의 의미론적 설명을 재현하는 것으로 공격 목표를 공식화하므로 가능한 악의적 프롬프트의 표현에 본질적인 제한을 부과합니다. 1 [cs.CV] 2026년 3월 31일 멀티모달 대규모 언어 모델에 대한 적대적 프롬프트 주입 공격 따라서 우리는 adv라고 하는 보완적인 공격 패러다임을 탐구합니다. 최근에는 많은 프론티어 LLM이 싱가포르 난양 기술 대학교 전기 전자 공학부 1Rapid-Rich Object Search Lab에서 사용됩니다. 핵심 제안은 여러 비공개 소스 MLLM에 걸친 두 가지 다중 모드 이해 작업에 대한 광범위한 실험은 기존 방법에 비해 우리 접근 방식의 우수한 성능을 보여줍니다. 그러나 MLLM에 대한 기존의 표적화된 적대적 공격은 주로 다른 자연 이미지의 의미론적 설명을 재현하는 것으로 공격 목표를 공식화하여 가능한 악성의 표현성에 본질적인 제한을 부과합니다. 한편, 눈에 띄지 않는 시각적 교란은 공격된 이미지의 특징 표현을 대략적 및 세밀한 수준 모두에서 악의적인 시각적 및 텍스트 대상의 특징 표현과 얼라인먼트하도록 반복적으로 최적화됩니다. 시각적 대상은 텍스트 렌더링 이미지로 인스턴스화되고 최적화 과정에서 점진적으로 개선되어 원하는 악성 프롬프트를 충실하게 표현하고 전송 가능성을 향상시킵니다. 경험적 사례는 여러 비공개 소스 MLLM에 걸쳐 두 가지 다중 모드 이해 작업에 대한 광범위한 실험을 중심으로 구축되었으며 기존 방법에 비해 우리 접근 방식의 탁월한 성능을 보여줍니다. GPT 모델 외에도 Gemini-2.5는 소프트 기준과 하드 기준에서 각각 79%와 81%의 성공률을 달성하여 공격에 대한 뚜렷한 취약성을 보여줍니다. 다양한 비공개 소스 MLLM에 대한 VQA 작업의 하드 기준(대상 텍스트)에 따른 성능입니다. 모든 공격은 제한된 효율성을 보이지만 우리의 방법은 여전히 가장 높은 ASR 및 AvgSim을 달성합니다. 보고된 핵심 결과는 우리의 방법이 특히 GPT 계열 모델에서 다른 방법보다 훨씬 뛰어난 성능을 보인다는 것입니다. 공격 성공률(ASR)과 평균 유사성 점수(AvgSim)가 보고됩니다. 모든 공격은 제한된 효율성을 보이지만 우리의 방법은 여전히 가장 높은 ASR 및 AvgSim을 달성합니다. 우리의 방법은 특히 GPT 제품군 모델에서 다른 방법보다 훨씬 뛰어납니다. 또한 이 논문에서는 주요 과제는 대상 텍스트와 의미론적으로 일치하는 대상 이미지를 구성하여 공동 기능이 소스 이미지에 대한 적대적 섭동의 업데이트를 안내하기 위한 일관된 교차 모달 감독 신호를 제공하는 것임을 분명히 합니다. 그러나 이미지와 텍스트 기능 간의 모달 표현 불일치로 인해 성능이 제한됩니다. 미묘한 텍스트 트리거만 사용하면 모델이 안정적으로 인식하기에는 메시지가 너무 눈에 띄지 않을 수 있습니다. 전반적으로, 이 논문은 제안된 방법이 보고된 비교에 의해 직접적으로 뒷받침된다는 점에서 가장 설득력이 있지만, 청구 범위는 평가 설정 및 명시된 제한 사항을 고려하여 읽어야 합니다.

핵심 결론

주요 시사점: 우리의 방법은 특히 GPT 제품군 모델에서 다른 방법보다 훨씬 뛰어납니다.
가장 중요한 지원 결과: 공격 성공률(ASR) 및 평균 유사성 점수(AvgSim)가 보고됩니다.
중요한 주의 사항: 핵심 과제는 대상 텍스트와 의미론적으로 일치하는 대상 이미지를 구성하여 결합 기능이 소스 이미지에 대한 적대적 섭동의 업데이트를 안내하기 위한 일관된 교차 모달 감독 신호를 제공하는 것입니다.

문제 정의

그러나 MLLM에 대한 기존의 표적화된 적대적 공격은 주로 다른 자연 이미지의 의미론적 설명을 재현하는 것으로 공격 목표를 공식화하여 가능한 악의적 프롬프트의 표현에 본질적인 제한을 부과합니다.
1 [cs.CV] 2026년 3월 31일 다중 모달 대규모 언어 모델에 대한 적대적 프롬프트 주입 공격 따라서 우리는 adv라고 하는 보완적인 공격 패러다임을 탐색합니다.
최근에는 싱가포르 난양기술대학교 전기전자공학부 1Rapid-Rich Object Search Lab의 선도적인 LLM이 많이 있습니다.
적이 눈에 띄지 않는 시각적 프롬프트 주입을 통해 MLLM의 동작을 조작하는 적대적 프롬프트 주입 공격에 대한 그림입니다.

핵심 아이디어/방법

여러 비공개 소스 MLLM에 걸쳐 두 가지 다중 모드 이해 작업에 대한 광범위한 실험은 기존 방법에 비해 우리 접근 방식의 탁월한 성능을 보여줍니다.
그러나 MLLM에 대한 기존의 표적화된 적대적 공격은 주로 공격 목표를 다른 자연 이미지의 의미론적 설명을 재현하는 것으로 공식화하여 가능한 악성 코드의 표현성에 본질적인 제한을 부과합니다.
한편, 눈에 띄지 않는 시각적 교란은 공격받은 이미지의 특징 표현을 대략적 및 세밀한 수준 모두에서 악의적인 시각적 및 텍스트 대상의 특징 표현과 얼라인먼트하기 위해 반복적으로 최적화됩니다.
시각적 대상은 텍스트 렌더링 이미지로 인스턴스화되고 최적화 과정에서 점진적으로 개선되어 원하는 악성 프롬프트를 충실하게 표현하고 전송 가능성을 향상시킵니다.
최근에는 싱가포르 난양기술대학교 전기전자공학부 1Rapid-Rich Object Search Lab의 선도적인 LLM이 많이 있습니다.
적이 눈에 띄지 않는 시각적 프롬프트 주입을 통해 MLLM의 동작을 조작하는 적대적 프롬프트 주입 공격에 대한 그림입니다.

실제 결과

우리의 방법은 특히 GPT 제품군 모델에서 다른 방법보다 훨씬 뛰어납니다.
공격 성공률(ASR)과 평균 유사성 점수(AvgSim)가 보고됩니다.

결론이 나온 과정

1단계 — 제안된 접근 방식: 여러 비공개 소스 MLLM에 걸쳐 두 가지 다중 모드 이해 작업에 대한 광범위한 실험은 기존 방법에 비해 우리 접근 방식의 탁월한 성능을 보여줍니다.
2단계 — 평가 설정 또는 비교 기준: 여러 비공개 소스 MLLM에 걸친 두 가지 다중 모드 이해 작업에 대한 광범위한 실험은 기존 방법에 비해 우리 접근 방식의 탁월한 성능을 보여줍니다.
3단계 - 보고된 주요 증거: 우리의 방법은 특히 GPT 제품군 모델에서 다른 방법보다 훨씬 더 성능이 뛰어납니다.
4단계 - 추가 지원 또는 적격 결과: 공격 성공률(ASR) 및 평균 유사성 점수(AvgSim)가 보고됩니다.
5단계 — 주장 경계/제한: 주요 과제는 공동 기능이 소스 이미지에 대한 적대적 섭동의 업데이트를 안내하기 위한 일관된 교차 모달 감독 신호를 제공하도록 대상 텍스트와 의미론적으로 일치하는 대상 이미지를 구성하는 것입니다.

실험 설정/결과

GPT 모델 외에도 Gemini-2.5는 소프트 기준과 하드 기준에서 각각 79%와 81%의 성공률을 달성하여 공격에 대한 뚜렷한 취약성을 보여줍니다.
다양한 비공개 소스 MLLM에 대한 VQA 작업의 하드 기준(대상 텍스트)에 따른 성능입니다.
모든 공격은 제한된 효율성을 보이지만 우리의 방법은 여전히 가장 높은 ASR 및 AvgSim을 달성합니다.
우리의 방법은 특히 GPT 제품군 모델에서 다른 방법보다 훨씬 뛰어납니다.
공격 성공률(ASR)과 평균 유사성 점수(AvgSim)가 보고됩니다.
유사성 점수가 0.3을 초과하면 공격이 성공한 것으로 간주됩니다.

한계/리스크

핵심 과제는 공동 기능이 소스 이미지에 대한 적대적 섭동의 업데이트를 안내하기 위한 일관된 교차 모달 감독 신호를 제공하도록 대상 텍스트와 의미론적으로 일치하는 대상 이미지를 구성하는 것입니다.
그러나 이미지와 텍스트 기능 간의 모달 표현 불일치로 인해 성능이 제한됩니다.
미묘한 텍스트 트리거만 사용하면 모델이 안정적으로 인식하기에는 메시지가 너무 눈에 띄지 않을 수 있습니다.
이 문제를 해결하기 위해 우리는 기본 이미지를 초기화하고 공격 전반에 걸쳐 이를 반복적으로 개선하여 효율성을 향상시키는 동적 표적 이미지 체계를 제안합니다.