#3 Rocks, Pebbles and Sand: Modality-aware Scheduling for Multimodal Large Language Model Inference

Score: 23.8 | Matched keywords: large language model, large language models, llm, multimodal, token

Detailed Summary (EN)

Read-like-fullpaper digest

This paper tackles Large Language Models (LLMs) have revolutionized natural language processing by enabling tasks such as text generation, summarization, and reasoning at scale, becoming the backbone of applications from conversational agents to code assistants. In addition, emerging “any-to-any” models such as Next-GPT [50] go beyond text outputs, enabling responses in multiple modalities (e.g., generating images or videos), further expanding the scope of multimodal AI. MLLMs process diverse modalities alongside text, unlocking capabilities such as image reasoning, video summarization, and audio captioning, while preserving the interactive nature of traditional LLMs.

We design RPS-Serve, a modality-aware scheduler that lets sand flow quickly through pebbles and rocks, ensuring interactive responsiveness while avoiding starvation. Existing LLM serving systems, optimized for text-only workloads, fail under multimodality: large requests (e.g., videos) monopolize resources, causing severe head-of-line blocking and performance degradation. Our key insight is that multimodal requests differ by orders of magnitude in resource demands, which we capture through a simple abstraction: videos behave like rocks, images like pebbles, and text like sand. These heterogeneous workloads introduce additional inference stages, such as vision preprocessing and encoding, that inflate latency and memory demand.

The empirical case is built around while highly variable in length [7, 36, 41, 51, 55], remain lightweight compared to visual inputs. across state-of-the-art MLLMs shows that RPS-Serve reduces, on average, time-to-first-token (TTFT) by 54% overall, and by 78.5% for latency-critical requests, compared to current systems. Optimizations like chunked prefill [2, 3, 43] reduce head-ofline blocking for long text prompts, but fail under images or videos whose size is order of magnitude higher. while highly variable in length [7, 36, 41, 51, 55], remain lightweight compared to visual inputs.

The central reported finding is across state-of-the-art MLLMs shows that RPS-Serve reduces, on average, time-to-first-token (TTFT) by 54% overall, and by 78.5% for latency-critical requests, compared to current systems. Optimizations like chunked prefill [2, 3, 43] reduce head-ofline blocking for long text prompts, but fail under images or videos whose size is order of magnitude higher. while highly variable in length [7, 36, 41, 51, 55], remain lightweight compared to visual inputs.

Overall, the paper is most convincing where its proposed method is directly supported by the reported comparisons, but the scope of the claim should still be read in light of the evaluation setup and stated limitations.

Final takeaway

Main takeaway: across state-of-the-art MLLMs shows that RPS-Serve reduces, on average, time-to-first-token (TTFT) by 54% overall, and by 78.5% for latency-critical requests, compared to current systems.
Most important supporting result: Optimizations like chunked prefill [2, 3, 43] reduce head-ofline blocking for long text prompts, but fail under images or videos whose size is order of magnitude higher.

Problem definition

Large Language Models (LLMs) have revolutionized natural language processing by enabling tasks such as text generation, summarization, and reasoning at scale, becoming the backbone of applications from conversational agents to code assistants.
In addition, emerging “any-to-any” models such as Next-GPT [50] go beyond text outputs, enabling responses in multiple modalities (e.g., generating images or videos), further expanding the scope of multimodal AI.
MLLMs process diverse modalities alongside text, unlocking capabilities such as image reasoning, video summarization, and audio captioning, while preserving the interactive nature of traditional LLMs.
However, our motivational analysis shows that under multimodal workloads, FCFS fails: large image and video requests monopolize GPU resources during prefill, causing severe head-of-line blocking.

Core idea & method

We design RPS-Serve, a modality-aware scheduler that lets sand flow quickly through pebbles and rocks, ensuring interactive responsiveness while avoiding starvation.
Existing LLM serving systems, optimized for text-only workloads, fail under multimodality: large requests (e.g., videos) monopolize resources, causing severe head-of-line blocking and performance degradation.
Our key insight is that multimodal requests differ by orders of magnitude in resource demands, which we capture through a simple abstraction: videos behave like rocks, images like pebbles, and text like sand.
These heterogeneous workloads introduce additional inference stages, such as vision preprocessing and encoding, that inflate latency and memory demand.
RPS-Serve classifies requests, prioritizes them dynamically, and applies aging to avoid starvation.

Actual findings

across state-of-the-art MLLMs shows that RPS-Serve reduces, on average, time-to-first-token (TTFT) by 54% overall, and by 78.5% for latency-critical requests, compared to current systems.
Optimizations like chunked prefill [2, 3, 43] reduce head-ofline blocking for long text prompts, but fail under images or videos whose size is order of magnitude higher.

How the conclusion was reached

Step 1 — Proposed approach: We design RPS-Serve, a modality-aware scheduler that lets sand flow quickly through pebbles and rocks, ensuring interactive responsiveness while avoiding starvation.
Step 2 — Evaluation setup or comparison basis: while highly variable in length [7, 36, 41, 51, 55], remain lightweight compared to visual inputs.
Step 3 — Main reported evidence: across state-of-the-art MLLMs shows that RPS-Serve reduces, on average, time-to-first-token (TTFT) by 54% overall, and by 78.5% for latency-critical requests, compared to current systems.
Step 4 — Additional supporting or qualifying result: Optimizations like chunked prefill [2, 3, 43] reduce head-ofline blocking for long text prompts, but fail under images or videos whose size is order of magnitude higher.

Experimental setup & results

across state-of-the-art MLLMs shows that RPS-Serve reduces, on average, time-to-first-token (TTFT) by 54% overall, and by 78.5% for latency-critical requests, compared to current systems.
Optimizations like chunked prefill [2, 3, 43] reduce head-ofline blocking for long text prompts, but fail under images or videos whose size is order of magnitude higher.
while highly variable in length [7, 36, 41, 51, 55], remain lightweight compared to visual inputs.

Limitations & risks

상세 요약 (KO)

전체 논문 읽은 느낌 요약

이 문서에서는 대규모 언어 모델(LLM)이 텍스트 생성, 요약, 규모에 따른 추론과 같은 작업을 지원하여 자연어 처리에 혁명을 일으키고 대화 에이전트에서 코드 도우미에 이르기까지 애플리케이션의 중추가 되었음을 다룹니다. 또한 Next-GPT[50]와 같은 새로운 "any-to-any" 모델은 텍스트 출력을 넘어 다양한 양식(예: 이미지 또는 비디오 생성)으로 응답을 가능하게 하여 다중 모드 AI의 범위를 더욱 확장합니다. MLLM은 텍스트와 함께 다양한 양식을 처리하여 이미지 추론, 비디오 요약, 오디오 캡션과 같은 기능을 잠금 해제하는 동시에 기존 LLM의 대화형 특성을 유지합니다. 우리는 모래가 자갈과 바위 사이를 빠르게 흐르게 하여 기아를 피하면서 대화형 응답성을 보장하는 양식 인식 스케줄러인 RPS-Serve를 설계합니다. 텍스트 전용 워크로드에 최적화된 기존 LLM 제공 시스템은 다중 모드에서 실패합니다. 대규모 요청(예: 비디오)이 리소스를 독점하여 심각한 헤드 오브 라인 차단 및 성능 저하를 초래합니다. 우리의 주요 통찰은 다중 모드 요청이 리소스 수요의 규모에 따라 다르다는 것입니다. 이는 간단한 추상화를 통해 포착합니다. 즉, 비디오는 바위처럼, 이미지는 자갈처럼, 텍스트는 모래처럼 동작합니다. 이러한 이기종 워크로드에는 비전 전처리 및 인코딩과 같은 추가 추론 단계가 도입되어 대기 시간과 메모리 수요가 늘어납니다. 경험적 사례는 길이가 매우 가변적이지만 [7, 36, 41, 51, 55] 시각적 입력에 비해 가볍습니다. 최첨단 MLLM 전반에 걸쳐 RPS-Serve는 평균적으로 현재 시스템과 비교하여 첫 번째 토큰까지의 시간(TTFT)을 전체적으로 54%, 대기 시간이 중요한 요청의 경우 78.5% 단축하는 것으로 나타났습니다. 청크 미리 채우기[2, 3, 43]와 같은 최적화는 긴 텍스트 프롬프트에 대한 헤드오브라인 차단을 줄이지만 크기가 훨씬 더 큰 이미지나 비디오에서는 실패합니다. 길이는 매우 가변적이지만[7, 36, 41, 51, 55] 시각적 입력에 비해 가볍습니다. 최신 MLLM 전반에 걸쳐 보고된 중앙 조사 결과에 따르면 RPS-Serve는 현재 시스템과 비교하여 평균적으로 첫 번째 토큰까지의 시간(TTFT)을 전체적으로 54%, 대기 시간이 중요한 요청의 경우 78.5% 줄인 것으로 나타났습니다. 청크 미리 채우기[2, 3, 43]와 같은 최적화는 긴 텍스트 프롬프트에 대한 헤드오브라인 차단을 줄이지만 크기가 훨씬 더 큰 이미지나 비디오에서는 실패합니다. 길이는 매우 가변적이지만[7, 36, 41, 51, 55] 시각적 입력에 비해 가볍습니다. 전반적으로, 이 논문은 제안된 방법이 보고된 비교에 의해 직접적으로 뒷받침된다는 점에서 가장 설득력이 있지만, 청구 범위는 평가 설정 및 명시된 제한 사항을 고려하여 읽어야 합니다.

핵심 결론

주요 시사점: 최첨단 MLLM 전반에 걸쳐 RPS-Serve는 평균적으로 현재 시스템과 비교하여 첫 번째 토큰까지의 시간(TTFT)을 전체적으로 54%, 대기 시간이 중요한 요청의 경우 78.5% 단축하는 것으로 나타났습니다.
가장 중요한 지원 결과: 청크 미리 채우기[2, 3, 43]와 같은 최적화는 긴 텍스트 프롬프트에 대한 헤드오브라인 차단을 줄이지만 크기가 훨씬 더 큰 이미지나 비디오에서는 실패합니다.

문제 정의

LLM(대규모 언어 모델)은 텍스트 생성, 요약, 대규모 추론과 같은 작업을 지원하여 자연어 처리에 혁명을 일으켰으며 대화 에이전트에서 코드 도우미에 이르기까지 애플리케이션의 중추가 되었습니다.
또한 Next-GPT[50]와 같은 새로운 "any-to-any" 모델은 텍스트 출력을 넘어 다양한 양식(예: 이미지 또는 비디오 생성)으로 응답을 가능하게 하여 다중 모드 AI의 범위를 더욱 확장합니다.
MLLM은 텍스트와 함께 다양한 양식을 처리하여 이미지 추론, 비디오 요약, 오디오 캡션과 같은 기능을 잠금 해제하는 동시에 기존 LLM의 대화형 특성을 유지합니다.
그러나 동기 분석에 따르면 다중 모달 워크로드에서는 FCFS가 실패합니다. 대규모 이미지 및 비디오 요청이 사전 채우기 중에 GPU 리소스를 독점하여 심각한 헤드 오브 라인 차단을 유발합니다.

핵심 아이디어/방법

우리는 모래가 자갈과 바위 사이를 빠르게 흐르게 하여 기아를 피하면서 대화형 응답성을 보장하는 양식 인식 스케줄러인 RPS-Serve를 설계합니다.
텍스트 전용 워크로드에 최적화된 기존 LLM 제공 시스템은 다중 모드에서 실패합니다. 대규모 요청(예: 비디오)이 리소스를 독점하여 심각한 헤드 오브 라인 차단 및 성능 저하를 초래합니다.
우리의 주요 통찰은 다중 모드 요청이 리소스 수요의 규모에 따라 다르다는 것입니다. 이는 간단한 추상화를 통해 포착합니다. 즉, 비디오는 바위처럼, 이미지는 자갈처럼, 텍스트는 모래처럼 동작합니다.
이러한 이기종 워크로드에는 비전 전처리 및 인코딩과 같은 추가 추론 단계가 도입되어 대기 시간과 메모리 수요가 늘어납니다.
RPS-Serve는 요청을 분류하고 동적으로 우선순위를 지정하며 기아를 방지하기 위해 에이징을 적용합니다.

실제 결과

최첨단 MLLM 전반에 걸쳐 RPS-Serve는 평균적으로 현재 시스템과 비교하여 첫 번째 토큰까지의 시간(TTFT)을 전체적으로 54%, 대기 시간이 중요한 요청의 경우 78.5% 단축하는 것으로 나타났습니다.
청크 미리 채우기[2, 3, 43]와 같은 최적화는 긴 텍스트 프롬프트에 대한 헤드오브라인 차단을 줄이지만 크기가 훨씬 더 큰 이미지나 비디오에서는 실패합니다.

결론이 나온 과정

1단계 — 제안된 접근 방식: 우리는 모래가 자갈과 바위 사이로 빠르게 흐르도록 하여 기아를 피하면서 대화형 응답성을 보장하는 양식 인식 스케줄러인 RPS-Serve를 설계합니다.
2단계 - 평가 설정 또는 비교 기준: 길이는 매우 가변적이지만[7, 36, 41, 51, 55] 시각적 입력에 비해 가볍습니다.
3단계 — 보고된 주요 증거: 최첨단 MLLM 전반에 걸쳐 RPS-Serve는 현재 시스템과 비교하여 평균적으로 첫 번째 토큰까지의 시간(TTFT)을 전체적으로 54%, 지연 시간이 중요한 요청의 경우 78.5%를 줄이는 것으로 나타났습니다.
4단계 — 추가 지원 또는 적격 결과: 청크 미리 채우기[2, 3, 43]와 같은 최적화는 긴 텍스트 프롬프트에 대한 헤드오브라인 차단을 줄이지만 크기가 훨씬 더 큰 이미지나 비디오에서는 실패합니다.

실험 설정/결과

최첨단 MLLM 전반에 걸쳐 RPS-Serve는 평균적으로 현재 시스템과 비교하여 첫 번째 토큰까지의 시간(TTFT)을 전체적으로 54%, 대기 시간이 중요한 요청의 경우 78.5% 단축하는 것으로 나타났습니다.
청크 미리 채우기[2, 3, 43]와 같은 최적화는 긴 텍스트 프롬프트에 대한 헤드오브라인 차단을 줄이지만 크기가 훨씬 더 큰 이미지나 비디오에서는 실패합니다.
길이는 매우 가변적이지만[7, 36, 41, 51, 55] 시각적 입력에 비해 가볍습니다.