#8 LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation

Score: 13.4 | Matched keywords: alignment, benchmark, diffusion, large language models, multimodal

Detailed Summary (EN)

Problem definition

In recent years, diffusion models [11, 12, 33] have driven remarkable progress, establishing new performance standards in text-to-video generation [13, 3, 28, 51], particularly through the adoption of Diffusion Transformer (DiT) architectures [31].
These advances have laid a solid groundwork for customized video generation [27, 48, 52, 5, 15, 8, 24], where high-degree-of-freedom personalization unlocks transformative applications ranging from virtual theatrical production to e-commerce—enabling fine-grained control over both backgrounds and foregrounds, including multiple interacting subjects.
Yet, realizing open-set personalized multi-subject video generation under such flexible and complex conditions remains profoundly challenging.
The task requires not only the precise integration of diverse and interrelated conditioning signals but also the preservation of temporal coherence and identity fidelity across all subjects.

Core idea & method

On the data side, a tailored collection pipeline orchestrates captions and visual cues from independent videos, while multimodal large language models (MLLMs) infer and assign subject-specific dependencies.
These extracted relational priors impose a finer-grained structure that amplifies the expressive control of personalized video generation and enables the construction of a comprehensive benchmark.
On the modeling side, Relational Self-Attention and Relational Cross-Attention intertwine position-aware embeddings with refined attention dynamics to inscribe explicit subject–attribute dependencies, enforcing disciplined intra-group cohesion and amplifying the separation between distinct subject clusters.
Comprehensive evaluations on our benchmark demonstrate that LumosX achieves state-of-the-art performance in fine-grained, identity-consistent, and semantically aligned personalized multi-subject video generation.
Date: March 23, 2026 1 Introduction In recent years, diffusion models [11, 12, 33] have driven remarkable progress, establishing new performance standards in text-to-video generation [13, 3, 28, 51], particularly through the adoption of Diffusion Transformer (DiT) architectures [31].

Experimental setup & results

LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation Jiazheng Xing∗1,4,2, Fei Du∗2,3 Hangjie Yuan∗2,3,1, Pengwei Liu1,2, Hongbin Xu4, Hai Ci4, Ruigang Niu2,3, Weihua Chen†2,3, Fan Wang2, Yong Liu †1 1Zhejiang University, 2DAMO Academy, Alibaba Group, 3Hupan Lab, 4National University of Singapore * Equal contribution, † Corresponding authors.
However, precise face–attribute alignment across subjects remains challenging, as existing methods lack explicit mechanisms to ensure intra-group consistency.
Addressing this gap requires both explicit modeling strategies and face-attribute-aware data resources.
We therefore propose LumosX, a framework that advances both data and model design.
On the data side, a tailored collection pipeline orchestrates captions and visual cues from independent videos, while multimodal large language models (MLLMs) infer and assign subject-specific dependencies.

Limitations & risks

under fine-grained multi-subject inputs, explicit constraints must be imposed at both the data and model levels.

Read-like-fullpaper digest

This paper addresses In recent years, diffusion models [11, 12, 33] have driven remarkable progress, establishing new performance standards in text-to-video generation [13, 3, 28, 51], particularly through the adoption of Diffusion Transformer (DiT) architectures [31]. The core method is On the data side, a tailored collection pipeline orchestrates captions and visual cues from independent videos, while multimodal large language models (MLLMs) infer and assign subject-specific dependencies. Key empirical findings include LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation Jiazheng Xing∗1,4,2, Fei Du∗2,3 Hangjie Yuan∗2,3,1, Pengwei Liu1,2, Hongbin Xu4, Hai Ci4, Ruigang Niu2,3, Weihua Chen†2,3, Fan Wang2, Yong Liu †1 1Zhejiang University, 2DAMO Academy, Alibaba Group, 3Hupan Lab, 4National University of Singapore * Equal contribution, † Corresponding authors.

상세 요약 (KO)

문제 정의

최근 몇 년 동안 확산 모델[11, 12, 33]은 특히 DiT(확산 변환기) 아키텍처[31]의 채택을 통해 텍스트-비디오 생성[13, 3, 28, 51]에서 새로운 성능 표준을 확립하면서 놀라운 발전을 이루었습니다.
이러한 발전은 맞춤형 비디오 생성을 위한 탄탄한 기반을 마련했습니다[27, 48, 52, 5, 15, 8, 24]. 여기서 높은 자유도의 개인화는 가상 연극 제작에서 전자 상거래에 이르는 혁신적인 응용 프로그램을 잠금 해제하여 여러 상호 작용하는 주제를 포함하여 배경과 전경 모두에 대한 세밀한 제어를 가능하게 합니다.
그러나 이러한 유연하고 복잡한 조건에서 개방형 맞춤형 다중 주제 비디오 생성을 실현하는 것은 여전히 어려운 일입니다.
이 작업에는 다양하고 상호 연관된 조절 신호의 정확한 통합뿐만 아니라 모든 주제에 걸쳐 시간적 일관성과 정체성 충실도의 보존도 필요합니다.

핵심 아이디어/방법

데이터 측면에서는 맞춤형 컬렉션 파이프라인이 독립 비디오의 캡션과 시각적 단서를 조정하는 동시에 MLLM(다중 언어 모델)이 주제별 종속성을 추론하고 할당합니다.
이렇게 추출된 관계형 사전 변수는 개인화된 비디오 생성의 표현 제어를 증폭하고 포괄적인 벤치마크 구축을 가능하게 하는 보다 세분화된 구조를 부과합니다.
모델링 측면에서 Relational Self-Attention 및 Relational Cross-Attention은 위치 인식 임베딩을 세련된 Attention 역학과 얽혀 명시적인 주제-속성 종속성을 명시하고 규율 있는 그룹 내 결속력을 강화하고 별개의 주제 클러스터 간의 분리를 증폭시킵니다.
벤치마크에 대한 종합적인 평가는 LumosX가 세밀하고 정체성이 일관되며 의미론적으로 얼라인먼트된 개인화된 다중 주제 비디오 생성에서 최첨단 성능을 달성한다는 것을 보여줍니다.
날짜: 2026년 3월 23일 1 서론 최근 몇 년 동안 확산 모델[11, 12, 33]은 특히 DiT(확산 변환기) 아키텍처[31]의 채택을 통해 텍스트-비디오 생성[13, 3, 28, 51]에서 새로운 성능 표준을 확립하면서 눈에 띄는 발전을 이루었습니다.

실험 설정/결과

LumosX: 개인화된 비디오 생성을 위한 속성과 모든 신원 연결 Jiazheng Xing*1,4,2, Fei Du*2,3 Hangjie Yuan*2,3,1, Pengwei Liu1,2, Hongbin Xu4, Hai Ci4, Ruigang Niu2,3, Weihua Chen†2,3, Fan Wang2, Yong Liu †1 1Zhejiang University, 2DAMO Academy, Alibaba Group, 3Hupan Lab, 4National University of Singapore * 균등 기여, † 교신저자.
그러나 기존 방법에는 그룹 내 일관성을 보장하는 명시적인 메커니즘이 부족하기 때문에 피험자 간의 정확한 얼굴 속성 얼라인먼트은 여전히 어려운 과제로 남아 있습니다.
이러한 격차를 해결하려면 명시적인 모델링 전략과 얼굴 속성 인식 데이터 리소스가 모두 필요합니다.
따라서 우리는 데이터와 모델 설계를 모두 향상시키는 프레임워크인 LumosX를 제안합니다.
데이터 측면에서는 맞춤형 컬렉션 파이프라인이 독립 비디오의 캡션과 시각적 단서를 조정하는 동시에 MLLM(다중 언어 모델)이 주제별 종속성을 추론하고 할당합니다.

한계/리스크

세분화된 다중 주제 입력에서는 데이터 및 모델 수준 모두에서 명시적인 제약 조건을 적용해야 합니다.

전체 논문 읽은 느낌 요약

이 논문에서는 최근 몇 년 동안 확산 모델[11, 12, 33]이 눈에 띄는 발전을 주도하여 특히 DiT(확산 변환기) 아키텍처[31]의 채택을 통해 텍스트-비디오 생성[13, 3, 28, 51]에서 새로운 성능 표준을 확립했습니다. 핵심 방법은 다음과 같습니다. 데이터 측면에서는 맞춤형 컬렉션 파이프라인이 독립 비디오의 캡션과 시각적 단서를 조정하는 동시에 MLLM(다중 모드 대규모 언어 모델)이 주제별 종속성을 추론하고 할당합니다. 주요 경험적 연구 결과에는 LumosX: 개인화된 비디오 생성을 위해 모든 ID와 속성 연결 Jiazheng Xing*1,4,2, Fei Du*2,3 Hangjie Yuan*2,3,1, Pengwei Liu1,2, Hongbin Xu4, Hai Ci4, Ruigang Niu2,3, Weihua Chen†2,3, Fan Wang2, Yong Liu †1 1Zhejiang University, 2DAMO Academy, Alibaba Group, 3Hupan Lab, 4National University of Singapore * 동일 기여, † 교신저자.