#8 VEPO: Variable Entropy Policy Optimization for Low-Resource Language Foundation Models

Score: 11.6 | Matched keywords: alignment, foundation models, large language models

Detailed Summary (EN)

Problem definition

While Large Language Models have revolutionized Neural Machine Translation, their performance in low-resource regimes remains hampered by suboptimal tokenization, training imbalances, and reinforcement learning instabilities.
State-of-the-art models such as GPT4, DeepSeekR1 and Qwen-max(OpenAI et al., 2024; Guo et al., 2025; Qwen et al., 2025) frequently exhibit pronounced sequence 1Qiyuan Tech.
fragmentation and substantial variance in subword segmentation for morphologically rich writing systems, such as Khmer and Thai, predominantly as a consequence of vocabularies and training corpora that are disproportionately optimized for high-resource languages.
Specialized MT architectures (Cheng et al., 2025; Zheng et al., 2025; Dou et al., 2025) address these gaps via curated data, yet often struggle with the instruction flexibility required for production environments.

Core idea & method

ensures prescribed sequence length, robust format consistency, and rigorous linguistic well formedness, all enforced during training.
Central to our approach is a variable entropy mechanism that enables the model to dynamically calibrate the equilibrium between literal fidelity and semantic naturalness by modulating the exploration exploitation manifold.
By integrating entropy tempered advantage estimation with asymmetric clipping, VEPO sustains robust exploration while mitigating policy collapse.
Empirical evaluations across 90 FLORES200, COMET-22, chrF directions demonstrate that VEPO yields substantial improvements in both tokenization efficiency and translation quality, bridging the performance gap for underrepresented languages.
Introduction While Large Language Models have revolutionized Neural Machine Translation, their performance in low-resource regimes remains hampered by suboptimal tokenization, training imbalances, and reinforcement learning instabilities.

Experimental setup & results

protocols (Choshen et al., 2020), prompting recent extensions into multilingual preference optimization (Dang et al., 2024).
Despite their versatility, generalpurpose LLMs often exhibit three systemic failure modes in translation: (1) Fidelity Gaps, where the absence of task-specific constraints leads to semantic hallucinations or overt translation errors; (2) Verbosity Bias, characterized by redundant supplementary explanations or conversational filler that detracts from concise output (see Figure 1); and (3) Generation Overrun, a phenomenon where the model continues generating irrelevant text or ”hallucinatory continuations” after the target translation is completed.
VEPO diverges from existing multilingual paradigms by addressing these issues across three dimensions.
First, unlike specialized architectures such as Qwen-MT (Qwen, 2025) that are often confined to rigid templates, VEPO maintains expansive instruction-following capabilities while ensuring bilingual accuracy.
Second, VEPO mitigates redundancy and over-generation through the synergy of RLVRintegrated structural constraints and length-invariant reinforcement learning normalization, providing robust control 2 VEPO: Variable Entropy Policy Optimization over sequence termination.

Limitations & risks

In this paper, we introduced Variable Entropy Policy Optimization, a comprehensive framework for adapting foundation models to low-resource linguistic environments.
Our approach systematically addresses the primary bottlenecks in multilingual modeling through targeted tokenizer expansion, balanced continued pre-training, and entropyaware reinforcement learning.
By integrating Reinforcement Learning with Verifiable Rewards, we enforce deterministic structural constraints directly within the optimization loop, effectively mitigating common failure modes such as sequence inflation and markup corruption.
Empirical results across 90 FLORES-200, COMET-22 and chrF directions demonstrate that VEPO achieves state-of-the-art translation performance while preserving robust generalpurpose capabilities.

Read-like-fullpaper digest

This paper addresses While Large Language Models have revolutionized Neural Machine Translation, their performance in low-resource regimes remains hampered by suboptimal tokenization, training imbalances, and reinforcement learning instabilities. The core method is ensures prescribed sequence length, robust format consistency, and rigorous linguistic well formedness, all enforced during training. Key empirical findings include protocols (Choshen et al., 2020), prompting recent extensions into multilingual preference optimization (Dang et al., 2024).

상세 요약 (KO)

문제 정의

대규모 언어 모델이 신경 기계 번역에 혁명을 일으켰지만, 자원이 부족한 체제에서의 성능은 차선의 토큰화, 교육 불균형 및 강화 학습 불안정성으로 인해 여전히 방해를 받고 있습니다.
GPT4, DeepSeekR1 및 Qwen-max(OpenAI et al., 2024; Guo et al., 2025; Qwen et al., 2025)와 같은 최첨단 모델은 종종 뚜렷한 시퀀스 1Qiyuan Tech를 나타냅니다.
크메르어 및 태국어와 같이 형태학적으로 풍부한 쓰기 시스템에 대한 하위 단어 분할의 단편화 및 상당한 차이는 주로 고자원 언어에 불균형적으로 최적화된 어휘 및 훈련 말뭉치의 결과입니다.
전문화된 MT 아키텍처(Cheng et al., 2025; Zheng et al., 2025; Dou et al., 2025)는 선별된 데이터를 통해 이러한 격차를 해결하지만 종종 생산 환경에 필요한 지침 유연성에 어려움을 겪습니다.

핵심 아이디어/방법

규정된 시퀀스 길이, 강력한 형식 일관성, 엄격한 언어적 올바른 형성을 보장하며 모두 훈련 중에 시행됩니다.
우리 접근 방식의 핵심은 탐사 활용 매니폴드를 조정하여 모델이 문자 그대로의 충실도와 의미론적 자연성 사이의 평형을 동적으로 보정할 수 있도록 하는 가변 엔트로피 메커니즘입니다.
엔트로피 강화 이점 추정을 비대칭 클리핑과 통합함으로써 VEPO는 정책 붕괴를 완화하면서 강력한 탐색을 유지합니다.
90 FLORES200, COMET-22, chrF 방향에 대한 실증적 평가는 VEPO가 토큰화 효율성과 번역 품질 모두에서 상당한 개선을 제공하여 잘 표현되지 않은 언어의 성능 격차를 해소한다는 것을 보여줍니다.
소개 대규모 언어 모델이 신경 기계 번역에 혁명을 일으켰지만, 자원이 부족한 체제에서의 성능은 차선의 토큰화, 교육 불균형 및 강화 학습 불안정성으로 인해 여전히 방해를 받고 있습니다.

실험 설정/결과

프로토콜(Choshen et al., 2020), 최근 다국어 선호도 최적화에 대한 확장을 촉발했습니다(Dang et al., 2024).
다재다능함에도 불구하고 범용 LLM은 번역에서 세 가지 체계적 실패 모드를 자주 나타냅니다. (1) 충실도 격차(Fidelity Gaps): 작업별 제약 조건이 없으면 의미론적 환각이나 명백한 번역 오류가 발생합니다. (2) 장황한 보충 설명이나 간결한 출력을 떨어뜨리는 대화형 필러를 특징으로 하는 장황성 편향(그림 1 참조) (3) 세대 오버런(Generation Overrun), 대상 번역이 완료된 후에도 모델이 관련 없는 텍스트 또는 "환각 연속"을 계속 생성하는 현상입니다.
VEPO는 이러한 문제를 3차원에 걸쳐 해결함으로써 기존 다국어 패러다임과 다릅니다.
첫째, 종종 엄격한 템플릿에 국한되는 Qwen-MT(Qwen, 2025)와 같은 특수 아키텍처와 달리 VEPO는 이중 언어 정확성을 보장하면서 광범위한 지침 따르기 기능을 유지합니다.
둘째, VEPO는 RLVR통합 구조적 제약과 길이 불변 강화 학습 정규화의 시너지 효과를 통해 중복성과 과잉 생성을 완화하여 강력한 제어를 제공합니다. 2 VEPO: 시퀀스 종료에 대한 가변 엔트로피 정책 최적화.

한계/리스크

본 논문에서는 파운데이션 모델을 저자원 언어 환경에 적용하기 위한 포괄적인 프레임워크인 가변 엔트로피 정책 최적화를 소개했습니다.
우리의 접근 방식은 대상 토크나이저 확장, 균형 잡힌 지속적인 사전 훈련 및 엔트로피 인식 강화 학습을 통해 다국어 모델링의 주요 병목 현상을 체계적으로 해결합니다.
강화 학습을 검증 가능한 보상과 통합함으로써 최적화 루프 내에서 직접 결정론적 구조적 제약 조건을 적용하여 시퀀스 인플레이션 및 마크업 손상과 같은 일반적인 실패 모드를 효과적으로 완화합니다.
90 FLORES-200, COMET-22 및 chrF 방향에 대한 경험적 결과는 VEPO가 강력한 범용 기능을 유지하면서 최첨단 변환 성능을 달성한다는 것을 보여줍니다.

전체 논문 읽은 느낌 요약

이 문서에서는 대규모 언어 모델이 신경 기계 번역에 혁명을 일으켰지만 자원이 부족한 체제에서의 성능은 차선책 토큰화, 훈련 불균형 및 강화 학습 불안정성으로 인해 여전히 방해받고 있음을 다룹니다. 핵심 방법은 규정된 시퀀스 길이, 강력한 형식 일관성, 엄격한 언어적 올바른 형식성을 보장하는 것입니다. 이 모든 것이 훈련 중에 적용됩니다. 주요 경험적 발견에는 프로토콜(Choshen et al., 2020)이 포함되어 있으며 최근 다국어 선호도 최적화에 대한 확장을 촉발했습니다(Dang et al., 2024).