#2 AMALIA Technical Report: A Fully Open Source Large Language Model for European Portuguese

Score: 16.4 | Matched keywords: large language model, large language models, llm

Detailed Summary (EN)

Read-like-fullpaper digest

This paper tackles This is a strong indication that LLMs can indeed capture specific traits of underrepresented language varieties, even when the number of examples is orders of magnitude smaller than the dominant language variety and the dominant language, i.e., Brazilian Portuguese and English, respectively. As a result, many European languages, including European Portuguese, are underrepresented in global LLMs, limiting these technologies’ ability to capture the full breadth of Europe’s linguistic and cultural diversity. Our experiments reveal that the model is on par with other similarly-sized open models in most machine-translated benchmarks and is superior on the European Portuguese benchmarks.

We introduce AMALIA, a fully open LLM that prioritizes pt-PT by using more high-quality pt-PT data during both the mid- and post-training stages. Despite rapid progress in open large language models (LLMs), European Portuguese (pt-PT) remains underrepresented in both training data and native evaluation, with machine-translated benchmarks likely missing the variant's linguistic and cultural nuances. Experiments show that AMALIA matches strong baselines on translated benchmarks while substantially improving performance on pt-PT-specific evaluations, supporting the case for targeted training and native benchmarking for European Portuguese. To evaluate pt-PT more faithfully, we release a suite of pt-PT benchmarks that includes translated standard tasks and four new datasets targeting pt-PT generation, linguistic competence, and pt-PT/pt-BR bias.

The empirical case is built around with machine-translated benchmarks likely missing the variant’s linguistic and cultural nuances. To evaluate pt-PT more faithfully, we release a suite of pt-PT benchmarks that includes translated standard tasks and four new datasets targeting pt-PT generation, linguistic competence, and pt-PT/pt-BR bias., with machine-translated benchmarks likely missing the variant’s linguistic and cultural nuances.

The central reported finding is To evaluate pt-PT more faithfully, we release a suite of pt-PT benchmarks that includes translated standard tasks and four new datasets targeting pt-PT generation, linguistic competence, and pt-PT/pt-BR bias.

Overall, the paper is most convincing where its proposed method is directly supported by the reported comparisons, but the scope of the claim should still be read in light of the evaluation setup and stated limitations.

Final takeaway

Main takeaway: To evaluate pt-PT more faithfully, we release a suite of pt-PT benchmarks that includes translated standard tasks and four new datasets targeting pt-PT generation, linguistic competence, and pt-PT/pt-BR bias.

Problem definition

This is a strong indication that LLMs can indeed capture specific traits of underrepresented language varieties, even when the number of examples is orders of magnitude smaller than the dominant language variety and the dominant language, i.e., Brazilian Portuguese and English, respectively.
As a result, many European languages, including European Portuguese, are underrepresented in global LLMs, limiting these technologies’ ability to capture the full breadth of Europe’s linguistic and cultural diversity.
Our experiments reveal that the model is on par with other similarly-sized open models in most machine-translated benchmarks and is superior on the European Portuguese benchmarks.
This paper introduces AMALIA, an LLM designed to address this imbalance by prioritizing European Portuguese and its cultural context during pretraining and post-training.

Core idea & method

We introduce AMALIA, a fully open LLM that prioritizes pt-PT by using more high-quality pt-PT data during both the mid- and post-training stages.
Despite rapid progress in open large language models (LLMs), European Portuguese (pt-PT) remains underrepresented in both training data and native evaluation, with machine-translated benchmarks likely missing the variant's linguistic and cultural nuances.
Experiments show that AMALIA matches strong baselines on translated benchmarks while substantially improving performance on pt-PT-specific evaluations, supporting the case for targeted training and native benchmarking for European Portuguese.
To evaluate pt-PT more faithfully, we release a suite of pt-PT benchmarks that includes translated standard tasks and four new datasets targeting pt-PT generation, linguistic competence, and pt-PT/pt-BR bias.

Actual findings

To evaluate pt-PT more faithfully, we release a suite of pt-PT benchmarks that includes translated standard tasks and four new datasets targeting pt-PT generation, linguistic competence, and pt-PT/pt-BR bias.

How the conclusion was reached

Step 1 — Proposed approach: We introduce AMALIA, a fully open LLM that prioritizes pt-PT by using more high-quality pt-PT data during both the mid- and post-training stages.
Step 2 — Evaluation setup or comparison basis:, with machine-translated benchmarks likely missing the variant’s linguistic and cultural nuances.
Step 3 — Main reported evidence: To evaluate pt-PT more faithfully, we release a suite of pt-PT benchmarks that includes translated standard tasks and four new datasets targeting pt-PT generation, linguistic competence, and pt-PT/pt-BR bias.

Experimental setup & results

To evaluate pt-PT more faithfully, we release a suite of pt-PT benchmarks that includes translated standard tasks and four new datasets targeting pt-PT generation, linguistic competence, and pt-PT/pt-BR bias.
with machine-translated benchmarks likely missing the variant’s linguistic and cultural nuances.

Limitations & risks

상세 요약 (KO)

전체 논문 읽은 느낌 요약

본 논문에서는 예시의 수가 지배적인 언어 다양성과 지배적인 언어(예: 각각 브라질 포르투갈어 및 영어)보다 훨씬 작은 경우에도 LLM이 실제로 잘 표현되지 않은 언어 다양성의 특정 특성을 포착할 수 있다는 강력한 표시입니다. 결과적으로, 유럽 포르투갈어를 포함한 많은 유럽 언어는 글로벌 LLM에서 과소 대표되며, 유럽의 언어적, 문화적 다양성 전체를 포착하는 이러한 기술의 능력을 제한합니다. 우리의 실험에 따르면 이 모델은 대부분의 기계 번역 벤치마크에서 유사한 크기의 다른 개방형 모델과 동등하며 유럽 포르투갈 벤치마크에서는 우수합니다. 훈련 중간 및 훈련 후 단계에서 보다 고품질의 pt-PT 데이터를 사용하여 pt-PT를 우선시하는 완전 개방형 LLM인 AMALIA를 소개합니다. 개방형 대규모 언어 모델(LLM)의 급속한 발전에도 불구하고 유럽 포르투갈어(pt-PT)는 훈련 데이터와 기본 평가 모두에서 여전히 과소평가되고 있으며 기계 번역 벤치마크에서는 변형의 언어적, 문화적 뉘앙스가 누락될 가능성이 높습니다. 실험에 따르면 AMALIA는 번역된 벤치마크의 강력한 기준과 일치하는 동시에 pt-PT 관련 평가의 성능을 크게 향상시켜 유럽 포르투갈어에 대한 목표 교육 및 기본 벤치마킹 사례를 뒷받침합니다. pt-PT를 보다 충실하게 평가하기 위해 우리는 번역된 표준 작업과 pt-PT 생성, 언어 역량 및 pt-PT/pt-BR 편견을 대상으로 하는 4개의 새로운 데이터 세트가 포함된 pt-PT 벤치마크 제품군을 출시합니다. 경험적 사례는 변종의 언어적, 문화적 뉘앙스가 누락될 가능성이 있는 기계 번역 벤치마크를 기반으로 구축되었습니다. pt-PT를 보다 충실하게 평가하기 위해 번역된 표준 작업과 pt-PT 생성, 언어 역량 및 pt-PT/pt-BR 편향을 목표로 하는 4개의 새로운 데이터 세트가 포함된 pt-PT 벤치마크 제품군을 출시합니다. 기계 번역된 벤치마크에서는 변형의 언어적, 문화적 뉘앙스가 누락될 가능성이 높습니다. 보고된 핵심 결과는 pt-PT를 보다 충실하게 평가하기 위해 번역된 표준 작업과 pt-PT 생성, 언어 역량 및 pt-PT/pt-BR 편견을 대상으로 하는 4개의 새로운 데이터 세트를 포함하는 pt-PT 벤치마크 제품군을 출시하는 것입니다. 전반적으로, 이 논문은 제안된 방법이 보고된 비교에 의해 직접적으로 뒷받침된다는 점에서 가장 설득력이 있지만, 청구 범위는 평가 설정 및 명시된 제한 사항을 고려하여 읽어야 합니다.

핵심 결론

주요 내용: pt-PT를 보다 충실하게 평가하기 위해 번역된 표준 작업과 pt-PT 생성, 언어 역량 및 pt-PT/pt-BR 편견을 대상으로 하는 4개의 새로운 데이터 세트가 포함된 pt-PT 벤치마크 제품군을 출시합니다.

문제 정의

이는 예시의 수가 지배적인 언어 다양성과 지배적인 언어(예: 각각 브라질 포르투갈어 및 영어)보다 훨씬 작은 경우에도 LLM이 실제로 잘 표현되지 않은 언어 다양성의 특정 특성을 포착할 수 있다는 강력한 표시입니다.
결과적으로, 유럽 포르투갈어를 포함한 많은 유럽 언어는 글로벌 LLM에서 과소 대표되며, 유럽의 언어적, 문화적 다양성 전체를 포착하는 이러한 기술의 능력을 제한합니다.
우리의 실험에 따르면 이 모델은 대부분의 기계 번역 벤치마크에서 유사한 크기의 다른 개방형 모델과 동등하며 유럽 포르투갈 벤치마크에서는 우수합니다.
이 문서에서는 사전 훈련 및 사후 훈련 중에 유럽 포르투갈어와 문화적 맥락을 우선시하여 이러한 불균형을 해결하도록 설계된 LLM인 AMALIA를 소개합니다.

핵심 아이디어/방법

훈련 중간 및 훈련 후 단계에서 보다 고품질의 pt-PT 데이터를 사용하여 pt-PT를 우선시하는 완전 개방형 LLM인 AMALIA를 소개합니다.
개방형 대규모 언어 모델(LLM)의 급속한 발전에도 불구하고 유럽 포르투갈어(pt-PT)는 훈련 데이터와 기본 평가 모두에서 여전히 과소대표되고 있으며 기계 번역 벤치마크에서는 변형의 언어적, 문화적 뉘앙스가 누락될 가능성이 높습니다.
실험에 따르면 AMALIA는 번역된 벤치마크의 강력한 기준과 일치하는 동시에 pt-PT 관련 평가의 성능을 크게 향상시켜 유럽 포르투갈어에 대한 목표 교육 및 기본 벤치마킹 사례를 뒷받침합니다.
pt-PT를 보다 충실하게 평가하기 위해 우리는 번역된 표준 작업과 pt-PT 생성, 언어 역량 및 pt-PT/pt-BR 편견을 대상으로 하는 4개의 새로운 데이터 세트가 포함된 pt-PT 벤치마크 제품군을 출시합니다.

실제 결과

pt-PT를 보다 충실하게 평가하기 위해 우리는 번역된 표준 작업과 pt-PT 생성, 언어 역량 및 pt-PT/pt-BR 편견을 대상으로 하는 4개의 새로운 데이터 세트가 포함된 pt-PT 벤치마크 제품군을 출시합니다.

결론이 나온 과정

1단계 - 제안된 접근 방식: 훈련 중간 및 훈련 후 단계에서 보다 고품질의 pt-PT 데이터를 사용하여 pt-PT를 우선시하는 완전 개방형 LLM인 AMALIA를 소개합니다.
2단계 — 평가 설정 또는 비교 기준: 기계 번역된 벤치마크를 사용하면 변형의 언어적, 문화적 뉘앙스가 누락될 가능성이 높습니다.
3단계 — 보고된 주요 증거: pt-PT를 보다 충실하게 평가하기 위해 번역된 표준 작업과 pt-PT 생성, 언어 역량 및 pt-PT/pt-BR 편견을 대상으로 하는 4개의 새로운 데이터 세트가 포함된 pt-PT 벤치마크 제품군을 출시합니다.

실험 설정/결과

pt-PT를 보다 충실하게 평가하기 위해 우리는 번역된 표준 작업과 pt-PT 생성, 언어 역량 및 pt-PT/pt-BR 편견을 대상으로 하는 4개의 새로운 데이터 세트가 포함된 pt-PT 벤치마크 제품군을 출시합니다.
기계 번역된 벤치마크에서는 변종의 언어적, 문화적 뉘앙스가 누락될 가능성이 높습니다.