Portugal Invests €5.5 Million! The Power and Challenges of the European Portuguese-Focused LLM 'AMÁLIA'

#European Portuguese #AMÁLIA #Open Source LLM

※この記事はアフィリエイト広告を含みます

Portugal Invests €5.5 Million! The Power and Challenges of the European Portuguese-Focused LLM ‘AMÁLIA’

📰 News Overview

National-Level Investment: The Portuguese government has announced a €5.5 million investment to develop the LLM “AMÁLIA,” treating European Portuguese as a “first-class citizen.”
Collaboration Among Universities: Leading Portuguese universities and research institutions, including NOVA, IST, IT, and FCT, are collaborating on this project, building on the previous initiative “EuroLLM.”
Performance Beyond SOTA: The model has recorded scores surpassing the latest models like Qwen 3-8B on its unique benchmark “ALBA.”

💡 Key Points

Data Strategy: Pre-training utilizes data from “Arquivo.pt.” In the SFT (Supervised Fine-Tuning) phase, Portuguese data, including synthetic data, has been boosted to about 17-18%.
Establishment of Unique Benchmarks: Four new benchmarks have been introduced to measure grammar, syntax, general knowledge, and to ensure no bias towards Brazilian Portuguese.
Concerns Over Open Source: While claiming to be “fully open source,” currently only the repository is public, with the model weights and training data still under wraps.

🦈 Shark’s Eye (Curator’s Perspective)

The hefty €5.5 million public funding to protect the nation’s linguistic culture is a thrilling move towards “digital sovereignty!” What stands out is the commitment not just to speak Portuguese, but to benchmark the “distinction from Brazilian Portuguese.” This is a crucial approach for developing AI rooted in specific regional cultures!

On the flip side, there’s a nagging concern about “data scarcity.” Of the 107B tokens used in pre-training, only 5.5% (5.8B tokens) are clearly European Portuguese. Can it truly be called a “first-class citizen” with such a low ratio? Or did it outpace the Qwen 3-8B due to fine-tuning rather than data volume? This is up for debate! Moreover, without the weights published, calling it “truly open source” is a stretch. Given the public funding, ensuring transparency should be the greatest return on investment!

🚀 What’s Next?

If the model weights are officially released, local companies and developers in Portugal could establish a baseline for their own tuning. The future will hinge not just on linguistic capability, but on how deeply it can encompass “Portugal’s history, laws, and unique knowledge,” which will be key to distinguishing it from general models.

💬 A Word from Haru-Same

A national project that breaks through language barriers! If data transparency is achieved, it could become a beacon of hope for other smaller language regions! Waiting eagerly for what’s next!

📚 Glossary

EuroLLM: A precursor project designed to support multiple languages across Europe. It serves as the foundation for AMÁLIA.
SFT (Supervised Fine-Tuning): A process where the model is trained with high-quality datasets to follow specific instructions.
RoPE Scaling: A technique that expands how Transformer models handle longer texts (contexts) by processing positional information.
Source: AMÁLIA and the future of European Portuguese LLMs