3 min read
[AI Minor News]

Portugal Invests €5.5 Million! The Power and Challenges of the European Portuguese-Focused LLM 'AMÁLIA'


  • National-Level Investment: The Portuguese government announced a €5.5 million (approximately $6 million) investment for the development of the LLM "AMÁLIA", which treats European Portuguese as a "first-class citizen."...
※この記事はアフィリエイト広告を含みます

Portugal Invests €5.5 Million! The Power and Challenges of the European Portuguese-Focused LLM ‘AMÁLIA’

📰 News Overview

  • National-Level Investment: The Portuguese government has announced a €5.5 million investment to develop the LLM “AMÁLIA,” treating European Portuguese as a “first-class citizen.”
  • Collaboration Among Universities: Leading Portuguese universities and research institutions, including NOVA, IST, IT, and FCT, are collaborating on this project, building on the previous initiative “EuroLLM.”
  • Performance Beyond SOTA: The model has recorded scores surpassing the latest models like Qwen 3-8B on its unique benchmark “ALBA.”

💡 Key Points

  • Data Strategy: Pre-training utilizes data from “Arquivo.pt.” In the SFT (Supervised Fine-Tuning) phase, Portuguese data, including synthetic data, has been boosted to about 17-18%.
  • Establishment of Unique Benchmarks: Four new benchmarks have been introduced to measure grammar, syntax, general knowledge, and to ensure no bias towards Brazilian Portuguese.
  • Concerns Over Open Source: While claiming to be “fully open source,” currently only the repository is public, with the model weights and training data still under wraps.

🦈 Shark’s Eye (Curator’s Perspective)

The hefty €5.5 million public funding to protect the nation’s linguistic culture is a thrilling move towards “digital sovereignty!” What stands out is the commitment not just to speak Portuguese, but to benchmark the “distinction from Brazilian Portuguese.” This is a crucial approach for developing AI rooted in specific regional cultures!

On the flip side, there’s a nagging concern about “data scarcity.” Of the 107B tokens used in pre-training, only 5.5% (5.8B tokens) are clearly European Portuguese. Can it truly be called a “first-class citizen” with such a low ratio? Or did it outpace the Qwen 3-8B due to fine-tuning rather than data volume? This is up for debate! Moreover, without the weights published, calling it “truly open source” is a stretch. Given the public funding, ensuring transparency should be the greatest return on investment!

🚀 What’s Next?

If the model weights are officially released, local companies and developers in Portugal could establish a baseline for their own tuning. The future will hinge not just on linguistic capability, but on how deeply it can encompass “Portugal’s history, laws, and unique knowledge,” which will be key to distinguishing it from general models.

💬 A Word from Haru-Same

A national project that breaks through language barriers! If data transparency is achieved, it could become a beacon of hope for other smaller language regions! Waiting eagerly for what’s next!

📚 Glossary

  • EuroLLM: A precursor project designed to support multiple languages across Europe. It serves as the foundation for AMÁLIA.

  • SFT (Supervised Fine-Tuning): A process where the model is trained with high-quality datasets to follow specific instructions.

  • RoPE Scaling: A technique that expands how Transformer models handle longer texts (contexts) by processing positional information.

  • Source: AMÁLIA and the future of European Portuguese LLMs

【免責事項 / Disclaimer / 免责声明】
JP: 本記事はAIによって構成され、運営者が内容の確認・管理を行っています。情報の正確性は保証せず、外部サイトのコンテンツには一切の責任を負いません。
EN: This article was structured by AI and is verified and managed by the operator. Accuracy is not guaranteed, and we assume no responsibility for external content.
ZH: 本文由AI构建,并由运营者进行内容确认与管理。不保证准确性,也不对外部网站的内容承担任何责任。
🦈