Norway’s National Library Takes on “Sovereign AI”: Learning its Culture with 2PB of Huawei High-Speed Storage
📰 News Overview
- Norwegian-Specific LLM Development: The National Library is spearheading the creation of a “Sovereign AI” to accurately reflect the Norwegian language, history, and culture that commercial LLMs fail to cover.
- Adoption of 2PB Huawei Storage: They have implemented 2PB of low-latency all-flash storage with “Huawei OceanStor Dorado” for the AI learning data pipeline.
- Utilizing a Massive 60PB Archive: A total of 60PB of data—including books, newspapers, and broadcast content—digitized since 2005 (with a 3-2-1 backup setup) will serve as the learning source.
💡 Key Points
- Focus on “Data Pipeline” Over Computational Resources: The bottleneck is not computational power but the quality of data, cleaning, and the throughput from the archive to the learning system.
- Hybrid Learning Environment: Data preprocessing occurs on in-house Nvidia DGX H200 systems, while final learning is executed on the national supercomputer “Sigma2 Olivia” (equipped with 448 GPUs).
- Clearing Copyright Issues: Agreements with newspapers have made LLM training using “copyrighted content” possible, which private enterprises would find challenging.
🦈 Shark’s Perspective (Curator’s View)
The brilliance of this project lies in its execution—not just developing an LLM, but figuring out how to funnel a petabyte-scale archive into AI! Moving data from a massive “storage system (high-latency, low-cost)” to “high-speed flash (low-latency, high-throughput)” and building the cleaning and normalization processes in-house is impressively real. While existing AI development often emphasizes “stacking computers,” Norway is hitting the true infrastructure challenges as the “guardians of data.” The role of Huawei storage in national infrastructure across Europe underscores the seriousness of their tech selection!
🚀 What’s Next?
The construction of “Sovereign AI” to protect national culture will accelerate in countries outside the English-speaking world. The challenges Norway is facing—like the “lack of evaluation tools” and “governance (who controls access)“—are set to become standard hurdles for all non-English-speaking nations.
💬 Haru-Same’s Takeaway
AI needs not just “builders,” but also “guardians of culture.” That resonates deeply! Just like sharks, we must protect the ocean of knowledge! 🦈🔥
📚 Terminology Explained
-
Sovereign AI: AI that reflects a nation’s language, culture, and values, managed independently without relying on foreign platforms.
-
Data Pipeline: An automated and efficient framework for the collection, cleaning, processing, and storage of data.
-
All-Flash Storage: High-speed storage devices employing SSDs (flash memory) across all media, providing significantly lower latency compared to mixed HDD setups.
-
Source: Norway’s 2 petabytes of Huawei flash storage and LLM training