Distilling Gemini 3.1 into a 26M Model! Meet the Ultra-Lightweight AI Tool Executor “Needle”
📰 News Overview
- Distilled Gemini 3.1 into a 26 million parameter model: The “Needle,” a model based on the remarkably compact “Simple Attention Network (SAN),” has been revealed.
- Unmatched Processing Speed: Operating on the Cactus platform, it boasts an astonishing inference speed of 1200 tokens/sec for decoding and 6000 tokens/sec for prefill.
- Local Fine-Tuning Capability: It runs on Mac and PCs, equipped with a web UI called “needle playground” that allows easy learning of custom tool-calling data.
💡 Key Points
- High Superiority for Specific Tasks: Outperforming larger models like FunctionGemma-270m, Qwen-0.6B, and Granite-350m in single-shot function calling efficiency.
- Thoroughly Optimized Architecture: Eliminating FFN (Feed-Forward Network) and adopting ZCRMSNorm and GQA+RoPE, it minimizes resource consumption through innovations like shared embeddings.
- Support for Next-Gen Edge Devices: Designed to function not only on smartphones but also on ultra-small devices like watches and smart glasses, serving as a foundation for “personal AI.”
🦈 Shark’s Perspective (Curator’s View)
Recreating the intelligence of Gemini 3.1 (tool execution) in a minuscule 26M size is a remarkable technological breakthrough! The brilliance of “Needle” lies not just in its lightweight nature but also in its prefill speed of 6000 tokens per second on Cactus. The implementation of the “Simple Attention Network,” which strips away the bulk of FFN, brilliantly resolves the “weight” versus “accuracy trade-off” that existing small models faced. Plus, the ease of fine-tuning on a Mac is nothing short of divine! We’re entering an era where users can cultivate their very own AI agents tailored to their habits on the edge!
🚀 What’s Next?
Wearable devices like smart glasses will soon be equipped with specialized ultra-compact models like “Needle,” allowing for zero-latency control of home appliances and information retrieval without relying on the cloud. The shift of AI from “beyond the cloud” to “right at your fingertips” is accelerating rapidly!
💬 Shark’s Takeaway
The era of giant models dominating the scene is over! It’s time for the age of “small but sharp” to shine! 🦈🔥
📚 Glossary
-
Distillation: A technique that transfers knowledge from a large, high-performance model (teacher model) to a smaller model (student model), enabling lightweight versions to maintain accuracy.
-
Tool Calling: The functionality allowing AI to invoke external functions or APIs to perform specific actions like fetching weather updates, calculations, or device control.
-
Prefill: The phase where the AI reads the input text at once before generating the initial response. The faster this speed, the shorter the wait time before the response begins.
-
Source: Needle: We Distilled Gemini Tool Calling into a 26M Model