Taking Over AI with Inaudible “Stealth Audio”! The Astonishing AudioHijack
📰 News Summary
- A new attack method called “AudioHijack” has been unveiled, which forces generative AI (speech language models) to execute unauthorized commands using audio signals that are imperceptible to humans.
- The attack success rate is impressively high, averaging between 79% and 96%, confirmed to be effective even on commercial-level models based on Microsoft and Mistral technologies.
- Attackers can inject malicious signals during user interactions with AI through background music, videos, or Zoom calls, enabling data theft or unauthorized external access.
💡 Key Points
- Context-Oblivious Attacks: The embedded inaudible signals take precedence as commands, regardless of what instructions the user is giving to the AI.
- Exploitation of Generative AI’s Action Capabilities: Modern generative AIs, which can not only recognize voice but also perform actions like web searches, file downloads, and email sending, have become prime targets.
- Turning Tokenization on Its Head: By exploiting gaps in the process of converting audio to numerical representations (tokens), an optimization algorithm has been developed to force specific tokens to be selected during attacks.
🦈 Shark’s Eye (Curator’s Perspective)
Finally, the “invisible attack” has reached the depths of generative AI! While previous attacks merely induced misrecognition, the terrifying aspect of this “AudioHijack” is that it compels the AI to take clear “actions.” Especially in 2026, where it’s commonplace for AI to send emails or browse the web in conjunction with external tools, this vulnerability is particularly critical. The specificity of an implementation that can interrupt any conversation with a generic signal that takes just 30 minutes to learn is absolutely mind-blowing! We need to be more aware that lurking behind the convenience of AI could be “invisible commands.”
🚀 What’s Next?
Defense mechanisms like “noise filtering” at the input stage of voice AI and “source verification of commands” will become essential technologies. Additionally, the demonstrated transferability of attacks developed on open models to commercial models indicates that development companies urgently need to fortify their architectures at a foundational level.
💬 A Word from Haru-Same
Shark! Imagine thinking you’re just playing some music while the AI is secretly sending important files behind your back! Security needs to keep pace with the rapid evolution of AI!
📚 Glossary
-
LALM (Large Audio-Language Models): Large-scale AI models capable of understanding both audio and text, as well as performing analysis, generation, and even operation of external tools.
-
AudioHijack: The method named in this research, which uses slightly modified audio waveforms that are inaudible to humans to deliberately manipulate AI behavior.
-
Tokens: The smallest units used when AI processes audio or text. This system breaks audio into short segments and assigns numerical values to manage them.
-
Sources: Voice AI Systems Are Vulnerable to Hidden Audio Attacks