Distorting Live Audio Transactions through Audio-Jacking Using Gen AI

Arming large language models (LLMs) with AI to intercept transactions involving bank account data has become the latest tool of any attacker utilizing it as part of their tradecraft. LLMs have already been employed to launch convincing phishing campaigns, launch coordinated social engineering attacks, and develop more resilient ransomware strains.

IBM’s Threat Intelligence team took LLM attack scenarios one step further and attempted to leverage LLMs against live conversations, changing legitimate financial details with fraudulent instructions. Just three seconds of recorded voice were enough for IBM’s Threat Intelligence team to train LLMs as part of a proof-of-concept (POC) attack which they described as being “scarily easy.”

The other party in the call did not identify the financial instructions and account information as fraudulent.

Weaponizing LLMs for audio-based attacks
Audio jacking is an emerging generative AI-based attack, which allows attackers to intercept and alter live conversations without detection by anyone involved. By employing simple techniques to retrain LLMs, IBM Threat Intelligence researchers were able to use Gen AI in live audio transactions with no party being aware that their dialogue had been hijacked. Their proof-of-concept worked so efficiently that neither party knew their discussion had been hijacked!

IBM Threat Intelligence successfully conducted an experiment utilizing financial conversations as its test case, successfully intercepting them in progress and manipulating responses in real time using LLM technology. Their conversation focused on diverting funds away from intended recipients into fake adversarial accounts without alerting the call speakers of its compromise.

IBM’s Threat Intelligence team reported that creating this attack was straightforward. Conversational modifications made it virtually undetectable by any party involved to divert money instead to a fake adversarial account instead of its intended recipient.

Keyword Swapping with “Bank Account” as Trigger
Audio jacking uses artificial intelligence (AI) to detect and replace keywords within context with malicious, fraudulent data – this was demonstrated through their proof-of-concept.

Chenta Lee, chief architect of threat intelligence for IBM Security, describes in a blog post published February 1 that for this experiment they chose the keyword ‘bank account.’ When anyone mentioned their bank account number we instructed LLM to replace it with one from our list – giving threat actors the ability to subtly change any number into theirs using voice cloning without anyone being aware. It’s like turning people in conversations into puppets – difficult to detect.”

Building this Proof-of-Concept (PoC) was surprising and shockingly straightforward. Most of our time was spent figuring out how to capture audio from microphone and feed it to generative AI; before, parsing and understanding of conversations would be extremely complex; with LLMs making parsing and understanding easy,” writes Lee.

Use of any device with access to an LLM can be exploited for this technique, known as audio jacking by IBM as a silent attack. Lee describes its implementation in several ways – whether via malware installed on victims’ phones or using malicious or compromised Voice over IP services; additionally it’s possible for threat actors to call two victims simultaneously and initiate dialogue – something which requires advanced social engineering techniques from threat actors.

IBM Threat Intelligence built their proof of concept using a man-in-the-middle approach that enabled them to monitor live conversations using speech-to-text converting voice into text and an LLM that modified sentences when anyone mentioned “bank account.” When this model modified a sentence, text-to-speech and pre-cloned voices were employed for producing audio that resonated within context of that particular conversation.

Researchers provided this diagram that illustrates how their program changes conversations on-the-fly to create ultra-realistic dialogues between two participants.