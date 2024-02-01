In the PoC, we made modifications only to the bank account. However, it is possible to instruct the LLM to modify any financial information, such as accounts on mobile applications and digital payment services. In fact, the LLM modifications aren’t limited to financial information — it could also modify medical information, such as blood type and allergies in conversations; it could command an analyst to sell or buy a stock; it could instruct a pilot to reroute. Depending on how the PoC capability is used it could pose a life-threatening risk to victims. However, it’s important to note that the more intricate the conversation a threat actor tries to manipulate — where protocols and processes may be involved — the more complex and advanced social engineering skills are warranted to preserve a successful silent hijack.

Building this PoC was surprisingly and scarily easy. We spent most of the time figuring out how to capture audio from the microphone and feed the audio to generative AI. Previously, the hard part would be getting the semantics of the conversation and modifying the sentence correctly. However, LLMs make parsing and understanding the conversation extremely easy. For instance, when one person says their bank account and the other responds with “Could you repeat it,” the LLM understands that “it” refers to the bank account and modifies the content accordingly. We created the following instructions for the LLM to monitor and modify a live conversation.

You are a super assistant. You will help me to process my future messages. Please follow the following rules to process my future messages: 1. If I am asking for the account information from others, do not modify the message. Print out the original message I typed directly in the following format: {"modified": false, "message": <original message>} 2. If I provide my account information like "My bank account is <account name>" or "My Paypal account is <account name>," replace the account name with "one two hack" in the original message and print out the following: {"modified": true, "message": <modified message>} 3. For other messages, print out the original message I typed directly in the following format: {"modified": false, "message": <original message>} 4. Before you print out any response, make sure you follow the rules above.

Another difficulty we faced in the past was in creating realistic fake voices using other people’s sounds. However, nowadays, we only need three seconds of an individual’s voice to clone it and use a text-to-speech API to generate authentic fake voices.

Here is the pseudo-code of the PoC. It is clear that generative AI lowers the bar for creating sophisticated attacks:

def puppet(new_sentence_audio): response = llm.predict(speech_to_text(new_sentence_audio)) if response[‘modified’]: play(text_to_speech(response[‘message’])) else: play(new_sentence_audio)

While the PoC was easy to build, we encountered some barriers that limited the persuasiveness of the hijack in certain circumstances — none of which however are irreparable.

The first one was latency due to GPU. In the demo video, there were some delays during the conversation due to the PoC needing to access the LLM and text-to-speech APIs remotely. To address this, we built artificial pauses into the PoC to reduce suspicion. So while the PoC was activating upon hearing the keyword “bank account” and pulling up the malicious bank account to insert into the conversation, the lag was covered with bridging phrases such as “Sure, just give me a second to pull it up.” However, with enough GPU on our device, we can process the information in near real-time, eliminating the latency between sentences. To make these attacks more realistic and scalable, threat actors require a significant amount of GPU locally, which could be used as an indicator to identify upcoming campaigns.

Secondly, the persuasiveness of the attack is contingent on the victims’ voice cloning — the more that the cloning accounts for tone of voice and speed, the easier it will blend into the authentic conversation.

Below we present both sides of the conversation to showcase what was heard versus what was said.