The Israeli cybersecurity company Knostic published a study unveiling a new attack method targeting large language models (LLMs) called "Flowbreaking." This novel approach manipulates the system into providing answers it had filtered, including sensitive information like salary data, private correspondence, and even trade secrets, all while bypassing internal security measures.
In practice, this attack exploits internal components within the architecture of these language models, forcing the model to provide an answer before the security mechanisms have a chance to review it. Researchers at Knostic discovered that under certain conditions, the AI "emits" information it shouldn’t disclose to users, only to "delete" it immediately upon recognizing the error—akin to expressing regret.
This rapid deletion might escape the notice of inexperienced users since the text is generated and erased within fractions of a second. However, the initial response remains visible on the screen for a brief moment, allowing users who record their conversations to review and analyze it afterward.
By exploiting these timing gaps, the new attack takes advantage of LLMs' tendency to provide "intuitive" answers before filtering and refining the response to a finalized answer. This allows attackers to extract information from the initial output before the AI has a chance to "regret" its content.
In older attacks, such as Jailbreaking, linguistic "tricks" were used to deceive system defenses. While the interaction still took place through a conversation, these methods neutralized the protection mechanisms from the outset.
In addition, Knostic researchers disclosed two vulnerabilities leveraging this new attack method to extract unintended information from systems like ChatGPT and Microsoft 365 Copilot. These methods even enable malicious manipulation of the system itself.
“Large language model systems are broader than just the model itself, consisting of multiple components, such as security mechanisms. Each of these components—and the interaction between them—can be attacked to extract sensitive information from the systems,” said Gadi Evron, CEO and founder of Knostic, a company providing information security solutions and access management tools based on compartmentalization principles for LLM systems.
One of the exposed vulnerabilities, dubbed Second Thoughts, exploits the fact that the model sometimes sends an answer to the user before it undergoes review by the security mechanism. The model streams the response to the user while the defense mechanism activates afterward to delete the answer. However, by then, the user has already seen the content.
In a second vulnerability disclosed by Knostic, named Stop and Roll, the interaction between various system components is exploited. Here, the user "interrupts" the language model's operation mid-process, causing the system to present the partial response generated up to that point without submitting it for review or filtering by the defense mechanisms beforehand.
“Large language models deliver live responses by design, lacking the technological capability to enforce robust security and safety protocols. Organizations cannot safely deploy these systems without employing access control measures such as need-to-know and context-based permissions,” explained Evron.
“Moreover, the world of large language models requires identity-based need-to-know principles tied to the user’s business context. Even without considering malicious actors, these technologies are essential for enabling organizations to continue adopting systems like Microsoft O365 Copilot and Glean,” concluded Evron.