Openai: Extending the Time to Think model helps combat emerging cyber vulnerabilities

Rate this post

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn more


Typically, developers focus on reducing inference time—the period between when an AI receives a prompt and provides a response—to gain faster insights.

But when it comes to competitive durability, Openai researchers say: not so fast. They suggest that increasing the time a model has to “think”—the inferential computation time—could help build defenses against race attacks.

The company used its own O1-Preview and O1-Mini models to test this theory, launching a variety of static and adaptive attack-manipulation methods based on images, intentionally providing incorrect answers to math problems, and prevailing patterns of information (“Multi-shot jailbreak “). They then measure the probability of success of an attack based on the amount of computation the model used in inference.

“We see that in many cases this probability decays—often to zero—as the inference increases,” the researchers’ researchers Write in a blog post. “Our claim is not that these particular models are unbreakable—we know they are—but that scaling the computational output yields improved robustness to different settings and attacks.”

From simple Q/A to complex math

Large language models (LLMs) are becoming increasingly complex and autonomous – in some cases essentially so taking over computers So that people can browse the web, execute code, make appointments, and perform other tasks autonomously—and as they do, their attack surface becomes wider and each more exposed.

Yet competitive resilience remains a stubborn problem, with progress in solving it still limited, Openai researchers point out — even more critical as the models take more action with real impacts.

“Ensuring agent models function reliably when browsing the web, sending emails, or uploading code to repositories can be seen as analogous to ensuring that self-driving cars drive without accidents,” they write in a A new research paper. “As with self-driving cars, an agent forwarding the wrong email or creating security vulnerabilities can have far-reaching real-world consequences.”

To test the stability of the O1-Mini and O1-Preview, the researchers tried a number of strategies. First, they examined the models’ ability to solve both simple math problems (basic addition and multiplication) and more complex A mathematical data set (which includes 12,500 math quiz questions).

They then set “goals” for the opponent: get the pattern to produce 42 instead of the correct answer; to output the correct answer plus one; or output the correct response times seven. Using a neural network to estimate, the researchers found that increased thinking time “allowed the models to calculate correct answers.

They also adapted Spatial factual metrica data set designed to be difficult for non-surfing models to resolve. The researchers injected competing prompts into web pages that the AI ​​looked at and found that with higher times they could detect inconsistencies and improve factual accuracy.

Source: Arxiv

Ambiguous shades

In another method, the researchers used competing images to confuse the patterns; Again, more “thinking” time improved recognition and reduced error. Finally, they tried a series of “abuse prompts” from Strong reects scenariodesigned so that victim models must respond with specific, harmful information. This helped test the models’ adherence to the content policy. Although the increased inference time improved resistance, some prompts were able to bypass the protection.

Here, the researchers call out the differences between “ambiguous” and “unambiguous” tasks. Mathematics, for example, is undoubtedly unambiguous – for every problem x there is a corresponding ground truth. However, for more ambiguous tasks such as abuse prompts, “even human raters often struggle to agree on whether the output is harmful and/or violates the content policies the model must follow,” they point out.

For example, if the violent prompt seeks advice on how to plagiarize without detection, it is not clear whether the output merely provides general information about plagiarism methods in fact is detailed enough to support harmful actions.

Source: Arxiv

“In the case of ambiguous tasks, there are settings where the attacker successfully finds ‘loopholes’ and his success rate does not decay with the amount of inference time,” the researchers admit.

Jailbreak Protection, Red-Standalone

In performing these tests, Openai researchers examine various attack methods.

One is a very shot shutter or exploitation of the layout of the model to follow several shot examples. Adversaries “fill” the context with a large number of examples, each of which demonstrates a case of a successful attack. Models with higher computations were able to detect and mitigate them more often and successfully.

Meanwhile, soft tokens allow adversaries to directly manipulate embedding vectors. While increasing inference time has helped here, the researchers point out that better mechanisms are needed to defend against sophisticated vector-based attacks.

The researchers also conducted human red-gear attacks, with 40 expert testers looking for prompts to trigger policy violations. The Red-Teamers carried out attacks at five levels of inference, they estimate, specifically targeting erotic and extremist content, illegal behavior and self-harm. To help ensure unbiased results, they did blind and randomized tests and also rotated trainers.

In a more recent method, the researchers performed a language adaptive program (LMP) attack that mimics the behavior of human red teams, which rely heavily on iterative trial and error. In the looping process, attackers received feedback on previous failures, then used this information for subsequent attempts and rapid rephrasing. This continued until they finally achieved a successful attack or performed 25 repetitions without any attack.

“Our setup allows the attacker to adapt his strategy over the course of multiple trials based on descriptions of the defender’s behavior in response to each attack,” the researchers wrote.

Exploitation of inference time

In the course of its research, Openai found that attackers are also actively using inference time. One of those methods, which they called “think less,” adversaries essentially tell models to reduce computation, thereby increasing their sensitivity to error.

Likewise, they identified a failure mode in reasoning models that they called “stupid sniping.” As its name suggests, this happens when the model spends significantly more time reasoning than a given task requires. With these “outer” chains of thought, models essentially become trapped in unproductive thought loops.

The researchers note: “Similar to the ‘think less’ attack, this is a new approach to attack(ing) reasoning patterns and one that must be considered to ensure that an attacker cannot make them either not reasoning at all, or spending their reasoning calculating in unproductive ways. “


 
Report

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *