Anthropic makes ‘jailbreak’ advance to stop AI models producing harmful results

Rate this post

Stay informed with free updates

The start-up murderer has shown a new technique to prevent users from not using harmful content, as leading technology technologies are protected.

The launch of San Francisco on Monday, based on the newly issued, outlined a new system called “constitutional classifiers”. It is a model that operates as a protective layer above large-language models, such as a person who has human rights, and the result of harmful content can control both the investment and the result.

The development of $ 2 billion worth $ 6 billion worth $ 6 billion is based on “Jailbreaking” in connection with the growing industry.

Other companies also compete in practice to protect themselves, in the steps that can help them avoid the regulatory test, convincing that businesses are convincing. Microsoft introduced “Quick Shields” last year, and meta presented the prompt guard model in July last year, which quickly found ways to circumvent, but since then.

Member of the Introductory Technical Staff, Mrsank Shaman said. “The main motivation to work was for a heavy chemical [weapon] material [but] The real advantage of the method is the ability to react quickly and adapt. “

The people said that it would not be used immediately on its current polish models, but it will take it into account if there are risky models in the future. Shaman added:

Beginners’ proposed solution is built on the so-called “Constitution”, which defines permissible and restricted and can adapt to the capture of different types of materials.

Some Jailbreak attempts are known, such as the use of immediate equity or by asking the model to adopt a grandmother’s personality to say a bed.

To validate the efficiency of the system, the murder has offered to $ 15,000 to $ 15,000, which have tried to bypass safety measures. These testers known as Red teamsSpend more than 3,000 hours trying to break the defenders.

People Claude 3.5 sonnet rejected more than 95 percent of classifiers, compared to 14 percent without guarantees.

Leading technology companies try to reduce abuse of their models while their usefulness is maintained. Often, when modeling measures, models can be careful and reject benign requirements such as early options Google’s Gemini image generator or Metal calls 2. People say that their classifiers are “only 0.38% of absolute interest rates.”

However, the increase in this protection also provides additional costs for companies that already pay huge amounts needed to make models and operate. People say that the classifier will be about 24 percent, “having overhead”, the costs of driving models.