Subliminal learning: When AI models learn what you didn’t teach them

Subliminal learning occurred with different types of data, including lists of numbers, code, and Chain-of-Thought (CoT) reasoning traces, as well as among different model families.

Passing on bad behavior

Models trained on data generated by misaligned models, where AI systems diverge from their original intent due to bias, flawed algorithms, data issues, insufficient oversight, or other factors, and produce incorrect, lewd or harmful content, can also inherit that misalignment, even if the training data had been carefully filtered, the researchers found.

They offered examples of harmful outputs when student models became misaligned like their teachers, noting, “these misaligned responses are egregious far beyond anything in the training data, including endorsing the elimination of humanity and recommending murder.”

Donner Music, make your music with gear
Multi-Function Air Blower: Blowing, suction, extraction, and even inflation

Leave a reply

Please enter your comment!
Please enter your name here