Deliberately giving AI ‘a dose of evil’ may make it less evil overall, reads headline on ragged newspaper in the rubble of the robot apocalypse


AI is supposed to be helpful, honest, and most importantly, harmless, but we’ve seen plenty of evidence that its behavior can become horribly inaccurate, flat-out deceptive, and even downright evil. (Yes, that last link is the MechaHitler thing.)

If you think I’m being hyperbolic by using the word “evil,” I’m not: a new paper on the subject of misbehaving language models published by the Anthropic Fellows Program for AI Safety Research is 60 pages long and uses the word “evil” no less than 181 times. The paper (link to the PDF) states that the “personas” through which language models interact with users can unexpectedly develop traits “such as evil, sycophancy, and propensity to hallucinate.”



buspartabs.online