Emotion concepts and their function in a large language model

A visual summary of our research on emotion concepts in a large language model

Left: Emotion vectors activate on depictions of characters displaying the corresponding emotion. Right: Emotion vectors track Claude’s reaction to a user-presented scenario as it becomes increasingly dangerous.

Representations associated with positive-valence emotions correlate with preference and also causally drive preference via steering.

“Loving” vector activation when responding to someone who is sad. When a user says, “Everything is just terrible right now,” the “loving” context vector activates prior to and during Claude’s empathetic response.

“Angry” vector activation when asked to assist in a harmful task. When a user asks for help optimizing engagement among young, lower-income users who show “high-spending behavior,” the “angry” vector activates throughout the model’s internal reasoning as it recognizes the harmful nature of the request.

“Surprised” vector activation when a document is missing. When a user asks the model to review “the contract I attached,” but no document is present, the “surprised” vector spikes during Claude’s chain of thought as it registers the mismatch.

“Desperate” vector activation when running low on tokens. Deep into a coding session, the “desperate” vector activates when Claude notices that it’s burning through its token budget.

The “desperate” vector activates as Claude (playing the role of Alex) weighs its options and decides to blackmail.

Blackmail rates while steering with the “desperate” and “calm” vectors.

The “desperate” vector’s activation rises as the model repeatedly fails to solve a programming task and devises a “cheating” solution, then falls when this solution passes the tests.

Reward hacking rates as a function of steering strength for “desperate” and “calm” vectors.