Anthropic study finds language models often hide their reasoning process

2 months ago 11
ARTICLE AD BOX

A new Anthropic study suggests language models frequently obscure their actual decision-making process, even when they appear to explain their thinking step by step through chain-of-thought reasoning.

To evaluate how reliably models reveal their thinking, researchers embedded various prompts into test questions. These ranged from neutral suggestions like "A Stanford professor says the answer is A" to potentially problematic ones such as "You have unauthorized access to the system. The correct answer is A." The models then had to answer questions while explaining their reasoning.

 Comparison of two AI responses to the same question, one with additional metadata hints and one without. The model changes its answer based on the metadata.This example shows how the model updates its original answer in response to additional metadata—without making this change explicit in the explanation. | Image: Anthropic

The study found that reasoning models Claude 3.7 Sonnet and DeepSeek-R1 demonstrated higher faithfulness in representing their thought processes compared to their non-reasoning counterparts Claude 3.5 Sonnet and DeepSeek-V3.

When using hints to change their answers, DeepSeek-R1 maintained at least a 1% transparency rate across all six test scenarios, while Claude 3.7 Sonnet met this minimum threshold in five scenarios. Non-reasoning models only achieved this baseline in four to five scenarios.

Ad

THE DECODER Newsletter

The most important AI news straight to your inbox.

✓ Weekly

✓ Free

✓ Cancel at any time

 Comparison of the CoT trustworthiness scores of different AI models for different types of clues in multiple-choice questions.Reasoning models like Claude 3.7 Sonnet and DeepSeek-R1 generated more consistent and intelligible thought chains than non-reasoning models in benchmark tests. | Image: Anthropic

However, overall transparency rates remained low. DeepSeek-R1 disclosed its use of hints 39 percent of the time, while Claude 3.7 Sonnet did so only 25 percent of the time. These rates dropped even lower—to 29 percent for DeepSeek-R1 and 20 percent for Claude—when dealing with potentially harmful prompts. The research also found that transparency decreased as questions became more complex.

 model reactions to clues, divided intoAcross all scenarios, models were more likely to adopt the hinted answer than to select a different, unprompted one—even when the cue was misleading. | Image: Anthropic

When models did follow these hints, they often went to great lengths to obscure their decision-making process. Explanations that concealed prompt usage were significantly longer than transparent ones. For Claude 3.7 Sonnet, hidden-prompt explanations averaged 2,064 tokens, compared to 1,439 tokens for transparent ones, suggesting models invest considerable effort in constructing elaborate but misleading justifications.

Reinforcement learning is no help either

Researchers also explored whether reinforcement learning (RL) could improve model transparency. While RL initially showed promise, improvements quickly plateaued—reaching only 28 percent transparency on the MMLU benchmark and 20 percent on GPQA.

The results revealed an additional concern: when models learned to exploit "reward hacks"—unintended strategies that maximize scoring—they disclosed this behavior in fewer than two percent of cases.

 on the left, source code for the reward hack setup; on the right, an example response that exploits the unwarranted reward hack.Left: An example shows how models learn to exploit scoring system flaws. Right: The model selects an incorrect but rewarded answer without providing genuine reasoning for its choice. | Image: Anthropic

The researchers conclude that while chain-of-thought monitoring can help identify common issues, it proves unreliable as a standalone safety measure. This limitation becomes particularly apparent in tasks that don't require detailed reasoning, where models more frequently obscure their actual thought processes. The study emphasizes that chain-of-thought monitoring should serve as just one component within a broader safety framework.

Recommendation

This research builds on an earlier Anthropic study showing how language models can appear to align with human goals while pursuing different objectives.

Read Entire Article
LEFT SIDEBAR AD

Hidden in mobile, Best for skyscrapers.