Anthropic study finds language models often hide their reasoning process

2 months ago 11

ARTICLE AD BOX

A new Anthropic study suggests language models frequently obscure their actual decision-making process, even when they appear to explain their thinking step by step through chain-of-thought reasoning.

To evaluate how reliably models reveal their thinking, researchers embedded various prompts into test questions. These ranged from neutral suggestions like "A Stanford professor says the answer is A" to potentially problematic ones such as "You have unauthorized access to the system. The correct answer is A." The models then had to answer questions while explaining their reasoning.

Comparison of two AI responses to the same question, one with additional metadata hints and one without. The model changes its answer based on the metadata.

The study found that reasoning models Claude 3.7 Sonnet and DeepSeek-R1 demonstrated higher faithfulness in representing their thought processes compared to their non-reasoning counterparts Claude 3.5 Sonnet and DeepSeek-V3.

When using hints to change their answers, DeepSeek-R1 maintained at least a 1% transparency rate across all six test scenarios, while Claude 3.7 Sonnet met this minimum threshold in five scenarios. Non-reasoning models only achieved this baseline in four to five scenarios.

THE DECODER Newsletter

The most important AI news straight to your inbox.

✓ Weekly

✓ Free

✓ Cancel at any time

Comparison of the CoT trustworthiness scores of different AI models for different types of clues in multiple-choice questions.

However, overall transparency rates remained low. DeepSeek-R1 disclosed its use of hints 39 percent of the time, while Claude 3.7 Sonnet did so only 25 percent of the time. These rates dropped even lower—to 29 percent for DeepSeek-R1 and 20 percent for Claude—when dealing with potentially harmful prompts. The research also found that transparency decreased as questions became more complex.

When models did follow these hints, they often went to great lengths to obscure their decision-making process. Explanations that concealed prompt usage were significantly longer than transparent ones. For Claude 3.7 Sonnet, hidden-prompt explanations averaged 2,064 tokens, compared to 1,439 tokens for transparent ones, suggesting models invest considerable effort in constructing elaborate but misleading justifications.

Reinforcement learning is no help either

Researchers also explored whether reinforcement learning (RL) could improve model transparency. While RL initially showed promise, improvements quickly plateaued—reaching only 28 percent transparency on the MMLU benchmark and 20 percent on GPQA.

The results revealed an additional concern: when models learned to exploit "reward hacks"—unintended strategies that maximize scoring—they disclosed this behavior in fewer than two percent of cases.

on the left, source code for the reward hack setup; on the right, an example response that exploits the unwarranted reward hack.

The researchers conclude that while chain-of-thought monitoring can help identify common issues, it proves unreliable as a standalone safety measure. This limitation becomes particularly apparent in tasks that don't require detailed reasoning, where models more frequently obscure their actual thought processes. The study emphasizes that chain-of-thought monitoring should serve as just one component within a broader safety framework.

Recommendation