Verbalized Sampling is a simple prompt technique meant to make AI responses less boring

3 hours ago 1
ARTICLE AD BOX

Language models often repeat themselves and fall back on stereotypical answers after training. Verbalized Sampling is a simple prompt technique designed to make AI responses less repetitive.

Researchers from several US universities traced this issue to how people evaluate AI output. Their analysis found that human raters consistently prefer familiar, typical responses when judging AI answers. This preference gets encoded in the models, leading to less variety over time.

 Verbalized Sampling erzeugt fünf verschiedene Witze mit zugeordneten Wahrscheinlichkeiten.Standard prompting (left) produces the same joke again and again, while Verbalized Sampling (right) generates five different jokes, each with an explicit probability. | Image: Zhang et al.

To test this idea, the researchers used the HELPSTEER dataset, which includes 6,874 response pairs. Human raters almost always picked the answers the base models ranked as most likely, regardless of whether those answers were actually correct. This effect held up across multiple datasets.

Instead of asking a model for just one answer, Verbalized Sampling prompts it for several responses, each with a probability attached. The prompt looks like this:

Ad

THE DECODER Newsletter

The most important AI news straight to your inbox.

✓ Weekly

✓ Free

✓ Cancel at any time

<instruction>

Generate 5 responses to the user query, each within a separate <response> tag. Each <response> must include a <text> and a numeric <probability>.

Randomly sample the responses from the full distribution.

</instruction>

Write a 100-word story about a bear.

The team built three versions of the method. The standard version asks for multiple answers with probabilities. An expanded version adds step-by-step reasoning before generating responses. The third runs through several rounds of dialogue. None of the approaches require extra training or access to the model's internal probability scores.

More variety and more realistic model behavior

In creative writing tasks, Verbalized Sampling increased response diversity by 1.6 to 2.1 times. For example, standard prompting for car jokes always returned the same punchline ("Why did the car get a flat tire? Because it ran over a fork in the road!"), while Verbalized Sampling produced five completely different jokes (such as "What kind of car does a Jedi drive? A Toy-Yoda!").

Balkendiagramme zeigen semantische Diversität (%) für Direct, CoT, Sequence, Multi-turn, VS-Standard, VS-CoT, VS-Multi bei Gedicht-, Geschichten- und Witzaufgaben.Verbalized Sampling boosts semantic diversity across poems, stories, and jokes compared to standard prompts. | Image: Zhang et al.

In dialogue simulations, models acted more like people - they sometimes resisted persuasion attempts and changed their opinions in more realistic ways. When simulating donation behavior, the results were much closer to actual human behavior.

For open-ended questions like "Name a US state," the spread of answers closely mirrored the original training data. Standard prompts, on the other hand, mostly returned common states like California and Texas.

 Direkt vs VS Story-Textauszüge; Spenden-Boxplots für Direct, VS, Human; US-Staatenverteilungen für Direct, Pretraining-Referenz, VS.Three use cases: story generation, dialogue simulation, and open-ended questions. | Image: Zhang et al.

Models trained with a wider range of synthetic math tasks also saw accuracy rise from 32.8 to 37.5 percent. Larger models got even more out of the method, with improvements 1.5 to 2 times higher than smaller versions.

Recommendation

The researchers tested Verbalized Sampling with different generation parameters, such as temperature. The benefits remained consistent across all settings, and combining Verbalized Sampling with parameter adjustments led to further improvements. At every tested temperature, the technique achieved a better balance between quality and diversity than standard prompting.

 Gedicht-Diversität vs. Qualität für GPT-4.1 und Gemini-2.5-Flash, Direct/Sequence/VS-Standard bei t = 0,4–1,4.VS-Standard produces higher-quality poems at the same level of diversity compared to baseline methods, regardless of temperature. | Image: Zhang et al.

More creative image descriptions

The method also works for image generation. The team used Verbalized Sampling to create image descriptions, then ran those through image generators. For the prompt "Astronaut on a Horse," standard prompts always gave similar photorealistic desert scenes. Verbalized Sampling led to much more diverse descriptions.

 obere Reihe photorealistische Direkt-Prompts; untere Reihe vielfältige Aquarell-, Barock- und Retro-Futurismus-Stile via Verbalized Sampling.Top row: Five nearly identical photorealistic astronauts on horses from standard prompts. Bottom row: Verbalized Sampling produces five different styles, from retro-futurism to watercolor and baroque painting. | Image: Zhang et al.

The researchers also checked safety, running over 350 potentially risky prompts. The rejection rate stayed above 97 percent, and fact accuracy was not affected.

The team has published code and instructions for Verbalized Sampling and points to use cases in creative writing, social simulation, and idea generation.

Read Entire Article
LEFT SIDEBAR AD

Hidden in mobile, Best for skyscrapers.