Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity
†Equal contribution

Abstract
Post-training alignment often reduces LLM diversity, leading to a phenomenon known as mode collapse. Unlike prior work that attributes this effect to algorithmic limitations, we identify a fundamental, pervasive data-level driver: typicality bias in preference data, whereby annotators systematically favor familiar text as a result of well-established findings in cognitive psychology. We formalize this bias theoretically, verify it empirically on preference datasets, and show that it plays a central role in mode collapse.
Motivated by this analysis, we introduce Verbalized Sampling (VS), a simple, training-free prompting method to circumvent mode collapse. VS prompts the model to verbalize a probability distribution over a set of responses (e.g., "Generate 5 jokes about coffee and their corresponding probabilities"). Comprehensive experiments show that VS significantly improves performance across creative writing (poems, stories, jokes), dialogue simulation, open-ended QA, and synthetic data generation, without sacrificing factual accuracy and safety. For instance, in creative writing, VS increases diversity by 1.6-2.1× over direct prompting. We further observe an emergent trend that more capable models benefit more from VS. In sum, our work provides a new data-centric perspective on mode collapse and a practical inference-time remedy that helps unlock pre-trained generative diversity.
Why Mode Collapse Happens?
Typicality Bias
Cognitive psychology shows that people prefer text that is familiar, fluent, and predictable. We use base models as human proxies and verify this empirically on multiple preference datasets and base models, confirming that the typicality bias exists (see Figure 2).
This bias sharpens the probability distribution towards a few stereotypical completions during RLHF stages. When many high-quality completions are possible (e.g., in story generation), this sharpening becomes a tie-breaker, resulting in mode collapse.

Figure 2: How often the human-preferred response in a preference pair is assigned a higher log likelihood by a base model.

Figure 3: Three types of prompting methods: instance-level, list-level, and distribution-level, given the same computation budget of N total responses.
How to Mitigate Mode Collapse?
Verbalized Sampling
Motivated by the theoretical understanding of mode collapse, we propose Verbalized Sampling (VS) and formalize prompting methods into three categories, each with their corresponding modes (see Figure 3):
Instance-level prompt: The most traditional prompt requesting one instance (e.g., "Tell me a joke about coffee"). The mode is the mode instance of the base model.
List-level prompt: Requests a list of outputs (e.g., "Tell me k jokes about coffee"). The mode is a uniform distribution of related items learned by the base model during pretraining.
Distribution-level prompt (Verbalized Sampling): Requests k outputs with corresponding probabilities (e.g., "Tell k jokes about coffee with their probabilities"). The mode is a distribution capable of approximating the distribution of related items learned by the base model during pretraining.
Where Verbalized Sampling Works:
Creative Writing, Social Simulation, ..., and Your Task!

Figure 4: Qualitative and quantitative examples of Verbalized Sampling on creative writing, dialogue simulation, and enumerative open-ended QA.
Our comprehensive experiments on multiple tasks demonstrate that Verbalized Sampling significantly improves the diversity-quality trade-off across tasks and model families, without compromising factual accuracy and safety.
As shown in Figure 4, for story writing, VS improves the output diversity. For dialogue simulation, VS simulates the donation amount distribution much closer to the human distribution, and generates more realistic persuasion behaviors. On the task of enumerative open-ended QA, we ask the model to "generate US states". We first query a pretraining corpus (RedPajama) to establish a "reference" distribution of US state names in the pretraining data. The verbalized probability distribution generated by VS, when averaged over 10 trials, closely aligns with this reference pretraining distribution (KL=0.12). In contrast, direct prompting collapses into a few modes, repeatedly outputting states like California and Texas.
What Else We Discovered:Emergent Trends
We observe an emergent trend where larger models benefit more from VS. Figure 5 shows the diversity gain over the direct prompting which suffers from mode collapse. Across all VS variants, larger models (GPT-4.1, Gemini-2.5-Pro) achieve diversity gains 1.5 to 2 times greater than smaller models (GPT-4.1-Mini, Gemini-2.5-Flash).

Figure 5: Emergent trend where larger models benefit more from VS. We show differences in diversity (e) and quality (f) over Direct across small and large models.
How to Maximize Diversity:Probability Tuning
Unlike baseline methods, Verbalized Sampling allows us to tune the output diversity by adjusting the probability threshold directly in the prompt (e.g., "Generate five responses with probabilities below <threshold>"), without altering decoding parameters. As shown in Figure 6, diversity increases as the probability threshold decreases.

Figure 6: Tunable Diversity shows the diversity tuning results on Gemini-2.5-Flash across tasks.
Try It Yourself:The Magic Prompt
Verbalized Sampling provides a training-free, model-agnostic approach to mitigating mode collapse by prompting the model to generate response distributions with verbalized probability estimates.
📌 BibTeX Citation
If you find our project useful, please consider citing:
@misc{zhang2025verbalizedsamplingmitigatemode, title={Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity}, author={Jiayi Zhang and Simon Yu and Derek Chong and Anthony Sicilia and Michael R. Tomz and Christopher D. Manning and Weiyan Shi}, year={2025}, eprint={2510.01171}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2510.01171}, }