Better LLM Reasoning via Dual-Play
TL;DR
We design a dual-play RL framework, PasoDoble, that concurrently trains a Proposer LLM and a Solver LLM to enhance LLM's reasoning abilities.
Abstract
Large Language Models (LLMs) have achieved remarkable progress through Reinforcement Learning with Verifiable Rewards (RLVR), yet still rely heavily on external supervision (e.g., curated labels). Adversarial learning, particularly through self-play, offers a promising alternative that enables models to iteratively learn from themselves—thus reducing reliance on external supervision. Dual-play extends adversarial learning by assigning specialized roles to two models and training them against each other, fostering sustained competition and mutual evolution. Despite its promise, adapting dual-play training to LLMs remains limited, largely due to their susceptibility to reward hacking and training instability. In this paper, we introduce PasoDoble, a novel LLM dual-play framework. PasoDoble adversarially trains two models initialized from the same base model: a Proposer, which generates challenging questions with ground-truth answers, and a Solver, which attempts to solve them. We enrich the Proposer with knowledge from a pre-training dataset to ensure the questions' quality and diversity. To avoid reward hacking, the Proposer is rewarded for producing only valid questions that push the Solver's limit, while the Solver is rewarded for solving them correctly, and both are updated jointly. To further enhance training stability, we introduce an optional offline paradigm that decouples Proposer and Solver updates, alternately updating each for several steps while holding the other fixed. Notably, PasoDoble operates without supervision during training. Experimental results show that PasoDoble can improve the reasoning performance of LLMs.
PasoDoble
PasoDoble Training Framework
At the core of our framework are the Proposer LLM and the Solver LLM. In each training step, we sample a knowledge piece from the Knowledge Base, and then this is fed into the Proposer to generate N questions and corresponding answers. The Solver then generates M responses to each question. The Proposer is rewarded for generating difficult questions (i.e., 1 - solving_rate). The Solver is rewarded for solving the questions (i.e., solving_rate).
An example training step in the online setting.
Results
After applying PasoDoble, we observe performance increase across different models and benchmarks, with higher performance boost on larger models. Except for Qwen2.5-0.5B, where performance drops slightly, PasoDoble increases the model's average accuracy by 2% (Qwen3-0.6B), 5% (Qwen2.5-1.5B), 12% (Qwen3-1.7B), 8% (Qwen3-3B), and 15% (Qwen3-4B).
BibTeX
@article{zhang2025pasodoble,
title={Better LLM Reasoning via Dual-Play},
author={Zhengxin Zhang and Chengyu Huang and Aochong Oliver Li and Claire Cardie},
eprint={2511.11881},
archivePrefix={arXiv},
year={2025},
url={https://arxiv.org/abs/2511.11881}
}