OpenThoughts3 - A new SOTA Reasoning Data Recipe

OpenThinker3-7B is the SOTA open-data reasoning model at its scale. Our model achieves 53% on AIME 2025, 51% on LiveCodeBench 06/24-01/25, and 54% on GPQA Diamond, representing improvements of 15.3, 17.2, and 20.5 percentage points compared to DeepSeek-R1-Distill-Qwen-7B.

All of our datasets and models are available on Hugging Face. We trained OpenThinker3-7B using only supervised fine-tuning, without any reinforcement learning. The key to our model’s performance is our new dataset, OpenThoughts3-1.2M. This dataset comprises 1.2 million questions across math, code, and science domains, with reasoning traces annotated from QwQ-32B. OpenThoughts3-1.2M is the result of over 1,000+ rigorous experiments on each stage in the reasoning dataset construction pipeline.

Model	HMMT	AIME25	LCB 06/24-01/25	CodeForces	GPQA-D	JEEBench
OpenThinker3-7B	42.7	53.3	51.7	32.2	53.7	72.4
DS-R1-Distill-7B	25.0	38.0	34.5	21.1	33.2	50.4
OpenR1-Distill-7B	26.7	48.0	50.9	32.9	52.9	70.7
Nemotron-Nano	33.3	50.7	44.3	30.9	52.9	64.3
AceReason	32.7	47.3	43.8	32.5	50.2	55.3
Qwen2.5-7B-Instruct	2.0	8.0	16.3	10.2	24.6	33.9

In the above table, we bold values within two standard errors of the highest-scoring model.

Our detailed report and experimental artifacts are fully public for the community to build upon together:

OpenThoughts3 Data Recipe

We extensively studied the effect of the following steps in our data generation pipeline:

Question sourcing
Question mixing
Question filtering
Generating multiple answers per question
Answer filtering
Teacher model selection

We build our pipeline iteratively, ablating the initial steps first and keeping the best choices fixed throughout the remaining ablations. Our ablations contain over 1,000 controlled experiments across three data domains: math, code, and science.

Our state-of-the-art performance is driven by the insights uncovered during the experimental pipeline. These insights include:

Sampling multiple answers per question from a teacher model is an effective technique to increase the size of a data source by at least 16x.
Models with better performance are not necessarily better teachers. QwQ-32B is a stronger teacher than DeepSeek-R1, although it scores lower on target reasoning benchmarks such as JEEBench.
We experimented with numerous verification and answer filtering methods, and none gave significant performance improvements.
Selecting questions from a small number (top 1 or 2) of high-quality sources leads to better downstream performance compared to optimizing for diversity (i.e., top 8 or 16 sources).
Filtering questions by LLM labeled difficulty or LLM response length yields better results than filters typical to pre-training data curation that use embeddings or fastText

Our final dataset pipeline is presented below:

Scaling Up

Scaling reasoning datasets is a powerful lever for increasing downstream performance, as we demonstrated scaling from Bespoke-Stratos-17k to OpenThoughts-114k and then again to OpenThoughts2-1M. With the stronger OpenThoughts3 data recipe, we once again take advantage of scale, scaling up to 1.2M samples consisting of 850,000 math, 250,000 code, and 100,000 science instruction and answer pairs.

We outperform the previous state-of-the-art open reasoning datasets (AM and Nemotron Nano) and clearly demonstrate that more-is-more for reasoning.

Conclusion

Our OpenThinker3-7B model and associated OpenThoughts3-1.2M are the SOTA open-data model and SFT dataset for reasoning models, respectively. Our data pipeline is built on the insights gained through rigorous and comprehensive experimentation across the entire protocol. We release all artifacts from our research and hope to fuel the research community further. We plan to further scale this pipeline up in the future and release a 32B equivalent to OpenThinker3-7B.

Citation

@misc{guha2025openthoughtsdatarecipesreasoning,
      title={OpenThoughts: Data Recipes for Reasoning Models}, 
      author={Etash Guha and Ryan Marten and Sedrick Keh and Negin Raoof and Georgios Smyrnis and Hritik Bansal and Marianna Nezhurina and Jean Mercat and Trung Vu and Zayne Sprague and Ashima Suvarna and Benjamin Feuer and Liangyu Chen and Zaid Khan and Eric Frankel and Sachin Grover and Caroline Choi and Niklas Muennighoff and Shiye Su and Wanjia Zhao and John Yang and Shreyas Pimpalgaonkar and Kartik Sharma and Charlie Cheng-Jie Ji and Yichuan Deng and Sarah Pratt and Vivek Ramanujan and Jon Saad-Falcon and Jeffrey Li and Achal Dave and Alon Albalak and Kushal Arora and Blake Wulfe and Chinmay Hegde and Greg Durrett and Sewoong Oh and Mohit Bansal and Saadia Gabriel and Aditya Grover and Kai-Wei Chang and Vaishaal Shankar and Aaron Gokaslan and Mike A. Merrill and Tatsunori Hashimoto and Yejin Choi and Jenia Jitsev and Reinhard Heckel and Maheswaran Sathiamoorthy and Alexandros G. Dimakis and Ludwig Schmidt},
      year={2025},
      eprint={2506.04178},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2506.04178}, 
}

OpenThoughts3 - A new SOTA Reasoning Data Recipe

OpenThoughts3 Data Recipe

Scaling Up

Conclusion

Citation

Subscribe for updates