If you can't measure it, you can't improve it. At Open Thoughts, we are on a mission to build the best open reasoning datasets (and therefore, the best open reasoning models). We are sharing everything publicly on our journey including the tools we are using to get there. Today we are releasing reasoning benchmarks into our model evaluation tool Evalchemy.
Model evaluations are the important feedback signal in the experimental feedback loop. Measuring the effectiveness of a particular data curation strategy allows us to know what works and what doesn't. These evaluations need to be reliable, repeatable, easy to use and fast. This is why we built Evalchemy.
Evalchemy is a unified interface for evaluating post-trained LLMs. Built off the popular lm-evaluation-harness by EleutherAI, we have added additional benchmarks and support for evaluating more API-based models.
As part of the Open Thoughts project, Evalchemy now includes the common reasoning benchmarks AIME24, AMC23, MATH500, LiveCodeBench, and GPQA-Diamond. Coding evaluations HumanEvalPlus, MBPPPlus, BigCodeBench, MultiPL-E and CRUXEval have also joined the expanding list of available benchmarks.
Method | Model Name | AIME2024 | MATH500 | GPQA-Diamond |
---|---|---|---|---|
Evalchemy Eval | DeepSeek-R1-Distill-Qwen-7B | 60.0 | 88.2 | 46.9 |
R1 Report | DeepSeek-R1-Distill-Qwen-7B | 55.5 | 83.3 | 49.1 |
Evalchemy Eval | gpt-4o-2024-08-06 | 8.7 | 75.8 | 46.5 |
OpenAI Report | gpt-4o | 9.3 | 60.3 | 50.6 |
Evalchemy Eval | o1-mini | 64.0 | 85.6 | 60.0 |
OpenAI Report | o1-mini | - | 90.0 | 60.0 |
Evalchemy Eval | DeepSeek-R1 | 86.7 | 91.6 | 71.2 |
R1 Report | DeepSeek-R1 | 79.8 | 97.3 | 71.5 |
In the table above we show our evaluation results for reasoning benchmarks on popular models compared to the publicly reported numbers.
Note: The AIME24 dataset has a small sample size, resulting in high variance in evaluation accuracy. To mitigate this, we updated the code to compute the average score over five evaluation runs with different seeds. No system prompt is used, the maximum token length is set to 32,768, and temperature = 0.7.
We are continuously improving Evalchemy. If there is a benchmark you would like to see added, please raise an issue on Github, or even better, open a pull request, as we encourage contributions from the community.
Citation
@misc{guha2025openthoughtsdatarecipesreasoning,
title={OpenThoughts: Data Recipes for Reasoning Models},
author={Etash Guha and Ryan Marten and Sedrick Keh and Negin Raoof and Georgios Smyrnis and Hritik Bansal and Marianna Nezhurina and Jean Mercat and Trung Vu and Zayne Sprague and Ashima Suvarna and Benjamin Feuer and Liangyu Chen and Zaid Khan and Eric Frankel and Sachin Grover and Caroline Choi and Niklas Muennighoff and Shiye Su and Wanjia Zhao and John Yang and Shreyas Pimpalgaonkar and Kartik Sharma and Charlie Cheng-Jie Ji and Yichuan Deng and Sarah Pratt and Vivek Ramanujan and Jon Saad-Falcon and Jeffrey Li and Achal Dave and Alon Albalak and Kushal Arora and Blake Wulfe and Chinmay Hegde and Greg Durrett and Sewoong Oh and Mohit Bansal and Saadia Gabriel and Aditya Grover and Kai-Wei Chang and Vaishaal Shankar and Aaron Gokaslan and Mike A. Merrill and Tatsunori Hashimoto and Yejin Choi and Jenia Jitsev and Reinhard Heckel and Maheswaran Sathiamoorthy and Alexandros G. Dimakis and Ludwig Schmidt},
year={2025},
eprint={2506.04178},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2506.04178},
}
@software{Evalchemy,
author = {Guha, Etash and Raoof, Negin and Mercat, Jean and Frankel, Eric and Keh, Sedrick and Grover, Sachin and Smyrnis, George and Vu, Trung and Marten, Ryan and Saad-Falcon, Jon and Choi, Caroline and Arora, Kushal and Merrill, Mike and Deng, Yichuan and Suvarna, Ashima and Bansal, Hritik and Nezhurina, Marianna and Choi, Yejin and Heckel, Reinhard and Oh, Seewong and Hashimoto, Tatsunori and Jitsev, Jenia and Shankar, Vaishaal and Schmidt, Ludwig and Dimakis, Alex and Sathiamoorthy, Mahesh},
month = nov,
title = {{Evalchemy}},
year = {2024}
}