DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science

Published in ICLR 2026

Fan Shu†, Yite Wang†, Ruofan Wu, Boyi Liu, Zhewei Yao, Yuxiong He, and Feng Yan.

DARE-bench introduces a verifiable benchmark for machine learning modeling and data science instruction-following, with 6,300 Kaggle-derived tasks for both training and evaluation.

Read on arXiv

Code repository