From HumanEval to SWE-bench
Dimensions When we write code, we usually consider the following contexts: In-file references In-repo references that cross multiple files Code execution results guides us to update code iteratively Github Issues / PRs / Discussions express requirements & other infos In the figure, I list out some benchmarks specifically designed to test out different aspects of code LM’s performance. Context scope is the biggest differentiator among them. There are many more benchmarks that differentiate from these benchmarks taking into account of other concerns, e....