University of Michigan School of Information
Jacobs: We need more effective methods to measure AI's true capabilities

Monday, 05/19/2025
By Noor HindiAbigail Jacobs, an assistant professor at the University of Michigan School of Information, emphasizes the need for improved evaluation methods in AI that prioritize validity.
AI developers are focused on making their chatbots appear more intelligent than they actually are. An example is SWE-Bench, a benchmark assessing how well AI tools can fix bugs in Python programs. However, some developers train their models specifically to excel in these benchmarks rather than addressing real-world coding challenges.
Jacobs, an expert in AI and algorithmic bias, argues that the AI industry must take the validity of evaluations more seriously to ensure systems are genuinely delivering on their promises.
“Taking validity seriously means asking folks in academia, industry, or wherever to show that their system does what they say it does,” says Jacobs. “I think it points to a weakness in the AI world if they want to back off from showing that they can support their claim.”
In a recent paper, Jacobs questioned current validity measurement practices and suggests that developers reconsider how they evaluate the effectiveness of AI systems.
RELATED
Read “How to build a better AI benchmark” in the MIT Technology Review.
Learn more about UMSI assistant professor Abigail Jacobs by visiting her UMSI faculty profile.