Jacobs: We need more effective methods to measure AI's true capabilities

Quoted by MIT Technology Review. Assistant professor Abigail Jacobs. How to build a better AI benchmark.

Monday, 05/19/2025

By Noor Hindi

Abigail Jacobs, an assistant professor at the University of Michigan School of Information, emphasizes the need for improved evaluation methods in AI that prioritize validity.

AI developers are focused on making their chatbots appear more intelligent than they actually are. An example is SWE-Bench, a benchmark assessing how well AI tools can fix bugs in Python programs. However, some developers train their models specifically to excel in these benchmarks rather than addressing real-world coding challenges.

Jacobs, an expert in AI and algorithmic bias, argues that the AI industry must take the validity of evaluations more seriously to ensure systems are genuinely delivering on their promises.

“Taking validity seriously means asking folks in academia, industry, or wherever to show that their system does what they say it does,” says Jacobs. “I think it points to a weakness in the AI world if they want to back off from showing that they can support their claim.”

In a recent paper, Jacobs questioned current validity measurement practices and suggests that developers reconsider how they evaluate the effectiveness of AI systems.

RELATED

Read “How to build a better AI benchmark” in the MIT Technology Review.

Learn more about UMSI assistant professor Abigail Jacobs by visiting her UMSI faculty profile.