University of Michigan School of Information
Study: Can combining unusual datasets lead to greater scientific discovery?

Thursday, 10/03/2024
By Noor HindiUniversity of Michigan School of Information PhD candidate Yulin Yu and associate professor Daniel Romero have published work examining a new way of looking at — and utilizing — datasets to drive scientific advancement and foster high-impact, innovative scientific development.
Their paper, “Does the Use of Unusual Combinations of Datasets Contribute to Greater Scientific Impact?” reveals that combining datasets, especially datasets that are not typically combined, can lead to more impact in the form of citations and mentions of scientific findings in the news and social media. With a wealth of publicly available datasets that are accessible to the public, Yu and Romero’s paper proposes a novel way of utilizing these datasets to spur scientific development.
“Data is a critical component of innovation,” Yu says. “And the conversation has centered around the importance of keeping data open-source. But the ultimate goal of the open-access movement is to drive important innovation and one question we as scientists haven’t fully answered is: how can we more efficiently use or reuse these data to drive innovation?”
Yu and Romero’s paper tests the recombination theory, “which suggests that innovative combinations of existing knowledge, including the use of unusual combinations of datasets, can lead to high-impact discoveries.” Their research, which examined atypical data combinations in more than 30,000 publications and over 5,000 datasets, found that combining different datasets led to higher citation rates and media attention.
“Innovation often comes from recombination of entities, but we don’t know what recombination in what condition is most fruitful,” Yu says. “By testing this idea on data use, we find a strategy that allows the information we have to bring in new knowledge, which is inspiring to me.”
Data has become pivotal to almost every sector of society, from using it to discover important scientific patterns to building AI models. Yulin and Romero’s research has the potential to change how scientists, policymakers and data curators use and manage data for research, encouraging them to “explore new research avenues by combining infrequently paired datasets.”
“It used to be the case where the hardest part was finding and getting any data,” Romero says. “Now the problem has become more about filtering and selection. There are so many datasets out there, and the challenge for researchers is how to use them, which ones to use and how to combine them.”
Yu, a fourth year PhD candidate at UMSI, researches how to leverage big data, AI and network science to investigate the drivers of innovation across a range of contexts. Romero is her PhD advisor.
Read “Does the Use of Unusual Combinations of Datasets Contribute to Greater Scientific Impact?” on the Proceedings of the National Academy of Sciences.
RELATED
Learn more about UMSI associate professor Daniel Romero and PhD candidate Yulin Yu by visiting their UMSI profiles.
Read more about UMSI research by subscribing to our free research roundup newsletter.