University of Michigan School of Information
Focus on AI: AI Literacy | Emotional Intimacy | Gender Role Congruity

Monday, 06/09/2025
By Noor HindiUniversity of Michigan School of Information faculty and PhD students are advancing the field of artificial intelligence through innovative research and impactful contributions. Here are some of their recent publications.
Publications
Ubuntu AI: Machine Learning for Regenerative Design Ecologies
Oxford Intersections: AI in Society, May 2025
Ron Eglash, Audrey G Bennett
Current AI technologies amplify exploitation by training on human-created images and text taken from the internet, without any return to its creators. Companies can then use AI to displace paid employees and compete with independent creators, amplifying wealth inequality. Black populations are especially vulnerable to this appropriation. This paper describes how alternative AI services and platforms might reverse its potentially debilitating impact on Black artisans by combining the capabilities from two projects. Ubuntu-AI, funded by the OpenAI Foundation, is a platform that allows African artists and designers to license images for use in AI. Artisanal Futures, funded by the National Science Foundation, examines how Black artisans in Detroit can use digital fabrication and machine learning for economic empowerment. The project this paper describes creates a collaboration between those two platforms, and examines the possibilities for redesigning AI and its applications to circulate value in community-based production, rather than allow its extraction by corporations or the state. Because production within a community, no matter how regenerative, is also dependent on external exchanges, we stress the importance of creating layered ecosystems of exchange which minimize value alienation, while expanding regenerative practices. We envision this expansion as a democratization of AI at multiple scales, from community-based ownership of creative production, to global principles for maintaining human rights and egalitarian futures.
Artificial intelligence voice gender, gender role congruity, and trust in automated vehicles
Scientific Reports, May 2025
Qiaoning Zhang, X. Jessie Yang, Lionel P. Robert Jr.
Existing research on human–automated vehicle (AV) interactions has largely focused on auditory explanations, with less attention to how voice characteristics shape user trust. This paper explores the influence of gender similarity between users and AV voices and the role of gender-role congruity rooted in societal stereotypes on cognitive and affective trust in AVs. Findings reveal that gender-role congruity moderates the relationship between gender similarity and trust. When an AV’s voice gender aligns with its expected role, gender similarity enhances cognitive and affective trust. However, when gender roles were not congruent, the trust-enhancing effect of gender similarity diminishes. These findings highlight the importance of considering gender in AV voice design for conveying critical driving information and reveal how societal stereotypes shape AV design. The study offers insights for enhancing user trust and acceptance of AV technology, suggesting future research directions for developing AV systems that avoid reinforcing social biases.
Sexual and Emotional Intimacy with Robots: A Brief Review
AMCIS, May 2025
Annette M. Masterson, Shiyu Li, Lionel Peter Robert, Jr
Interactive robots foster closer human-robot connections, but emotional and sexual intimacy are often conflated with other traits. This paper bridges theory and practice by providing a framework for understanding intimacy with physical robots, drawing on models of interpersonal intimacy. Through a systematic literature review and qualitative analysis, we clarify definitions of intimacy and examine the nuances of human-robot interactions. Key contributions include: (1) examining definitions of emotional and sexual intimacy, (2) integrating two thematic domains, and (3) highlighting key findings and research gaps. Results indicate a tendency to view intimacy as primarily emotional or physical closeness, emphasize its benefits, and highlight the need to redefine human-robot boundaries.
Sociodemographic Prompting is Not Yet an Effective Approach for Simulating Subjective Judgments with LLMs
Huaman Sun, Jiaxin Pei, Minje Choi, David Jurgens
Human judgments are inherently subjective and are actively affected by personal traits such as gender and ethnicity. While Large LanguageModels (LLMs) are widely used to simulate human responses across diverse contexts, their ability to account for demographic differences in subjective tasks remains uncertain. In this study, leveraging the POPQUORN dataset, we evaluate nine popular LLMs on their ability to understand demographic differences in two subjective judgment tasks: politeness and offensiveness. We find that in zero-shot settings, most models’ predictions for both tasks align more closely with labels from White participants than those from Asian or Black participants, while only a minor gender bias favoring women appears in the politeness task. Furthermore, sociodemographic prompting does not consistently improve and, in some cases, worsens LLMs’ ability to perceive language from specific sub-populations. These findings highlight potential demographic biases in LLMs when performing subjective judgment tasks and underscore the limitations of sociodemographic prompting as a strategy to achieve pluralistic alignment. Code and data are available at: https://github.com/Jiaxin-Pei/LLM-as-Subjective-Judge.
Extending the self through AI-mediated communication: functional, ontological, and anthropomorphic extensions
Communication and Change, April 2025
Scott W. Campbell, Nicole B. Ellison, Morgan Quinn Ross
Increasingly, Artificial Intelligence (AI) assistants are used to optimize communication goals by modifying, augmenting, and generating messages in human online interactions. Scholars are just beginning to recognize the potential of AI-Mediated Communication (AI-MC) to transform how people communicate and manage impressions, and this article helps advance research in this area by conceptualizing ways in which using AI-MC can lead to perceptions of self-extension. We first unpack the conceptual history of technological self-extension and trace the development of a three-pronged framework that has been applied to research on smartphones, including functional, ontological, and anthropomorphic dimensions. We then synthesize the literature on smartphone self-extension with defining features and uses of AI-MC to advance propositions about ways in which its use can foster user perceptions of functional, ontological, and anthropomorphic self-extension. As we explain, AI-MC refers to AI’s capacity to modify, augment, and generate communication, and each of these distinctive processes suggests distinctive mechanisms and implications for self-extension. The article concludes by addressing how AI-MC and smartphones are converging and advances considerations for self-extension and how scholars study it.
SciLitLLM: How to Adapt LLMs for Scientific Literature Understanding
International Conference on Learning Representations, April 2025
Sihang Li*, Jin Huang*, Jiaxi Zhuang, Yaorui Shi, Xiaochen Cai, Mingjun Xu, Xiang Wang, Linfeng Zhang, Guolin Ke, Hengxing Cai
Scientific literature understanding is crucial for extracting targeted information and garnering insights, thereby significantly advancing scientific discovery. Despite the remarkable success of Large Language Models (LLMs), they face challenges in scientific literature understanding, primarily due to (1) a lack of scientific knowledge and (2) unfamiliarity with specialized scientific tasks. To develop an LLM specialized in scientific literature understanding, we propose a hybrid strategy that integrates continual pre-training (CPT) and supervised fine-tuning (SFT), to simultaneously infuse scientific domain knowledge and enhance instruction-following capabilities for domain-specific tasks. In this process, we identify two key challenges: (1) constructing high-quality CPT corpora, and (2) generating diverse SFT instructions. We address these challenges through a meticulous pipeline, including PDF text extraction, parsing content error correction, quality filtering, and synthetic instruction creation. Applying this strategy, we present a suite of LLMs: SciLitLLM, specialized in scientific literature understanding. These models demonstrate promising performance on scientific literature understanding benchmarks. (1) We present an effective framework that integrates CPT and SFT to adapt LLMs to scientific literature understanding, which can also be easily adapted to other domains. (2) We propose an LLM-based synthesis method to generate diverse and high-quality scientific instructions, resulting in a new instruction set -- SciLitIns -- for less-represented scientific domains. (3) SciLitLLM achieves promising performance in scientific literature understanding benchmarks.
MASSW: A New Dataset and Benchmark Tasks for AI-Assisted Scientific Workflows
Findings of the Association for Computational Linguistics: NAACL 2025, April 2025
Xingjian Zhang*, Yutong Xie*, Jin Huang, Jinge Ma, Zhaoying Pan, Qijia Liu, Ziyang Xiong, Tolga Ergen, Dongsub Shim, Honglak Lee, Qiaozhu Mei
Scientific innovation relies on detailed workflows, which include critical steps such as contextualizing literature, generating ideas, validating ideas, interpreting results, and planning new research. Scientific publications that document these workflows are extensive and unstructured, making it difficult to effectively navigate and explore the space of scientific innovation. To meet this challenge, we introduce MASSW, a comprehensive dataset of Multi-Aspect Summarization of Scientific Workflows. MASSW includes more than 152,000 peer-reviewed publications from 17 leading computer science conferences spanning the past 50 years. Using Large Language Models (LLMs), we automatically extract five core aspects from these publications – context, key idea, method, outcome, and projected impact – which correspond to five key steps in a research workflow. We show that these LLM-extract summaries have a comparable quality to human annotations, and they facilitate a variety of downstream tasks, corresponding to different types of predictions and recommendations along the scientific workflow. Overall, MASSW demonstrates decent utility as a pre-computed and trustful resource for the AI4Science community to create and benchmark a wide-range of new AI methods for optimizing scientific workflows and fostering scientific innovation.
SciAssess: Benchmarking LLM Proficiency in Scientific Literature Analysis
Findings of the Association for Computational Linguistics: NAACL 2025, April 2025
Hengxing Cai, Xiaochen Cai, Junhan Chang, Sihang Li, Lin Yao, Changxin Wang, Zhifeng Gao, Hongshuai Wang, Yongge Li, Mujie Lin, Shuwen Yang, Jiankun Wang, Mingjun Xu, Jin Huang, Xi Fang, Jiaxi Zhuang, Yuqi Yin, Yaqi Li, Changhong Chen, Zheng Cheng, Zifeng Zhao, Linfeng Zhang, Guolin Ke
Recent breakthroughs in Large Language Models (LLMs) have revolutionized scientific literature analysis. However, existing benchmarks fail to adequately evaluate the proficiency of LLMs in this domain, particularly in scenarios requiring higher-level abilities beyond mere memorization and the handling of multimodal data.In response to this gap, we introduce SciAssess, a benchmark specifically designed for the comprehensive evaluation of LLMs in scientific literature analysis. It aims to thoroughly assess the efficacy of LLMs by evaluating their capabilities in Memorization (L1), Comprehension (L2), and Analysis & Reasoning (L3). It encompasses a variety of tasks drawn from diverse scientific fields, including biology, chemistry, material, and medicine.To ensure the reliability of SciAssess, rigorous quality control measures have been implemented, ensuring accuracy, anonymization, and compliance with copyright standards. SciAssess evaluates 11 LLMs, highlighting their strengths and areas for improvement. We hope this evaluation supports the ongoing development of LLM applications in scientific literature analysis.SciAssess and its resources are available at https://github.com/sci-assess/SciAssess.
Pre-prints, Working Papers, Articles, Workshops and Talks
Using AI to Reform Government is Much Harder Than it Looks
Tech Policy, June 2025
Last week, Elon Musk announced his official departure from the Trump administration and the Department of Government Efficiency (DOGE). In a Friday afternoon press conference with President Donald Trump in the Oval Office, Musk indicated that he would continue to advise the President, and that DOGE would continue its work. “The DOGE influence will only grow stronger,” he said. “It is permeating throughout the government.” But despite the list of supposed accomplishments that Trump read from behind the Resolute Desk, for Musk, his tenure in government fell far short of expectations. He originally boasted that, aided by AI, it would be possible to cut a trillion dollars of government spending. Now, after reaching just a fraction of that goal (and possibly even increasing long-term budget deficits), Musk appears chastened. In an interview with the Washington Post, he acknowledged that “it sure is an uphill battle trying to improve things in DC.”
Unraveling LoRA Interference: Orthogonal Subspaces for Robust Model Merging
arXiv, May 2025
Fine-tuning large language models (LMs) for individual tasks yields strong performance but is expensive for deployment and storage. Recent works explore model merging to combine multiple task-specific models into a single multi-task model without additional training. However, existing merging methods often fail for models fine-tuned with low-rank adaptation (LoRA), due to significant performance degradation. In this paper, we show that this issue arises from a previously overlooked interplay between model parameters and data distributions. We propose Orthogonal Subspaces for Robust model Merging (OSRM) to constrain the LoRA subspace prior to fine-tuning, ensuring that updates relevant to one task do not adversely shift outputs for others. Our approach can seamlessly integrate with most existing merging algorithms, reducing the unintended interference among tasks. Extensive experiments on eight datasets, tested with three widely used LMs and two large LMs, demonstrate that our method not only boosts merging performance but also preserves single-task accuracy. Furthermore, our approach exhibits greater robustness to the hyperparameters of merging. These results highlight the importance of data-parameter interaction in model merging and offer a plug-and-play solution for merging LoRA models.
Be.FM: Open Foundation Models for Human Behavior
arXiv, May 2025
Yutong Xie, Zhuoheng Li, Xiyuan Wang, Yijun Pan, Qijia Liu, Xingzhi Cui, Kuang-Yu Lo, Ruoyi Gao, Xingjian Zhang, Jin Huang, Walter Yuan, Matthew O. Jackson, Qiaozhu Mei
Despite their success in numerous fields, the potential of foundation models for modeling and understanding human behavior remains largely unexplored. We introduce Be.FM, one of the first open foundation models designed for human behavior modeling. Built upon open-source large language models and fine-tuned on a diverse range of behavioral data, Be.FM can be used to understand and predict human decision-making. We construct a comprehensive set of benchmark tasks for testing the capabilities of behavioral foundation models. Our results demonstrate that Be.FM can predict behaviors, infer characteristics of individuals and populations, generate insights about contexts, and apply behavioral science knowledge.
ExAnte: A Benchmark for Ex-Ante Inference in Large Language Models
arXiv, May 2025
Yachuan Liu, Xiaochun Wei, Lin Shi, Xinnuo Li, Bohan Zhang, Paramveer Dhillon, Qiaozhu Mei
Large language models (LLMs) face significant challenges in ex-ante reasoning, where analysis, inference, or predictions must be made without access to information from future events. Even with explicit prompts enforcing temporal cutoffs, LLMs often generate outputs influenced by internalized knowledge of events beyond the specified cutoff. This paper introduces a novel task and benchmark designed to evaluate the ability of LLMs to reason while adhering to such temporal constraints. The benchmark includes a variety of tasks: stock prediction, Wikipedia event prediction, scientific publication prediction, and Question Answering (QA), designed to assess factual knowledge under temporal cutoff constraints. We use leakage rate to quantify models’ reliance on future information beyond cutoff timestamps. Experimental results reveal that LLMs struggle to consistently adhere to temporal cutoffs across common prompting strategies and tasks, demonstrating persistent challenges in ex-ante reasoning. This benchmark provides a potential evaluation framework to advance the development of LLMs’ temporal reasoning ability for time-sensitive applications.
Local Minima Prediction using Dynamic Bayesian Filtering for UGV Navigation in Unstructured Environments
arXiv, May 2025
Seung Hun Lee, Wonse Jo, Lionel P. Robert Jr., Dawn M. Tilbury
Path planning is crucial for the navigation of autonomous vehicles, yet these vehicles face challenges in complex and real-world environments. Although a global view may be provided, it is often outdated, necessitating the reliance of Unmanned Ground Vehicles (UGVs) on real-time local information. This reliance on partial information, without considering the global context, can lead to UGVs getting stuck in local minima. This paper develops a method to proactively predict local minima using Dynamic Bayesian filtering, based on the detected obstacles in the local view and the global goal. This approach aims to enhance the autonomous navigation of self-driving vehicles by allowing them to predict potential pitfalls before they get stuck, and either ask for help from a human, or re-plan an alternate trajectory.
What Do People Want to Know About Artificial Intelligence (AI)? The Importance Of Answering End-User Questions to Explain Autonomous Vehicle (AV) Decisions
arXiv, May 2025
Somayeh Molaei, Lionel P. Robert, Nikola Banovic
Improving end-users’ understanding of decisions made by autonomous vehicles (AVs) driven by artificial intelligence (AI) can improve utilization and acceptance of AVs. However, current explanation mechanisms primarily help AI researchers and engineers in debugging and monitoring their AI systems, and may not address the specific questions of end-users, such as passengers, about AVs in various scenarios. In this paper, we conducted two user studies to investigate questions that potential AV passengers might pose while riding in an AV and evaluate how well answers to those questions improve their understanding of AI-driven AV decisions. Our initial formative study identified a range of questions about AI in autonomous driving that existing explanation mechanisms do not readily address. Our second study demonstrated that interactive text-based explanations effectively improved participants’ comprehension of AV decisions compared to simply observing AV decisions. These findings inform the design of interactions that motivate end-users to engage with and inquire about the reasoning behind AI-driven AV decisions
Evaluating Generative AI Systems is a Social Science Measurement Challenge
arXiv, May 2025
Hanna Wallach, Meera Desai, Nicholas Pangakis, A. Feder Cooper, Angelina Wang, Solon Barocas, Alexandra Chouldechova, Chad Atalla, Su Lin Blodgett, Emily Corvi, P. Alex Dow, Jean Garcia-Gathright, Alexandra Olteanu, Stefanie Reed, Emily Sheng, Dan Vann, Jennifer Wortman Vaughan, Matthew Vogel, Hannah Washington, Abigail Z. Jacobs
Across academia, industry, and government, there is an increasing awareness that the measurement tasks involved in evaluating generative AI (GenAI) systems are especially difficult. We argue that these measurement tasks are highly reminiscent of measurement tasks found throughout the social sciences. With this in mind, we present a framework, grounded in measurement theory from the social sciences, for measuring concepts related to the capabilities, impacts, opportunities, and risks of GenAI systems. The framework distinguishes between four levels: the background concept, the systematized concept, the measurement instrument(s), and the instance-level measurements themselves. This four-level approach differs from the way measurement is typically done in ML, where researchers and practitioners appear to jump straight from background concepts to measurement instruments, with little to no explicit systematization in between. As well as surfacing assumptions, thereby making it easier to understand exactly what the resulting measurements do and do not mean, this framework has two important implications for evaluating evaluations: First, it can enable stakeholders from different worlds to participate in conceptual debates, broadening the expertise involved in evaluating GenAI systems. Second, it brings rigor to operational debates by offering a set of lenses for interrogating the validity of measurement instruments and their resulting measurements.
Using Language Models to Decipher the Motivation Behind Human Behaviors
arXiv, May 2025
Yutong Xie, Qiaozhu Mei, Walter Yuan, Matthew O. Jackson
AI presents a novel tool for deciphering the motivations behind human behaviors. By varying prompts to a large language model, we can elicit the full range of human behaviors in a variety of different scenarios in classic economic games. By analyzing which prompts elicit which behaviors, we infer (decipher) the motivations behind the human behaviors. We also show how one can analyze the prompts to reveal relationships between the classic economic games, providing insight into what different economic scenarios induce people to think about. We also show how this deciphering process can be used to understand differences in the behavioral tendencies of different populations. We show how AI offers a new way to examine the thinking and framing that produce different behaviors.
Exploring Collaborative GenAI Agents in Synchronous Group Settings: Eliciting Team Perceptions and Design Considerations for the Future of Work
arXiv, April 2025
Janet G. Johnson, Macarena Peralta, Mansanjam Kaur, Ruijie Sophia Huang, Sheng Zhao, Ruijia Guan, Shwetha Rajaram, Michael Nebeling
While generative artificial intelligence (GenAI) is finding increased adoption in workplaces, current tools are primarily designed for individual use. Prior work established the potential for these tools to enhance personal creativity and productivity towards shared goals; however, we don’t know yet how to best take into account the nuances of group work and team dynamics when deploying GenAI in work settings. In this paper, we investigate the potential of collaborative GenAI agents to augment teamwork in synchronous group settings through an exploratory study that engaged 25 professionals across 6 teams in speculative design workshops and individual follow-up interviews. Our workshops included a mixed reality prototype to simulate embodied collaborative GenAI agents capable of actively participating in group discussions. Our findings suggest that, if designed well, collaborative GenAI agents offer valuable opportunities to enhance team problem-solving by challenging groupthink, bridging communication gaps, and reducing social friction. However, teams’ willingness to integrate GenAI agents depended on its perceived fit across a number of individual, team, and organizational factors. We outline the key design tensions around agent representation, social prominence, and engagement and highlight the opportunities spatial and immersive technologies could offer to modulate GenAI influence on team outcomes and strike a balance between augmentation and agency.
Evaluating how LLM annotations represent diverse views on contentious topics
arXiv, March 2025
Megan A. Brown, Shubham Atreja, Libby Hemphill, Patrick Y. Wu
Researchers have proposed the use of generative large language models (LLMs) to label data for both research and applied settings. This literature emphasizes the improved performance of LLMs relative to other natural language models, noting that LLMs typically outperform other models on standard metrics such as accuracy, precision, recall, and F1 score. However, previous literature has also highlighted the bias embedded in language models, particularly around contentious topics such as potentially toxic content. This bias could result in labels applied by LLMs that disproportionately align with majority groups over a more diverse set of viewpoints. In this paper, we evaluate how LLMs represent diverse viewpoints on these contentious tasks. Across four annotation tasks on four datasets, we show that LLMs do not show substantial disagreement with annotators on the basis of demographics. Instead, the model, prompt, and disagreement between human annotators on the labeling task are far more predictive of LLM agreement. Our findings suggest that when using LLMs to annotate data, under-representing the views of particular groups is not a substantial concern. We conclude with a discussion of the implications for researchers and practitioners.
AI Literacy in K-12 and Higher Education in the Wake of Generative AI: An Integrative Review
arXiv, February 2025
Xingjian Gu, Barbara J. Ericson
Even though AI literacy has emerged as a prominent education topic in the wake of generative AI, its definition remains vague. There is little consensus among researchers and practitioners on how to discuss and design AI literacy interventions. The term has been used to describe both learning activities that train undergraduate students to use ChatGPT effectively and having kindergarten children interact with social robots. This paper applies an integrative review method to examine empirical and theoretical AI literacy studies published since 2020, to identify shifting definitions and emerging trends in AI literacy around the public introduction of generative AI. In synthesizing the 124 reviewed studies, three ways to conceptualize literacy—functional, critical, and indirectly beneficial—and three perspectives on AI—technical detail, tool, and sociocultural—were identified, forming a framework that reflects the spectrum of how AI literacy is approached in practice. The framework highlights the need for more specialized terms within AI literacy discourse and indicates research gaps in certain AI literacy objectives.
RELATED
Keep up with research from UMSI experts by subscribing to our free research roundup newsletter!