University of Michigan School of Information
Privacy, Accessibility and Bias: UMSI Research Roundup

Wednesday, 11/01/2023
University of Michigan School of Information faculty and PhD students are creating and sharing knowledge that helps build a better world. Here are some of their recent publications.
Bridging the Gap: Towards Advancing Privacy and Accessibility
ASSETS '23: Proceedings of the 25th International ACM SIGACCESS Conference on Computers and Accessibility, October 2023
Rahaf Alharbi, Robin N. Brewer, Gesu India, Lotus Zhang, Leah Findlater, Yixin Zou, Abigale Stangl
The privacy dimensions of accessibility technologies are often understudied and overlooked. Very little prior research has investigated the privacy concerns of disabled people, and much less has studied the barriers of privacy-preserving techniques. In order to address this gap and bridge between two separate communities (accessibility and privacy), our one-day workshop explores how researchers might design and build technologies that are both accessible and privacy-preserving.
Psychometric Evaluation of the Modes of Health Information Acquisition, Sharing, and Use Questionnaire: Prospective Cross-Sectional Observational Study
Journal of Medical Internet Research, September 2023
Lenette M Jones, Ronald J Piscotty Jr, Stephen Sullivan, Beatriz Manzor Mitrzyk, Robert J Ploutz-Snyder, Bidisha Ghosh, Tiffany Veinot
Background: Health information is a critical resource for individuals with health concerns and conditions, such as hypertension. Enhancing health information behaviors may help individuals to better manage chronic illness. The Modes of Health Information Acquisition, Sharing, and Use (MHIASU) is a 23-item questionnaire that measures how individuals with health risks or chronic illness acquire, share, and use health information. Yet this measure has not been psychometrically evaluated in a large national sample.
Objective: The objective of this study was to evaluate the psychometric properties of the self-administered MHIASU in a large, diverse cohort of individuals living with a chronic illness.
Methods: Sharing Information, a prospective, observational study, was launched in August 2018 and used social media campaigns to advertise to Black women. Individuals who were interested in participating clicked on the advertisements and were redirected to a Qualtrics eligibility screener. To meet eligibility criteria individuals had to self-identify as a Black woman, be diagnosed with hypertension by a health care provider, and live in the United States. A total of 320 Black women with hypertension successfully completed the eligibility screener and then completed a web-based version of the MHIASU questionnaire. We conducted a psychometric evaluation of the MHIASU using exploratory factor analysis. The evaluation included item review, construct validity, and reliability.
Results: Construct validity was established using exploratory factor analysis with principal axis factoring. The analysis was constricted to the expected domains. Interitem correlations were examined for possible item extraction. There were no improvements in factor structure with the removal of items with high interitem correlation (n=3), so all items of the MHIASU were retained. As anticipated, the instrument was found to have 3 subscales: acquisition, sharing, and use. Reliability was high for all 3 subscales, as evidenced by Cronbach α scores of .81 (acquisition), .81 (sharing), and .93 (use). Factor 3 (use of health information) explained the maximum variance (74%).
Conclusions: Construct validity and reliability of the web-based, self-administered MHIASU was demonstrated in a large national cohort of Black women with hypertension. Although this sample was highly educated and may have had higher digital literacy compared to other samples not recruited via social media, the population captured (Black women living with hypertension) are often underrepresented in research and are particularly vulnerable to this chronic condition. Future studies can use the MHIASU to examine health information behavior in other diverse populations managing health concerns and conditions.
Developing a Semantically Based Query Recommendation for an Electronic Medical Record Search Engine: Query Log Analysis and Design Implications
JMIR Formative Research, September 2023
Danny T Y Wu, David Hanauer, Paul Murdock, V G Vinod Vydiswaran, Qiaozhu Mei, Kai Zheng
Background: An effective and scalable information retrieval (IR) system plays a crucial role in enabling clinicians and researchers to harness the valuable information present in electronic health records. In a previous study, we developed a prototype medical IR system, which incorporated a semantically based query recommendation (SBQR) feature. The system was evaluated empirically and demonstrated high perceived performance by end users. To delve deeper into the factors contributing to this perceived performance, we conducted a follow-up study using query log analysis.
Objective: One of the primary challenges faced in IR is that users often have limited knowledge regarding their specific information needs. Consequently, an IR system, particularly its user interface, needs to be thoughtfully designed to assist users through the iterative process of refining their queries as they encounter relevant documents during their search. To address these challenges, we incorporated “query recommendation” into our Electronic Medical Record Search Engine (EMERSE), drawing inspiration from the success of similar features in modern IR systems for general purposes.
Methods: The query log data analyzed in this study were collected during our previous experimental study, where we developed EMERSE with the SBQR feature. We implemented a logging mechanism to capture user query behaviors and the output of the IR system (retrieved documents). In this analysis, we compared the initial query entered by users with the query formulated with the assistance of the SBQR. By examining the results of this comparison, we could examine whether the use of SBQR helped in constructing improved queries that differed from the original ones.
Results: Our findings revealed that the first query entered without SBQR and the final query with SBQR assistance were highly similar (Jaccard similarity coefficient=0.77). This suggests that the perceived positive performance of the system was primarily attributed to the automatic query expansion facilitated by the SBQR rather than users manually manipulating their queries. In addition, through entropy analysis, we observed that search results converged in scenarios of moderate difficulty, and the degree of convergence correlated strongly with the perceived system performance.
Conclusions: The study demonstrated the potential contribution of the SBQR in shaping participants' positive perceptions of system performance, contingent upon the difficulty of the search scenario. Medical IR systems should therefore consider incorporating an SBQR as a user-controlled option or a semiautomated feature. Future work entails redesigning the experiment in a more controlled manner and conducting multisite studies to demonstrate the effectiveness of EMERSE with SBQR for patient cohort identification. By further exploring and validating these findings, we can enhance the usability and functionality of medical IR systems in real-world settings.
A randomized trial of a mobile health intervention to augment cardiac rehabilitation
npj Digital Magazine, September 2023
Jessica R. Golbus, Kashvi Gupta, Rachel Stevens, V.Swetha E. Jeganathan, Evan Luff,
Jieru Shi, Walter Dempsey, Thomas Boyden, Bhramar Mukherjee, Sarah Kohnstamm,
Vlad Taralunga, Vik Kheterpal, Susan Murphy, Predrag Klasnja, Sachin Kheterpal,
Brahmajee K. Nallamothu
Mobile health (mHealth) interventions may enhance positive health behaviors, but randomized trials evaluating their efficacy are uncommon. Our goal was to determine if a mHealth intervention augmented and extended benefits of center-based cardiac rehabilitation (CR) for physical activity levels at 6-months. We delivered a randomized clinical trial to low and moderate risk patients with a compatible smartphone enrolled in CR at two health systems. All participants received a compatible smartwatch and usual CR care. Intervention participants received a mHealth intervention that included a just-in-time-adaptive intervention (JITAI) as text messages. The primary outcome was change in remote 6-minute walk distance at 6-months stratified by device type. Here we report the results for 220 participants enrolled in the study (mean [SD]: age 59.6 [10.6] years; 67 [30.5%] women). For our primary outcome at 6 months, there is no significant difference in the change in 6 min walk distance across smartwatch types (Intervention versus control: +31.1 meters Apple Watch, −7.4 meters Fitbit; p = 0.28). Secondary outcomes show no difference in mean step counts between the first and final weeks of the study, but a change in 6 min walk distance at 3 months for Fitbit users. Amongst patients enrolled in center-based CR, a mHealth intervention did not improve 6-month outcomes but suggested differences at 3 months in some users.
Community-Engaged Participatory Methods to Address LGBTQ+ Young People’s Health Information Needs With a Resource Website: Participatory Design and Development Study
JMIR Formative Research, September 2023
Daniel Delmonaco, Shannon Li, Christian Paneda, Elliot Popoff, Luna Hughson, Laura Jadwin-Cakmak, Jack Alferio, Christian Stephenson, Angelique Henry, Kiandra Powdhar, Isabella Gierlinger, Gary W Harper, Oliver L Haimson
Background: Lesbian, gay, bisexual, transgender, queer, and questioning (LGBTQ+) young people (aged 15 to 25 years) face unique health challenges and often lack resources to adequately address their health information needs related to gender and sexuality. Beyond information access issues, LGBTQ+ young people may need information resources to be designed and organized differently compared with their cisgender and heterosexual peers and, because of identity exploration, may have different information needs related to gender and sexuality than older people.
Objective: The objective of our study was to work with a community partner to develop an inclusive and comprehensive new website to address LGBTQ+ young people’s health information needs. To design this resource website using a community-engaged approach, our objective required working with and incorporating content and design recommendations from young LGBTQ+ participants.
Methods: We conducted interviews (n=17) and participatory design sessions (n=11; total individual participants: n=25) with LGBTQ+ young people to understand their health information needs and elicit design recommendations for the new website. We involved our community partner in all aspects of the research and design process.
Results: We present participants’ desired resources, health topics, and technical website features that can facilitate information seeking for LGBTQ+ young people exploring their sexuality and gender and looking for health resources. We describe how filters can allow people to find information related to intersecting marginalized identities and how dark mode can be a privacy measure to avoid unwanted identity disclosure. We reflect on our design process and situate the website development in previous critical reflections on participatory research with marginalized communities. We suggest recommendations for future LGBTQ+ health websites based on our research and design experiences and final website design, which can enable LGBTQ+ young people to access information, find the right information, and navigate identity disclosure concerns. These design recommendations include filters, a reduced number of links, conscientious choice of graphics, dark mode, and resources tailored to intersecting identities.
Conclusions: Meaningful collaboration with community partners throughout the design process is vital for developing technological resources that meet community needs. We argue for community partner leadership rather than just involvement in community-based research endeavors at the intersection of human-computer interaction and health.
May I Ask a Follow-up Question? Understanding the Benefits of Conversations in Neural Network Explainability
arXiv, September 2023
Tong Zhang, X. Jessie Yang, Boyang Li
Research in explainable AI (XAI) aims to provide insights into the decision-making process of opaque AI models. To date, most XAI methods offer one-off and static explanations, which cannot cater to the diverse backgrounds and understanding levels of users. With this paper, we investigate if free-form conversations can enhance users’ comprehension of static explanations, improve acceptance and trust in the explanation methods, and facilitate human-AI collaboration. Participants are presented with static explanations, followed by a conversation with a human expert regarding the explanations. We measure the effect of the conversation on participants’ ability to choose, from three machine learning models, the most accurate one based on explanations and their self-reported comprehension, acceptance, and trust. Empirical results show that conversations significantly improve comprehension, acceptance, trust, and collaboration. Our findings highlight the importance of customized model explanations in the format of free-form conversations and provide insights for the future design of conversational explanations.
Fairness Hub Technical Briefs: Overview of Bias Mitigation Strategies
EdArXiv, September 2023
Jinsook Lee, Chris Brooks, Renzhe Yu, Rene Kizilcec
There are a number of strategies that can be used to mitigate model bias. Which strategy is most appropriate and effective will depend on the particular use case, as this depends on the source of the bias. We offer an overview of potential strategies here with a brief description of each to help guide the selection of bias mitigation strategies. The Fairness Hub’s goal is to provide teams with resources for bias analysis and mitigation without the need for data sharing. We encourage the teams to evaluate the effectiveness of individual and combined mitigation strategies with the AUC Gap measure of model bias for various models in their application and for various subgroups. Many of the strategies described are implemented in the AI Fairness 360 and the Fairlearn toolkits, and our Hub can answer further questions about the implementation and evaluation of these strategies.
Selection homophily and peer influence for adolescents’ smoking and vaping norms and outcomes in high and middle-income settings
Humanities and Social Sciences Communications, September 2023
Jennifer M. Murray, Sharon C. Sánchez-Franco, Olga L. Sarmiento, Erik O. Kimbrough,
Christopher Tate, Shannon C. Montgomery, Rajnish Kumar, Laura Dunne, Abhijit Ramalingam, Erin L. Krupka, Felipe Montes, Huiyu Zhou, Laurence Moore, Linda Bauld,
Blanca Llorente, Frank Kee, Ruth F. Hunter
The MECHANISMS study investigates how social norms for adolescent smoking and vaping are transmitted through school friendship networks, and is the first study to use behavioral economics methodology to assess smoking-related social norms. Here, we investigate the effects of selection homophily (the tendency to form friendships with similar peers) and peer influence (a social process whereby an individual’s behavior or attitudes are affected by peers acting as reference points for the individual) on experimentally measured smoking and vaping norms, and other smoking outcomes, in adolescents from high and middle-income settings. Full school year groups in six secondary schools in Northern Ireland (United Kingdom) and six secondary schools in Bogotá (Colombia) participated (n = 1344/1444, participation = 93.1%, target age 12–13 years). Over one semester, pupils received one previously tested school-based smoking prevention program (ASSIST or Dead Cool). Outcomes included experimentally measured smoking/vaping norms, self-report and objectively measured smoking behavior, and self-report smoking norms, intentions, susceptibility, attitudes, and psycho-social antecedents. We investigated selection homophily and peer influence using regressions and SIENA modeling. Regression results demonstrate lagged and contemporaneous selection homophily (odds ratios [ORs] = 0.87–1.26, p ≤ 0.01), and peer influence effects for various outcomes from average responses of friends, school classes, or school year groups (standardized coefficients [βs] = 0.07–0.55, ORs = 1.14–1.31, p ≤ 0.01). SIENA models showed that comparable proportions of smoking/vaping-based similarity between friends were due to selection homophily (32.8%) and peer influence (39.2%). A higher percentage of similarity between friends was due to selection homophily and/or peer influence for ASSIST schools compared to Dead Cool. Selection homophily was also more important in Bogotá, whilst peer influence was stronger in Northern Ireland. These findings support using social norms strategies in adolescent smoking prevention interventions. Future research should consider selection homophily and social influence jointly, and examine whether these findings translate to other high and low-middle-income settings with varying cultures and norms.
Advancing Understanding of Just-in-Time States for Supporting Physical Activity (Project JustWalk JITAI): Protocol for a System ID Study of Just-in-Time Adaptive Interventions
JMIR Research Protocols, September 2023
Junghwan Park, Meelim Kim, Mohamed El Mistiri, Rachael Kha, Sarasij Banerjee, Lisa Gotzian, Guillaume Chevance, Daniel E Rivera, Predrag Klasnja, Eric Hekler
Background: Just-in-time adaptive interventions (JITAIs) are designed to provide support when individuals are receptive and can respond beneficially to the prompt. The notion of a just-in-time (JIT) state is critical for JITAIs. To date, JIT states have been formulated either in a largely data-driven way or based on theory alone. There is a need for an approach that enables rigorous theory testing and optimization of the JIT state concept.
Objective: The purpose of this system ID experiment was to investigate JIT states empirically and enable the empirical optimization of a JITAI intended to increase physical activity (steps/d).
Methods: We recruited physically inactive English-speaking adults aged ≥25 years who owned smartphones. Participants wore a Fitbit Versa 3 and used the study app for 270 days. The JustWalk JITAI project uses system ID methods to study JIT states. Specifically, provision of support systematically varied across different theoretically plausible operationalizations of JIT states to enable a more rigorous and systematic study of the concept. We experimentally varied 2 intervention components: notifications delivered up to 4 times per day designed to increase a person’s steps within the next 3 hours and suggested daily step goals. Notifications to walk were experimentally provided across varied operationalizations of JIT states accounting for need (ie, whether daily step goals were previously met or not), opportunity (ie, whether the next 3 h were a time window during which a person had previously walked), and receptivity (ie, a person previously walked after receiving notifications). Suggested daily step goals varied systematically within a range related to a person’s baseline level of steps per day (eg, 4000) until they met clinically meaningful targets (eg, averaging 8000 steps/d as the lower threshold across a cycle). A series of system ID estimation approaches will be used to analyze the data and obtain control-oriented dynamical models to study JIT states. The estimated models from all approaches will be contrasted, with the ultimate goal of guiding rigorous, replicable, empirical formulation and study of JIT states to inform a future JITAI.
Results: As is common in system ID, we conducted a series of simulation studies to formulate the experiment. The results of our simulation studies illustrated the plausibility of this approach for generating informative and unique data for studying JIT states. The study began enrolling participants in June 2022, with a final enrollment of 48 participants. Data collection concluded in April 2023. Upon completion of the analyses, the results of this study are expected to be submitted for publication in the fourth quarter of 2023.
Conclusions: This study will be the first empirical investigation of JIT states that uses system ID methods to inform the optimization of a scalable JITAI for physical activity.
How We Define Harm Impacts Data Annotations: Explaining How Annotators Distinguish Hateful, Offensive, and Toxic Comments
arXiv, September 2023
Angela Schöpke-Gonzalez, Siqi Wu, Sagar Kumar, Paul J. Resnick, Libby Hemphill
Computational social science research has made advances in machine learning and natural language processing that support content moderators in detecting harmful content. These advances often rely on training datasets annotated by crowdworkers for harmful content. In designing instructions for annotation tasks to generate training data for these algorithms, researchers often treat the harm concepts that we train algorithms to detect - 'hateful', 'offensive', 'toxic', 'racist', 'sexist', etc. - as interchangeable. In this work, we studied whether the way that researchers define 'harm' affects annotation outcomes. Using Venn diagrams, information gain comparisons, and content analyses, we reveal that annotators do not use the concepts 'hateful', 'offensive', and 'toxic' interchangeably. We identify that features of harm definitions and annotators' individual characteristics explain much of how annotators use these terms differently. Our results offer empirical evidence discouraging the common practice of using harm concepts interchangeably in content moderation research. Instead, researchers should make specific choices about which harm concepts to analyze based on their research goals. Recognizing that researchers are often resource constrained, we also encourage researchers to provide information to bound their findings when their concepts of interest differ from concepts that off-the-shelf harmful content detection algorithms identify. Finally, we encourage algorithm providers to ensure their instruments can adapt to contextually-specific content detection goals (e.g., soliciting instrument users' feedback).
Automatic Prompt Rewriting for Personalized Text Generation
arXiv, September 2023
Cheng Li, Mingyang Zhang, Qiaozhu Mei, Weize Kong, Michael Bendersky
Facilitated by large language models (LLMs), personalized text generation has become a rapidly growing research direction. Most existing studies focus on designing specialized models for a particular domain, or they require fine-tuning the LLMs to generate personalized text. We consider a typical scenario in which the large language model, which generates personalized output, is frozen and can only be accessed through APIs. Under this constraint, all one can do is to improve the input text (i.e., text prompts) sent to the LLM, a procedure that is usually done manually. In this paper, we propose a novel method to automatically revise prompts for personalized text generation. The proposed method takes the initial prompts generated by a state-of-the-art, multistage framework for personalized generation and rewrites a few critical components that summarize and synthesize the personal context. The prompt rewriter employs a training paradigm that chains together supervised learning (SL) and reinforcement learning (RL), where SL reduces the search space of RL and RL facilitates end-to-end training of the rewriter. Using datasets from three representative domains, we demonstrate that the rewritten prompts outperform both the original prompts and the prompts optimized via supervised learning or reinforcement learning alone. In-depth analysis of the rewritten prompts shows that they are not only human readable, but also able to guide manual revision of prompts when there is limited resource to employ reinforcement learning to train the prompt rewriter, or when it is costly to deploy an automatic prompt rewriter for inference.
Using social norms to explain giving behavior
Experimental Economics, October 2023
Catherine C. Eckel, Hanna G. Hoover, Erin L. Krupka, Nishita Sinha, Rick K. Wilson
Transfers of resources in dictator games vary significantly by the characteristics of recipients. We focus on social norms and demonstrate that variation in the recipient changes both giving and injunctive norms and may offer an explanation for differences in giving. We elicit generosity using dictator games, and social norms using incentivized coordination games, with two different recipient types: an anonymous student and a charitable organization. A within-subjects design ensures that other factors are held constant. Our results show that differences in giving behavior are closely related to differences in social norms of giving across contexts. Controlling for individual differences in beliefs about the norm, subjects do not weight compliance with the norms in the student recipient or charity recipient dictator game differently. These results suggest that the impact of context on giving co-occurs with an impact on social norms.
Tackling the Lack of a Practical Guide in Disability-Centered Research
ASSETS '23: Proceedings of the 25th International ACM SIGACCESS Conference on Computers and Accessibility, October 2023
Emma J. McDonnell, Kelly Avery Mack, Kathrin Gerling, Katta Spiel, Cynthia L. Bennett, Robin N. Brewer, Rua Mae Williams, Garreth W. Tigwell
Accessibility research strives to develop technology that is useful for disabled people, but the research processes that we engage in do not always center disabled people in a way that allows us to shape artifacts so that they beneft disabled communities. In this workshop, we want to address core questions that are relevant in this context: How can research questions be defned in a way that shares power between research teams and technology users? How should research processes be designed to be broadly accessible for disabled people? And what are equitable ways of summarizing and sharing research fndingsin a way that allows disabled communities to critically appraise fndings with us? Through discussion among all attendees, we want to develop a practical guide in disabilitycentered research that will be made available and further developed as a community resource when engaging in accessibility research.
Adoption of Recurrent Innovations: A Large-Scale Case Study on Mobile App Updates
ACM Transactions on the Web, October 2023
Fuqi Lin, Xuan Lu, Wei Ai, Huoran Li, Yun Ma, Yulian Yang, Hongfei Deng, Qingxiang Wang, Qiaozhu Mei, Xuanzhe Liu
Modern technology innovations feature a successive and even recurrent procedure. Intervals between old and new generations of technology are shrinking, and the Internet and Web services have facilitated the fast adoption of an innovation even before the convergence of its predecessor. While the adoption and diffusion of innovations have been studied for decades, most theories and analyses focus on single and one-time innovations. Meanwhile, limited work has investigated successive innovations while lacking user-level analysis, possibly due to the unavailability of fine-grained adoption behavior data. In this study, we present the first large-scale analysis of the adoption of recurrent innovations in the context of mobile app updates, investigating how millions of users consume various versions of thousands of apps on their mobile devices. Our analysis reveals novel patterns of crowd and individual adoption behaviors, which suggest the need for new categories of adopters to be added on top of the Rogers model of innovation diffusion. We show that standard machine learning models are able to pick up various sources of signals to predict whether or not a user in these different categories will adopt a new version of an app and how soon they will adopt it.
Measuring and Predicting Drivers’ Takeover Readiness and Supporting Takeover Transitions in Automated Driving
Foundation for Traffic Safety Emerging Technologies Technical Support, September 2023
Doowon Han, Jundi Liu, Feng Zhou, Dawn Tilbury, Lionel Robert, Lisa Molnar, X Jessie Yang
As vehicle automation progresses, the driver’s role will transform from an operator to a system supervisor. Level 3 automated vehicles (AVs) possess the ability to perceive their surroundings and interpret road conditions while performing driving tasks such as accelerating, braking, steering, and navigating. The advanced capability allows the driver to engage in non-driving related tasks (NDRTs). However, if an AV encounters a system limit, such as vision system failure or path planning issues, the driver must quickly regain control of the vehicle. This transition from automated control to manual control presents a crucial challenge to the human driver, as they become increasingly out of the loop (OOTL) (Zhou et al., 2020; Petersen et al., 2019; Molnar et al., 2017) To help inform and address these issues, this report is organized into two parts, presenting two studies aimed at facilitating takeover transitions when using Level 3 automation. Part 1 examines driver takeover readiness; that is, driver behavior and physiological indices and other factors that are predictive of successful takeover performance. Knowledge of such measures can inform the development and tuning of driver state monitoring (DSM) systems. Part 2 examines a driver support system, a gaze guidance system, that helps orient driver’s attention to areas of potential risk during a control takeover. This study leverages data from an existing naturalistic driving study as well as theoretical models of driver visual attention allocation.
HiCL: Hierarchical Contrastive Learning of Unsupervised Sentence Embeddings
Proceedings of Findings of the Association of Computational Linguistics, EMNLP 2023, October 2023
Zhuofeng Wu, Chaowei Xiao, VG Vinod Vydiswaran
In this paper, we propose a hierarchical contrastive learning framework, HiCL, which considers local segment-level and global sequence-level relationships to improve training efficiency and effectiveness. Traditional methods typically encode a sequence in its entirety for contrast with others, often neglecting local representation learning, leading to challenges in generalizing to shorter texts. Conversely, HiCL improves its effectiveness by dividing the sequence into several segments and employing both local and global contrastive learning to model segment-level and sequence-level relationships. Further, considering the quadratic time complexity of transformers over input tokens, HiCL boosts training efficiency by first encoding short segments and then aggregating them to obtain the sequence representation. Extensive experiments show that HiCL enhances the prior top-performing SNCSE model across seven extensively evaluated STS tasks, with an average increase of +0.2% observed on BERT-large and +0.44% on RoBERTa-large.
Examining the Potential of ChatGPT on Biomedical Information Retrieval: Fact-Checking Drug-Disease Associations
Annals of Biomedical Engineering, October 2023
Zhenxiang Gao, Lingyao Li, Siyuan Ma, Qinyong Wang, Libby Hemphill, Rong Xu
Large language models (LLMs) such as ChatGPT have recently attracted significant attention due to their impressive performance on many real‐world tasks. These models have also demonstrated the potential in facilitating various biomedical tasks. However, little is known of their potential in biomedical information retrieval, especially identifying drug-disease associations. This study aims to explore the potential of ChatGPT, a popular LLM, in discerning drug-disease associations. We collected 2694 true drug-disease associations and 5662 false drug-disease pairs. Our approach involved creating various prompts to instruct ChatGPT in identifying these associations. Under varying prompt designs, ChatGPT’s capability to identify drug-disease associations with an accuracy of 74.6–83.5% and 96.2–97.6% for the true and false pairs, respectively. This study shows that ChatGPT has the potential in identifying drug-disease associations and may serve as a helpful tool in searching pharmacy-related information. However, the accuracy of its insights warrants comprehensive examination before its implementation in medical practice.
Automated Evaluation of Personalized Text Generation using Large Language Models
arXiv, October 2023
Yaqing Wang, Jiepu Jiang, Mingyang Zhang, Cheng Li, Yi Liang, Qiaozhu Mei, Michael Bendersky
Personalized text generation presents a specialized mechanism for delivering content that is specific to a user's personal context. While the research progress in this area has been rapid, evaluation still presents a challenge. Traditional automated metrics such as BLEU and ROUGE primarily measure lexical similarity to human-written references, and are not able to distinguish personalization from other subtle semantic aspects, thus falling short of capturing the nuances of personalized generated content quality. On the other hand, human judgments are costly to obtain, especially in the realm of personalized evaluation. Inspired by these challenges, we explore the use of large language models (LLMs) for evaluating personalized text generation, and examine their ability to understand nuanced user context. We present AuPEL, a novel evaluation method that distills three major semantic aspects of the generated text: personalization, quality and relevance, and automatically measures these aspects. To validate the effectiveness of AuPEL, we design carefully controlled experiments and compare the accuracy of the evaluation judgments made by LLMs versus that of judgements made by human annotators, and conduct rigorous analyses of the consistency and sensitivity of the proposed metric. We find that, compared to existing evaluation metrics, AuPEL not only distinguishes and ranks models based on their personalization abilities more accurately, but also presents commendable consistency and efficiency for this task. Our work suggests that using LLMs as the evaluators of personalized text generation is superior to traditional text similarity metrics, even though interesting new challenges still remain.