University of Michigan School of Information
ChatGPT, Entrepreneurship and ICWSM Conference: UMSI Research Roundup

Wednesday, 05/24/2023
University of Michigan School of Information faculty and PhD students are creating and sharing knowledge that helps build a better world. Here are some of their recent publications.
Opportunities for Social Media to Support Aspiring Entrepreneurs with Financial Constraints
Proceedings of the ACM on Human-Computer Interaction, April 2023
Aarti Israni, Julie Hui, Tawanna Dillahunt
Social media offers an alternative source for entrepreneurs to expand their social networks and obtain relevant resources to support their ambitions. Aspiring entrepreneurs with limited access to resources and social networks might rely more on the opportunities that social media tools offer. Aspiring entrepreneurs navigate social media to realize their economic dreams. Yet, those who face financial constraints often face challenges. Because aspiring entrepreneurs are transitioning to entrepreneurship, they must construct and even adapt to new work-role identities and new requisite skills, behaviors, attitudes, and patterns of interactions. In a re-analysis of a sub-sample of data from two empirical studies, this work examines how aspiring entrepreneurs living in a financially-constrained environment seek informational, social, and emotional support online and navigate their transition to entrepreneurship. These entrepreneurs obtained informational and emotional resources from observing other members' posts in online communities, including the next steps needed to adapt to their desired small business work roles. However, few publicly disclosed their informational or emotional needs online. We extend existing research on financially-constrained entrepreneurs' use of social media, contributing insights into how these resource-seeking practices limit their exploration of alternative entrepreneurial identities and feedback. We also contribute design implications to facilitate their online disclosure practices, including offering suggestions about ways to respond to questions and other disclosures in ways that restore trust and mitigate identity threats.
Enhancing the Cardiovascular Safety of Hemodialysis Care Using Multimodal Provider Education and Patient Activation Interventions: Protocol for a Cluster Randomized Controlled Trial
JMIR Research Protocols, April 2023
Tiffany Christine Veinot, Brenda Gillespie, Marissa Argentina, Jennifer Bragg-Gresham, Dinesh Chatoth, Kelli Collins Damron, Michael Heung, Sarah Krein, Rebecca Wingard, Kai Zheng, Rajiv Saran
Background: End-stage kidney disease (ESKD) is treated with dialysis or kidney transplantation, with most patients with ESKD receiving in-center hemodialysis treatment. This life-saving treatment can result in cardiovascular and hemodynamic instability, with the most common form being low blood pressure during the dialysis treatment (intradialytic hypotension [IDH]). IDH is a complication of hemodialysis that can involve symptoms such as fatigue, nausea, cramping, and loss of consciousness. IDH increases risks of cardiovascular disease and ultimately hospitalizations and mortality. Provider-level and patient-level decisions influence the occurrence of IDH; thus, IDH may be preventable in routine hemodialysis care.
Objective: This study aims to evaluate the independent and comparative effectiveness of 2 interventions—one directed at hemodialysis providers and another for patients—in reducing the rate of IDH at hemodialysis facilities. In addition, the study will assess the effects of interventions on secondary patient-centered clinical outcomes and examine factors associated with a successful implementation of the interventions.
Methods: This study is a pragmatic, cluster randomized trial to be conducted in 20 hemodialysis facilities in the United States. Hemodialysis facilities will be randomized using a 2 × 2 factorial design, such that 5 sites will receive a multimodal provider education intervention, 5 sites will receive a patient activation intervention, 5 sites will receive both interventions, and 5 sites will receive none of the 2 interventions. The multimodal provider education intervention involved theory-informed team training and the use of a digital, tablet-based checklist to heighten attention to patient clinical factors associated with increased IDH risk. The patient activation intervention involves tablet-based, theory-informed patient education and peer mentoring. Patient outcomes will be monitored during a 12-week baseline period, followed by a 24-week intervention period and a 12-week postintervention follow-up period. The primary outcome of the study is the proportion of treatments with IDH, which will be aggregated at the facility level. Secondary outcomes include patient symptoms, fluid adherence, hemodialysis adherence, quality of life, hospitalizations, and mortality.
Results: This study is funded by the Patient-Centered Outcomes Research Institute and approved by the University of Michigan Medical School’s institutional review board. The study began enrolling patients in January 2023. Initial feasibility data will be available in May 2023. Data collection will conclude in November 2024.
Conclusions: The effects of provider and patient education on reducing the proportion of sessions with IDH and improving other patient-centered clinical outcomes will be evaluated, and the findings will be used to inform further improvements in patient care. Improving the stability of hemodialysis sessions is a critical concern for clinicians and patients with ESKD; the interventions targeted to providers and patients are predicted to lead to improvements in patient health and quality of life.
“HOT” ChatGPT: The promise of ChatGPT in detecting and discriminating hateful, offensive, and toxic comments on social media
arXiv, April 2023
Lingyao Li, Lizhou Fan, Shubham Atreja, Libby Hemphill
Harmful content is pervasive on social media, poisoning online communities and negatively impacting participation. A common approach to address this issue is to develop detection models that rely on human annotations. However, the tasks required to build such models expose annotators to harmful and offensive content and may require significant time and cost to complete. Generative AI models have the potential to understand and detect harmful content. To investigate this potential, we used ChatGPT and compared its performance with MTurker annotations for three frequently discussed concepts related to harmful content: Hateful, Offensive, and Toxic (HOT). We designed five prompts to interact with ChatGPT and conducted four experiments eliciting HOT classifications. Our results show that ChatGPT can achieve an accuracy of approximately 80% when compared to MTurker annotations. Specifically, the model displays a more consistent classification for non-HOT comments than HOT comments compared to human annotations. Our findings also suggest that ChatGPT classifications align with provided HOT definitions, but ChatGPT classifies “hateful” and “offensive” as subsets of “toxic”. Moreover, the choice of prompts used to interact with ChatGPT impacts its performance. Based on these insights, our study provides several meaningful implications for employing ChatGPT to detect HOT content, particularly regarding the reliability and consistency of its performance, its understanding and reasoning of the HOT concept, and the impact of prompts on its performance. Overall, our study provides guidance about the potential of using generative AI models to moderate large volumes of user-generated content on social media.
3rd International Workshop on Scientific Knowledge Representation, Discovery, and Assessment (Sci-K 2023)
WWW’23 Companion: Companion Proceedings of the ACM Web Conference 2023, April 2023
Angelo Antonio Salatino, Yu Bu, Ying Ding, Agnes Horvat, Yong Huang, Meijun Liu, Paolo Manghi, Andrea Mannocci, Franseco Osborne, Daniel Romero, Dimitris Sacharidis, Misha Teplitskiy, Thanasis Vergoulis, Feng Xia, Yujia Zhai
The International Workshop on Scientific Knowledge: Representation, Discovery, and Assessment (Sci-K 2023) is now running its third edition. The Sci-K workshop is a venue that brings together researchers and practitioners from different disciplines (including, but not limited to, Digital Libraries, Information Extraction, Machine Learning, Semantic Web, Knowledge Engineering, Natural Language Processing, Scholarly Communication, Science of Science, Scientometrics and Bibliometrics), as well as professionals from the industry, to explore innovative solutions and ideas for the production and consumption of Scientific Knowledge Graphs and assessing the research impact. The workshop has called for high-quality submissions around the three main themes of research, related to scientific knowledge: representation, discovery, and assessment. In response to the call for papers, the workshop has received outstanding submissions from researchers in 15 different countries: United States of America, Germany, United Kingdom, Ireland, Sweden, Canada, India, Brazil, Australia, Italy, Slovenia, Bulgaria, Denmark, Ethiopia, and Norway. Each paper was reviewed at least by three members of the programme committee. Given the quality and the interesting topics covered by the submissions, we accepted 10 papers. Sci-K 2023 builds on two previous successful editions and keeps attracting a combined pool of attendees. The first edition (Sci-K 2021), was held on 13 April 2021 in conjunction with The Web Conference 2021. Its program consisted of two keynote talks, and the presentation of 11 research papers. The second edition (Sci-K 2022) took place on the 26 April 2022 at The Web Conference 2022. The program included the presentation of 5 long papers, 4 short papers, 2 vision papers, 2 keynote speeches and a panel on “What’s next after Microsoft Academic Graph?”.
How and Why Do Researchers Reference Data? A Study of Rhetorical Features and Functions of Data References in Academic Articles
CODATA Data Science, April 2023
Sara Lafia, Andrea Thomer, Elizabeth Moss, David Bleckley, Libby Hemphill
Data reuse is a common practice in the social sciences. While published data play an essential role in the production of social science research, they are not consistently cited, which makes it difficult to assess their full scholarly impact and give credit to the original data producers. Furthermore, it can be challenging to understand researchers’ motivations for referencing data. Like references to academic literature, data references perform various rhetorical functions, such as paying homage, signaling disagreement, or drawing comparisons. This paper studies how and why researchers reference social science data in their academic writing. We develop a typology to model relationships between the entities that anchor data references, along with their features (access, actions, locations, styles, types) and functions (critique, describe, illustrate, interact, legitimize). We illustrate the use of the typology by coding multidisciplinary research articles (n = 30) referencing social science data archived at the Inter-university Consortium for Political and Social Research (ICPSR). We show how our typology captures researchers’ interactions with data and purposes for referencing data. Our typology provides a systematic way to document and analyze researchers’ narratives about data use, extending our ability to give credit to data that support research.
Cross-Institutional Transfer Learning for Educational Models: Implications for Model Performance, Fairness, and Equity
FAccT’23, June 2023
Josh Gardner, Renzhe Yu, Quan Nguyen, Christopher Brooks, Rene F. Kizilcec
Modern machine learning increasingly supports paradigms that are multi-institutional (using data from multiple institutions during training) or cross-institutional (using models from multiple institutions for inference), but the empirical effects of these paradigms are not well understood. This study investigates cross-institutional learning via an empirical case study in higher education. We propose a framework and metrics for assessing the utility and fairness of student dropout prediction models that are transferred across institutions. We examine the feasibility of cross-institutional transfer under real-world data- and model-sharing constraints, quantifying model biases for intersectional student identities, characterizing potential disparate impact due to these biases, and investigating the impact of various cross-institutional ensembling approaches on fairness and overall model performance. We perform this analysis on data representing over 200,000 enrolled students annually from four universities without sharing training data between institutions.
We find that a simple zero-shot cross-institutional transfer procedure can achieve similar performance to locally-trained models for all institutions in our study, without sacrificing model fairness. We also find that stacked ensembling provides no additional benefits to overall performance or fairness compared to either a local model or the zero-shot transfer procedure we tested. We find no evidence of a fairness-accuracy tradeoff across dozens of models and transfer schemes evaluated. Our auditing procedure also highlights the importance of intersectional fairness analysis, revealing performance disparities at the intersection of sensitive identity groups that are concealed under one-dimensional analysis.
ChatGPT as an Attack Tool: Stealthy Textual Backdoor Attack via Blackbox Generative Model Trigger
arXiv, April 2023
Jiazhao Li, Yijin Yang, Zhuofeng Wu, V.G. Vinod Vydiswaran, Chaowei Xiao
Textual backdoor attacks pose a practical threat to existing systems, as they can compromise the model by inserting imperceptible triggers into inputs and manipulating labels in the training dataset. With cutting-edge generative models such as GPT-4 pushing rewriting to extraordinary levels, such attacks are becoming even harder to detect. We conduct a comprehensive investigation of the role of black-box generative models as a backdoor attack tool, highlighting the importance of researching relative defense strategies. In this paper, we reveal that the proposed generative model-based attack, BGMAttack, could effectively deceive textual classifiers. Compared with the traditional attack methods, BGMAttack makes the backdoor trigger less conspicuous by leveraging state-of-the-art generative models. Our extensive evaluation of attack effectiveness across five datasets, complemented by three distinct human cognition assessments, reveals that BGMAttack achieves comparable attack performance while maintaining superior stealthiness relative to baseline methods.
ChatGPT in education: A discourse analysis of worries and concerns on social media
arXiv, April 2023
Lingyao Li, Zihui Ma, Lizhou Fan, Sanggyu Lee, Huizi Yu, Libby Hemphill
The rapid advancements in generative AI models present new opportunities in the education sector. However, it is imperative to acknowledge and address the potential risks and concerns that may arise with their use. We analyzed Twitter data to identify key concerns related to the use of ChatGPT in education. We employed BERT-based topic modeling to conduct a discourse analysis and social network analysis to identify influential users in the conversation. While Twitter users generally expressed a positive attitude towards the use of ChatGPT, their concerns converged to five specific categories: academic integrity, impact on learning outcomes and skill development, limitation of capabilities, policy and social concerns, and workforce challenges. We also found that users from the tech, education, and media fields were often implicated in the conversation, while education and tech individual users led the discussion of concerns. Based on these findings, the study provides several implications for policymakers, tech companies and individuals, educators, and media agencies. In summary, our study underscores the importance of responsible and ethical use of AI in education and highlights the need for collaboration among stakeholders to regulate AI policy.
Analyzing the Engagement of Social Relationships During Life Event Shocks in Social Media
ICWSM, June 2023
Minje Choi, David Jurgens, Daniel Romero
Individuals experiencing unexpected distressing events, shocks, often rely on their social network for support. While prior work has shown how social networks respond to shocks, these studies usually treat all ties equally, despite differences in the support provided by different social relationships. Here, we conduct a computational analysis on Twitter that examines how responses to online shocks differ by the relationship type of a user dyad. We introduce a new dataset of over 13K instances of individuals’ self-reporting shock events on Twitter and construct networks of relationship-labeled dyadic interactions around these events. By examining behaviors across 110K replies to shocked users in a pseudo-causal analysis, we demonstrate relationship-specific patterns in response levels and topic shifts. We also show that while well-established social dimensions of closeness such as tie strength and structural embeddedness contribute to shock responsiveness, the degree of impact is highly dependent on relationship and shock types. Our findings indicate that social relationships contain highly distinctive characteristics in network interactions and that relationship-specific behaviors in online shock responses are unique from those of offline settings
Information Retention in the Multi-platform Sharing of Science
ICWSM, June 2023
Sohyeon Hwang, Emoke-Agnes Horvat, Daniel Romero
The public interest in accurate scientific communication, underscored by recent public health crises, highlights how content often loses critical pieces of information as it spreads online. However, multi-platform analyses of this phenomenon remain limited due to challenges in data collection. Collecting mentions of research tracked by Altmetric LLC, we examine information retention in the over 4 million online posts referencing 9,765 of the most-mentioned scientific articles across blog sites, Facebook, news sites, Twitter, and Wikipedia. To do so, we present a burst-based framework for examining online discussions about science over time and across different platforms. To measure information retention we develop a keyword-based computational measure comparing an online post to the scientific article’s abstract. We evaluate our measure using ground truth data labeled by within field experts. We highlight three main findings: first, we find a strong tendency towards low levels of information retention, following a distinct trajectory of loss except when bursts of attention begin in social media. Second, platforms show significant differences in information retention. Third, sequences involving more platforms tend to be associated with higher information retention. These findings highlight a strong tendency towards information loss over time—posing a critical concern for researchers, policymakers, and citizens alike—but suggest that multi-platform discussions may improve information retention overall.
Large-Scale Analysis of New Employee Network Dynamics
The Web Conference, May 2023
Yulin Yu, Longqi Yang, Sian Lindley, Mengting Wan
The COVID-19 pandemic has accelerated digital transformations across industries, but also introduced new challenges into workplaces, including the difficulties of effectively socializing with colleagues when working remotely. This challenge is exacerbated for new employees who need to develop workplace networks from the outset. In this paper, by analyzing a large-scale telemetry dataset of more than 10,000 Microsoft employees who joined the company in the first three months of 2022, we describe how new employees interact and telecommute with their colleagues during their ``onboarding'' period. Our results reveal that although new hires are gradually expanding networks over time, there still exists significant gaps between their network statistics and those of tenured employees even after the six-month onboarding phase. We also observe that heterogeneity exists among new employees in how their networks change over time, where employees whose job tasks do not necessarily require extensive and diverse connections could be at a disadvantaged position in this onboarding process. By investigating how web-based people recommendations in organizational knowledge base facilitate new employees naturally expand their networks, we also demonstrate the potential of web-based applications for addressing the aforementioned socialization challenges. Altogether, our findings provide insights on new employee network dynamics in remote and hybrid work environments, which may help guide organizational leaders and web application developers on quantifying and improving the socialization experiences of new employees in digital workplaces.
Unique In What Sense? Heterogeneous Relations Between Multiple Types of Uniqueness and Popularity in Music
ICWSM, June 2023
Yulin Yu, Pui Yin Cheung, Yong-Yeol Ahn, Paramveer Dhillon
How does our society appreciate the uniqueness of cultural products? This fundamental puzzle has intrigued scholars in many fields, including psychology, sociology, anthropology, and marketing. It has been theorized that cultural products that balance familiarity and novelty are more likely to become popular. However, a cultural product's novelty is typically multifaceted. This paper uses songs as a case study to study the multiple facets of uniqueness and their relationship with success. We first unpack the multiple facets of a song's novelty or uniqueness and, next, measure its impact on a song's popularity. We employ a series of statistical models to study the relationship between a song's popularity and novelty associated with its lyrics, chord progressions, or audio properties. Our analyses performed on a dataset of over fifty thousand songs find a consistently negative association between all types of song novelty and popularity. Overall we found a song's lyrics uniqueness to have the most significant association with its popularity. However, audio uniqueness was the strongest predictor of a song's popularity, conditional on the song's genre. We further found the theme and repetitiveness of a song's lyrics to mediate the relationship between the song's popularity and novelty. Broadly, our results contradict the "optimal distinctiveness theory" (balance between novelty and familiarity) and call for an investigation into the multiple dimensions along which a cultural product's uniqueness could manifest.
Just Another Day on Twitter: A Complete 24 Hours of Twitter Data
ICWSM, June 2023
Juergen Pfeffer, Daniel Matter, Kokil Jaidka, Onur Varol, Afra Mashhadi, Jana Lasser, Dennis Assenmacher, Siqi Wu, Diyi Yang, Cornelia Brantner, Daniel M. Romero, Jahna Otterbacher, Carsten Schwemmer, Kenneth Joseph, David Garcia, Fred Morstatter
At the end of October 2022, Elon Musk concluded his acquisition of Twitter. In the weeks and months before that, several questions were publicly discussed that were not only of interest to the platform's future buyers, but also of high relevance to the Computational Social Science research community. For example, how many active users does the platform have? What percentage of accounts on the site are bots? And, what are the dominating topics and sub-topical spheres on the platform? In a globally coordinated effort of 80 scholars to shed light on these questions, and to offer a dataset that will equip other researchers to do the same, we have collected all 375 million tweets published within a 24-hour time period starting on September 21, 2022. To the best of our knowledge, this is the first complete 24-hour Twitter dataset that is available for the research community. With it, the present work aims to accomplish two goals. First, we seek to answer the aforementioned questions and provide descriptive metrics about Twitter that can serve as references for other researchers. Second, we create a baseline dataset for future research that can be used to study the potential impact of the platform's ownership change.
Nudges (and Deceptive Patterns) for Privacy
The Routledge Handbook of Privacy and Social Media, May 2023
Alessandro Acquisti, Idris Adjerid, Laura Brandimarte, Lorrie Faith Cranor, Saranga Komanduri, Pedro Giovanni Leon, Norman Sadeh, Florian Schaub, Yang Wang, Shomir Wilson
In 2017, we published in ACM Computing Surveys a review of the rapidly expanding field of research on behavioral hurdles and nudges in privacy and information security. In this chapter, we augment that review by considering novel research and interesting developments in this area. We consider the expanding literature on privacy behavioral and decision-making hurdles, the ongoing debate on rationality in consumer decision-making, and the so-called privacy paradox, as well as the expanding literature on both nudges and deceptive patterns (also known as “dark patterns”). We conclude by examining the effectiveness of nudges as tools for helping individuals manage their privacy online.
Defending against Insertion-based Textual Backdoor Attacks via Attribution
arXiv, May 2023
Jiazhao Li, Zhuofeng Wu, Wei Ping, Chaowei Xiao, V.G. Vinod Vydiswaran
Textual backdoor attack, as a novel attack model, has been shown to be effective in adding a backdoor to the model during training. Defending against such backdoor attacks has become urgent and important. In this paper, we propose AttDef, an efficient attribution-based pipeline to defend against two insertion-based poisoning attacks, BadNL and InSent Specifically, we regard the tokens with larger attribution scores as potential triggers since larger attribution words contribute more to the false prediction results and therefore are more likely to be poison triggers. Additionally, we further utilize an external pretrained language model to distinguish whether input is poisoned or not. We show that our proposed method can generalize sufficiently well in two common attack scenarios (poisoning training data and testing data), which consistently improves previous methods. For instance, AttDef can successfully mitigate both attacks with an average accuracy of 79.97% (56.59%↑) and 48.34% (3.99%↑) under pretraining and post-training attack defense respectively, achieving the new state-of-the-art performance on prediction recovery over four benchmark datasets.
Prevalence Estimation in Social Media Using Black Box Classifiers
ICWSM, June 2023
Many problems in computational social science require estimating the proportion of items with a particular property. This counting task is called prevalence estimation or quantification. Frequently, researchers have a pre-trained classifier available to them. However, it is usually not safe to simply apply the classifier to all items and count the predictions of each class, because the test dataset may differ in important ways from the dataset on which the classifier was trained, a phenomenon called distribution shift. In addition, a second type of distribution shift may occur when one wishes to compare the prevalence between multiple datasets, such as tracking changes over time. To cope with that, some assumptions need to be made about the nature of possible distribution shifts across datasets, a process that we call extrapolation.
This tutorial will introduce an end-to-end framework for prevalence estimation using black box (pre-trained) classifiers, with a focus on social media datasets. The framework consists of a calibration phase and an extrapolation phase, aiming to address the two types of distribution shifts described above. We will provide hands-on exercises that walk the participants through solving a real world problem of quantifying positive tweets in datasets from two separate time periods. All datasets, pre-trained models, and example codes will be provided in a Jupyter notebook. After attending this tutorial, participants will be able to understand the basics of the prevalence estimation problem in social media, and construct a data analysis pipeline to conduct prevalence estimation for their projects.