University of Michigan School of Information
Media Storms | TikTok | Algorithms: UMSI Research Roundup
Monday, 01/29/2024
University of Michigan School of Information faculty and PhD students are creating and sharing knowledge that helps build a better world. Here are some of their recent publications.
Publications
News for (Me and) You: Exploring the Reporting Practices of Citizen Journalists on TikTok
Journalism Studies, December 2023
The social media platform TikTok is increasingly becoming an important space for sharing and finding news and information,especially for younger people. Most scholarly research examining news sharing on TikTok has focused on practices by professional journalists and news organizations; however, these are only a small percentage of the actors who make up the news information sharing ecosystem on the platform. In particular, citizen journalists play a large role in creating and disseminating news on TikTok.Thus, I interviewed 14 TikTok citizen journalists to understand their news reporting practices. Findings suggest TikTok citizen journalists are guided by the platform logics and concerns around misinformation in the content they post on the platform. Thisstudy contributes to the literature on the role of citizen journalists and social media in shaping news disseminated online.
A Closer Look at Civic Honesty in Collectivist Cultures
Proceedings of the National Academy of Sciences of the United States of America, November 2023
David Tannenbaum, Michel Andre Marechal, Alain Cohn
Yang and colleagues (hereafter YAC) conducted a replication and extension of our lost wallet study in China (1, 2). They argue that in collectivist cultures, civic honesty manifests as holding onto a wallet for safekeeping, without contacting the owner (“safekeeping”). By contrast, in more individualistic cultures, civic honesty manifests as actively contacting the owner to return a wallet (“emailing”). Thus, using email contact rates may distort civic honesty measurement in collectivist countries.
We agree with YAC that, especially for cross-cultural research, use of a single outcome measure may limit generalizability and examining additional measures is of value (3). However, upon closer examination, many of YAC’s findings are spurious and other conclusions are contradicted by their data.
A key finding in YAC is city-level collectivism predicts safekeeping but not emailing, which would suggest civic honesty expresses itself differently across cultures. This result, however, is entirely due to an error in their regression specifications; once corrected, the relationship between collectivism and safekeeping disappears. YAC’s regressions include both city fixed effects (i.e., where the study was performed) and city-level rates of collectivism (i.e., degree of collectivism in a city). Including both variables leads to “double dipping” on the same information and perfect multicollinearity. Table 1 illustrates the problem: arbitrary changes to the model—which should have zero effect on the collectivism coefficient—alter the coefficient from significantly positive to significantly negative to no longer estimable. Correcting the problem requires removing city fixed effects from the model; when this is done, the relationship between collectivism and wallet safekeeping is no longer statistically significant. To be thorough, we tried 4,400 other model combinations and not once was collectivism a significant predictor.
Embracing Virtual Reality Technology with Black Adolescents to Redress Police Encounters
Journal of Youth Development, November 2023
Danielle M. Olson, Tyler Musgrave, Divya Gumudavelly, Chardee Galan, Sarita Schoenebeck
Police brutality—including the incidents that mobilized collective outrage and action across the world during the summer of 2020—has negatively impacted the psychological health of Black youth for generations. Police harassment is a persistent form of racial discrimination that Black youth frequently navigate (Brunson, 2007), and particularly as a consequence of vicarious trauma via social media, it has been associated with depressive and anxious sequelae (e.g., Tynes et al., 2019). While Black youth are faced with policing experiences in vivo and in vitro, many studies of the psychological impact of policing are contained within traditional retrospective surveys, which limits our understanding of youth’s in-the-moment perceptions and desired concurrent actions. To expand the efforts to assess and redress youth’s experiences with police encounters, this manuscript details the development of an afterschool program that supports adolescents in the creation of a series of video game and virtual reality (VR) narratives. A participatory design method was utilized to co-create the perception of policing experiences with Detroit high school students enrolled in a computer science course, allowing them to actualize their experiences “on screen” and work towards redressing these experiences through co-construction and virtual activism.
Predictive Dispatch of Volunteer First Responders: Algorithm Development and Validation
JMIR mHealth and uHealth, November 2023
Michael Khalemsky, Anna Khalemsky, Stephen Lankenau, Janna Ataiants, Alexis Roth, Gabriela Marcu, David G Schwartz
Background: Smartphone-based emergency response apps are increasingly being used to identify and dispatch volunteer first responders (VFRs) to medical emergencies to provide faster first aid, which is associated with better prognoses. Volunteers’ availability and willingness to respond are uncertain, leading in recent studies to response rates of 17% to 47%. Dispatch algorithms that select volunteers based on their estimated time of arrival (ETA) without considering the likelihood of response may be suboptimal due to a large percentage of alerts wasted on VFRs with shorter ETA but a low likelihood of response, resulting in delays until a volunteer who will actually respond can be dispatched.
Objective: This study aims to improve the decision-making process of human emergency medical services dispatchers and autonomous dispatch algorithms by presenting a novel approach for predicting whether a VFR will respond to or ignore a given alert.
Methods: We developed and compared 4 analytical models to predict VFRs’ response behaviors based on emergency event characteristics, volunteers’ demographic data and previous experience, and condition-specific parameters. We tested these 4 models using 4 different algorithms applied on actual demographic and response data from a 12-month study of 112 VFRs who received 993 alerts to respond to 188 opioid overdose emergencies. Model 4 used an additional dynamically updated synthetic dichotomous variable, frequent responder, which reflects the responder’s previous behavior.
Results: The highest accuracy (260/329, 79.1%) of prediction that a VFR will ignore an alert was achieved by 2 models that used events data, VFRs’demographic data, and their previous response experience, with slightly better overall accuracy (248/329, 75.4%) for model 4, which used the frequent responder indicator. Another model that used events data and VFRs’ previous experience but did not use demographic data provided a high-accuracy prediction (277/329, 84.2%) of ignored alerts but a low-accuracy prediction (153/329, 46.5%) of responded alerts. The accuracy of the model that used events data only was unacceptably low. The J48 decision tree algorithm provided the best accuracy. Conclusions: VFR dispatch has evolved in the last decades, thanks to technological advances and a better understanding of VFR management. The dispatch of substitute responders is a common approach in VFR systems. Predicting the response behavior of candidate responders in advance of dispatch can allow any VFR system to choose the best possible response candidates based not only on ETA but also on the probability of actual response. The integration of the probability to respond into the dispatch algorithm constitutes a new generation of individual dispatch, making this one of the first studies to harness the power of predictive analytics for VFR dispatch. Our findings can help VFR network administrators in their continual efforts to improve the response times of their networks and to save lives.
When it Rains, it Pours: Modeling Media Storms and the News Ecosystem
Findings of the Association for Computational Linguistics: EMNLP, December 2023
Benjamin Litterer, David Jurgens, Dallas Card
Most events in the world receive at most brief coverage by the news media. Occasionally, however, an event will trigger a media storm, with voluminous and widespread coverage lasting for weeks instead of days. In this work, we develop and apply a pairwise article similarity model, allowing us to identify story clusters in corpora covering local and national online news, and thereby create a comprehensive corpus of media storms over a nearly two year period. Using this corpus, we investigate media storms at a new level of granularity, allowing us to validate claims about storm evolution and topical distribution, and provide empirical support for previously hypothesized patterns of influence of storms on media coverage and intermedia agenda setting
Interim Report for Ubuntu-AI: A Bottom-up Approach to More Democratic and Equitable Training and Outcomes for Machine Learning
Democratic Inputs for AI, September 2023
Michael Nayebare, Ron Eglash, Ussen Kimanuka
Artificial Intelligence (AI) can be a threat to creative arts and design, taking data and images without permission or compensation. But with AI becoming a global portal for human knowledge access, anyone resisting inclusion in its data inputs will become invisible to its outputs. This is the AI double bind, in which the threat of exclusion forces us to give up any claims of ownership to our creative endeavors. To address such problems, this project develops an experimental platform designed to return value to those who create it, using a case study on African arts and design. If successful, it will allow African creatives to work with AI instead of against it, creating new opportunities for funding, gaining wider dissemination of their work, and creating a database for machine learning that results in more inclusive knowledge of African arts and design for AI outputs.
Understanding voice-based information uncertainty: A case study of health information seeking with voice assistants
Journal of the Association for Information Science and Technology, December 2023
Evaluating information quality online is increasingly important for healthy decision-making. People assess information quality using visual interfaces (e.g., computers, smartphones) with visual cues like aesthetics. Yet, voice interfaces lack critical visual cues for evaluating information because there is often no visual display. Without ways to assess voice-based information quality, people may overly trust or misinterpret information which can be challenging in high-risk or sensitive contexts. This paper investigates voice information uncertainty in one high-risk context—health information seeking. We recruited 30 adults (ages 18–84) in the United States to participate in scenario-based interviews about health topics. Our findings provide evidence of information uncertainty expectations with voice assistants, voice search preferences, and the audio cues they use to assess information quality. We contribute a nuanced discussion of how to inform more critical information ecosystems with voice technologies and propose ways to design audio cues to help people more quickly assess content quality.
Shockvertising, Malware, and a Lack of Accountability: Exploring Consumer Risks of Virtual Reality Advertisements and Marketing Experiences
IEEE Security and Privacy, December 2023
Abraham Mhaidli, Shwetha Rajaram, Selin Fidan, Gina Herakovic, Florian Schaub
Companies increasingly use virtual reality (VR) for advertising. This begs the question, What risks does VR advertising pose for consumers? We analyze VR marketing experiences to identify risks and discuss opportunities to address those and future risks in VR advertising.
Examining Voice Community Use
ACM Transactions on Computer-Human Interaction, October 2023
Robin Brewer, Sam Ankenbauer, Manahil Hashmi, Pooja Upadhyay
Visual online communities can present accessibility challenges to older adults or people with vision and motor disabilities. Motivated by this challenge, accessibility and HCI researchers have called for voice-based communities to support aging and disability. This paper extends prior work on voice community design and short-term use by providing empirical data on how people interact with voice communities over time and intentional instances of non-use. We conducted a one-year study with 43 blind and low vision older adults, of whom 21 used a voice-based community. We use vignettes to unpack five different voice community member roles - the obligatory poster, routine poster, cross-platform lurker, busy socialite, and visual expertise seeker - and discuss community interactions over time. Findings show how participation varied based on engagement in other communities and ways that participants sought interaction. We discuss (1) how to design voice communities for member roles and (2) the implications of synchronous and asynchronous voice community interaction in voice-only communities.
PM2.5 forecasting under distribution shift: A graph learning approach
AI Open, November 2023
Yachuan Liu, Jiaqi Ma, Paramveer Dhillon, Qiaozhu Mei
We present a new benchmark task for graph-based machine learning, aiming to predict future air quality (PM2.5 concentration) observed by a geographically distributed network of environmental sensors. While prior work has successfully applied Graph Neural Networks (GNNs) on a wide family of spatio-temporal prediction tasks, the new benchmark task introduced here brings a technical challenge that has been less studied in the context of graph-based spatio-temporal learning: distribution shift across a long period of time. An important goal of this paper is to understand the behavior of spatio-temporal GNNs under distribution shift. We conduct a comprehensive comparative study of both graph-based and non-graph-based machine learning models under two data split methods, one results in distribution shift and one doesn’t. Our empirical results suggest that GNN models tend to suffer more from distribution shift compared to non-graph-based models, which calls for special attention when deploying spatio-temporal GNNs in practice.
From illness management to quality of life: rethinking consumer health informatics opportunities for progressive, potentially fatal illnesses
Journal of the American Medical Informatics Association, December 2023
Marcy G Antonio, Tiffany Veinot
Objectives: Investigate how people with chronic obstructive pulmonary disease (COPD)—an example of a progressive, potentially fatal illness—are using digital technologies (DTs) to address illness experiences, outcomes and social connectedness.
Materials and Methods: A transformative mixed methods study was conducted in Canada with people with COPD (n ¼ 77) or with a progressive lung condition (n ¼ 6). Stage-1 interviews (n ¼ 7) informed the stage-2 survey. Survey responses (n ¼ 80) facilitated the identification of participants for stage-3 interviews (n ¼ 13). The interviews were thematically analyzed. Descriptive statistics were calculated for the survey. The integrative mixed method analysis involved mixing between and across the stages. Results: Most COPD participants (87.0%) used DTs. However, few participants frequently used DTs to self-manage COPD. People used DTs to seek online information about COPD symptoms and treatments, but lacked tailored information about illness progression. Few expressed interest in using DTs for self- monitoring and tracking. The regular use of DTs for intergenerational connections may facilitate leaving a legacy and passing on traditions and memories. Use of DTs for leisure activities provided opportunities for connecting socially and for respite, reminiscing, distraction and spontaneity.
Discussion and Conclusion: We advocate reconceptualizing consumer health technologies to prioritize quality of life for people with a progressive, potentially fatal illness. “Quality of life informatics” should focus on reducing stigma regarding illness and disability and taboo towards death, improving access to palliative care resources and encouraging experiences to support social, emotional and mental health. For DTs to support people with fatal, progressive illnesses, we must expand informatics strategies to quality of life.
What is Community?: Informing the Design of a Community Building Platform for Low-Income Black and Latino Residents
Proceedings of the 57th Hawaii International Conference on System Sciences, January 2024
Cynthia McLeod, Amy Gonzales, Jesse King, Julie Hui, Aarti Israni
Online communities can of er under-resourced populations an avenue for upward social mobility by capitalizing on community connections and the pooling of resources. UpTogether, a non-profit organization, attempted to access this potential by providing its members with a novel social media platform to interact with like-minded others. Yet, despite members' interest in building greater connections within the community, few people utilized the platform to engage with their groups. By examining 25 participant interviews, we explore participants’ conceptualizations of community and their experience on the platform. With this, we identify their expectations of community and pose recommendations for future initiatives aimed at building community–online and offline.
Anti-bias and Pro-transformation: how to merge critique and transformative visions for Artificial Intelligence
Online Proceedings of the Coding, Computational Modeling, and Equity in Mathematics Education Symposium, April 2023
As our keynote speaker Gideon Christian points out, the dangers of bias in AI and other data-intensive information sciences have been well documented (Angwin et al., 2022). They include risk prediction equations used by criminal justice officials to inform their decisions about bail, sentencing and early release; bank loans, medical decisions, and many other aspects of our lives. But an exclusive focus on “bias” is not enough, we need to be both anti-bias and, simultaneously, create transformative change.
What is the difference? If we focus exclusively on eliminating bias, we imply that if only the bias would vanish, we would have a just and equitable system. But that is not at all the case. For example, our current banking algorithms have resulted in higher loan rates for Black home buyers, because of bias in the ways they calculate risk. But that bias does not address the problem that homes and loans are extremely expensive to begin with. For the working class, even in the absence of bias, the dangers of defaulting on loans are significant. They have been a widespread destructive force in working class communities, no matter what color. A system designed to make the rich even richer, at the expense of the working poor, does not need bias to enact forms of oppression. An exclusive focus on eliminating bias can thus become a distraction from the more important project of transformation.
Autonomy Acceptance Model (AAM): The Role of Autonomy and Risk in Security Robot Acceptance
HRI ‘24, March 2024
Xin Ye, Wonse Jo, Arsha Ali, Samia Cornelius Bhatti, Connor Esterwood, Hana Andargie Kassie, Lionel Peter Robert
The rapid deployment of security robots across our society calls for further examination of their acceptance. This study explored human acceptance of security robots by theoretically extending the technology acceptance model to include the impact of autonomy and risk. To accomplish this, an online experiment involving 236 participants was conducted. Participants were randomly assigned to watch a video introducing a security robot operating at an autonomy level of low, moderate, or high, and presenting either a low or high risk to humans. This resulted in a 3 (autonomy) × 2 (risk) between-subjects design. The findings suggest that increased perceived usefulness, perceived ease of use, and trust enhance acceptance, while higher robot autonomy tends to decrease acceptance. Additionally, the physical risk associated with security robots moderates the relationship between autonomy and acceptance. Based on these results, this paper offer recommendations for future research on security robots.
Gender-Affirming Surgeons’ Attitudes toward Social Media Communication with Patients
The Bulletin of Applied Transgender Studies, December 2023
Jules L. Madzia, Tee Chuanromanee, Gaines Blasdel, Aloe DeGuia, Mary Byrnes, Nabeel A. Shakir, Megan Lane, Oliver L. Haimson
Online spaces are increasingly important for transgender people who are considering gender-affirming surgeries to find information, ask questions, and communicate with each other. While many surgical resources are community-generated, the onus of providing medical information about surgery should be on the surgical team. We sought to understand the potential for an online space for surgeon and community engagement. We assessed gender-affirming surgeon perspectives on online communication and communities by conducting a survey (N = 55) to understand current social media use and gauge surgeons’ opinions related to participating in online spaces. We found that gender-affirming surgeons were not generally in support of a new online platform for patient-surgeon communication, with 67% responding that a new platform was not needed. Participants identified potential negative implications including risks to patients (e.g., misinformation, liability, and platform use in emergency situations) and risks to surgeons (e.g., the additional burden that the platform would place on their already-limited time, changes to surgeon culture, and safety concerns related to online harassment). Potential positive implications include opportunities to improve patient education and enhance patient care. Our results establish empirical understanding of social media use patterns among gender-affirming surgeons and may inform the design of resources to enable trans patients to receive the information and care that they require when considering and undergoing gender-affirming surgery.
Automated-detection of risky alcohol use prior to surgery using natural language processing
Alcohol, Clinical and Experimental Research, January 2024
VG Vinod Vydiswaran, Asher Strayhorn, Katherine Weber, Haley Stevens, Jessica Mellinger, G Scott Winder, Anne C Fernandez
Background: Preoperative risky alcohol use is one of the most common surgical risk factors. Accurate and early identification of risky alcohol use could enhance surgical safety. Artificial Intelligence-based approaches, such as natural language processing (NLP), provide an innovative method to identify alcohol-related risks from patients' electronic health records (EHR) before surgery.
Methods: Clinical notes (n = 53,629) from pre-operative patients in a tertiary care facility were analyzed for evidence of risky alcohol use and alcohol use disorder. One hundred of these records were reviewed by experts and labeled for comparison. A rule-based NLP model was built, and we assessed the clinical notes for the entire population. Additionally, we assessed each record for the presence or absence of alcohol-related International Classification of Diseases (ICD) diagnosis codes as an additional comparator.
Results: NLP correctly identified 87% of the human-labeled patients classified with risky alcohol use. In contrast, diagnosis codes alone correctly identified only 29% of these patients. In terms of specificity, NLP correctly identified 84% of the non-risky cohort, while diagnosis codes correctly identified 90% of this cohort. In the analysis of the full dataset, the NLP-based approach identified three times more patients with risky alcohol use than ICD codes.
Conclusions: NLP, an artificial intelligence-based approach, efficiently and accurately identifies alcohol-related risk in patients' EHRs. This approach could supplement other alcohol screening tools to identify patients in need of intervention, treatment, and/or postoperative withdrawal prophylaxis. Alcohol-related ICD diagnosis had limited utility relative to NLP, which extracts richer information within clinical notes to classify patients.
Computational reparations as generative justice: Decolonial transitions to unalienated circular value flow
Big Data and Society, January 2024
Ron Eglash, Kwame P Robinson, Audrey Bennett, Lionel Robert, Mathew Garvin
The Latin roots of the word reparations are “re” (again) plus “parere” which means “to give birth to, bring into being, produce”. Together they mean “to make generative once again”. In this sense, the extraction processes that cause labor injustice, ecological devastation, and social degradation cannot be repaired by simply transferring money. Reparations need to take on the full sense of “restorative”: the transition to a decolonial system that can support value generators in the control of their own systems of production, protect the value they create from extraction, and circulate value in unalienated forms that benefit the human and non-human communities that produced that value. With funding from the National Science Foundation, we have developed a research framework for this process that starts with “artisanal labor”: employee-owned business and worker collectives that have people doing what they love, despite low incomes. Focusing primarily on Detroit’s Black-owned urban farms, artisanal textile businesses, Black hair salons, worker collectives, and other community-based production, with additional connections to Indigenous and other communities, we have introduced digital fabrication technologies, sensors, artificial intelligence, server-side apps and other computational support for a transition to unalienated circular value flow. We will report on our investigations with the challenges at multiple scales. At each level, we show how computational supports can act as restorative mechanisms for lost circular value flows, and thus address both past and ongoing disenfranchisement.
Pre-prints, Working Papers, and Reports
Calibrate-Extrapolate: Rethinking Prevalence Estimation with Black Box Classifiers
arXiv, January 2024
In computational social science, researchers often use a pretrained, black box classifier to estimate the frequency of each class in unlabeled datasets. A variety of prevalence estimation techniques have been developed in the literature, each yielding an unbiased estimate if certain stability assumption holds. This work introduces a framework to rethink the prevalence estimation process as calibrating the classifier outputs against ground truth labels to obtain the joint distribution of a base dataset and then extrapolating to the joint distribution of a target dataset. We call this framework “Calibrate-Extrapolate”. Visualizing the joint distribution makes the stability assumption needed for a prevalence estimation technique clear and easy to understand. In the calibration phase, the techniques assume only a stable calibration curve between a calibration dataset and the full base dataset. This allows for the classifier outputs to be used for purposive sampling, thus improving the efficiency of calibration. In the extrapolation phase, some techniques assume a stable calibration curve while some assume stable class-conditional densities. We discuss the stability assumptions from a causal perspective. By specifying base and target joint distributions, we can generate simulated datasets, as a way to build intuitions about the impacts of assumption violations. This also leads to a better understanding of how the classifier predictive power affects the accuracy of prevalence estimates: the greater the predictive power, the lower the sensitivity to violations of stability assumptions in the extrapolation phase. We illustrate the framework with an application that estimates the prevalence of toxic news comments over time on Reddit, Twitter, and YouTube, using Jigsaw’s Perspective API as a black box classifier.
Evaluating the Impact of Personalized Value Alignment in Human-Robot Interaction: Insights into Trust and Team Performance Outcomes
arXiv, November 2023
Shreyas Bhat, Joseph B. Lyons, Cong Shi, X. Jessie Yang
This paper examines the effect of real-time, personalized alignment of a robot’s reward function to the human’s values on trust and team performance. We present and compare three distinct robot interaction strategies: a non-learner strategy where the robot presumes the human’s reward function mirrors its own; a nonadaptive-learner strategy in which the robot learns the human’s reward function for trust estimation and human behavior modeling, but still optimizes its own reward function; and an adaptive-learner strategy in which the robot learns the human’s reward function and adopts it as its own. Two human-subject experiments with a total number of 𝑁 = 54 participants were conducted. In both experiments, the human-robot team searches for potential threats in a town. The team sequentially goes through search sites to look for threats. We model the interaction between the human and the robot as a trust-aware Markov Decision Process (trust-aware MDP) and use Bayesian Inverse Reinforcement Learning (IRL) to estimate the reward weights of the human as they interact with the robot. In Experiment 1, we start our learning algorithm with an informed prior of the human’s values/goals. In Experiment 2, we start the learning algorithm with an uninformed prior. Results indicate that when starting with a good informed prior, personalized value alignment does not seem to benefit trust or team performance. On the other hand, when an informed prior is unavailable, alignment to the human’s values leads to high trust and higher perceived performance while maintaining the same objective team performance
NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language Models via Complexity Classes
arXiv, January 2024
Lizhou Fan, Wenyue Hua, Lingyao Li, Haoyang Ling, Yongfeng Zhang
Complex reasoning ability is one of the most important features of current Large Language Models (LLMs), which has also been leveraged to play an integral role in complex decision-making tasks. Therefore, the investigation into the reasoning capabilities of LLMs is critical: numerous benchmarks have been established to assess the reasoning abilities of LLMs. However, current benchmarks are inadequate in offering a rigorous evaluation of the full extent of reasoning abilities that LLMs are capable of achieving. They are also prone to the risk of overfitting, as these benchmarks, being publicly accessible and static, allow models to potentially tailor their responses to specific benchmark metrics, thereby inflating their performance. Addressing these limitations, our research introduces a new benchmark, named NPHardEval. This benchmark is designed to evaluate the reasoning abilities of LLMs across a broad spectrum of 900 algorithmic questions, extending up to the NP-Hard complexity class. These questions are meticulously chosen to represent a wide range of complexity class below the NP-hard complexity class, offering a rigorous measure of the reasoning ability of LLMs. Through this study, we shed light on the current state of reasoning in LLMs, providing an objective and rigorous perspective through the comparison of LLMs’ performance across complex classes. Our findings contribute significantly to understanding the current capabilities of LLMs in reasoning tasks and lay the groundwork for future advancements in enhancing the reasoning abilities of these models. Moreover, this benchmark is designed with a dynamic update mechanism, where the datapoints are refreshed on a monthly basis. Such regular updates play a crucial role in mitigating the risk of LLMs overfitting to the benchmark, promoting a more accurate and reliable assessment of their reasoning capabilities. The benchmark dataset and code of NPHardEval are available at https://github.com/casmlab/NPHardEval.