408: Anticipating AI failures in healthcare with Dr. Karandeep Singh

Information Changes Everything: The Podcast. Karandeep Singh, Associate professor, UMSI and Michigan Medical School & Associate CMIO for artificial intelligence, Michigan Medicine. News and research from the world of information science.

July 9, 2024

Listen to UMSI on:

Information Changes Everything

News and research from the world of information science
Presented by the University of Michigan School of Information (UMSI)

Episode

408

Released

July 9, 2024

Recorded

2023

Guests

Dr. Karandeep Singh is an associate professor and the Joan and Irwin Jacobs Chancellor’s Endowed Chair in Digital Health Innovation at the University of California, San Diego. He is also associate chief medical information officer for inpatient care and chief health AI officer at UC San Diego Health.

When this talk was recorded, Singh was an associate professor at the University of Michigan School of Information and Michigan Medical School. He was also associate chief medical information officer for AI at Michigan Medicine.

Summary

In this episode of “Information Changes Everything,” Dr. Karandeep Singh, a leader in machine learning within health systems, discusses the common challenges and failures in AI model implementation in health care. Using real-world examples, Dr. Singh addresses issues such as reproducibility, transportability and intervention efficacy. He offers insights on anticipating these challenges, estimating their impact, and designing better interventions to improve patient care.

Resources and links mentioned

Reach out to us at [email protected].

Timestamps

Intro (0:00)

Information news from UMSI (1:24)

Hear excerpts from Dr. Karandeep Singh’s 2023 talk “Why Health AI Implementations Fail” at UMSI (2:53)

Next time: Lise Vesterlund exposes the invisible labor holding women back (23:10)

Outro (24:00)

Subscribe to “Information Changes Everything” on your favorite podcast platform for more intriguing discussions and expert insights.

About us

The “Information Changes Everything” podcast is a service of the University of Michigan School of Information, leaders and best in research and education for applied data science, information analysis, user experience, data analytics, digital curation, libraries, health informatics and the full field of information science. Visit us at si.umich.edu.

Questions or comments

If you have questions, comments, or topics you'd like us to cover, please reach out to us at [email protected].

Dr. Karandeep Singh (00:00):

At the end of the day, unless you're expanding your workforce, you're just deciding how you have people spend their time, which effectively, I mean put in a negative light, that's rationing care. But we already have people in waiting lists to get in to see clinicians. So that's already rationing care if you think about it that way.

Kate Atkins, host (00:20):

That was Karandeep Singh during his talk at UMSI’s 2023 data Science and computational social Science seminar series. And this is information changes everything where we put the spotlight on news and research from the world of information science. You're going to hear from experts, students, researchers, and other people making a real difference. As always, we're presented by the University of Michigan School of Information UMSI for short. Learn more about us at umsi.umich.edu. I'm your host, Kate Atkins. Today we'll hear more from Karandeep Singh as he shares real-world examples to illustrate common implementation roadblocks that can cause AI models to fail. He'll demonstrate how we can anticipate these issues and better design interventions to make a positive impact on patient care. Singh is an associate professor at UMSI and the Associate Chief Medical Information Officer for artificial intelligence at Michigan Medicine. Before we jump in, a few other people and projects that you should know about. First, the University of Michigan is proactively integrating artificial intelligence into its curriculum by developing its own AI tools to enhance student learning. UMSI associate professor David Jurgens says the goal is to help students use AI in a productive and collaborative way while maintaining critical thinking skills.

(01:52):

Next Nextdoor is the largest hyper-local social media network used predominantly in less dense, wealthier, older and more educated neighborhoods. A team of University of Michigan and New York University researchers revealed that posts seeking and offering services are the most frequent, but posts about suspicious activities get the highest engagement, raising concerns about racialized community surveillance. Finally, wearable technology has been around for a while, but some researchers at North Carolina State are taking it one step further. They're using machine learning to create touch base sensors embroidered into fabric. The sensors power itself from the friction generated by its multiple layers and has been used to play video games in control electronic devices. For more on all of these stories, check out si.umich.edu or click the link in our show notes. Now back to Karandeep Singh.

Dr. Karandeep Singh (02:55):

I wish we didn't have a talk that I could give on this, but unfortunately I do. And so we'll talk through some of the things that we've kind of learned over the last several years of trying to use AI in our health system at Michigan Medicine and also just from colleagues' experience around why AI implementations fail. Just to talk about what do I mean by AI implementation and health because I think everyone kind of has their own take on what this means, what I mean simply is the classic human in the loop model of AI where you start off with an AI model that produces a recommendation that then gets reviewed by a person, often a clinician, and that gets implemented into a clinical workflow to produce some change. Because to produce positive change, you have to produce change. So if that last part isn't true, then nothing is actually changing, nothing is happening and there can be no positive change.

(03:43):

In reality, it's not this straightforward because most AI models that we use are actually predicting risk. They're not predicting a recommendation. And so on top of that risk prediction we’re layering some kind of decision about that considers the trade-offs of intervening versus not intervening. So it converts that risk into a recommendation to do something or not do something. This is not the only way that AI gets implemented in healthcare. Another way is this way, which is that you have an AI model that produces a recommendation, this then gets implemented into a clinical workflow and changes something about the workflow itself and then that in result that reorganizes or streamlines a person's work. So just to give you an example of this one, we have a rapid response team that responds to emergencies happening in the hospital. And what they used to do is they used to call out to each of the inpatient units to say, is there anyone that you have that we should kind of just check in on?

(04:39):

Assuming that there's no emergencies happening at that very minute. Eventually they would talk to all the different units and charge nurse for a unit might say, yeah, we have this one person, you might want to come check in on them. They're okay right now, but they might kind of go downhill today. And that's kind of how they organize their work. What we have implemented over the past two years exploring a couple of different models is now what they do is there's a risk model that every 30 minutes scores every patient who's admitted to the hospital on their risk of going to the ICU or having a condition that would eventually lead them to go to the ICU, and then we sort that list from highest to lowest. They print that list out at seven in the morning and they basically check in on the top 20 sickest patients.

(05:21):

But that's kind of a way in which it's streamlined their workflow. They still have full agency of who to see, who not to see, but if you were to ask, well, what are you doing differently? They're not making all those phone calls in the morning. Now they're kind of starting off with a list that has already considered a set of characteristics about all the hospitalized patients to try to prioritize the patients that would ultimately be the patients they would get called on anyway later in the day if they got sicker. So recognizing that there's a lot of different workflows and a lot of different ways that AI gets implemented, it's hard to say kind of what is an ideal implementation, but I would say these are some of the principles I would say are there in an ideal implementation. And it's important to talk about this before we talk into why things fail.

(06:04):

So I think an ideal implementation for AI is one that produces recommendations that are informative and that you can actually act on. There are things that you have the resources to act on or that it's the right timing where you actually can act on it and so on. Another I think characteristic is that it allocates clinician and staff time more effectively. So if you look at what are we using AI for the most part we're by and large not using it to make diagnoses. It might be predicting a diagnosis and we might wrap a workflow around that. But by and large, we're not automating the kind of key parts of medical care at this moment. But what we are doing is we're deciding how we allocate people's time. And so we try to allocate time in a way that will lead to better patient outcomes and allocate time in a way that is more effective from a clinician staff level.

(06:56):

And if the AI is not doing that, then it's kind of not all that useful. If it's telling people to go see the patients they already were going to see and do things that they already were going to do, there's not really much gain that occurs from that. And the last one I think is that we ideally want to allocate more resources toward the people in our communities who need them and may benefit from them. So if we're talking about why things fail, things generally fail because of issues either from the AI artifacts themselves, like the models that we're using and something about the recommendations that they're producing or things fail because of the interventions or the changes in clinical workflow that we're putting in place. But basically these interventions reorganize our work. They help us reallocate care. And so if there are issues that we haven't considered about the social context or the clinical context, and when we do that, that's another common way that things can fail.

(07:46):

So here's some common reasons why I think models fail: lack of transparency, lack of reproducibility, lack of transportability, something called net benefit where the model is actually worse than no model, which we often don't even measure or look for, and predicting non-modifiable risk, and I'll give some very painfully learned examples of this stuff from our own work. So transparency you would think would be a requirement to implement something in clinical care. But what's interesting is this kind of paradox that's happened in health where if you look at the literature and you do a lit review and you look at what are the models in the literature, the things that are in the literature are basically a completely disjoint set from what's actually implemented. So there are researchers hard at work building all kinds of things with a lot of thought, and they get to a point where they have an artifact.

(08:35):

That artifact, if you're lucky, might be available for you to check out online, might even be in a format that you could actually plug and play into your electronic health record. But now we're already getting into maybe half of a percent of things that are out there or maybe even lower than that. For the most part, things kind of end at the paper and you can't find any artifact, anything from there that you could possibly implement. On the other extreme is this thing which is companies and specifically electronic health record vendors are really well positioned to develop things and make them easy to use. So with a couple of clicks, you can turn on a sepsis model to work in your health system because the electronic health record vendor built it and then they have model cards that they make available to their customers. But those model cards are proprietary.

(09:22):

Those aren't things that you can share publicly. So what happens when you have a situation like that, what ends up happening is things like this. So Epic has a no-show model that predicts the chance that you're not going to show up to a doctor's appointment. Now you can think of a lot of good reasons why you might want to do this. You want to find out, oh, maybe who doesn't have transportation, we can arrange for transportation to make it more likely they'll show up. You can think of lots of nefarious reasons to do this, like double booking people and the people who end up getting double booked are the most vulnerable people who can't show up to an appointment anyway for various obvious reasons. But the point is they made this model and they released it, made it available, and about 50 or 60 of their customers kind of adopted it.

(10:09):

When we were looking at this model, there was two versions of it. So they said, well, version two requires this additional technology, which you don't have yet, so you should consider version one. And at that time we were making a decision around whether to get that technology that would give us the ability to run version two, which was a kind of cloud-based thing. So version one we looked at the set of predictors and including among the predictors was race, ethnicity, maybe not race, ethnicity, BMI and religion. She said religion. What is religion doing in this model predicting no shows? It's not like people have appointments on the weekends. Is it something to do with Fridays? Like in some people having Fridays as protected days, but there was no explanation for why religion was there or what the role of religion was in that model. This was something that we ended up saying, well, we're not going to use that version

(10:55):

because we don't get it. Or if we were going to use it, we want to figure out is there a way we can lie to the model and just lie about the religion? Just give it a constant so that it doesn't consider that in its decision making. And the folks at the HR vendor said, huh, that's interesting. We hadn't thought anyone would ever want to lie to the model to get out a decision that you could trust. Now, version two of their model, which was cloud enabled, had taken those variables out probably because of internal pushback that had happened among customers out of the public view. And again, all the information sheets that we get, say proprietary confidential, proprietary confidential, do not share publicly, but colleagues at UCSF led by Sarah Murray, Bob Wachter published this blog post that basically said, what the heck is religion doing in this public model that a bunch of health organizations are using right now to predict no-shows?

(11:44):

And so I think that's an example of the sort of things that are very commonly happening now because of a lack of transparency, especially around vendor based models because there's no mandate to be transparent and health systems are by and large not demanding that transparency. Reproducibility is another challenge. So this is one example that I think was published a couple of years ago now by DeepMind. And so DeepMind, what they did was they made this deal with the VA that you give us all your data on a bunch of veterans and we will train a model that can predict acute kidney injury and basically who's going to have acute kidney injury in the next 48 hours. And if you can predict who's going to develop acute kidney injury in the next 48 hours, there may be things you can do to try to prevent it.

(12:28):

And so that was kind of the goal here. So they published it and they said, despite state-of-the-art retrospective performance of our model compared to existing literature, future work should now prospectively evaluate and independently validate the proposed model to establish its clinical utility and effect on patient outcomes. So that sounds great. So where is the model? The model is nowhere to be found. It's not on GitHub. Where is the dataset? Dataset nowhere to be found, not on GitHub. And so in essence, to actually do what they're saying you should do to use this in clinical care, you would have to get access to the same VA data, write your whole modeling pipeline so you can clean data the way that they cleaned it for their purpose of their study and structure the time series component, build models on top of that and then evaluate it and then use it in your system.

(13:16):

And so you can see how much of a difficult task that would be. So that's what we did. So we actually ended up having a contract with the VA around the time that this paper came out and our contract was to build an acute kidney injury model for the VA. So we said, this is great timing, probably whoever gave us the money to do that wasn't fully aware that they were already doing this. But in any case, we said, this is a great time to stop what we're doing and try to replicate the DeepMind paper. So we wrote a software package that could clean data for us and kind of transform data in a relatively similar way to what DeepMind had done. We couldn't use a deep neural net because we were confined to the VA computing environment where we didn't have a GPU, but we were able to somewhat replicate aspects of one of their gradient boosted decision trees that they had used, which also had state-of-the-art performance.

(14:07):

So it wasn't like only the deep neural net had state of the art performance, even their gradient boosted decision tree had beat all prior reported kind of performance measures in that space. There were some variables we had to leave out. We left out billing codes. And so reproducibility is basically when we did all the steps to try to replicate what they had done, we weren't quite able to get their level of performance. Our average performance was the solid line and theirs is the dotted line. But that's probably because we didn't include billing codes and had we included billing codes, we probably would've closed that gap because some of the most important predictors of acute kidney injury were billing codes. The one I love is basically coming in, I don't remember the exact billing code, but basically what the billing code was saying was, you're coming in to get your kidneys taken out.

(14:53):

So it was like kidney cancer or something, but it was something that you could easily say, okay, why is that a predictor? Oh, it's probably because they're coming in for an ablation for kidney cancer or they're getting their kidney taken out, and that's a great way to develop acute kidney injury is to have your kidney actually removed. So we couldn't quite reproduce it, but that's reproducibility in a nutshell. Transportability is the other side. So there's not one VA hospital, there's 110 VA hospitals across the country. So we said, what if we use this model and evaluate it individually at each VA? And so you can see that at some VAs it looks really good at some VAs, it looks terrible when you look at the end of the curve. This is partly related to sample size, but it's not entirely related to sample size.

(15:32):

But just to say that this is another challenge, it's probably not actionable. You're not going to not do what you have to do to treat their kidney cancer. In other cases, what it will mean is this set of actions you might consider are a different set of actions or the time course is different. And we'll talk about sepsis where I think one of the things that they included in the model could still be useful, but for a totally different set of actions. So another example of this where all three of these issues came out, transparency, reproducibility, transportability was this evaluation that we did of a proprietary sepsis model that was implemented at hundreds of hospitals. And we also happened to be evaluating it at University of Michigan. So this is a situation where we had all the scores being produced by the model. We had some information about the model, even our own analytical staff weren't clear whether they were allowed to look behind the veil to see what goes into the model.

(16:28):

Because everywhere on here it said, this is a proprietary model, this is our ip. You can't look later, we figured out that we were allowed to look and that when we looked, we found even crazier stuff. But before we could look, we said, well, we have the predictions, we have the patients, we have the outcomes. We can at least get a sense of how good this sepsis model is in predicting sepsis. So this is a widely used model. So what did we find? So first of all, we looked at their white paper, which basically was the majority of the information that we had. This is a model that was trained at three different health systems between 2013, 2015. It included things like vital signs, medications, lab values, comorbidities and demographics, all things that seemed relatively standard, nothing kind of major red flags there. And when they looked at this model's performance in a specific way that they measured it and the way that they define their outcome, they got an error into the curve of 0.76 and 0.83.

(17:25):

And there's no error into the curve that's good or bad because depending on what you're trying to predict, you can expect a really high one or a really low one. But that's just was kind of our benchmark to see how useful is this model appear? And that's one way that we can look at that. Although admittedly not the only way. So we ended up trying to redo this analysis and part of the reason we redid it is because when we were piloting this on a handful of units, we were finding that qualitatively it wasn't as useful as that performance would suggest. Like on paper, the numbers looked good, qualitatively it didn't feel that good. Probably just as importantly, it wasn't moving any of our quality measures. And so we asked our sepsis committee for the health system to say, can you give us the timestamps that you are using to decide when someone has sepsis?

(18:12):

That's a really tricky thing to define. There's lots of different ways to do it. But we said, if you're using that to make clinical decisions about our quality of care, can you give us those timestamps? So they said, here have those timestamps. So we said, we'll look to see how accurately the model can predict that sepsis before it happens, which is what the original model was claiming to do. So when we looked at it, we got a much lower end of the curve. We got a points error under the curve in the point sixes if we restricted it to scores that happen before you develop sepsis, much lower than what Epic was reporting. But if you look at scores in the three hours after, it was way better. We said, huh, that's interesting. Something weird is going on. But we don't know exactly what. And at this time when we published this, we really didn't know exactly what was going on.

(18:56):

So is this a reproducibility issue? It is because on our data, when they ran their analysis, they told us our AUC was as high as 0.88. So they said, no, no, no. Not only is AUC good. It's better than what we report in our white paper. You're like one of the top performing centers for our calculation. But we said we looked at almost the same set of data and we said it looks a lot worse based on what our sepsis committee is using to actually define sepsis. So something weird is going on here. This is a transportability issue. So when our paper came out, there was a lot of unhappy people at this company and they put out a public statement and private statements to their customers to say that Michigan analysis is flawed because it was done at a single center. It's not reflective of any other health center.

(19:43):

So we then went and did a follow-up study at Barnes Jewish BJC healthcare, which is nine different hospitals. So this was not just a reproducibility issue, it was also a transportability issue. There are clear centers where this does worse, clear centers where this does better, but even at the best centers, it only does on the very lower end of what was being reported by the vendor in their white paper. But it was also a transparency issue because after our paper came out, there was an investigative journalist who went and looked into what the heck was going on and found out through channels that one of the big predictors in the Epic model, not just one predictor, but several predictors in the Epic model were actually whether you had received broad spectrum IV antibiotics, which is essentially how we treat sepsis. And so what's happening here is this, right?

(20:30):

The patient comes in, they look sick, a clinician says, I think you have sepsis, starts you on antibiotics. The model says, aha, this person has sepsis alerts you to say, Hey, this person has sepsis. And you say, I know that's why I started having antibiotics. And that's not entirely what was happening, but that was definitely a part of what was happening. And when I asked them this question, couldn't that happen? They said, yeah, that could happen. And in fact, that probably was responsible for that huge gap between the performance right up to sepsis to the three hours after sepsis. So all three of those things can add up and really lead to all kinds of time wasted down the line because of something you didn't know that would've totally changed what you would've done. Had we known this was in the model, we probably would've approached it very differently, or we might've even just decided not to look at it and not even bothered with this paper because we really were looking at it because of how well it supposedly was doing when we weren't finding that kind of qualitative benefit when we were trying to use it in our own health system.

(21:31):

So other couple of things, I know I spent a lot of time on that. We'll breeze through some of the other ones. A model can be worse than no model. When you look at a sensitivity specificity of positive predictive value, negative predictive value, you show any combination of those to a clinician, they'll pick one that they think is good enough to implement. But when you actually ask them upfront, what trade-off are you willing to accept between false positives and false negatives, you can almost always find that probably at least half the time the model's performance that's achieved isn't good enough based on the actual trade-offs. The trade-offs of a false positive and false negative aren't the same. This is something that we've now incorporated into our governance. We don't necessarily use this equation, but we ask people upfront, what is a positive predictive value that you would need to use this?

(22:15):

And then you often find that you can never achieve that positive predictive value because the event rates are too low. And no matter how good your model, it can almost never get you to a level of certainty that's above that required PPV threshold that they've set upfront. So this is something that also I think we've discovered is if the model's not good enough, why use it? It's not always that will be better than doing nothing. But yeah, you're always at the end of the day, unless you're expanding your workforce, you're just deciding how you have people spend their time, which effectively, I mean, put in a negative light, that's rationing care. But we already have people and waiting lists to get in to see clinicians. So that's already rationing care if you think about it that way.

Kate Atkins, host (23:00):

You can watch the full talk by clicking the link in our show notes. To learn more about upcoming events like this, visit us at umsi.info/events and tune in next time to hear from economist and professor Lise Vesterlund during a 2022 talk at UMSI’s Social Behavioral and Experimental Economic Seminar series. Vesterlund discusses why women in the workforce are disproportionately asked and expected to do unpromotable work, which leaves them over-committed and under-utilized.

Lise Vesterlund (23:33):

So awareness will help us see when this happens. It will help us see when we're in a meeting, we start asking for volunteers to say, wait a minute, this isn't how it should be done. It will help us see when we once again volunteer a female who's overqualified for a job to take notes or set up the next meeting. We'll make everybody else say, nah, this doesn't feel right. So the awareness is the first step

Kate Atkins, host (23:57):

That's in our next episode. Before we go, your inbox will thank you for signing up for our email newsletter. It's a monthly digest featuring tiny delicious news tidbits about information science and library topics. Check our show notes to sign up today. The University of Michigan School of Information creates and shares knowledge so that people like you will use information with technology to build a better world. Don't forget to subscribe to information changes everything on your favorite podcast platform, and if you've got questions, comments, or episode ideas, send us an email at [email protected]. From all of us at the University of Michigan School of Information, thanks for listening.

Information Changes Everything: The Podcast

News and research from the world of information science, presented by the University of Michigan School of Information.

People featured in this episode

Profile

David Jurgens

Associate Professor of Information, School of Information and Associate Professor of Electrical Engineering and Computer Science, College of Engineering

Send Email