Health Care AI, Intended To Save Money, Turns Out To Require a Lot of Expensive Humans

Lambert here: Then lower the standards. Problem solved.

By Darius Tahir, Correspondent, who is based in Washington, D.C., and reports on health technology with an eye toward how it helps (or doesn’t) underserved populations; how it can be used (or not) to help government’s public health efforts; and whether or not it’s as innovative as it’s cracked up to be. Originally published at KFF Health News.

Preparing cancer patients for difficult decisions is an oncologist’s job. They don’t always remember to do it, however. At the University of Pennsylvania Health System, doctors are nudged to talk about a patient’s treatment and end-of-life preferences by an artificially intelligent algorithm that predicts the chances of death.

But it’s far from being a set-it-and-forget-it tool. A routine tech checkup revealed the algorithm decayed during the covid-19 pandemic, getting 7 percentage points worse at predicting who would die, according to a 2022 study.

There were likely real-life impacts. Ravi Parikh, an Emory University oncologist who was the study’s lead author, told KFF Health News the tool failed hundreds of times to prompt doctors to initiate that important discussion — possibly heading off unnecessary chemotherapy — with patients who needed it.

He believes several algorithms designed to enhance medical care weakened during the pandemic, not just the one at Penn Medicine. “Many institutions are not routinely monitoring the performance” of their products, Parikh said.

Algorithm glitches are one facet of a dilemma that computer scientists and doctors have long acknowledged but that is starting to puzzle hospital executives and researchers: Artificial intelligence systems require consistent monitoring and staffing to put in place and to keep them working well.

In essence: You need people, and more machines, to make sure the new tools don’t mess up.

“Everybody thinks that AI will help us with our access and capacity and improve care and so on,” said Nigam Shah, chief data scientist at Stanford Health Care. “All of that is nice and good, but if it increases the cost of care by 20%, is that viable?”

Government officials worry hospitals lack the resources to put these technologies through their paces. “I have looked far and wide,” FDA Commissioner Robert Califf said at a recent agency panel on AI. “I do not believe there’s a single health system, in the United States, that’s capable of validating an AI algorithm that’s put into place in a clinical care system.”

AI is already widespread in health care. Algorithms are used to predict patients’ risk of death or deterioration, to suggest diagnoses or triage patients, to record and summarize visits to save doctors work, and to approve insurance claims.

If tech evangelists are right, the technology will become ubiquitous — and profitable. The investment firm Bessemer Venture Partners has identified some 20 health-focused AI startups on track to make $10 million in revenue each in a year. The FDA has approved nearly a thousand artificially intelligent products.

Evaluating whether these products work is challenging. Evaluating whether they continue to work — or have developed the software equivalent of a blown gasket or leaky engine — is even trickier.

Take a recent study at Yale Medicine evaluating six “early warning systems,” which alert clinicians when patients are likely to deteriorate rapidly. A supercomputer ran the data for several days, said Dana Edelson, a doctor at the University of Chicago and co-founder of a company that provided one algorithm for the study. The process was fruitful, showing huge differences in performance among the six products.

It’s not easy for hospitals and providers to select the best algorithms for their needs. The average doctor doesn’t have a supercomputer sitting around, and there is no Consumer Reports for AI.

“We have no standards,” said Jesse Ehrenfeld, immediate past president of the American Medical Association. “There is nothing I can point you to today that is a standard around how you evaluate, monitor, look at the performance of a model of an algorithm, AI-enabled or not, when it’s deployed.”

Perhaps the most common AI product in doctors’ offices is called ambient documentation, a tech-enabled assistant that listens to and summarizes patient visits. Last year, investors at Rock Health tracked $353 million flowing into these documentation companies. But, Ehrenfeld said, “There is no standard right now for comparing the output of these tools.”

And that’s a problem, when even small errors can be devastating. A team at Stanford University tried using large language models — the technology underlying popular AI tools like ChatGPT — to summarize patients’ medical history. They compared the results with what a physician would write.

“Even in the best case, the models had a 35% error rate,” said Stanford’s Shah. In medicine, “when you’re writing a summary and you forget one word, like ‘fever’ — I mean, that’s a problem, right?”

Sometimes the reasons algorithms fail are fairly logical. For example, changes to underlying data can erode their effectiveness, like when hospitals switch lab providers.

Sometimes, however, the pitfalls yawn open for no apparent reason.

Sandy Aronson, a tech executive at Mass General Brigham’s personalized medicine program in Boston, said that when his team tested one application meant to help genetic counselors locate relevant literature about DNA variants, the product suffered “nondeterminism” — that is, when asked the same question multiple times in a short period, it gave different results.

Aronson is excited about the potential for large language models to summarize knowledge for overburdened genetic counselors, but “the technology needs to improve.”

If metrics and standards are sparse and errors can crop up for strange reasons, what are institutions to do? Invest lots of resources. At Stanford, Shah said, it took eight to 10 months and 115 man-hours just to audit two models for fairness and reliability.

Experts interviewed by KFF Health News floated the idea of artificial intelligence monitoring artificial intelligence, with some (human) data whiz monitoring both. All acknowledged that would require organizations to spend even more money — a tough ask given the realities of hospital budgets and the limited supply of AI tech specialists.

“It’s great to have a vision where we’re melting icebergs in order to have a model monitoring their model,” Shah said. “But is that really what I wanted? How many more people are we going to need?”

Print Friendly, PDF & Email

This entry was posted in Guest Post, Health care, Ridiculously obvious scams, Technology and innovation on by .

About Lambert Strether

Readers, I have had a correspondent characterize my views as realistic cynical. Let me briefly explain them. I believe in universal programs that provide concrete material benefits, especially to the working class. Medicare for All is the prime example, but tuition-free college and a Post Office Bank also fall under this heading. So do a Jobs Guarantee and a Debt Jubilee. Clearly, neither liberal Democrats nor conservative Republicans can deliver on such programs, because the two are different flavors of neoliberalism (“Because markets”). I don’t much care about the “ism” that delivers the benefits, although whichever one does have to put common humanity first, as opposed to markets. Could be a second FDR saving capitalism, democratic socialism leashing and collaring it, or communism razing it. I don’t much care, as long as the benefits are delivered. To me, the key issue — and this is why Medicare for All is always first with me — is the tens of thousands of excess “deaths from despair,” as described by the Case-Deaton study, and other recent studies. That enormous body count makes Medicare for All, at the very least, a moral and strategic imperative. And that level of suffering and organic damage makes the concerns of identity politics — even the worthy fight to help the refugees Bush, Obama, and Clinton’s wars created — bright shiny objects by comparison. Hence my frustration with the news flow — currently in my view the swirling intersection of two, separate Shock Doctrine campaigns, one by the Administration, and the other by out-of-power liberals and their allies in the State and in the press — a news flow that constantly forces me to focus on matters that I regard as of secondary importance to the excess deaths. What kind of political economy is it that halts or even reverses the increases in life expectancy that civilized societies have achieved? I am also very hopeful that the continuing destruction of both party establishments will open the space for voices supporting programs similar to those I have listed; let’s call such voices “the left.” Volatility creates opportunity, especially if the Democrat establishment, which puts markets first and opposes all such programs, isn’t allowed to get back into the saddle. Eyes on the prize! I love the tactical level, and secretly love even the horse race, since I’ve been blogging about it daily for fourteen years, but everything I write has this perspective at the back of it.

22 comments

  1. VTDigger

    I can confirm, my employer in healthcare was testing out several AI visit transcription tools and the risk of…’hallucination’ I think they call it…was too large. It would simply move the work around, they would have to hire people to edit the AI transcriptions.

    There’s no Truth serum for these AI golems yet.

    Reply
      1. Adam

        There’s already a website called Mechnical Turk that is the likely the most popular freelancing tech website (that basically pays pennies).

        Reply
    1. Jeff H

      having worked with medical transcriptionists I can attest to their magical ability to discern the meaning of a physicians dictation.

      Reply
    2. IM Doc

      I have now been using it for 3-4 months.
      It is horrendous.

      Two big issues with the dictation part –
      1) It just makes stuff up out of thin air – it is the most amazing thing I have ever seen.
      2) There are lots of personal issues that get brought up that I would never dream of having in a chart. These are mostly ignored; however, they can be injected into these charts in the least obvious manner. I have to read over every word, thus, this is actually much more time-consuming for me. I have colleagues who do not spend one second proofreading and what appears in these charts is truly horrifying.

      I long for the days that I had a medical transcription specialist with a brain.

      Even more scary are all the diagnostic suggestions and plans inserted at the end that you never ordered much less dreamed of.

      As I have said repeatedly – retirement cannot come soon enough.

      Reply
      1. Felix_47

        As a fellow Oslerian I feel your pain. Wouldn’t be nice to just jot down a brief note that reminds your brain of what is the primary issue and take a history and examine the patient and prescribe or treat as indicated without anything between you and the patient? The way it used to be. Enough US doctors need to get upset enough to push for a system like the British NHS but better funded. The problem is that many doctors are greedy and unwilling to give up the ability to bill and overbill in the private practice setting. So now we are practicing insurance documentation for billing in order to bill enough to make a living. Half or more of our surgeries are unnecessary or do harm and much of the medication and therapy we prescribe is not effective or causes harm. We need a system that has no production bonuses, no pressure to see people, no patient satisfaction parameters, and no financial incentives. Then let those who went to school for it make treatment decisions in view of what is best for the patient as they see it, knowing the patient and the situation. Put them on a salary like a fireman paramedic and leave them alone. Fee for service is not morally supportable in healthcare if you believe in the Hippocratic oath.

        Reply
    3. Acacia

      “Hallucination” … a.k.a. making shit up… even the word choice is an attempt on the part of tech bros to anthropomorphize a piece of technology that cannot reason or think.

      Reply
  2. The Rev Kev

    Reading this article, it seems that AI here is really in Beta stage when released but whereas beta software can be improved with updates, this healthcare AI deteriorates all by itself and it would only be a matter of time until it becomes useless. But as the people pushing this hunk of junk are making hundreds of millions of dollars selling it to gullible ‘professionals’, I do not think that anything will change anytime soon. Maybe those companies should have spent the resources to do a self-diagnostic tool to go with those healthcare AIs but as that would reveal how bad it gets over time, they were never going to do that.

    Reply
  3. ChrisFromGA

    I am more and more convinced that “AI” is just a marketing term.

    And it’s a big part of “the bezzle.”

    Reply
    1. Skk

      Yes it is. I was in the field that went thru nomenclature changes from statistical inference to predictive analytics to machine learning to data mining to data science.
      If AI is mostly large language models then it’s really a development of text analytics and gawd knows how large the error rate was when I was doing it for summarization.

      As regards the current post, once a model was in use, we continually monitored for ‘model drift’ over time. What once watched for was the data one was making predictions about no longer ( if it ever did, since it had never existed before ) had the same probability distribution that the data one ‘learnt on ‘ and validated the model on. And the answer was a ‘model Refresh’, meaning use ever newer data to re-learn and revalidate on. I remember a payday lender modeller who boasted that they refreshed their models DAILY. Which gave rise to serious epistemological questions about probability.

      Thank gawd I’ve retired and no longer have to work with such mechanistic bozos in my very field, let alone C guys.

      Reply
    2. Lefty Godot

      This article seems to be conflating “AI” with “algorithms”. It’s not clear to me which they really mean.

      Reply
    1. redleg

      Since there are limited number of vendors using the same dataset for training, some person is going to make bank (pun intended) when all of the mighty AI-encumbered institutions fail simultaneously.

      Reply
  4. THEWILLMAN

    Software has always been sold using “happy path” demos and the actual implementation is always a bit more complicated. And healthcare institutions are notorious for being a decade or two behind what you’d see at first mover technology companies.

    In this case they’re being sold bleeding edge technology that maybe a few hundred companies in the world can do reliably at scale. Hallucinations and inaccuracies and a ton financial and labor cost (and given who their customers are – human cost) to clean things up is exactly what should be expected.

    Reply
  5. Skip Intro

    Unlike the corpus of English text, there is very few vast troves of medical data on which to train machine learning algorithms, so these algorithms can only be effective at the margins of healthcare, or produce results from insufficient information, which will be mostly ‘hallucination’, which is the term of art for any AI results you don’t like.

    Reply
  6. Jeremy Grimm

    I saw this statement near the top of this post:
    “Many institutions are not routinely monitoring the performance” of their products, Parikh said.
    How well do many institutions routinely monitor the performance of their staff and treatments? and where is this performance information reported?
    I found the statements below toward the end of the post.
    “Government officials worry hospitals lack the resources to put these technologies through their paces.”
    “All acknowledged that would require organizations to spend even more money — a tough ask given the realities of hospital budgets…”
    Am I to believe hospitals are cash strapped? If they are, where is all the money going? dividends, fees, bonuses for management, or management salaries?
    Even the title of this post leaves me wondering: “Health Care AI, Intended To Save Money, Turns Out To Require a Lot of Expensive Humans.” Save money for whom? What about providing better care, better outcomes, better practice of Medicine? If saving money is the intention behind applying AI to medical care — how much AI would it take to spot the thick covering of parasites sucking money and effectiveness out of what the u.s. calls medical care?

    Reply
  7. frligf

    British @LabourParty just announced massive tax money for AI so they clearly haven’t heard of the problems of AI that have been muttered about for a decade.
    Put simply, every time AI makes a calculation, any error is included and multiplied by that fraction. which means over billions of iterations, it makes a wide margin of error, just like the problem with trying to use microchips to calculate flight plans to the moon or mars, which is why it was done by humans

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *