All Medications Are Insignificant In The Eyes Of God And Traditional Effect Size Criteria
SSRI antidepressants like Prozac were first developed in the 1980s and 1990s. The first round of studies, sponsored by pharmaceutical companies, showed they worked great! But later, skeptics found substantial bias in these early trials; several later analyses that corrected for this all found effect sizes (compared to placebo) of only 0.30. Is an effect size of 0.30 good or bad? The usual answer is “bad”. The UK’s National Institute for Clinical Excellence used to say that treatments were only “clinically relevant” if they had an effect size of 0.50 or more. The US FDA apparently has a rule of thumb that any effect size below 0.50 is “small”. Others are even stricter. Leucht et al investigate when doctors subjectively feel like their patients have gotten better, and find that even effect size 0.50 correlates to doctors saying they see little or no change. Based on this research, Irving Kirsch, author of some of the earliest credible antidepressant effect size estimates, argues that “[the] thresholds suggested by NICE were not empirically based and are presumably too small”, and says that “minimal improvement” should be defined as an effect size of 0.875 or more. No antidepressant consistently attains this. He wrote:
…sparking a decade of news articles like Antidepressants Don’t Work - Official Study and Why Antidepressants Are No Better Than Placebos. Since then everyone has gotten into a lot of fights about this, with inconclusive results. Recently a Danish team affiliated with the pharma company Lundbeck discovered an entirely new way to get into fights about this. I found their paper, Determining maximal achievable effect sizes of antidepressant therapies in placebo-controlled trials, more enlightening than most other writing on this issue. They ask: what if the skeptics’ preferred effect size number is impossible to reach? Consider the typical antidepressant study. You’re probably measuring how depressed people are using a test called HAM-D - on one version, the scale ranges from 0 to 54, anything above 7 is depressed, anything above 24 is severely depressed. Most of your patients probably start out in the high teens to low twenties. You give half of them antidepressant and half placebo for six weeks. By the end of the six weeks, maybe a third of your subjects have dropped out due to side effects or general distractedness. On average, the people left in the placebo group will have a HAM-D score of around 15, and the people left in the experimental group will have some other score depending on how good your medication is. The Danes simulate several different hypothetical medications. The one I find most interesting is a medication that completely cures some fraction of the people who take it. They simulate “completely cures” by giving the patients a distribution of HAM-D scores similar to those of healthy non-depressed people. Here’s what they find: The pictures from A to F are antidepressants that cure 0%, 20%, 40%, 60%, 80%, and 100% of patients respectively. And we’re looking at the ES - effect size - for each. Only D, E, and F pass NICE’s 0.50 threshold. And only F passes Kirsch’s higher 0.875 threshold. So a drug that completely cured 40% of people who took it would be “clinically insignificant” for NICE. And even a drug that completely cured 80% of the people who took it would be clinically insignificant for Kirsch! Clearly this use of “clinically insignificant” doesn’t match our intuitive standards of “meh, doesn’t matter”. We can make this even worse. Suppose that instead of completely curing patients, the drug “only” makes their depression improve a bit - specifically, half again as much as the placebo effect. Here’s the same simulation: Here we find that only E and F meet NICE’s criteria, and nobody meets Kirsch’s criteria! A drug that significantly improves 60% of patients is clinically insignificant for NICE, and even a drug that significantly improves 100% of patients improve is clinically insignificant for Kirsch! What’s gone wrong here? The authors point to three problems. First, most people in depression trials respond very well to the placebo effect. The minimum score on a depression test is zero, and even healthy non-depressed people rarely get zero points exactly. So if most of the placebo group is doing pretty well, there’s not a lot of room for the drug to make people do better than placebo. Second, this improvement in the placebo group is inconsistent; a lot do recover completely, but others don’t recover at all. That means there’s a large standard deviation in the placebo group. Effect size is measured as a percent of standard deviation. If standard deviation is very high, this artificially lowers effect size. Third, many patients (often about 30%) leave the study partway through for side effects or other reasons. These people stop taking the medication. But intention-to-treat analysis leaves them in the final statistics. Since a third of the experimental group isn’t even taking the medication, this artificially lowers the medication’s apparent effect size. I’m not sure about this, but I think NICE and Kirsch were basing their criteria off observations of single patients. That is, in one person, it takes a difference of 0.50 or 0.875 to notice much of a change. But studies face different barriers than single-person observations and aren’t directly comparable. NICE has since walked back on their claim that only effect sizes higher than 0.50 are clinically relevant (although this is part of a broader trend for health institutes not to say things like this, so I don’t want to make too big a deal of it). As far as I know, Kirsch hasn’t. Still, I think that a broader look at medication effect sizes suggests that the Danish team’s effect size laxness is overall right, and the earlier effect size strictness was wrong. Here’s a chart by a team including Leucht, who did some of the original HAM-D research: Some of our favorite medications, including statins, anticholinergics, and bisphosphonates, don’t reach the 0.50 level. And many more, including triptans, benzodiazepines (!), and Ritalin (!!) don’t reach 0.875. This doesn’t even include some of my favorites. Zolpidem (“Ambien”) has effect size around 0.39 for getting you to sleep faster. Ibuprofen (“Advil”, “Motrin”) has effect sizes between from about 0.20 (for surgical pain) to 0.42 (for arthritis). All of these are around the 0.30 effect size of antidepressants. There’s no anti-ibuprofen lobby trying to rile people up about NSAIDs, so nobody’s pointed out that this is “clinically insignificant”. But by traditional standards, it is! Statisticians have tried to put effect sizes in context by saying some effect sizes are “small” or “big” or “relevant” or “miniscule”. I think this is a valiant effort. But it makes things worse as often as it makes them better. Some effect sizes are smaller than we think; others are larger. Consider a claim that the difference between treatment and control groups was “only as big, in terms of effect size, as the average height difference between men and women - just a couple of inches” (I think I saw someone say this once, but I’ve lost the reference thoroughly enough that I’m presenting it as a hypothetical). That drug would be more than four times stronger than Ambien! The difference between study effect sizes, population effect sizes, and individual effect sizes only confuses things further. I would downweight all claims about “this drug has a meaningless effect size” compared to your other sources of evidence, like your clinical experience. You're currently a free subscriber to Astral Codex Ten. For the full experience, upgrade your subscription. |
Older messages
Are Woo Non-Responders Defective?
Tuesday, May 30, 2023
...
Open Thread 278
Monday, May 29, 2023
...
Your Book Review: Lying for Money
Friday, May 26, 2023
Finalist #2 in the ACX Book Review contest
Hypergamy: Much More Than You Wanted To Know
Wednesday, May 24, 2023
...
Mantic Monday 5/22/23
Tuesday, May 23, 2023
Whales v. Minnows // US v. Itself // EPJ v. The Veil Of Time // Balaji v. Medlock
You Might Also Like
Volunteer DEF CON hackers dive into America's leaky water infrastructure [Mon Nov 25 2024]
Monday, November 25, 2024
Hi The Register Subscriber | Log in The Register Daily Headlines 25 November 2024 water Volunteer DEF CON hackers dive into America's leaky water infrastructure Six sites targeted for security
EndHunger_FinalForReal.docx
Monday, November 25, 2024
The G20 have a new plan, again what happened last week in Asia, Africa and the Americas Hey, this is Sham Jaff, your very own news curator. Each week, I highlight some of the biggest stories from
The House Just Blessed Trump’s Authoritarian Playbook by Passing Nonprofit-Killer Bill
Monday, November 25, 2024
Democratic support for the bill dwindled as critics warned it would let Donald Trump crack down on political foes. Most Read The House Just Blessed Trump's Authoritarian Playbook by Passing
Monday Briefing: U.N. climate talks end with a deal
Sunday, November 24, 2024
Plus, photographing the world's food. View in browser|nytimes.com Ad Morning Briefing: Asia Pacific Edition November 25, 2024 Author Headshot By Gaya Gupta Good morning. We're covering a deal
GeekWire's Most-Read Stories of the Week
Sunday, November 24, 2024
Catch up on the top tech stories from this past week. Here are the headlines that people have been reading on GeekWire. ADVERTISEMENT GeekWire SPONSOR MESSAGE: Get your ticket for AWS re:Invent,
13 Things That Delighted Us Last Week: From Daschund Bags to Sparkly Toilet Seats
Sunday, November 24, 2024
Plus, the Gucci poker set that Jennifer Tilly packs in her carry-on. The Strategist Logo Every product is independently selected by editors. If you buy something through our links, New York may earn an
LEVER WEEKLY: Trump's Cabinet Of Curiosities
Sunday, November 24, 2024
Opening up Trump's corruption-riddled cabinet and more from The Lever this week. LEVER WEEKLY: Trump's Cabinet Of Curiosities By The Lever • 24 Nov 2024 View in browser View in browser This is
What our travel expert brings on every trip
Sunday, November 24, 2024
M&Ms? View in browser Ad The Recommendation Ad Traveling is stressful for everyone, even travel writers Various travel gear items laid out on a yellow background. Michael Hession/NYT Wirecutter
☕ The Brew’s Holiday Gift Guide
Sunday, November 24, 2024
What to get everyone in your family... Presented By Bose November 24, 2024 | View Online | Sign Up | Shop Sunny Eckerle NOTE FROM THE WRITERS Good morning! Cassandra and Matty here, Morning Brew's
How Friendsgiving became America's favorite made-up holiday
Sunday, November 24, 2024
Plus: The real story behind FX's "Say Nothing," the horrifying effects of air pollution in South Asia, and more. November 25, 2024 View in browser Friendsgiving is just what America