All Medications Are Insignificant In The Eyes Of God And Traditional Effect Size Criteria
SSRI antidepressants like Prozac were first developed in the 1980s and 1990s. The first round of studies, sponsored by pharmaceutical companies, showed they worked great! But later, skeptics found substantial bias in these early trials; several later analyses that corrected for this all found effect sizes (compared to placebo) of only 0.30. Is an effect size of 0.30 good or bad? The usual answer is “bad”. The UK’s National Institute for Clinical Excellence used to say that treatments were only “clinically relevant” if they had an effect size of 0.50 or more. The US FDA apparently has a rule of thumb that any effect size below 0.50 is “small”. Others are even stricter. Leucht et al investigate when doctors subjectively feel like their patients have gotten better, and find that even effect size 0.50 correlates to doctors saying they see little or no change. Based on this research, Irving Kirsch, author of some of the earliest credible antidepressant effect size estimates, argues that “[the] thresholds suggested by NICE were not empirically based and are presumably too small”, and says that “minimal improvement” should be defined as an effect size of 0.875 or more. No antidepressant consistently attains this. He wrote:
…sparking a decade of news articles like Antidepressants Don’t Work - Official Study and Why Antidepressants Are No Better Than Placebos. Since then everyone has gotten into a lot of fights about this, with inconclusive results. Recently a Danish team affiliated with the pharma company Lundbeck discovered an entirely new way to get into fights about this. I found their paper, Determining maximal achievable effect sizes of antidepressant therapies in placebo-controlled trials, more enlightening than most other writing on this issue. They ask: what if the skeptics’ preferred effect size number is impossible to reach? Consider the typical antidepressant study. You’re probably measuring how depressed people are using a test called HAM-D - on one version, the scale ranges from 0 to 54, anything above 7 is depressed, anything above 24 is severely depressed. Most of your patients probably start out in the high teens to low twenties. You give half of them antidepressant and half placebo for six weeks. By the end of the six weeks, maybe a third of your subjects have dropped out due to side effects or general distractedness. On average, the people left in the placebo group will have a HAM-D score of around 15, and the people left in the experimental group will have some other score depending on how good your medication is. The Danes simulate several different hypothetical medications. The one I find most interesting is a medication that completely cures some fraction of the people who take it. They simulate “completely cures” by giving the patients a distribution of HAM-D scores similar to those of healthy non-depressed people. Here’s what they find: The pictures from A to F are antidepressants that cure 0%, 20%, 40%, 60%, 80%, and 100% of patients respectively. And we’re looking at the ES - effect size - for each. Only D, E, and F pass NICE’s 0.50 threshold. And only F passes Kirsch’s higher 0.875 threshold. So a drug that completely cured 40% of people who took it would be “clinically insignificant” for NICE. And even a drug that completely cured 80% of the people who took it would be clinically insignificant for Kirsch! Clearly this use of “clinically insignificant” doesn’t match our intuitive standards of “meh, doesn’t matter”. We can make this even worse. Suppose that instead of completely curing patients, the drug “only” makes their depression improve a bit - specifically, half again as much as the placebo effect. Here’s the same simulation: Here we find that only E and F meet NICE’s criteria, and nobody meets Kirsch’s criteria! A drug that significantly improves 60% of patients is clinically insignificant for NICE, and even a drug that significantly improves 100% of patients improve is clinically insignificant for Kirsch! What’s gone wrong here? The authors point to three problems. First, most people in depression trials respond very well to the placebo effect. The minimum score on a depression test is zero, and even healthy non-depressed people rarely get zero points exactly. So if most of the placebo group is doing pretty well, there’s not a lot of room for the drug to make people do better than placebo. Second, this improvement in the placebo group is inconsistent; a lot do recover completely, but others don’t recover at all. That means there’s a large standard deviation in the placebo group. Effect size is measured as a percent of standard deviation. If standard deviation is very high, this artificially lowers effect size. Third, many patients (often about 30%) leave the study partway through for side effects or other reasons. These people stop taking the medication. But intention-to-treat analysis leaves them in the final statistics. Since a third of the experimental group isn’t even taking the medication, this artificially lowers the medication’s apparent effect size. I’m not sure about this, but I think NICE and Kirsch were basing their criteria off observations of single patients. That is, in one person, it takes a difference of 0.50 or 0.875 to notice much of a change. But studies face different barriers than single-person observations and aren’t directly comparable. NICE has since walked back on their claim that only effect sizes higher than 0.50 are clinically relevant (although this is part of a broader trend for health institutes not to say things like this, so I don’t want to make too big a deal of it). As far as I know, Kirsch hasn’t. Still, I think that a broader look at medication effect sizes suggests that the Danish team’s effect size laxness is overall right, and the earlier effect size strictness was wrong. Here’s a chart by a team including Leucht, who did some of the original HAM-D research: Some of our favorite medications, including statins, anticholinergics, and bisphosphonates, don’t reach the 0.50 level. And many more, including triptans, benzodiazepines (!), and Ritalin (!!) don’t reach 0.875. This doesn’t even include some of my favorites. Zolpidem (“Ambien”) has effect size around 0.39 for getting you to sleep faster. Ibuprofen (“Advil”, “Motrin”) has effect sizes between from about 0.20 (for surgical pain) to 0.42 (for arthritis). All of these are around the 0.30 effect size of antidepressants. There’s no anti-ibuprofen lobby trying to rile people up about NSAIDs, so nobody’s pointed out that this is “clinically insignificant”. But by traditional standards, it is! Statisticians have tried to put effect sizes in context by saying some effect sizes are “small” or “big” or “relevant” or “miniscule”. I think this is a valiant effort. But it makes things worse as often as it makes them better. Some effect sizes are smaller than we think; others are larger. Consider a claim that the difference between treatment and control groups was “only as big, in terms of effect size, as the average height difference between men and women - just a couple of inches” (I think I saw someone say this once, but I’ve lost the reference thoroughly enough that I’m presenting it as a hypothetical). That drug would be more than four times stronger than Ambien! The difference between study effect sizes, population effect sizes, and individual effect sizes only confuses things further. I would downweight all claims about “this drug has a meaningless effect size” compared to your other sources of evidence, like your clinical experience. You're currently a free subscriber to Astral Codex Ten. For the full experience, upgrade your subscription. |
Older messages
Are Woo Non-Responders Defective?
Tuesday, May 30, 2023
...
Open Thread 278
Monday, May 29, 2023
...
Your Book Review: Lying for Money
Friday, May 26, 2023
Finalist #2 in the ACX Book Review contest
Hypergamy: Much More Than You Wanted To Know
Wednesday, May 24, 2023
...
Mantic Monday 5/22/23
Tuesday, May 23, 2023
Whales v. Minnows // US v. Itself // EPJ v. The Veil Of Time // Balaji v. Medlock
You Might Also Like
Searching for Justice and the Missing in the New Syria
Sunday, January 12, 2025
The prisons are open, the secret files are unlocked. Now Syrians are trying to figure out how to hold war criminals accountable. Most Read Leaked Meta Rules: Users Are Free to Post “Mexican Immigrants
Monday Briefing: Number of missing rises in L.A.
Sunday, January 12, 2025
Plus, how cured ham fixed an antique organ in France View in browser|nytimes.com Ad Morning Briefing: Asia Pacific Edition January 13, 2025 Author Headshot By Emmett Lindner Good morning. We're
GeekWire's Most-Read Stories of the Week
Sunday, January 12, 2025
Catch up on the top tech stories from this past week. Here are the headlines that people have been reading on GeekWire. ADVERTISEMENT GeekWire SPONSOR MESSAGE: GeekWire's special series marks
9 Things That Delighted Us Last Week: From Fleece Shellaclavas to Portable Sound Machines
Sunday, January 12, 2025
Plus: Ceremonia's new nonaerosol dry shampoo. The Strategist Logo Every product is independently selected by editors. If you buy something through our links, New York may earn an affiliate
LEVER WEEKLY: How To End This Disaster Movie
Sunday, January 12, 2025
We get to decide whether the LA fires are a wake-up call or a funeral pyre. How To End This Disaster Movie By David Sirota • 12 Jan 2025 View in browser View in browser A helicopter drops water on the
6 easy(ish) ways we’re resetting for the new year
Sunday, January 12, 2025
Future you will thank you View in browser Ad The Recommendation January 12, 2025 Ad How Wirecutter journalists reset for a fresh year An image of Wirecutter's picks for best kids backpacks, best
☕ Fannie and Freddie
Sunday, January 12, 2025
Can the NHL pull off outdoor games in Florida? Morning Brew January 12, 2025 | View Online | Sign Up | Shop Walking a bike on a snow-covered bridge in Amsterdam. Marcel Van Hoorn/ANP/AFP via Getty
DEI Loses Popularity, Death Toll Rises in LA, and a Special Kind of Library
Sunday, January 12, 2025
Meta is ending its key diversity, equity and inclusion programs, joining corporate giants Ford, McDonald's and Walmart that have pulled the plug on DEI initiatives. ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏
UW and computer science student reach truce in ‘HuskySwap’ spat
Saturday, January 11, 2025
Blue Origin set for first orbital launch | Zillow layoffs | Pandion shutdown | AI in 2025 ADVERTISEMENT GeekWire SPONSOR MESSAGE: GeekWire's special series marks Microsoft's 50th anniversary by
Cryptos Surrender Recent Gains | DOJ's $6.5 Billion Bitcoin Sale
Saturday, January 11, 2025
Bitcoin and other tokens retreated as Fed signaled caution on rate cuts. Forbes START INVESTING • Newsletters • MyForbes Presented by Nina Bambysheva Staff Writer, Forbes Money & Markets Follow me