Need Help With Upcoming Episode on Mask Junk Science

Is there a discussion of the power of the experiment/effect size? I didn't see it.

There is no detailed power analysis in the preprint that I can see--for reviewers such an analysis would have to be provided, but it isn't uncommon for such things to not be included in the published paper or relegated to an appendix. However, in cases like these, where they powered their study for a novel, derived metric--if I were a reviewer they'd have to really get into the weeds to justify their analysis and study power. I see that they did preregister their study so there may be a more in depth power analysis that I haven't encountered there. I may look and see if I can see anything there which explains this.

The sample size is huge. So a significant P value is likely to be found even when the difference in outcomes between groups is negligible. That's just the way the math works. The level of significance by itself does not predict effect size. Unlike significance tests, effect size is independent of sample size. So the advantage (to pro-maskers) when speaking to statistical significance only depends could vanish when turning to effect size. P values are confounded because of their dependence on sample size. Sometimes a statistically significant result means only that a huge sample size was used, which is the case in this study.

This is a thing referred to as the "crud factor" by the legendary Paul Meehl. Usually, this is more an issue with exploratory and observational studies, in experimental designs is now handled by doing a power analysis and pre-specifying a clincally relevant effect size. In this case, it looks like they powered their study to almost perfectly catch the effect-size observed (since their confidence bound just excluded zero). Whether or not a clinical effect of a 9% reduction in "symptomatic seroprevelance" which bares an unknown relationship with actual infection rate is significant clinically or not is a judgement call.

I will note from the author's twitter statements, he choose to emphasize the effect size directly and hedges when talking about the statistical significance--having said things like (to paraphrase) "it isn't really a big difference between p = 0.06 and p = 0.04". This is kind of true in a way, and apropos Alex's question above, it is likely the author knows that the difference between his reported significance and not may come down to only a few cases in raw numbers. How one should judge things like that scientifically is a tough question which I tend to be much harsher on than many of my colleagues were (hence why I left academia).

Of course, I don't know if the authors accounted for things like test sensitivity/specificity or how/if they accounted for the symptomatic filtering in terms of estimating actual infections (doesn't look like it). And he (lead author) has eagerly made bold, facile, and unfounded extrapolations from the point estimate ("if 30% increase in masks reduces 9%, 60% would reduce 18%!") without doing the kind of analysis necessary there even if the point estimate were valid.
 
Last edited:
what percentages are you using?
Applying the 0.76% and the 0.69% to the corresponding denominators in Figure 1 gives you 1116 and 1106 (rounding to the nearest whole number, because obviously the number of cases was in whole numbers and the given percentages were really longer and rounded off).
 
There is no detailed power analysis in the preprint that I can see--for reviewers such an analysis would have to be provided, but it isn't uncommon for such things to not be included in the published paper or relegated to an appendix. However, in cases like these, where they powered their study for a novel, derived metric--if I were a reviewer they'd have to really get into the weeds to justify their analysis and study power. I see that they did preregister their study so there may be a more in depth power analysis that I haven't encountered there. I may look and see if I can see anything there which explains this.



This is a thing referred to as the "crud factor" by the legendary Paul Meehl. Usually, this is more an issue with exploratory and observational studies, in experimental designs is now handled by doing a power analysis and pre-specifying a clincally relevant effect size. In this case, it looks like they powered their study to almost perfectly catch the effect-size observed (since their confidence bound just excluded zero). Whether or not a clinical effect of a 9% reduction in "symptomatic seroprevelance" which bears an unknown relationship with actual infection rate is significant clinically or not is a judgement call.

I will note from the author's twitter statements, he choose to emphasize the effect size directly and hedges when talking about the statistical significance--having said things like (to paraphrase) "it isn't really a big difference between p = 0.06 and p = 0.04". This is kind of true in a way, and apropos Alex's question above, it is likely the author knows that the difference between his reported significance and not may come down to only a few cases in raw numbers. How one should judge things like that scientifically is a tough question which I tend to be much harsher on than many of my colleagues were (hence why I left academia).

Of course, I don't know if the authors accounted for things like test sensitivity/specificity or how/if they accounted for the symptomatic filtering in terms of estimating infections (doesn't look like it). And he has eagerly made bold, facile, and unfounded extrapolations from the point estimate ("if 30% increase in masks reduces 9%, 60% would reduce 18%!") without doing the kind of analysis necessary there even if the point estimate were valid.

jh1517,
Agree with all you wrote above.

As an aside, when I first began in actuarial analysis of insurance data and programs, I used to eagerly use "significant difference" a lot in presentations to leadership. One day, one of those execs looked at me hard and told me he didn't want to hear that term anymore. He said something like, "But does it make a meaningful difference to our bottom line? Do we need to invest in it?" Point taken.
 
jh1517,
Agree with all you wrote above.

As an aside, when I first began in actuarial analysis of insurance data and programs, I used to eagerly use "significant difference" a lot in presentations to leadership. One day, one of those execs looked at me hard and told me he didn't want to hear that term anymore. He said something like, "But does it make a meaningful difference to our bottom line? Do we need to invest in it?" Point taken.


In my field, we'd constantly find researchers trying to report p-values for computer simulation studies (running a dynamic system with the same parameters with two different inputs 1000 times for each input and reporting it like it was a randomized experiment with 1000 subjects) :eek:. I'm all for abandoning the null-hypothesis significance testing paradigm all together.
 
great... got it now... I repeated the same rounding mistake.

where did you find 0.7603% and 0.6899%
I calculated those by recalculating the percentages using 1116 and 1106 over the denominators (i.e. the same calculation the authors used), but I rounded to four decimal places instead of two. I just wanted to double-check that it worked out to 9.3% instead of 9.2%.
 
Finally found it.

The power analysis is in the paper that was published looking at just changing mask use. But the power analysis is based on symptomatic seropositivity, and is on page 51 here.

https://cowles.yale.edu/sites/default/files/files/pub/d22/d2284.pdf


This is a bit confusing. They seem to be reporting reduction in respiratory disease overall in their power analysis, not even "symptomatic seroprevelence", do not even mention test sensitivity/specificity, then they refer to the randomized 7500 experiment with no mention of specific power analysis for that, but then they have never reported that straightforward result?...
 
I calculated those by recalculating the percentages using 1116 and 1106 over the denominators (i.e. the same calculation the authors used), but I rounded to four decimal places instead of two. I just wanted to double-check that it worked out to 9.3% instead of 9.2%.

thx. are 1,116 and 1,106 published anywhere or yr just yr calc?
 
This is a bit confusing. They seem to be reporting reduction in respiratory disease overall in their power analysis, not even "symptomatic seroprevelence", do not even mention test sensitivity/specificity, then they refer to the randomized 7500 experiment with no mention of specific power analysis for that, but then they have never reported that straightforward result?...
You beat me to it. They will have a lot of work to do in peer review
 
thx. are 1,116 and 1,106 published anywhere or yr just yr calc?

doesn't this seem like a really small number to you. I mean, we're talking about 10 tests. what if they were off by a few because of differences in the phone interviews, or the bar code on the blood, or something else. what if the real diff in the total number of positive test in the intervention group vs. the control group really 7 or 5?

If you're this yale scientist wouldn't you be a little bit concerned... a little less confident.
 
doesn't this seem like a really small number to you. I mean, we're talking about 10 tests. what if they were off by a few because of differences in the phone interviews, or the bar code on the blood, or something else. what if the real diff in the total number of positive test in the intervention group vs. the control group really 7 or 5?

If you're this yale scientist wouldn't you be a little bit concerned... a little less confident.
Or more or less than 10 due to random variance/luck of the draw in selecting those to be tested.

You are exactly correct, Alex, and no one here is disagreeing with you. Stating "See! Masks work. The study proves it" is, indeed, junk science. I think you have your answer and your points for the interview.
 
This is a bit confusing. They seem to be reporting reduction in respiratory disease overall in their power analysis, not even "symptomatic seroprevelence", do not even mention test sensitivity/specificity, then they refer to the randomized 7500 experiment with no mention of specific power analysis for that, but then they have never reported that straightforward result?...

With the caveat that this is above my pay grade...I think their focus on respiratory symptoms was reasonable, because that's why the results are significantly different. There was a difference in the proportion of people with respiratory symptoms between the control and intervention groups, not a difference in the proportion of people with symptoms who were COVID positive. Except you still need to take into account the drop-off in numbers when you slice out that proportion, which they didn't do.

And reading on...this is the first time they mention that they only drew blood from 7500 randomly selected subjects for pre and post serology. It was supposed to be 25,000 from their pre-registration. I don't like that.
 
thx. are 1,116 and 1,106 published anywhere or yr just yr calc?
I don't see anywhere that they state the specific numbers. But it doesn't really matter, because giving the percentage instead, with the denominator, is just a slightly indirect way of stating those specific numbers. It's not uncommon to find it that way. Sometimes they give you all the information you need to calculate the number, but don't also give you the number. The important part is that you can get the number.
 
doesn't this seem like a really small number to you. I mean, we're talking about 10 tests. what if they were off by a few because of differences in the phone interviews, or the bar code on the blood, or something else. what if the real diff in the total number of positive test in the intervention group vs. the control group really 7 or 5?

If you're this yale scientist wouldn't you be a little bit concerned... a little less confident.

Well, that's the whole point of the significance testing. What's the probability that the difference was really something smaller? And in this case, the answer was 4.3%. It's not just the numbers, it's taking into account how much the numbers differed from the average (the variance), and figuring out how likely or unlikely one more or one less would be.
 
With the caveat that this is above my pay grade...I think their focus on respiratory symptoms was reasonable, because that's why the results are significantly different. There was a difference in the proportion of people with respiratory symptoms between the control and intervention groups, not a difference in the proportion of people with symptoms who were COVID positive. Except you still need to take into account the drop-off in numbers when you slice out that proportion, which they didn't do.

Yes, but they seem to be specifying a power analysis for respiratory symptoms, but the present results for "symptomatic seropositive"--even granting that that is the right thing to look at and ignoring the fact that that doesn't necessarily have a direct relationship to covid, how do they line those up themselves? You can't do a power analysis for one outcome (respiratory symptoms) and then present results for a subset of that outcome (respiratory symptoms + seropositive).

And reading on...this is the first time they mention that they only drew blood from 7500 randomly selected subjects for pre and post serology. It was supposed to be 25,000 from their pre-registration. I don't like that.

That is concerning too. I hope you may be more convinced that at least "sloppy" may be an appropriate adjective for what they have presented so far.
 
Yes, but they seem to be specifying a power analysis for respiratory symptoms, but the present results for "symptomatic seropositive"--even granting that that is the right thing to look at and ignoring the fact that that doesn't necessarily have a direct relationship to covid, how do they line those up themselves? You can't do a power analysis for one outcome (respiratory symptoms) and then present results for a subset of that outcome (respiratory symptoms + seropositive).

Because sample size calculations don't use outcomes, they use effect sizes. And the effect in this case shows up in the difference in respiratory symptoms. You can measure the effect in various ways (the outcome they chose was "symptomatic seropositivity", but they could have chosen something else, like respiratory symptoms), but it only matters that you are measuring the same effect, not that you are measuring the effect in the same way.

However, as I mentioned previously, when you choose to count COVID positivity (so your counts are only 22% of the counts for respiratory symptoms), I have a suspicion that it reduces your 'effective n' and they didn't take that into account.

That is concerning too. I hope you may be more convinced that at least "sloppy" may be an appropriate adjective for what they have presented so far.

I've agreed there has been sloppiness in reporting. It was sloppiness in methodology I wasn't convinced about.
 
You can measure the effect in various ways (the outcome they chose was "symptomatic seropositivity", but they could have chosen something else, like respiratory symptoms), but it only matters that you are measuring the same effect, not that you are measuring the effect in the same way.

I think we have pretty much reached agreement on the paper, but I would say I don't know that this is correct.

Consider if I were to measure a coin for bias by performing a sequence of flips.

If I measure it by method A: predetermining that I will flip it 10 times and count the number of heads, and I get 6 heads and 4 tails (say the sequence HTHHTTHTHH), I will have a certain p-value for that experiment (Calculated from a binomial distribution).

But if I measure it by method B: flipping the coin indefinitely until I get 6 heads, and I get the exact same sequence (HTHHTTHTHH), I will have a different p-value (Calculated from a negative-binomial distribution).

This stuff can get very weird. In both cases I'm measuring the same thing (coin bias), arguably in the same way (flipping the coin and counting heads and tails), and in both cases I observe the same actual data (HTHHTTHTHH), but the resulting inference would be different.
 
Last edited:
I think we have pretty much reached agreement on the paper, but I would say I don't know that this is correct.
I don't know that it is correct, either. I'm defending it because in theory it was reasonable. But I think it would have been better not to depend upon theory, but something more direct.

However, it should be pointed out that all this is moot, since the result was statistically significant (power analyses are for when they aren't).

And from that sample size analysis in the paper, it was obvious that they had chosen a sample size they thought was feasible, and then tested to see what effect sizes they might find under those conditions, rather than the other way around.
 
Back
Top