Is there a discussion of the power of the experiment/effect size? I didn't see it.
There is no detailed power analysis in the preprint that I can see--for reviewers such an analysis would have to be provided, but it isn't uncommon for such things to not be included in the published paper or relegated to an appendix. However, in cases like these, where they powered their study for a novel, derived metric--if I were a reviewer they'd have to really get into the weeds to justify their analysis and study power. I see that they did preregister their study so there may be a more in depth power analysis that I haven't encountered there. I may look and see if I can see anything there which explains this.
The sample size is huge. So a significant P value is likely to be found even when the difference in outcomes between groups is negligible. That's just the way the math works. The level of significance by itself does not predict effect size. Unlike significance tests, effect size is independent of sample size. So the advantage (to pro-maskers) when speaking to statistical significance only depends could vanish when turning to effect size. P values are confounded because of their dependence on sample size. Sometimes a statistically significant result means only that a huge sample size was used, which is the case in this study.
This is a thing referred to as the "crud factor" by the legendary Paul Meehl. Usually, this is more an issue with exploratory and observational studies, in experimental designs is now handled by doing a power analysis and pre-specifying a clincally relevant effect size. In this case, it looks like they powered their study to almost perfectly catch the effect-size observed (since their confidence bound just excluded zero). Whether or not a clinical effect of a 9% reduction in "symptomatic seroprevelance" which bares an unknown relationship with actual infection rate is significant clinically or not is a judgement call.
I will note from the author's twitter statements, he choose to emphasize the effect size directly and hedges when talking about the statistical significance--having said things like (to paraphrase) "it isn't really a big difference between p = 0.06 and p = 0.04". This is kind of true in a way, and apropos Alex's question above, it is likely the author knows that the difference between his reported significance and not may come down to only a few cases in raw numbers. How one should judge things like that scientifically is a tough question which I tend to be much harsher on than many of my colleagues were (hence why I left academia).
Of course, I don't know if the authors accounted for things like test sensitivity/specificity or how/if they accounted for the symptomatic filtering in terms of estimating actual infections (doesn't look like it). And he (lead author) has eagerly made bold, facile, and unfounded extrapolations from the point estimate ("if 30% increase in masks reduces 9%, 60% would reduce 18%!") without doing the kind of analysis necessary there even if the point estimate were valid.
Last edited: