Ganzfeld Experiments: Suggestions please.

Interesting work. And it's supported by data from a real experiment. In the Bierman paper I wrote about on page on, he breaks down the PRL results by target set and worked out the expected mean chance of each one, adjusting for this kind of bias. IIRC, he also found the chance expectation ranged between 16% and 30something%. This is for very few trials, though. Sometimes even less than ten.

So I think the next natural question is: how many trials does the average Ganzfeld experiment has?

If memory serves, it's about 40 trials.
 
Hello everyone,

My attempt to get less involved is going downhill. I'll post the rest of what I sent MasterWu for completeness, and for assessment by the wider group.

------------------------------------------------------

It's simple, really; power is (roughly) a function of the size of a study (i.e. number of trials) and its effect size (ES). The higher either of these values is, the higher the power will be.

All I'm saying is that, (1) since there is a subset of ganzfeld studies involving participants selected for psi-enhancing qualities, and (2) since that subset has been shown to achieve statistically higher hit rates all throughout the ganzfeld (but especially in the Storm 2010 meta-analysis of the 30 most recent studies), the power of those studies will be higher than the power of studies with unselected participants, and averaging across ALL ganzfeld studies, ignoring this, will be very uninformative.

For example, if we consider ALL ganzfeld studies in Storm et al. (2010),

30 studies
Hit rate = 33.3%
Average sample size = 55
Average power = 40%

Now only unselected participants,

16 studies
Hit rate = 27.3%
Average sample size = 56
Average power = 11%

Now only selected participants,

14 studies
Hit rate = 40.1%
Average sample size = 53
Average power = 79%

Going back to the first power calculation, if we were to take it at face value, believing that it accurately represented the studies in Storm et al (2010), we would be forced to conclude that those studies had a "typical power" of 40%. But we also found that blocking our studies into just two groups identified by one binding characteristic (i.e. participant selection) yielded power values for each group drastically different from one another: 11% vs. ~80%. Thus our mean power value of 40% was, effectively, a lie, as often happens with averages. It's the statistical equivalent of concluding, by length, that all participants at a college party were wearing shorts when in fact half were in their underwear and half were wearing pants; it tells us nothing about what actually occurred at the party.

My power calculations above were meant to show that, actually, many ganzfeld studies are highly powered. Power is a finicky thing, though; even the estimates I presented are not ideal. We cover some of this in our paper. There are differences between types of selected participants, so although blocking them as I did is much more accurate, it too does not tell a full story.

I suspect both Kennedy and myself would agree that giving the average power for a group of studies really means that the power in individual studies will be greater than or less than the average. This seems to be a no-brainer. I'm not sure what your point is otherwise.

----------------------------------------------

We haven't covered the judging bias objection [in our paper] because we don't think it holds water and, so far as I know, neither do any of the major skeptics. If randomization is performing as theorized, the long-run probability of selecting a correct target is always .25, in all cases, regardless of content, position, or judging patterns. Studies which use random number generators or pseudo-random generators battery-tested for randomness, where proper functioning of those machines has been ascertained prior to the experiment, will conform to this prediction.

Now, it is true that, in the short run, for individual studies, randomization will be imperfect. Sometimes targets will appear noticeably more often in certain positions as opposed to others, opening up the potential for position effects; or, alternatively, pictures with a certain content will be selected as targets more often, perhaps allowing for content effects. But if true randomness is operant, this is irrelevant; the probability that any content preference or position preference would happen to coincide with a spurious inflation of a specific piece of content or a particular position is equal to the significance level obtained in the study. After all, that significance level only tells you the probability of finding a hit rate like the one you found, or higher, by chance. Saying that your bias happened to coincide with the bias of the random number generator in your experiments is only saying that you got lucky; you result is explicable by chance variation. So if your particular p-value was, say, .05, then you would know that you would only be this lucky (or luckier), 5% of the time.

I see. The universe sorta keeps track, so when the randomizer has spit out an unexpectedly excessive number of targets in the 3rd position, or has picked a target so often that you would only get this lucky 1 time in 80,000 tries, subsequent randomizations will show a relative dearth of that position and target so it will even out over all. That's why, when the roulette wheel has come up with 3, 4, or 5 reds in a row, it's safe to bet on black, because it's due to come up in order to even out the reds and blacks in the long run.

What's remarkable about the ganzfeld database is that people are a lot luckier, a lot more often. For all studies following the autoganzfeld, for example, 25% are at least that lucky, and the probability that this or a better result would occur by chance can be calculated to less than 1 part in 5,200,000.

Simply put, I think that if any simulation shows that random sequences can correlate with non-random sequences, something must be wrong with the simulation.

Ignoring the inflated alpha errors associated with small study effects, I was curious about the experiments mentioned by Master Wu in his first post which seem to contradict the idea that the error corresponds to a description of chance. Have you read this book? I just ordered it so I can look at this in more detail.

http://www.amazon.com/The-Challenge-Chance-Experiment-Unexpected/dp/0394485114

Linda
 
Linda, can you summarize anything interesting you find in the book, once you get it? I would really appreciate that. I also came across the book, but I don't live in a America, and there isn't a Kindle version.
 
Johaan, I wanted to know something, concerning the issue with Kennedy and the 40% hit rate of selected participants (meditators, artists, and all that ). I recall an old thread done in the other forum ( the mind-body forum ) where you, Ersby, Maarat ( or something like that ) analyzed some studies that weren't included in Storm et al (2010) because apparently he didn't knew about them. In that thread, you guys showed that the non-included studies seem to have the inverse situation: selected participants doing worse than unselected participants and that this trend was pervasive back until the 70s.

Now, you then put this studies inside the Storm et al study, and although it didn't eliminated the statistical significance of it, it did reduced the 40% of the selected participants to something like 37%, a 3% drop. ¿How much does this affect the power size of the studies, if the alledged true hit rate by selected participants it's 37% and not 40%?
 
I also wanted to know, ¿why does the Ganzfeld don't show a funnel graph as expected from a real effect? I hope Ersby, or you can explain me this. Specially Ersby since he deviced the non-funnel graph. You said Radin didn't specify the effect size he used, but, is there a way of figuring this out?
 
On the other hand, Bem's second way of modelling the content bias and testing for its presence found that it rendered the previously significant difference only marginally significant (two-tailed). And then Bierman's third way of modelling the content bias found that it rendered the significant difference in scoring on dynamic vs. static targets non-significant.

Bem's second way of modeling the content bias was a two sample comparison test, which has less power than a one sample test. They answer different questions about the data. Note that the first method Bem used produced a p-value that was virtually indistinguishable from the original one, thus why it was rounded to the same decimal place. Bierman's finding could very easily be explained by coincidence. Randomness is randomness.

Johann's response is making me a bit uneasy. My criticism, as well as Kennedy's, has been that parapsychologists have focused on debunking debunkers instead of focussing on performing well-designed, well-powered experiments. Too much emphasis has been put on Bem coming up with tests (of unknown validity and reliability to detect the effects of the bias in the first place) which attempt to rule out an effect from one kind of bias as a way to claim that it doesn't matter that the ganzfeld tests aren't well-performed. Kennedy's and my point has always been that the research shows pretty clearly that this doesn't work, that the better strategy is to move forward with performing well-designed and well-powered experiments instead. Especially since there seems to be general agreement that regardless of whether or not "it accounts for all of the effects in the ganzfeld studies", these are just some of the ways in which the hit rates can be biased (another example would be the autoganzfeld session 302). Why Johann's response makes me uneasy is because Johann previously told me that he and Maaneli had co-authored a paper in which they recommended that the way to move forward was to perform experiments in ways which reduce bias. So this now seems like a step backwards for him to criticize Kennedy and me for making that same suggestion by bringing up Bem's attempt to rationalize away attempts to address bias.

Linda, you state repeatedly that the ganzfeld studies aren't well designed. You are entitled to that opinion. However, part of the basis for this conclusion has been Kennedy's report, and specifically his suggestion that the power of ganzfeld studies is low. You specifically asked me in another thread why I thought parapsychologists had not already taken care to use their best subjects to rectify this. Well, my response is that many of them have.

As I have implied before, we are under-informing ourselves when we focus on only the summary measure of a meta-analysis. The power calculations done by Max have taught me something that in retrospect should have been obvious; that is, that the statement "parapsychologists do (or have been doing) x" is usually a misleading one. IMO, the attempt to generalize over the whole field is a weakness that infects arguments from both advocates and counter-advocates of parapsychology, on a consistent basis. If we shift our focus away from the false homogeneity that has often been implied for the field, to its true diversity, we can abjure some our misconceptions and obtain a more realistic picture of the situation. Nothing aids improvement like an accurate understanding of where we are now.

For example, if we look at the success of experiments specifically designed to test artists and musicians because of their previously reported higher hit rates, we find a string of breathtaking successes with very high power (please note: these studies were aggregated in Max's power paper; I owe him a great debt of gratitude for pointing them out to me):

Bem & Honorton 104/105b (1988): N = 20; HR = 50%; z = 2.20; p = .013 - Original Study

Morris et al. (1993): N = 32; HR = 41%; z = 1.78; p = .037
McDonough et al (1994): N = 20; HR = 30%; z = 1.02; p = .382
Morris et al (1995): N = 97; HR = 33%; z = 1.67; p = .047
Dalton (1997): N =128; HR = 47%; z = 5.20; p = 7.072*10^-08
Parker & Westerlund study 4 (1998): N = 30; HR = 47%; z = 2.40; p = .008
Morris, Summers & Yim (2003): N =40; HR = 38%; z = 1.64; p = .054

Let's do the meta-analysis.

For all studies: N = 367; X = 152; HR = 41.4%; z = 7.26; p = .00000000000436

For all studies minus the first study: N = 347; X = 142; HR = 40.92%; z = 6.85; p = 0.0000000000591

Note the remarkable characteristic of these studies: to my knowledge, there is nothing post hoc about them! Every single one mentions its intention to use the results of previous studies with high-scoring creatives to enhance their own results (by using creatives); a meta-analysis on this subset is therefore not only wholly justified, but probably more informative as to the reality of psi than a meta-analysis which takes an all-inclusive approach. There is also the benefit that little to no selection bias is likely to exist for them, since studies specifically using creatives are well-known, and since there is very little ambiguity in the single criterion "selection for creativity". Furthermore, all confirmatory studies reached independently significant results except for one, for a 5/6 or 83.33% proportion of positive results. Putting aside the fact that a binomial test with a 5% alpha on the study count here is wildly conservative, the probability of 5/6 significant studies or more is p = 0.0000018.

In spite of my argument against the file-drawer, I think it may still be worthwhile to apply an internal consistency check; that is, to ascertain whether selection bias is a viable hypothesis. We can do this with the Ioannidis & Trikalinos (2007) excessive significance test (applied by Francis, 2012, on Bem's studies), which uses pooled ES to predict how many studies should reach significance. Although the test is impaired in the presence of significant heterogeneity, these studies are not significantly heterogenous. So, we find that 4.46 studies should have reached significance, and 5 did. Looks pretty good to me, but not, as Francis would say, "too good to be true". Considering that the test is overly generous towards the file-drawer when the true power is greater than .5 (which it inevitably is for these studies) these results would thus be very difficult to explain with selective reporting, whether of studies or of individual trials.

For a collection of just six replication studies to demonstrate a cumulative deviation of greater than six sigma (which I'll remind people again is the threshold for the discovery of the Higgs) is nothing short of flabbergasting. Yet how well-known is this fact? I've never seen it mentioned in the parapsychology literature, or the skeptical literature, with the exception on Max's still unpublished power paper, and even then without this meta-analysis. My intention is to publish a small analysis of these studies eventually, in which I do a systematic statistical and methodological survey of each study, and identify success characteristics.

Note: The removal of Dalton (1997) leaves a binomial p = 0.0000309 and a z = 4.24

So let's recapitulate. These six studies cover a period from 1993 to 2003. According to Storm et al, they would constitute about 10% of the 60 study post-PRL database. That's 10% of studies that have gone specifically towards the confirmation of a hypothesis, with astounding success. But it's only a fraction of the picture, because when we also think about selected subjects in general, where the finding is slightly weaker (owing to the marginally higher ambiguity in the "selected" criterion), we still find 14 studies in the database of Storm et al (2010)**, which is about 47% (14/30) of the studies conducted from 1997 to 2008. These 47% studies have much greater power than the rest of the Storm et al database, even when we remove the studies with creatives.

But let's do that anyway. We remove Dalton (1997), Parker & Westerlund (1998), and Morris, Summers & Yim (2003) from Storm et al (2010). That still leaves a hit rate of 40.18%, with an average sample size of 50 and a power of 77%.

In sum:

Kennedy says in his paper the following:

By the usual methodological standards recommended for experimental research, there have been no well-designed ganzfeld experiments. Based on available data, Rosenthal (1986), Utts (1991), and Dalton (1997b) described 33% as the expected hit rate for a typical ganzfeld experiment where 25% is expected by chance. With this hit rate, a sample size of 201 is needed to have a .8 probability of obtaining a .05 result one-tailed. No existing ganzfeld experiments were preplanned with that sample size. The median sample size in recent studies was 40 trials, which has a power of .22.

Kennedy also says that "cases with a smaller number of studies and/or possible methodological problems*** sometimes have replication rates outside of this range", which is perhaps an indication that he at least remembered some of what Max mentioned in their email exchange. But the image conferred on ganzfeld research by Kennedy's paragraph is, IMO, pretty convincingly false. Why? Because 37% of ganzfeld experiments in the most recent meta-analysis by Storm et al (2010) have a power (based on their mean and median sample sizes, which both happen to be 50) in the vicinity of 77%. An unknown proportion of these have been designed to be confirmatory (In other words, I don't know, but the data is out there for anyone to find out). A further 10% have a power of around 69%, possessing both a larger effect size and the virtue of all being confirmatory (but with a smaller median sample size of n = 36).
 
Last edited:
Continuation of post:

Can this situation improve? Absolutely. We propose how that may be done in our paper, taking note of many of Kennedy's suggestions (e.g. prospective designs based on power analysis—good ones, mind you—multiple experimenter protocols, higher effect sizes, large sample sizes, etc) as well as Wiseman's suggestions. There's already a study in the Koestler parapsychology unit registry that uses Max's power predictions to prospectively plan its sample size. We do not, however, want to propose changes which will drain the resources of parapsychologists on what we feel are unevidenced sources of bias. That wouldn't endear us to many people, and it wouldn't make progress.

I'm somewhat disappointed myself that you insist on upholding claims about randomization, the evidence for which is (at best) very weak. Moreover, those claims are a priori unlikely; when people use RNGs or PRNGs, they expect randomness. This is the gold standard in medicine, psychology, and other fields. Notice how exactly none of the criticisms of Bem's precognition studies from the wider scientific community have centered around the randomness of his sequences; they place their focus, rather, on more impactful flaws such as questionable research practices and file-drawer issues. Concerns about randomness in parapsychology have been there since the days of Rhine; they come from an older skepticism very much concerned with sensory cues (and ignorant of many statistical flaws), where parapsychology has undoubtedly prevailed; what remains to be overcome now are problems of selective inclusion, variability in design, power, etc. Let's face it, flaws in the randomization of functioning RNGs are very unlikely to do anything but raise or lower a score by less than a percentage point over several studies. Just think about RNG studies; they use the same sources of randomization, and yet have an effect size that is orders of magnitude smaller than free-response ESP studies; last I looked their meta-analyses have measured shifts of fractions of a percent. That's because RNGs work. And even if their entire effect is derived from randomization flaws, that's still nothing compared to ESP studies. This isn't rationalization, IMO, it's logic!

In closing, I maintain that it is essential that the strengths of the present database be acknowledged, before there can be movement. This is where I believe Kennedy et al err; in presenting only criticisms of the research, they have alienated many parapsychologists. The impulse to do great things is stifled when recognition for good things is withheld; meeting on common ground often means conceding some ground. We concede some ground in our paper to both sides of the debate, because we want something from both sides. It's not 50/50 because the ganzfeld studies have rebutted every specific skeptical proposition we examined—but only after checking them ourselves. Most weren't ruled out by default. A database more apt to gain the attention of the mainstream is one which possesses a priori refutations to all the skeptical criticisms we looked at, where no after-the-fact analysis is necessary. That's all I'll say for now.

-------------

** Note that this is not the post-PRL database but the post-Milton & Wiseman database; I used the smaller post-MW because the stats for those studies are readily available to me.

*** For artistic subjects, I would note that we only have quality ratings for Dalton (1997), Parker & Westerlund study 4 (1998), and Morris, Summers & Yim (2003), since they were in the time frame for the Storm et al meta-analysis. Of these, Dalton, with the most significant result, achieved a perfect rating of 1. Parker & Westerlund, with the second most significant result, achieved a rating of .96. And Morris, Summers & Yim, with the least significant result, achieved a quality rating of .85.
 
Last edited:
I also wanted to know, ¿why does the Ganzfeld don't show a funnel graph as expected from a real effect? I hope Ersby, or you can explain me this. Specially Ersby since he deviced the non-funnel graph. You said Radin didn't specify the effect size he used, but, is there a way of figuring this out?

In the presence of significant between-studies heterogeneity, the funnel plot will incorrectly conclude bias where none exists.

"One limitation of the trim and fill method [based on the funnel plot] is it assumes that the sampling error is the key source of variation in a set of studies. This may not be the case in many studies, which are often quite heterogeneous due to both methodological and substantive differences among primary studies. A simulations study (Terrin et al., 2005) confirmed that when trim and fill is applied to heterogeneous data sets, it can adjust for publication bias when none actually exists."

http://thescipub.com/html/10.3844/ajassp.2012.1512.1517
 
Johaan, I wanted to know something, concerning the issue with Kennedy and the 40% hit rate of selected participants (meditators, artists, and all that ). I recall an old thread done in the other forum ( the mind-body forum ) where you, Ersby, Maarat ( or something like that ) analyzed some studies that weren't included in Storm et al (2010) because apparently he didn't knew about them. In that thread, you guys showed that the non-included studies seem to have the inverse situation: selected participants doing worse than unselected participants and that this trend was pervasive back until the 70s.

Now, you then put this studies inside the Storm et al study, and although it didn't eliminated the statistical significance of it, it did reduced the 40% of the selected participants to something like 37%, a 3% drop. ¿How much does this affect the power size of the studies, if the alledged true hit rate by selected participants it's 37% and not 40%?

Ersby found about four studies that hadn't been included in Storm et al, which used selected participants. I say "about" because IMO Wiseman's public demonstration with a sample size of ten ordinarily wouldn't be considered a real ganzfeld study, nor the Bierman series with 7 trials, and the Howard & Delgado (2005) study was actually just one excluded experiment, but with a sample size of 50. I don't actually think Howard & Delgado used selected subjects according to the definition of Storm et al; they just retested some participants that had scored a hit in their last session (at least 25% of which were just lucky). Ersby said they fulfill the condition of "prior psi testing", but I noted that this is not how Honorton had used that definition (Storm et al didn't explain what it meant). In any case, I'm willing to bet that none of the 14 studies in the Storm et al database used that criterion as their sole selectedness trait, largely because I don't imagine there's anything very special about people who happen to score one ganzfeld hit.

So, if we don't accept any of my objections, there were a total of 127 trials and 22 hits excluded, for a hit rate of 17.32%. Adding this to the selected hit rate of Storm et al yields about a 37% HR. Alternatively, if my objections are considered, we obtain 58 trials and 12 hits with a 21% HR. This only reduces the Storm et al hit rate to 39%. The single study added in this case was Parker & Sjoden (2008), which it should be noted presented all of the ganzfeld images to its subjects subliminally during the ganzfeld state; a strange idea very deviant from the regular ganzfeld protocol.

Some readers may object that I'm using ad hoc criteria to confirm my hypothesis, and that what these excluded sessions show is very simply that the selected subject hit rate isn't the ripe bananas I made it out to be. This is a legitimate criticism. But IMO it uses a definition of the file-drawer that is rather basic; that is, it takes the position that in order to come up with a reasonable estimate for the hit rate of selected subject studies we must have the hits and misses of every selected ganzfeld session ever conducted. Note that there isn't a single experiment or series in the 30 study ganzfeld database by Storm et al, selected or unselected, that uses a sample size of 10 or below, like the Wiseman and Bierman series. The only one (unselected) that even comes close is Roe & Flint (2007), with 14 trials, but they were also the only study to use an 8-choice design, which boosts their power to compensate.

It's a constructive exercise to examine a couple questions we could ask about the ganzfeld database, to see for which ones these studies would really be "in the file-drawer".

(1) The basic "out of all selected subject trials ever conducted, from 1997 to 2008, what is the true hit rate?"

Excluded studies: Parker & Sjoden (2008), Wiseman (2000), and Welzman & Bierman (1997) series IV **

(2) "What hit rate am I likely to obtain with a reasonably sized study using all selected subjects, if it could be either exploratory or confirmatory?"

Excluded studies: Parker & Sjoden (2008)

(3) "If I prospectively plan a reasonably sized confirmatory study with selected participants, what hit rate am I likely to obtain?"

Excluded studies: None of those that Ersby found, but probably several in the Storm et al meta-analysis would have to be removed. IMO, the hit rate would be higher, but this remains to be ascertained by a good review.

** I really insist that Howard & Delgado used unselected subjects.
 
  • Like
Reactions: K9!
So, if we don't accept any of my objections, there were a total of 127 trials and 22 hits excluded, for a hit rate of 17.32%. Adding this to the selected hit rate of Storm et al yields about a 37% HR. Alternatively, if my objections are considered, we obtain 58 trials and 12 hits with a 21% HR. This only reduces the Storm et al hit rate to 39%. The single study added in this case was Parker & Sjoden (2008), which it should be noted presented all of the ganzfeld images to its subjects subliminally during the ganzfeld state; a strange idea very deviant from the regular ganzfeld protocol.

Hmm... I think there is a lot to discuss over this, however, I have to ask. ¿With a 37% hit rate for selected participants, has there ever been a study done with selected persons which reach the power size? The same question I may ask with the reduced 39% hit rate.
 
Ignoring the inflated alpha errors associated with small study effects,

The correlation between sqrt(n) and z here is positive and significant at p = .002, and an inspection of the residuals plot shows that a linear fit is appropriate; ergo, the larger the number of trials in the post-PRL database, the more positive the results.

I was curious about the experiments mentioned by Master Wu in his first post which seem to contradict the idea that the error corresponds to a description of chance. Have you read this book? I just ordered it so I can look at this in more detail.

http://www.amazon.com/The-Challenge-Chance-Experiment-Unexpected/dp/0394485114

Linda

Let us know what you find.
 
Ignoring the inflated alpha errors associated with small study effects, I was curious about the experiments mentioned by Master Wu in his first post which seem to contradict the idea that the error corresponds to a description of chance. Have you read this book? I just ordered it so I can look at this in more detail.

http://www.amazon.com/The-Challenge-Chance-Experiment-Unexpected/dp/0394485114

Linda

I didn't realise that had been mentioned in the opening post. I've read that book (my edition had the title "The Challenge of Chance: Experiments and Speculations"). It's pretty interesting but also a bit tough to get through: effectively, the author is trying to place a narrative on chance events.

The mass telepathy experiment involved multiple senders and multiple receivers, all in the same room. The receiver sat in cubicles set up in an auditorium while the senders viewed the targets (drawings, slides or symbols) on a stage. The judging was not done blind: the receivers' drawings were compared directly to the target and a decision was made whether it was a hit or not.

They found 35 hits out of 2112 responses, making 1.6% hit rate. Pretty low, but the authors note that they were remarkably good fits and give some examples in the book. There was another interesting occurrence: that of coinciding answers. The authors noted that occasionally, there would be a cluster of people writing/drawing similar responses.

They followed up the first experiments with 20 receivers with a smaller scale replication with 9 receivers.

They then compared their results to a control experiment. In this case they took 20 responses from different days and collated them to make one set of responses to a randomly selected target, and then judged this in the same way as before.

The results for the control were the same as for the two genuine experiments, and also they number of coinciding answers was also almost exactly the same.

After this, the book is then mostly about the authors trying to find more patterns in their otherwise identical experiments, in an attempt to differentiate between the genuine and the control data. This leads to them working with strings of random numbers, which is what the quote in the opening post on this thread refers to.
 
Last edited:
Bem's second way of modeling the content bias was a two sample comparison test, which has less power than a one sample test.

It's a paired test which has higher power than an unpaired two-sample test.

They answer different questions about the data.

Agreed, which was sorta my point.

Bierman's finding could very easily be explained by coincidence. Randomness is randomness.

That was the point. When trying to figure out ways to increase the effect size, it doesn't help to suggest "use dynamic targets" if this was due to a coincidence which one wouldn't expect to reappear in subsequent studies.

Linda, you state repeatedly that the ganzfeld studies aren't well designed.

I've done more than that. I've gone through the studies and designs in detail using the information and recommendations from GRADE in order to asses the risk of bias for each element, and then reviewed the recommendations for how to address each risk. This is a way to get away from making this merely about personal opinion, and instead using standards which have already undergone testing for reliability and validity.

However, part of the basis for this conclusion has been Kennedy's report, and specifically his suggestion that the power of ganzfeld studies is low.

Not in my case. My conclusions are based on looking at the studies, not based on Kennedy's report. I've only recently seen Kennedy's report, and I find it interesting that he raises many of the same issues and makes the same recommendations as I have made over the years. But that's not a surprise since he and I are both physicians and are likely coming at this from the perspective of evidence-based practices (as well as being more intimately familiar with all the ways in which the system can be gamed, since there can be powerful incentives to do so in the healthcare market).

As I have implied before, we are under-informing ourselves when we focus on only the summary measure of a meta-analysis. The power calculations done by Max have taught me something that in retrospect should have been obvious; that is, that the statement "parapsychologists do (or have been doing) x" is usually a misleading one. IMO, the attempt to generalize over the whole field is a weakness that infects arguments from both advocates and counter-advocates of parapsychology, on a consistent basis. If we shift our focus away from the false homogeneity that has often been implied for the field, to its true diversity, we can abjure some our misconceptions and obtain a more realistic picture of the situation. Nothing aids improvement like an accurate understanding of where we are now.

Agreed.

For example, if we look at the success of experiments specifically designed to test artists and musicians because of their previously reported higher hit rates, we find a string of breathtaking successes with very high power (please note: these studies were aggregated in Max's power paper; I owe him a great debt of gratitude for pointing them out to me):

Bem & Honorton 104/105b (1988): N = 20; HR = 50%; z = 2.20; p = .013 - Original Study

Morris et al. (1993): N = 32; HR = 41%; z = 1.78; p = .037
McDonough et al (1994): N = 20; HR = 30%; z = 1.02; p = .382
Morris et al (1995): N = 97; HR = 33%; z = 1.67; p = .047
Dalton (1997): N =128; HR = 47%; z = 5.20; p = 7.072*10^-08
Parker & Westerlund study 4 (1998): N = 30; HR = 47%; z = 2.40; p = .008
Morris, Summers & Yim (2003): N =40; HR = 38%; z = 1.64; p = .054

Let's do the meta-analysis.

For all studies: N = 367; X = 152; HR = 41.4%; z = 7.26; p = .00000000000436

For all studies minus the first study: N = 347; X = 142; HR = 40.92%; z = 6.85; p = 0.0000000000591

Note the remarkable characteristic of these studies: to my knowledge, there is nothing post hoc about them! Every single one mentions its intention to use the results of previous studies with high-scoring creatives to enhance their own results (by using creatives); a meta-analysis on this subset is therefore not only wholly justified, but probably more informative as to the reality of psi than a meta-analysis which takes an all-inclusive approach. There is also the benefit that little to no selection bias is likely to exist for them, since studies specifically using creatives are well-known, and since there is very little ambiguity in the single criterion "selection for creativity". Furthermore, all confirmatory studies reached independently significant results except for one, for a 5/6 or 83.33% proportion of positive results. Putting aside the fact that a binomial test with a 5% alpha on the study count here is wildly conservative, the probability of 5/6 significant studies or more is p = 0.0000018.

In spite of my argument against the file-drawer, I think it may still be worthwhile to apply an internal consistency check; that is, to ascertain whether selection bias is a viable hypothesis. We can do this with the Ioannidis & Trikalinos (2007) excessive significance test (applied by Francis, 2012, on Bem's studies), which uses pooled ES to predict how many studies should reach significance. Although the test is impaired in the presence of significant heterogeneity, these studies are not significantly heterogenous. So, we find that 4.46 studies should have reached significance, and 5 did. Looks pretty good to me, but not, as Francis would say, "too good to be true". Considering that the test is overly generous towards the file-drawer when the true power is greater than .5 (which it inevitably is for these studies) these results would thus be very difficult to explain with selective reporting, whether of studies or of individual trials.

For a collection of just six replication studies to demonstrate a cumulative deviation of greater than six sigma (which I'll remind people again is the threshold for the discovery of the Higgs) is nothing short of flabbergasting. Yet how well-known is this fact? I've never seen it mentioned in the parapsychology literature, or the skeptical literature, with the exception on Max's still unpublished power paper, and even then without this meta-analysis.

Maybe you were unfamiliar with this research previously, but creativity and artistic populations have been discussed in the parapsychology literature for years. Heck, Nicola Holt's paper from 2007 combines the same papers as you do above (except that one of her six papers is different from one of yours) and comes up with the same result (http://www.academia.edu/695143/Are_..._and_psi_with_an_experience-sampling_protocol). That artistic populations and creative subjects have higher hit rates was frequently brought up on the JREF forum when I participated there years ago.

Linda
 
Continuation of post:

Can this situation improve? Absolutely. We propose how that may be done in our paper, taking note of many of Kennedy's suggestions (e.g. prospective designs based on power analysis—good ones, mind you—multiple experimenter protocols, higher effect sizes, large sample sizes, etc) as well as Wiseman's suggestions. There's already a study in the Koestler parapsychology unit registry that uses Max's power predictions to prospectively plan its sample size. We do not, however, want to propose changes which will drain the resources of parapsychologists on what we feel are unevidenced sources of bias. That wouldn't endear us to many people, and it wouldn't make progress.

I mostly agree. I would probably differ from you a bit in what gets called a "good" power analysis. And I would prefer to avoid what you "feel are unevidenced" sources of bias. Especially because the identifying the effects of bias depends upon comparisons between the results in the presence and the absence of that bias. Post hoc rebuttals of criticisms have performed poorly when it comes to determining whether or not a source of bias has contributed to an effect. So your emphasis on avoiding the former in favour of the latter is what makes me uneasy.

I'm somewhat disappointed myself that you insist on upholding claims about randomization, the evidence for which is (at best) very weak.

I'm wondering if you understand the claim, because your explanation as to why you think it's unreasonable is not relevant to the claim. The expected hit rate in the absence of psi is the conjunction of the randomly selected targets and subject/experimenter/judges' response biases. And as far as I know, nobody disputes this, and the presence of response biases is well-established (a couple of which have been identified). Note that this does not apply to most medical studies (if any). Fortuitous and non-fortuitous combinations will occur. I also don't think anybody disputes this. None of this depends upon RNG's or PRNG's behaving badly.

In theory, these combinations can be described by random selection. In practice? Don't know. However, this is a uniquely fertile field for any process which acts to select fortuitous or combinations. The argument is that we can depend upon there being zero selection (excluding DAT). My claim is that this is unconvincing in the absence of valid and reliable tests of the idea, especially in the presence of human nature (but also, maybe not, considering the testing in "Challenges of Chance"). I should also add that it's also unconvincing in the face of multiple examples of actual fortuitous selection acting on the ganzfeld studies, as well as other parapsychology research programs (for example, you chose to report a p-value of 0.03 as the result of Wackermann's study, when the p-value for the analysis which Wackermann stated was more appropriate was 0.064).

(I also happen to have a side claim that fortuitous combinations are over-represented in the absence of selection, but that idea awaits further development and I believe, at the moment, that it's influence could only be trivial, at best.)

Moreover, those claims are a priori unlikely; when people use RNGs or PRNGs, they expect randomness. This is the gold standard in medicine, psychology, and other fields. Notice how exactly none of the criticisms of Bem's precognition studies from the wider scientific community have centered around the randomness of his sequences; they place their focus, rather, on more impactful flaws such as questionable research practices and file-drawer issues. Concerns about randomness in parapsychology have been there since the days of Rhine; they come from an older skepticism very much concerned with sensory cues (and ignorant of many statistical flaws), where parapsychology has undoubtedly prevailed; what remains to be overcome now are problems of selective inclusion, variability in design, power, etc. Let's face it, flaws in the randomization of functioning RNGs are very unlikely to do anything but raise or lower a score by less than a percentage point over several studies. Just think about RNG studies; they use the same sources of randomization, and yet have an effect size that is orders of magnitude smaller than free-response ESP studies; last I looked their meta-analyses have measured shifts of fractions of a percent. That's because RNGs work. And even if their entire effect is derived from randomization flaws, that's still nothing compared to ESP studies. This isn't rationalization, IMO, it's logic!

Yeah, the problem is that logic describes an idealized situation, not the messy world of experimentation.

In closing, I maintain that it is essential that the strengths of the present database be acknowledged, before there can be movement. This is where I believe Kennedy et al err; in presenting only criticisms of the research, they have alienated many parapsychologists. The impulse to do great things is stifled when recognition for good things is withheld; meeting on common ground often means conceding some ground.

That's probably a good point. This was a frequent complaint from residents and students at medical journal clubs (where a paper is chosen to be analyzed in detail). By the time a paper has been picked apart, it's easy to lose sight of whether and where there is any merit to the results.

That was why I was very careful to identify all the areas in which the ganzfeld was at low risk of bias when I went through each step of the GRADE process, in my previous evaluation.

We concede some ground in our paper to both sides of the debate, because we want something from both sides. It's not 50/50 because the ganzfeld studies have rebutted every specific skeptical proposition we examined—but only after checking them ourselves. Most weren't ruled out by default. A database more apt to gain the attention of the mainstream is one which possesses a priori refutations to all the skeptical criticisms we looked at, where no after-the-fact analysis is necessary. That's all I'll say for now.

Well, I can't tell from that that you've conceded any ground. You give the impression that you think "rebutting skeptical propositions" has an effect on the real goal - performing experiments at low risk of bias - when it doesn't. What do you think you've conceded?

Linda
 
Last edited:
Also, Master Wu, I owe you an apology for implying that your ganzbot would be useless (by agreeing with Xissy). My pessimism was unjustified, as you did better than other simulations I've seen, by running experiments with smaller numbers of trials, and reporting the results of those individual experiments (instead of combining them at that step).

Linda
 
Also, Master Wu, I owe you an apology for implying that your ganzbot would be useless (by agreeing with Xissy). My pessimism was unjustified, as you did better than other simulations I've seen, by running experiments with smaller numbers of trials, and reporting the results of those individual experiments (instead of combining them at that step).

Linda

No problem Linda, ¿is there anything you want me to try with the Ganzbot? I'm planning in doing a simulation of 10 sessions, 40 trials each (which apparently is the average ganzfeld ). The program has been modified according to Johaan concern, and now it gives the percentages without rounding them ( it gives up to two digits ). ¿Is it possible you can make an analysis or an opinion of the data, same as Johaan?
 
Okey, so here are the results ( now without the rounding ) of 10 sessions, each consisting of 40 trials. No bias.

1.- 22.50% ( Hits: 9/40)
2.- 27.50% ( Hits: 11/40)
3.- 32.50% ( Hits: 13/40)
4.- 15.00% ( Hits: 6/40)
5.- 30.00% ( Hits: 12/40)
6.- 22.50% ( Hits: 9/40)
7.- 30.00% ( Hits: 12/40)
8.- 32.50% ( Hits: 13/40)
9.- 20.00% ( Hits: 8/40)
10.- 30.00 ( Hits: 12/40)

I do have a question concerning this. for 40 trials, ¿is the chance expectation 25% or is it less, or more?, ¿is there a way of knowing the chance expectation based on this results, or should we always use the theoretical ( that applies to thousands or hundred of trials)?

To test for randomness, I made 5 sessions, each of 10000 trials:

1.- 25.16% ( Hits: 2516/10000)
2.- 24.90% ( Hits: 2490/10000)
3.- 24.70% ( Hits: 2470/10000)
4.- 25.02% ( Hits: 2502/10000)
5.- 25.00% ( Hits: 2500/10000)

And just for the fun of it, 30 sessions, each consisting of 40 trials. Bias is 50% A, and 50% B.

1.- 30.00% ( Hits: 12/40)
2.- 15.00% ( Hits: 6/40)
3.- 25.00% ( Hits: 10/40)
4.- 20.00% ( Hits: 8/40)
5.- 27.50% ( Hits: 11/40)
6.- 35.00% ( Hits: 14/40)
7.- 27.50% ( Hits: 11/40)
8.- 20.00% ( Hits: 8/40)
9.- 27.50% ( Hits: 11/40)
10.- 22.50% ( Hits: 9/40)
11.- 25.00% ( Hits: 10/40)
12.- 25.00% ( Hits: 10/40)
13.- 30.00% ( Hits: 12/40)
14.- 17.50% ( Hits: 7/40)
15.- 35.00% ( Hits: 14/40)
16.- 20.00% ( Hits: 8/40)
17.- 25.00% ( Hits: 10/40)
18.- 35.00% ( Hits: 14/40)
19.- 27.50% ( Hits: 11/40)
20.- 37.50% ( Hits: 15/40)
 
Back
Top