Jesus, if we don’t let the acronyms fly around here. I volunteered to look through the article that’s going around on IVF with PGS outcomes from a statistician’s point of view for Bea. My qualifications for doing so are that a) I am a competent statistican and b) I’m not doing IVF with PGS, so I don’t have any personal bias. I am not a medical doctor, but I am a researcher and I have a fairly good grasp of what makes a paper valid or not. If you want technical details about the actual research that was conducted, the protocols and what-not, I’m not the person to ask.
There are several basic things that I look for in any scientific paper, and I’m going to step you through one by one. I’m sure other scientifically-minded folks might have their own interpretations, but this is how I read the article.
- Hypothesis being tested
Hypothesis testing is a basic form of statistical inference. That is, you take test results from a small sample of individuals and try to generate conclusions about how the entire population behaves based on the sample results. In this study, the research team was interested in determining whether PGS would increase the ongoing pregnancy rate (defined as a “viable intrauterine pregnancy after 12 weeks gestation) for women with advanced maternal age over the course of three IVF cycles. The baseline rate for the clinical population was 40% — the research team decided that the study results needed to yield a 55% rate to be clinically significant.
In other words, the research team was trying to decide whether it was helpful to use PGS as a standard protocol in all IVF cycles. The study needed to show that including PGS in an IVF cycle yielded a 55% pregnancy rate over a course of three IVF cycles in order to be considered “successful”.
- Methodology of the research
Methodology describes the steps taken to produce the research results. The standard rule is that your methodology has to be described in enough detail that another researcher can understand and reproduce your testing method. The NEJM is a peer-reviewed research journal. That means that when a paper is submitted for possible publication, the journal editors pass out the article to several other researchers in the field to see what they think. Those experts review the methodology, and they must approve that the method is sound before a paper will be published. With that in mind, I’m going to accept that their methodology was proper, since I am not an expert in this subject.
However, from a patient’s point of view, there are a few things that stand out to me. First, and Bea and Aurelia went into this in detail, the patient population was defined simply as those with “advanced maternal age”, i.e., over 35. The paper, though not the article, also notes that other conditions were in play: poor semen quality, unexplained infertility, tubal issues, anovulation, endometriosis, cervical issues, and ovarian failure*. There was no control for any of those factors. That means that there might have been different results if the test population had focused on only semen quality, or only tubal issues. Since no controls were in place to account for the variation in baseline diagnoses, no statements can be made about the effectivity of using PGS for any specific condition.
Results are, well, results. Did it work? Did it not work? In this case, the results were measured in ongoing pregnancy rates at 12 weeks gestation, biochemical pregnancies (positive beta), and clinical pregnancies (visible gestational sac at 7 weeks). Miscarriage rates and live birth rates were also counted. The only issue here is that some of these results don’t line up well. For example, a single biochemical pregnancy might actually result in two live births. It was also noted that the study experienced 8 “spontaneous” pregnancies (how the hell you have a spontaneous pregnancy in an IVF cycle, I don’t know) and that any cryopreserved embryos were transferred before a fresh IVF cycle was begun. From the paper, there is no indication of how those number were included or excluded from the results. Because of these kinds of issues, the numbers reported don’t add up directly.
The conclusions in any paper are where the researchers lay out how they think that the results do or do not support their hypotheses. So here is where you have to be very careful: conclusions can be influenced by researcher bias. Conclusions can also be manipulated by the statistical tests performed on the results. There’s an old saying – that figures lie and liars figure. And it’s true. You can make statistical results say anything you want depending on the test that you use.
Now, the researchers’ overall conclusion was twofold: first, that PGS did not increase the ongoing-pregnancy and live birth rates. This is obviously true when you look at the numbers reported in their results. The second assertation is that PGS “significantly reduced” the ongoing-pregnancy and live birth rates. And this is what I think is causing a lot of people to mis-interpret their conclusion.
Let’s sidetrack into the definition of “statistical significance” here. For a statistician, the word “significant” has a special meaning. Instead of the common definition, meaning that it is something important or something to pay attention to, statistically significant refers to the Type I error probability in the hypothesis test. Now I’m going to go very slowly:
o For this experiment, the null hypothesis was that PGS doesn’t improve ongoing pregnancy rates**
o Type I error is a “false positive”. In other words, if we have a Type I error, we decide that PGS improves pg rates when it really doesn’t.
o The researchers chose to test against a 5% level of significance. This means that there is only a 5% chance that the researchers will decide that PGS improves pg rates when it really doesn’t.
o A 5% significance level has no real meaning, other than being what the research team chose to test against. A decision that is statistically significant at 5% may or may not be significant at 3% or 1% or 0.5%.
o The researchers did not provide any sensitivity testing results on their statistical output. We also do not have enough data about the true underlying population characteristics. Therefore, we cannot draw any conclusions about the true power*** of their experiment.
Still with me? Okay, since we cannot draw any conclusions about the power of their experiment, it is very dangerous to throw out the conclusion that PGS significantly reduces pg rates. There is also no information on how the statistical studies went about controlling for variation due to underlying personal factors. The double-blind trial groups were selected randomly, controlling for maternal age, IVF vs. ICSI, and center location (two medical center in different locations participated). In theory, it’s possible that a higher proportion of women in the PGS group also had higher FSH levels and therefore poorer-quality eggs. The numbers themselves do show that the PGS group had a higher raw number amount of women with unexplained IF, tubal IF, anovulation, and ovarian failure.
The other thing that is troublesome about this second statement is that the results were statistically significant only over the total sample size. That is, the researchers could only make the statement that PGS reduces pg rates when they looked at the overall numbers. On a cycle by cycle basis , “the ongoing-pregnancy rates and the rates of biochemical and clinical pregnancy in the two groups were not significantly different.” Yep, you heard me. The researchers buried that little sentence in the results section, but it’s there. The PGS rates weren’t significantly better, but they didn’t seem to hurt anything either.
My take on their study is that they did some things right, and they did some things wrong. I feel like they proved that PGS doesn’t increase pg rates in the general population of advanced maternal age IVF candidates and so should not be added to the standard of practice “just because”. On the other hand, they have NOT conclusively proved that there is anything about PGS that is detrimental to the IVF success process on a single-cycle basis, and so that needs further research before making the kinds of assertions that they did in their conclusion. If it was me, and my personal RE told me that it would help in my particular circumstances, I would definitely go ahead with PGS with no qualms. The study is simply too broad to use as a decision-breaking piece of research. There are still too many holes in their data that need to be filled.
One thing that is important to remember when you see articles like this is that this represents basic research into a subject. It’s one of the first times that an experiment like this has been tried, and it will most certainly spawn controversy in the medical establishment for the very reasons we’re all picking it apart. But what it does is lay a foundation for other researchers to come back and start testing various parts of the overall experiment to resolve the inconsistencies and variation that were identified. It may not help us, since these tests take years (and sometimes decades) to complete, but the overall body of knowledge will eventually be generated.
*I do want to point out that I think it was highly unethical for the study to use donor eggs from women with advanced maternal age to treat the women who were facing ovarian failure. I hope to god that those women knew the risk they were taking by using eggs from donors that also might have egg quality issues.
**Trust me here. If you want me to explain why this is our null hypothesis and the related concept of power, please email me.
***Again, power has a different definition for statisticians.