Want kids one day? Take the quiz
How we evaluate studies about fertility and reproductive health

How we evaluate studies about fertility and reproductive health

15 min read

Science is the backbone of everything we do at Modern Fertility, whether it’s interpreting what someone’s hormones might mean for their health and fertility, or moving fertility research forward by conducting rigorous studies and publishing them in peer-reviewed medical journals.

Here’s the thing. Science, and the process of pushing the boundaries of our knowledge, isn’t always linear. For every one study claiming to find a relationship between two variables, there might be three others claiming there’s no relationship. And the sheer number of published articles has been growing 8%-9% per year over the past several decades, with about two papers published every minute of every day. With so much information out there, it’s hard to evaluate all of the relevant information and come to a conclusion. Not all published papers are good, and not all good published papers come to the same conclusion. (Read on to learn what “good” means in our book.)

In this post, we’ll talk about how the Modern Fertility clinical team evaluates fertility science. We’ll highlight what we look for when evaluating which studies we cite, how we determine the strengths and weaknesses of a study, how we reconcile conflicting info from different studies, and the (often very long) scientific processes that turn a hypothesis into a widely accepted fertility fact.

Main takeaways

  • The best studies are those that include a lot of people, have good control groups, are representative of the population, accurately measure what they say they’re measuring, statistically control for variables that may obscure the main relationship they’re interested in, and use solid study design.
  • Not all “good” studies come to the same conclusions, and that’s totally normal. Differences in study design, populations, and the specific measures used can all lead to different conclusions.
  • The current scientific consensus on any given topic doesn’t usually come from any one study, but rather from a bunch of studies coming to the same conclusion over time.
  • Just because a medicine, pill, or device is commercially available doesn’t mean it’s backed by solid science. And just because there may be a study purportedly backing something up, it doesn’t mean that it’s a good one. Modern Fertility has your back in helping you evaluate what's out there.

What makes a study "good"?

There are countless things that make a study good (and bad, for that matter). For the sake of brevity, we’ll outline some of the questions we ask ourselves when evaluating the quality of a study so you can start doing the same.

1. How many people were in the study?: Studies are typically designed to use results from small groups to make conclusions about larger groups. The more people a study includes, the more confident we can be that the findings in the study are reflective of the population we are trying to draw conclusions about. If we’re interested in knowing if a new in-vitro fertilization (IVF) protocol increases pregnancy rates, if a new medicine decreases symptoms of endometriosis, or if prenatal supplements improve pregnancy outcomes, we’re more confident in a study that looks at these things in a large sample.

What is “large?” This is where it gets tricky — there is no scientifically agreed-upon number of what constitutes a “large” sample. Epidemiologists who study population health might be used to 10,000-100,000 people in their studies; reproductive endocrinologists might be used to several hundreds or thousands. Generally, studies looking for really dramatic differences will need fewer people than studies looking for really small differences. When our team is evaluating the sample size of a paper in practice, we look at the magnitude of the effect reported and the frequency of the condition in the larger population to determine whether the study is sufficiently large.

2. Was there a control group? How good was it?: Studies (especially in the reproductive endocrinology world) often want to evaluate the effect of a new treatment, device, or intervention. But, often, just knowing the effectiveness of whatever we’re interested in isn’t helpful unless we have something to compare it to. For example, if a certain supplement says that 85% of people who start using it will get pregnant within one year of trying to conceive (TTC), it’s helpful to know that 85% of people not using the supplements will also get pregnant within one year of TTC. Having a control group provides us with an anchor for comparison.  

What makes a control group “good”? A good control group is one that matches the “experimental” group in as many ways as possible, making us more confident that any differences between them are due to the variable we are investigating, and not due to any other differences between the two groups. For example, if we want to compare the effects of two different ovulation-inducing meds, we should be sure that the two groups are comparable on variables that can affect reproductive function — like age, smoking, cycle regularity, and polycystic ovary syndrome (PCOS). If one group was significantly older on average than the other (due to random chance or due to a poorly designed study), we wouldn’t be confident that any differences in ovulation were uniquely caused by the two different meds.

3. Is the study population representative of the general population?: There are so many social, demographic, and personal factors that can influence reproductive health and fertility. Findings based on a group that is homogenous in any of these factors might not replicate in a different group, meaning we can’t fully trust that what we see in that study will be seen in the real world.

For example, studies of the effect of eating disorders on fertility that recruit people at eating disorder clinics may yield different results than studies that recruit from the general population — with the former often pointing to big effects of eating disorders that aren’t replicated at the population level. Why might that be the case? Because people with eating disorders seeking clinical help are more likely to have the most severe cases. What we’ve seen is that severe eating disorders have a big impact on fertility, but we still haven’t learned what effect mild or moderate eating disorders have on fertility.

Are studies good at being representative in the US? Historically, no. Medical research has an unfortunate but well-earned reputation of not being representative with respect to sex assigned at birth (e.g., people with ovaries were literally *banned* from clinical trials from 1977-1993), race, and socioeconomic status. This is particularly important to think about in the context of many reproductive endocrinology studies that recruit from fertility clinics — we know that access to fertility care isn’t equal across racial and socioeconomic lines.

4. How accurate are the measures used in the study?: There are often many different ways to measure the same thing. The more accurate these measures are, the more confident we can be in a study’s results.

Dr. Emily Oster, an economics professor at Brown University (and author of our fave parenting books), drives home this point about measurement accuracy in a recent edition of her newsletter. A study came out in September of 2020 linking alcohol use during pregnancy with child outcomes. One of Dr. Oster’s main qualms? The research team asked people to report on their alcohol consumption before they found out they were pregnant, and after they found out they were pregnant — over nine years after they’d given birth. Given that people can’t accurately self-report what they’ve eaten over the last day or week, chances are they aren’t that good at accurately reporting how many drinks they consumed per week in the few weeks between their last period and first positive pregnancy test a decade ago.

5. Does the study control for confounding variables?: If we’re interested in the relationship between two variables, it’s important to statistically remove the effects of a third variable (we call them “confounders”) that could have a related effect to the two variables we’re interested in.

For example, Dr. Oster wrote a piece in FiveThirtyEight about problems assessing the effect of paternal age on fertility-related outcomes. We know that maternal age is associated with decreases in egg quality that translate into negative pregnancy outcomes (e.g., risk of miscarriage) or developmental differences (e.g., risk of chromosomal abnormalities), but what about paternal age? Some early studies of paternal age and sperm quality tried to link paternal age and pregnancy outcomes, but were limited by the fact that people often partner up with people very similar in age to them. Without taking into account the age of the birthing parent, how could you be sure that any outcomes were driven by the sperm-giving parent? This is why careful consideration of such variables is important.

6. What’s the real-world significance of the results?: There’s a difference between results from a study being statistically significant, and clinically or real-world significant.

Here’s one recent example from a study published on the link between tanning bed use and endometriosis risk. The researchers found that people who used tanning beds 6+ times a year in high school had a statistically significantly higher risk (approximately 24% higher, which sounds like a lot!) of endometriosis relative to people who never used tanning beds. A closer look at the data shows that the rate of endometriosis (in cases per person-years) was 0.38% in people who never used tanning beds; in people who used tanning beds 6+ times per year, it was 0.49%.

Put in real-world terms, if you observed 1,000 people who don’t use tanning beds for one year, you’d expect to see four cases of endometriosis; if you observed 1,000 people who use tanning beds 6+ times per year for one year, you’d expect to see five cases of endometriosis. So, while the difference in endometriosis rates between these groups was statistically significant, whether it has important real-world significance may be up for debate.

7. What was the study design?: There are different types of study design, and the “best” design depends on the question you are trying to answer. In the world of fertility and reproductive health, we’re often interested in changes as a function of treatment, or changes over time.

Let’s say we’re looking at whether or not the number of oocytes (immature egg cells) obtained during egg retrieval changes with age. We could either conduct a cross-sectional study or a longitudinal study:

  • In a cross-sectional study, we could find the answer by sampling people with ovaries of different ages and to compare the number of oocytes across those different ages and conclude that differences in these variables across age groups suggests they change with age.
  • In a longitudinal study, we could find the answer by sampling the same person over time, and seeing how individual trajectories of oocytes change. This repeated sampling over time is a more direct way of investigating changes over time, and comes with some statistical advantages, too. Holding the number of participants equal, the statistical ability to detect the effect researchers are looking at is higher when you have multiple data points per person as compared to when you have a single data point per person.

Though few would disagree that longitudinal studies are better than cross-sectional ones, they take longer to complete and are logistically more challenging, which may be why we don’t see more of them.

8. Was it published in a reputable scientific journal?: The main way scientists disseminate their research findings is by submitting them for publication in peer-reviewed, scientific journals. Journals will seek independent reviewers (i.e., other scientists in the field) to give feedback on the submission, and will only publish the paper if it's deemed methodologically sound. Journals get "rankings" and, typically, higher-ranked journals have more rigorous standards.

But just because something was purportedly peer-reviewed and published doesn’t automatically make it great. There's an ever-growing list of predatory journals, which are journals that sound scientific, but are really just money-making scams for people behind the scenes.

But buyer beware! Even the best journals publish bad science. It’s tempting to give findings from “good” journals an automatic pass, but it’s important to review the findings just as critically.

Discrepancies across studies

Sometimes differences in findings across studies can be explained easily. Differences in how things are measured, who’s included in the sample, how many people were sampled, what other variables were considered, general study design, and how statistics were run are the most frequent culprits for discrepancies.

Other times, it’s not as easy to reconcile different findings across studies. It’s completely feasible for two studies, conducted very similarly, to reach different conclusions. And this might boil down to the fact that people are just complicated.

In studies of cell cultures or lab animals, every aspect of the setup is controlled by researchers, and such studies can often be successfully replicated across lab groups and across time. In human studies, not so much. The actions and exposures of people can’t be completely controlled by researchers in any study, and people’s actions and exposures change across space and time. “Part of the reason we see differences across studies is because there is so much heterogeneity in people, and that heterogeneity is just hard to quantify and hard to account for,” says Dr. Nataki Douglas, MD/PhD and head of Modern Fertility’s medical advisory board.

Examples of studies we think rock

Most studies and papers out there don’t check off all the boxes mentioned above, and that’s alright — doing flawless science is impossible, and doing really great science is hard. When we come across examples of really great science that change the game, we get excited. Here are two examples:

1. Letrozole as a first-line treatment for people with PCOS. For a strong study that changed the fertility treatment game for people with PCOS, look no further than this 2014 study in the New England Journal of Medicine. Though other studies may have suggested letrozole may be a better first-line treatment for people with PCOS, it’s this study that really sealed the deal. Today, most clinics will turn to letrozole first for ovulation induction in people with PCOS. Here are some things that make this study great.

  • A sufficiently large sample size? Researchers studied over 600 people with PCOS, recruited from multiple clinics. This sample size is larger than what is typically seen in single-clinic studies, and a larger sample size = more confidence in the results.
  • A good control group? Half the participants were allocated to a "control" group and received clomiphene citrate (aka Clomid). The people who got Clomid and the people who got Letrozole were comparable in terms of all background characteristics examined, like age, BMI, hormone levels, antral follicle count, number of months they’ve been trying to conceive, and more. That people in the two groups were so similar means that any differences in outcomes can likely be attributed to differences in which treatment they got.
  • Representative of the population? While people who seek out treatment for infertility are not representative of all people who may have infertility, the fact that people were recruited from multiple clinics means that these results aren’t driven by any potential idiosyncrasies of a single clinic, and that we can reasonably expect similar results to be found across clinics.
  • Accurate measures? This study used very solid measures of its main variables of interest. People’s diagnosis of PCOS was confirmed using the Rotterdam criteria, meaning we can be very sure they had PCOS, and ovulation was confirmed using progesterone levels.
  • Control for confounding variables? The authors of the study did some exploring to see whether confounding variables previously linked with fertility, like BMI, could explain their pattern of results. This, along with the fact that the Clomid and letrozole groups were well-matched on all baseline variables examined, suggests that the group differences they found were truly because of differences in treatment.
  • Real-world significance? The real-world significance of these results is clear. Put into plain English, as compared to people using Clomid, people using Letrozole have a 44% higher chance of a live birth. If 100 people were using Clomid and 100 were using Letrozole, we’d expect 9 more live births in the Letrozole group as compared to the Clomid group.
  • Solid study design? The study design here (a double-blind randomized controlled trial — more on that in a bit!) was great for the research question at hand.
  • Reputable journal? The journal this paper was published in, the New England Journal of Medicine, is the most heavily cited and arguably one of the most respected medical journals in circulation today.

2. How AMH changes with age. Of the countless studies assessing changes in AMH across age, one stands out as particularly strong. Here are some reasons why this 2016 study is one of our favorites.

  • A sufficiently large sample size? Dutch researchers measured AMH as part of a population-based cohort study, meaning everyone in the population was eligible to participate. Population-based studies tend to be big, and this one in particular analyzed data from over 3,300 people with ovaries.
  • A good control group? Control groups are crucial when we’re interested in comparing two things, like the effects of two different treatments. They aren’t necessary when investigating changes over time.
  • Representative of the population? Unlike studies based on patients at a fertility clinic, studies based on nationally representative samples give us results that are applicable to the population at large.
  • Accurate measures? The researchers used well-validated assays to measure AMH, along with well-validated collection and storage protocols.
  • Control for confounding variables? Models evaluating changes in AMH levels over time did control for some confounders, like smoking. This makes it likely that declines in AMH over time are due to people’s changing biology, and not just changing behavior.
  • Real-world significance? The real-world significance of this study is clear: AMH changes over time in a major way. It doesn’t necessarily decrease steadily over time, and the rate of decrease looks different for different people. These findings have implications for how we interpret AMH levels clinically.
  • Solid study design? This study was longitudinal in its design, meaning it followed the same people over time — specifically, researchers gathered data from people five times over a 20 year period. To our knowledge, no other published study has analyzed that many repeat AMH measures from that many people at that long of a time scale. This design enables us to really investigate how AMH trajectories change over long periods of time in individuals, and how baseline characteristics may affect the trajectories of change over time.
  • Reputable journal? This article was published in BMC Medicine, which is a trustworthy and well-cited medical journal.

Keeping score? When it comes to quality of research, both of these studies check out:

Letrozole as a first-line treatment for people with PCOS How AMH changes with age
Large sample size
Good control group
Representative of the population ✅/❌
Accurate measures
Control for confounding variables
Real-world significance
Solid study design
Reputable journal

From observations to medical “facts”: The science hierarchy of evidence

It’s very rare that one single study provides the definitive answer to a question. More often, scientists look for the accumulation of different kinds of studies that arrive at similar findings before being confident in the answer.

The “evidence hierarchy” in science starts with least convincing and ends with most convincing.

The "evidence hierarchy" in science.

At the bottom: Hypotheses, case series/reports, case control studies, and cohort studies

At the very bottom of the pyramid are hypotheses and opinions, not necessarily backed by any hard data (you’ve gotta start somewhere!). Case series/reports, case control studies, and cohort studies are all association studies of different sizes — where we’re looking to see if variable A is statistically related to variable B.

In the middle: Randomized controlled trials

Association studies are fine to help search for relationships between two variables. But just because two things are associated doesn’t mean one causes the other, and that’s where randomized controlled trials come in. In a randomized controlled trial, researchers will give one group a placebo and one group the experimental drug to better determine whether any differences between the two groups might be caused by the experimental drug. Randomized controlled trials in which neither the researchers nor the participants know which treatment group they’re in (called "double-blind") are considered the gold standard in medical research.

At the top: Systematic reviews and meta-analyses

At the top of the pyramid are systematic reviews and meta-analyses, and these are analyses or summaries of all the available data on a topic from scientifically sound studies. These sources look at data from the best sources, and combine a large number of observations from different scientific teams and from studies using different methodological approaches. Because they’re based on the largest and broadest set of observations, systematic reviews and meta-analyses are given a lot of weight when we’re trying to decide if a hypothesis is likely true. There aren’t systematic reviews or meta-analyses for everything because it takes time for enough data to be collected for a systematic review or meta-analysis to be worth it.

There are scientific journals that only publish systematic reviews and meta-analyses, like Human Reproduction Update and the Cochrane Database of Systematic Reviews. These are often the first go-to’s when researchers are interested in learning about where science currently stands on a specific topic. Bonus points: All Cochrane reviews also include a “plain language summary” where the main findings are stripped of scientific jargon and clearly communicated (example here).

If we know what great science is, how does less-than-great science happen?

Most of what you’ll learn from doctors and reproductive endocrinologists is firmly backed by the scientific community (because of things like systematic reviews and meta-analyses). But there are different ways that lower quality science makes its way into our collective conversations about fertility.

Here’s one example: Countless supplements claim they're "proven" to treat infertility or boost your odds of conception, but their “proof” is often customer reviews (opinions are at the bottom of our “hierarchy of evidence”). Sometimes fertility clinics, in an effort to give their patients the best chances of getting pregnant, will offer recommendations or treatment add-ons that haven’t yet been systematically evaluated to be effective. Since they haven’t been proven harmful either, why not try? This is why some clinics may suggest things like Mucinex to people who’ve had a couple of unsuccessful rounds of treatment — not necessarily because they're proven, but because there’s little harm in doing so. (Mucinex-D, however, has pseudoephedrine in it, which isn't good to take long term.)

Despite there being little physiological harm in trying things that aren't backed by rigorous science, there are potential drawbacks. Supplements and add-ons can be pricey, meaning there's a financial burden for something not yet proven to yield positive results. Then there’s the emotional impact — people use such interventions because they think they’ll work and might feel let down if they don’t. Though no interventions are 100% foolproof, there are procedures supported by decades of data (think IVF or ovulation-inducing meds).

Bottom line

The study of fertility and reproductive health is constantly growing and evolving, and that’s part of what makes it so important and exciting to be a part of. Not all studies are created equal, and it’s rare that one study (no matter how great it is!) ever provides the definitive answer to a question.

In a world where not all published studies should be trusted, and companies and clinics may offer things not backed by rigorous studies and data, Modern Fertility has your back. Along with our medical advisory board members, we're constantly keeping tabs on the literature to bring you scientifically accurate information that has been vetted with critical and careful eyes.

Through centuries of meticulous research, we’re continuously inching our way closer to a more complete picture of the scientific wonder that is human reproduction and fertility.

Did you like this article?

Talia Shirazi, PhD

Talia is a clinical product scientist at Modern Fertility. She's passionate about reproductive health + behavioral neuroendocrinology. Talia received her PhD in biological anthropology at Penn State.

Join the Modern Community

This is a space for us to talk about health, fertility, careers, and more. All people with ovaries are welcome (including trans and non-binary folks!).

Recent Posts

Why does vaginal lubrication matter for sex?

Lube 101: what it is, why to use it, and how to choose the best lube for you

What every female athlete should know about exercise and reproductive health

The Modern guide to ovulation predictor kits and ovulation tests

How to choose the right birth control for you