Comparison in Scientific Research: Uncovering statistically significant relationships

by Anthony Carpi, Ph.D., Anne E. Egger, Ph.D.

Listen to this reading

Did you know that when Europeans first saw chimpanzees, they thought the animals were hairy, adult humans with stunted growth? A study of chimpanzees paved the way for comparison to be recognized as an important research method. Later, Charles Darwin and others used this comparative research method in the development of the theory of evolution.

Comparison is used to determine and quantify relationships between two or more variables by observing different groups that either by choice or circumstance are exposed to different treatments.

Comparison includes both retrospective studies that look at events that have already occurred, and prospective studies, that examine variables from the present forward.

Comparative research is similar to experimentation in that it involves comparing a treatment group to a control, but it differs in that the treatment is observed rather than being consciously imposed due to ethical concerns, or because it is not possible, such as in a retrospective study.

Anyone who has stared at a chimpanzee in a zoo (Figure 1) has probably wondered about the animal's similarity to humans. Chimps make facial expressions that resemble humans, use their hands in much the same way we do, are adept at using different objects as tools, and even laugh when they are tickled. It may not be surprising to learn then that when the first captured chimpanzees were brought to Europe in the 17 th century, people were confused, labeling the animals "pygmies" and speculating that they were stunted versions of "full-grown" humans. A London physician named Edward Tyson obtained a "pygmie" that had died of an infection shortly after arriving in London, and began a systematic study of the animal that cataloged the differences between chimpanzees and humans, thus helping to establish comparative research as a scientific method .

Figure 1: A chimpanzee

Figure 1: A chimpanzee

  • A brief history of comparative methods

In 1698, Tyson, a member of the Royal Society of London, began a detailed dissection of the "pygmie" he had obtained and published his findings in the 1699 work: Orang-Outang, sive Homo Sylvestris: or, the Anatomy of a Pygmie Compared with that of a Monkey, an Ape, and a Man . The title of the work further reflects the misconception that existed at the time – Tyson did not use the term Orang-Outang in its modern sense to refer to the orangutan; he used it in its literal translation from the Malay language as "man of the woods," as that is how the chimps were viewed.

Tyson took great care in his dissection. He precisely measured and compared a number of anatomical variables such as brain size of the "pygmie," ape, and human. He recorded his measurements of the "pygmie," even down to the direction in which the animal's hair grew: "The tendency of the Hair of all of the Body was downwards; but only from the Wrists to the Elbows 'twas upwards" (Russell, 1967). Aided by William Cowper, Tyson made drawings of various anatomical structures, taking great care to accurately depict the dimensions of these structures so that they could be compared to those in humans (Figure 2). His systematic comparative study of the dimensions of anatomical structures in the chimp, ape, and human led him to state:

in the Organization of abundance of its Parts, it more approached to the Structure of the same in Men: But where it differs from a Man, there it resembles plainly the Common Ape, more than any other Animal. (Russell, 1967)

Tyson's comparative studies proved exceptionally accurate and his research was used by others, including Thomas Henry Huxley in Evidence as to Man's Place in Nature (1863) and Charles Darwin in The Descent of Man (1871).

Figure 2: Edward Tyson's drawing of the external appearance of a

Figure 2: Edward Tyson's drawing of the external appearance of a "pygmie" (left) and the animal's skeleton (right) from The Anatomy of a Pygmie Compared with that of a Monkey, an Ape, and a Man from the second edition, London, printed for T. Osborne, 1751.

Tyson's methodical and scientific approach to anatomical dissection contributed to the development of evolutionary theory and helped establish the field of comparative anatomy. Further, Tyson's work helps to highlight the importance of comparison as a scientific research method .

  • Comparison as a scientific research method

Comparative research represents one approach in the spectrum of scientific research methods and in some ways is a hybrid of other methods, drawing on aspects of both experimental science (see our Experimentation in Science module) and descriptive research (see our Description in Science module). Similar to experimentation, comparison seeks to decipher the relationship between two or more variables by documenting observed differences and similarities between two or more subjects or groups. In contrast to experimentation, the comparative researcher does not subject one of those groups to a treatment , but rather observes a group that either by choice or circumstance has been subject to a treatment. Thus comparison involves observation in a more "natural" setting, not subject to experimental confines, and in this way evokes similarities with description.

Importantly, the simple comparison of two variables or objects is not comparative research . Tyson's work would not have been considered scientific research if he had simply noted that "pygmies" looked like humans without measuring bone lengths and hair growth patterns. Instead, comparative research involves the systematic cataloging of the nature and/or behavior of two or more variables, and the quantification of the relationship between them.

Figure 3: Skeleton of the juvenile chimpanzee dissected by Edward Tyson, currently displayed at the Natural History Museum, London.

Figure 3: Skeleton of the juvenile chimpanzee dissected by Edward Tyson, currently displayed at the Natural History Museum, London.

While the choice of which research method to use is a personal decision based in part on the training of the researchers conducting the study, there are a number of scenarios in which comparative research would likely be the primary choice.

  • The first scenario is one in which the scientist is not trying to measure a response to change, but rather he or she may be trying to understand the similarities and differences between two subjects . For example, Tyson was not observing a change in his "pygmie" in response to an experimental treatment . Instead, his research was a comparison of the unknown "pygmie" to humans and apes in order to determine the relationship between them.
  • A second scenario in which comparative studies are common is when the physical scale or timeline of a question may prevent experimentation. For example, in the field of paleoclimatology, researchers have compared cores taken from sediments deposited millions of years ago in the world's oceans to see if the sedimentary composition is similar across all oceans or differs according to geographic location. Because the sediments in these cores were deposited millions of years ago, it would be impossible to obtain these results through the experimental method . Research designed to look at past events such as sediment cores deposited millions of years ago is referred to as retrospective research.
  • A third common comparative scenario is when the ethical implications of an experimental treatment preclude an experimental design. Researchers who study the toxicity of environmental pollutants or the spread of disease in humans are precluded from purposefully exposing a group of individuals to the toxin or disease for ethical reasons. In these situations, researchers would set up a comparative study by identifying individuals who have been accidentally exposed to the pollutant or disease and comparing their symptoms to those of a control group of people who were not exposed. Research designed to look at events from the present into the future, such as a study looking at the development of symptoms in individuals exposed to a pollutant, is referred to as prospective research.

Comparative science was significantly strengthened in the late 19th and early 20th century with the introduction of modern statistical methods . These were used to quantify the association between variables (see our Statistics in Science module). Today, statistical methods are critical for quantifying the nature of relationships examined in many comparative studies. The outcome of comparative research is often presented in one of the following ways: as a probability , as a statement of statistical significance , or as a declaration of risk. For example, in 2007 Kristensen and Bjerkedal showed that there is a statistically significant relationship (at the 95% confidence level) between birth order and IQ by comparing test scores of first-born children to those of their younger siblings (Kristensen & Bjerkedal, 2007). And numerous studies have contributed to the determination that the risk of developing lung cancer is 30 times greater in smokers than in nonsmokers (NCI, 1997).

Comprehension Checkpoint

  • Comparison in practice: The case of cigarettes

In 1919, Dr. George Dock, chairman of the Department of Medicine at Barnes Hospital in St. Louis, asked all of the third- and fourth-year medical students at the teaching hospital to observe an autopsy of a man with a disease so rare, he claimed, that most of the students would likely never see another case of it in their careers. With the medical students gathered around, the physicians conducting the autopsy observed that the patient's lungs were speckled with large dark masses of cells that had caused extensive damage to the lung tissue and had forced the airways to close and collapse. Dr. Alton Ochsner, one of the students who observed the autopsy, would write years later that "I did not see another case until 1936, seventeen years later, when in a period of six months, I saw nine patients with cancer of the lung. – All the afflicted patients were men who smoked heavily and had smoked since World War I" (Meyer, 1992).

Figure 4: Image from a stereoptic card showing a woman smoking a cigarette circa 1900

Figure 4: Image from a stereoptic card showing a woman smoking a cigarette circa 1900

The American physician Dr. Isaac Adler was, in fact, the first scientist to propose a link between cigarette smoking and lung cancer in 1912, based on his observation that lung cancer patients often reported that they were smokers. Adler's observations, however, were anecdotal, and provided no scientific evidence toward demonstrating a relationship. The German epidemiologist Franz Müller is credited with the first case-control study of smoking and lung cancer in the 1930s. Müller sent a survey to the relatives of individuals who had died of cancer, and asked them about the smoking habits of the deceased. Based on the responses he received, Müller reported a higher incidence of lung cancer among heavy smokers compared to light smokers. However, the study had a number of problems. First, it relied on the memory of relatives of deceased individuals rather than first-hand observations, and second, no statistical association was made. Soon after this, the tobacco industry began to sponsor research with the biased goal of repudiating negative health claims against cigarettes (see our Scientific Institutions and Societies module for more information on sponsored research).

Beginning in the 1950s, several well-controlled comparative studies were initiated. In 1950, Ernest Wynder and Evarts Graham published a retrospective study comparing the smoking habits of 605 hospital patients with lung cancer to 780 hospital patients with other diseases (Wynder & Graham, 1950). Their study showed that 1.3% of lung cancer patients were nonsmokers while 14.6% of patients with other diseases were nonsmokers. In addition, 51.2% of lung cancer patients were "excessive" smokers while only 19.1% of other patients were excessive smokers. Both of these comparisons proved to be statistically significant differences. The statisticians who analyzed the data concluded:

when the nonsmokers and the total of the high smoking classes of patients with lung cancer are compared with patients who have other diseases, we can reject the null hypothesis that smoking has no effect on the induction of cancer of the lungs.

Wynder and Graham also suggested that there might be a lag of ten years or more between the period of smoking in an individual and the onset of clinical symptoms of cancer. This would present a major challenge to researchers since any study that investigated the relationship between smoking and lung cancer in a prospective fashion would have to last many years.

Richard Doll and Austin Hill published a similar comparative study in 1950 in which they showed that there was a statistically higher incidence of smoking among lung cancer patients compared to patients with other diseases (Doll & Hill, 1950). In their discussion, Doll and Hill raise an interesting point regarding comparative research methods by saying,

This is not necessarily to state that smoking causes carcinoma of the lung. The association would occur if carcinoma of the lung caused people to smoke or if both attributes were end-effects of a common cause.

They go on to assert that because the habit of smoking was seen to develop before the onset of lung cancer, the argument that lung cancer leads to smoking can be rejected. They therefore conclude, "that smoking is a factor, and an important factor, in the production of carcinoma of the lung."

Despite this substantial evidence , both the tobacco industry and unbiased scientists raised objections, claiming that the retrospective research on smoking was "limited, inconclusive, and controversial." The industry stated that the studies published did not demonstrate cause and effect, but rather a spurious association between two variables . Dr. Wilhelm Hueper of the National Cancer Institute, a scientist with a long history of research into occupational causes of cancers, argued that the emphasis on cigarettes as the only cause of lung cancer would compromise research support for other causes of lung cancer. Ronald Fisher , a renowned statistician, also was opposed to the conclusions of Doll and others, purportedly because they promoted a "puritanical" view of smoking.

The tobacco industry mounted an extensive campaign of misinformation, sponsoring and then citing research that showed that smoking did not cause "cardiac pain" as a distraction from the studies that were being published regarding cigarettes and lung cancer. The industry also highlighted studies that showed that individuals who quit smoking suffered from mild depression, and they pointed to the fact that even some doctors themselves smoked cigarettes as evidence that cigarettes were not harmful (Figure 5).

Figure 5: Cigarette advertisement circa 1946.

Figure 5: Cigarette advertisement circa 1946.

While the scientific research began to impact health officials and some legislators, the industry's ad campaign was effective. The US Federal Trade Commission banned tobacco companies from making health claims about their products in 1955. However, more significant regulation was averted. An editorial that appeared in the New York Times in 1963 summed up the national sentiment when it stated that the tobacco industry made a "valid point," and the public should refrain from making a decision regarding cigarettes until further reports were issued by the US Surgeon General.

In 1951, Doll and Hill enrolled 40,000 British physicians in a prospective comparative study to examine the association between smoking and the development of lung cancer. In contrast to the retrospective studies that followed patients with lung cancer back in time, the prospective study was designed to follow the group forward in time. In 1952, Drs. E. Cuyler Hammond and Daniel Horn enrolled 187,783 white males in the United States in a similar prospective study. And in 1959, the American Cancer Society (ACS) began the first of two large-scale prospective studies of the association between smoking and the development of lung cancer. The first ACS study, named Cancer Prevention Study I, enrolled more than 1 million individuals and tracked their health, smoking and other lifestyle habits, development of diseases, cause of death, and life expectancy for almost 13 years (Garfinkel, 1985).

All of the studies demonstrated that smokers are at a higher risk of developing and dying from lung cancer than nonsmokers. The ACS study further showed that smokers have elevated rates of other pulmonary diseases, coronary artery disease, stroke, and cardiovascular problems. The two ACS Cancer Prevention Studies would eventually show that 52% of deaths among smokers enrolled in the studies were attributed to cigarettes.

In the second half of the 20 th century, evidence from other scientific research methods would contribute multiple lines of evidence to the conclusion that cigarette smoke is a major cause of lung cancer:

Descriptive studies of the pathology of lungs of deceased smokers would demonstrate that smoking causes significant physiological damage to the lungs. Experiments that exposed mice, rats, and other laboratory animals to cigarette smoke showed that it caused cancer in these animals (see our Experimentation in Science module for more information). Physiological models would help demonstrate the mechanism by which cigarette smoke causes cancer.

As evidence linking cigarette smoke to lung cancer and other diseases accumulated, the public, the legal community, and regulators slowly responded. In 1957, the US Surgeon General first acknowledged an association between smoking and lung cancer when a report was issued stating, "It is clear that there is an increasing and consistent body of evidence that excessive cigarette smoking is one of the causative factors in lung cancer." In 1965, over objections by the tobacco industry and the American Medical Association, which had just accepted a $10 million grant from the tobacco companies, the US Congress passed the Federal Cigarette Labeling and Advertising Act, which required that cigarette packs carry the warning: "Caution: Cigarette Smoking May Be Hazardous to Your Health." In 1967, the US Surgeon General issued a second report stating that cigarette smoking is the principal cause of lung cancer in the United States. While the tobacco companies found legal means to protect themselves for decades following this, in 1996, Brown and Williamson Tobacco Company was ordered to pay $750,000 in a tobacco liability lawsuit; it became the first liability award paid to an individual by a tobacco company.

  • Comparison across disciplines

Comparative studies are used in a host of scientific disciplines, from anthropology to archaeology, comparative biology, epidemiology , psychology, and even forensic science. DNA fingerprinting, a technique used to incriminate or exonerate a suspect using biological evidence , is based on comparative science. In DNA fingerprinting, segments of DNA are isolated from a suspect and from biological evidence such as blood, semen, or other tissue left at a crime scene. Up to 20 different segments of DNA are compared between that of the suspect and the DNA found at the crime scene. If all of the segments match, the investigator can calculate the statistical probability that the DNA came from the suspect as opposed to someone else. Thus DNA matches are described in terms of a "1 in 1 million" or "1 in 1 billion" chance of error.

Comparative methods are also commonly used in studies involving humans due to the ethical limits of experimental treatment . For example, in 2007, Petter Kristensen and Tor Bjerkedal published a study in which they compared the IQ of over 250,000 male Norwegians in the military (Kristensen & Bjerkedal, 2007). The researchers found a significant relationship between birth order and IQ, where the average IQ of first-born male children was approximately three points higher than the average IQ of the second-born male in the same family. The researchers further showed that this relationship was correlated with social rather than biological factors, as second-born males who grew up in families in which the first-born child died had average IQs similar to other first-born children. One might imagine a scenario in which this type of study could be carried out experimentally, for example, purposefully removing first-born male children from certain families, but the ethics of such an experiment preclude it from ever being conducted.

  • Limitations of comparative methods

One of the primary limitations of comparative methods is the control of other variables that might influence a study. For example, as pointed out by Doll and Hill in 1950, the association between smoking and cancer deaths could have meant that: a) smoking caused lung cancer, b) lung cancer caused individuals to take up smoking, or c) a third unknown variable caused lung cancer AND caused individuals to smoke (Doll & Hill, 1950). As a result, comparative researchers often go to great lengths to choose two different study groups that are similar in almost all respects except for the treatment in question. In fact, many comparative studies in humans are carried out on identical twins for this exact reason. For example, in the field of tobacco research , dozens of comparative twin studies have been used to examine everything from the health effects of cigarette smoke to the genetic basis of addiction.

  • Comparison in modern practice

Figure 6: The

Figure 6: The "Keeling curve," a long-term record of atmospheric CO 2 concentration measured at the Mauna Loa Observatory (Keeling et al.). Although the annual oscillations represent natural, seasonal variations, the long-term increase means that concentrations are higher than they have been in 400,000 years. Graphic courtesy of NASA's Earth Observatory.

Despite the lessons learned during the debate that ensued over the possible effects of cigarette smoke, misconceptions still surround comparative science. For example, in the late 1950s, Charles Keeling , an oceanographer at the Scripps Institute of Oceanography, began to publish data he had gathered from a long-term descriptive study of atmospheric carbon dioxide (CO 2 ) levels at the Mauna Loa observatory in Hawaii (Keeling, 1958). Keeling observed that atmospheric CO 2 levels were increasing at a rapid rate (Figure 6). He and other researchers began to suspect that rising CO 2 levels were associated with increasing global mean temperatures, and several comparative studies have since correlated rising CO 2 levels with rising global temperature (Keeling, 1970). Together with research from modeling studies (see our Modeling in Scientific Research module), this research has provided evidence for an association between global climate change and the burning of fossil fuels (which emits CO 2 ).

Yet in a move reminiscent of the fight launched by the tobacco companies, the oil and fossil fuel industry launched a major public relations campaign against climate change research . As late as 1989, scientists funded by the oil industry were producing reports that called the research on climate change "noisy junk science" (Roberts, 1989). As with the tobacco issue, challenges to early comparative studies tried to paint the method as less reliable than experimental methods. But the challenges actually strengthened the science by prompting more researchers to launch investigations, thus providing multiple lines of evidence supporting an association between atmospheric CO 2 concentrations and climate change. As a result, the culmination of multiple lines of scientific evidence prompted the Intergovernmental Panel on Climate Change organized by the United Nations to issue a report stating that "Warming of the climate system is unequivocal," and "Carbon dioxide is the most important anthropogenic greenhouse gas (IPCC, 2007)."

Comparative studies are a critical part of the spectrum of research methods currently used in science. They allow scientists to apply a treatment-control design in settings that preclude experimentation, and they can provide invaluable information about the relationships between variables . The intense scrutiny that comparison has undergone in the public arena due to cases involving cigarettes and climate change has actually strengthened the method by clarifying its role in science and emphasizing the reliability of data obtained from these studies.

Table of Contents

Activate glossary term highlighting to easily identify key terms within the module. Once highlighted, you can click on these terms to view their definitions.

Activate NGSS annotations to easily identify NGSS standards within the module. Once highlighted, you can click on them to view these standards.

Book cover

The International Encyclopedia of Higher Education Systems and Institutions pp 217–224 Cite as

Comparative Research, Higher Education

  • Anna Kosmützky 3  
  • Reference work entry
  • First Online: 01 January 2020

157 Accesses

1 Citations

Cross-cultural research ; Cross-national research ; Intercultural research ; International comparative research

Within higher education, the catch-all term “comparative research” typically denotes inter- or cross-national or inter- or cross-cultural comparative research. For an overall terminology, this type of research is defined here as empirical research that collects data and/or carries out observations across national, geographical, and cultural boundaries in at least two of such entities, and systematically relates those entities in a comparative analysis.

Core Characteristics

Typically, studies that compare research objects in two or more social entities are seen as the “truly” or “genuinely” comparative studies. Comparative research can, however, also be comparative over time as well as cross-sectional. In fact, many studies are comparative without being internationally comparative in nature. They compare, for example, organizations within one higher education...

This is a preview of subscription content, log in via an institution .

Buying options

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Antonucci, Lorenza. 2013. Comparative research in higher education studies: Considering the different levels of comparison and emerging methodological challenges. In Theory and method in higher education research , ed. Jeroen Huisman and Malcolm Tight, vol. 9, 1–19. Bingley: Emerald.

Google Scholar  

Bleiklie, Ivar. 2014. Comparing university organizations across boundaries. Higher Education 67 (4): 381–391.

Brew A, Boud D, Lucas L, and Crawford K (2013). Reflexive Deliberation in International Research Collaboration: Minimising Risk and Maximising Opportunity. Higher Education, 66 (1), 93–104. https://doi.org/10.1007/s10734-012-9592-6

Clark, Burton R. 1983. The higher education system: Academic organization in cross-national perspective . Berkeley: University of California Press.

Flexner, Abraham. 1994. Universities: American, English, German. Foundations of higher education . New Brunswick: Transaction Publishers.

Hantrais, Linda. 2009. International comparative research: Theory, methods and practice . Basingstoke/New York: Palgrave Macmillan.

Jarausch, Konrad H. 1985. Comparing higher education – Historically? History of Education Quarterly 25 (1/2): 241–252.

Kogan, Maurice. 1996. Comparing higher education systems. Higher Education 32: 395–402.

Kosmützky, Anna. 2015. In defense of international comparative studies. On the analytical and explanatory power of the nation-state in international comparative higher education research. European Journal of Higher Education 5 (4): 354–370.

Kosmützky, Anna. 2016. The precision and rigor of international comparative studies in higher education. In Theory and method in higher education , ed. Jeroen Huisman and Malcom Tight, 199–221. Bingley: Emerald Group Publishing.

Kosmützky, Anna. 2017. A two-sided medal. On the complexity of international comparative and collaborative team research. Higher Education Quarterly . https://doi.org/10.1111/hequ.12156

Kosmützky, Anna, and Georg Krücken. 2014. Growth or steady state? A bibliometric focus on international comparative higher education research. Higher Education 67 (4): 457–472.

Kosmützky, Anna, and Terhi Nokkala. 2014. Challenges and trends in comparative higher education: Editorial. Higher Education 67 (4): 369–380.

Kuzhabekova, Aliya, Darwin D. Hendel, and David W. Chapman. 2015. Mapping global research on international higher education. Research in Higher Education 56 (8): 861–882.

Musselin, Christine. 2000. Do we compare societies when we compare national university systems. In Embedding organizations societal analysis of actors, organizations, and socio-economic context , ed. Marc Maurice and Arndt Sorge, 295–309. Amsterdam/Great Britain: John Benjamins.

Ragin, Charles C. 1987. Comparative method: Moving beyond qualitative and quantitative strategies . Berkeley: University of California.

Reale, Emanuela. 2014. Challenges in higher education research: The use of quantitative tools in comparative analyses. Higher Education 67 (4): 409–422.

Rumbley, Laura, Philip G. Altbach, David A. Stanfield, and Ariane de Gayardon. 2014. A global inventory of research, training and publication in the field of higher education: Growth, diversity, disparity. In Higher education: A worldwide inventory of research centers, academic programs, and journals and publications , ed. Laura Rumbley, Philip G. Altbach, and David A. Stanfield, 23–33. Bonn/Berlin/New York: Lemmens.

Schriewer, Jürgen. 2009. Comparative education methodology in transition: Towards a science of complexity. In Discourse formation in comparative education , Comparative Studies Series, ed. Jürgen Schriewer, vol. 10, 3rd ed., 3–53. Frankfurt (am Main): Lang.

Shahjahan, Ryad A., and Adrianna J. Kezar. 2013. Beyond the ‘national container’: Addressing methodological nationalism in higher education research. Educational Researcher 42 (1): 20–29.

Slipersæter, Stig, and Dag W. Aksnes. 2008. The many ways of internationalisation. In Borderless knowledge , ed. Åse Gornitzka and Liv Langfeldt, vol. 22, 13–36. Dordrecht: Springer Netherlands. https://doi.org/10.1007/978-1-4020-8283-2_2 .

Chapter   Google Scholar  

Smelser, Neil. 2013. Comparative methods in the social sciences . New Orleans: Quid Pro Books.

Teichler, Ulrich. 1996. Comparative higher education: Potentials and limits. Higher Education 32 (4): 431–465.

Teichler, Ulrich. 1997. Vergleichende Hochschulforschung. In Vergleichende Erziehungswissenschaft: Herausforderung, Vermittlung, Praxis; Festschrift für Wolfgang Mitter zum 70. Geburtstag , ed. Christoph Kodron and Wolfgang Mitter, 161–172. Köln: Böhlau.

Teichler, Ulrich. 2000. Higher education research and its institutional basis. In The institutional basis of higher education research: Experiences and perspectives , ed. Stefanie Schwarz and Ulrich Teichler, 13–24. Dordrecht/London: Kluwer Academic Publishers.

Teichler, Ulrich. 2014. Opportunities and problems of comparative higher education research: The daily life of research. Higher Education 67 (4): 393–408.

Tight, Malcolm. 2012. Researching higher education . Maidenhead: Open University Press.

Välimaa, Jussi, and Terhi Nokkala. 2014. The dimensions of social dynamics in comparative studies on higher education. Higher Education 67 (4): 423–437.

Download references

Author information

Authors and affiliations.

International Centre for Higher Education Research (INCHER), University of Kassel, Kassel, Germany

Anna Kosmützky

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Anna Kosmützky .

Editor information

Editors and affiliations.

CIPES - Centre for Research in Higher Education Policies and Faculty of Economics - U., Porto, Portugal

Pedro Nuno Teixeira

Department of Education, Seoul National University, Seoul, Korea (Republic of)

Jung Cheol Shin

Section Editor information

Higher education studies team (HIEST), University ofJyväskylä, Jyväskylä, Central Finland, Finland

Jussi Valimaa

Finnish Institute for Educational Research, University of Jyväskylä, Jyväskylän yliopisto, Finland

Terhi Nokkala

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature B.V.

About this entry

Cite this entry.

Kosmützky, A. (2020). Comparative Research, Higher Education. In: Teixeira, P.N., Shin, J.C. (eds) The International Encyclopedia of Higher Education Systems and Institutions. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-8905-9_175

Download citation

DOI : https://doi.org/10.1007/978-94-017-8905-9_175

Published : 19 August 2020

Publisher Name : Springer, Dordrecht

Print ISBN : 978-94-017-8904-2

Online ISBN : 978-94-017-8905-9

eBook Packages : Education Reference Module Humanities and Social Sciences Reference Module Education

Share this entry

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

National Academies Press: OpenBook

On Evaluating Curricular Effectiveness: Judging the Quality of K-12 Mathematics Evaluations (2004)

Chapter: 5 comparative studies, 5 comparative studies.

It is deceptively simple to imagine that a curriculum’s effectiveness could be easily determined by a single well-designed study. Such a study would randomly assign students to two treatment groups, one using the experimental materials and the other using a widely established comparative program. The students would be taught the entire curriculum, and a test administered at the end of instruction would provide unequivocal results that would permit one to identify the more effective treatment.

The truth is that conducting definitive comparative studies is not simple, and many factors make such an approach difficult. Student placement and curricular choice are decisions that involve multiple groups of decision makers, accrue over time, and are subject to day-to-day conditions of instability, including student mobility, parent preference, teacher assignment, administrator and school board decisions, and the impact of standardized testing. This complex set of institutional policies, school contexts, and individual personalities makes comparative studies, even quasi-experimental approaches, challenging, and thus demands an honest and feasible assessment of what can be expected of evaluation studies (Usiskin, 1997; Kilpatrick, 2002; Schoenfeld, 2002; Shafer, in press).

Comparative evaluation study is an evolving methodology, and our purpose in conducting this review was to evaluate and learn from the efforts undertaken so far and advise on future efforts. We stipulated the use of comparative studies as follows:

A comparative study was defined as a study in which two (or more) curricular treatments were investigated over a substantial period of time (at least one semester, and more typically an entire school year) and a comparison of various curricular outcomes was examined using statistical tests. A statistical test was required to ensure the robustness of the results relative to the study’s design.

We read and reviewed a set of 95 comparative studies. In this report we describe that database, analyze its results, and draw conclusions about the quality of the evaluation database both as a whole and separated into evaluations supported by the National Science Foundation and commercially generated evaluations. In addition to describing and analyzing this database, we also provide advice to those who might wish to fund or conduct future comparative evaluations of mathematics curricular effectiveness. We have concluded that the process of conducting such evaluations is in its adolescence and could benefit from careful synthesis and advice in order to increase its rigor, feasibility, and credibility. In addition, we took an interdisciplinary approach to the task, noting that various committee members brought different expertise and priorities to the consideration of what constitutes the most essential qualities of rigorous and valid experimental or quasi-experimental design in evaluation. This interdisciplinary approach has led to some interesting observations and innovations in our methodology of evaluation study review.

This chapter is organized as follows:

Study counts disaggregated by program and program type.

Seven critical decision points and identification of at least minimally methodologically adequate studies.

Definition and illustration of each decision point.

A summary of results by student achievement in relation to program types (NSF-supported, University of Chicago School Mathematics Project (UCSMP), and commercially generated) in relation to their reported outcome measures.

A list of alternative hypotheses on effectiveness.

Filters based on the critical decision points.

An analysis of results by subpopulations.

An analysis of results by content strand.

An analysis of interactions among content, equity, and grade levels.

Discussion and summary statements.

In this report, we describe our methodology for review and synthesis so that others might scrutinize our approach and offer criticism on the basis of

our methodology and its connection to the results stated and conclusions drawn. In the spirit of scientific, fair, and open investigation, we welcome others to undertake similar or contrasting approaches and compare and discuss the results. Our work was limited by the short timeline set by the funding agencies resulting from the urgency of the task. Although we made multiple efforts to collect comparative studies, we apologize to any curriculum evaluators if comparative studies were unintentionally omitted from our database.

Of these 95 comparative studies, 65 were studies of NSF-supported curricula, 27 were studies of commercially generated materials, and 3 included two curricula each from one of these two categories. To avoid the problem of double coding, two studies, White et al. (1995) and Zahrt (2001), were coded within studies of NSF-supported curricula because more of the classes studied used the NSF-supported curriculum. These studies were not used in later analyses because they did not meet the requirements for the at least minimally methodologically adequate studies, as described below. The other, Peters (1992), compared two commercially generated curricula, and was coded in that category under the primary program of focus. Therefore, of the 95 comparative studies, 67 studies were coded as NSF-supported curricula and 28 were coded as commercially generated materials.

The 11 evaluation studies of the UCSMP secondary program that we reviewed, not including White et al. and Zahrt as previously mentioned, benefit from the maturity of the program, while demonstrating an orientation to both establishing effectiveness and improving a product line. For these reasons, at times we will present the summary of UCSMP’s data separately.

The Saxon materials also present a somewhat different profile from the other commercially generated materials because many of the evaluations of these materials were conducted in the 1980s and the materials were originally developed with a rather atypical program theory. Saxon (1981) designed its algebra materials to combine distributed practice with incremental development. We selected the Saxon materials as a middle grades commercially generated program, and limited its review to middle school studies from 1989 onward when the first National Council of Teachers of Mathematics (NCTM) Standards (NCTM, 1989) were released. This eliminated concerns that the materials or the conditions of educational practice have been altered during the intervening time period. The Saxon materials explicitly do not draw from the NCTM Standards nor did they receive support from the NSF; thus they truly represent a commercial venture. As a result, we categorized the Saxon studies within the group of studies of commercial materials.

At times in this report, we describe characteristics of the database by

comparison research study

FIGURE 5-1 The distribution of comparative studies across programs. Programs are coded by grade band: black bars = elementary, white bars = middle grades, and gray bars = secondary. In this figure, there are six studies that involved two programs and one study that involved three programs.

NOTE: Five programs (MathScape, MMAP, MMOW/ARISE, Addison-Wesley, and Harcourt) are not shown above since no comparative studies were reviewed.

particular curricular program evaluations, in which case all 19 programs are listed separately. At other times, when we seek to inform ourselves on policy-related issues of funding and evaluating curricular materials, we use the NSF-supported, commercially generated, and UCSMP distinctions. We remind the reader of the artificial aspects of this distinction because at the present time, 18 of the 19 curricula are published commercially. In order to track the question of historical inception and policy implications, a distinction is drawn between the three categories. Figure 5-1 shows the distribution of comparative studies across the 14 programs.

The first result the committee wishes to report is the uneven distribution of studies across the curricula programs. There were 67 coded studies of the NSF curricula, 11 studies of UCSMP, and 17 studies of the commercial publishers. The 14 evaluation studies conducted on the Saxon materials compose the bulk of these 17-non-UCSMP and non-NSF-supported curricular evaluation studies. As these results suggest, we know more about the

evaluations of the NSF-supported curricula and UCSMP than about the evaluations of the commercial programs. We suggest that three factors account for this uneven distribution of studies. First, evaluations have been funded by the NSF both as a part of the original call, and as follow-up to the work in the case of three supplemental awards to two of the curricula programs. Second, most NSF-supported programs and UCSMP were developed at university sites where there is access to the resources of graduate students and research staff. Finally, there was some reported reluctance on the part of commercial companies to release studies that could affect perceptions of competitive advantage. As Figure 5-1 shows, there were quite a few comparative studies of Everyday Mathematics (EM), Connected Mathematics Project (CMP), Contemporary Mathematics in Context (Core-Plus Mathematics Project [CPMP]), Interactive Mathematics Program (IMP), UCSMP, and Saxon.

In the programs with many studies, we note that a significant number of studies were generated by a core set of authors. In some cases, the evaluation reports follow a relatively uniform structure applied to single schools, generating multiple studies or following cohorts over years. Others use a standardized evaluation approach to evaluate sequential courses. Any reports duplicating exactly the same sample, outcome measures, or forms of analysis were eliminated. For example, one study of Mathematics Trailblazers (Carter et al., 2002) reanalyzed the data from the larger ARC Implementation Center study (Sconiers et al., 2002), so it was not included separately. Synthesis studies referencing a variety of evaluation reports are summarized in Chapter 6 , but relevant individual studies that were referenced in them were sought out and included in this comparative review.

Other less formal comparative studies are conducted regularly at the school or district level, but such studies were not included in this review unless we could obtain formal reports of their results, and the studies met the criteria outlined for inclusion in our database. In our conclusions, we address the issue of how to collect such data more systematically at the district or state level in order to subject the data to the standards of scholarly peer review and make it more systematically and fairly a part of the national database on curricular effectiveness.

A standard for evaluation of any social program requires that an impact assessment is warranted only if two conditions are met: (1) the curricular program is clearly specified, and (2) the intervention is well implemented. Absent this assurance, one must have a means of ensuring or measuring treatment integrity in order to make causal inferences. Rossi et al. (1999, p. 238) warned that:

two prerequisites [must exist] for assessing the impact of an intervention. First, the program’s objectives must be sufficiently well articulated to make

it possible to specify credible measures of the expected outcomes, or the evaluator must be able to establish such a set of measurable outcomes. Second, the intervention should be sufficiently well implemented that there is no question that its critical elements have been delivered to appropriate targets. It would be a waste of time, effort, and resources to attempt to estimate the impact of a program that lacks measurable outcomes or that has not been properly implemented. An important implication of this last consideration is that interventions should be evaluated for impact only when they have been in place long enough to have ironed out implementation problems.

These same conditions apply to evaluation of mathematics curricula. The comparative studies in this report varied in the quality of documentation of these two conditions; however, all addressed them to some degree or another. Initially by reviewing the studies, we were able to identify one general design template, which consisted of seven critical decision points and determined that it could be used to develop a framework for conducting our meta-analysis. The seven critical decision points we identified initially were:

Choice of type of design: experimental or quasi-experimental;

For those studies that do not use random assignment: what methods of establishing comparability of groups were built into the design—this includes student characteristics, teacher characteristics, and the extent to which professional development was involved as part of the definition of a curriculum;

Definition of the appropriate unit of analysis (students, classes, teachers, schools, or districts);

Inclusion of an examination of implementation components;

Definition of the outcome measures and disaggregated results by program;

The choice of statistical tests, including statistical significance levels and effect size; and

Recognition of limitations to generalizability resulting from design choices.

These are critical decisions that affect the quality of an evaluation. We further identified a subset of these evaluation studies that met a set of minimum conditions that we termed at least minimally methodologically adequate studies. Such studies are those with the greatest likelihood of shedding light on the effectiveness of these programs. To be classified as at least minimally methodologically adequate, and therefore to be considered for further analysis, each evaluation study was required to:

Include quantifiably measurable outcomes such as test scores, responses to specified cognitive tasks of mathematical reasoning, performance evaluations, grades, and subsequent course taking; and

Provide adequate information to judge the comparability of samples. In addition, a study must have included at least one of the following additional design elements:

A report of implementation fidelity or professional development activity;

Results disaggregated by content strands or by performance by student subgroups; and/or

Multiple outcome measures or precise theoretical analysis of a measured construct, such as number sense, proof, or proportional reasoning.

Using this rubric, the committee identified a subset of 63 comparative studies to classify as at least minimally methodologically adequate and to analyze in depth to inform the conduct of future evaluations. There are those who would argue that any threat to the validity of a study discredits the findings, thus claiming that until we know everything, we know nothing. Others would claim that from the myriad of studies, examining patterns of effects and patterns of variation, one can learn a great deal, perhaps tentatively, about programs and their possible effects. More importantly, we can learn about methodologies and how to concentrate and focus to increase the likelihood of learning more quickly. As Lipsey (1997, p. 22) wrote:

In the long run, our most useful and informative contribution to program managers and policy makers and even to the evaluation profession itself may be the consolidation of our piecemeal knowledge into broader pictures of the program and policy spaces at issue, rather than individual studies of particular programs.

We do not wish to imply that we devalue studies of student affect or conceptions of mathematics, but decided that unless these indicators were connected to direct indicators of student learning, we would eliminate them from further study. As a result of this sorting, we eliminated 19 studies of NSF-supported curricula and 13 studies of commercially generated curricula. Of these, 4 were eliminated for their sole focus on affect or conceptions, 3 were eliminated for their comparative focus on outcomes other than achievement, such as teacher-related variables, and 19 were eliminated for their failure to meet the minimum additional characteristics specified in the criteria above. In addition, six others were excluded from the studies of commercial materials because they were not conducted within the grade-

level band specified by the committee for the selection of that program. From this point onward, all references can be assumed to refer to at least minimally methodologically adequate unless a study is referenced for illustration, in which case we label it with “EX” to indicate that it is excluded in the summary analyses. Studies labeled “EX” are occasionally referenced because they can provide useful information on certain aspects of curricular evaluation, but not on the overall effectiveness.

The at least minimally methodologically adequate studies reported on a variety of grade levels. Figure 5-2 shows the different grade levels of the studies. At times, the choice of grade levels was dictated by the years in which high-stakes tests were given. Most of the studies reported on multiple grade levels, as shown in Figure 5-2 .

Using the seven critical design elements of at least minimally methodologically adequate studies as a design template, we describe the overall database and discuss the array of choices on critical decision points with examples. Following that, we report on the results on the at least minimally methodologically adequate studies by program type. To do so, the results of each study were coded as either statistically significant or not. Those studies

comparison research study

FIGURE 5-2 Single-grade studies by grade and multigrade studies by grade band.

that contained statistically significant results were assigned a percentage of outcomes that are positive (in favor of the treatment curriculum) based on the number of statistically significant comparisons reported relative to the total number of comparisons reported, and a percentage of outcomes that are negative (in favor of the comparative curriculum). The remaining were coded as the percentage of outcomes that are non significant. Then, using seven critical decision points as filters, we identified and examined more closely sets of studies that exhibited the strongest designs, and would therefore be most likely to increase our confidence in the validity of the evaluation. In this last section, we consider alternative hypotheses that could explain the results.

The committee emphasizes that we did not directly evaluate the materials. We present no analysis of results aggregated across studies by naming individual curricular programs because we did not consider the magnitude or rigor of the database for individual programs substantial enough to do so. Nevertheless, there are studies that provide compelling data concerning the effectiveness of the program in a particular context. Furthermore, we do report on individual studies and their results to highlight issues of approach and methodology and to remain within our primary charge, which was to evaluate the evaluations, we do not summarize results of the individual programs.

DESCRIPTION OF COMPARATIVE STUDIES DATABASE ON CRITICAL DECISION POINTS

An experimental or quasi-experimental design.

We separated the studies into experimental and quasiexperimental, and found that 100 percent of the studies were quasiexperimental (Campbell and Stanley, 1966; Cook and Campbell, 1979; and Rossi et al., 1999). 1 Within the quasi-experimental studies, we identified three subcategories of comparative study. In the first case, we identified a study as cross-curricular comparative if it compared the results of curriculum A with curriculum B. A few studies in this category also compared two samples within the curriculum to each other and specified different conditions such as high and low implementation quality.

A second category of a quasi-experimental study involved comparisons that could shed light on effectiveness involving time series studies. These studies compared the performance of a sample of students in a curriculum

comparison research study

FIGURE 5-3 The number of comparative studies in each category.

under investigation across time, such as in a longitudinal study of the same students over time. A third category of comparative study involved a comparison to some form of externally normed results, such as populations taking state, national, or international tests or prior research assessment from a published study or studies. We categorized these studies and divided them into NSF, UCSMP, and commercial and labeled them by the categories above ( Figure 5-3 ).

In nearly all studies in the comparative group, the titles of experimental curricula were explicitly identified. The only exception to this was the ARC Implementation Center study (Sconiers et al., 2002), where three NSF-supported elementary curricula were examined, but in the results, their effects were pooled. In contrast, in the majority of the cases, the comparison curriculum is referred to simply as “traditional.” In only 22 cases were comparisons made between two identified curricula. Many others surveyed the array of curricula at comparison schools and reported on the most frequently used, but did not identify a single curriculum. This design strategy is used often because other factors were used in selecting comparison groups, and the additional requirement of a single identified curriculum in

these sites would often make it difficult to match. Studies were categorized into specified (including a single or multiple identified curricula) and nonspecified curricula. In the 63 studies, the central group was compared to an NSF-supported curriculum (1), an unnamed traditional curriculum (41), a named traditional curriculum (19), and one of the six commercial curricula (2). To our knowledge, any systematic impact of such a decision on results has not been studied, but we express concern that when a specified curriculum is compared to an unspecified content which is a set of many informal curriculum, the comparison may favor the coherency and consistency of the single curricula, and we consider this possibility subsequently under alternative hypotheses. We believe that a quality study should at least report the array of curricula that comprise the comparative group and include a measure of the frequency of use of each, but a well-defined alternative is more desirable.

If a study was both longitudinal and comparative, then it was coded as comparative. When studies only examined performances of a group over time, such as in some longitudinal studies, it was coded as quasi-experimental normed. In longitudinal studies, the problems created by student mobility were evident. In one study, Carroll (2001), a five-year longitudinal study of Everyday Mathematics, the sample size began with 500 students, 24 classrooms, and 11 schools. By 2nd grade, the longitudinal sample was 343. By 3rd grade, the number of classes increased to 29 while the number of original students decreased to 236 students. At the completion of the study, approximately 170 of the original students were still in the sample. This high rate of attrition from the study suggests that mobility is a major challenge in curricular evaluation, and that the effects of curricular change on mobile students needs to be studied as a potential threat to the validity of the comparison. It is also a challenge in curriculum implementation because students coming into a program do not experience its cumulative, developmental effect.

Longitudinal studies also have unique challenges associated with outcome measures, a study by Romberg et al. (in press) (EX) discussed one approach to this problem. In this study, an external assessment system and a problem-solving assessment system were used. In the External Assessment System, items from the National Assessment of Educational Progress (NAEP) and Third International Mathematics and Science Survey (TIMSS) were balanced across four strands (number, geometry, algebra, probability and statistics), and 20 items of moderate difficulty, called anchor items, were repeated on each grade-specific assessment (p. 8). Because the analyses of the results are currently under way, the evaluators could not provide us with final results of this study, so it is coded as EX.

However, such longitudinal studies can provide substantial evidence of the effects of a curricular program because they may be more sensitive to an

TABLE 5-1 Scores in Percentage Correct by Everyday Mathematics Students and Various Comparison Groups Over a Five-Year Longitudinal Study

accumulation of modest effects and/or can reveal whether the rates of learning change over time within curricular change.

The longitudinal study by Carroll (2001) showed that the effects of curricula may often accrue over time, but measurements of achievement present challenges to drawing such conclusions as the content and grade level change. A variety of measures were used over time to demonstrate growth in relation to comparison groups. The author chose a set of measures used previously in studies involving two Asian samples and an American sample to provide a contrast to the students in EM over time. For 3rd and 4th grades, where the data from the comparison group were not available, the authors selected items from the NAEP to bridge the gap. Table 5-1 summarizes the scores of the different comparative groups over five years. Scores are reported as the mean percentage correct for a series of tests on number computation, number concepts and applications, geometry, measurement, and data analysis.

It is difficult to compare performances on different tests over different groups over time against a single longitudinal group from EM, and it is not possible to determine whether the students’ performance is increasing or whether the changes in the tests at each grade level are producing the results; thus the results from longitudinal studies lacking a control group or use of sophisticated methodological analysis may be suspect and should be interpreted with caution.

In the Hirsch and Schoen (2002) study, based on a sample of 1,457 students, scores on Ability to Do Quantitative Thinking (ITED-Q) a subset of the Iowa Tests of Education Development, students in Core-Plus showed increasing performance over national norms over the three-year time period. The authors describe the content of the ITED-Q test and point out

that “although very little symbolic algebra is required, the ITED-Q is quite demanding for the full range of high school students” (p. 3). They further point out that “[t]his 3-year pattern is consistent, on average, in rural, urban, and suburban schools, for males and females, for various minority groups, and for students for whom English was not their first language” (p. 4). In this case, one sees that studies over time are important as results over shorter periods may mask cumulative effects of consistent and coherent treatments and such studies could also show increases that do not persist when subject to longer trajectories. One approach to longitudinal studies was used by Webb and Dowling in their studies of the Interactive Mathematics Program (Webb and Dowling, 1995a, 1995b, 1995c). These researchers conducted transcript analyses as a means to examine student persistence and success in subsequent course taking.

The third category of quasi-experimental comparative studies measured student outcomes on a particular curricular program and simply compared them to performance on national tests or international tests. When these tests were of good quality and were representative of a genuine sample of a relevant population, such as NAEP reports or TIMSS results, the reports often provided one a reasonable indicator of the effects of the program if combined with a careful description of the sample. Also, sometimes the national tests or state tests used were norm-referenced tests producing national percentiles or grade-level equivalents. The normed studies were considered of weaker quality in establishing effectiveness, but were still considered valid as examples of comparing samples to populations.

For Studies That Do Not Use Random Assignment: What Methods of Establishing Comparability Across Groups Were Built into the Design

The most fundamental question in an evaluation study is whether the treatment has had an effect on the chosen criterion variable. In our context, the treatment is the curriculum materials, and in some cases, related professional development, and the outcome of interest is academic learning. To establish if there is a treatment effect, one must logically rule out as many other explanations as possible for the differences in the outcome variable. There is a long tradition on how this is best done, and the principle from a design point of view is to assure that there are no differences between the treatment conditions (especially in these evaluations, often there are only the new curriculum materials to be evaluated and a control group) either at the outset of the study or during the conduct of the study.

To ensure the first condition, the ideal procedure is the random assignment of the appropriate units to the treatment conditions. The second condition requires that the treatment is administered reliably during the length of the study, and is assured through the careful observation and

control of the situation. Without randomization, there are a host of possible confounding variables that could differ among the treatment conditions and that are related themselves to the outcome variables. Put another way, the treatment effect is a parameter that the study is set up to estimate. Statistically, an estimate that is unbiased is desired. The goal is that its expected value over repeated samplings is equal to the true value of the parameter. Without randomization at the onset of a study, there is no way to assure this property of unbiasness. The variables that differ across treatment conditions and are related to the outcomes are confounding variables, which bias the estimation process.

Only one study we reviewed, Peters (1992), used randomization in the assignment of students to treatments, but that occurred because the study was limited to one teacher teaching two sections and included substantial qualitative methods, so we coded it as quasi-experimental. Others report partially assigning teachers randomly to treatment conditions (Thompson, et al., 2001; Thompson et al., 2003). Two primary reasons seem to account for a lack of use of pure experimental design. To justify the conduct and expense of a randomized field trial, the program must be described adequately and there must be relative assurance that its implementation has occurred over the duration of the experiment (Peterson et al., 1999). Additionally, one must be sure that the outcome measures are appropriate for the range of performances in the groups and valid relative to the curricula under investigation. Seldom can such conditions be assured for all students and teachers and over the duration of a year or more.

A second reason is that random assignment of classrooms to curricular treatment groups typically is not permitted or encouraged under normal school conditions. As one evaluator wrote, “Building or district administrators typically identified teachers who would be in the study and in only a few cases was random assignment of teachers to UCSMP Algebra or comparison classes possible. School scheduling and teacher preference were more important factors to administrators and at the risk of losing potential sites, we did not insist on randomization” (Mathison et al., 1989, p. 11).

The Joint Committee on Standards for Educational Evaluation (1994, p. 165) committee of evaluations recognized the likelihood of limitations on randomization, writing:

The groups being compared are seldom formed by random assignment. Rather, they tend to be natural groupings that are likely to differ in various ways. Analytical methods may be used to adjust for these initial differences, but these methods are based upon a number of assumptions. As it is often difficult to check such assumptions, it is advisable, when time and resources permit, to use several different methods of analysis to determine whether a replicable pattern of results is obtained.

Does the dearth of pure experimentation render the results of the studies reviewed worthless? Bias is not an “either-or” proposition, but it is a quantity of varying degrees. Through careful measurement of the most salient potential confounding variables, precise theoretical description of constructs, and use of these methods of statistical analysis, it is possible to reduce the amount of bias in the estimated treatment effect. Identification of the most likely confounding variables and their measurement and subsequent adjustments can greatly reduce bias and help estimate an effect that is likely to be more reflective of the true value. The theoretical fully specified model is an alternative to randomization by including relevant variables and thus allowing the unbiased estimation of the parameter. The only problem is realizing when the model is fully specified.

We recognized that we can never have enough knowledge to assure a fully specified model, especially in the complex and unstable conditions of schools. However, a key issue in determining the degree of confidence we have in these evaluations is to examine how they have identified, measured, or controlled for such confounding variables. In the next sections, we report on the methods of the evaluators in identifying and adjusting for such potential confounding variables.

One method to eliminate confounding variables is to examine the extent to which the samples investigated are equated either by sample selection or by methods of statistical adjustments. For individual students, there is a large literature suggesting the importance of social class to achievement. In addition, prior achievement of students must be considered. In the comparative studies, investigators first identified participation of districts, schools, or classes that could provide sufficient duration of use of curricular materials (typically two years or more), availability of target classes, or adequate levels of use of program materials. Establishing comparability was a secondary concern.

These two major factors were generally used in establishing the comparability of the sample:

Student population characteristics, such as demographic characteristics of students in terms of race/ethnicity, economic levels, or location type (urban, suburban, or rural).

Performance-level characteristics such as performance on prior tests, pretest performance, percentage passing standardized tests, or related measures (e.g., problem solving, reading).

In general, four methods of comparing groups were used in the studies we examined, and they permit different degrees of confidence in their results. In the first type, a matching class, school, or district was identified.

Studies were coded as this type if specified characteristics were used to select the schools systematically. In some of these studies, the methodology was relatively complex as correlates of performance on the outcome measures were found empirically and matches were created on that basis (Schneider, 2000; Riordan and Noyce, 2001; and Sconiers et al., 2002). For example, in the Sconiers et al. study, where the total sample of more than 100,000 students was drawn from five states and three elementary curricula are reviewed (Everyday Mathematics, Math Trailblazers [MT], and Investigations [IN], a highly systematic method was developed. After defining eligibility as a “reform school,” evaluators conducted separate regression analyses for the five states at each tested grade level to identify the strongest predictors of average school mathematics score. They reported, “reading score and low-income variables … consistently accounted for the greatest percentage of total variance. These variables were given the greatest weight in the matching process. Other variables—such as percent white, school mobility rate, and percent with limited English proficiency (LEP)—accounted for little of the total variance but were typically significant. These variables were given less weight in the matching process” (Sconiers et al., 2002, p. 10). To further provide a fair and complete comparison, adjustments were made based on regression analysis of the scores to minimize bias prior to calculating the difference in scores and reporting effect sizes. In their results the evaluators report, “The combined state-grade effect sizes for math and total are virtually identical and correspond to a percentile change of about 4 percent favoring the reform students” (p. 12).

A second type of matching procedure was used in the UCSMP evaluations. For example, in an evaluation centered on geometry learning, evaluators advertised in NCTM and UCSMP publications, and set conditions for participation from schools using their program in terms of length of use and grade level. After selecting schools with heterogeneous grouping and no tracking, the researchers used a match-pair design where they selected classes from the same school on the basis of mathematics ability. They used a pretest to determine this, and because the pretest consisted of two parts, they adjusted their significance level using the Bonferroni method. 2 Pairs were discarded if the differences in means and variance were significant for all students or for those students completing all measures, or if class sizes became too variable. In the algebra study, there were 20 pairs as a result of the matching, and because they were comparing three experimental conditions—first edition, second edition, and comparison classes—in the com-

parison study relevant to this review, their matching procedure identified 8 pairs. When possible, teachers were assigned randomly to treatment conditions. Most results are presented with the eight identified pairs and an accumulated set of means. The outcomes of this particular study are described below in a discussion of outcome measures (Thompson et al., 2003).

A third method was to measure factors such as prior performance or socio-economic status (SES) based on pretesting, and then to use analysis of covariance or multiple regression in the subsequent analysis to factor in the variance associated with these factors. These studies were coded as “control.” A number of studies of the Saxon curricula used this method. For example, Rentschler (1995) conducted a study of Saxon 76 compared to Silver Burdett with 7th graders in West Virginia. He reported that the groups differed significantly in that the control classes had 65 percent of the students on free and reduced-price lunch programs compared to 55 percent in the experimental conditions. He used scores on California Test of Basic Skills mathematics computation and mathematics concepts and applications as his pretest scores and found significant differences in favor of the experimental group. His posttest scores showed the Saxon experimental group outperformed the control group on both computation and concepts and applications. Using analysis of covariance, the computation difference in favor of the experimental group was statistically significant; however, the difference in concepts and applications was adjusted to show no significant difference at the p < .05 level.

A fourth method was noted in studies that used less rigorous methods of selection of sample and comparison of prior achievement or similar demographics. These studies were coded as “compare.” Typically, there was no explicit procedure to decide if the comparison was good enough. In some of the studies, it appeared that the comparison was not used as a means of selection, but rather as a more informal device to convince the reader of the plausibility of the equivalence of the groups. Clearly, the studies that used a more precise method of selection were more likely to produce results on which one’s confidence in the conclusions is greater.

Definition of Unit of Analysis

A major decision in forming an evaluation design is the unit of analysis. The unit of selection or randomization used to assign elements to treatment and control groups is closely linked to the unit of analysis. As noted in the National Research Council (NRC) report (1992, p. 21):

If one carries out the assignment of treatments at the level of schools, then that is the level that can be justified for causal analysis. To analyze the results at the student level is to introduce a new, nonrandomized level into

the study, and it raises the same issues as does the nonrandomized observational study…. The implications … are twofold. First, it is advisable to use randomization at the level at which units are most naturally manipulated. Second, when the unit of observation is at a “lower” level of aggregation than the unit of randomization, then for many purposes the data need to be aggregated in some appropriate fashion to provide a measure that can be analyzed at the level of assignment. Such aggregation may be as simple as a summary statistic or as complex as a context-specific model for association among lower-level observations.

In many studies, inadequate attention was paid to the fact that the unit of selection would later become the unit of analysis. The unit of analysis, for most curriculum evaluators, needs to be at least the classroom, if not the school or even the district. The units must be independently responding units because instruction is a group process. Students are not independent, the classroom—even if the teachers work together in a school on instruction—is not entirely independent, so the school is the unit. Care needed to be taken to ensure that an adequate numbers of units would be available to have sufficient statistical power to detect important differences.

A curriculum is experienced by students in a group, and this implies that individual student responses and what they learn are correlated. As a result, the appropriate unit of assignment and analysis must at least be defined at the classroom or teacher level. Other researchers (Bryk et al., 1993) suggest that the unit might be better selected at an even higher level of aggregation. The school itself provides a culture in which the curriculum is enacted as it is influenced by the policies and assignments of the principal, by the professional interactions and governance exhibited by the teachers as a group, and by the community in which the school resides. This would imply that the school might be the appropriate unit of analysis. Even further, to the extent that such decisions about curriculum are made at the district level and supported through resources and professional development at that level, the appropriate unit could arguably be the district. On a more practical level, we found that arguments can be made for a variety of decisions on the selection of units, and what is most essential is to make a clear argument for one’s choice, to use the same unit in the analysis as in the sample selection process, and to recognize the potential limits to generalization that result from one’s decisions.

We would argue in all cases that reports of how sites are selected must be explicit in the evaluation report. For example, one set of evaluation studies selected sites by advertisements in a journal distributed by the program and in NCTM journals (UCSMP) (Thompson et al., 2001; Thompson et al., 2003). The samples in their studies tended to be affluent suburban populations and predominantly white populations. Other conditions of inclusion, such as frequency of use also might have influenced this outcome,

but it is important that over a set of studies on effectiveness, all populations of students be adequately sampled. When a study is not randomized, adjustments for these confounding variables should be included. In our analysis of equity, we report on the concerns about representativeness of the overall samples and their impact on the generalizability of the results.

Implementation Components

The complexity of doing research on curricular materials introduces a number of possible confounding variables. Due to the documented complexity of curricular implementation, most comparative study evaluators attempt to monitor implementation in some fashion. A valuable outcome of a well-conducted evaluation is to determine not only if the experimental curriculum could ideally have a positive impact on learning, but whether it can survive or thrive in the conditions of schooling that are so variable across sites. It is essential to know what the treatment was, whether it occurred, and if so, to what degree of intensity, fidelity, duration, and quality. In our model in Chapter 3 , these factors were referred to as “implementation components.” Measuring implementation can be costly for large-scale comparative studies; however, many researchers have shown that variation in implementation is a key factor in determining effectiveness. In coding the comparative studies, we identified three types of components that help to document the character of the treatment: implementation fidelity, professional development treatments, and attention to teacher effects.

Implementation Fidelity

Implementation fidelity is a measure of the basic extent of use of the curricular materials. It does not address issues of instructional quality. In some studies, implementation fidelity is synonymous with “opportunity to learn.” In examining implementation fidelity, a variety of data were reported, including, most frequently, the extent of coverage of the curricular material, the consistency of the instructional approach to content in relation to the program’s theory, reports of pedagogical techniques, and the length of use of the curricula at the sample sites. Other less frequently used approaches documented the calendar of curricular coverage, requested teacher feedback by textbook chapter, conducted student surveys, and gauged homework policies, use of technology, and other particular program elements. Interviews with teachers and students, classroom surveys, and observations were the most frequently used data-gathering techniques. Classroom observations were conducted infrequently in these studies, except in cases when comparative studies were combined with case studies, typically with small numbers of schools and classes where observations

were conducted for long or frequent time periods. In our analysis, we coded only the presence or absence of one or more of these methods.

If the extent of implementation was used in interpreting the results, then we classified the study as having adjusted for implementation differences. Across all 63 at least minimally methodologically adequate studies, 44 percent reported some type of implementation fidelity measure, 3 percent reported and adjusted for it in interpreting their outcome measures, and 53 percent recorded no information on this issue. Differences among studies, by study type (NSF, UCSMP, and commercially generated), showed variation on this issue, with 46 percent of NSF reporting or adjusting for implementation, 75 percent of UCSMP, and only 11 percent of the other studies of commercial materials doing so. Of the commercial, non-UCSMP studies included, only one reported on implementation. Possibly, the evaluators for the NSF and UCSMP Secondary programs recognized more clearly that their programs demanded significant changes in practice that could affect their outcomes and could pose challenges to the teachers assigned to them.

A study by Abrams (1989) (EX) 3 on the use of Saxon algebra by ninth graders showed that concerns for implementation fidelity extend to all curricula, even those like Saxon whose methods may seem more likely to be consistent with common practice. Abrams wrote, “It was not the intent of this study to determine the effectiveness of the Saxon text when used as Saxon suggests, but rather to determine the effect of the text as it is being used in the classroom situations. However, one aspect of the research was to identify how the text is being taught, and how closely teachers adhere to its content and the recommended presentation” (p. 7). Her findings showed that for the 9 teachers and 300 students, treatment effects favoring the traditional group (using Dolciani’s Algebra I textbook, Houghton Mifflin, 1980) were found on the algebra test, the algebra knowledge/skills subtest, and the problem-solving test for this population of teachers (fixed effect). No differences were found between the groups on an algebra understanding/applications subtest, overall attitude toward mathematics, mathematical self-confidence, anxiety about mathematics, or enjoyment of mathematics. She suggests that the lack of differences might be due to the ways in which teachers supplement materials, change test conditions, emphasize

and deemphasize topics, use their own tests, vary the proportion of time spent on development and practice, use calculators and group work, and basically adapt the materials to their own interpretation and method. Many of these practices conflict directly with the recommendations of the authors of the materials.

A study by Briars and Resnick (2000) (EX) in Pittsburgh schools directly confronted issues relevant to professional development and implementation. Evaluators contrasted the performance of students of teachers with high and low implementation quality, and showed the results on two contrasting outcome measures, Iowa Test of Basic Skills (ITBS) and Balanced Assessment. Strong implementers were defined as those who used all of the EM components and provided student-centered instruction by giving students opportunities to explore mathematical ideas, solve problems, and explain their reasoning. Weak implementers were either not using EM or using it so little that the overall instruction in the classrooms was “hardly distinguishable from traditional mathematics instruction” (p. 8). Assignment was based on observations of student behavior in classes, the presence or absence of manipulatives, teacher questionnaires about the programs, and students’ knowledge of classroom routines associated with the program.

From the identification of strong- and weak-implementing teachers, strong- and weak-implementation schools were identified as those with strong- or weak-implementing teachers in 3rd and 4th grades over two consecutive years. The performance of students with 2 years of EM experience in these settings composed the comparative samples. Three pairs of strong- and weak-implementation schools with similar demographics in terms of free and reduced-price lunch (range 76 to 93 percent), student living with only one parent (range 57 to 82 percent), mobility (range 8 to 16 percent), and ethnicity (range 43 to 98 percent African American) were identified. These students’ 1st-grade ITBS scores indicated similarity in prior performance levels. Finally, evaluators predicted that if the effects were due to the curricular implementation and accompanying professional development, the effects on scores should be seen in 1998, after full implementation. Figure 5-4 shows that on the 1998 New Standards exams, placement in strong- and weak-implementation schools strongly affected students’ scores. Over three years, performance in the district on skills, concepts, and problem solving rose, confirming the evaluator’s predictions.

An article by McCaffrey et al. (2001) examining the interactions among instructional practices, curriculum, and student achievement illustrates the point that distinctions are often inadequately linked to measurement tools in their treatment of the terms traditional and reform teaching. In this study, researchers conducted an exploratory factor analysis that led them to create two scales for instructional practice: Reform Practices and Tradi-

comparison research study

FIGURE 5-4 Percentage of students who met or exceeded the standard. Districtwide grade 4 New Standards Mathematics Reference Examination (NSMRE) performance for 1996, 1997, and 1998 by level of Everyday Mathematics implementation. Percentage of students who achieved the standard. Error bars denote the 99 percent confidence interval for each data point.

SOURCE: Re-created from Briars and Resnick (2000, pp. 19-20).

tional Practices. The reform scale measured the frequency, by means of teacher report, of teacher and student behaviors associated with reform instruction and assessment practices, such as using small-group work, explaining reasoning, representing and using data, writing reflections, or performing tasks in groups. The traditional scale focused on explanations to whole classes, the use of worksheets, practice, and short-answer assessments. There was a –0.32 correlation between scores for integrated curriculum teachers. There was a 0.27 correlation between scores for traditional

curriculum teachers. This shows that it is overly simplistic to think that reform and traditional practices are oppositional. The relationship among a variety of instructional practices is rather more complex as they interact with curriculum and various student populations.

Professional Development

Professional development and teacher effects were separated in our analysis from implementation fidelity. We recognized that professional development could be viewed by the readers of this report in two ways. As indicated in our model, professional development can be considered a program element or component or it can be viewed as part of the implementation process. When viewed as a program element, professional development resources are considered mandatory along with program materials. In relation to evaluation, proponents of considering professional development as a mandatory program element argue that curricular innovations, which involve the introduction of new topics, new types of assessment, or new ways of teaching, must make provision for adequate training, just as with the introduction of any new technology.

For others, the inclusion of professional development in the program elements without a concomitant inclusion of equal amounts of professional development relevant to a comparative treatment interjects a priori disproportionate treatments and biases the results. We hoped for an array of evaluation studies that might shed some empirical light on this dispute, and hence separated professional development from treatment fidelity, coding whether or not studies reported on the amount of professional development provided for the treatment and/or comparison groups. A study was coded as positive if it either reported on the professional development provided on the experimental group or reported the data on both treatments. Across all 63 at least minimally methodologically adequate studies, 27 percent reported some type of professional development measure, 1.5 percent reported and adjusted for it in interpreting their outcome measures, and 71.5 percent recorded no information on the issue.

A study by Collins (2002) (EX) 4 illustrates the critical and controversial role of professional development in evaluation. Collins studied the use of Connected Math over three years, in three middle schools in threat of being classified as low performing in the Massachusetts accountability system. A comparison was made between one school (School A) that engaged

substantively in professional development opportunities accompanying the program and two that did not (Schools B and C). In the CMP school reports (School A) totals between 100 and 136 hours of professional development were recorded for all seven teachers in grades 6 through 8. In School B, 66 hours were reported for two teachers and in School C, 150 hours were reported for eight teachers over three years. Results showed significant differences in the subsequent performance by students at the school with higher participation in professional development (School A) and it became a districtwide top performer; the other two schools remained at risk for low performance. No controls for teacher effects were possible, but the results do suggest the centrality of professional development for successful implementation or possibly suggest that the results were due to professional development rather than curriculum materials. The fact that these two interpretations cannot be separated is a problem when professional development is given to one and not the other. The effect could be due to textbook or professional development or an interaction between the two. Research designs should be adjusted to consider these issues when different conditions of professional development are provided.

Teacher Effects

These studies make it obvious that there are potential confounding factors of teacher effects. Many evaluation studies devoted inadequate attention to the variable of teacher quality. A few studies (Goodrow, 1998; Riordan and Noyce, 2001; Thompson et al., 2001; and Thompson et al., 2003) reported on teacher characteristics such as certification, length of service, experience with curricula, or degrees completed. Those studies that matched classrooms and reported by matched results rather than aggregated results sought ways to acknowledge the large variations among teacher performance and its impact on student outcomes. We coded any effort to report on possible teacher effects as one indicator of quality. Across all 63 at least minimally methodologically adequate studies, 16 percent reported some type of teacher effect measure, 3 percent reported and adjusted for it in interpreting their outcome measures, and 81 percent recorded no information on this issue.

One can see that the potential confounding factors of teacher effects, in terms of the provision of professional development or the measure of teacher effects, are not adequately considered in most evaluation designs. Some studies mention and give a subjective judgment as to the nature of the problem, but this is descriptive at the most. Hardly any of the studies actually do anything analytical, and because these are such important potential confounding variables, this presents a serious challenge to the efficacy of these studies. Figure 5-5 shows how attention to these factors varies

comparison research study

FIGURE 5-5 Treatment of implementation components by program type.

NOTE: PD = professional development.

across program categories among NSF-supported, UCSMP, and studies of commercial materials. In general, evaluations of NSF-supported studies were the most likely to measure these variables; UCSMP had the most standardized use of methods to do so across studies; and commercial material evaluators seldom reported on issues of implementation fidelity.

Identification of a Set of Outcome Measures and Forms of Disaggregation

Using the selected student outcomes identified in the program theory, one must conduct an impact assessment that refers to the design and measurement of student outcomes. In addition to selecting what outcomes should be measured within one’s program theory, one must determine how these outcomes are measured, when those measures are collected, and what

purpose they serve from the perspective of the participants. In the case of curricular evaluation, there are significant issues involved in how these measures are reported. To provide insight into the level of curricular validity, many evaluators prefer to report results by topic, content strand, or item cluster. These reports often present the level of specificity of outcome needed to inform curriculum designers, especially when efforts are made to document patterns of errors, distribution of results across multiple choices, or analyses of student methods. In these cases, whole test scores may mask essential differences in impact among curricula at the level of content topics, reporting only average performance.

On the other hand, many large-scale assessments depend on methods of test equating that rely on whole test scores and make comparative interpretations of different test administrations by content strands of questionable reliability. Furthermore, there are questions such as whether to present only gain scores effect sizes, how to link pretests and posttests, and how to determine the relative curricular sensitivity of various outcome measures.

The findings of comparative studies are reported in terms of the outcome measure(s) collected. To describe the nature of the database with regard to outcome measures and to facilitate our analyses of the studies, we classified each of the included studies on four outcome measure dimensions:

Total score reported;

Disaggregation of content strands, subtest, performance level, SES, or gender;

Outcome measure that was specific to curriculum; and

Use of multiple outcome measures.

Most studies reported a total score, but we did find studies that reported only subtest scores or only scores on an item-by-item basis. For example, in the Ben-Chaim et al. (1998) evaluation study of Connected Math, the authors were interested in students’ proportional reasoning proficiency as a result of use of this curriculum. They asked students from eight seventh-grade classes of CMP and six seventh-grade classes from the control group to solve a variety of tasks categorized as rate and density problems. The authors provide precise descriptions of the cognitive challenges in the items; however, they do not explain if the problems written up were representative of performance on a larger set of items. A special rating form was developed to code responses in three major categories (correct answer, incorrect answer, and no response), with subcategories indicating the quality of the work that accompanied the response. No reports on reliability of coding were given. Performance on standardized tests indicated that control students’ scores were slightly higher than CMP at the beginning of the

year and lower at the end. Twenty-five percent of the experimental group members were interviewed about their approaches to the problems. The CMP students outperformed the control students (53 percent versus 28 percent) overall in providing the correct answers and support work, and 27 percent of the control group gave an incorrect answer or showed incorrect thinking compared to 13 percent of the CMP group. An item-level analysis permitted the researchers to evaluate the actual strategies used by the students. They reported, for example, that 82 percent of CMP students used a “strategy focused on package price, unit price, or a combination of the two; those effective strategies were used by only 56 of 91 control students (62 percent)” (p. 264).

The use of item or content strand-level comparative reports had the advantage that they permitted the evaluators to assess student learning strategies specific to a curriculum’s program theory. For example, at times, evaluators wanted to gauge the effectiveness of using problems different from those on typical standardized tests. In this case, problems were drawn from familiar circumstances but carefully designed to create significant cognitive challenges, and assess how well the informal strategies approach in CMP works in comparison to traditional instruction. The disadvantages of such an approach include the use of only a small number of items and the concerns for reliability in scoring. These studies seem to represent a method of creating hybrid research models that build on the detailed analyses possible using case studies, but still reporting on samples that provide comparative data. It possibly reflects the concerns of some mathematicians and mathematics educators that the effectiveness of materials needs to be evaluated relative to very specific, research-based issues on learning and that these are often inadequately measured by multiple-choice tests. However, a decision not to report total scores led to a trade-off in the reliability and representativeness of the reported data, which must be addressed to increase the objectivity of the reports.

Second, we coded whether outcome data were disaggregated in some way. Disaggregation involved reporting data on dimensions such as content strand, subtest, test item, ethnic group, performance level, SES, and gender. We found disaggregated results particularly helpful in understanding the findings of studies that found main effects, and also in examining patterns across studies. We report the results of the studies’ disaggregation by content strand in our reports of effects. We report the results of the studies’ disaggregation by subgroup in our discussions of generalizability.

Third, we coded whether a study used an outcome measure that the evaluator reported as being sensitive to a particular treatment—this is a subcategory of what was defined in our framework as “curricular validity of measures.” In such studies, the rationale was that readily available measures such as state-mandated tests, norm-referenced standardized tests, and

college entrance examinations do not measure some of the aims of the program under study. A frequently cited instance of this was that “off the shelf” instruments do not measure well students’ ability to apply their mathematical knowledge to problems embedded in complex settings. Thus, some studies constructed a collection of tasks that assessed this ability and collected data on it (Ben-Chaim et al., 1998; Huntley et al., 2000).

Finally, we recorded whether a study used multiple outcome measures. Some studies used a variety of achievement measures and other studies reported on achievement accompanied by measures such as subsequent course taking or various types of affective measures. For example, Carroll (2001, p. 47) reported results on a norm-referenced standardized achievement test as well as a collection of tasks developed in other studies.

A study by Huntley et al. (2000) illustrates how a variety of these techniques were combined in their outcome measures. They developed three assessments. The first emphasized contextualized problem solving based on items from the American Mathematical Association of Two-Year Colleges and others; the second assessment was on context-free symbolic manipulation and a third part requiring collaborative problem solving. To link these measures to the overall evaluation, they articulated an explicit model of cognition based on how one links an applied situation to mathematical activity through processes of formulation and interpretation. Their assessment strategy permitted them to investigate algebraic reasoning as an ability to use algebraic ideas and techniques to (1) mathematize quantitative problem situations, (2) use algebraic principles and procedures to solve equations, and (3) interpret results of reasoning and calculations.

In presenting their data comparing performance on Core-Plus and traditional curriculum, they presented both main effects and comparisons on subscales. Their design of outcome measures permitted them to examine differences in performance with and without context and to conclude with statements such as “This result illustrates that CPMP students perform better than control students when setting up models and solving algebraic problems presented in meaningful contexts while having access to calculators, but CPMP students do not perform as well on formal symbol-manipulation tasks without access to context cues or calculators” (p. 349). The authors go on to present data on the relationship between knowing how to plan or interpret solutions and knowing how to carry them out. The correlations between these variables were weak but significantly different (0.26 for control groups and 0.35 for Core-Plus). The advantage of using multiple measures carefully tied to program theory is that they can permit one to test fine content distinctions that are likely to be the level of adjustments necessary to fine tune and improve curricular programs.

Another interesting approach to the use of outcome measures is found in the UCSMP studies. In many of these studies, evaluators collected infor-

TABLE 5-2 Mean Percentage Correct on the Subject Tests

mation from teachers’ reports and chapter reviews as to whether topics for items on the posttests were taught, calling this an “opportunity to learn” measure. The authors reported results from three types of analyses: (1) total test scores, (2) fair test scores (scores reported by program but only on items on topics taught), and (3) conservative test scores (scores on common items taught in both). Table 5-2 reports on the variations across the multiple- choice test scores for the Geometry study (Thompson et al., 2003) on a standardized test, High School Subject Tests-Geometry Form B , and the UCSMP-constructed Geometry test, and for the Advanced Algebra Study on the UCSMP-constructed Advanced Algebra test (Thompson et al., 2001). The table shows the mean scores for UCSMP classes and comparison classes. In each cell, mean percentage correct is reported first by whole test, then by fair test, and then by conservative test.

The authors explicitly compare the items from the standard Geometry test with the items from the UCSMP test and indicate overlap and difference. They constructed their own test because, in their view, the standard test was not adequately balanced among skills, properties, and real-world uses. The UCSMP test included items on transformations, representations, and applications that were lacking in the national test. Only five items were taught by all teachers; hence in the case of the UCSMP geometry test, there is no report on a conservative test. In the Advanced Algebra evaluation, only a UCSMP-constructed test was viewed as appropriate to cover the treatment of the prior material and alignment to the goals of the new course. These data sets demonstrate the challenge of selecting appropriate outcome measures, the sensitivity of the results to those decisions, and the importance of full disclosure of decision-making processes in order to permit readers to assess the implications of the choices. The methodology utilized sought to ensure that the material in the course was covered adequately by treatment teachers while finding ways to make comparisons that reflected content coverage.

Only one study reported on its outcomes using embedded assessment items employed over the course of the year. In a study of Saxon and UCSMP, Peters (1992) (EX) studied the use of these materials with two classrooms taught by the same teacher. In this small study, he randomly assigned students to treatment groups and then measured their performance on four unit tests composed of items common to both curricula and their progress on the Orleans-Hanna Algebraic Prognosis Test.

Peters’ study showed no significant difference in placement scores between Saxon and UCSMP on the posttest, but did show differences on the embedded assessment. Figure 5-6 (Peters, 1992, p. 75) shows an interesting display of the differences on a “continuum” that shows both the direction and magnitude of the differences and provides a level of concept specificity missing in many reports. This figure and a display ( Figure 5-7 ) in a study by Senk (1991, p. 18) of students’ mean scores on Curriculum A versus Curriculum B with a 10 percent range of differences marked represent two excellent means to communicate the kinds of detailed content outcome information that promises to be informative to curriculum writers, publishers, and school decision makers. In Figure 5-7 , 16 items listed by number were taken from the Second International Mathematics Study. The Functions, Statistics, and Trigonometry sample averaged 41 percent correct on these items whereas the U.S. precalculus sample averaged 38 percent. As shown in the figure, differences of 10 percent or less fall inside the banded area and greater than 10 percent fall outside, producing a display that makes it easy for readers and designers to identify the relative curricular strengths and weaknesses of topics.

While we value detailed outcome measure information, we also recognize the importance of examining curricular impact on students’ standardized test performance. Many developers, but not all, are explicit in rejecting standardized tests as adequate measures of the outcomes of their programs, claiming that these tests focus on skills and manipulations, that they are overly reliant on multiple-choice questions, and that they are often poorly aligned to new content emphases such as probability and statistics, transformations, use of contextual problems and functions, and process skills, such as problem solving, representation, or use of calculators. However, national and state tests are being revised to include more content on these topics and to draw on more advanced reasoning. Furthermore, these high-stakes tests are of major importance in school systems, determining graduation, passing standards, school ratings, and so forth. For this reason, if a curricular program demonstrated positive impact on such measures, we referred to that in Chapter 3 as establishing “curricular alignment with systemic factors.” Adequate performance on these measures is of paramount importance to the survival of reform (to large groups of parents and

comparison research study

FIGURE 5-6 Continuum of criterion score averages for studied programs.

SOURCE: Peters (1992, p. 75).

school administrators). These examples demonstrate how careful attention to outcomes measures is an essential element of valid evaluation.

In Table 5-3 , we document the number of studies using a variety of types of outcome measures that we used to code the data, and also report on the types of tests used across the studies.

comparison research study

FIGURE 5-7 Achievement (percentage correct) on Second International Mathematics Study (SIMS) items by U.S. precalculus students and functions, statistics, and trigonometry (FST) students.

SOURCE: Re-created from Senk (1991, p. 18).

TABLE 5-3 Number of Studies Using a Variety of Outcome Measures by Program Type

A Choice of Statistical Tests, Including Statistical Significance and Effect Size

In our first review of the studies, we coded what methods of statistical evaluation were used by different evaluators. Most common were t-tests; less frequently one found Analysis of Variance (ANOVA), Analysis of Co-

comparison research study

FIGURE 5-8 Statistical tests most frequently used.

variance (ANCOVA), and chi-square tests. In a few cases, results were reported using multiple regression or hierarchical linear modeling. Some used multiple tests; hence the total exceeds 63 ( Figure 5-8 ).

One of the difficult aspects of doing curriculum evaluations concerns using the appropriate unit both in terms of the unit to be randomly assigned in an experimental study and the unit to be used in statistical analysis in either an experimental or quasi-experimental study.

For our purposes, we made the decision that unless the study concerned an intact student population such as the freshman at a single university, where a student comparison was the correct unit, we believed that for statistical tests, the unit should be at least at the classroom level. Judgments were made for each study as to whether the appropriate unit was utilized. This question is an important one because statistical significance is related to sample size, and as a result, studies that inappropriately use the student as the unit of analysis could be concluding significant differences where they are not present. For example, if achievement differences between two curricula are tested in 16 classrooms with 400 students, it will always be easier to show significant differences using scores from those 400 students than using 16 classroom means.

Fifty-seven studies used students as the unit of analysis in at least one test of significance. Three of these were coded as correct because they involved whole populations. In all, 10 studies were coded as using the

TABLE 5-4 Performance on Applied Algebra Problems with Use of Calculators, Part 1

TABLE 5-5 Reanalysis of Algebra Performance Data

correct unit of analysis; hence, 7 studies used teachers or classes, or schools. For some studies where multiple tests were conducted, a judgment was made as to whether the primary conclusions drawn treated the unit of analysis adequately. For example, Huntley et al. (2000) compared the performance of CPMP students with students in a traditional course on a measure of ability to formulate and use algebraic models to answer various questions about relationships among variables. The analysis used students as the unit of analysis and showed a significant difference, as shown in Table 5-4 .

To examine the robustness of this result, we reanalyzed the data using an independent sample t-test and a matched pairs t-test with class means as the unit of analysis in both tests ( Table 5-5 ). As can be seen from the analyses, in neither statistical test was the difference between groups found to be significantly different (p < .05), thus emphasizing the importance of using the correct unit in analyzing the data.

Reanalysis of student-level data using class means will not always result

TABLE 5-6 Mean Percentage Correct on Entire Multiple-Choice Posttest: Second Edition and Non-UCSMP

in a change in finding. Furthermore, using class means as the unit of analysis does not suggest that significant differences will not be found. For example, a study by Thompson et al. (2001) compared the performance of UCSMP students with the performance of students in a more traditional program across several measures of achievement. They found significant differences between UCSMP students and the non-UCSMP students on several measures. Table 5-6 shows results of an analysis of a multiple-choice algebraic posttest using class means as the unit of analysis. Significant differences were found in five of eight separate classroom comparisons, as shown in the table. They also found a significant difference using a matched-pairs t-test on class means.

The lesson to be learned from these reanalyses is that the choice of unit of analysis and the way the data are aggregated can impact study findings in important ways including the extent to which these findings can be generalized. Thus it is imperative that evaluators pay close attention to such considerations as the unit of analysis and the way data are aggregated in the design, implementation, and analysis of their studies.

Second, effect size has become a relatively common and standard way of gauging the practical significance of the findings. Statistical significance only indicates whether the main-level differences between two curricula are large enough to not be due to chance, assuming they come from the same population. When statistical differences are found, the question remains as to whether such differences are large enough to consider. Because any innovation has its costs, the question becomes one of cost-effectiveness: Are the differences in student achievement large enough to warrant the costs of change? Quantifying the practical effect once statistical significance is established is one way to address this issue. There is a statistical literature for doing this, and for the purposes of this review, the committee simply noted whether these studies have estimated such an effect. However, the committee further noted that in conducting meta-analyses across these studies, effect size was likely to be of little value. These studies used an enormous variety of outcome measures, and even using effect size as a means to standardize units across studies is not sensible when the measures in each

study address such a variety of topics, forms of reasoning, content levels, and assessment strategies.

We note very few studies drew upon the advances in methodologies employed in modeling, which include causal modeling, hierarchical linear modeling (Bryk and Raudenbush, 1992; Bryk et al., 1993), and selection bias modeling (Heckman and Hotz, 1989). Although developing detailed specifications for these approaches is beyond the scope of this review, we wish to emphasize that these methodological advances should be considered within future evaluation designs.

Results and Limitations to Generalizability Resulting from Design Constraints

One also must consider what generalizations can be drawn from the results (Campbell and Stanley, 1966; Caporaso and Roos, 1973; and Boruch, 1997). Generalization is a matter of external validity in that it determines to what populations the study results are likely to apply. In designing an evaluation study, one must carefully consider, in the selection of units of analysis, how various characteristics of those units will affect the generalizability of the study. It is common for evaluators to conflate issues of representativeness for the purpose of generalizability (external validity) and comparativeness (the selection of or adjustment for comparative groups [internal validity]). Not all studies must be representative of the population served by mathematics curricula to be internally valid. But, to be generalizable beyond restricted communities, representativeness must be obtained by the random selection of the basic units. Clearly specifying such limitations to generalizability is critical. Furthermore, on the basis of equity considerations, one must be sure that if overall effectiveness is claimed, that the studies have been conducted and analyzed with reference of all relevant subgroups.

Thus, depending on the design of a study, its results may be limited in generalizability to other populations and circumstances. We identified four typical kinds of limitations on the generalizability of studies and coded them to determine, on the whole, how generalizable the results across studies might be.

First, there were studies whose designs were limited by the ability or performance level of the students in the samples. It was not unusual to find that when new curricula were implemented at the secondary level, schools kept in place systems of tracking that assigned the top students to traditional college-bound curriculum sequences. As a result, studies either used comparative groups who were matched demographically but less skilled than the population as a whole, in relation to prior learning, or their results compared samples of less well-prepared students to samples of students

with stronger preparations. Alternatively, some studies reported on the effects of curricula reform on gifted and talented students or on college-attending students. In these cases, the study results would also limit the generalizability of the results to similar populations. Reports using limited samples of students’ ability and prior performance levels were coded as a limitation to the generalizability of the study.

For example, Wasman (2000) conducted a study of one school (six teachers) and examined the students’ development of algebraic reasoning after one (n=100) and two years (n=73) in CMP. In this school, the top 25 percent of the students are counseled to take a more traditional algebra course, so her experimental sample, which was 61 percent white, 35 percent African American, 3 percent Asian, and 1 percent Hispanic, consisted of the lower 75 percent of the students. She reported on the student performance on the Iowa Algebraic Aptitude Test (IAAT) (1992), in the subcategories of interpreting information, translating symbols, finding relationships, and using symbols. Results for Forms 1 and 2 of the test, for the experimental and norm group, are shown in Table 5-7 for 8th graders.

In our coding of outcomes, this study was coded as showing no significant differences, although arguably its results demonstrate a positive set of

TABLE 5-7 Comparing Iowa Algebraic Aptitude Test (IAAT) Mean Scores of the Connected Mathematics Project Forms 1 and 2 to the Normative Group (8th Graders)

outcomes as the treatment group was weaker than the control group. Had the researcher used a prior achievement measure and a different statistical technique, significance might have been demonstrated, although potential teacher effects confound interpretations of results.

A second limitation to generalizability was when comparative studies resided entirely at curriculum pilot site locations, where such sites were developed as a means to conduct formative evaluations of the materials with close contact and advice from teachers. Typically, pilot sites have unusual levels of teacher support, whether it is in the form of daily technical support in the use of materials or technology or increased quantities of professional development. These sites are often selected for study because they have established cooperative agreements with the program developers and other sources of data, such as classroom observations, are already available. We coded whether the study was conducted at a pilot site to signal potential limitations in generalizability of the findings.

Third, studies were also coded as being of limited generalizability if they failed to disaggregate their data by socioeconomic class, race, gender, or some other potentially significant sources of restriction on the claims. We recorded the categories in which disaggregation occurred and compiled their frequency across the studies. Because of the need to open the pipeline to advanced study in mathematics by members of underrepresented groups, we were particularly concerned about gauging the extent to which evaluators factored such variables into their analysis of results and not just in terms of the selection of the sample.

Of the 46 included studies of NSF-supported curricula, 19 disaggregated their data by student subgroup. Nine of 17 studies of commercial materials disaggregated their data. Figure 5-9 shows the number of studies that disaggregated outcomes by race or ethnicity, SES, gender, LEP, special education status, or prior achievement. Studies using multiple categories of disaggregation were counted multiple times by program category.

The last category of restricted generalization occurred in studies of limited sample size. Although such studies may have provided more indepth observations of implementation and reports on professional development factors, the smaller numbers of classrooms and students in the study would limit the extent of generalization that could be drawn from it. Figure 5-10 shows the distribution of sizes of the samples in terms of numbers of students by study type.

Summary of Results by Student Achievement Among Program Types

We present the results of the studies as a means to further investigate their methodological implications. To this end, for each study, we counted across outcome measures the number of findings that were positive, nega-

comparison research study

FIGURE 5-9 Disaggregation of subpopulations.

comparison research study

FIGURE 5-10 Proportion of studies by sample size and program.

tive, or indeterminate (no significant difference) and then calculated the proportion of each. We represented the calculation of each study as a triplet (a, b, c) where a indicates the proportion of the results that were positive and statistically significantly stronger than the comparison program, b indicates the proportion that were negative and statistically significantly weaker than the comparison program, and c indicates the proportion that showed no significant difference between the treatment and the comparative group. For studies with a single outcome measure, without disaggregation by content strand, the triplet is always composed of two zeros and a single one. For studies with multiple measures or disaggregation by content strand, the triplet is typically a set of three decimal values that sum to one. For example, a study with one outcome measure in favor of the experimental treatment would be coded (1, 0, 0), while one with multiple measures and mixed results more strongly in favor of the comparative curriculum might be listed as (.20, .50, .30). This triplet would mean that for 20 percent of the comparisons examined, the evaluators reported statistically significant positive results, for 50 percent of the comparisons the results were statistically significant in favor of the comparison group, and for 30 percent of the comparisons no significant difference were found. Overall, the mean score on these distributions was (.54, .07, .40), indicating that across all the studies, 54 percent of the comparisons favored the treatment, 7 percent favored the comparison group, and 40 percent showed no significant difference. Table 5-8 shows the comparison by curricular program types. We present the results by individual program types, because each program type relies on a similar program theory and hence could lead to patterns of results that would be lost in combining the data. If the studies of commercial materials are all grouped together to include UCSMP, their pattern of results is (.38, .11, .51). Again we emphasize that due to our call for increased methodological rigor and the use of multiple methods, this result is not sufficient to establish the curricular effectiveness of these programs as a whole with adequate certainty.

We caution readers that these results are summaries of the results presented across a set of evaluations that meet only the standard of at least

TABLE 5-8 Comparison by Curricular Program Types

minimally methodologically adequate . Calculations of statistical significance of each program’s results were reported by the evaluators; we have made no adjustments for weaknesses in the evaluations such as inappropriate use of units of analysis in calculating statistical significance. Evaluations that consistently used the correct unit of analysis, such as UCSMP, could have fewer reports of significant results as a consequence. Furthermore, these results are not weighted by study size. Within any study, the results pay no attention to comparative effect size or to the established credibility of an outcome measure. Similarly, these results do not take into account differences in the populations sampled, an important consideration in generalizing the results. For example, using the same set of studies as an example, UCSMP studies used volunteer samples who responded to advertisements in their newsletters, resulting in samples with disproportionately Caucasian subjects from wealthier schools compared to national samples. As a result, we would suggest that these results are useful only as baseline data for future evaluation efforts. Our purpose in calculating these results is to permit us to create filters from the critical decision points and test how the results change as one applies more rigorous standards.

Given that none of the studies adequately addressed all of the critical criteria, we do not offer these results as definitive, only suggestive—a hypothesis for further study. In effect, given the limitations of time and support, and the urgency of providing advice related to policy, we offer this filtering approach as an informal meta-analytic technique sufficient to permit us to address our primary task, namely, evaluating the quality of the evaluation studies.

This approach reflects the committee’s view that to deeply understand and improve methodology, it is necessary to scrutinize the results and to determine what inferences they provide about the conduct of future evaluations. Analogous to debates on consequential validity in testing, we argue that to strengthen methodology, one must consider what current methodologies are able (or not able) to produce across an entire series of studies. The remainder of the chapter is focused on considering in detail what claims are made by these studies, and how robust those claims are when subjected to challenge by alternative hypothesis, filtering by tests of increasing rigor, and examining results and patterns across the studies.

Alternative Hypotheses on Effectiveness

In the spirit of scientific rigor, the committee sought to consider rival hypotheses that could explain the data. Given the weaknesses in the designs generally, often these alternative hypotheses cannot be dismissed. However, we believed that only after examining the configuration of results and

alternative hypotheses can the next generation of evaluations be better informed and better designed. We began by generating alternative hypotheses to explain the positive directionality of the results in favor of experimental groups. Alternative hypotheses included the following:

The teachers in the experimental groups tended to be self-selecting early adopters, and thus able to achieve effects not likely in regular populations.

Changes in student outcomes reflect the effects of professional development instruction, or level of classroom support (in pilot sites), and thus inflate the predictions of effectiveness of curricular programs.

Hawthorne effect (Franke and Kaul, 1978) occurs when treatments are compared to everyday practices, due to motivational factors that influence experimental participants.

The consistent difference is due to the coherence and consistency of a single curricular program when compared to multiple programs.

The significance level is only achieved by the use of the wrong unit of analysis to test for significance.

Supplemental materials or new teaching techniques produce the results and not the experimental curricula.

Significant results reflect inadequate outcome measures that focus on a restricted set of activities.

The results are due to evaluator bias because too few evaluators are independent of the program developers.

At the same time, one could argue that the results actually underestimate the performance of these materials and are conservative measures, and their alternative hypotheses also deserve consideration:

Many standardized tests are not sensitive to these curricular approaches, and by eliminating studies focusing on affect, we eliminated a key indicator of the appeal of these curricula to students.

Poor implementation or increased demands on teachers’ knowledge dampens the effects.

Often in the experimental treatment, top-performing students are missing as they are advised to take traditional sequences, rendering the samples unequal.

Materials are not well aligned with universities and colleges because tests for placement and success in early courses focus extensively on algebraic manipulation.

Program implementation has been undercut by negative publicity and the fears of parents concerning change.

There are also a number of possible hypotheses that may be affecting the results in either direction, and we list a few of these:

Examining the role of the teacher in curricular decision making is an important element in effective implementation, and design mandates of evaluation design make this impossible (and the positives and negatives or single- versus dual-track curriculum as in Lundin, 2001).

Local tests that are sensitive to the curricular effects typically are not mandatory and hence may lead to unpredictable performance by students.

Different types and extent of professional development may affect outcomes differentially.

Persistence or attrition may affect the mean scores and are often not considered in the comparative analyses.

One could also generate reasons why the curricular programs produced results showing no significance when one program or the other is actually more effective. This could include high degrees of variability in the results, samples that used the correct unit of analysis but did not obtain consistent participation across enough cases, implementation that did not show enough fidelity to the measures, or outcome measures insensitive to the results. Again, subsequent designs should be better informed by these findings to improve the likelihood that they will produce less ambiguous results and replication of studies could also give more confidence in the findings.

It is beyond the scope of this report to consider each of these alternative hypotheses separately and to seek confirmation or refutation of them. However, in the next section, we describe a set of analyses carried out by the committee that permits us to examine and consider the impact of various critical evaluation design decisions on the patterns of outcomes across sets of studies. A number of analyses shed some light on various alternative hypotheses and may inform the conduct of future evaluations.

Filtering Studies by Critical Decision Points to Increase Rigor

In examining the comparative studies, we identified seven critical decision points that we believed would directly affect the rigor and efficacy of the study design. These decision points were used to create a set of 16 filters. These are listed as the following questions:

Was there a report on comparability relative to SES?

Was there a report on comparability of samples relative to prior knowledge?

Was there a report on treatment fidelity?

Was professional development reported on?

Was the comparative curriculum specified?

Was there any attempt to report on teacher effects?

Was a total test score reported?

Was total test score(s) disaggregated by content strand?

Did the outcome measures match the curriculum?

Were multiple tests used?

Was the appropriate unit of analysis used in their statistical tests?

Did they estimate effect size for the study?

Was the generalizability of their findings limited by use of a restricted range of ability levels?

Was the generalizability of their findings limited by use of pilot sites for their study?

Was the generalizability of their findings limited by not disaggregating their results by subgroup?

Was the generalizability of their findings limited by use of small sample size?

The studies were coded to indicate if they reported having addressed these considerations. In some cases, the decision points were coded dichotomously as present or absent in the studies, and in other cases, the decision points were coded trichotomously, as description presented, absent, or statistically adjusted for in the results. For example, a study may or may not report on the comparability of the samples in terms of race, ethnicity, or socioeconomic status. If a report on SES was given, the study was coded as “present” on this decision; if a report was missing, it was coded as “absent”; and if SES status or ethnicity was used in the analysis to actually adjust outcomes, it was coded as “adjusted for.” For each coding, the table that follows reports the number of studies that met that condition, and then reports on the mean percentage of statistically significant results, and results showing no significant difference for that set of studies. A significance test is run to see if the application of the filter produces changes in the probability that are significantly different. 5

In the cases in which studies are coded into three distinct categories—present, absent, and adjusted for—a second set of filters is applied. First, the studies coded as present or adjusted for are combined and compared to those coded as absent; this is what we refer to as a weak test of the rigor of the study. Second, the studies coded as present or absent are combined and compared to those coded as adjusted for. This is what we refer to as a strong test. For dichotomous codings, there can be as few as three compari-

sons, and for trichotomous codings, there can be nine comparisons with accompanying tests of significance. Trichotomous codes were used for adjustments for SES and prior knowledge, examining treatment fidelity, professional development, teacher effects, and reports on effect sizes. All others were dichotomous.

NSF Studies and the Filters

For example, there were 11 studies of NSF-supported curricula that simply reported on the issues of SES in creating equivalent samples for comparison, and for this subset the mean probabilities of getting positive, negative, or results showing no significant difference were (.47, .10, .43). If no report of SES was supplied (n= 21), those probabilities become (.57, .07, .37), indicating an increase in positive results and a decrease in results showing no significant difference. When an adjustment is made in outcomes based on differences in SES (n=14), the probabilities change to (.72, .00, .28), showing a higher likelihood of positive outcomes. The probabilities that result from filtering should always be compared back to the overall results of (.59, .06, .35) (see Table 5-8 ) so as to permit one to judge the effects of more rigorous methodological constraints. This suggests that a simple report on SES without adjustment is least likely to produce positive outcomes; that is, no report produces the outcomes next most likely to be positive and studies that adjusted for SES tend to have a higher proportion of their comparisons producing positive results.

The second method of applying the filter (the weak test for rigor) for the treatment of the adjustment of SES groups compares the probabilities when a report is either given or adjusted for compared to when no report is offered. The combined percentage of a positive outcome of a study in which SES is reported or adjusted for is (.61, .05, .34), while the percentage for no report remains as reported previously at (.57, .07, .37). A final filter compares the probabilities of the studies in which SES is adjusted for with those that either report it only or do not report it at all. Here we compare the percentage of (.72, .00, .28) to (.53, .08, .37) in what we call a strong test. In each case we compared the probability produced by the whole group to those of the filtered studies and conducted a test of the differences to determine if they were significant. These differences were not significant. These findings indicate that to date, with this set of studies, there is no statistically significant difference in results when one reports or adjusts for changes in SES. It appears that by adjusting for SES, one sees increases in the positive results, and this result deserves a closer examination for its implications should it prove to hold up over larger sets of studies.

We ran tests that report the impact of the filters on the number of studies, the percentage of studies, and the effects described as probabilities

for each of the three study categories, NSF-supported and commercially generated with UCSMP included. We claim that when a pattern of probabilities of results does not change after filtering, one can have more confidence in that pattern. When the pattern of results changes, there is a need for an explanatory hypothesis, and that hypothesis can shed light on experimental design. We propose that this “filtering process” constitutes a test of the robustness of the outcome measures subjected to increasing degrees of rigor by using filtering.

Results of Filtering on Evaluations of NSF-Supported Curricula

For the NSF-supported curricular programs, out of 15 filters, 5 produced a probability that differed significantly at the p<.1 level. The five filters were for treatment fidelity, specification of control group, choosing the appropriate statistical unit, generalizability for ability, and generalizability based on disaggregation by subgroup. For each filter, there were from three to nine comparisons, as we examined how the probabilities of outcomes change as tests were more stringent and across the categories of positive results, negative results, and results with no significant differences. Out of a total of 72 possible tests, only 11 produced a probability that differed significantly at the p < .1 level. With 85 percent of the comparisons showing no significant difference after filtering, we suggest the results of the studies were relatively robust in relation to these tests. At the same time, when rigor is increased for the five filters just listed, the results become generally more ambiguous and signal the need for further research with more careful designs.

Studies of Commercial Materials and the Filters

To ensure enough studies to conduct the analysis (n=17), our filtering analysis of the commercially generated studies included UCSMP (n=8). In this case, there were six filters that produced a probability that differed significantly at the p < .1 level. These were treatment fidelity, disaggregation by content, use of multiple tests, use of effect size, generalizability by ability, and generalizability by sample size. In this case, because there were no studies in some possible categories, there were a total of 57 comparisons, and 9 displayed significant differences in the probabilities after filtering at the p < .1 level. With 84 percent of the comparisons showing no significant difference after filtering, we suggest the results of the studies were relatively robust in relation to these tests. Table 5-9 shows the cases in which significant differences were recorded.

Impact of Treatment Fidelity on Probabilities

A few of these differences are worthy of comment. In the cases of both the NSF-supported and commercially generated curricula evaluation studies, studies that reported treatment fidelity differed significantly from those that did not. In the case of the studies of NSF-supported curricula, it appeared that a report or adjustment on treatment fidelity led to proportions with less positive effects and more results showing no significant differences. We hypothesize that this is partly because larger studies often do not examine actual classroom practices, but can obtain significance more easily due to large sample sizes.

In the studies of commercial materials, the presence or absence of measures of treatment fidelity worked differently. Studies reporting on or adjusting for treatment fidelity tended to have significantly higher probabilities in favor of experimental treatment, less positive effects in fewer of the comparative treatments, and more likelihood of results with no significant differences. We hypothesize, and confirm with a separate analysis, that this is because UCSMP frequently reported on treatment fidelity in their designs while study of Saxon typically did not, and the change represents the preponderance of these different curricular treatments in the studies of commercially generated materials.

Impact of Identification of Curricular Program on Probabilities

The significant differences reported under specificity of curricular comparison also merit discussion for studies of NSF-supported curricula. When the comparison group is not specified, a higher percentage of mean scores in favor of the experimental curricula is reported. In the studies of commercial materials, a failure to name specific curricular comparisons also produced a higher percentage of positive outcomes for the treatment, but the difference was not statistically significant. This suggests the possibility that when a specified curriculum is compared to an unspecified curriculum, reports of impact may be inflated. This finding may suggest that in studies of effectiveness, specifying comparative treatments would provide more rigorous tests of experimental approaches.

When studies of commercial materials disaggregate their results of content strands or use multiple measures, their reports of positive outcomes increase, the negative outcomes decrease, and in one case, the results show no significant differences. Percentage of significant difference was only recorded in one comparison within each one of these filters.

TABLE 5-9 Cases of Significant Differences

Impact of Units of Analysis on Probabilities 6

For the evaluations of the NSF-supported materials, a significant difference was reported on the outcomes for the studies that used the correct unit of analysis compared to those that did not. The percentage for those with the correct unit were (.30, .40, .30) compared to (.63, .01, .36) for those that used the incorrect result. These results suggest that our prediction that using the correct unit of analysis would decrease the percentage of positive outcomes is likely to be correct. It also suggests that the most serious threat to the apparent conclusions of these studies comes from selecting an incorrect unit of analysis. It causes a decrease in favorable results, making the results more ambiguous, but never reverses the direction of the effect. This is a concern that merits major attention in the conduct of further studies.

For the commercially generated studies, most of the ones coded with the correct unit of analysis were UCSMP studies. Because of the small number of studies involved, we could not break out from the overall filtering of studies of commercial materials, but report this issue to assist readers in interpreting the relative patterns of results.

Impact of Generalizability on Probabilities

Both types of studies yielded significant differences for some of the comparisons coded as restrictions to generalizability. Investigating these is important in order to understand the effects of these curricular programs on different subpopulations of students. In the case of the studies of commercially generated materials, significantly different results occurred in the categories of ability and sample size. In the studies of NSF-supported materials, the significant differences occurred in ability and disaggregation by subgroups.

In relation to generalizability, the studies of NSF-supported curricula reported significantly more positive results in favor of the treatment when they included all students. Because studies coded as “limited by ability” were restricted either by focusing only on higher achieving students or on lower achieving students, we sorted these two groups. For higher performing students (n=3), the probabilities of effects were (.11, .67, .22). For lower

performing students (n=2), the probabilities were (.39, .025, .59). The first two comparisons are significantly different at p < .05. These findings are based on only a total of five studies, but they suggest that these programs may be serving the weaker ability students more effectively than the stronger ability students, serving both less well than they serve whole heterogeneous groups. For the studies of commercial materials, there were only three studies that were restricted to limited populations. The results for those three studies were (.23, .41, .32) and for all students (n=14) were (.42, .53, .09). These studies were significantly different at p = .004. All three studies included UCSMP and one also included Saxon and was limited by serving primarily high-performing students. This means both categories of programs are showing weaker results when used with high-ability students.

Finally, the studies on NSF-supported materials were disaggregated by subgroups for 28 studies. A complete analysis of this set follows, but the studies that did not report results disaggregated by subgroup generated probabilities of results of (.48, .09, .43) whereas those that did disaggregate their results reported (.76, 0, .24). These gains in positive effects came from significant losses in reporting no significant differences. Studies of commercial materials also reported a small decrease in likelihood of negative effects for the comparison program when disaggregation by subgroup is reported offset by increases in positive results and results with no significant differences, although these comparisons were not significantly different. A further analysis of this topic follows.

Overall, these results suggest that increased rigor seems to lead in general to less strong outcomes, but never reports of completely contrary results. These results also suggest that in recommending design considerations to evaluators, there should be careful attention to having evaluators include measures of treatment fidelity, considering the impact on all students as well as one particular subgroup; using the correct unit of analysis; and using multiple tests that are also disaggregated by content strand.

Further Analyses

We conducted four further analyses: (1) an analysis of the outcome probabilities by test type; (2) content strands analysis; (3) equity analysis; and (4) an analysis of the interactions of content and equity by grade band. Careful attention to the issues of content strand, equity, and interaction is essential for the advancement of curricular evaluation. Content strand analysis provides the detail that is often lost by reporting overall scores; equity analysis can provide essential information on what subgroups are adequately served by the innovations, and analysis by content and grade level can shed light on the controversies that evolve over time.

Analysis by Test Type

Different studies used varied combinations of outcome measures. Because of the importance of outcome measures on test results, we chose to examine whether the probabilities for the studies changed significantly across different types of outcome measures (national test, local test). The most frequent use of tests across all studies was a combination of national and local tests (n=18 studies), a local test (n=16), and national tests (n=17). Other uses of test combinations were used by three studies or less. The percentages of various outcomes by test type in comparison to all studies are described in Table 5-10 .

These data ( Table 5-11 ) suggest that national tests tend to produce less positive results, and with the resulting gains falling into results showing no significant differences, suggesting that national tests demonstrate less curricular sensitivity and specificity.

TABLE 5-10 Percentage of Outcomes by Test Type

TABLE 5-11 Percentage of Outcomes by Test Type and Program Type

TABLE 5-12 Number of Studies That Disaggregated by Content Strand

Content Strand

Curricular effectiveness is not an all-or-nothing proposition. A curriculum may be effective in some topics and less effective in others. For this reason, it is useful for evaluators to include an analysis of curricular strands and to report on the performance of students on those strands. To examine this issue, we conducted an analysis of the studies that reported their results by content strand. Thirty-eight studies did this; the breakdown is shown in Table 5-12 by type of curricular program and grade band.

To examine the evaluations of these content strands, we began by listing all of the content strands reported across studies as well as the frequency of report by the number of studies at each grade band. These results are shown in Figure 5-11 , which is broken down by content strand, grade level, and program type.

Although there are numerous content strands, some of them were reported on infrequently. To allow the analysis to focus on the key results from these studies, we separated out the most frequently reported on strands, which we call the “major content strands.” We defined these as strands that were examined in at least 10 percent of the studies. The major content strands are marked with an asterisk in the Figure 5-11 . When we conduct analyses across curricular program types or grade levels, we use these to facilitate comparisons.

A second phase of our analysis was to examine the performance of students by content strand in the treatment group in comparison to the control groups. Our analysis was conducted across the major content strands at the level of NSF-supported versus commercially generated, initially by all studies and then by grade band. It appeared that such analysis permitted some patterns to emerge that might prove helpful to future evaluators in considering the overall effectiveness of each approach. To do this, we then coded the number of times any particular strand was measured across all studies that disaggregated by content strand. Then, we coded the proportion of times that this strand was reported as favoring the experimental treatment, favoring the comparative curricula, or showing no significant difference. These data are presented across the major content strands for the NSF-supported curricula ( Figure 5-12 ) and the commercially generated curricula, ( Figure 5-13 ) (except in the case of the elemen-

comparison research study

FIGURE 5-11 Study counts for all content strands.

tary curricula where no data were available) in the forms of percentages, with the frequencies listed in the bars.

The presentation of results by strands must be accompanied by the same restrictions as stated previously. These results are based on studies identified as at least minimally methodologically adequate. The quality of the outcome measures in measuring the content strands has not been examined. Their results are coded in relation to the comparison group in the study and are indicated as statistically in favor of the program, as in favor of the comparative program, or as showing no significant differences. The results are combined across studies with no weighting by study size. Their results should be viewed as a means for the identification of topics for potential future study. It is completely possible that a refinement of methodologies may affect the future patterns of results, so the results are to be viewed as tentative and suggestive.

comparison research study

FIGURE 5-12 Major content strand result: All NSF (n=27).

According to these tentative results, future evaluations should examine whether the NSF-supported programs produce sufficient competency among students in the areas of algebraic manipulation and computation. In computation, approximately 40 percent of the results were in favor of the treatment group, no significant differences were reported in approximately 50 percent of the results, and results in favor of the comparison were revealed 10 percent of the time. Interpreting that final proportion of no significant difference is essential. Some would argue that because computation has not been emphasized, findings of no significant differences are acceptable. Others would suggest that such findings indicate weakness, because the development of the materials and accompanying professional development yielded no significant difference in key areas.

comparison research study

FIGURE 5-13 Major content strand result: All commercial (n=8).

From Figure 5-13 of findings from studies of commercially generated curricula, it appears that mixed results are commonly reported. Thus, in evaluations of commercial materials, lack of significant differences in computations/operations, word problems, and probability and statistics suggest that careful attention should be given to measuring these outcomes in future evaluations.

Overall, the grade band results for the NSF-supported programs—while consistent with the aggregated results—provide more detail. At the elementary level, evaluations of NSF-supported curricula (n=12) report better performance in mathematics concepts, geometry, and reasoning and problem solving, and some weaknesses in computation. No content strand analysis for commercially generated materials was possible. Evaluations

(n=6) at middle grades of NSF-supported curricula showed strength in measurement, geometry, and probability and statistics and some weaknesses in computation. In the studies of commercial materials, evaluations (n=4) reported favorable results in reasoning and problem solving and some unfavorable results in algebraic procedures, contextual problems, and mathematics concepts. Finally, at the high school level, the evaluations (n=9) by content strand for the NSF-supported curricula showed strong favorable results in algebra concepts, reasoning/problem solving, word problems, probability and statistics, and measurement. Results in favor of the control were reported in 25 percent of the algebra procedures and 33 percent of computation measures.

For the studies of commercial materials (n=4), only the geometry results favor the control group 25 percent of the time, with 50 percent having favorable results. Algebra concepts, reasoning, and probability and statistics also produced favorable results.

Equity Analysis of Comparative Studies

When the goal of providing a standards-based curriculum to all students was proposed, most people could recognize its merits: the replacement of dull, repetitive, largely dead-end courses with courses that would lead all students to be able, if desired and earned, to pursue careers in mathematics-reliant fields. It was clear that the NSF-supported projects, a stated goal of which was to provide standards-based courses to all students, called for curricula that would address the problem of too few students persisting in the study of mathematics. For example, as stated in the NSF Request for Proposals (RFP):

Rather than prematurely tracking students by curricular objectives, secondary school mathematics should provide for all students a common core of mainstream mathematics differentiated instructionally by level of abstraction and formalism, depth of treatment and pace (National Science Foundation, 1991, p. 1). In the elementary level solicitation, a similar statement on causes for all students was made (National Science Foundation, 1988, pp. 4-5).

Some, but not enough attention has been paid to the education of students who fall below the average of the class. On the other hand, because the above average students sometimes do not receive a demanding education, it may be incorrectly assumed they are easy to teach (National Science Foundation, 1989, p. 2).

Likewise, with increasing numbers of students in urban schools, and increased demographic diversity, the challenges of equity are equally significant for commercial publishers, who feel increasing pressures to demonstrate the effectiveness of their products in various contexts.

The problem was clearly identified: poorer performance by certain subgroups of students (minorities—non-Asian, LEP students, sometimes females) and a resulting lack of representation of such groups in mathematics-reliant fields. In addition, a secondary problem was acknowledged: Highly talented American students were not being provided adequate challenge and stimulation in comparison with their international counterparts. We relied on the concept of equity in examining the evaluation. Equity was contrasted to equality, where one assumed all students should be treated exactly the same (Secada et al., 1995). Equity was defined as providing opportunities and eliminating barriers so that the membership in a subgroup does not subject one to undue and systematically diminished possibility of success in pursuing mathematical study. Appropriate treatment therefore varies according to the needs of and obstacles facing any subgroup.

Applying the principles of equity to evaluate the progress of curricular programs is a conceptually thorny challenge. What is challenging is how to evaluate curricular programs on their progress toward equity in meeting the needs of a diverse student body. Consider how the following questions provide one with a variety of perspectives on the effectiveness of curricular reform regarding equity:

Does one expect all students to improve performance, thus raising the bar, but possibly not to decrease the gap between traditionally well-served and under-served students?

Does one focus on reducing the gap and devote less attention to overall gains, thus closing the gap but possibly not raising the bar?

Or, does one seek evidence that progress is made on both challenges—seeking progress for all students and arguably faster progress for those most at risk?

Evaluating each of the first two questions independently seems relatively straightforward. When one opts for a combination of these two, the potential for tensions between the two becomes more evident. For example, how can one differentiate between the case in which the gap is closed because talented students are being underchallenged from the case in which the gap is closed because the low-performing students improved their progress at an increased rate? Many believe that nearly all mathematics curricula in this country are insufficiently challenging and rigorous. Therefore achieving modest gains across all ability levels with evidence of accelerated progress by at-risk students may still be criticized for failure to stimulate the top performing student group adequately. Evaluating curricula with regard to this aspect therefore requires judgment and careful methodological attention.

Depending on one’s view of equity, different implications for the collection of data follow. These considerations made examination of the quality of the evaluations as they treated questions of equity challenging for the committee members. Hence we spell out our assumptions as precisely as possible:

Evaluation studies should include representative samples of student demographics, which may require particular attention to the inclusion of underrepresented minority students from lower socioeconomic groups, females, and special needs populations (LEP, learning disabled, gifted and talented students) in the samples. This may require one to solicit participation by particular schools or districts, rather than to follow the patterns of commercial implementation, which may lead to an unrepresentative sample in aggregate.

Analysis of results should always consider the impact of the program on the entire spectrum of the sample to determine whether the overall gains are distributed fairly among differing student groups, and not achieved as improvements in the mean(s) of an identifiable subpopulation(s) alone.

Analysis should examine whether any group of students is systematically less well served by curricular implementation, causing losses or weakening the rate of gains. For example, this could occur if one neglected the continued development of programs for gifted and talented students in mathematics in order to implement programs focused on improving access for underserved youth, or if one improved programs solely for one group of language learners, ignoring the needs of others, or if one’s study systematically failed to report high attrition affecting rates of participation of success or failure.

Analyses should examine whether gaps in scores between significantly disadvantaged or underperforming subgroups and advantaged subgroups are decreasing both in relation to eliminating the development of gaps in the first place and in relation to accelerating improvement for underserved youth relative to their advantaged peers at the upper grades.

In reviewing the outcomes of the studies, the committee reports first on what kinds of attention to these issues were apparent in the database, and second on what kinds of results were produced. Some of the studies used multiple methods to provide readers with information on these issues. In our report on the evaluations, we both provide descriptive information on the approaches used and summarize the results of those studies. Developing more effective methods to monitor the achievement of these objectives may need to go beyond what is reported in this study.

Among the 63 at least minimally methodologically adequate studies, 26 reported on the effects of their programs on subgroups of students. The

TABLE 5-13 Most Common Subgroups Used in the Analyses and the Number of Studies That Reported on That Variable

other 37 reported on the effects of the curricular intervention on means of whole groups and their standard deviations, but did not report on their data in terms of the impact on subpopulations. Of those 26 evaluations, 19 studies were on NSF-supported programs and 7 were on commercially generated materials. Table 5-13 reports the most common subgroups used in the analyses and the number of studies that reported on that variable. Because many studies used multiple categories for disaggregation (ethnicity, SES, and gender), the number of reports is more than double the number of studies. For this reason, we report the study results in terms of the “frequency of reports on a particular subgroup” and distinguish this from what we refer to as “study counts.” The advantage of this approach is that it permits reporting on studies that investigated multiple ways to disaggregate their data. The disadvantage is that in a sense, studies undertaking multiple disaggregations become overrepresented in the data set as a result. A similar distinction and approach were used in our treatment of disaggregation by content strands.

It is apparent from these data that the evaluators of NSF-supported curricula documented more equity-based outcomes, as they reported 43 of the 56 comparisons. However, the same percentage of the NSF-supported evaluations disaggregated their results by subgroup, as did commercially generated evaluations (41 percent in both cases). This is an area where evaluations of curricula could benefit greatly from standardization of ex-

pectation and methodology. Given the importance of the topic of equity, it should be standard practice to include such analyses in evaluation studies.

In summarizing these 26 studies, the first consideration was whether representative samples of students were evaluated. As we have learned from medical studies, if conclusions on effectiveness are drawn without careful attention to representativeness of the sample relative to the whole population, then the generalizations drawn from the results can be seriously flawed. In Chapter 2 we reported that across the studies, approximately 81 percent of the comparative studies and 73 percent of the case studies reported data on school location (urban, suburban, rural, or state/region), with suburban students being the largest percentage in both study types. The proportions of students studied indicated a tendency to undersample urban and rural populations and oversample suburban schools. With a high concentration of minorities and lower SES students in these areas, there are some concerns about the representativeness of the work.

A second consideration was to see whether the achievement effects of curricular interventions were achieved evenly among the various subgroups. Studies answered this question in different ways. Most commonly, evaluators reported on the performance of various subgroups in the treatment conditions as compared to those same subgroups in the comparative condition. They reported outcome scores or gains from pretest to posttest. We refer to these as “between” comparisons.

Other studies reported on the differences among subgroups within an experimental treatment, describing how well one group does in comparison with another group. Again, these reports were done in relation either to outcome measures or to gains from pretest to posttest. Often these reports contained a time element, reporting on how the internal achievement patterns changed over time as a curricular program was used. We refer to these as “within” comparisons.

Some studies reported both between and within comparisons. Others did not report findings by comparing mean scores or gains, but rather created regression equations that predicted the outcomes and examined whether demographic characteristics are related to performance. Six studies (all on NSF-supported curricula) used this approach with variables related to subpopulations. Twelve studies used ANCOVA or Multiple Analysis of Variance (MANOVA) to study disaggregation by subgroup, and two reported on comparative effect sizes. In the studies using statistical tests other than t-tests or Chi-squares, two were evaluations of commercially generated materials and the rest were of NSF-supported materials.

Of the studies that reported on gender (n=19), the NSF-supported ones (n=13) reported five cases in which the females outperformed their counterparts in the controls and one case in which the female-male gap decreased within the experimental treatments across grades. In most cases, the studies

present a mixed picture with some bright spots, with the majority showing no significant difference. One study reported significant improvements for African-American females.

In relation to race, 15 of 16 reports on African Americans showed positive effects in favor of the treatment group for NSF-supported curricula. Two studies reported decreases in the gaps between African Americans and whites or Asians. One of the two evaluations of African Americans, performance reported for the commercially generated materials, showed significant positive results, as mentioned previously.

For Hispanic students, 12 of 15 reports of the NSF-supported materials were significantly positive, with the other 3 showing no significant difference. One study reported a decrease in the gaps in favor of the experimental group. No evaluations of commercially generated materials were reported on Hispanic populations. Other reports on ethnic groups occurred too seldom to generalize.

Students from lower socioeconomic groups fared well, according to reported evaluations of NSF-supported materials (n=8), in that experimental groups outperformed control groups in all but one case. The one study of commercially generated materials that included SES as a variable reported no significant difference. For students with limited English proficiency, of the two evaluations of NSF-supported materials, one reported significantly more positive results for the experimental treatment. Likewise, one study of commercially generated materials yielded a positive result at the elementary level.

We also examined the data for ability differences and found reports by quartiles for a few evaluation studies. In these cases, the evaluations showed results across quartiles in favor of the NSF-supported materials. In one case using the same program, the lower quartiles showed the most improvement, and in the other, the gains were in the middle and upper groups for the Iowa Test of Basic Skills and evenly distributed for the informal assessment.

Summary Statements

After reviewing these studies, the committee observed that examining differences by gender, race, SES, and performance levels should be examined as a regular part of any review of effectiveness. We would recommend that all comparative studies report on both “between” and “within” comparisons so that the audience of an evaluation can simply and easily consider the level of improvement, its distribution across subgroups, and the impact of curricular implementation on any gaps in performance. Each of the major categories—gender, race/ethnicity, SES, and achievement level—contributes a significant and contrasting view of curricular impact. Further-

more, more sophisticated accounts would begin to permit, across studies, finer distinctions to emerge, such as the effect of a program on young African-American women or on first generation Asian students.

In addition, the committee encourages further study and deliberation on the use of more complex approaches to the examination of equity issues. This is particularly important due to the overlaps among these categories, where poverty can show itself as its own variable but also may be highly correlated to prior performance. Hence, the use of one variable can mask differences that should be more directly attributable to another. The committee recommends that a group of measurement and equity specialists confer on the most effective design to advance on these questions.

Finally, it is imperative that evaluation studies systematically include demographically representative student populations and distinguish evaluations that follow the commercial patterns of use from those that seek to establish effectiveness with a diverse student population. Along these lines, it is also important that studies report on the impact data on all substantial ethnic groups, including whites. Many studies, perhaps because whites were the majority population, failed to report on this ethnic group in their analyses. As we saw in one study, where Asian students were from poor homes and first generation, any subgroup can be an at-risk population in some setting, and because gains in means may not necessarily be assumed to translate to gains for all subgroups or necessarily for the majority subgroup. More complete and thorough descriptions and configurations of characteristics of the subgroups being served at any location—with careful attention to interactions—is needed in evaluations.

Interactions Among Content and Equity, by Grade Band

By examining disaggregation by content strand by grade levels, along with disaggregation by diverse subpopulations, the committee began to discover grade band patterns of performance that should be useful in the conduct of future evaluations. Examining each of these issues in isolation can mask some of the overall effects of curricular use. Two examples of such analysis are provided. The first example examines all the evaluations of NSF-supported curricula from the elementary level. The second examines the set of evaluations of NSF-supported curricula at the high school level, and cannot be carried out on evaluations of commercially generated programs because they lack disaggregation by student subgroup.

Example One

At the elementary level, the findings of the review of evaluations of data on effectiveness of NSF-supported curricula report consistent patterns of

benefits to students. Across the studies, it appears that positive results are enhanced when accompanied by adequate professional development and the use of pedagogical methods consistent with those indicated by the curricula. The benefits are most consistently evidenced in the broadening topics of geometry, measurement, probability, and statistics, and in applied problem solving and reasoning. It is important to consider whether the outcome measures in these areas demonstrate a depth of understanding. In early understanding of fractions and algebra, there is some evidence of improvement. Weaknesses are sometimes reported in the areas of computational skills, especially in the routinization of multiplication and division. These assertions are tentative due to the possible flaws in designs but quite consistent across studies, and future evaluations should seek to replicate, modify, or discredit these results.

The way to most efficiently and effectively link informal reasoning and formal algorithms and procedures is an open question. Further research is needed to determine how to most effectively link the gains and flexibility associated with student-generated reasoning to the automaticity and generalizability often associated with mastery of standard algorithms.

The data from these evaluations at the elementary level generally present credible evidence of increased success in engaging minority students and students in poverty based on reported gains that are modestly higher for these students than for the comparative groups. What is less well documented in the studies is the extent to which the curricula counteract the tendencies to see gaps emerge and result in long-term persistence in performance by gender and minority group membership as they move up the grades. However, the evaluations do indicate that these curricula can help, and almost never do harm. Finally, on the question of adequate challenge for advanced and talented students, the data are equivocal. More attention to this issue is needed.

Example Two

The data at the high school level produced the most conflicting results, and in conducting future evaluations, evaluators will need to examine this level more closely. We identify the high school as the crucible for curricular change for three reasons: (1) the transition to postsecondary education puts considerable pressure on these curricula; (2) the criteria outlined in the NSF RFP specify significant changes from traditional practice; and (3) high school freshmen arrive from a myriad of middle school curricular experiences. For the NSF-supported curricula, the RFP required that the programs provide a core curriculum “drawn from statistics/probability, algebra/functions, geometry/trigonometry, and discrete mathematics” (NSF, 1991, p. 2) and use “a full range of tools, including graphing calculators

and computers” (NSF, 1991, p. 2). The NSF RFP also specified the inclusion of “situations from the natural and social sciences and from other parts of the school curriculum as contexts for developing and using mathematics” (NSF, 1991, p. 1). It was during the fourth year that “course options should focus on special mathematical needs of individual students, accommodating not only the curricular demands of the college-bound but also specialized applications supportive of the workplace aspirations of employment-bound students” (NSF, 1991, p. 2). Because this set of requirements comprises a significant departure from conventional practice, the implementation of the high school curricula should be studied in particular detail.

We report on a Systemic Initiative for Montana Mathematics and Science (SIMMS) study by Souhrada (2001) and Brown et al. (1990), in which students were permitted to select traditional, reform, and mixed tracks. It became apparent that the students were quite aware of the choices they faced, as illustrated in the following quote:

The advantage of the traditional courses is that you learn—just math. It’s not applied. You get a lot of math. You may not know where to use it, but you learn a lot…. An advantage in SIMMS is that the kids in SIMMS tell me that they really understand the math. They understand where it comes from and where it is used.

This quote succinctly captures the tensions reported as experienced by students. It suggests that student perceptions are an important source of evidence in conducting evaluations. As we examined these curricular evaluations across the grades, we paid particular attention to the specificity of the outcome measures in relation to curricular objectives. Overall, a review of these studies would lead one to draw the following tentative summary conclusions:

There is some evidence of discontinuity in the articulation between high school and college, resulting from the organization and emphasis of the new curricula. This discontinuity can emerge in scores on college admission tests, placement tests, and first semester grades where nonreform students have shown some advantage on typical college achievement measures.

The most significant areas of disadvantage seem to be in students’ facility with algebraic manipulation, and with formalization, mathematical structure, and proof when isolated from context and denied technological supports. There is some evidence of weakness in computation and numeration, perhaps due to reliance on calculators and varied policies regarding their use at colleges (Kahan, 1999; Huntley et al., 2000).

There is also consistent evidence that the new curricula present

strengths in areas of solving applied problems, the use of technology, new areas of content development such as probability and statistics and functions-based reasoning in the use of graphs, using data in tables, and producing equations to describe situations (Huntley et al., 2000; Hirsch and Schoen, 2002).

Despite early performance on standard outcome measures at the high school level showing equivalent or better performance by reform students (Austin et al., 1997; Merlino and Wolff, 2001), the common standardized outcome measures (Preliminary Scholastic Assessment Test [PSAT] scores or national tests) are too imprecise to determine with more specificity the comparisons between the NSF-supported and comparison approaches, while program-generated measures lack evidence of external validity and objectivity. There is an urgent need for a set of measures that would provide detailed information on specific concepts and conceptual development over time and may require use as embedded as well as summative assessment tools to provide precise enough data on curricular effectiveness.

The data also report some progress in strengthening the performance of underrepresented groups in mathematics relative to their counterparts in the comparative programs (Schoen et al., 1998; Hirsch and Schoen, 2002).

This reported pattern of results should be viewed as very tentative, as there are only a few studies in each of these areas, and most do not adequately control for competing factors, such as the nature of the course received in college. Difficulties in the transition may also be the result of a lack of alignment of measures, especially as placement exams often emphasize algebraic proficiencies. These results are presented only for the purpose of stimulating further evaluation efforts. They further emphasize the need to be certain that such designs examine the level of mathematical reasoning of students, particularly in relation to their knowledge of understanding of the role of proofs and definitions and their facility with algebraic manipulation as we as carefully document the competencies taught in the curricular materials. In our framework, gauging the ease of transition to college study is an issue of examining curricular alignment with systemic factors, and needs to be considered along with those tests that demonstrate a curricular validity of measures. Furthermore, the results raising concerns about college success need replication before secure conclusions are drawn.

Also, it is important that subsequent evaluations also examine curricular effects on students’ interest in mathematics and willingness to persist in its study. Walker (1999) reported that there may be some systematic differences in these behaviors among different curricula and that interest and persistence may help students across a variety of subgroups to survive entry-level hurdles, especially if technical facility with symbol manipulation

can be improved. In the context of declines in advanced study in mathematics by American students (Hawkins, 2003), evaluation of curricular impact on students’ interest, beliefs, persistence, and success are needed.

The committee takes the position that ultimately the question of the impact of different curricula on performance at the collegiate level should be resolved by whether students are adequately prepared to pursue careers in mathematical sciences, broadly defined, and to reason quantitatively about societal and technological issues. It would be a mistake to focus evaluation efforts solely or primarily on performance on entry-level courses, which can clearly function as filters and may overly emphasize procedural competence, but do not necessarily represent what concepts and skills lead to excellence and success in the field.

These tentative patterns of findings indicate that at the high school level, it is necessary to conduct individual evaluations that examine the transition to college carefully in order to gauge the level of success in preparing students for college entry and the successful negotiation of majors. Equally, it is imperative to examine the impact of high school curricula on other possible student trajectories, such as obtaining high school diplomas, moving into worlds of work or through transitional programs leading to technical training, two-year colleges, and so on.

These two analyses of programs by grade-level band, content strand, and equity represent a methodological innovation that could strengthen the empirical database on curricula significantly and provide the level of detail really needed by curriculum designers to improve their programs. In addition, it appears that one could characterize the NSF programs (and not the commercial programs as a group) as representing a particular approach to curriculum, as discussed in Chapter 3 . It is an approach that integrates content strands; relies heavily on the use of situations, applications, and modeling; encourages the use of technology; and has a significant dose of mathematical inquiry. One could ask the question of whether this approach as a whole is “effective.” It is beyond the charge and scope of this report, but is a worthy target of investigation if one uses proper care in design, execution, and analysis. Likewise other approaches to curricular change should be investigated at the aggregate level, using careful and rigorous design.

The committee believes that a diversity of curricular approaches is a strength in an educational system that maintains local and state control of curricular decision making. While “scientifically established as effective” should be an increasingly important consideration in curricular choice, local cultural differences, needs, values, and goals will also properly influence curricular choice. A diverse set of effective curricula would be ideal. Finally, the committee emphasizes once again the importance of basing the studies on measures with established curricular validity and avoiding cor-

ruption of indicators as a result of inappropriate amounts of teaching to the test, so as to be certain that the outcomes are the product of genuine student learning.

CONCLUSIONS FROM THE COMPARATIVE STUDIES

In summary, the committee reviewed a total of 95 comparative studies. There were more NSF-supported program evaluations than commercial ones, and the commercial ones were primarily on Saxon or UCSMP materials. Of the 19 curricular programs reviewed, 23 percent of the NSF-supported and 33 percent of the commercially generated materials selected had programs with no comparative reviews. This finding is particularly disturbing in light of the legislative mandate in No Child Left Behind (U.S. Department of Education, 2001) for scientifically based curricular programs and materials to be used in the schools. It suggests that more explicit protocols for the conduct of evaluation of programs that include comparative studies need to be required and utilized.

Sixty-nine percent of NSF-supported and 61 percent of commercially generated program evaluations met basic conditions to be classified as at least minimally methodologically adequate studies for the evaluation of effectiveness. These studies were ones that met the criteria of including measures of student outcomes on mathematical achievement, reporting a method of establishing comparability among samples and reporting on implementation elements, disaggregating by content strand, or using precise, theoretical analyses of the construct or multiple measures.

Most of these studies had both strengths and weaknesses in their quasi-experimental designs. The committee reviewed the studies and found that evaluators had developed a number of features that merit inclusions in future work. At the same time, many had internal threats to validity that suggest a need for clearer guidelines for the conduct of comparative evaluations.

Many of the strengths and innovations came from the evaluators’ understanding of the program theories behind the curricula, their knowledge of the complexity of practice, and their commitment to measuring valid and significant mathematical ideas. Many of the weaknesses came from inadequate attention to experimental design, insufficient evidence of the independence of evaluators in some studies, and instability and lack of cooperation in interfacing with the conditions of everyday practice.

The committee identified 10 elements of comparative studies needed to establish a basis for determining the effectiveness of a curriculum. We recognize that not all studies will be able to implement successfully all elements, and those experimental design variations will be based largely on study size and location. The list of elements begins with the seven elements

corresponding to the seven critical decisions and adds three additional elements that emerged as a result of our review:

A better balance needs to be achieved between experimental and quasi-experimental studies. The virtual absence of large-scale experimental studies does not provide a way to determine whether the use of quasi-experimental approaches is being systematically biased in unseen ways.

If a quasi-experimental design is selected, it is necessary to establish comparability. When quasi-experimentation is used, it “pertains to studies in which the model to describe effects of secondary variables is not known but assumed” (NRC, 1992, p. 18). This will lead to weaker and potentially suspect causal claims, which should be acknowledged in the evaluation report, but may be necessary in relation to feasibility (Joint Committee on Standards for Educational Evaluation, 1994). In general, to date, studies have assumed prior achievement measures, ethnicity, gender, and SES, are acceptable variables on which to match samples or on which to make statistical adjustments. But there are often other variables in need of such control in such evaluations including opportunity to learn, teacher effectiveness, and implementation (see #4 below).

The selection of a unit of analysis is of critical importance to the design. To the extent possible, it is useful to randomly assign the unit for the different curricula. The number of units of analysis necessary for the study to establish statistical significance depends not on the number of students, but on this unit of analysis. It appears that classrooms and schools are the most likely units of analysis. In addition, the development of increasingly sophisticated means of conducting studies that recognize that the level of the educational system in which experimentation occurs affects research designs.

It is essential to examine the implementation components through a set of variables that include the extent to which the materials are implemented, teaching methods, the use of supplemental materials, professional development resources, teacher background variables, and teacher effects. Gathering these data to gauge the level of implementation fidelity is essential for evaluators to ensure adequate implementation. Studies could also include nested designs to support analysis of variation by implementation components.

Outcome data should include a variety of measures of the highest quality. These measures should vary by question type (open ended, multiple choice), by type of test (international, national, local) and by relation of testing to everyday practice (formative, summative, high stakes), and ensure curricular validity of measures and assess curricular alignment with systemic factors. The use of comparisons among total tests, fair tests, and

conservative tests, as done in the evaluations of UCSMP, permits one to gain insight into teacher effects and to contrast test results by items included. Tests should also include content strands to aid disaggregation, at a level of major content strands (see Figure 5-11 ) and content-specific items relevant to the experimental curricula.

Statistical analysis should be conducted on the appropriate unit of analysis and should include more sophisticated methods of analysis such as ANOVA, ANCOVA, MACOVA, linear regression, and multiple regression analysis as appropriate.

Reports should include clear statements of the limitations to generalization of the study. These should include indications of limitations in populations sampled, sample size, unique population inclusions or exclusions, and levels of use or attrition. Data should also be disaggregated by gender, race/ethnicity, SES, and performance levels to permit readers to see comparative gains across subgroups both between and within studies.

It is useful to report effect sizes. It is also useful to present item-level data across treatment program and show when performances between the two groups are within the 10 percent confidence interval of each other. These two extremes document how crucial it is for curricula developers to garner both precise and generalizable information to inform their revisions.

Careful attention should also be given to the selection of samples of populations for participation. These samples should be representative of the populations to whom one wants to generalize the results. Studies should be clear if they are generalizing to groups who have already selected the materials (prior users) or to populations who might be interested in using the materials (demographically representative).

The control group should use an identified comparative curriculum or curricula to avoid comparisons to unstructured instruction.

In addition to these prototypical decisions to be made in the conduct of comparative studies, the committee suggests that it would be ideal for future studies to consider some of the overall effects of these curricula and to test more directly and rigorously some of the findings and alternative hypotheses. Toward this end, the committee reported the tentative findings of these studies by program type. Although these results are subject to revision, based on the potential weaknesses in design of many of the studies summarized, the form of analysis demonstrated in this chapter provides clear guidance about the kinds of knowledge claims and the level of detail that we need to be able to judge effectiveness. Until we are able to achieve an array of comparative studies that provide valid and reliable information on these issues, we will be vulnerable to decision making based excessively on opinion, limited experience, and preconceptions.

This book reviews the evaluation research literature that has accumulated around 19 K-12 mathematics curricula and breaks new ground in framing an ambitious and rigorous approach to curriculum evaluation that has relevance beyond mathematics. The committee that produced this book consisted of mathematicians, mathematics educators, and methodologists who began with the following charge:

  • Evaluate the quality of the evaluations of the thirteen National Science Foundation (NSF)-supported and six commercially generated mathematics curriculum materials;
  • Determine whether the available data are sufficient for evaluating the efficacy of these materials, and if not;
  • Develop recommendations about the design of a project that could result in the generation of more reliable and valid data for evaluating such materials.

The committee collected, reviewed, and classified almost 700 studies, solicited expert testimony during two workshops, developed an evaluation framework, established dimensions/criteria for three methodologies (content analyses, comparative studies, and case studies), drew conclusions on the corpus of studies, and made recommendations for future research.

Welcome to OpenBook!

You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

Do you want to take a quick tour of the OpenBook's features?

Show this book's table of contents , where you can jump to any chapter by name.

...or use these buttons to go back to the previous chapter or skip to the next one.

Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

Switch between the Original Pages , where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

To search the entire text of this book, type in your search term here and press Enter .

Share a link to this book page on your preferred social network or via email.

View our suggested citation for this chapter.

Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

Get Email Updates

Do you enjoy reading reports from the Academies online for free ? Sign up for email notifications and we'll let you know about new publications in your areas of interest when they're released.

Scientific Method

5. comparison in scientific research.

nyone who has stared at a chimpanzee in a zoo (Figure 1) has probably wondered about the animal’s similarity to humans. Chimps make facial expressions that resemble humans, use their hands in much the same way we do, are adept at using different objects as tools, and even laugh when they are tickled. It may not be surprising to learn then that when the first captured chimpanzees were brought to Europe in the 17 th century, people were confused, labeling the animals “pygmies” and speculating that they were stunted versions of “full-grown” humans. A London physician named Edward Tyson obtained a “pygmie” that had died of an infection shortly after arriving in London, and began a systematic study of the animal that cataloged the differences between chimpanzees and humans, thus helping to establish comparative research as a scientific method .

chimpanzee

A brief history of comparative methods

In 1698, Tyson, a member of the Royal Society of London, began a detailed dissection of the “pygmie” he had obtained and published his findings in the 1699 work: Orang-Outang, sive Homo Sylvestris: or, the Anatomy of a Pygmie Compared with that of a Monkey, an Ape, and a Man . The title of the work further reflects the misconception that existed at the time – Tyson did not use the term Orang-Outang in its modern sense to refer to the orangutan; he used it in its literal translation from the Malay language as “man of the woods,” as that is how the chimps were viewed.

Tyson took great care in his dissection. He precisely measured and compared a number of anatomical variables such as brain size of the “pygmie,” ape, and human. He recorded his measurements of the “pygmie,” even down to the direction in which the animal’s hair grew: “The tendency of the Hair of all of the Body was downwards; but only from the Wrists to the Elbows ’twas upwards” (Russell, 1967). Aided by William Cowper, Tyson made drawings of various anatomical structures, taking great care to accurately depict the dimensions of these structures so that they could be compared to those in humans (Figure 2). His systematic comparative study of the dimensions of anatomical structures in the chimp, ape, and human led him to state:

in the Organization of abundance of its Parts, it more approached to the Structure of the same in Men: But where it differs from a Man, there it resembles plainly the Common Ape, more than any other Animal. (Russell, 1967)

Tyson’s comparative studies proved exceptionally accurate and his research was used by others, including Thomas Henry Huxley in Evidence as to Man’s Place in Nature (1863) and Charles Darwin in The Descent of Man (1871).

pygmie and skeleton

Tyson’s methodical and scientific approach to anatomical dissection contributed to the development of evolutionary theory and helped establish the field of comparative anatomy. Further, Tyson’s work helps to highlight the importance of comparison as a scientific research method .

Comparison as a scientific research method

Comparative research represents one approach in the spectrum of scientific research methods and in some ways is a hybrid of other methods, drawing on aspects of both experimental science (see our Experimentation in Science module) and descriptive research (see our Description in Science module). Similar to experimentation, comparison seeks to decipher the relationship between two or more variables by documenting observed differences and similarities between two or more subjects or groups. In contrast to experimentation, the comparative researcher does not subject one of those groups to a treatment , but rather observes a group that either by choice or circumstance has been subject to a treatment. Thus comparison involves observation in a more “natural” setting, not subject to experimental confines, and in this way evokes similarities with description.

Importantly, the simple comparison of two variables or objects is not comparative research . Tyson’s work would not have been considered scientific research if he had simply noted that “pygmies” looked like humans without measuring bone lengths and hair growth patterns. Instead, comparative research involves the systematic cataloging of the nature and/or behavior of two or more variables, and the quantification of the relationship between them.

skeleton chimpanzee

While the choice of which research method to use is a personal decision based in part on the training of the researchers conducting the study, there are a number of scenarios in which comparative research would likely be the primary choice.

The first scenario is one in which the scientist is not trying to measure a response to change, but rather he or she may be trying to understand the similarities and differences between two subjects. For example, Tyson was not observing a change in his “pygmie” in response to an experimental treatment. Instead, his research was a comparison of the unknown “pygmie” to humans and apes in order to determine the relationship between them. A second scenario in which comparative studies are common is when the physical scale or timeline of a question may prevent experimentation. For example, in the field of paleoclimatology, researchers have compared cores taken from sediments deposited millions of years ago in the world’s oceans to see if the sedimentary composition is similar across all oceans or differs according to geographic location. Because the sediments in these cores were deposited millions of years ago, it would be impossible to obtain these results through the experimental method. Research designed to look at past events such as sediment cores deposited millions of years ago is referred to as retrospective research. A third common comparative scenario is when the ethical implications of an experimental treatment preclude an experimental design. Researchers who study the toxicity of environmental pollutants or the spread of disease in humans are precluded from purposefully exposing a group of individuals to the toxin or disease for ethical reasons. In these situations, researchers would set up a comparative study by identifying individuals who have been accidentally exposed to the pollutant or disease and comparing their symptoms to those of a control group of people who were not exposed. Research designed to look at events from the present into the future, such as a study looking at the development of symptoms in individuals exposed to a pollutant, is referred to as prospective research.

Comparative science was significantly strengthened in the late 19th and early 20th century with the introduction of modern statistical methods . These were used to quantify the association between variables (see our Statistics in Science module). Today, statistical methods are critical for quantifying the nature of relationships examined in many comparative studies. The outcome of comparative research is often presented in one of the following ways: as a probability , as a statement of statistical significance , or as a declaration of risk. For example, in 2007 Kristensen and Bjerkedal showed that there is a statistically significant relationship (at the 95% confidence level) between birth order and IQ by comparing test scores of first-born children to those of their younger siblings (Kristensen & Bjerkedal, 2007). And numerous studies have contributed to the determination that the risk of developing lung cancer is 30 times greater in smokers than in nonsmokers (NCI, 1997).

Comprehension Checkpoint

Scientists may opt for comparative research where it would be unethical to conduct an experiment.

Comparison in practice: The case of cigarettes

In 1919, Dr. George Dock, chairman of the Department of Medicine at Barnes Hospital in St. Louis, asked all of the third- and fourth-year medical students at the teaching hospital to observe an autopsy of a man with a disease so rare, he claimed, that most of the students would likely never see another case of it in their careers. With the medical students gathered around, the physicians conducting the autopsy observed that the patient’s lungs were speckled with large dark masses of cells that had caused extensive damage to the lung tissue and had forced the airways to close and collapse. Dr. Alton Ochsner, one of the students who observed the autopsy, would write years later that “I did not see another case until 1936, seventeen years later, when in a period of six months, I saw nine patients with cancer of the lung. – All the afflicted patients were men who smoked heavily and had smoked since World War I” (Meyer, 1992).

BicycleWoman

The American physician Dr. Isaac Adler was, in fact, the first scientist to propose a link between cigarette smoking and lung cancer in 1912, based on his observation that lung cancer patients often reported that they were smokers. Adler’s observations, however, were anecdotal, and provided no scientific evidence toward demonstrating a relationship. The German epidemiologist Franz Müller is credited with the first case-control study of smoking and lung cancer in the 1930s. Müller sent a survey to the relatives of individuals who had died of cancer, and asked them about the smoking habits of the deceased. Based on the responses he received, Müller reported a higher incidence of lung cancer among heavy smokers compared to light smokers. However, the study had a number of problems. First, it relied on the memory of relatives of deceased individuals rather than first-hand observations, and second, no statistical association was made. Soon after this, the tobacco industry began to sponsor research with the biased goal of repudiating negative health claims against cigarettes (see our Scientific Institutions and Societies module for more information on sponsored research).

Beginning in the 1950s, several well-controlled comparative studies were initiated. In 1950, Ernest Wynder and Evarts Graham published a retrospective study comparing the smoking habits of 605 hospital patients with lung cancer to 780 hospital patients with other diseases (Wynder & Graham, 1950). Their study showed that 1.3% of lung cancer patients were nonsmokers while 14.6% of patients with other diseases were nonsmokers. In addition, 51.2% of lung cancer patients were “excessive” smokers while only 19.1% of other patients were excessive smokers. Both of these comparisons proved to be statistically significant differences. The statisticians who analyzed the data concluded:

when the nonsmokers and the total of the high smoking classes of patients with lung cancer are compared with patients who have other diseases, we can reject the null hypothesis that smoking has no effect on the induction of cancer of the lungs.

Wynder and Graham also suggested that there might be a lag of ten years or more between the period of smoking in an individual and the onset of clinical symptoms of cancer. This would present a major challenge to researchers since any study that investigated the relationship between smoking and lung cancer in a prospective fashion would have to last many years.

Richard Doll and Austin Hill published a similar comparative study in 1950 in which they showed that there was a statistically higher incidence of smoking among lung cancer patients compared to patients with other diseases (Doll & Hill, 1950). In their discussion, Doll and Hill raise an interesting point regarding comparative research methods by saying,

This is not necessarily to state that smoking causes carcinoma of the lung. The association would occur if carcinoma of the lung caused people to smoke or if both attributes were end-effects of a common cause.

They go on to assert that because the habit of smoking was seen to develop before the onset of lung cancer, the argument that lung cancer leads to smoking can be rejected. They therefore conclude, “that smoking is a factor, and an important factor, in the production of carcinoma of the lung.”

Despite this substantial evidence , both the tobacco industry and unbiased scientists raised objections, claiming that the retrospective research on smoking was “limited, inconclusive, and controversial.” The industry stated that the studies published did not demonstrate cause and effect, but rather a spurious association between two variables . Dr. Wilhelm Hueper of the National Cancer Institute, a scientist with a long history of research into occupational causes of cancers, argued that the emphasis on cigarettes as the only cause of lung cancer would compromise research support for other causes of lung cancer. Ronald Fisher , a renowned statistician, also was opposed to the conclusions of Doll and others, purportedly because they promoted a “puritanical” view of smoking.

The tobacco industry mounted an extensive campaign of misinformation, sponsoring and then citing research that showed that smoking did not cause “cardiac pain” as a distraction from the studies that were being published regarding cigarettes and lung cancer. The industry also highlighted studies that showed that individuals who quit smoking suffered from mild depression, and they pointed to the fact that even some doctors themselves smoked cigarettes as evidence that cigarettes were not harmful (Figure 5).

cigarette advertisement

While the scientific research began to impact health officials and some legislators, the industry’s ad campaign was effective. The US Federal Trade Commission banned tobacco companies from making health claims about their products in 1955. However, more significant regulation was averted. An editorial that appeared in the New York Times in 1963 summed up the national sentiment when it stated that the tobacco industry made a “valid point,” and the public should refrain from making a decision regarding cigarettes until further reports were issued by the US Surgeon General.

In 1951, Doll and Hill enrolled 40,000 British physicians in a prospective comparative study to examine the association between smoking and the development of lung cancer. In contrast to the retrospective studies that followed patients with lung cancer back in time, the prospective study was designed to follow the group forward in time. In 1952, Drs. E. Cuyler Hammond and Daniel Horn enrolled 187,783 white males in the United States in a similar prospective study. And in 1959, the American Cancer Society (ACS) began the first of two large-scale prospective studies of the association between smoking and the development of lung cancer. The first ACS study, named Cancer Prevention Study I, enrolled more than 1 million individuals and tracked their health, smoking and other lifestyle habits, development of diseases, cause of death, and life expectancy for almost 13 years (Garfinkel, 1985).

All of the studies demonstrated that smokers are at a higher risk of developing and dying from lung cancer than nonsmokers. The ACS study further showed that smokers have elevated rates of other pulmonary diseases, coronary artery disease, stroke, and cardiovascular problems. The two ACS Cancer Prevention Studies would eventually show that 52% of deaths among smokers enrolled in the studies were attributed to cigarettes.

In the second half of the 20 th century, evidence from other scientific research methods would contribute multiple lines of evidence to the conclusion that cigarette smoke is a major cause of lung cancer:

Descriptive studies of the pathology of lungs of deceased smokers would demonstrate that smoking causes significant physiological damage to the lungs. Experiments that exposed mice, rats, and other laboratory animals to cigarette smoke showed that it caused cancer in these animals (see our Experimentation in Science module for more information). Physiological models would help demonstrate the mechanism by which cigarette smoke causes cancer.

As evidence linking cigarette smoke to lung cancer and other diseases accumulated, the public, the legal community, and regulators slowly responded. In 1957, the US Surgeon General first acknowledged an association between smoking and lung cancer when a report was issued stating, “It is clear that there is an increasing and consistent body of evidence that excessive cigarette smoking is one of the causative factors in lung cancer.” In 1965, over objections by the tobacco industry and the American Medical Association, which had just accepted a $10 million grant from the tobacco companies, the US Congress passed the Federal Cigarette Labeling and Advertising Act, which required that cigarette packs carry the warning: “Caution: Cigarette Smoking May Be Hazardous to Your Health.” In 1967, the US Surgeon General issued a second report stating that cigarette smoking is the principal cause of lung cancer in the United States. While the tobacco companies found legal means to protect themselves for decades following this, in 1996, Brown and Williamson Tobacco Company was ordered to pay $750,000 in a tobacco liability lawsuit; it became the first liability award paid to an individual by a tobacco company.

_________ research looks at past events, while _________ research looks at events from the present into the future.

  • Prospective, retrospective
  • Retrospective, prospective

Comparison across disciplines

Comparative studies are used in a host of scientific disciplines, from anthropology to archaeology, comparative biology, epidemiology , psychology, and even forensic science. DNA fingerprinting, a technique used to incriminate or exonerate a suspect using biological evidence , is based on comparative science. In DNA fingerprinting, segments of DNA are isolated from a suspect and from biological evidence such as blood, semen, or other tissue left at a crime scene. Up to 20 different segments of DNA are compared between that of the suspect and the DNA found at the crime scene. If all of the segments match, the investigator can calculate the statistical probability that the DNA came from the suspect as opposed to someone else. Thus DNA matches are described in terms of a “1 in 1 million” or “1 in 1 billion” chance of error.

Comparative methods are also commonly used in studies involving humans due to the ethical limits of experimental treatment . For example, in 2007, Petter Kristensen and Tor Bjerkedal published a study in which they compared the IQ of over 250,000 male Norwegians in the military (Kristensen & Bjerkedal, 2007). The researchers found a significant relationship between birth order and IQ, where the average IQ of first-born male children was approximately three points higher than the average IQ of the second-born male in the same family. The researchers further showed that this relationship was correlated with social rather than biological factors, as second-born males who grew up in families in which the first-born child died had average IQs similar to other first-born children. One might imagine a scenario in which this type of study could be carried out experimentally, for example, purposefully removing first-born male children from certain families, but the ethics of such an experiment preclude it from ever being conducted.

Limitations of comparative methods

One of the primary limitations of comparative methods is the control of other variables that might influence a study. For example, as pointed out by Doll and Hill in 1950, the association between smoking and cancer deaths could have meant that: a) smoking caused lung cancer, b) lung cancer caused individuals to take up smoking, or c) a third unknown variable caused lung cancer AND caused individuals to smoke (Doll & Hill, 1950). As a result, comparative researchers often go to great lengths to choose two different study groups that are similar in almost all respects except for the treatment in question. In fact, many comparative studies in humans are carried out on identical twins for this exact reason. For example, in the field of tobacco research , dozens of comparative twin studies have been used to examine everything from the health effects of cigarette smoke to the genetic basis of addiction.

What is the benefit of comparing identical twins in research studies?

  • It is easier to do research if the subjects know each other well.
  • It helps control variables that might influence a study.

Comparison in modern practice

keeling curve small

Despite the lessons learned during the debate that ensued over the possible effects of cigarette smoke, misconceptions still surround comparative science. For example, in the late 1950s, Charles Keeling , an oceanographer at the Scripps Institute of Oceanography, began to publish data he had gathered from a long-term descriptive study of atmospheric carbon dioxide (CO 2 ) levels at the Mauna Loa observatory in Hawaii (Keeling, 1958). Keeling observed that atmospheric CO 2 levels were increasing at a rapid rate (Figure 6). He and other researchers began to suspect that rising CO 2 levels were associated with increasing global mean temperatures, and several comparative studies have since correlated rising CO 2 levels with rising global temperature (Keeling, 1970). Together with research from modeling studies (see our Modeling in Scientific Research module), this research has provided evidence for an association between global climate change and the burning of fossil fuels (which emits CO 2 ).

Yet in a move reminiscent of the fight launched by the tobacco companies, the oil and fossil fuel industry launched a major public relations campaign against climate change research . As late as 1989, scientists funded by the oil industry were producing reports that called the research on climate change “noisy junk science” (Roberts, 1989). As with the tobacco issue, challenges to early comparative studies tried to paint the method as less reliable than experimental methods. But the challenges actually strengthened the science by prompting more researchers to launch investigations, thus providing multiple lines of evidence supporting an association between atmospheric CO 2 concentrations and climate change. As a result, the culmination of multiple lines of scientific evidence prompted the Intergovernmental Panel on Climate Change organized by the United Nations to issue a report stating that “Warming of the climate system is unequivocal,” and “Carbon dioxide is the most important anthropogenic greenhouse gas (IPCC, 2007).”

Comparative studies are a critical part of the spectrum of research methods currently used in science. They allow scientists to apply a treatment-control design in settings that preclude experimentation, and they can provide invaluable information about the relationships between variables . The intense scrutiny that comparison has undergone in the public arena due to cases involving cigarettes and climate change has actually strengthened the method by clarifying its role in science and emphasizing the reliability of data obtained from these studies.

Comparing and contrasting is a critical research tool for making sense of the world. Through scenarios in which scientists would likely choose to do comparative research, this module explores the differences and similarities between comparison and experimentation. Studies of the link between cigarette smoking and health illustrate how comparison along with other research methods provided solid evidence that cigarette smoke is a major cause of lung cancer.

Key Concepts

  • Comparison is used to determine and quantify relationships between two or more variables by observing different groups that either by choice or circumstance are exposed to different treatments.
  • Comparison includes both retrospective studies that look at events that have already occurred, and prospective studies, that examine variables from the present forward.
  • Comparative research is similar to experimentation in that it involves comparing a treatment group to a control, but it differs in that the treatment is observed rather than being consciously imposed due to ethical concerns, or because it is not possible, such as in a retrospective study.

Footer Logo Lumen Candela

Privacy Policy

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base
  • Comparing and contrasting in an essay | Tips & examples

Comparing and Contrasting in an Essay | Tips & Examples

Published on August 6, 2020 by Jack Caulfield . Revised on July 23, 2023.

Comparing and contrasting is an important skill in academic writing . It involves taking two or more subjects and analyzing the differences and similarities between them.

Instantly correct all language mistakes in your text

Upload your document to correct all your mistakes in minutes

upload-your-document-ai-proofreader

Table of contents

When should i compare and contrast, making effective comparisons, comparing and contrasting as a brainstorming tool, structuring your comparisons, other interesting articles, frequently asked questions about comparing and contrasting.

Many assignments will invite you to make comparisons quite explicitly, as in these prompts.

  • Compare the treatment of the theme of beauty in the poetry of William Wordsworth and John Keats.
  • Compare and contrast in-class and distance learning. What are the advantages and disadvantages of each approach?

Some other prompts may not directly ask you to compare and contrast, but present you with a topic where comparing and contrasting could be a good approach.

One way to approach this essay might be to contrast the situation before the Great Depression with the situation during it, to highlight how large a difference it made.

Comparing and contrasting is also used in all kinds of academic contexts where it’s not explicitly prompted. For example, a literature review involves comparing and contrasting different studies on your topic, and an argumentative essay may involve weighing up the pros and cons of different arguments.

Prevent plagiarism. Run a free check.

As the name suggests, comparing and contrasting is about identifying both similarities and differences. You might focus on contrasting quite different subjects or comparing subjects with a lot in common—but there must be some grounds for comparison in the first place.

For example, you might contrast French society before and after the French Revolution; you’d likely find many differences, but there would be a valid basis for comparison. However, if you contrasted pre-revolutionary France with Han-dynasty China, your reader might wonder why you chose to compare these two societies.

This is why it’s important to clarify the point of your comparisons by writing a focused thesis statement . Every element of an essay should serve your central argument in some way. Consider what you’re trying to accomplish with any comparisons you make, and be sure to make this clear to the reader.

Comparing and contrasting can be a useful tool to help organize your thoughts before you begin writing any type of academic text. You might use it to compare different theories and approaches you’ve encountered in your preliminary research, for example.

Let’s say your research involves the competing psychological approaches of behaviorism and cognitive psychology. You might make a table to summarize the key differences between them.

Or say you’re writing about the major global conflicts of the twentieth century. You might visualize the key similarities and differences in a Venn diagram.

A Venn diagram showing the similarities and differences between World War I, World War II, and the Cold War.

These visualizations wouldn’t make it into your actual writing, so they don’t have to be very formal in terms of phrasing or presentation. The point of comparing and contrasting at this stage is to help you organize and shape your ideas to aid you in structuring your arguments.

When comparing and contrasting in an essay, there are two main ways to structure your comparisons: the alternating method and the block method.

The alternating method

In the alternating method, you structure your text according to what aspect you’re comparing. You cover both your subjects side by side in terms of a specific point of comparison. Your text is structured like this:

Mouse over the example paragraph below to see how this approach works.

One challenge teachers face is identifying and assisting students who are struggling without disrupting the rest of the class. In a traditional classroom environment, the teacher can easily identify when a student is struggling based on their demeanor in class or simply by regularly checking on students during exercises. They can then offer assistance quietly during the exercise or discuss it further after class. Meanwhile, in a Zoom-based class, the lack of physical presence makes it more difficult to pay attention to individual students’ responses and notice frustrations, and there is less flexibility to speak with students privately to offer assistance. In this case, therefore, the traditional classroom environment holds the advantage, although it appears likely that aiding students in a virtual classroom environment will become easier as the technology, and teachers’ familiarity with it, improves.

The block method

In the block method, you cover each of the overall subjects you’re comparing in a block. You say everything you have to say about your first subject, then discuss your second subject, making comparisons and contrasts back to the things you’ve already said about the first. Your text is structured like this:

  • Point of comparison A
  • Point of comparison B

The most commonly cited advantage of distance learning is the flexibility and accessibility it offers. Rather than being required to travel to a specific location every week (and to live near enough to feasibly do so), students can participate from anywhere with an internet connection. This allows not only for a wider geographical spread of students but for the possibility of studying while travelling. However, distance learning presents its own accessibility challenges; not all students have a stable internet connection and a computer or other device with which to participate in online classes, and less technologically literate students and teachers may struggle with the technical aspects of class participation. Furthermore, discomfort and distractions can hinder an individual student’s ability to engage with the class from home, creating divergent learning experiences for different students. Distance learning, then, seems to improve accessibility in some ways while representing a step backwards in others.

Note that these two methods can be combined; these two example paragraphs could both be part of the same essay, but it’s wise to use an essay outline to plan out which approach you’re taking in each paragraph.

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

comparison research study

If you want to know more about AI tools , college essays , or fallacies make sure to check out some of our other articles with explanations and examples or go directly to our tools!

  • Ad hominem fallacy
  • Post hoc fallacy
  • Appeal to authority fallacy
  • False cause fallacy
  • Sunk cost fallacy

College essays

  • Choosing Essay Topic
  • Write a College Essay
  • Write a Diversity Essay
  • College Essay Format & Structure
  • Comparing and Contrasting in an Essay

 (AI) Tools

  • Grammar Checker
  • Paraphrasing Tool
  • Text Summarizer
  • AI Detector
  • Plagiarism Checker
  • Citation Generator

Some essay prompts include the keywords “compare” and/or “contrast.” In these cases, an essay structured around comparing and contrasting is the appropriate response.

Comparing and contrasting is also a useful approach in all kinds of academic writing : You might compare different studies in a literature review , weigh up different arguments in an argumentative essay , or consider different theoretical approaches in a theoretical framework .

Your subjects might be very different or quite similar, but it’s important that there be meaningful grounds for comparison . You can probably describe many differences between a cat and a bicycle, but there isn’t really any connection between them to justify the comparison.

You’ll have to write a thesis statement explaining the central point you want to make in your essay , so be sure to know in advance what connects your subjects and makes them worth comparing.

Comparisons in essays are generally structured in one of two ways:

  • The alternating method, where you compare your subjects side by side according to one specific aspect at a time.
  • The block method, where you cover each subject separately in its entirety.

It’s also possible to combine both methods, for example by writing a full paragraph on each of your topics and then a final paragraph contrasting the two according to a specific metric.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Caulfield, J. (2023, July 23). Comparing and Contrasting in an Essay | Tips & Examples. Scribbr. Retrieved February 19, 2024, from https://www.scribbr.com/academic-essay/compare-and-contrast/

Is this article helpful?

Jack Caulfield

Jack Caulfield

Other students also liked, how to write an expository essay, how to write an argumentative essay | examples & tips, academic paragraph structure | step-by-step guide & examples, what is your plagiarism score.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 21 February 2024

A cross-sectional and population-based study from primary care on post-COVID-19 conditions in non-hospitalized patients

  • Dominik J. Ose   ORCID: orcid.org/0000-0002-5079-2152 1 , 2   na1 ,
  • Elena Gardner 1   na1 ,
  • Morgan Millar   ORCID: orcid.org/0000-0001-5532-6970 3 ,
  • Andrew Curtin 1 ,
  • Jiqiang Wu 1 ,
  • Mingyuan Zhang   ORCID: orcid.org/0000-0002-9837-4187 4 ,
  • Camie Schaefer 1 ,
  • Jing Wang 1 ,
  • Jennifer Leiser   ORCID: orcid.org/0000-0001-6152-6143 1 ,
  • Kirsten Stoesser   ORCID: orcid.org/0000-0001-9108-4438 1 &
  • Bernadette Kiraly   ORCID: orcid.org/0000-0003-2243-6553 1  

Communications Medicine volume  4 , Article number:  24 ( 2024 ) Cite this article

Metrics details

  • Bacterial infection
  • Epidemiology
  • Health services
  • Population screening

Current research on post-COVID-19 conditions (PCC) has focused on hospitalized COVID-19 patients, and often lacks a comparison group. This study assessed the prevalence of PCC in non-hospitalized COVID-19 primary care patients compared to primary care patients not diagnosed with COVID-19.

This cross-sectional, population-based study ( n  = 2539) analyzed and compared the prevalence of PCC in patients with a positive COVID-19 test ( n  = 1410) and patients with a negative COVID-19 test ( n  = 1129) never hospitalized for COVID-19 related conditions. Participants were identified using electronic health records and completed an electronic questionnaire, available in English and Spanish, including 54 potential post COVID-19 symptoms. Logistic regression was conducted to assess the association of PCC with COVID-19.

Post-COVID-19 conditions are prevalent in both groups, and significantly more prevalent in patients with COVID-19. Strong significant differences exist for the twenty most reported conditions, except for anxiety. Common conditions are fatigue (59.5% (COVID-19 positive) vs. 41.3% (COVID-19 negative); OR 2.15 [1.79–2.60]), difficulty sleeping (52.1% (positive) vs. 41.9% (negative); OR 1.42 [1.18–1.71]) and concentration problems (50.6% (positive) vs 28.5% (negative); OR 2.64 [2.17–3.22]). Similar disparities in prevalence are also observed after comparing two groups (positive vs. negative) by age, sex, time since testing, and race/ethnicity.

Conclusions

PCC is highly prevalent in non-hospitalized COVID-19 patients in primary care. However, it is important to note that PCC strongly overlaps with common health symptoms seen in primary care, including fatigue, difficulty sleeping, and headaches, which makes the diagnosis of PCC in primary care even more challenging.

Plain Language Summary

Research on post-COVID-19 conditions (PCC), also known as Long COVID, has often involved hospitalized COVID-19 patients. However, many patients with COVID-19 were not hospitalized, therefore how commonly the condition affects individuals attending primary care services is not accounted for. Here, we assessed non-hospitalized primary care patients with and without COVID-19. Our results demonstrate that PCC is highly common among primary care patients with COVID-19 and often presents as fatigue, difficulty sleeping, and concentration problems. As these symptoms overlap with other non-COVID-related conditions, it is challenging to accurately diagnose PCC. This calls for improved diagnostics and management of PCC in primary care settings, which is often the first point of contact with the healthcare systems for many patients.

Introduction

Research about long-COVID, often referred to as post-COVID-19 conditions, is emerging 1 . Post-COVID-19 conditions (PCC) are characterized as signs and symptoms that develop during or after a COVID-19 infection, are consistently present for more than 12 weeks, and are not attributable to alternative diagnoses 2 , 3 . PCC can affect multiple body systems and affect various symptoms, including fatigue, shortness of breath, smell/taste disorders, muscle weakness, anxiety, and memory problems 4 , 5 , 6 , 7 , 8 . A meta-analysis indicated that 80% of patients with a COVID-19 infection developed one or more long-term symptoms 9 .

Previously, many studies have focused on PCC in hospitalized patients and suggested that PCC is more common among, or even specific to, hospitalized COVID-19 patients 10 , 11 , 12 , 13 , 14 , 15 . On the other hand, research in non-hospitalized patients is evolving 7 , 16 , 17 , 18 ; a current study suggests that hospitalized and non-hospitalized patients have similar rates of PCC 19 . However, both hospitalized and non-hospitalized patients with PCC often report poor or decreased quality of life regarding mobility, pain and discomfort, and the ability to return to normal levels of work or social activity 20 , 21 , 22 . Still, studies focusing exclusively on non-hospitalized or primary care patients are rare.

During the pandemic, primary care has been instrumental in identifying, managing, and monitoring patients with COVID-19 as well as has been critical for the implementation and mass delivery of vaccination 23 , 24 . Primary care, often the first point of contact with the health system, is also likely to play an important role in addressing challenges associated with PCC 25 . Several PCC-related symptoms (e.g., fatigue, muscle weakness, depression) are commonly reported and treated in primary care settings, independent of COVID-19 26 .

However, evidence about PCC in primary care remains scarce. In particular, population-based studies with control groups that quantify the burden of PCC in primary care are missing. To address this situation, this study aims to analyze the prevalence of PCC in non-hospitalized COVID-19 patients in primary care, and to compare the prevalence of PCC symptoms between patients with and without COVID-19.

The results of our study show that PCC symptoms, such as fatigue, shortness of breath, and difficulty sleeping, are prevalent among non-hospitalized primary care patients, independent of COVID-19. However, the symptom burden is much higher among COVID-19 patients. This evidence highlights the major challenge faced by primary care providers – how to distinguish PCC from the background of symptoms commonly addressed in primary care. Overall, our findings support claims that PCC is ideally managed in the primary care setting, especially due to the holistic, longitudinal, and multidisciplinary aspects of primary care. In particular, comprehensive training on care pathways, guidelines, and referral criteria are necessary to support a primary care-led response to PCC.

Design and population

This cross-sectional, population-based study was conducted at the University of Utah Health (U of U Health) system in Salt Lake City, Utah, United States. U of U Health is Utah’s only academic healthcare system and provides primary care through 12 health centers in the greater Salt Lake City area. These clinics serve a combined total of about 120,000 patients annually. All participants have provided informed consent. The University of Utah Institutional Review Board (IRB #139714) exempted the study.

Inclusion/exclusion criteria

Participant selection criteria included: age 18+ years, at least one prior visit (in-person or virtual) with a U of U Health primary care center between January 2020 and March 2021, email address on file, preferred language English or Spanish, and a positive or negative COVID-19 test result (PCR) between March 1st, 2020, and August 31st, 2021, documented in their electronic health record (EHR). Patients were excluded if they had a COVID-19 test before March 1st, 2020, or if they were hospitalized or sought emergency department care related to COVID-19.

Questionnaire

The questionnaire was developed utilizing input from 1) a literature review to identify common post-COVID-19 symptoms and 2) primary care physician observations during the pandemic. The questionnaire development was iterative, with multiple drafts revised for clarity and content validity based on feedback from colleagues, clinicians, and other researchers with expertise in questionnaire methods. A pilot test of the questionnaire was conducted with faculty members at the Department of Family and Preventative Medicine at U of U Health to clarify and refine the contents and usability. The questionnaire was composed in English and translated into Spanish by a certified interpreter and native speaker. The questionnaire consisted of 54 PCC-related symptoms grouped into seven categories understandable to the public (Supplementary Table  1 ). Patients were asked to select symptoms they have experienced in the week prior and to rate the severity of those symptoms on a 3-point scale (mild, moderate, severe). Participants were not offered compensation for participating. Both the English and Spanish versions of the questionnaire started data collection on 08/31/2021 and ended 11/15/2021.

Participants

All clinics utilize a shared EHR system. Data from EHRs were stored in the University’s Enterprise Data Warehouse (EDW). We identified 126,440 primary care patients in the EDW for possible inclusion in the study. Of those primary care patients, 124,606 were not hospitalized for COVID-19. We excluded patients from the non-hospitalized cohort with a preferred language other than English or Spanish ( n  = 4084). The remaining patients were split into patients who preferred English ( n  = 114,588) and patients who preferred Spanish ( n  = 5934). After excluding patients in both language groups with no COVID-19 test in the EHR, patients were further subdivided into English-preferred patients with a negative COVID-19 test ( n  = 46,065), English-preferred patients with a positive COVID-19 test ( n  = 7356), Spanish-preferred with a negative COVID-19 test ( n  = 1700), and Spanish-preferred with a positive COVID-19 test ( n  = 905). Participants tested for COVID-19 because they were experiencing symptoms within the last week. Among the remaining English-preferred cohort, 12,429 patients with a positive test and 7239 patients with a negative test were randomly selected to receive an invitation to complete the questionnaire (Supplementary Fig.  1 ). All of the patients in the Spanish-preferred cohort with test results received an invitation to complete the questionnaire. In total, 22,248 questionnaires were successfully delivered to patients by email sent through REDCap. 19,321 patient records did not respond to the questionnaire, and 2927 responses were submitted for the questionnaire (Fig.  1 ). Questionnaire records with duplicate responses ( n  = 3), and unfinished questionnaires ( n  = 385) were excluded from the final analysis. Finally, 2539 participants (1410 COVID-19 positive, 1129 COVID-19 negative) were verified and completed the survey responses for analysis of common post-COVID-19 symptoms.

figure 1

Flowchart of selecting patients with and without COVID-19 for analysis. Out of 2927 patients who responded, 2,539 patients were included in the analysis. Out of those, 1410 patients had a positive COVID-19 test and 1129 had a negative COVID-19 test respectively. Out of the 388 excluded patients, 282 were excluded for not answering any symptom-related question, 79 were excluded for a previous hospitalization, and 27 were excluded for a missing or invalid COVID-19 test date.

Common post-COVID-19 symptoms were reported by both COVID-19 positive and negative patients in their questionaries, classified into 7 categories, including (1) general symptoms (fatigue/tiredness, muscle & body aches, joint pains, shortness of breath, cough); (2) brain & nervous system headaches (concentration problems, memory problems, general weakness, dizziness, balance problems); (3) mental well-being (difficulty sleeping, anxiety, depression); (4) ears, nose, and throat (congested nose, ringing in ears); (5) heart or circulation (irregular heartbeats, leg pain when walking); (6) eyes or vision (dry eyes); (7) stomach or digestion (heartburn).

The secondary COVID-19 test result (positive, negative) was documented in the EDW. In addition, participants could self-report a positive COVID-19 test result outside the University of Utah health system. If they did, the answers were included when defining COVID-19 test result.

Information on the following secondary demographic and clinical characteristics was obtained from the EDW. Sex was defined as a biological variable to describe biological differences and influences when comparing males and females. Thus, sex was coded binary as male or female as opposed to gender, which could exist on a spectrum and is more often used when describing social or psychological differences between men, women, and other genders 27 . Other covariates included age (18–34 years, 35–49 years, 50 years and above), race (American Indian/Alaskan Native, Asian, Black/African American, Native Hawaiian/Other Pacific Islander, White, Other, Unknown), ethnicity (Hispanic/Latino or non-Hispanic/Latino), BMI (<18.50: underweight, 18.50–24.99: normal weight, 25.00–39.99: overweight, 40.00+: obese), smoking status (never smoked, quit smoking, currently smoking), COVID-19 vaccine status (none, any dose), time between testing and questionnaire receipt (3–9 months, 10–12 months, more than 12 months), and Charlson Comorbidity Index (Scored 0–15).

Given the sample of 2539 participants (1410 COVID-19 positive, 1129 COVID-19 negative) who reported common post-COVID-19 symptoms, power analysis was conducted to calculate power for the hypothesis test for a significant difference in the prevalence of PCC between COVID-19 positive and negative patients. Using a two-sided Chi-square test at a significance level of .05, we would have 93–99% power to detect a small effect size of 0.15–0.2.

Statistical analysis

Patients were categorized as having positive or negative COVID-19 tests. For the final analysis, symptom scores (0–3) were dichotomized into symptom not reported (0) and symptom reported (1–3). Descriptive statistics were calculated, including the frequency and percentage, for all categorical variables (demographic and clinical characteristics, PCC-related symptoms, COVID-19 test result, Charlson comorbidity index (CCI), and body mass index). The mean and standard deviation were calculated for continuous variables (age and CCI). The chi-square test was conducted to assess the associations of the COVID-19 test result with other variables (Table  1 ).

Logistic regression was conducted to assess the association of PCC with COVID-19, after the adjustment of covariates (age, sex, race, ethnicity, time since COVID-19 test and questionnaire receipt, smoking, COVID-19 vaccine status, and Charlson Comorbidity Index) (Table  2 ). Cross-validation was used to prevent overfitting, improve model performance, and address the potential limitation of self-selection bias. Model selection criteria and a comprehensive directed acyclic graph (DAG) of confounding variables were also considered in creating the logistic regression model. These analyses were then stratified by several clinicodemographic moderators, including age (Table  3 ), sex (Table  4 ), race (Table  5 ), ethnicity (Table  6 ), and time since COVID-19 test (Table  7 ) to assess the moderating effects of these variables on the association of PCC and COVID-19. Accordingly, the Tukey multiple comparison test was used to compare the prevalence between two groups by these moderators. The reported results included unadjusted and adjusted p values, odds ratios (OR), and associated 95% confidence intervals. The significance level was set to .05. All statistical tests were 2-sided. Missing values were removed using listwise deletion. All the above analyses were conducted in RStudio.

Reporting summary

Further information on research design is available in the  Nature Portfolio Reporting Summary linked to this article.

Clinical and demographic characteristics

A total of 22,248 patients received the questionnaire, 2927 patients responded (13.2% response rate) and 2539 patients reported symptoms (Supplementary Fig.  1 ); 1410 with positive COVID-19 tests (55.5%) and 1129 with negative COVID-19 tests (44.5%). The mean age was 44.4 years, 63.3% of participants were female, 20.6% were Hispanic/Latino ( n  = 523), and 18.9% were non-white/non-Caucasian (Table  1 ). In contrast, non-responders were younger (43.0; P  < 0.001), less often female (59.1%; P  < 0.001) and less often non-white (37.7%; P  < 0.001) (Supplementary Table  2 ). Compared to COVID-19 negative patients, COVID-19 positive patients were younger (42.6 years vs. 46.7 years; P  < 0.001) and more likely to be males (37% vs. 35.7%; P  = 0.514), non-Hispanics/Latinos (80.5% vs. 71.8%; P  < 0.001), and unvaccinated persons (11.8% vs. 5.1%; P  < 0.001) (Table  1 ).

Common post COVID-19 conditions

There were 20 common symptoms reported by both COVID-19-positive and negative patients. Among them, the following symptoms had relatively higher prevalence than the others, including fatigue/tiredness (51.4%), anxiety (48.3%), difficulty sleeping (47.6%), headaches (44.1%), and concentration problems (40.8%). Balance problems (18.9%) and leg pain when walking (18.8%) had the lowest prevalence. All symptoms were more common among patients with positive COVID-19 tests. For example, the prevalence of common symptoms and odds ratios (OR) comparing COVID-19 positive and negative patients were, respectively, concentration problems (50.6% vs. 28.5%; 2.64 [2.17–3.22]), memory problems (39.4% vs. 19.2%; 2.65 [2.15–3.28], fatigue/tiredness (59.5% vs. 41.3%; 2.15 [1.79–2.60]), headaches (49.8.% vs. 37%, 1.60 [1.32–1.94]), difficulty sleeping (52.1% vs. 41.9%, 1.42 [1.18–1.71]), and anxiety (51.1% vs. 44.7%, 1.18 [0.98–1.42]) (Table  2 , Fig.  2 ).

figure 2

Frequently reported Post-COVID-19 symptoms comparing patients with and without COVID-19 ( N  = 2,539). Figure 2 shows the top 20 most frequently experienced symptoms across the severity scale (mild, moderate, and severe). Bars in blue represent patients with a positive COVID-19 test and bars in red represent patients with a negative COVID-19 test. Adjusted odds ratios with [95% Confidence Interval] shown at the end of the bars suggest relationships between COVID-19 test result and reported symptom (see Table  2 for significance p  < 0.05).

These odds ratios were also presented graphically in Supplementary Fig.  2 . Differences between COVID-19 positive and COVID-19 negative groups are more clearly shown in this figure. For example, group differences for shortness of breath, concentration problems, memory problems were similar and higher than group differences for other symptoms. Group differences for mental well-being symptoms (difficulty sleeping, anxiety, depression), dry eyes, and heartburn (reflux) were similar and lower than group differences for other symptoms. Group differences for muscle & body aches, joint pains, general weakness, dizziness balance problems, irregular heartbeats, and leg pain when walking were also similar.

Figure  3 (Supplementary Table  3 ) shows descending absolute differences in prevalence of the top 20 most frequently experienced symptoms between COVID-19 positive and COVID-19 negative patients. Differences for concentration problems, memory problems, fatigue, and shortness of breath were higher than other symptoms, and COVID-19-positive patients were more than two times as likely to experience these symptoms compared to COVID-19-negative patients. Anxiety was the only symptom not significantly associated with a positive COVID-19 test result.

figure 3

Comparing Post-COVID-19 symptoms with the largest differences between patients with and without COVID-19 ( N  = 2539). Figure 3 shows the top 20 statistically significant largest differences in experienced symptoms across the severity scale (mild, moderate, and severe), in descending order by highest to lowest odds ratios. Bars in blue represent patients with a positive COVID-19 test and bars in red represent patients with a negative COVID-19 test. Adjusted odds ratios with [95% Confidence Interval] shown at the end of the bars suggest relationships between COVID-19 test results and reported symptoms.

Post COVID-19 conditions by age

Table  3 (Supplementary Data  1 ) compares the prevalence of the twenty most common symptoms across three age groups, 18–34 ( n  = 810), 35-49 ( n  = 797) and ≥50 ( n  = 932). In all three age groups, the most common symptoms reported by the COVID-19 positive patients included fatigue (59.7% vs. 63.3% vs. 54.4%), concentration problems (51.8% vs. 56.1% vs. 43.1%), difficulty sleeping (54.5% vs. 54.8% vs. 45.8%). The COVID-19 positive patients in the age group 35–49 generally had significantly higher prevalence and ORs of these common symptoms than their counterparts in the other two age groups.

Post-COVID-19 conditions by sex

Regarding sex, the most common symptoms reported by both female ( n  = 1615) and male ( n  = 924) COVID-19-positive patients included fatigue/tiredness (65.5% vs. 48.7%), headaches (57.7% vs. 36%), concentration problems (56.2% vs. 40.4%), memory problems (43.1% vs. 32.5%), difficulty sleeping (55.5% vs. 45.4%), and anxiety (57.7% vs. 39.3%). For both groups, COVID-19-positive patients had higher prevalence of each symptom than their COVID-19 negative counterparts. However, female COVID-19-positive patients had higher prevalence of each symptom than their male counterparts. In addition, the OR of each symptom was higher for females than for males, except for concentration problems (2.43 [1.94–3.05] vs. 3.08 [2.20–4.37]), difficulty sleeping (1.26 [1.01–1.57] vs. 1.62 [1.20–2.20]), and dry eyes (1.31 [1.03–1.66] vs. 1.65 [1.13–2.41]). Moreover, eight symptoms (fatigue/tiredness, joint pains, shortness of breath, concentration problems, memory problems, general weakness, irregular heartbeats, leg pain when walking) were more than twice as prevalent in female COVID-19-positive patients compared to their COVID-19-negative counterparts. In contrast, four conditions (fatigue/tiredness, shortness of breath, concentration problems, memory problems) were more than twice as prevalent in male COVID-19-positive patients compared to their COVID-19-negative counterparts (Table  4 ).

Post COVID-19 conditions by race

In both groups (White and non-White), COVID-19-positive patients had a higher prevalence of each symptom. White COVID-19-positive patients had lower ORs of these common conditions than their non-White counterparts: fatigue (1.96 [1.61–2.38] vs. 3.12 [2.05–4.80]), headaches (1.49 [1.21–1.82] vs. 1.97 [1.29–3.05]), concentration problems (2.49 [2.03–3.07] vs. 3.13 [2.02–4.90]), difficulty sleeping (1.29 [1.06–1.57] vs. 1.74 [1.15–2.66]), and anxiety (1.11 [0.90–1.36] vs. 1.43 [0.92–2.21]). Moreover, among non-White patients, the ORs were greater than two for fourteen conditions (fatigue/tiredness, muscle & body aches, joint pains, shortness of breath, cough, concentration problems, shortness of breath, general weakness, dizziness, balance problems, irregular heartbeats, leg pain when walking, dry eyes, heartburn) as compared to three conditions (shortness of breath, concentration problems, shortness of breath) among White patients (Table  5 ).

Post COVID-19 conditions by ethnicity

In each group (Hispanic/Latino and non-Hispanic/Latino), COVID-19-positive patients had a higher prevalence of each symptom. Hispanic COVID-19-positive patients had higher ORs of these common conditions than their non-Hispanic counterparts: fatigue/tiredness (2.28 [1.53–3.40] vs. 2.12 [1.73–2.60]), headaches (1.80 [1.20–2.72] vs. 1.48 [1.20–1.83]), concentration problems (2.78 [1.84–4.23] vs. 2.54 [2.05–3.16]), and difficulty sleeping (1.61 [1.08–2.41] vs. 1.32 [1.08–1.62]). Moreover, in the Hispanic group the ORs were greater than two for twelve conditions (fatigue/tiredness, muscle & body aches, joint pains, shortness of breath, concentration problems, shortness of breath, general weakness, dizziness, balance problems, irregular heartbeats, leg pain when walking, heartburn) as compared to six (fatigue/tiredness, shortness of breath, concentration problems, shortness of breath, general weakness, leg pain when walking) in the non-Hispanic group (Table  6 ).

Post-COVID-19 conditions by time since COVID-19 test

Patients were split into three groups based on when patients were tested for COVID-19: 3–9 months ( n  = 765), 10-12 months ( n  = 1073), and more than 12 months ( n  = 701). In each group, COVID-19-positive patients had a higher prevalence of each symptom. The group at 3–9 months had the highest ORs of these common conditions: fatigue/tiredness (2.29 [1.64–3.21]), headaches (1.74 [1.24–2.44]) concentration problems (2.70 [1.92–3.83]), difficulty sleeping (1.67 [1.20–2.32]), and anxiety (1.55 [1.09–2.20]) (Table  7 ; Supplementary Data  2 ).

Supplementary Figure  3 shows the absolute differences in prevalence of top 10 most frequently experienced symptoms across the severity scale (mild, moderate, and severe) between COVID-positive and -negative patients at three time points. Supplementary Figure  3 also shows the adjusted odds ratios with 95% Confidence Intervals comparing the odds of each reported symptom between COVID positive and negative patients shown at the end of the bars. Symptoms were presented in descending order of the prevalence of the symptom for COVID-positive patients. Absolute prevalence differences were similar for fatigue (20.3%), concentration problems (20.1%), and memory problems (20.7%), and higher than absolute prevalence differences for other symptoms. Compared to COVID- negative patients, COVID-positive patients had higher prevalence and odds of each symptom at each time point. In addition, the prevalence of each symptom (except for concentration problems) reduced over time. Depression and congested nose were not significantly associated with COVID-test results past 9 months.

This study aimed to analyze the prevalence of PCC in non-hospitalized COVID-19 patients in primary care and compared the prevalence of PCC symptoms between patients with and without COVID-19. Our analysis revealed three major findings: First, post-COVID-19 conditions are very prevalent in this primary care population, independent of COVID-19. In particular, conditions impacting the brain and nervous system (e.g., concentration problems, headaches), mental health (e.g., anxiety, difficulty sleeping), and general well-being (e.g., fatigue/tiredness, shortness of breath) are common. Second, the burden of PCC is much higher among patients with COVID-19 compared to patients without COVID-19. Twenty common post-COVID-19 symptoms were prevalent in both COVID-19 positive and negative patients, and significantly more prevalent in patients with COVID-19, except for anxiety. Third, PCC was more prevalent in COVID-19-positive patients who were 35-49 years old, 3-9 months from their testing date, female, or from racial/ethnic minority groups than their COVID-19-negative counterparts by age, sex, time since test, and race/ethnicity.

In more detail, the prevalence of PCC varied with respect age, sex, time since COVID-19 test, race, and ethnicity. Concerning age, patients aged 35-49 years had a higher burden of PCC compared to younger and older patients. In general, the evidence on such disparity is inconclusive, although studies in hospitalized patients reported older age as a risk factor for developing PCC-related symptoms, while others point out that patients of all ages suffer long-lasting problems 28 , 29 , 30 , 31 .

With respect to sex, female COVID-19-positive patients had a higher prevalence of each symptom than their male counterparts. Eight symptoms were more than twice as prevalent in female COVID-19-positive patients compared to their COVID-19-negative counterparts. By contrast, four symptoms were more than twice as prevalent in male COVID-19-positive patients compared to their COVID-19-negative counterparts. Our findings are in line with previous research that a higher proportion of female patients reported PCC-related symptoms than male patients 32 , 33 . However, since the time of writing, previous studies were much smaller, conducted outside the United States, and not exclusively focused on non-hospitalized patients in primary care. Future research should consider the role of immune response, hormonal factors, and social or environmental factors in how PCC manifests among sexes.

With respect to race and ethnicity, COVID-19-positive patients had higher prevalence of each symptom in White and non-White patients and also in Hispanic/Latino and non-Hispanic/Latino patients. Twelve symptoms were more than twice as prevalent in Hispanic/Latino COVID-19-positive patients than their COVID-19-negative counterparts. In contrast, six symptoms were more than twice as prevalent in non-Hispanic/Latino COVID-19-positive patients compared to their COVID-19-negative counterparts. Similar findings were also observed in non-White and White patients, with more symptoms prevalent in the former group. Throughout the pandemic, disparities among non-white/Latino patients regarding exposure to SARS-CoV-2 and access to healthcare and social services were exacerbated 34 , 35 , 36 . Equitable access to quality primary healthcare services is critical; minority racial and ethnic groups generally have less access to care and were more likely to be exposed to COVID-19 (especially at the beginning of the pandemic) than White patients 37 , 38 .

Concerning time since COVID-19 testing, COVID-19-positive patients 3-9 months out from their testing date had a higher prevalence of symptoms compared to patients with negative COVID-19 test results. Nine symptoms were more than twice as prevalent in the positive 3-9 month group, compared to seven in the positive 10-12 month group, and five in the positive 12+ month group. A reduction in the severity of some symptoms over time has been reported in other studies about PCC, but point out that neurological symptoms tend to persist 4 , 39 . In the present study, concentration problems and memory problems remain more than two times as likely to impact COVID-19-positive patients from 3 months of infection onwards.

More in general, most of the research on PCC has occurred among hospitalized patients, research regarding PCC among non-hospitalized patients in the United States primary care setting is emerging. As also shown in studies with hospitalized COVID-19 patients 28 , 40 , 41 , 42 , 43 , 44 , 45 , the most prevalent symptoms among this non-hospitalized, primary care population were fatigue/tiredness, difficulty sleeping, and anxiety, followed closely by headaches and concentration problems. These symptoms were prevalent in at least half of COVID-19-positive patients. Existing studies with non-hospitalized patients were largely based on recruitment from social media groups 4 , 16 , 21 , had smaller sample sizes 7 , 19 , 22 , 46 , 47 , 48 , or lacked a COVID-19 negative comparison group 19 , 48 , 49 . Some research calls for the management of PCC in the primary care setting or describes the potential burden on healthcare systems 5 , 7 , yet studies quantifying the burden of PCC symptoms in primary care have not been widely conducted. One study from a UK-based primary care database retrospectively analyzed PCC symptoms in a population-based cohort, and share some similar findings to the present study – PCC in non-hospitalized cohorts consists of heterogenous symptoms across the body, including fatigue 18 . Here, through our cross-sectional use of electronic health records and questionnaires, we show that many of the symptoms of long-COVID are prevalent in primary care, independent of COVID-19 50 .

However, our study demonstrates that the great challenge facing primary care providers is to differentiate PCC from the acute sequela of COVID-19, previous comorbidities, preexisting conditions, as well as complications from prolonged illness, hospitalization, or isolation 51 , 52 . Evidence including the perspective of primary care physicians on how to best manage PCC has also been mounting and emphasizes the need for communication and trust between patient and provider to support management 53 .

Based on the current evidence, disease management of PCC requires a holistic, longitudinal follow-up, multidisciplinary rehabilitation services (e.g., family medicine, pulmonary, infectious disease, neurology), and the empowerment of affected patient groups 54 . Emotional support, ongoing monitoring, symptomatic treatment, and attention to comorbidities are cornerstones of this approach 55 . Primary care, and more specifically Family Medicine - with attributes like person-focused, comprehensive, and coordinated care—is theoretically very well-prepared to address those requirements 56 , 57 , 58 , 59 , 60 . Primary care clinicians know their patients, their lives, and their families and are in an optimal position to coordinate and personalize the treatment as well as the support needed. However, a comprehensive training program, including care pathways, guidance, and criteria to which patients should be referred, is necessary to support the primary care-led Post-COVID-19 response 61 .

To the best of our knowledge, this is one of a few studies analyzing the prevalence of PCC in non-hospitalized COVID-19 primary care patients compared to primary care patients not diagnosed with COVID-19. Additionally, we included Spanish language preference questionnaires to encourage inclusivity for a variety of patients. Previous studies on PCC that include non-hospitalized patients experiencing symptoms for more than a year are uncommon in the literature. Methodological challenges were also common in prior work, including small study populations or exclusion of patients with negative test results.

One limitation of our study is a potential self-selection bias as patients with symptoms described as part of PCC are more likely to participate, evidenced by high rates of symptoms among patients with negative COVID-19 test results. Higher rates of symptoms in the 3-9 months since COVID-19 test group could also be attributed to self-selection bias since patients with perceived persistent symptoms may have been more inclined to participate in the questionnaire. Additionally, clinical data that would shed more light on the pathological mechanisms of PCC (e.g., chest x-rays, computed chest topography, inflammatory markers) were not included in this analysis. Finally, our questionnaire did not include questions about taste or smell disturbances, which have been cited frequently in the literature as common PCC symptoms 62 , 63 . We also did not include questions regarding the impact of symptoms on daily activities. Future research should include controls for symptoms that may have developed before the COVID-19 pandemic, include more clinical data, and develop methods for including older adults in outpatient studies.

This study demonstrates that PCC is highly prevalent in non-hospitalized COVID-19 patients in primary care. However, it is important to note that PCC strongly overlaps with common health conditions seen in primary care, including fatigue, difficulty sleeping, and headaches. This makes the diagnosis of PCC in primary care even more challenging. There is an urgent need to strengthen the diagnosis and treatment of PCC in primary care.

Competing interests

The authors declare no competing interests.

Data availability

Based on the IRB protocol for data monitoring for this study, sharing data in an open-access repository does not meet the encryption standards for data sharing outside the University of Utah. Data will be made available to interested parties through an individual data transfer agreement upon request to the corresponding author. After the request, we will coordinate with the Technology Licensing Office to develop a data transfer agreement and ensure the transfer meets IRB protocol. The source data underlying Figs.  2 and 3 can be found in Table  3 and Supplementary Table  3 , respectively.

Adeloye, D. et al. The long-term sequelae of COVID-19: an international consensus on research priorities for patients with pre-existing and new-onset airways disease. Lancet Respir. Med. 9 , 1467–1478 (2021).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Shah, W., Hillman, T., Playford, E. D. & Hishmeh, L. Managing the long term effects of covid-19: summary of NICE, SIGN, and RCGP rapid guideline. BMJ 372 , n136 (2021).

Article   PubMed   Google Scholar  

US Centers for Disease Control and Prevention. Long COVID or post-COVID conditions. COVID-19 https://www.cdc.gov/coronavirus/2019-ncov/long-term-effects/index.html (2023).

Davis, H. E. et al. Characterizing long COVID in an international cohort: 7 months of symptoms and their impact. EClinicalMedicine 38 , 101019 (2021).

Article   PubMed   PubMed Central   Google Scholar  

Sudre, C. H. et al. Attributes and predictors of long COVID. Nat. Med. 27 , 626–631 (2021).

Huang, Y. et al. COVID Symptoms, symptom clusters, and predictors for becoming a long-hauler: looking for clarity in the haze of the pandemic. MedRxiv https://doi.org/10.1101/2021.03.03.21252086 (2021).

Desgranges F. et al. Post‑COVID‑19 syndrome in outpatients: a cohort study. J. Gen. Intern. Med . 37 , 1943–1952 (2022).

Nalbandian, A. et al. Post-acute COVID-19 syndrome. Nat. Med. 27 , 601–615 (2021).

Lopez-Leon, S. et al. More than 50 long-term effects of COVID-19: a systematic review and meta-analysis. Sci. Rep. 11 , 16144 (2021).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Pérez-González, A., Araújo-Ameijeiras, A., Fernández-Villar, A., Crespo, M. & Poveda, E. Long COVID in hospitalized and non-hospitalized patients in a large cohort in Northwest Spain, a prospective cohort study. Sci. Rep. 12 , 3369 (2022).

Article   ADS   PubMed   PubMed Central   Google Scholar  

Townsend, L. et al. Persistent poor health after COVID-19 is not associated with respiratory complications or initial disease severity. Ann. Am. Thorac. Soc. 18 , 997–1003 (2021).

Becker, J. H. et al. Assessment of Cognitive Function in Patients After COVID-19 Infection. JAMA Netw. Open 4 , e2130645 (2021).

Carfì, A., Bernabei, R. & Landi, F. for the Gemelli against COVID-19 Post-Acute Care Study Group. Persistent symptoms in patients after acute COVID-19. JAMA. 324 , 603 (2020).

Hastie, C. E. et al. Outcomes among confirmed cases and a matched comparison group in the Long-COVID in Scotland study. Nat. Commun. 13 , 5663 (2022).

Ford N. D. Long COVID and significant activity limitation among adults, by age—United States, June 1–13, 2022, to June 7–19, 2023. MMWR Morb. Mortal Wkly. Rep . 72 , 866–70 (2023).

Goërtz, Y. M. J. et al. Persistent symptoms 3 months after a SARS-CoV-2 infection: the post-COVID-19 syndrome? ERJ Open Res. 6 , 00542–02020 (2020).

Vance, H. et al. Addressing post-COVID symptoms: a guide for primary care physicians. J. Am. Board Fam. Med. 34 , 1229–1242 (2021).

Subramanian, A. et al. Symptoms and risk factors for long COVID in non-hospitalized adults. Nat. Med. 28 , 1706–1714 (2022).

Fernández-de-las-Peñas, C. et al. Post–COVID-19 symptoms 2 years after SARS-CoV-2 infection among hospitalized vs nonhospitalized patients. JAMA Netw. Open 5 , e2242106 (2022).

Malik, P. et al. Post-acute COVID-19 syndrome (PCS) and health-related quality of life (HRQoL)-A systematic review and meta-analysis. J. Med. Virol. 94 , 253–262 (2022).

Article   CAS   PubMed   Google Scholar  

Vaes, A. W. et al. Care dependency in non-hospitalized patients with COVID-19. J. Clin. Med. 9 , 2946 (2020).

Dennis, A. et al. Multiorgan impairment in low-risk individuals with post-COVID-19 syndrome: a prospective, community-based study. BMJ OPEN . 11 , e048391 (2021).

Huston, P. et al. COVID-19 and primary care in six countries. BJGP Open 4 , bjgpopen20X101128 (2020).

Mughal, F., Khunti, K. & Mallen, C. The impact of COVID-19 on primary care: Insights from the National Health Service (NHS) and future recommendations. J. Fam. Med. Prim. Care 10 , 4345 (2021).

Article   Google Scholar  

Starfield, B. Is primary care essential? Lancet Lond. Engl. 344 , 1129–1133 (1994).

Article   CAS   Google Scholar  

Finley, C. R. et al. What are the most common conditions in primary care? Can. Fam. Physician 64 , 832–840 (2018).

PubMed   PubMed Central   Google Scholar  

Sodhi, A., Pisani, M., Glassberg, M. K., Bourjeily, G. & D’Ambrosio, C. Sex and gender in lung disease and sleep disorders: a state-of-the-art review. Chest 162 , 647–658 (2022).

Schmidt, K. et al. Management of COVID-19 ICU-survivors in primary care—a narrative review. BMC Fam Pract . 22 (2021).

Yan, Z., Yang, M. & Lai, C. L. Long COVID-19 syndrome: a comprehensive review of its effect on various organ systems and recommendation on rehabilitation plans. Biomedicines 9 , 966 (2021).

Petrella, C. et al. Serum NGF and BDNF in long-COVID-19 adolescents: a pilot study. Diagnostics 12 , 1162 (2022).

Cabrera Martimbianco, A. L., Pacheco, R. L., Bagattini, A. M. & Riera, R. Frequency, signs and symptoms, and criteria adopted for long COVID-19: a systematic review. Int. J. Clin. Pract . 75 , e14357 (2021).

Anjana, N. K. N. et al. Manifestations and risk factors of post COVID syndrome among COVID-19 patients presented with minimal symptoms—a study from Kerala, India. J. Fam. Med. Prim Care 10 , 4023–4029 (2021).

Bai, F. et al. Female gender is associated with long COVID syndrome: a prospective cohort study. Clin. Microbiol. Infect . 28 , 611.e9–611.e16 (2022).

Wiley, Z. et al. Age, comorbid conditions, and racial disparities in COVID-19 outcomes. J. Racial Ethn Health Disparities 9 , 117–123 (2022).

Dorn, Avan, Cooney, R. E. & Sabin, M. L. COVID-19 exacerbating inequalities in the US. Lancet Lond. Engl. 395 , 1243–1244 (2020).

Kaiser Family Foundation. COVID-19 cases and deaths by race/ethnicity: current data and changes over time. KFF (2022). https://www.kff.org/coronavirus-covid-19/issue-brief/covid-19-cases-and-deaths-by-race-ethnicity-current-data-and-changes-over-time/ .

Wielen, L. M. V. et al. Not near enough: racial and ethnic disparities in access to nearby behavioral health care and primary care. J. Health Care Poor Underserved 26 , 1032–1047 (2015).

Dickinson, K. L. et al. Structural Racism and the COVID-19 Experience in the United States. Health Secur. 19 , S–14 (2021).

Jason, L. A. et al. COVID-19 symptoms over time: comparing long-haulers to ME/CFS. Fatigue Biomed. Health Behav. 9 , 59–68 (2021).

Huang, L. et al. 1-year outcomes in hospital survivors with COVID-19: a longitudinal cohort study. Lancet 398 , 747–758 (2021).

Nasserie, T., Hittle, M. & Goodman, S. N. Assessment of the frequency and variety of persistent symptoms among patients with COVID-19: a systematic review. JAMA Netw. Open 4 , e2111417 (2021).

Bungenberg, J. et al. Long COVID-19: objectifying most self-reported neurological symptoms. Ann. Clin. Transl. Neurol. 9 , 141–154 (2022).

Uygur, O. F. & Uygur, H. Association of post-COVID-19 fatigue with mental health problems and sociodemographic risk factors. Fatigue Biomed. Health Behav. 9 , 196–208 (2021).

Renaud-Charest, O. et al. Onset and frequency of depression in post-COVID-19 syndrome: a systematic review. J. Psychiatr. Res. 144 , 129–137 (2021).

Visco, V. et al. Post-COVID-19 syndrome: involvement and interactions between respiratory, cardiovascular and nervous systems. J. Clin. Med. 11 , 524 (2022).

Oliveira, A. M. et al. Long COVID symptoms in non-hospitalised patients: a retrospective study. Acta Médica Port 36 , 618–630 (2023).

Krysa, J. A. et al. Accessing care services for long COVID sufferers in Alberta, Canada: a random, cross-sectional survey study. Int. J. Environ. Res. Public Health 20 , 6457 (2023).

Rach, S. et al. Mild COVID-19 infection associated with post-COVID-19 condition after 3 months—a questionnaire survey. Ann. Med. 55 , 2226907 (2023).

Kirchberger, I. et al. Post-COVID-19 Syndrome in Non-Hospitalized Individuals: Healthcare Situation 2 Years after SARS-CoV-2 Infection. Viruses 15 , 1326 (2023).

Greenhalgh, T., Sivan, M., Delaney, B., Evans, R. & Milne, R. Long covid—an update for primary care. BMJ . 378 , e072117. https://doi.org/10.1136/bmj-2022-072117 (2022).

Herman, E., Shih, E. & Cheng, A. Long COVID: rapid evidence review. Am. Fam. Physician 106 , 523–532 (2022).

PubMed   Google Scholar  

Sanyaolu, A. et al. Post-acute sequelae in COVID-19 survivors: an overview. Sn Compr. Clin. Med. 4 , 91 (2022).

Rotar Pavlic, D. et al. Long COVID as a never-ending puzzle: the experience of primary care physicians. A qualitative interview study. BJGP Open . 7 , BJGPO.2023.0074 (2023).

Sisó-Almirall, A. et al. Long Covid-19: proposed primary care clinical guidelines for diagnosis and disease management. Int. J. Environ. Res. Public Health 18 , 4350 (2021).

Greenhalgh, T. & Knight, M. Long COVID: a primer for family physicians. Am. Fam. Physician 102 , 716–717 (2020).

Greenhalgh, T., Knight, M., A’Court, C., Buxton, M. & Husain, L. Management of post-acute covid-19 in primary care. BMJ . 370 , m3026 (2020).

Parkin, A. et al. A multidisciplinary NHS COVID-19 service to manage post-COVID-19 syndrome in the community. J Prim Care Community Health 12 , 21501327211010994 (2021).

Ladds, E. et al. Persistent symptoms after Covid-19: qualitative study of 114 “long Covid” patients and draft quality principles for services. Bmc Health Serv Res 20 , 1144 (2020).

Groff, D. et al. Short-term and Long-term Rates of Postacute Sequelae of SARS-CoV-2 Infection: a Systematic Review. JAMA Netw Open 4 , e2128568 (2021).

Mayo, N. L., Ellenbogen, R. L., Mendoza, M. D. & Russel, H. A. The family physician’s role in long COVID management. J. Fam. Pract . 71 , 426–31 (2022).

Berger, Z., Altiery DE Jesus, V., Assoumou, S. A. & Greenhalgh, T. Long COVID and health inequities: the role of primary care. Milbank Q 99 , 519–541 (2021).

Aiyegbusi, O. L. et al. Symptoms, complications and management of long COVID: a review. J. R. Soc. Med. 114 , 428–442 (2021).

Watson, D. L. B. et al. Altered smell and taste: Anosmia, parosmia and the impact of long Covid-19. PLoS ONE 16 , e0256998 (2021).

Download references

Acknowledgements

The research reported in this publication was supported in part by the National Center for Advancing Translational Sciences of the National Institutes of Health under Award Number UL1TR002538. This study was also supported Health Studies Fund, provided by the Department of Family and Preventive Medicine at the University of Utah. We would like to thank the participants of this study for taking the time to complete our questionnaire and share their experiences with our team. Additional thanks to external collaborators who offered their expertise on PCC clinical presentations, document translation, and questionnaire development. Finally, we would like to thank the colleagues who proof-read this manuscript prior to submission.

Author information

These authors contributed equally: Dominik J. Ose, Elena Gardner.

Authors and Affiliations

University of Utah Health, School of Medicine, Department of Family and Preventive Medicine, Salt Lake City, UT, USA

Dominik J. Ose, Elena Gardner, Andrew Curtin, Jiqiang Wu, Camie Schaefer, Jing Wang, Jennifer Leiser, Kirsten Stoesser & Bernadette Kiraly

Westsächsische Hochschule - Zwickau, Department of Health and Healthcare Services, Zwickau, Germany

Dominik J. Ose

University of Utah Health, School of Medicine, Department of Internal Medicine, Salt Lake City, UT, USA

Morgan Millar

University of Utah Health, Data Science Services, Salt Lake City, UT, USA

Mingyuan Zhang

You can also search for this author in PubMed   Google Scholar

Contributions

B.K., K.S., J.L. and D.O. created the design and developed the first draft of the questionnaire. M.M. and C.S. supported the questionnaire development and delivery methodological (e.g., design) and administrative (e.g., redcap implementation). M.Z. provided data from the enterprise data warehouse (EDW). A.C., J. Wu and J. Wang carried out the data analyses and supported B.K., K.S., J.L., E.G. and D.O. with data interpretation. D.O. and E.G. have written the first draft of the manuscript with support from A.C. and J. Wang in the methods section. All authors have read the final manuscript and, overall, contributed extensively to this work.

Corresponding author

Correspondence to Bernadette Kiraly .

Ethics declarations

Peer review, peer review information.

Communications Medicine thanks Sridhar Chilimuri, Adekunle Sanyaolu, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information, supplementary data 1, supplementary data 2, peer review file, description of additional supplementary data file, reporting summary, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Ose, D.J., Gardner, E., Millar, M. et al. A cross-sectional and population-based study from primary care on post-COVID-19 conditions in non-hospitalized patients. Commun Med 4 , 24 (2024). https://doi.org/10.1038/s43856-024-00440-y

Download citation

Received : 29 March 2023

Accepted : 22 January 2024

Published : 21 February 2024

DOI : https://doi.org/10.1038/s43856-024-00440-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

comparison research study

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Advanced Search
  • Journal List

Logo of trials

Comparative effectiveness research for the clinician researcher: a framework for making a methodological design choice

Cylie m. williams.

1 Peninsula Health, Community Health, PO Box 52, Frankston, Melbourne, Victoria 3199 Australia

2 Monash University, School of Physiotherapy, Melbourne, Australia

3 Monash Health, Allied Health Research Unit, Melbourne, Australia

Elizabeth H. Skinner

4 Western Health, Allied Health, Melbourne, Australia

Alicia M. James

Jill l. cook, steven m. mcphail.

5 Queensland University of Technology, School of Public Health and Social Work, Brisbane, Australia

Terry P. Haines

Comparative effectiveness research compares two active forms of treatment or usual care in comparison with usual care with an additional intervention element. These types of study are commonly conducted following a placebo or no active treatment trial. Research designs with a placebo or non-active treatment arm can be challenging for the clinician researcher when conducted within the healthcare environment with patients attending for treatment.

A framework for conducting comparative effectiveness research is needed, particularly for interventions for which there are no strong regulatory requirements that must be met prior to their introduction into usual care. We argue for a broader use of comparative effectiveness research to achieve translatable real-world clinical research. These types of research design also affect the rapid uptake of evidence-based clinical practice within the healthcare setting.

This framework includes questions to guide the clinician researcher into the most appropriate trial design to measure treatment effect. These questions include consideration given to current treatment provision during usual care, known treatment effectiveness, side effects of treatments, economic impact, and the setting in which the research is being undertaken.

Comparative effectiveness research compares two active forms of treatment or usual care in comparison with usual care with an additional intervention element. Comparative effectiveness research differs from study designs that have an inactive control, such as a ‘no-intervention’ or placebo group. In pharmaceutical research, trial designs in which placebo drugs are tested against the trial medication are often labeled ‘Phase III’ trials. Phase III trials aim to produce high-quality evidence of intervention efficacy and are important to identify potential side effects and benefits. Health outcome research with this study design involves the placebo being non-treatment or a ‘sham’ treatment option [ 1 ].

Traditionally, comparative effectiveness research is conducted following completion of a Phase III placebo control trial [ 2 – 4 ]. It is possible that comparative effectiveness research might not determine whether one treatment has clinical beneficence, because the comparator treatment might be harmful, irrelevant, or ineffective. This is unless the comparator treatment has already demonstrated superiority to a placebo [ 2 ]. Moreover, comparing an active treatment to an inactive control will be more likely to produce larger effect sizes than a comparison of two active treatments [ 5 ], requiring smaller sample sizes and lower costs to establish or refute the effectiveness of a treatment. Historically, then, treatments only become candidates for comparative effectiveness research to establish superiority, after a treatment has demonstrated efficacy against an inactive control.

Frequently, the provision of health interventions precedes development of the evidence base directly supporting their use [ 6 ]. Some service-provision contexts are highly regulated and high standards of evidence are required before an intervention can be provided (such as pharmacological interventions and device use). However, this is not universally the case for all services that may be provided in healthcare interventions. Despite this, there may be expectation from the individual patient and the public that individuals who present to a health service will receive some form of care deemed appropriate by treating clinicians, even in the absence of research-based evidence supporting this. This expectation may be amplified in publicly subsidized health services (as is largely the case in Canada, the UK, Australia, and many other developed nations) [ 7 – 9 ]. If a treatment is already widely employed by health professionals and is accepted by patients as a component of usual care, then it is important to consider the ethics and practicality of attempting a placebo or no-intervention control trial in this context. In this context, comparative effectiveness research could provide valuable insights to treatment effectiveness, disease pathophysiology, and economic efficiency in service delivery, with greater research feasibility than the traditional paradigm just described. Further, some authors have argued that studies with inactive control groups are used when comparative effectiveness research designs are more appropriate [ 10 ]. We propose and justify a framework for conducting research that argues for the broader use of comparative effectiveness research to achieve more feasible and translatable real-world clinical research.

This debate is important for the research community; particularly those engaged in the planning and execution of research in clinical practice settings, particularly in the provision of non-pharmacological, non-device type interventions. The ethical, preferential, and pragmatic implications from active versus inactive comparator selection in clinical trials not only influence the range of theoretical conclusions that could be drawn from a study, but also the lived experiences of patients and their treating clinical teams. The comparator selection will also have important implications for policy and practice when considering potential translation into clinical settings. It is these implications that affect the clinical researcher’s methodological design choice and justification.

The decision-making framework takes the form of a decision tree (Fig.  1 ) to determine when a comparative effectiveness study can be justified and is particularly relevant to the provision of services that do not have a tight regulatory framework governing when an intervention can be used as part of usual care. This framework is headed by Level 1 questions (demarcated by a question within an oval), which feed into decision nodes (demarcated by rectangles), which end in decision points (demarcated by diamonds). Each question is discussed with clinical examples to illustrate relevant points.

An external file that holds a picture, illustration, etc.
Object name is 13063_2016_1535_Fig1_HTML.jpg

Comparative effectiveness research decision-making framework. Treatment A represents any treatment for a particular condition, which may or may not be a component of usual care to manage that condition. Treatment B is used to represent our treatment of interest. Where the response is unknown, the user should choose the NO response

Treatment A is any treatment for a particular condition that may or may not be a component of usual care to manage that condition. Treatment B is our treatment of interest. The framework results in three possible recommendations: that either (i) a study design comparing Treatment B with no active intervention could be used, or (ii) a study design comparing Treatment A, Treatment B and no active intervention should be used, or (iii) a comparative effectiveness study (Treatment A versus Treatment B) should be used.

Level 1 questions

Is the condition of interest being managed by any treatment as part of usual care either locally or internationally.

Researchers first need to identify what treatments are being offered as usual care to their target patient population to consider whether to perform a comparative effectiveness research (Treatment A versus B) or use a design comparing Treatment B with an inactive control. Usual care has been shown to vary across healthcare settings for many interventions [ 11 , 12 ]; thus, researchers should understand that usual care in their context might not be usual care universally. Consequently, researchers must consider what comprises usual care both in their local context and more broadly.

If there is no usual care treatment, then it is practical to undertake a design comparing Treatment B with no active treatment (Fig.  1 , Exit 1). If there is strong evidence of treatment effectiveness, safety, and cost-effectiveness of Treatment A that is not a component of usual care locally, this treatment should be considered for inclusion in the study. This situation can occur from delayed translation of research evidence into practice, with an estimated 17 years to implement only 14 % of research in evidence-based care [ 13 ]. In this circumstance, although it may be more feasible to use a Treatment B versus no active treatment design, the value of this research will be very limited, compared with comparative effectiveness research of Treatment A versus B. If the condition is currently being treated as part of usual care, then the researcher should consider the alternate Level 1 question for progression to Level 2.

As an example, prevention of falls is a safety priority within all healthcare sectors and most healthcare services have mitigation strategies in place. Evaluation of the effectiveness of different fall-prevention strategies within the hospital setting would most commonly require a comparative design [ 14 ]. A non-active treatment in this instance would mean withdrawal of a service that might be perceived as essential, a governmental health priority, and already integrated in the healthcare system.

Is there evidence of Treatment A’s effectiveness compared with no active intervention beyond usual care?

If there is evidence of Treatment A’s effectiveness compared with a placebo or no active treatment, then we progress to Question 3. If Treatment A has limited evidence, a comparative effectiveness research design of Treatment B versus no active treatment design can be considered. By comparing Treatment A with Treatment B, researchers would generate relevant research evidence for their local healthcare setting (is Treatment B superior to usual care or Treatment A?) and other healthcare settings that use Treatment A as their usual care. This design may be particularly useful when the local population is targeted and extrapolation of research findings is less relevant.

For example, the success of chronic disease management programs (Treatment A) run in different Aboriginal communities were highly influenced by unique characteristics and local cultures and traditions [ 15 ]. Therefore, taking Treatment A to an urban setting or non-indigenous setting with those unique characteristics will render Treatment A ineffectual. The use of Treatment A may also be particularly useful in circumstances where the condition of interest has an uncertain etiology and the competing treatments under consideration address different pathophysiological pathways. However, if Treatment A has limited use beyond the research location and there are no compelling reasons to extrapolate findings more broadly applicable, then Treatment B versus no active control design may be suitable.

The key points clinical researchers should consider are:

  • The commonality of the treatment within usual care
  • The success of established treatments in localized or unique population groups only
  • Established effectiveness of treatments compared with placebo or no active treatment

Level 2 questions

Do the benefits of treatment a exceed the side effects when compared with no active intervention beyond usual care.

Where Treatment A is known to be effective, yet produces side effects, the severity, risk of occurrence, and duration of the side effects should be considered before it is used as a comparator for Treatment B. If the risk or potential severity of Treatment A is unacceptably high or is uncertain, and there are no other potential comparative treatments available, a study design comparing Treatment B with no active intervention should be used (Fig.  1 , Exit 2). Whether Treatment A remains a component of usual care should also be considered. If the side effects of Treatment A are considered acceptable, comparative effectiveness research may still be warranted.

The clinician researcher may also be challenged when the risk of the Treatment A and risk of Treatment B are unknown or when one is marginally more risky than the other [ 16 ]. Unknown risk comparison between the two treatments when using this framework should be considered as uncertain and the design of Treatment A versus Treatment B or Treatment B versus no intervention or a three-arm trial investigating Treatment A, B and no intervention is potentially justified (Fig.  1 , Exit 3).

A good example of risk comparison is the use of exercise programs. Walking has many health benefits, particularly for older adults, and has also demonstrated benefits in reducing falls [ 17 ]. Exercise programs inclusive of walking training have been shown to prevent falls but brisk walking programs for people at high risk of falls can increase the number of falls experienced [ 18 ]. The pragmatic approach of risk and design of comparative effectiveness research could better demonstrate the effect than a placebo (no active treatment) based trial.

  • Risk of treatment side effects (including death) in the design
  • Acceptable levels of risk are present for all treatments

Level 3 question

Does treatment a have a sufficient overall net benefit, when all costs and consequences or benefits are considered to deem it superior to a ‘no active intervention beyond usual care’ condition.

Simply being effective and free of unacceptable side effects is insufficient to warrant Treatment A being the standard for comparison. If the cost of providing Treatment A is so high that it renders its benefits insignificant compared with its costs, or Treatment A has been shown not to be cost-effective, or the cost-effectiveness is below acceptable thresholds, it is clear that Treatment A is not a realistic comparator. Some have advocated for a cost-effectiveness (cost-utility) threshold of $50,000 per quality-adjusted life year gained as being an appropriate threshold, though there is some disagreement about this and different societies might have different capacities to afford such a threshold [ 19 ]. Based on these considerations, one should further contemplate whether Treatment A should remain a component of usual care. If no other potential comparative treatments are available, a study design comparing Treatment B with no active intervention is recommended (Fig.  1 , Exit 4).

If Treatment A does have demonstrated efficacy, safety, and cost-effectiveness compared with no active treatment, it is unethical to pursue a study design comparing Treatment B with no active intervention, where patients providing consent are being asked to forego a safe and effective treatment that they otherwise would have received. This is an unethical approach and also unfeasible, as the recruitment rates could be very poor. However, Treatment A may be reasonable to include as a comparison if it is usually purchased by the potential participant and is made available through the trial.

The methodological design of a diabetic foot wound study illustrates the importance of health economics [ 20 ]. This study compared the outcomes of Treatment A (non-surgical sharps debridement) with Treatment B (low-frequency ultrasonic debridement). Empirical evidence supports the need for wound care and non-intervention would place the patient at risk of further wound deterioration, potentially resulting in loss of limb loss or death [ 21 ]. High consumable expenses and increased short-term time demands compared with low expense and longer term decreased time demands must also be considered. The value of information should also be considered, with the existing levels of evidence weighed up against the opportunity cost of using research funds for another purpose in the context of the probability that Treatment A is cost-effective [ 22 ].

  • Economic evaluation and effect on treatment
  • Understanding the health economics of treatment based on effectiveness will guide clinical practice
  • Not all treatment costs are known but establishing these can guide evidence-based practice or research design

Level 4 question

Is the patient (potential participant) presenting to a health service or to a university- or research-administered clinic.

If Treatment A is not a component of usual care, one of three alternatives is being considered by the researcher: (i) conducting a comparative effectiveness study of Treatment B in addition to usual care versus usual care alone, (ii) introducing Treatment A to usual care for the purpose of the trial and then comparing it with Treatment B in addition to usual care, (iii) conducting a trial of Treatment B versus no active control. If the researcher is considering option (i), usual care should itself be considered to be Treatment A, and the researcher should return to Question 2 in our framework.

There is a recent focus on the importance of health research conducted by clinicians within health service settings as distinct from health research conducted by university-based academics within university settings [ 23 , 24 ]. People who present to health services expect to receive treatment for their complaint, unlike a person responding to a research trial advertisement, where it is clearly stated that participants might not receive active treatment. It is in these circumstances that option (ii) is most appropriate.

Using research designs (option iii) comparing Treatment B with no active control within a health service setting poses challenges to clinical staff caring for patients, as they need to consider the ethics of enrolling patients into a study who might not receive an active treatment (Fig.  1 , Exit 4). This is not to imply that the use of a non-active control is unethical. Where there is no evidence of effectiveness, this should be considered within the study design and in relation to the other framework questions about the risk and use of the treatment within usual care. Clinicians will need to establish the effectiveness, safety, and cost-effectiveness of the treatments and their impact on other health services, weighed against their concern for the patient’s well-being and the possibility that no treatment will be provided [ 25 ]. This is referred to as clinical equipoise.

Patients have a right to access publicly available health interventions, regardless of the presence of a trial. Comparing Treatment B with no active control is inappropriate, owing to usual care being withheld. However, if there is insufficient evidence that usual care is effective, or sufficient evidence that adverse events are likely, the treatment is prohibitive to implement within clinical practice, or the cost of the intervention is significant, a sham or placebo-based trial should be implemented.

Comparative effectiveness research evaluating different treatment options of heel pain within a community health service [ 26 ] highlighted the importance of the research setting. Children with heel pain who attended the health service for treatment were recruited for this study. Children and parents were asked on enrollment if they would participate if there were a potential assignment to a ‘no-intervention’ group. Of the 124 participants, only 7 % ( n  = 9) agreed that they would participate if placed into a group with no treatment [ 26 ].

  • The research setting can impact the design of research
  • Clinical equipoise challenges clinicians during recruitment into research in the healthcare setting
  • Patients enter a healthcare service for treatment; entering a clinical trial is not the presentation motive

This framework describes and examines a decision structure for comparator selection in comparative effectiveness research based on current interventions, risk, and setting. While scientific rigor is critical, researchers in clinical contexts have additional considerations related to existing practice, patient safety, and outcomes. It is proposed that when trials are conducted in healthcare settings, a comparative effectiveness research design should be the preferred methodology to placebo-based trial design, provided that evidence for treatment options, risk, and setting have all been carefully considered.

Authors’ contributions

CMW and TPH drafted the framework and manuscript. All authors critically reviewed and revised the framework and manuscript and approved the final version of the manuscript.

Competing interests

The authors declare that they have no competing interests.

Contributor Information

Cylie M. Williams, Phone: +61 3 9784 8100, Email: ua.vog.civ.nchp@smailliweilyc .

Elizabeth H. Skinner, Email: [email protected] .

Alicia M. James, Email: ua.vog.civ.nchp@semajaicila .

Jill L. Cook, Email: [email protected] .

Steven M. McPhail, Email: [email protected] .

Terry P. Haines, Email: [email protected] .

Blinded Versus Unblinded Review: A Field Study Comparing the Equity of Peer-Review

  • Open access
  • Published: 17 February 2024

Functional outcomes of different surgical treatments for common peroneal nerve injuries: a retrospective comparative study

  • Zhen Pang 1   na1 ,
  • Shuai Zhu 1   na1 ,
  • Yun-Dong Shen 1 , 2 , 3 ,
  • Yan-Qun Qiu 2 ,
  • Yu-Qi Liu 4 ,
  • Wen-Dong Xu 1 , 2 , 3 , 4 , 5 , 6 , 7 &
  • Hua-Wei Yin 1 , 2 , 3 , 4 , 5  

BMC Surgery volume  24 , Article number:  64 ( 2024 ) Cite this article

87 Accesses

Metrics details

This study aims to assess the recovery patterns and factors influencing outcomes in patients with common peroneal nerve (CPN) injury.

This retrospective study included 45 patients with CPN injuries treated between 2009 and 2019 in Jing’an District Central Hospital. The surgical interventions were categorized into three groups: neurolysis (group A; n  = 34 patients), nerve repair (group B; n  = 5 patients) and tendon transfer (group C; n  = 6 patients). Preoperative and postoperative sensorimotor functions were evaluated using the British Medical Research Council grading system. The outcome of measures included the numeric rating scale, walking ability, numbness and satisfaction. Receiver operating characteristic (ROC) curve analysis was utilized to determine the optimal time interval between injury and surgery for predicting postoperative foot dorsiflexion function, toe dorsiflexion function, and sensory function.

Surgical interventions led to improvements in foot dorsiflexion strength in all patient groups, enabling most to regain independent walking ability. Group A (underwent neurolysis) had significant sensory function restoration ( P  < 0.001), and three patients in Group B (underwent nerve repair) had sensory improvements. ROC analysis revealed that the optimal time interval for achieving M3 foot dorsiflexion recovery was 9.5 months, with an area under the curve (AUC) of 0.871 (95% CI = 0.661–1.000, P  = 0.040). For M4 foot dorsiflexion recovery, the optimal cut-off was 5.5 months, with an AUC of 0.785 (95% CI = 0.575–0.995, P  = 0.020). When using M3 toe dorsiflexion recovery or S4 sensory function recovery as the gold standard, the optimal cut-off remained at 5.5 months, with AUCs of 0.768 (95% CI = 0.582–0.953, P  = 0.025) and 0.853 (95% CI = 0.693–1.000, P  = 0.001), respectively.

Conclusions

Our study highlights the importance of early surgical intervention in CPN injury recovery, with optimal outcomes achieved when surgery is performed within 5.5 to 9.5 months post-injury. These findings provide guidance for clinicians in tailoring treatment plans to the specific characteristics and requirements of CPN injury patients.

Peer Review reports

Lower extremity nerve injuries represent 20% of all peripheral nerve injuries, among which the common peroneal nerve (CPN) is the most frequently damaged in the lower limb due to its superficial location [ 1 , 2 ]. CPN injury often results in a “drop foot” symptom, with patients often exhibiting a characteristic steppage gait and suffering from ankle motor weakness in dorsiflexion [ 3 ]. The loss of great toe extension and dorsal foot sensory is also common [ 4 ]. The primary goal of surgical intervention is to enhance motor function, particularly in foot dorsiflexion, while also alleviating sensory disturbances and associated symptoms.

The choice of treatment for CPN injuries is heavily influenced by their underlying causes, which encompass various factors such as trauma, idiopathic entrapment, and iatrogenic injuries [ 5 ]. Traumatic etiologies include injuries such as lacerations, knee dislocations and fractures [ 6 ]. Idiopathic entrapment syndrome is the main cause of common peroneal palsies [ 7 ]. For instance, nerve lacerations necessitate immediate nerve repair, while neurolysis is suitable for addressing nerve entrapment. CPNs are frequently compressed by tendons, tumors or ganglion cysts, necessitating their resection during neurolysis procedures [ 8 , 9 ]. Conventional treatment options include conservative management, physical therapy, neurolysis, nerve repair (comprising direct sutures and nerve grafting), and tendon transfer [ 10 ].

Considering that some common peroneal palsies may exhibit spontaneous recovery, non-operative management is usually preferred in cases lacking well-defined injuries [ 4 ]. Successful non-operative approaches include activity restriction and the utilization of ankle-foot orthoses. However, when functional improvement remains slow or absent despite 3–6 months of conservative therapy, surgical interventions become imperative [ 11 ]. Physical therapy techniques, such as electrical stimulation, have been found effective in promoting nerve repair and improving patient function [ 12 ].

Two primary surgical strategies are employed in the treatment of CPN injuries: (1) restoration of CPN function and (2) tendon transfer to reestablish foot muscle function and balance [ 13 , 14 ]. Nerve exploration and neurolysis typically suffice for most entrapment or compression injuries, with 75% of patients demonstrating a positive nerve action potential during surgical exploration, achieving complete recovery [ 15 ]. In cases of sharp lacerations, direct nerve suturing within a few days is often the preferred choice. However, when the peroneal nerve exhibits defects or there is high anastomotic tension, autogenous nerve grafts are preferred, with the sural nerve serving as the most common donor. The success of nerve grafts is closely linked to graft length [ 16 ], as grafts shorter than 6 cm yield favorable outcomes in 64% of patients, while those exceeding 12 cm are associated with favorable outcomes in only 11% of patients [ 2 ]. In recent years, nerve transfer has emerged as a novel approach for CPN injury treatment. Transferring the soleus muscular branch of the tibial nerve to the deep fibular nerve has shown promise in CPN injury repair and the restoration of ankle dorsiflexion [ 17 , 18 ]. Additionally, the double transfer of tibial nerve branches to the flexor digitorum longus and lateral head of the gastrocnemius to the deep peroneal nerve has proven beneficial in restoring motor function for certain patients [ 19 ]. These innovative approaches have opened new avenues in nerve repair therapy.

Neurolysis, being less invasive, facilitates rapid post-surgical recovery and effectively enhances the function of patients with intact CPN continuity. Nonetheless, its precise indications remain somewhat ambiguous. Neurolysis is less efficacious for patients with an unbroken CPN but experiencing complete motor function loss [ 20 ]. Nerve repair is appropriate for individuals with a complete CPN rupture, as it can restore both sensory and motor functions. However, when the graft length becomes excessive, nerve repair outcomes tend to be suboptimal.

Tendon transfer, particularly posterior tibial tendon transfer, is an effective method for reinstating foot dorsiflexion. The primary objective across all treatments remains the correction of foot drop, which can be achieved through tendon transfer [ 21 ]. Initially, tendon transfer was considered a corrective surgery for patients whose nerve function failed to improve after repair. However, a novel surgical approach has recently emerged, wherein tendon transfer is combined with nerve repair in a one-stage protocol, aimed at rebalancing muscle forces for enhanced reinnervation [ 22 ]. Ferraresi et al. have demonstrated that one-stage nerve repair and tendon transfer can yield superior functional recovery compared to nerve repair alone [ 23 ]. Similarly, Ho et al. reported that simultaneous tendon transfer and nerve repair may offer improved function compared to tendon transfer as a sole intervention [ 4 ].

Tendon transfer can effectively restore foot dorsiflexion but cannot fully restore muscle strength or range of motion and may result in flatfoot or hindfoot valgus [ 2 , 20 ]. In addition, tendon transfer was less effective in restoring toe extension and dorsal foot sensory function.

While previous research has established the safety and efficacy of the three surgical treatments, there is a lack of studies investigating the postoperative recovery characteristics of each surgical approach, leading to a lack of evidence-based guidance for the decision-making process of physicians in selecting the most suitable surgery based on individual patient conditions and needs. Consequently, some patients may require a second surgery due to unsatisfactory results, particularly those who have previously undergone neurolysis. Thus, it is important to determine whether neurolysis can indeed yield the desired functional recovery.

Considering the limited number of systematic studies analyzing the factors that influence the prognosis of neurolysis, we designed this study to address these gaps by conducting a comprehensive retrospective analysis of patients post-treatment and investigating the factors that impact surgical outcomes. Our objective is to assess the therapeutic effects of different surgical interventions for CPN injuries and identify the key factors that influence the outcome of neurolysis to provide valuable guidance for clinical decision-making.

Study design and patient population

This descriptive, retrospective study included 45 patients with CPN injuries treated between 2009 and 2019 in Jing’an District Central Hospital. Patients were considered eligible if they met the following criteria: (1) confirmed CPN injury through examination, classified as partial or complete based on EMG grades, and attributable to diverse causes such as trauma, nerve entrapment, or idiopathic origins; (2) demonstrated weak foot dorsiflexion, graded as M0 to M4 on the Medical Research Council (MRC) scale for muscle strength, and/or sensory deficits on the dorsum of the foot; (3) underwent surgical treatments at Jing’an District Central Hospital, with comprehensive preoperative evaluation data available, and; (4) adhered to the follow-up recommendations. The study exclusion criteria were presence of severe organ dysfunction or severe ankle contracture/deformity on the affected side, inability to communicate normally due to severe neuropsychiatric disorders, and unwillingness of patients or their family members to participate in follow-up. The study was conducted in accordance with the ethical standards of the Declaration of Helsinki (as revised in 2013) and was approved by the Ethics Committee of Jing’an District Central Hospital (No. 202303), which waived the requirement for individual consent due to the retrospective nature of the present study.

Treatment selection

The patients were classified into the following groups based on the treatments they underwent rather than the type of injuries; namely, Group A underwent neurolysis, Group B underwent nerve repair, and Group C underwent tendon transfer. The type of treatment was based on the surgeons’ discretions. Potential criteria for considering neurolysis were a closed injury or spontaneous compression in which preoperative electromyography (EMG) shows that stimulation proximal to the injury can elicit compound muscle action potentials at the target muscle, or although no signal is elicited on preoperative EMG, signals are re-recorded after intraoperative exploration of the nerve to release adhesions and open compression, and the texture of the injured nerve is good and continuity is still present; for nerve repair they were a direct cut injury in which the nerve is known to be ruptured preoperatively, or although the patient suffered a non-cut injury, intraoperative exploration reveals a neuroma-like structure at the site of the nerve injury and no evidence of nerve fiber regeneration on EMG; and for tendon transfer they were a patient who has been injured for more than a year and already has muscle atrophy such as a tibialis anterior, or who has had previous neurolysis or nerve repair surgery that has not been effective.

Surgical technique

Patients received either lumbar anesthesia or general anesthesia while in the prone position, and the surgery followed standard sterile procedures with the application of a lower extremity tourniquet. The tourniquet was set at a pressure of 55 kPa and was in use for less than 60 min.

For neurolysis, a surgical oblique incision, typically 6–8 cm in length, was made extending from the fibular head to the popliteal space. The nerve, often entering the peroneal muscle layer near the head of the fibula, was tracked to locate the compression site. Frequently, the nerve is entrapped by the peroneus longus and brevis tendons or scar tissue. The tissue causing the nerve entrapment was excised, and in certain cases, partial dissection of the CPN’s epineurium was necessary (Fig.  1 ).

figure 1

The release of the nerve distally. The white arrow indicates the common peroneal nerve after release. This is a lateral incision at the fibular head on the right leg, with the popliteal fossa on the right and the calf on the left

In the nerve graft procedure, an S-shaped incision of approximately 12 cm in length was made beneath the fibular head. The CPN was explored and released from the fibular head to the sites where the superficial and deep peroneal nerves bifurcate. Both nerves were separately exposed to confirm their continuity, and any ruptured or necrotic sections of the nerve were revealed. These damaged portions of the nerve were dissected, and the stumps on either side were trimmed to expose the healthy nerve papilla. The length of the nerve defect was then measured. When the defect gap was less than 1 cm, a direct suture was performed as the preferred approach. In cases with larger gaps, nerve grafting was required. The sural nerve, typically obtained through a surgical incision in the lateral calf, was used as the donor for nerve grafts, with the cut length of the sural nerve determined by the length of the CPN defect. Three- or four-strand sural nerves were employed in parallel to bridge the peroneal nerve (Fig.  2 ).

figure 2

Suturing of the sural nerves between the common peroneal nerve stumps. The white arrow indicates the distal end of the common peroneal nerve, the gray arrow indicates the proximal end of the common peroneal nerve, and the white asterisk indicates the transplanted nerve

In tendon transfer, two longitudinal incisions were made, one on the medial foot and the other on the medial calf. The posterior tibialis tendon was exposed and cut at its tendon insertion sites (Fig.  3 ). In some cases, the peroneal brevis tendon and flexor digitorum longus were also utilized for transfer. Subsequently, two longitudinal incisions were created, one over the dorsal surface of the foot and the other on the dorsal calf. An electric drill was used to perforate the diaphysis of the third cuneiform bone, extending through to the sole of the foot. The posterior tibialis tendon was guided through the tibiofibular interosseous membrane to reach the anterolateral foot incision. A puncture needle was used to facilitate the passage of the tendon through the cuneiform hole to the plantar surface, with the foot held at 80° of dorsiflexion. The site of tendon fixation was then sutured and reinforced (Fig.  4 ).

figure 3

The posterior tibialis tendon was delivered from the wound on the medial calf. The gray arrow indicates the posterior tibialis tendon. The incision outlined by the white dotted line is used to find and resect the insertion of the posterior tibialis tendon. See Fig.  4 for the specific surgical operation

figure 4

Fixing the tendon by sutures. The posterior tibialis tendon indicated by the white arrow is sutured to the third cuneiform bone. This is a front-and-rear view with the toe in front and the heel behind

Assessments

The following data were retrieved and assessed: demographic information, medical history, preoperative evaluations and postoperative outcomes. In both preoperative and follow-up physical examinations, motor strength and sensory function were assessed using the British Medical Research Council Scale. For motor rating comparison, we utilized the following convention: the standard S3 + sensory rating was designated as S4, and the standard S4 sensory rating was denoted as S5. Motor strength assessments were based on the anterior tibia-based foot dorsiflexion, soleus muscle-based foot plantarflexion, as well as toe dorsiflexion and plantarflexion. Sensory function scoring focused on the dorsal foot and lateral lower leg. Preoperative physical examinations also included Tinel’s sign evaluation at the fibular head. All patients underwent preoperative EMG to confirm the CPN injuries.

In the outcome evaluation, pain was assessed using the numeric rating scale (NRS) from 0 (no pain) to 10 (worst pain imaginable). The functional recovery level was assessed by questioning the patients on their activity level and participation in sports. The activity level included ambulatory walking, independent walking, and running. All of the patients were asked to rate their “overall satisfaction with the outcome of the operation” on a scale of extremely satisfied, satisfied, satisfied with reservation, and dissatisfied.

We considered a patient to have achieved good function if their motor grade was M3 or higher, while an M2 motor grade indicated fair function. Poor function was assigned for scores falling within the M0–1 range. Simultaneously, sensory functions are classified according to the same criteria as the above classification of motor functions [ 22 ].

Follow-up was conducted through telephone, online communication software, or outpatient visits, with a minimum 1-year post-surgery duration and included assessing surgical efficacy (i.e., postoperative sensorimotor function, daily activities), detecting postoperative adverse events (i.e., pain, numbness), and evaluating satisfaction to the surgical treatments.

Statistical analysis

Statistical analysis was performed using the SPSS V22 statistics tool (IBM Corporation, Armonk, New York, USA). Due to the nature of our data, non-parametric methods were predominantly used. Spearman correlation test assessed relationships between ordinal variables, such as the time interval from injury to surgery and postoperative functional outcomes. Graphs were generated using GraphPad Prism 7 software (Dotmatics, Boston, Massachusetts, USA). Data are presented as mean ± standard deviations. Quantitative variables were analyzed using Student’s t -test and qualitative variables using the Mann-Whitney U test.

Receiver operating characteristic (ROC) curve analysis was performed to determine the threshold time interval from injury to surgery for predicting postoperative foot dorsiflexion function, toe dorsiflexion function, and sensory function. To validate the predictive ability of the time elapsed from injury to surgery, the area under the curve (AUC) was computed, and the optimal cut-off points were identified based on the highest Youden Index.

Quantitative data were analyzed using the Student’s t-test for normally distributed variables and the Mann-Whitney U test for non-normally distributed variables. Subgroup analyses within Group A were performed using one-way analysis of variance (ANOVA) to identify factors associated with the outcome of neurolysis. Post-hoc tests incorporating Tukey correction were conducted to determine the significant differences among various subgroup means.

All comparisons were two-tailed, and statistical significance was determined based on P  < 0.05.

General clinical data of the patient

The study cohort included 35 males and 10 females, aged between 2 and 67 years old, with a mean age of 31.16 years old. Of them, 34 underwent neurolysis, five received nerve graft, and 6 underwent tendon transfer. On average, patients underwent neurolysis 6.1 ± 5.4 months (ranging from 0.5 to 24 months) after the onset of the disease, nerve graft 2.2 ± 0.4 months after disease onset (with four cases at 2 months and one case at 3 months), and tendon transfer 38.2 ± 23.3 months after disease onset (ranging from 10 to 72 months, with three cases under 36 months and three cases over 36 months). The patient groups were designated as groups A, B and C, corresponding to those who underwent neurolysis, nerve graft, and tendon transfer, respectively.

Regarding the nature of the injuries, ten patients experienced peroneal nerve injuries due to cut trauma (five of whom underwent neurolysis, as confirmed by electromyography and intraoperative exploration that revealed intact CPN continuity). Eight injuries resulted from traffic accidents, six from falls, six from knee dislocations, two from crush injuries, one from an unspecified trauma, eight from idiopathic nerve compression (including local compression and strenuous exercise), two from iatrogenic causes, one from poisoning, and one had an unknown cause (Table  1 ).

Recovery of motor function

Before surgery, most patients had poor or fair foot dorsiflexion. However, nine patients in group A presented with good foot dorsiflexion function before the operation and their surgical indications primarily aimed to alleviate pain, further enhance functionality, improve toe dorsiflexion, and alleviate severe numbness.

The mean follow-up duration for the patients was 5.28 years. Compared to their preoperative levels, patients who underwent neurolysis ( P  < 0.001), nerve repair ( P  = 0.032) and tendon transfer ( P  = 0.015) all demonstrated improvements in foot dorsiflexion muscle strength after surgery. Specifically, in group A, 31 patients (91%) who received neurolysis achieved good foot dorsiflexion function, with 22 (71%) initially presenting with poor or fair dorsiflexion function preoperatively (Table  2 ). In group B, three patients (60%) who underwent nerve repair attained active dorsiflexion with a strength of M3. In group C, five patients (83%) who underwent tendon transfer achieved active dorsiflexion, demonstrating strengths ranging from M3 to M5 (Table  2 ). Notably, one patient in group B with fair recovery underwent secondary nerve repair. Furthermore, one patient in group C with poor recovery exhibited irreversible muscle atrophy. Toe dorsiflexion function, which is governed by the deep peroneal nerve branch of the CPN [ 24 ], was also monitored. In group A, 26 patients (76%) who underwent neurolysis achieved good toe dorsiflexion; in group B, three patients (60%) achieved good or fair toe dorsiflexion, with one patient in this group demonstrating an upgrade from fair to good function postoperatively; and in group C, one patient (17%) achieved good toe dorsiflexion (Table  3 ). Therefore, compared to nerve repair, tendon transfer proved more effective in restoring foot dorsiflexion function but displayed lower efficacy in restoring toe dorsiflexion.

In group A, patient satisfaction levels were as follows: 15 patients (44%) were extremely satisfied, nine patients (26%) were satisfied, eight patients (24%) were satisfied with reservation, and two patients (6%) were dissatisfied. In group B, one patient (20%) was extremely satisfied, two patients (40%) were satisfied, and two patients (40%) were dissatisfied. In group C, one patient (17%) was extremely satisfied, three patients (50%) were satisfied, and two patients (33%) were dissatisfied (Table  2 ). Patients in groups B and C who had fair or poor dorsiflexion outcomes expressed dissatisfaction. Additionally, one patient in group C, a 4-year-old, expressed dissatisfaction as the patient had hoped for more substantial improvements in foot dorsiflexion and toe dorsiflexion.

Recovery of sensory function

Neurolysis effectively restored sensory function in the dorsal foot and lateral lower leg for patients with CPN injuries ( P  < 0.001). However, patients who underwent nerve repair ( P  = 0.310) or tendon transfer ( P  = 0.699) did not show significant improvements in sensory function after surgery compared to their preoperative status. In group A, 30 patients (88%) experienced substantial sensory function recovery in the dorsal foot and lateral calf (Table  4 ). In group B, two patients had improved their sensory function from poor to fair, and one patient from fair to good, while the remaining patients exhibited no changes in sensory function. In Group C, one patient’s sensory function in the dorsal foot and lateral calf improved from fair to good. Comparatively, nerve repair appeared more effective than tendon transfer in restoring sensory function.

Among the patients, numbness in the dorsal foot and lateral calf was reported by 17 patients (50%) in group A. In contrast, four patients (80%) in group B and only one patient (17%) in group C reported numbness. In terms of the highest level of achieved activity, 23 patients (68%) in group A were able to run, and 11 patients (32%) could walk unaided. In group B, two patients (40%) could run, while three patients (60%) were limited to walking. In group C, three patients could run, and two patients could walk barefoot after tendon transfer. Pain was infrequently reported among these patients, with only three patients (9%) in group A describing slight pain. Among the five patients in group B, one reported severe pain with an NRS rating of 7. In group C, two patients (33%) experienced pain, with ratings of 3 or 5. All patients underwent Tinel’s sign testing at the lateral aspect of the fibular head and neck. Most of the 34 patients in the neurolysis group, two of the five patients in the nerve repair group, and four of the six patients in the tendon transfer group exhibited positive results. However, no significant relationship was observed between the presence of Tinel’s sign and surgical outcomes.

Factors affecting the prognosis of neurolysis

Three neurolysis patients later received tendon transfer to improve motor function. For individuals with suboptimal neurolysis outcomes, alternative surgeries were considered to enhance functional recovery. Factors influencing neurolysis outcomes were then explored, and the results indicated that foot dorsiflexion recovery showed no significant age correlation (ρ = 0.052, P  = 0.77). However, it exhibited a weak correlation with preoperative EMG results (ρ = 0.353, P  = 0.04) and a significant negative correlation with the time from onset to surgery (ρ=−0.481, P  = 0.004). Patients with preoperative EMG findings suggesting partial CPN injury tended to achieve better neurolysis outcomes than those with complete CPN injury. Moreover, shorter intervals between onset and surgery were associated with improved neurolysis results.

When assessing motor function recovery, we focused on the tibialis anterior muscle responsible for foot dorsiflexion, a crucial aspect impacting patients’ daily lives. A correlation was observed between higher postoperative foot dorsiflexion muscle strength and shorter time intervals (Fig.  5 A). However, due to variations in preoperative muscle strengths among patients, the reliability of this finding is limited. Subsequently, we narrowed our analysis to 21 patients with poor preoperative dorsiflexion muscle strength (Table  2 ) to investigate changes in muscle strength after neurolysis. Consistent with previous results, shorter time intervals were associated with greater functional improvements (Fig.  5 B-C). However, statistical significance in Fig.  5 B is limited due to the small number of patients with muscle strength changes from 0 to 2 ( n  = 3). Combining patients with 0–2 grade changes and those with 3 grade changes into one group, we found that a short time from symptom onset to surgery can lead to substantial foot dorsiflexion functional recovery (muscle strength increased by 4–5). However, a longer interval did not necessarily imply a lack of functional recovery, as muscle strength can still improve by 0–3.

figure 5

The tibialis anterior muscle force among the three groups. A . Relationship between the time from symptom onset to neurolysis and post-surgery muscle strength (M3 vs. M5). Patients with M5 muscle strength after surgery (3.37 ± 2.43) had a significantly shorter period from symptom onset to surgical treatment compared to those with M3 muscle strength (7.25 ± 3.07). P =0.011. Data from one patient, whose muscle strength had recovered to M5, was excluded from the analysis due to an unusually long time interval (24 months) between symptom onset and neurolysis. When including this patient’s data, the time interval for patients with M5 was 4.66 ± 5.52, and the P value was 0.0549. B-C . Duration from symptom onset to neurolysis in relation to changes in tibialis anterior muscle force. B. Patients with 4 (4.16 ± 2.79) or 5 (3.5 ± 1.89) grade muscle strength improvements exhibited shorter time intervals than those with 0–2 improvements (13.33 ± 7.72). P =0.03 (compared to 4 grade improvements), P =0.02 (compared to 5 grade improvements). C . Patients with 4-5 grade muscle strength improvements (3.83 ± 2.41) had shorter time intervals than those with 0–3 improvements (8.78 ± 5.98). P =0.03. *, P <0.05; error bars represent the standard deviation (SD). n =33 for panel A , n =21 for panels B-C . TA, tibialis anterior

Investigation of toe dorsiflexion function recovery showed a significant association between better recovery and shorter time intervals (Fig.  6 A). Analysis of patients who lacked toe dorsiflexion before surgery ( n  = 22 patients; Table  3 ) showed that a shorter time interval between symptom onset and surgery correlated with improved toe dorsiflexion (Fig.  6 B). It was found that once a specific time threshold was exceeded, patients lost the opportunity to restore toe dorsiflexion function.

figure 6

The toe dorsiflexion muscle strength among the three groups. A . Relationship between the time from symptom onset to neurolysis and post-surgery toe dorsiflexion muscle strength (M0-M2 vs. M3-M5). Patients with M3 to M5 muscle strength (5.14 ± 3.06) had significantly shorter time intervals between symptom onset and surgical treatment compared to patients with M0 to M2 muscle strength (9.38 ± 5.74). P =0.04. B . Graph illustrating that patients with 3 to 5 grade muscle strength improvements had shorter time intervals (5.4 ± 5.5) than those without improvements (11 ± 6). P =0.0074. *, P <0.05; **, P <0.01; error bars represent the standard deviation (SD). n =34 for panel A , n =21 for panel B

Next, we investigated the recovery of sensory function in the dorsal foot and lateral lower leg following neurolysis. Similar to motor function recovery, we observed that better sensory function recovery was associated with shorter time intervals (Fig.  7 A). Assessment of 28 patients with poor or fair sensory function prior to treatment (Table  4 ) revealed a trend toward sensory function improvement among those with shorter durations between disease onset and neurolysis (Fig.  7 B). However, due to limited sample sizes in each group, statistical significance was not established. When combining patients with no improvement and those with only one level of improvement, we found that individuals with shorter time intervals achieved significant sensory function recovery, whereas those with longer intervals experienced limited improvements (Fig.  7 C).

figure 7

The sensory grade analysis among the three groups. A . Relationship between the time from symptom onset to neurolysis and sensory grade (Grade 3 vs. Grade 5). Patients with Grade 5 sensory function (3.38 ± 2.37) had a significantly shorter period from symptom onset to surgical treatment compared to those with Grade 3 (8.92 ± 2.43). P =0.002. B-C . Time from symptom onset to treatment for patients with different changes in sensory grade. C. Patients with 3 to 4 grade sensory function improvements exhibited shorter intervals (4.41 ± 5.26) than those with 0–1 improvement (11.43 ± 5.34). P =0.02. *, P <0.05; **, P <0.01; error bars represent the standard deviation (SD). n =34 for panel A , n =28 for panels B-C

To assess the predictive value of time intervals for neurolysis outcomes, ROC analysis was conducted. Using M3 foot dorsiflexion recovery as the reference standard, the optimal time interval cut-off was 9.5 months, with an AUC area of 0.871 (95% CI = 0.661–1.000, P  = 0.04; Fig.  8 A), indicating that patients undergoing neurolysis within 9.5 months of injury had a good chance of achieving foot dorsiflexion at or above M3. When considering M4 foot dorsiflexion recovery as the reference standard, the optimal cut-off interval was 5.5 months, with an AUC area of 0.785 (95% CI = 0.575–0.995, P  = 0.02; Fig.  8 D). Therefore, for patients aiming for foot dorsiflexion at or above M4, early neurolysis within 5.5 months after injury is advisable. Similarly, when using M3 toe dorsiflexion recovery or S4 sensory function recovery as the reference standards, the optimal cut-off remained at 5.5 months, with AUC areas of 0.768 (95% CI = 0.582–0.953, P  = 0.025; Fig.  8 B) and 0.853 (95% CI = 0.693–1.000, P  = 0.001; Fig.  8 C), respectively. In summary, the best chances of recovering foot dorsiflexion, toe dorsiflexion, and sensory function are associated with neurolysis within 5.5 months after injury. Neurolysis performed between 5.5 and 9.5 months post-injury still allows partial foot dorsiflexion recovery.

figure 8

ROC curves for time to surgery. A . ROC analysis using M3 foot dorsiflexion recovery as the gold standard, with a best cut-off value of 9.5 months (AUC=0.871, 95% CI=0.661-1.000, P =0.04). B . ROC analysis using M3 toe dorsiflexion recovery as the gold standard, with a best cut-off value of 5.5 months (AUC=0.768, 95% CI=0.582-0.953, P =0.025). C . ROC analysis using S4 sensory function recovery as the gold standard, with a best cut-off value of 5.5 months (AUC=0.853, 95% CI=0.693-1.000, P =0.001). D . ROC analysis using M4 foot dorsiflexion recovery as the gold standard, with a best cut-off value of 5.5 months (AUC=0.785, 95% CI=0.575-0.995, P =0.02). n =25 for panels A and D ; n =33 for panel B ; n =30 for panel C . ROC, Receiver Operating Characteristic; AUC, Area Under the ROC Curve

The high incidence of CPN injuries presents a significant challenge in selecting the most appropriate treatment. In our study, we explored various treatment options, including conservative treatment, physical therapy, neurolysis, direct suture or nerve graft, and tendon transfer. Each treatment method was selected based on the etiology and severity of the patient’s injury, acknowledging that the right treatment approach can vary significantly depending on these factors. Patients with CPN transection or traction injuries can be considered for tendon transfer, while those with CPN rupture may benefit from nerve graft or tendon transfer and those with CPN compression are often considered for neurolysis [ 25 ]. However, in cases of cut traumas, intraoperative exploration has sometimes revealed CPN continuity. When these patients undergo timely surgery, simple neurolysis can lead to significant functional recovery. In a previous study, we did not observe a clear relationship between the causes of injury and postoperative outcomes, which might be attributed to the high incidence of trauma as the primary cause of injury and the varying degrees of injury severity among patients.

The CPN innervates muscles responsible for both foot and toe dorsiflexion. In this study, we aimed to provide a comprehensive assessment of CPN function by including both foot and toe dorsiflexion. While foot dorsiflexion is crucial for gait, toe dorsiflexion, governed by the deep peroneal nerve branch of the CPN, also plays a role in balanced and functional gait, particularly during the swing phase. Our findings revealed that a significant proportion of patients achieved good toe dorsiflexion recovery postoperatively. Specifically, 76% of patients undergoing neurolysis and 60% in the nerve repair group achieved good or fair toe dorsiflexion. This indicates the potential for functional recovery of toe dorsiflexion, which we believe is an important aspect of overall CPN function. The recovery of toe dorsiflexion function showed a significant association with shorter time intervals between symptom onset and surgery, indicating that patients who underwent surgery within shorter time intervals were more likely to achieve improved toe dorsiflexion, highlighting the time-sensitive nature of this aspect of recovery.

Conservative treatment can be effective for some CPN injuries, as spontaneous recovery is possible in certain cases. However, Maalla et al. found that if symptoms do not start to improve within the first month, early surgery within the first few months is advisable, which could otherwise delay or lead to incomplete spontaneous recovery [ 7 ]. Some patients with subtle symptoms and no significant findings in EMG may also benefit from surgery [ 26 ]. While physical therapy, including electrical stimulation, is a safe clinical approach [ 27 ] that can accelerate axon regeneration beyond the site of injury after surgery [ 28 ], it should be viewed as a complementary method that requires coordination with surgical intervention. Neurolysis of the CPN generally leads to faster recovery compared to rehabilitation therapy alone [ 7 ]. However, not all patients are willing to undergo surgery, and some may not be suitable candidates for neurolysis. Furthermore, there are no well-defined criteria to recommend or avoid neurolysis. Nerve repair has become increasingly effective with advancements in microsurgery techniques, although it tends to yield suboptimal results in patients not treated within 12 months of injury or those requiring grafts longer than 12 cm [ 2 ]. Tendon transfer is a common alternative for patients with limited nerve function. However, some patients may be hesitant to undergo tendon transfer, especially when ankle-foot orthoses like shoe dorsiflexion splint inserts can adequately support their daily activities [ 29 ].

One significant finding from our study is the independent predictive value of the time elapsed between symptom onset and neurolysis on patient outcomes, which can aid surgeons in making informed decisions regarding surgical interventions. Patients who underwent neurolysis within 5.5 months of their injury achieved substantial recovery in foot/toe dorsiflexion function and sensation. However, those who had surgery between 5.5 and 9.5 months post-injury only experienced partial foot dorsiflexion improvement, and neurolysis was less likely to restore effective function in individuals injured for over 9.5 months. In such cases, alternative options like tendon transfer or nerve repair may be more appropriate. Prior studies have also noted a correlation between the timing of surgery and postoperative recovery [ 30 , 31 , 32 ]. Nonetheless, our study offers a more comprehensive and systematic exploration of how CPN neurolysis influences postoperative sensorimotor function recovery, with potential clinical implications. Given the potential for spontaneous recovery in some patients, we cannot definitively attribute the functional improvements observed within shorter time intervals solely to neurolysis. Nevertheless, we can conclude that patients with more favorable motor and sensory functional recoveries tend to have shorter time intervals. For patients with longer intervals, additional treatment modalities may be necessary to facilitate substantial functional recovery. Taken together, our findings provide important insights for clinical decision-making and emphasize the importance of timely surgical intervention.

Neurolysis can achieve favorable outcomes in 80% of patients [ 2 ], with reduced functional recovery observed as surgery is delayed [ 33 ]. Timely medical attention is crucial, but treatment delays can occur due to patient referral issues [ 33 ]. The CPN has a poorer blood supply than the tibial nerve [ 34 , 35 ], which can lead to irreversible CPN damage with long-term compression, rendering traditional neurolysis less effective.

Some patients may choose observation over neurolysis, as advocated by Rose et al., for a 6-month observation period in peroneal nerve palsy [ 36 ]. In our series, all patients (except one) were treated after at least 1 month of observation. However, we found that observation alone did not yield satisfactory results. Despite potential drawbacks, the benefits of surgery outweigh the disadvantages. Currently, a 6–8 cm incision is made at the fibular head, but Ducic et al. recommended a minimally invasive 3 cm approach to reduce surgical trauma [ 37 ].

Our results indicated that tendon transfer generally led to better foot dorsiflexion recovery compared to nerve grafting, while nerve grafting was more effective in toe dorsiflexion and sensory function recovery. Giuffre et al. reported a 30% functional recovery rate ( n  = 10) in patients undergoing nerve repair, which is suboptimal [ 38 ]. Due to the limited number of nerve repair cases in our study, we refrain from making a definitive conclusion about the efficacy of nerve grafting. Yeap et al. reported that 83% of patients ( n  = 12) who underwent posterior tibial tendon transfer achieved excellent or good outcomes [ 39 ], consistent with our findings. Overall, these highlight the need for a tailored approach in treating CPN injuries, considering the specific functional deficits and patient needs.

Nerve repair, also known as nerve graft in our studies, yielded favorable motor recovery in 60% of patients and sensory recovery in 40%. Among our series, four patients had grafts shorter than 6 cm, with three of them achieving good motor function recovery, as reported by Kim et al. [ 33 ]. Notably, graft length, rather than the number of cables, significantly influenced the outcomes [ 31 , 40 ]. Fragility of the nutrient arteries of the CPN is a critical consideration; Lundborg et al. observed complete nerve ischemia with a 15% elongation of nerves [ 41 ]. To enhance nerve graft success, it is advisable to minimize graft length, reduce intraoperative nerve stretching, and ensure tension-free anastomosis. However, nerve grafting is inevitably associated with complications, including numbness.

The decision to use the Peroneus brevis tendon in transfers was influenced by its potential to enhance motor function, particularly in cases where nerve repair alone might not suffice. Our findings revealed that 83% of patients who underwent tendon transfer achieved favorable foot dorsiflexion recovery. While toe dorsiflexion function was not fully restored by tendon transfer, one patient exhibited improved toe dorsiflexion postoperatively. This improvement may be attributed to the balancing effect of tendon transfer on foot extension and plantar flexion forces, thereby promoting CPN regeneration. The mean time interval to surgery was 3 years, consistent with the understanding that nerve function may take up to 2 years to recover after nerve repair [ 42 ]. However, patients with prolonged foot dorsiflexion dysfunction may develop rigid equinus contracture, potentially leading to permanent deficits in plantarflexion [ 4 ]. Early tendon transfer is already widely accepted for ulnar and radial nerve injuries [ 43 , 44 ], suggesting that it may be a suitable option for patients who do not benefit from neurolysis.

Tendon transfer primarily enhances motor function, while nerve repair offers both motor recovery and sensory improvement. Consequently, combining these complementary procedures can facilitate patient recovery. Milesi’s theory suggests that reinnervation may be impeded by the force imbalance between active plantar flexor muscles and passively stretched denervated foot extensors [ 45 ]. Tendon transfer can rectify this imbalance, and simultaneous tendon transfer and nerve graft may enhance rehabilitation according to this theory [ 40 ]. Considering the limitations of each surgical method, our study supports the idea of combining tendon transfer and nerve repair to achieve better rehabilitation outcomes. This combined approach, in line with Milesi’s theory, suggests that rebalancing muscle group strength through tendon transfer, alongside nerve repair, can promote more comprehensive patient recovery.

Our research has several limitations that should be considered. First, this was a retrospective study in a single medical center, and the number of patients was limited. Second, data regarding postoperative rehabilitation were not reported in most patient’s reports and could not be analyzed. Third, several patients had incomplete physical examination results, and we did not perform simultaneous tendon transfer and neurolysis or nerve graft and could not evaluate combination procedures. Lastly, we acknowledge that focusing solely on toe dorsiflexion may not fully capture the functional gait outcomes. In future studies, larger sample sizes, more comprehensive data sets and more detailed analysis (i.e., ankle dorsiflexion and its direct impact on gait ability, alongside toe dorsiflexion) would be needed to provide a more holistic understanding of CPN injury recovery and validate these obtained results.

Our retrospective study on CPN injury therapy demonstrates the effectiveness of surgical treatment in improving clinical outcomes. While some patients may experience spontaneous recovery, our findings suggest that early surgical intervention leads to better outcomes, especially in cases where conservative treatment does not yield significant improvements. Thus, the choice of treatment should be guided not only by the nature of the CPN injury but also by the timing of surgical intervention, which is a crucial factor for motor and sensory function recovery after neurolysis, evidenced by the optimal results achieved when neurolysis was performed within 5.5 months of injury. Neurolysis alone can partially restore foot dorsiflexion function between 5.5 and 9.5 months after injury, but combining it with other procedures yielded the best therapeutic results. For patients who have been injured for more than 9.5 months, neurolysis alone may not be advisable, and in such cases, nerve repair and tendon transfer could be more appropriate options, as nerve repair was found to enhance the recovery of toe dorsiflexion and sensation in the dorsal foot and lateral lower leg. However, patients undergoing nerve repair often experience numbness and occasional pain. Tendon transfer was suitable for patients aiming at improving foot dorsiflexion function to some extent. Taken together, these results could help assist clinicians in selecting appropriate treatment plans tailored to the characteristics and needs of CPN injury patients.

Data availability

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Horteur C, Forli A, Corcella D, Pailhe R, Lateur G, Saragaglia D. Short- and long-term results of common peroneal nerve injuries treated by neurolysis, direct suture or nerve graft. Eur J Orthop Surg Traumatol. 2019;29(4):893–8.

Article   CAS   PubMed   Google Scholar  

George SC, Boyce DE. An evidence-based structured review to assess the results of common peroneal nerve repair. Plast Reconstr Surg. 2014;134(2):302e–11e.

Daniels SP, Ross AB, Sneag DB, Gardon SN, Li G, Hanna AS, Tuite MJ. Can MR neurography of the common peroneal nerve predict a residual motor deficit in patients with foot drop? Skeletal Radiol. 2023;52(4):751–61.

Article   PubMed   Google Scholar  

Ho B, Khan Z, Switaj PJ, Ochenjele G, Fuchs D, Dahl W, Cederna P, Kung TA, Kadakia AR. Treatment of peroneal nerve injuries with simultaneous tendon transfer and nerve exploration. J Orthop Surg Res. 2014;9:67.

Article   PubMed   PubMed Central   Google Scholar  

Rasulic L, Savic A, Vitosevic F, Samardzic M, Zivkovic B, Micovic M, Bascarevic V, Puzovic V, Joksimovic B, Novakovic N, et al. Iatrogenic peripheral nerve injuries-Surgical treatment and outcome: 10 years’ experience. World Neurosurg. 2017;103:841–51. e6.

Peskun CJ, Chahal J, Steinfeld ZY, Whelan DB. Risk factors for peroneal nerve injury and recovery in knee dislocation. Clin Orthop Relat Res. 2012;470(3):774–8.

Maalla R, Youssef M, Ben Lassoued N, Sebai MA, Essadam H. Peroneal nerve entrapment at the fibular head: outcomes of neurolysis. Orthop Traumatol Surg Res. 2013;99(6):719–22.

Yazid Bajuri M, Tan BC, Das S, Hassan S, Subanesh S. Compression neuropathy of the common peroneal nerve secondary to a ganglion cyst. Clin Ter. 2011;162(6):549–52.

CAS   PubMed   Google Scholar  

Harvie P, Torres-Grau J, Beaver RJ. Common peroneal nerve palsy associated with pseudotumour after total knee arthroplasty. Knee. 2012;19(2):148–50.

Dwivedi N, Paulson AE, Johnson JE, Dy CJ. Surgical Treatment of Foot Drop: patient evaluation and peripheral nerve treatment options. Orthop Clin North Am. 2022;53(2):223–34.

Souter J, Swong K, McCoyd M, Balasubramanian N, Nielsen M, Prabhu VC. Surgical results of common peroneal nerve neuroplasty at lateral Fibular Neck. World Neurosurg. 2018;112:e465–e72.

Senger JLB, Verge VMK, Macandili HSJ, Olson JL, Chan KM, Webber CA. Electrical stimulation as a conditioning strategy for promoting and accelerating peripheral nerve regeneration. Exp Neurol. 2018;302:75–84.

Seidel JA, Koenig R, Antoniadis G, Richter HP, Kretschmer T. Surgical treatment of traumatic peroneal nerve lesions. Neurosurgery. 2008;62(3):664–73. discussion – 73.

Vigasio A, Marcoccio I, Patelli A, Mattiuzzo V, Prestini G. New tendon transfer for correction of drop-foot in common peroneal nerve palsy. Clin Orthop Relat Res. 2008;466(6):1454–66.

Kim DH, Kline DG. Management and results of peroneal nerve lesions. Neurosurgery. 1996;39(2):312–9. discussion 9–20.

Roganovic Z. Missile-caused complete lesions of the peroneal nerve and peroneal division of the sciatic nerve: results of 157 repairs. Neurosurgery. 2005;57(6):1201–12. discussion – 12.

Chen H, Meng D, Yin G, Hou C, Lin H. Translocation of the soleus muscular branch of the tibial nerve to repair high common peroneal nerve injury. Acta Neurochir (Wien). 2019;161(2):271–7.

Bao B, Wei H, Zhu H, Zheng X. Transfer of Soleus Muscular Branch of tibial nerve to Deep Fibular nerve to Repair Foot Drop after Common Peroneal nerve Injury: a retrospective study. Front Neurol. 2022;13:745746.

El-Taher M, Sallam A, Saleh M, Metwally A. Foot Reanimation using double nerve transfer to deep peroneal nerve: a novel technique for treatment of neurologic Foot Drop. Foot Ankle Int. 2021;42(8):1011–21.

Woodmass JM, Romatowski NP, Esposito JG, Mohtadi NG, Longino PD. A systematic review of peroneal nerve palsy and recovery following traumatic knee dislocation. Knee Surg Sports Traumatol Arthrosc. 2015;23(10):2992–3002.

Emamhadi M, Bakhshayesh B, Andalib S. Surgical outcome of foot drop caused by common peroneal nerve injuries; is the glass half full or half empty? Acta Neurochir (Wien). 2016;158(6):1133–8.

Garozzo D, Ferraresi S, Buffatti P. Surgical treatment of common peroneal nerve injuries: indications and results. A series of 62 cases. J Neurosurg Sci. 2004;48(3):105–12. discussion 12.

Ferraresi S, Garozzo D, Buffatti P. Common peroneal nerve injuries: results with one-stage nerve repair and tendon transfer. Neurosurg Rev. 2003;26(3):175–9.

Lezak B, Massel DH, Varacallo M. Peroneal nerve Injury. StatPearls. edn. Treasure Island (FL); 2023.

Klifto KM, Azoury SC, Gurno CF, Card EB, Levin LS, Kovach SJ. Treatment approach to isolated common peroneal nerve palsy by mechanism of injury: systematic review and meta-analysis of individual participants’ data. J Plast Reconstr Aesthet Surg. 2022;75(2):683–702.

Peters BR, Pripotnev S, Chi D, Mackinnon SE. Complete Foot Drop with Normal Electrodiagnostic studies: Sunderland Zero ischemic conduction block of the common peroneal nerve. Ann Plast Surg. 2022;88(4):425–8.

Chan KM, Curran MW, Gordon T. The use of brief post-surgical low frequency electrical stimulation to enhance nerve regeneration in clinical practice. J Physiol. 2016;594(13):3553–9.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Gordon T, English AW. Strategies to promote peripheral nerve regeneration: electrical stimulation and/or exercise. Eur J Neurosci. 2016;43(3):336–50.

Kim DH, Midha R, Murovic JA, Spinner RJ. Kline and Hudson’s nerve injuries. 2nd ed. Philadelphia: Saunders; 2007. pp. 401–13.

Google Scholar  

Vasavada K, Shankar DS, Bi AS, Moran J, Petrera M, Kahan J, Alaia EF, Medvecky MJ, Alaia MJ. Predictors using machine learning of complete peroneal nerve Palsy Recovery after Multiligamentous knee Injury: a Multicenter Retrospective Cohort Study. Orthop J Sports Med. 2022;10(9):23259671221121410.

PubMed   PubMed Central   Google Scholar  

Mackay MJ, Ayres JM, Harmon IP, Tarakemeh A, Brubacher J, Vopat BG. Traumatic peroneal nerve injuries: a systematic review. JBJS Rev 2022;10(1).

Hughes BA, Stallard J, Chakrabarty A, Anand S, Bourke G. Determining the real site of peroneal nerve injury with knee dislocation: Earlierier is easier. J Plast Reconstr Aesthet Surg. 2021;74(10):2776–820.

Kim DH, Murovic JA, Tiel RL, Kline DG. Management and outcomes in 318 operative common peroneal nerve lesions at the Louisiana State University Health Sciences Center. Neurosurgery. 2004;54(6):1421–8. discussion 8–9.

Murovic JA. Lower-extremity peripheral nerve injuries: a Louisiana State University Health Sciences Center literature review with comparison of the operative outcomes of 806 Louisiana State University Health Sciences Center sciatic, common peroneal, and tibial nerve lesions. Neurosurgery. 2009;65(4 Suppl):A18–23.

Dy CJ, Inclan PM, Matava MJ, Mackinnon SE, Johnson JE. Current concepts review: common peroneal nerve Palsy after knee dislocations. Foot Ankle Int. 2021;42(5):658–68.

Rose HA, Hood RW, Otis JC, Ranawat CS, Insall JN. Peroneal-nerve palsy following total knee arthroplasty. A review of the hospital for special surgery experience. J Bone Joint Surg Am. 1982;64(3):347–51.

Ducic I, Felder JM. 3rd. Minimally invasive peripheral nerve surgery: peroneal nerve neurolysis. Microsurgery. 2012;32(1):26–30.

Giuffre JL, Bishop AT, Spinner RJ, Levy BA, Shin AY. Partial tibial nerve transfer to the tibialis anterior motor branch to treat peroneal nerve injury after knee trauma. Clin Orthop Relat Res. 2012;470(3):779–90.

Yeap JS, Birch R, Singh D. Long-term results of tibialis posterior tendon transfer for drop-foot. Int Orthop. 2001;25(2):114–8.

Wood MB. Peroneal nerve repair. Surgical results. Clin Orthop Relat Res. 1991;267:206–10.

Article   Google Scholar  

Lundborg G, Rydevik B. Effects of stretching the tibial nerve of the rabbit. A preliminary study of the intraneural circulation and the barrier function of the perineurium. J Bone Joint Surg Br. 1973;55(2):390–401.

Sedel L, Nizard RS. Nerve grafting for traction injuries of the common peroneal nerve. A report of 17 cases. J Bone Joint Surg Br. 1993;75(5):772–4.

De Abreu LB. Early restoration of pinch grip after ulnar nerve repair and tendon transfer. J Hand Surg Br. 1989;14(3):309–14.

Omer GE. Timing of tendon transfers in peripheral nerve injury. Hand Clin. 1988;4(2):317–22.

Irgit KS, Cush G. Tendon transfers for peroneal nerve injuries in the multiple ligament injured knee. J Knee Surg. 2012;25(4):327–33.

Download references

Acknowledgements

Not applicable.

This work was supported by the National Natural Science Foundation of China (Nos. 81801941, 82021002, 81830063); the National Science and Technology Innovation 2030 Major Program (No. 2021ZD0204200); the Shanghai Technology Innovation Plan (No. 21Y11902900); the Fujian Province Science and Technology Innovation Joint Fund Programme (No. 2021Y9129); the Shanghai Municipal Clinical Medical Center Project (No. 2022ZZ01007); and the Program of Shanghai Municipal Commission of Health and Family Planning (No. 20164Y0018).

Author information

Zhen Pang and Shuai Zhu contributed equally to the article as Co-first author.

Authors and Affiliations

Department of Hand Surgery, Huashan Hospital, Fudan University, Shanghai, China

Zhen Pang, Shuai Zhu, Yun-Dong Shen, Wen-Dong Xu & Hua-Wei Yin

Department of Hand and Upper Extremity Surgery, Jing’an District Central Hospital, Shanghai, China

Yun-Dong Shen, Yan-Qun Qiu, Wen-Dong Xu & Hua-Wei Yin

Department of Orthopedics and Hand Surgery, the First Affiliated Hospital of Fujian Medical University, Fujian, China

Yun-Dong Shen, Wen-Dong Xu & Hua-Wei Yin

Institute of Neuroscience, CAS Center for Excellence in Brain Science and Intelligence Technology, Chinese Academy of Sciences, Shanghai, China

Yu-Qi Liu, Wen-Dong Xu & Hua-Wei Yin

State Key Laboratory of Medical Neurobiology and MOE Frontiers Center for Brain Science, Institutes of Brain Science, Fudan University, Shanghai, China

Wen-Dong Xu & Hua-Wei Yin

Priority Among Priorities of Shanghai Municipal Clinical Medicine Center, Shanghai, China

Wen-Dong Xu

The National Clinical Research Center for Aging and Medicine, Fudan University, Shanghai, China

You can also search for this author in PubMed   Google Scholar

Contributions

(I) Conception and design: HW Yin, WD Xu, Z Pang; (II) Administrative support: YD Shen, YQ Qiu; (III) Provision of study materials or patients: YD Shen, YQ Qiu; (IV) Collection and assembly of data: Z Pang, S Zhu, YQ Liu; (V) Data analysis and interpretation: Z Pang, S Zhu; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Corresponding author

Correspondence to Hua-Wei Yin .

Ethics declarations

Ethics approval and consent to participate.

The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This study was approved by the Research Ethics Committee of Jing’an District Central Hospital (No.202303) and have therefore been performed in accordance with the ethical standards laid down in the 1964 Declaration of Helsinki and its later amendment. The Research Ethics Committee of Jing’an District Central Hospital waived the requirement for informed consent for this study.

Consent for publication

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Pang, Z., Zhu, S., Shen, YD. et al. Functional outcomes of different surgical treatments for common peroneal nerve injuries: a retrospective comparative study. BMC Surg 24 , 64 (2024). https://doi.org/10.1186/s12893-024-02354-x

Download citation

Received : 31 May 2023

Accepted : 09 February 2024

Published : 17 February 2024

DOI : https://doi.org/10.1186/s12893-024-02354-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Common peroneal nerve injury (CPN injury)
  • Nerve repair
  • Tendon transfer

BMC Surgery

ISSN: 1471-2482

comparison research study

  • Open access
  • Published: 16 February 2024

Comparing Bayesian hierarchical meta-regression methods and evaluating the influence of priors for evaluations of surrogate endpoints on heterogeneous collections of clinical trials

  • Willem Collier 1 ,
  • Benjamin Haaland 2 , 3 ,
  • Lesley A. Inker 4 ,
  • Hiddo J.L. Heerspink 5 &
  • Tom Greene 2  

BMC Medical Research Methodology volume  24 , Article number:  39 ( 2024 ) Cite this article

104 Accesses

Metrics details

Surrogate endpoints, such as those of interest in chronic kidney disease (CKD), are often evaluated using Bayesian meta-regression. Trials used for the analysis can evaluate a variety of interventions for different sub-classifications of disease, which can introduce two additional goals in the analysis. The first is to infer the quality of the surrogate within specific trial subgroups defined by disease or intervention classes. The second is to generate more targeted subgroup-specific predictions of treatment effects on the clinical endpoint.

Using real data from a collection of CKD trials and a simulation study, we contrasted surrogate endpoint evaluations under different hierarchical Bayesian approaches. Each approach we considered induces different assumptions regarding the relatedness (exchangeability) of trials within and between subgroups. These include partial-pooling approaches, which allow subgroup-specific meta-regressions and, yet, facilitate data adaptive information sharing across subgroups to potentially improve inferential precision. Because partial-pooling models come with additional parameters relative to a standard approach assuming one meta-regression for the entire set of studies, we performed analyses to understand the impact of the parameterization and priors with the overall goals of comparing precision in estimates of subgroup-specific meta-regression parameters and predictive performance.

In the analyses considered, partial-pooling approaches to surrogate endpoint evaluation improved accuracy of estimation of subgroup-specific meta-regression parameters relative to fitting separate models within subgroups. A random rather than fixed effects approach led to reduced bias in estimation of meta-regression parameters and in prediction in subgroups where the surrogate was strong. Finally, we found that subgroup-specific meta-regression posteriors were robust to use of constrained priors under the partial-pooling approach, and that use of constrained priors could facilitate more precise prediction for clinical effects in trials of a subgroup not available for the initial surrogacy evaluation.

Partial-pooling modeling strategies should be considered for surrogate endpoint evaluation on collections of heterogeneous studies. Fitting these models comes with additional complexity related to choosing priors. Constrained priors should be considered when using partial-pooling models when the goal is to predict the treatment effect on the clinical endpoint.

Peer Review reports

There is broad interest in the use of validated surrogate endpoints to expedite clinical trials in areas of slowly progressing disease, such as chronic kidney disease (CKD) [ 1 , 2 , 3 , 4 , 5 ]. A surrogate endpoint is typically a measure of disease progression captured earlier than an established clinical endpoint and should have the property that the treatment effect on the surrogate accurately predicts the treatment effect on the clinical endpoint [ 6 , 7 , 8 ]. This predictive potential is commonly established in a meta-regression analysis of previously conducted trials, where the meta-regression quantifies the strength of the association between treatment effects on the clinical and surrogate endpoints [ 3 , 4 , 5 , 6 , 7 , 8 ]. Accurate estimation of the meta-regression parameters requires variability in the treatment effects on the surrogate and clinical endpoints across trials used for analysis. To achieve this, the collection of trials can contain heterogeneity in terms of interventions and sub-classifications of disease [ 3 , 4 ]. There is often interest among entities such as regulatory agencies regarding the performance of the surrogate in pre-specified, clinically or biologically motivated, and mutually exclusive subgroups defined by intervention or disease classes [ 1 ]. These interests introduce two specific goals the analytical approach must facilitate: The first is accurate estimation of subgroup-specific meta-regression parameters. The second is accurate prediction of treatment effects on the clinical endpoint, either for subgroups used in model fitting or for those not available for model fitting (e.g., for a novel intervention).

One meta-regression methodology involves a Bayesian hierarchical model, which can be used to account for estimation error of the treatment effects on both endpoints as well as the correlation of the sampling errors (a frequently used weighted generalized linear regression approach accounts only for sampling error of the effect estimate on one of the two endpoints) [ 6 , 8 , 9 ]. Under the hierarchical Bayesian approach, it is common to assume all trials used in the analysis to be fully exchangeable despite underlying differences in interventions or diseases across trials [ 4 , 5 , 6 , 8 ]. In effect, this is accomplished by fitting a model with a single meta-regression relating treatment effects on the clinical endpoint to those of the surrogate endpoint to all trials available for the analysis, which we refer to as the “full-pooling” approach. Alternatively, distinct meta-regressions can be fit within subgroups in what we will refer to as the “no-pooling” approach [ 4 , 7 ]. There are often too few trials and insufficient variability in treatment effects within subgroups to estimate the meta-regression parameters with satisfactory precision under a strict no-pooling strategy. An additional limitation to the full and no-pooling strategies is that each induces limitations to model-based prediction of the treatment effect on the clinical endpoint in a future trial. This is especially the case when there is interest in prediction for a trial which is of a “new subgroup”, one that was not available for the initial surrogacy evaluation. Afterall, in the ideal scenario a surrogate can be used for a trial evaluating a novel intervention or when applying an approved indication in a new patient population. Use of a full-pooling model requires the assumption that any future trial is fully exchangeable with the previous trials. Use of a no-pooling approach requires the future trial to be of a subgroup used for the surrogacy evaluation (“existing subgroup”).

Bayesian hierarchical meta-regression lends naturally to a “partial-pooling” compromise to these earlier approaches, where a between subgroup distribution is assumed for some or all subgroup-specific model parameters [ 7 ]. The partial-pooling approach relaxes the assumption of full-exchangeability of all trials used for the analysis, can improve precision of inference on subgroup-specific parameters due to data adaptive information sharing across subgroups, and provides a framework for model-based prediction of an effect on a clinical endpoint for a trial of either an existing or a new subgroup. However, critical decisions needed to fit models of this class are without empirical guidance in the literature. For example, use of fixed and random effects approaches are used interchangeably when employing full-pooling models, and the implications of these two approaches are not well understood under a partial-pooling model [ 8 ]. To our knowledge, there is also not yet work evaluating the impact of the choice of priors under partial-pooling strategies, even though the role of certain prior distributions is likely to be amplified in likely scenarios in which the number of subgroups is small.

In this paper, we provide results from a series of analyses intended to help guide practical decision making for surrogate endpoint evaluations on collections of heterogeneous studies. We explore the extent to which partial-pooling approaches improve precision in key posteriors of interest in surrogate evaluation, the extent to which bias occurs, contrast fixed and random effects variants of models described, and explore the impact of priors. In the Methods section, we describe the modeling approaches evaluated, priors, and how these methods can be used for prediction. In the Results section, we provide results of a limited simulation study and of an applied analysis of CKD trials. We then conclude with the Discussion section.

Modeling approaches to the trial-level analysis of a surrogate

For the trial-level evaluation of a surrogate endpoint, a two stage approach to the analysis is often used [ 6 , 7 , 8 ]. In the first stage, treatment effects on both the clinical and surrogate endpoint as well as standard errors and a within-study correlation between the error of the estimated effects are calculated for each trial. These trial-level measures are used as the data input in the meta-regression evaluation (the second stage). A two-level hierarchical model for the meta-regression can be used to account for within-study estimation error for both treatment effects [ 4 , 5 , 6 , 7 , 8 ].

Under the two-stage approach, one key distinction between commonly used second-stage models involves whether true treatment effects on the surrogate endpoint are viewed as fixed or random [ 6 , 8 ]. Under the fixed effects approach, the true treatment effects on the surrogate endpoint are fixed and the true effects on the clinical endpoint are regressed on the true effects on the surrogate assuming Gaussian residuals. Under the random effects approach, the true treatment effects on both the surrogate and the clinical endpoints are assumed to follow a bivariate normal distribution [ 4 , 5 , 8 ]. The within-study joint distribution can be reasonably approximated with a bivariate normal distribution due to asymptotic normality, but the bivariate normality assumption for the between-study model is made for modeling convenience. Bujkiewicz et al. contrast the predictive performance of a surrogate under fixed and random effects approaches when using the full-pooling approach, but do not summarize differences in estimates of key parameters such as the meta-regression slope [ 8 ]. Papanikos et al. evaluate and contrast different fixed effects approaches in subgroup analyses of a surrogate, but do not compare fixed and random effects approaches [ 7 ]. We hypothesized that the fixed and random effects approaches could produce differing results because there may be more or less shrinkage in the true effects on the surrogate across trials (the “x-axis” variable in the regression) depending on the method used.

We next introduce the full pooling random and fixed effects models, which are applicable when the clinical trials being analyzed can be regarded as exchangeable. Let there be N total clinical trials, each of which compares an active treatment to a control. For trials \(j = 1, \dots , N\) , \((\widehat{\theta }_{1j}, \widehat{\theta }_{2j})'\) jointly represents the suitably scaled within study estimates of treatment effects on the clinical and surrogate endpoints for trial j . The pair \((\theta _{1j}, \theta _{2j})'\) represents the latent joint true treatment effects on the clinical and surrogate endpoints in study j . We let \(\Sigma _j\) denote a within study variance-covariance matrix for study j ( \(\Sigma _{j1,1} = SE(\widehat{\theta }_{1j})^2\) is the squared standard error of the estimated clinical effect, \(\Sigma _{j2,2} = SE(\widehat{\theta }_{2j})^2\) the squared standard error of the estimated surrogate effect, \(\widehat{r}_j\) is the estimated within trial correlation for study j , implying \(\Sigma _{j1,2} = \Sigma _{j2,1} = \widehat{r}_j SE(\widehat{\theta }_{1j}) SE(\widehat{\theta }_{2j})\) ). When the standard errors and within study correlation are available, it is customary to consider all entries of \(\Sigma _j\) fixed and known [ 6 , 7 , 8 , 10 , 11 ]. For the random effects model, \(\mu _s\) represents a population average true treatment effect on the surrogate, and \(\sigma _s^2\) the between trial variance in true effects on the surrogate. We parameterize the model such that \(\alpha\) denotes the meta-regression intercept, \(\beta\) the slope, and \(\sigma _e\) the residual standard deviation. The following represents the full-pooling random effects model (FP-RE).

To fit a full-pooling fixed effects model (FP-FE), rather than assuming a Gaussian distribution for which parameters will be estimated for \(\theta _{2j}\) as above, an independent prior is assigned directly to each \(\theta _{2j}\) .

Next, suppose that the N trials are to be divided into I total subgroups because exchangeability is plausible for the trials within each subgroup but not necessarily between trials in different subgroups. In our experience, regulatory agencies have expressed concern of heterogeneity in surrogate quality across pre-specified subgroups present in the data being used to evaluate CKD-relevant surrogate endpoints. The models discussed throughout the remainder of this paper are thus intended for similar scenarios where: the I subgroups which motivate concern over the full exchangeability of trials (i.e., there might be a different association between treatment effects on the clinical and surrogate endpoint depending on the subgroup a trial pertains to) are presented to the statistical analyst independent of any statistical criteria, subgroup assignment for the trials available for model fitting is not ambiguous (e.g., the inclusion and exclusion criteria of a trial would easily determine the subgroup assignment if disease-based subgroups are of interest), and there can not be misclassification of trials into the wrong subgroups. When such an analytical scenario is presented, we might first consider fitting separate models within each subgroup. For \(i = 1,\dots , I\) , the following represents what we refer to as a no-pooling random effects (NP-RE) model for the \(j^{\text {th}}\) trial within the \(i^{\text {th}}\) subgroup.

We note that one could fit a no-pooling fixed-effects model by placing a prior directly on each \(\theta _{2ji}\) , rather than assuming the Gaussian distribution as above.

For the partial pooling approach, we can incorporate between-subgroup distributions as an intermediate layer in the Bayesian analysis to induce information sharing across subgroups [ 7 , 12 ]. The terms controlling heterogeneity between subgroups are informed by the data. For example, if the data suggests a lack of between-subgroup heterogeneity for any given term, fitting this model should result in substantial information sharing and similar subgroup-specific parameter estimates. The partial pooling model may generate some amount of bias, but could counter-balance this bias with increased precision due to information sharing [ 12 ]. Among other reasons, because between-subgroup variation drives the data-adaptive information sharing, between-subgroup variance terms were of primary interest in our investigation of the influence of priors.

A partial-pooling random effects (PP-RE) model is displayed below. Consider there are additional model parameters necessary to define this model. We let \(\mu _s\) and \(\sigma _s^2\) represent the between subgroup average and variance of true treatment effects on the surrogate; \(\alpha\) and \(\sigma _{\alpha }^2\) and \(\beta\) and \(\sigma _{\beta }^2\) represent the between subgroup average and variance of the meta-regression intercept and slope, respectively; \(\tau _s\) and \(\tau _e\) denote the between-subgroup mean log-transformed true surrogate effects standard deviation and meta-regression residual standard deviation, respectively; \(\gamma _s^2\) and \(\gamma _e^2\) denote the between subgroup variance of the log-transformed within-subgroup true surrogate treatment effects standard deviation and meta-regression residual standard deviation, respectively.

If fitting a partial-pooling fixed effects (PP-FE) model, a prior can be placed directly on each \(\theta _{2ji}\) , rather than assuming the hierarchical Gaussian distribution displayed above. We display an example of a PP-FE model here to contrast it with the PP-RE model more clearly. In this example, we place a N(0,10 \(^2\) ) prior on each trial’s true treatment effect on the surrogate.

To our knowledge, there has been just one other paper to evaluate partial-pooling strategies for the trial-level analysis of a surrogate. As discussed in the introduction, Papanikos et al. evaluated different fixed effects partial-pooling approaches [ 7 ]. An additional difference between the PP-FE model displayed above and those considered by Papaniko’s et al. is that there was not a between-subgroup distribution assumed for \(\sigma _{ei}\) in their models. One advantage of allowing a between-subgroup distribution for \(\sigma _{ei}\) is that it enables estimating posteriors for parameters defining between-subgroup distributions for all meta-regression parameters (intercept, slope, and residual variance). This subsequently facilitates prediction for a trial of a new subgroup, as is discussed in the Generating posterior predictive distributions section.

Analysis set 1: simulation study

We generated trial level summary data (estimated treatment effects, standard errors, and the within-study correlations) based on four broad simulation setups, where within each we introduced two variants depending on the distribution used to simulate true treatment effects on the surrogate. The setups considered were motivated by applied data used to evaluate GFR slope. We consider three subgroups of trials as in previous evaluations of GFR slope and to reflect the likely scenarios where the available data limits the number of subgroups, stressing the potential for benefit from data adaptive partial-pooling [ 4 ]. We simulated 15 medium-to-large trials per subgroup (standard errors on either endpoint reflect trials with roughly 300-2000 patients). Within-study correlations were drawn equally at random from the range of values present in our application data. Without loss of generalizability, we modeled a negative trial-level association. As discussed in the section titled Analysis set 2: application analysis of CKD trials , there is a negative association between treatment effects on the clinical endpoint and treatment effects on GFR slope. We also varied the sizes of subgroups and the degree of between-study variability in true effects on the surrogate. Broadly, we consider one setup (S1) where there is homogeneity in the quality of the surrogate across subgroups, another setup (S2) where the surrogate is weak in two subgroups and strong in another, another setup (S3) where the surrogate is weak in one subgroup and strong in the other two, and a final setup (S4) where surrogate quality is different in all three subgroups. The strength of the surrogate was defined by the true meta-regression \(R^2\) . Earlier work has proposed that \(R^2 \in (0,0.49)\) , \(R^2 \in (0.5,0.72)\) , and \(R^2 \in (0.73,1)\) suggest a weak, moderate, and strong surrogate, respectively [ 13 ]. For our purposes, we simulated data from true parameter values to obtain \(R^2 = 0.35,0.65,0.95\) to define the surrogate as weak, moderate or strong within subgroups, respectively.

Consider the data generating model below for the first variant (V1) of the four simulation setups. To simulate estimated clinical and surrogate effects for trial j ( \(j = 1,\dots , 15\) ) in subgroup i ( \(i = 1,2,3\) ) when true surrogate effects are Gaussian, we first drew true surrogate effects from (9), then drew conditional true clinical effects from (10), and finally drew a pair of estimated effects using (11). The standard errors and within-study correlations forming the matrices \(\Sigma _{ji}\) were drawn according to the rules described above using uniform distributions to reflect variation in trial sizes.

We also sought to contrast results under the different models when true treatment effects on the surrogate were distinctly non-Gaussian (V2). We used the following data generating model, where true effects on the surrogate for each trial were drawn from a bimodal distribution (12).

To summarize results, we provide simulation average posterior medians, \(2.5^{\text {th}}\) and \(97.5^{\text {th}}\) percentiles for models fit across 100 simulated datasets per setup. We also summarize posterior predictive distributions (PPDs - described further below).

Analysis set 2: application analysis of CKD trials

We compare analyses using the models discussed above on a set of 66 CKD studies. Data from these studies was collected by the Chronic Kidney Disease Epidemiology Collaboration (CKD-EPI), an international research consortium [ 3 , 4 ]. Evaluations of GFR slope on this collection of studies have been described extensively [ 3 , 4 ]. For the purposes of this paper, we focus on the GFR “chronic slope” as the surrogate [ 4 ]. Time-to-doubling of serum creatinine or kidney failure is used as the clinical endpoint, which is accepted by regulatory agencies and is widely used as the primary endpoint in pivotal phase 3 clinical trials of CKD [ 3 ]. Treatment effects on the clinical endpoint were expressed as log transformed hazard ratios (HRs), estimated using proportional hazards regression. A shared parameter mixed effects model was used to jointly model longitudinal GFR trajectories and the time of termination of GFR follow-up due to kidney failure or death for each randomized patient. Treatment effects on the chronic GFR slope are expressed as the mean difference in the treatment arm slope minus the control arm slope, expressed in ml/min/1.73 m \(^2\) per-year. Further detail on the methods used to estimate effects on GFR slope-based endpoints are described elsewhere in the literature [ 4 , 14 ]. Finally, we obtained robust sandwich estimates of the within-study correlations using a joint model as in previous work by CKD-EPI [ 4 ].

Heterogeneity across the CKD-EPI trials can be attributed to many study level factors. We consider four disease-defined subgroups (CKD with unspecified cause (CKD-UC), diabetes (DM), glomerular diseases (GN), and cardiovascular diseases (CVD)) and 16 intervention-defined subgroups (listed in the Additional file 1 : Section 1). For the application analyses, we focus on fitting the FP-RE and PP-RE models, and use different sets of priors under the PP-RE model (we also contrast results under the PP-RE and PP-FE models where subgroups are defined by disease to complement certain simulation analyses). To capture the scenario where there is interest in prediction for a future trial of a new subgroup, we first fit models by leaving out CVD studies, and we generated PPDs for those studies left-out. For intervention-defined subgroups, we fit the model for trials of 7 subgroups for which there were at least 3 studies, and we then generated PPDs for studies of the remaining left-out, smaller subgroups. We also summarize PPDs obtained for studies of the subgroups used for model fitting under these two subgroup schema.

For the purposes of the simulation study, we utilized diffuse priors, which is a common practice in surrogate endpoint evaluations [ 4 , 6 , 7 , 8 ]. For the full-pooling and no-pooling models, we used the \(N(0,10^2)\) prior for the intercept ( \(\alpha\) or \(\alpha _i\) ) and slope ( \(\beta\) or \(\beta _i\) ), and for the mean true treatment effect on the surrogate ( \(\mu _s\) , \(\mu _{si}\) under random effects models) or for trial-specific true effects on the surrogate when fitting the fixed effects models ( \(\theta _{2ji}\) ). As in previous work in CKD, we used inverse-gamma priors on variance terms ( \(\text {IG(a,b)}\) for shape \(\text {a}\) and scale \(\text {b}\) ) [ 4 , 5 ]. For the full-pooling and no-pooling models, we used \(\sigma _{ei}^2,\sigma _{e}^2 \sim \text {IG}(0.001,0.001)\) . Where appropriate (random effects models), we also used \(\sigma _s^2,\sigma _{si}^2 \sim\) \(\text {IG}(0.001,0.001)\) . The \(\text {IG}(0.001,0.001)\) prior is considered an approximation to the Jeffery’s prior. For partial-pooling models, we let \(\tau _e^2 \sim \text {IG}(0.0025,0.001)\) and \(\gamma _e \sim \text {half-normal}(0,3^{2})\) , and for the random effects variants \(\tau _s^2 \sim \text {IG}(0.0025,0.001)\) and \(\gamma _s \sim \text {half-normal}(0,3^2)\) . This combination translates to priors for within subgroup standard deviations in the partial-pooling models matching those of the no-pooling models to the extent that the 25 \(^{\text {th}}\) , 50 \(^{\text {th}}\) , and 75 \(^{\text {th}}\) prior percentiles differed by less than 0.05. For \(\sigma _{\alpha }\) , \(\sigma _{\beta }\) , \(\sigma _{s}\) , we used \(\text {half-normal}(0,2^2)\) . These specific half-normal priors should be considered highly diffuse for all of our analyses.

For our application analyses, we considered three variations on priors when employing the PP-RE model. We considered different priors for partial-pooling models because we hypothesized that not only narrow priors, but also highly diffuse priors could unduly influence certain results of our analyses. This is because there is often a limited number of studies available for meta-analysis, which can limit the number of subgroups. The categorization of studies based on constructs such as disease subtype or treatment comparison class may also provide a small number of subgroups. When there are just a few subgroups, the data provides very little information on subgroup-to-subgroup variation. The posteriors for between-subgroup variance terms may be more likely to exhibit minimal updates from the priors based on the data. As such, if priors are so diffuse that they represent a range of variability that is beyond practical reality, so too could the posteriors. As described below, this is also important because between-subgroup variance parameters are utilized in generating posterior predictive distributions for a trial of a new subgroup. A practical degree of narrowing certain priors could be seen as a necessary middle ground between use of overly narrow or overly diffuse priors. While we narrowed all priors for our constrained “sets” considered, the priors we focused on were for between-subgroup standard deviations for meta-regression parameters. We first used the fully diffuse priors displayed above. We then employed an iterative procedure, where we narrowed priors (emphasizing between-subgroup standard deviation parameters such as \(\sigma _{\alpha },\sigma _{\beta },\gamma _e\) ) until a set was found that produced no more than 0.05 difference in the posterior median, 2.5 \(^{\text {th}}\) , and 97.5 \(^{\text {th}}\) percentiles for the within-subgroup meta-regression posteriors, no matter how much narrower posteriors on between-subgroup parameters became (referred to as “Constrained Priors Set 1”, which were ultimately the same for either subgroup classification). Finally, we chose what we will refer to as “domain-constrained” priors (“Constrained Priors Set 2”). It is reasonable to choose a prior that constrains between-subgroup variability to a range that is actually plausible in reality based on subject matter expertise (e.g., through a prior elicitation process). For example, in our case the intercept is the expected true log-HR on the clinical endpoint when the true effect on the surrogate is the null effect. When there is a null-effect on the surrogate, we may suspect a low probability of an expected HR on the clinical endpoint that is very strong in either direction (e.g., below 0.5 or above 2.0), and this logic can be used to provide a moderate to low probability for subgroup-specific intercepts to go beyond these values. Domain-constrained priors were the narrowest among those considered for our analyses, and further detail on choosing these priors is provided in Section 2 of Additional file 1 .

We wish to also emphasize that there is an important distinction between narrowing priors for the terms that define variability in the treatment effects on the surrogate across studies, and for the meta-regression parameters. The degree of variability of treatment effects on the surrogate influences the extent to which the data allows the quality of the surrogate to be inferred. Priors for the distribution(s) of true treatment effects on the surrogate should be left sufficiently diffuse so as not to restrict variation in effects across studies. In our cases, these were narrowed because the diffuse priors typically used are excessively wide relative to the range of treatment effects that are reasonable. The priors of primary interest are again those governing the degree of variability between subgroups in the meta-regression terms (e.g., \(\sigma _{\beta }\) ).

Generating posterior predictive distributions

There are a number of strategies that can be used to generate PPDs for the treatment effect on the clinical endpoint based on the treatment effect on the surrogate. In our simulation study, we compare summaries of PPDs for the true treatment effect on the clinical endpoint, which only takes into account uncertainty in the estimated meta-regression parameters. This is possible in a simulation analysis because we actually know the true effect on the surrogate [ 7 ]. For each study left-out of model fitting, let the true effect on the surrogate for that study be denoted \(\theta _{2}^N\) . Then, the PPD for the true effect on the clinical endpoint is generated by taking \(m=1,\dots ,M\) draws (for each of M posterior draws obtained in model fitting) from \(N(\alpha ^{*m} + \beta ^{*m}\theta _{2}^N,\sigma _e^{*m2})\) , where \(\alpha ^{*m}, \beta ^{*m}, \sigma _e^{*m}\) represent draws from posteriors from either the full-pooling, no-pooling or partial-pooling models. For our purposes, subgroup-specific parameters were used when trials were simulated from the same subgroup if using no-pooling or partial-pooling.

In application analyses, it is only possible to obtain the PPD for the estimated effect on the clinical endpoint, which involves a procedure that takes into account not only uncertainty in the meta-regression posteriors, but also uncertainty due to sampling error in the treatment effect estimates. Section 3 of the Additional file 1 provides further detail on the procedures used for prediction in our application analyses. We provide an overview here. For one part of our application analyses, we generated PPDs for trials of existing subgroups. Under full-pooling models, we directly used the single set of estimated meta-regression posteriors to map the effect on the surrogate to a predicted effect on the clinical endpoint. Under no-pooling and partial-pooling models, we used the appropriate subgroup-specific meta-regression posteriors estimated directly in model fitting (e.g., to make a prediction for a trial of subgroup \(i \in \{1,\dots ,I\}\) we directly use a draw from the posterior for \(\beta _i\) obtained through model fitting). In our second prediction exercise we generated PPDs for trials of a new subgroup. Only the full-pooling and partial-pooling models were used as no-pooling models do not facilitate estimation of parameters which allow the surrogate to be applied in a new subgroup. Again, under full-pooling models we used the single set of estimated meta-regression posteriors, which induces the assumption that the new study is fully exchangeable with those used for model fitting despite that it pertains to a new subgroup. Under partial-pooling models we used draws from population subgroup distributions (e.g., we draw a new \(\beta _{\text {new}}\) from \(N(\beta ,\sigma _{\beta }^2)\) ) to map the effect on the surrogate to the predicted clinical effect (that this process requires \(\sigma _{\beta }\) , which again may be influenced by the choice of priors in practical scenarios where the number of subgroups is small, is what motivated our interest in careful choosing of priors). This way, for all prediction exercises we were using subgroup-specific meta-regression posteriors for prediction, just that these were random draws from the population distribution when applying the surrogate to a new setting under the partial-pooling approach. When we are extrapolating the trial-level association to a new subgroup, drawing from the population distribution for each meta-regression posterior induces an additional degree of uncertainty into the prediction. This could be seen as a reasonable compromise between applying the fitted full-pooling model, which ignores that the new study represents a new scenario, and not applying the surrogate at all (i.e., the no-pooling approach). As discussed when introducing the PP-RE approach, the reason why we assume between-subgroup distributions for \(\sigma _e\) is to facilitate the possibility of drawing subgroup-specific residual standard deviations needed in prediction for a trial of a new subgroup.

For simulation and applied analyses, we used the University of Utah Center for High Performance Computing Linux cluster. On the cluster, we used R version 4.0.3 for data preparation and for generating model summaries. The mcmc sampling algorithms for model fitting were implemented using RStan version 2.21.12 [ 15 ]. We utilized the Gelman-Rubin statistic to assess adequate convergence of chains and the effective sample size to evaluate whether there were sufficient mcmc draws to utilize certain posterior summaries such as tail percentiles (as well as additional visual summaries such as rank plots) [ 16 , 17 ]. We landed on 10,000-20,000 mcmc iterations and 3 independent chains across all analyses. Finally, for the application analyses, the SAS NLMixed procedure was used to estimate treatment effects on the clinical and surrogate endpoints, standard errors, and within-study correlations within each study [ 18 ]. Example RStan code (PP-RE model) and R code (for simulating data) is provided in Section 4 of Additional file 1 .

Simulation study results

Contrasting different random effects approaches under gaussian surrogate effects.

Table  1 provides summaries of posterior distributions obtained from fitting models on simulation setups 1-4 (V1 and V2). When there was no heterogeneity in the true meta-regression parameters across subgroups (Setup 1), the PP-RE model resulted in limited additional uncertainty in posteriors relative to the FP-RE model, and also resulted in negligible additional bias via the posterior medians. Across Setups 2-4, where the strength of the association between effects on the clinical and surrogate endpoint varied across subgroups, for any given meta-regression parameter summarized, use of the FP-RE model naturally obscured such heterogeneity. The NP-RE and PP-RE models more adequately produced subgroup-specific meta-regression posteriors that suggested heterogeneity in the quality of the surrogate, but in every case the PP-RE model produced more precise posteriors than that of the NP-RE model. Benefits were especially evident when focusing on posteriors for the meta-regression slope. While the PP-RE model typically resulted in a small degree of bias, between-subgroup heterogeneity was potentially more evident due to improved precision. Precision gains under the PP-RE over the NP-RE model were also observed in the sensitivity analyses considered (Tables 2 and 3 of Additional file 1 ), including where there was heterogeneity in subgroup sizes. There was a larger degree of pooling away from parameter values true for smaller subgroups under partial-pooling, but the PP-RE model still allowed for heterogeneity in posterior medians and 95% credible intervals to aid in understanding variations in surrogate quality across subgroups. One potential drawback of all approaches considered was that \(R^2\) posterior medians appeared biased in every scenario evaluated, reflecting the challenge associated with accurate estimation of \(R^2\) with limited data. The average posterior median \(R^2\) under partial-pooling was more biased than under no-pooling in certain scenarios such as where the surrogate was weak, possibly due to information sharing. The challenges associated with estimating \(R^2\) emphasize why it is important to consider not only reporting \(R^2\) point estimates but also credible intervals. The credible intervals under the PP-RE approach remained wide in subgroups where the surrogate was weak. Differences in model performance were also evident in evaluations of model-based prediction of treatment effects on the clinical endpoint (Table  2 ). Coverage of true clinical effects by 95% posterior prediction intervals was lower when using the FP-RE model even where meta-regression parameters were truly the same across subgroups. The NP-RE model resulted in highest coverage because of excessively wide prediction intervals, whereas prediction under the PP-RE model resulted in improved precision with adequate coverage.

Contrasting fixed vs. random effects partial-pooling models under non-Gaussian surrogate effects

Where the true treatment effects on the surrogate were non-Gaussian, the PP-FE model resulted in downward bias in meta-regression intercept posteriors (e.g., via the posterior median), whereas the PP-RE model either did not result in any bias or resulted in a lesser degree of bias. The PP-FE model also resulted in downward bias in the meta-regression slope posteriors (regression dilution bias) in subgroups where the surrogate was simulated to be moderate-to-strong. We hypothesize that this downward bias was due to the absence of shrinkage of true treatment effects on the surrogate (the “x-axis” variable in the meta-regression) towards one another. Because no common distribution is assumed for true effects on the surrogate across studies, the true effects are likely to be more dispersed in contrast to use of the random effects model, where the Gaussian distributional assumption could result in pooling of true treatment effects on the surrogate across studies. Although the random effects model resulted in a small degree of upward bias in the meta-regression slope in subgroups where the surrogate was weak, the \(R^2\) posteriors were wider and their median’s lower than under the fixed effects model. This means that the risk of concluding a stronger surrogate than was true in reality was mitigated due to the less optimistic \(R^2\) posteriors. The implications of these biases observed in meta-regression posteriors are also evidenced in summaries of prediction in Table  2 . Despite the use of fixed effects, coverage of the true treatment effect on the clinical endpoint by 95% posterior predictive intervals under the PP-FE model was poorer than under the PP-RE model, to the largest extent in subgroups where the surrogate was strongest, which is likely where prediction is of greatest interest.

Application analysis results

The primary goal of the application analysis was to compare meta-regression posteriors and PPDs obtained after fitting the PP-RE model with different priors. However, we also note that Fig. 7 in the Additional file 1 indicates differences in the meta-regression slope estimates under the PP-RE and PP-FE models from the analysis where models were fit to disease-defined subgroups. The discrepancy in the posterior median between the two models grew larger for subgroups with a stronger meta-regression slope under the PP-RE model (under the PP-RE model, medians were -0.25, -0.30, -0.35, whereas, under the PP-FE model, these were -0.27, -0.29, -0.29).

Table  3 summarizes meta-regression slope posteriors from the application analyses (3 disease-defined subgroups, with 59 studies for model fitting in one analysis and 7 intervention-defined subgroups with 51 studies used for model fitting in the other). Additional file 1 : Tables 5 and 6 contain posterior summaries for the full set of meta-regression parameters from these analyses. When there were three disease-defined subgroups, using increasingly narrow priors resulted not only in narrower posteriors for between-subgroup standard deviation parameters but also for the between-subgroup mean parameters (even when priors for between-subgroup means were left the same). However, priors could be narrowed considerably before the within-subgroup posteriors narrowed. In most cases, even the narrowest priors used did not meaningfully change the inference on subgroup-specific posteriors. When there were 7 subgroups, narrower priors again resulted in equivalent or narrower posteriors for between-subgroup means and standard deviations, but to a lesser extent when compared to the analysis with fewer subgroups. Similarly, the use of narrower priors resulted in little, if any change in the within-subgroup posteriors under the options considered for intervention-defined subgroups.

Figures  1 and 2 display and illustrate the implications of the choice of priors on prediction for trials of a new subgroup or an existing subgroup. A subset of trials is displayed in the figures to be concise, and the remaining results are displayed in Additional file 1 : Tables 7-12. Firstly, consider the trials of novel subgroups. For every study, the PP-RE model resulted in wider PPDs than the FP-RE model. When there were fewer subgroups, predictive distributions for left-out studies were excessively and unrealistically wide when using completely diffuse priors under the PP-RE model. The use of constrained priors, especially those motivated by domain-specific reasoning (P3), resulted in PPDs which were narrowest among those obtained, but still wider than those under the FP-RE model with diffuse priors. Increasingly constrained priors resulted in more realistic uncertainty in HRs relative to the use of diffuse priors. When predicting for a trial of a novel intervention class (Fig.  2 ), where more subgroups were available for model-fitting, PPDs were narrower under the PP-RE approach (contrast PPDs in Fig.  1 relative to Fig. 2 ). This could be because of improved inferential precision for parameters associated with between-subgroup variability when more subgroups are present. These results indicate the PP-RE model may be more suitable for prediction to induce an appropriate degree of added uncertainty in predicting a clinical effect in a trial meaningfully different than those used to evaluate the surrogate. However, these results also suggest that PPDs can be excessively wide due to overly diffuse and unrealistic priors and not due to the true quality of the surrogate or its applicability to a new setting. Next, when trials were of a subgroup available for model fitting, the summaries of PPDs under the PP-RE model were more robust to the choice of priors relative to prediction for studies of a new subgroup (even for subgroups with few trials). In our setting, predictive distributions were also similar in width under the PP-RE relative to FP-RE model (evidenced by the 2.5 \(^{\text {th}}\) and 97.5 \(^{\text {th}}\) percentiles). The PP-RE model may thus increase accuracy and precision in prediction of clinical effects for future trials of existing subgroups over use of the FP-RE model by allowing subgroup-specific meta-regression parameters.

figure 1

Posterior predictive median and 95% interval are summarized. FP-RE: Full-pooling random effects. PP-RE: Partial-pooling random effects. P1: Diffuse priors used in fitting the PP-RE model. P2: Constrained priors set 1 in fitting the PP-RE model. P3: Constrained priors set 2 (narrowest) in fitting the PP-RE model. Studies listed are described further in Additional file 1 . The “ESG” (existing subgroup) studies were used for model fitting. The “NSG” (new subgroup) studies were left-out of model fitting

figure 2

Trial-level surrogate endpoint evaluations are often performed on collections of heterogeneous clinical trials. Standard methodology that yields estimates of a single set of meta-regression parameters may not be appropriate when trials meaningfully differ across pre-specified subgroups, and may also provide unrealistic precision in prediction of clinical effects in new studies that differ from those used to evaluate the surrogate. In this paper, we explored a class of models we refer to as “partial-pooling” models, where subgroup-specific meta-regressions are assumed, and yet between-subgroup distributions facilitate data adaptive information sharing across subgroups. Partial-pooling models provide a framework both for prediction of treatment effects on the clinical endpoint for a trial that meaningfully differs (is of a new subgroup) from those used for the surrogate evaluation itself and for prediction of future studies of an existing subgroup. There are various challenges in the implementation of a partial-pooling approach, such as the choice of priors and distribution for the true treatment effects on the surrogate. We conducted analyses to help guide such decision making.

Under the scenarios considered (e.g., unless there are a large number, exceeding at least 30, of large trials within a given subgroup), our analyses indicated that fitting separate models for surrogate endpoint evaluation within subgroups (no-pooling) can result in excessive uncertainty in posteriors. We found that partial-pooling methods can be a practical solution with noteworthy benefits (we saw improved precision in posteriors with limited bias due to information sharing in our analyses). If interest is in inference for subgroup-specific meta-regression posteriors, our results showed key differences in interpretations when using fixed versus random effects under the partial-pooling approach. In our analyses, the partial-pooling fixed effect variant produced downward bias in the meta-regression slope in subgroups of trials where the surrogate was strong, which translated to more biased prediction. The partial-pooling random effects approach did not produce such biases in subgroups where the surrogate was strong. We also did not see noteworthy biases under the partial-pooling random effects approach when the Gaussian distributional assumption of the true treatment effects on the surrogate was definitively violated.

A key theme of our results is that posterior distributions of the meta-regression parameters within each subgroup under the partial-pooling random effects model were robust to a degree of narrowing of priors on between-subgroup parameters. Similarly, inferences which apply the meta-regressions fit under the partial pooling model to estimate the posterior predictive distribution for the treatment effect on the clinical endpoint in a new trial were robust to the prior distributions when the new trial belonged to one of the same subgroups included when fitting the meta-regression. Conversely, however, inferences to a new trial which did not belong to one of the subgroups of the prior trials could be highly dependent on the prior distributions, especially for priors on the between subgroup standard deviations of the meta-regression parameters. Notably, when highly diffuse priors were used, the posterior predictive distributions for the new trial exhibited very high dispersion, indicating poor ability to extend the relationship between the treatment effects on the surrogate and clinical endpoints from the previous trials to the new trial. The extent to which the choice of priors influenced dispersion of posterior predictive distributions for a trial of a new subgroup was greater when there were fewer subgroups used in model fitting (e.g., if there were 3 as opposed to 7 subgroups, as in our analyses). This suggests that when fitting partial-pooling models, not only the use of overly constrained, but also the use of overly diffuse priors can unduly influence certain predictive analyses, and it is thus important to consider a strategy to identify more practical priors.

These quantitative findings are consistent with the general concept that the relationship between treatment effects on the surrogate and clinical endpoints observed in previously conducted trials can be reasonably applied to a new trial if at least one of the following three conditions hold: 1) there is strong evidence for a high-quality surrogate with a lack of heterogeneity in performance across a large number of subgroups representing an exhaustive array of intervention types and disease sub-classifications; 2) the new trial can be viewed as a member of the same subgroups used to evaluate the surrogate; 3) subject matter knowledge is sufficiently strong to support informative prior distributions, which mitigate heterogeneity in the meta-regression parameters between subgroups. This third condition appears related to the stress regulatory agencies place on the strength of evidence for a strong biological relationship between the surrogate and clinical endpoints. If the new trial is evaluating a novel treatment or disease subtype which is fundamentally distinct from any of the previous subgroups of trials, and subject matter knowledge cannot rule out heterogeneity in the meta-regression parameters between subgroups, application of the relationship between the surrogate and clinical endpoints observed in the prior trials to the new trial is tenuous. Of course, priors which drive the applicability of the meta-regression for prediction to a trial of a new subgroup can be tuned with multiple considerations in mind. In one regard, even without strong subject matter knowledge, basic logic can be used to narrow priors to some degree (such as for the meta-regression intercept, a log hazard ratio in our case, which is a commonly used metric and need not be expected to vary excessively). On the other hand, priors could be further constrained if there is strong subject matter knowledge indicating to do so, ideally from multiple stakeholders. Key is that the use of completely diffuse priors is likely to be highly impractical when employing partial-pooling models for surrogate evaluation, and the applicability of the surrogate should not depend on the excessive uncertainty imposed by the use of such priors as opposed to those that are realistic according to sound subject matter reasoning.

A noteworthy implication of our findings is that use of a partial-pooling model on a diverse collection of studies may be more useful than highly targeted surrogate evaluations on small subsets of studies. For example, there have been many evaluations of surrogates such as tumor response or progression free survival for highly specific tumor types in cancer [ 19 , 20 , 21 , 22 ]. However, there may be insufficient data in such settings to truly infer the quality of the surrogate. Partial-pooling models (with appropriately defined priors) fit to data sets with more tumor types, for example, may yield more useful information than fitting separate models within the small subgroups.

There are potential limitations to our analyses and findings. The use of Bayesian methods for surrogate evaluation is computationally demanding and we thus considered a limited number of scenarios in our application and simulation analyses. There may also be many additional distributions that could provide further benefit over the Gaussian or fixed-effects approaches we considered. For example, Bujkiewicz et al. showed potential benefits of using a t-distribution for certain terms [ 8 ]. Other strategies to refine priors may also be appropriate in other disease settings. Our analyses and discussion are embedded within the context where we initiate the analysis by assuming (through our priors) there may be some heterogeneity in the meta-regression across subgroups, but that priors on terms related to between-subgroup heterogeneity can be narrowed to some degree to ensure the inference is not unduly influenced by unrealistically wide priors. An alternative approach may be to use priors which, to some degree, induce the assumption that there is no between-subgroup heterogeneity in the quality of the surrogate to start the analysis, forcing the data to provide strong evidence for heterogeneity for the meta-regression posteriors to differ at all across subgroups. For example, spike and slab priors could be considered in future work, if the use of such priors aligns with the analytical goals in a given surrogate evaluation.

It is also important to note that there are many approaches to trial-level surrogate endpoint evaluation. For example, Buyse et al. have proposed joint models that can be fit in a single-stage analysis to simultaneously estimate within and between-study surrogacy metrics [ 23 ]. While joint modeling strategies have a number of advantages, their uptake appears less common than two-stage approaches in practice [ 9 ]. Other authors have also used network meta-regression strategies for surrogate endpoint evaluations on collections of heterogeneous studies [ 24 ]. Finally, within the context of evaluating whether there is heterogeneity in trial-level associations, alternative model structures may be useful depending on the ultimate scientific question. For example, one might consider a single linear regression with interaction terms. One potential drawback to such an approach is that with increasing trial-level factors (e.g., subgroups), such models become increasingly complex, potentially over-parameterized, and may pose challenges for non-statisticians to interpret. On the other hand, an advantage of the partial-pooling approaches discussed is that these maintain the linear regression structure within subgroups, which is again an approach that is already familiar to many investigators.

Conclusions

The methods discussed in this paper are applicable to the two-stage approach often used to establish the trial-level validity of a surrogate endpoint. Because establishing trial-level surrogacy requires a collection of clinical trials, analysts are often confronted with limited data. A strategy to overcome such data limitations is to incorporate a broad collection of studies with various disease and therapy sub-categories. However, analyses on such data in, for example, chronic kidney disease has encouraged regulatory agencies to question whether surrogate performance varies across pre-specified and clinically motivated subgroups of trials defined by disease or intervention classes. Analyses requiring sub-dividing available trials into subgroups will only exacerbate issues associated with model fitting on small amounts of data. We performed analyses that showed that partial-pooling modeling approaches may improve the potential to infer the quality of the surrogate within subgroups of trials even on limited datasets. However, our analyses also showed that even diffuse priors used for partial-pooling analyses can strongly influence the perceived quality of the surrogate as well as the ability to predict the treatment effect on the clinical endpoint. We discussed strategies that can be used to constrain priors used for the analysis to obtain more realistic estimates of key parameters for surrogate endpoint evaluation. Ultimately, analyses of a surrogate endpoint could result in appropriately expanding the feasibility of trials in an entire disease area, or could lead to the use of an endpoint that is not ultimately useful for patients. Partial-pooling models should be considered for surrogate endpoint evaluation on heterogeneous collections of trials, but the choice of a given model and priors to implement the model should be handled rigorously.

Availability of data and materials

Data restrictions apply to the data used for the application analyses presented, for which we were given access under license for this manuscript. These data are not publicly available due to privacy or ethical restrictions. The programs used to generate data used for the purposes of the simulation study is provided in the supplemental materials.

Abbreviations

  • Chronic kidney disease

Glomerular filtration rate

Random effects

Fixed-effects

Full-pooling

Partial-pooling

Posterior predictive distribution

Diabetes mellitus

Glomerular disease

Cardiovascular disease

Inverse-gamma

Thompson A, Smith K, Lawrence J. Change in estimated GFR and albuminuria as end points in clinical trials: a viewpoint from the FDA. Am J Kidney Dis. 2020;75(1):4–5.

Article   PubMed   Google Scholar  

Food and Drug Administration US. Guidance for industry: expedited programs for serious conditions - drugs and biologics. 2014. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/expedited-programs-serious-conditions-drugs-and-biologics . Accessed 1 Jan 2022.

Levey AS, Gansevoort RT, Coresh J, Inker LA, Heerspink HL, Grams M, et al. Change in albuminuria and GFR as end points for clinical trials in early stages of CKD: a scientific workshop sponsored by the National Kidney Foundation in collaboration with the US Food and Drug Administration and European Medicines Agency. Am J Kidney Dis. 2020;75(1):84–104.

Article   CAS   PubMed   Google Scholar  

Inker LA, Heerspink HJL, Tighiouart H, Levey AS, Coresh J, Gansevoort RT, et al. GFR slope as a surrogate end point for kidney disease progression in clinical trials: a meta-analysis of treatment effects of randomized controlled trials. J Am Soc Nephrol. 2019;30(9):1735–45.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Heerspink HJL, Greene T, Tighiourt H, Gansevoort RT, Coresh J, Simon AL, et al. Change in albuminuria as a surrogate endpoint for progression of kidney disease: a meta-analysis of treatment effects in randomised clinical trials. Lancet Diabetes Endocrinol. 2019;7(2):128–39.

Daniels MJ, Hughes MD. Meta-analysis for the evaluation of potential surrogate markers. Stat Med. 1997;16(17):1965–82.

Papanikos T, Thompson JR, Abrams KR, Stadler N, O C, Taylor R, et al. Bayesian hierarchical meta-analytic methods for modeling surrogate relationships that vary across treatment classes using aggregate data. Stat Med. 2020;39(8):1103–1124.

Bujkiewicz S, Thompson JR, Spata E, Abrams KR. Uncertainty in the Bayesian meta-analysis of normally distributed surrogate endpoints. Stat Methods Med Res. 2017;26(5):2287–318.

Article   MathSciNet   PubMed   Google Scholar  

Belin L, Tan A, De Rycke Y, Dechartress A. Progression-free survival as a surrogate for overall survival in oncology: a methodological systematic review. Br J Cancer. 2022;122(11):1707–14.

Article   Google Scholar  

Riley RD, Abrams KR, Sutton AJ, Lambert PC, Thompson JR. Bivariate random-effects meta-analysis and the estimation of between-study correlation. BMC Med Res Methodol. 2007;7(3):1471–2288.

Google Scholar  

Riley RD. Multivariate meta-analysis: the effect of ignoring within-study correlation. J R Stat Soc Series A Stat Soc. 2009;172(4):789–811.

Article   MathSciNet   Google Scholar  

Jones HE, Ohlssen DI, Neuenschwander B, Racine A, Branson M. Bayesian models for subgroup analysis in clinical trials. Clin Trials. 2011;8(2):129–43.

Prasad V, Kim C, Burotto M, Vandross A. The strength of association between surrogate end points and survival in oncology: a systematic review of trial-level meta-analyses. JAMA Intern Med. 2015;175(8):1389–98. https://doi.org/10.1001/jamainternmed.2015.2829 .

Vonesh E, Tighiouart H, Ying J, Heerspink HJL, Lewis J, Staplin N, et al. Mixed-effects models for slope-based endpoints in clinical trials of chronic kidney disease. Stat Med. 2019;38(22):4218–39.

RStan Development Team. Rstan: The R interface to Stan. 2020. https://cran.r-project.org/web/packages/rstan/rstan.pdf . Accessed 1 Dec 2022.

Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian data analysis. New York: Chapman and Hall; 1995.

Book   Google Scholar  

Vehtari A, Gelman A, Simpson D, Carpenter B, Burkner PC. Rank-normalization, folding, and localization: an improved rhat for assessing convergence of MCMC (with discussion). Bayesian Anal. 2021;16(2):667–718. https://doi.org/10.1214/20-BA1221 .

The SAS Institute. The NLMIXED procedure. 2015. https://support.sas.com/documentation/onlinedoc/stat/141/nlmixed.pdf . Accessed 1 Dec 2022.

Kataoka K, Nakamura K, Mizusawa J, Kato K, Eba J, Katayama H, et al. Surrogacy of progression-free survival (PFS) for overall survival (OS) in esophageal cancer trials with preoperative therapy: Literature-based meta-analysis. Eur J Surg Oncol. 2017;43(10):1956–61.

Chen YP, Sun Y, Chen L, Mao YP, Tang LL, Li WF, et al. Surrogate endpoints for overall survival in combined chemotherapy and radiotherapy trials in nasopharyngeal carcinoma: Meta-analysis of randomised controlled trials. Radiother Ooncol. 2015;116(2):157–66.

Gharzai LA, Jiang R, Wallington D, Jones G, Birer S, Jairath N, et al. Intermediate clinical endpoints for surrogacy in localised prostate cancer: an aggregate meta-analysis. Lancet Oncol. 2021;22(3):402–10.

Michiels S, Pugliano L, Marguet S, Grun D, Barinoff J, Cameron D, et al. Progression-free survival as surrogate end point for overall survival in clinical trials of HER2-targeted agents in HER2-positive metastatic breast cancer. Ann Oncol. 2016;27(6):1029–34.

Buyse M, Molenberghs G, Paoletti X, Oba K, Alonso A, Elst WV, et al. Statistical evaluation of surrogate endpoints with examples from cancer clinical trials. Biom J. 2016;58(1):104–32.

Bujkiewicz S, Jackson D, Thompson JR, Turner RM, Stadler N, Abrams KR, et al. Bivariate network meta-analysis for surrogate endpoint evaluation. Stat Med. 2019;38(18):3322–41.

Article   MathSciNet   PubMed   PubMed Central   Google Scholar  

Download references

Acknowledgements

The support and resources from the Center for High Performance Computing at the University of Utah are gratefully acknowledged. We thank all investigators, study teams, and participants of the studies included in the Analysis set 2: application analysis of CKD trials and  Application analysis results sections. Specific details for the same studies used in our analyses have been detailed in previous work by CKD-EPI [ 4 , 5 ].

We also thank the following CKD-EPI investigators/collaborators representing their respective studies (study acronyms/abbreviations are listed in Table 13 of Additional file 1 ): AASK: Tom Greene; ABCD: Robert W. Schrier, Raymond O. Estacio; ADVANCE: Mark Woodward, John Chalmers, Min Jun; AIPRI (Maschio): Giuseppe Maschio, Francesco Locatelli; ALTITUDE: Hans-Henrik Parving, Hiddo JL Heerspink; Bari (Schena): Francesco Paolo Schena, Manno Carlo; Bologna (Zucchelli): Pietro Zucchelli, Tazeen H Jafar; Boston (Brenner): Barry M. Brenner; canPREVENT: Brendan Barrett; Copenhagen (Kamper): Anne-Lise Kamper, Svend Strandgaard; CSG (Lewis 1992, 1993): Julia B. Lewis, Edmund Lewis; EMPA-REG OUTCOME: Christoph Wanner, Maximilian von Eynatten; Fukuoka (Katafuchi): Ritsuko Katafuchi; Groningen (van Essen): Paul E. de Jong, GG van Essen, Dick de Zeeuw; Guangzhou (Hou): Fan Fan Hou, Di Xie; HALT-PKD: Arlene Chapman, Vicente Torres, Alan Yu, Godela Brosnahan; HKVIN: Philip KT Li, Kai-Ming Chow, Cheuk-Chun Szeto, Chi-Bon Leung; IDNT: Edmund Lewis, Lawrence G. Hunsicker, Julia B. Lewis; Lecco (Pozzi): Lucia Del Vecchio, Simeone Andrulli, Claudio Pozzi, Donatella Casartelli; Leuven (Maes): Bart Maes; Madrid (Goicoechea): Marian Goicoechea, Eduardo Verde, Ursula Verdalles, David Arroyo; Madrid (Praga): Fernando Caravaca-Fontán, Hernando Trujillo, Teresa Cavero, Angel Sevillano; MASTERPLAN: Jack FM Wetzels, Jan van den Brand, Peter J Blankestijn, Arjan van Zuilen; MDRD Study: Gerald Beck, Tom Greene, John Kusek, Garabed Eknoyan; Milan (Ponticelli): Claudio Ponticelli, Giuseppe Montagnino, Patrizia Passerini, Gabriella Moroni ORIENT: Fumiaki Kobayashi, Hirofumi Makino, Sadayoshi Ito, Juliana CN Chan; Hong Kong Lupus Nephritis (Chan): Tak Mao Chan; REIN: Giuseppe Remuzzi, Piero Ruggenenti, Aneliya Parvanova, Norberto Perico; RENAAL: Dick De Zeeuw, Hiddo JL Heerspink, Barry M. Brenner, William Keane; ROAD: Fan Fan Hou, Di Xie; Rochester (Donadio): James Donadio, Fernando C. Fervenza; SHARP: Colin Baigent, Martin Landray, William Herrington, Natalie Staplin; STOP-IgAN: Jürgen Floege, Thomas Rauen, Claudia Seikrit, Stefanie Wied; Strasbourg (Hannedouche): Thierry P. Hannedouche; SUN-MACRO: Julia B. Lewis, Jamie Dwyer, Edmund Lewis; Texas (Toto): Robert D. Toto; Victoria (Ihle): Gavin J. Becker, Benno U. Ihle, Priscilla S. Kincaid-Smith.

The study was funded by the National Kidney Foundation (NKF). NKF has received consortium support from the following companies: AstraZeneca, Bayer, Cerium, Chinook, Boehringer Ingelheim, CSL Behring, Novartis and Travere. This work also received support from the Utah Study Design and Biostatistics Center, with funding in part from the National Center for Advancing Translational Sciences of the National Institutes of Health under Award Number UL1TR002538.

Author information

Authors and affiliations.

Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA

Willem Collier

Department Population Health Sciences, University of Utah School of Medicine, Salt Lake City, UT, USA

Benjamin Haaland & Tom Greene

Pentara Corporation, Millcreek, UT, USA

Benjamin Haaland

Division of Nephrology, Tufts University Medical Center, Boston, MA, USA

Lesley A. Inker

Department of Clinical Pharmacy and Pharmacology, Department of Nephrology, University of Groningen, Groningen, Netherlands

Hiddo J.L. Heerspink

You can also search for this author in PubMed   Google Scholar

Contributions

Willem Collier was the primary author for all sections of the manuscript, worked on the design and implementation of all analyses, wrote the programs used for analyses and results reporting, and generated summaries. Tom Greene contributed to writing and editing in all sections throughout the manuscript and helped in the design of all analyses. Benjamin Haaland contributed to writing and editing in all sections throughout the manuscript and helped in the design of all analyses. Lesley Inker contributed to writing and editing of the introduction, application analysis, and discussion sections, and helped to design the application analyses. Hiddo Heerspink contributed to writing and editing of the introduction, application analysis, and discussion sections, and helped to design the application analyses.

Corresponding author

Correspondence to Willem Collier .

Ethics declarations

Ethics approval and consent to participate.

The analyses presented in this study were deemed exempt from review by the Tufts Medical Center Institutional Review Board. The research presented in this paper complies with all relevant ethical regulations (Declaration of Helsinki). Only aggregated data from previously conducted clinical trials are presented. The protocol and consent documents of the individual trials used were reviewed and approved by each trial’s participating centers’ institutional review board, and informed consent was provided by all participants of the studies for which results were aggregated for our analyses.

Consent for publication

Not applicable.

Competing interests

Willem Collier received funding from the National Kidney Foundation for his graduate studies while working on aspects of the submitted work.

Benjamin Haaland is a full time employee of Pentara Corporation and consults for the National Kidney Foundation.

Hiddo JL Heerspink received grant support from the National Kidney Foundation to his institute and is a consultant for AbbVie, AstraZeneca, Bayer, Boehringer Ingelheim, Chinook, CSL Behring, Dimerix, Eli Lilly, Gilead, GoldFinch, Janssen, Merck, Novo Nordisk and Travere Pharmaceuticals.

Lesley A Inker reports funding from National Institutes of Health, National Kidney Foundation, Omeros, Chinnocks, and Reata Pharmaceuticals for research and contracts to Tufts Medical Center; consulting agreements to Tufts Medical Center with Tricida; and consulting agreements with Diamerix.

Tom Greene reports grant support from the National Kidney Foundation, Janssen Pharmaceuticals, Durect Corporation and Pfizer and statistical consulting from AstraZeneca, CSL and Boehringer Ingleheim.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Collier, W., Haaland, B., Inker, L. et al. Comparing Bayesian hierarchical meta-regression methods and evaluating the influence of priors for evaluations of surrogate endpoints on heterogeneous collections of clinical trials. BMC Med Res Methodol 24 , 39 (2024). https://doi.org/10.1186/s12874-024-02170-0

Download citation

Received : 13 September 2023

Accepted : 04 February 2024

Published : 16 February 2024

DOI : https://doi.org/10.1186/s12874-024-02170-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Surrogate endpoint
  • Meta-regression
  • Bayesian hierarchical modeling

BMC Medical Research Methodology

ISSN: 1471-2288

comparison research study

Select language

comparison research study

Department of Languages, Literature and Communication

Comparative literature.

What does comparative literature mean in a rapidly changing world?

The Comparative Literature section at Utrecht University is an open community of internationally recognised scholars who work with literature in several languages and media. We consider what comparison means in a rapidly changing world, using multi-temporal, multilingual, intermedial, and critical frameworks. 

comparison research study

As a group, we carry out our research and teaching with a broad understanding of the literary medium and value innovative approaches, such as:

  • literature as a (trans)cultural medium
  • cultural memory 
  • creativity- and comparative media studies
  • post/colonial studies 
  • digital-, environmental-, and medical humanities 
  • reader- and publishing studies, the sociology of literature 

comparison research study

We offer a broad programme cutting through different historical periods, from the classical ages (in Europe, Eurasia, China, and India) to the present. Within this comprehensive setting, our staff members bring specific expertise in the fields of Anglophone and European, Chinese and Japanese, African, South Asian, and Eurasian literature. Their research ranges from books to bodies, trauma to translation, decoloniality to diaspora, (critical) ecology to e-cultures, and animals to activism.

More information 

Reach out to us if you are looking for a dissertation supervisor, or seek our expertise or collaboration. Visit the personal websites of our staff members by clicking on their names listed below, or visit the site of our research institute .

See also our programme websites:

  • BA  Literary Studies
  • MA Literature Today
  • RMA Comparative Literary Studies

Staff members

comparison research study

dr. L. (Lida) Amiri

  • Diaspora and Transnational Studies, Translational Research, Twentieth Century, Twenty-first Century

comparison research study

dr. P.A.L. Bijl

comparison research study

Tashina Blom MA

  • Political Activism and Rhetoric, Literary Theory, History of Feminism, Cultural Theory, Cultural Memory

comparison research study

dr. Michela Borzaga

  • Critical Theory, Postcolonial Studies, Literature and Memory, Modernism, Cultural Memory, Gender and Literature

comparison research study

dr. Frank Brandsma

  • Medieval French, Medieval Culture, Medieval Literature, Middle Dutch, Historical Fiction

comparison research study

prof. dr. Kiene Brillenburg Wurth

  • Literature and (New) Media, Intermediality, Creative Humanities, Creativity, Philosophy of Art, Literary Theory, Music, Aesthetics

comparison research study

dr. Saskia Bultman

  • History of Science, History of Medicine, Gender and Sexuality, History of Identities, Nineteenth and Twentieth Century, Women's History, Biographies and Life Stories, Literature

comparison research study

dr. Olivia da Costa Fialho

  • Empirical Research, Literary Theory, Didactics of Literature, Phenomenology, Digital Humanities, Reading, Bridging Science and Humanities Research, Contemporary Literature, Early Modern Literature and Culture

comparison research study

dr. Kári Driscoll

  • Animal studies, Posthumanism, German Literature, Literary Translation, Literary Theory, Modernism, Literary (Post)Modernism, Comparative Literature Studies, Literary Criticism, Environmental Humanities, Contemporary Literature

comparison research study

A. (Arnab) Dutta MA

comparison research study

dr. Sophie van den Elzen

  • Antislavery and Abolitionism, Cultural History, History of Feminism, Literature and Memory, Literary Theory, Nineteenth and Twentieth Century, Political Activism and Rhetoric, Magazine Studies, Transnationalism

comparison research study

dr. Femi Eromosele

comparison research study

dr. Leila Essa

  • Comparative Literature Studies, Anglophone Indian Literature, German Literature and Culture, Contemporary Literature, Postcolonial Studies, Literature and Space, Literature and Memory, Literature and (New) Media, Social Justice, Postmigrant Authorship

comparison research study

dr. Susanne Ferwerda

  • Comparative Literature Studies, Climate Change, Gender and Literature, Feminist Literary Criticism , New Materialism, Postcolonial Studies, Queer Theory, Literary Theory, Australia, Modern Dutch Literature, Contemporary Literature, Migration

comparison research study

T.W. (Tom) Hedley MA

  • Literary (Post)Modernism, Attitude towards Science and Mathematics, Literature and Space, German Literature and Culture, Irish Literature, Narratology, Austrian Literature, Mathematical Thinking, Life Writing, Literature and Memory

comparison research study

Tessel Janse

comparison research study

dr. Flore Janssen

  • Gender Studies, Political Activism, Archives, Literature in the 19th / 20th Century

comparison research study

dr. Birgit Kaiser

  • Literary Theory, Feminist Literary Criticism , Postcolonial Studies, Decoloniality, New Materialism, Aesthetics, Literature in the 19th / 20th Century, Contemporary Literature, French Culture and Literature, German Literature

comparison research study

dr. Susanne Knittel

  • Disability Studies, German Literature and Culture, Italian Literature, Italian Culture, Literary Theory, Modernism, Perpetrator Studies, Posthumanism

Jason Mariotis

comparison research study

dr. Müge Özoglu

  • Literary Theory, Comparative Literature Studies, Gender and Sexuality, Queer Theory, Feminist Literary Criticism , Contemporary Literature

comparison research study

prof. dr. Ann Rigney

  • Collective Memory, Historical Fiction, Historical Theory, Intermediality, Literary Theory, Nationalism and Transnationalism, Reception Studies, Narratology, Walter Scott, Social Movements

comparison research study

dr. Jeroen Salman

  • History of the Book, Children's Literature, Media Studies, Popular Culture, Relation between Literature and Science, Digital Humanities, Early Modern Cultural History

comparison research study

dr. Carolina Sanchez-Jaegher

  • Ethics, Feminist Epistemology, Philosophy of Human Rights, Latin America

comparison research study

dr. Merve Tabur

  • Environmental Humanities, Climate Change, Middle Eastern Studies, Speculative fiction, Critical Theory, Feminist Theories, Postcolonial Studies

comparison research study

dr. Clara Vlessing

  • Literary Theory, Collective Memory, Social Movements, Feminist Literary Criticism , Biographies and Life Stories, Creative Writing

comparison research study

The Visual Memory of Protest

comparison research study

Special section the minnesota review: Mobilizing Creativity

comparison research study

Poetry as protest: Iranian poet spoke out for women’s self-determination 100 years ago

comparison research study

Ann Rigney formally admitted to the Royal Irish Academy

comparison research study

NWO Veni grants for economic historian Selin Dilli, literary scholars Leila Essa and Mia You, and Islamic studies scholar Mehrdad Alipour

Utrecht University Heidelberglaan 8 3584 CS Utrecht The Netherlands Tel. +31 (0)30 253 35 50

COMMENTS

  1. Comparative research

    Comparative research, simply put, is the act of comparing two or more things with a view to discovering something about one or all of the things being compared. This technique often utilizes multiple disciplines in one study. When it comes to method, the majority agreement is that there is no methodology peculiar to comparative research. [1]

  2. Chapter 10 Methods for Comparative Studies

    Chapter 10 of this book provides an overview of the methods for comparative studies in pharmacology, including the design, analysis, and interpretation of experiments and clinical trials. It also discusses the advantages and limitations of different types of comparisons, such as placebo, active, and dose comparisons. This chapter is a useful resource for researchers and students who want to ...

  3. (PDF) A Short Introduction to Comparative Research

    Comparative research is more of a perspective or orientation than a separate research technique (Ragin & Rubinson, 2009) . A comparative study is a kind of method that analyzes phenomena and then ...

  4. Comparative Research Methods

    Comparative research in communication and media studies is conventionally understood as the contrast among different macro-level units, such as world regions, countries, sub-national regions, social milieus, language areas and cultural thickenings, at one point or more points in time.

  5. 15

    What makes a study comparative is not the particular techniques employed but the theoretical orientation and the sources of data. All the tools of the social scientist, including historical analysis, fieldwork, surveys, and aggregate data analysis, can be used to achieve the goals of comparative research. So, there is plenty of room for the ...

  6. Comparative Studies

    Comparative method is a process of analysing differences and/or similarities betwee two or more objects and/or subjects. Comparative studies are based on research techniques and strategies for drawing inferences about causation and/or association of factors that are similar or different between two or more subjects/objects.

  7. Design, Analysis and Interpretation of Method-Comparison Studies

    This paper reviews the methodology of a method-comparison study to assist the clinician with the conduct and evaluation of such studies. Temperature data from one subject are used to illustrate the procedures. ... This work was partially supported by grant R15-NR04488 from the National Institute of Nursing Research, National Institutes of Health.

  8. Between context and comparability: Exploring new solutions for a

    Qualitative methods are frequently used in comparative studies within the field of higher education research (Kosmützky, 2016). Nevertheless, not many methodological reflections that can help higher education researchers to cope with the methodological issues of qualitative comparative studies are available within the knowledge base of the field.

  9. Comparison in Qualitative Research

    However, with either logic of comparison, three dangers merit attention: decontextualization, commensurability, and ethnocentrism. One promising research heuristic that attends to different logics of comparison while avoiding these dangers is the comparative case study (CCS) approach. CCS entails three axes of comparison.

  10. Comparative Case Studies: Methodological Discussion

    In the past, comparativists have oftentimes regarded case study research as an alternative to comparative studies proper. At the risk of oversimplification: methodological choices in comparative and international education (CIE) research, from the 1960s onwards, have fallen primarily on either single country (small n) contextualized comparison, or on cross-national (usually large n, variable ...

  11. Comparison in Scientific Research

    A study of chimpanzees paved the way for comparison to be recognized as an important research method. Later, Charles Darwin and others used this comparative research method in the development of the theory of evolution. identifying similarities and differences. a procedure, a process, a systematic way of doing something.

  12. Comparative Research, Higher Education

    Within and along with the field of higher education research, comparative research has developed continuously from the 1960s onward and has played an important role for its evolution (Teichler 1996).During the 1970s, following the first wave of university expansion in different countries, higher education research became an institutionalized field of studies in various countries around the world.

  13. Methods in Comparative Effectiveness Research

    Methods in Comparative Effectiveness Research is a comprehensive review of the concepts, methods, and applications of comparative effectiveness research (CER) in health care. It covers the history, principles, challenges, and controversies of CER, as well as the analytical techniques and data sources for conducting systematic reviews and meta-analyses. It also provides examples of CER studies ...

  14. Comparative Research Methods

    Research goals. Comparative communication research is a combination of substance (specific objects of investigation studied in diferent macro-level contexts) and method (identification of diferences and similarities following established rules and using equivalent concepts).

  15. PDF The Comparative approach: theory and method

    ideas about what the comparative approach is in terms of a scientific undertaking. In addition, we shall argue in Section 2.2. that one can distinguish in comparative politics a 'core subject' that enables us to study the relationship between 'politics and society' in a CONTENTS 2.1 Introduction 2.2 Comparative Research and case selection

  16. 5 Comparative Studies

    A third category of comparative study involved a comparison to some form of externally normed results, such as populations taking state, national, or international tests or prior research assessment from a published study or studies.

  17. The state and 'field' of comparative higher education

    Continued efforts to develop comparative studies in higher education are essential to contribute to thoughtful and rigorous research at all levels - local, national, and global. A mature field of comparative higher education could therefore help shape and strengthen future higher education research.

  18. 5. Comparison in Scientific Research

    5. Comparison in Scientific Research. nyone who has stared at a chimpanzee in a zoo (Figure 1) has probably wondered about the animal's similarity to humans. Chimps make facial expressions that resemble humans, use their hands in much the same way we do, are adept at using different objects as tools, and even laugh when they are tickled.

  19. Comparing and Contrasting in an Essay

    For example, a literature review involves comparing and contrasting different studies on your topic, and an argumentative essay may involve weighing up the pros and cons of different arguments. Prevent plagiarism. Run a free check. ... Let's say your research involves the competing psychological approaches of behaviorism and cognitive ...

  20. A cross-sectional and population-based study from primary care ...

    Current research on post-COVID-19 conditions (PCC) has focused on hospitalized COVID-19 patients, and often lacks a comparison group. This study assessed the prevalence of PCC in non-hospitalized ...

  21. Statistical Methods for Comparative Studies

    researcher. Throughout the book we develop for the applied research worker a basic understanding of the problems and techniques and avoid highly math- ematical presentations in the main body of the text. Overview of the Book The first five chapters discuss the main conceptual issues in the design and analysis of comparative studies.

  22. Comparative effectiveness research for the clinician researcher: a

    Health outcome research with this study design involves the placebo being non-treatment or a 'sham' treatment option . Traditionally, comparative effectiveness research is conducted following completion of a Phase III placebo control trial [2-4]. It is possible that comparative effectiveness research might not determine whether one ...

  23. Blinded Versus Unblinded Review: A Field Study Comparing the Equity of

    In this paper, the authors evaluate the equity of blinded and unblinded review in a field experiment during peer-review of submissions for an academic conference. The authors administer their field experiment in the context of reviewing the 530 submissions for the Society for Judgment and Decision Making's 39th Annual Conference.

  24. Functional outcomes of different surgical treatments for common

    This study aims to assess the recovery patterns and factors influencing outcomes in patients with common peroneal nerve (CPN) injury. This retrospective study included 45 patients with CPN injuries treated between 2009 and 2019 in Jing'an District Central Hospital. The surgical interventions were categorized into three groups: neurolysis (group A; n = 34 patients), nerve repair (group B; n ...

  25. Full article: Predictors of Compassion Fatigue and Compassion

    Research design. This is a cross-sectional study, that used survey methodology for data collection. The study is of a comparative nature and comparisons have been made among different categories of respondents. The analytical methodology followed is predominantly correlational and the study uses a descriptive design.

  26. Comparing Bayesian hierarchical meta-regression methods and evaluating

    We compare analyses using the models discussed above on a set of 66 CKD studies. Data from these studies was collected by the Chronic Kidney Disease Epidemiology Collaboration (CKD-EPI), an international research consortium [3, 4]. Evaluations of GFR slope on this collection of studies have been described extensively [3, 4].

  27. Comparative Literature

    The Comparative Literature section at Utrecht University is an open community of internationally recognised scholars who work with literature in several languages and media. We consider what comparison means in a rapidly changing world, using multi-temporal, multilingual, intermedial, and critical frameworks. We offer a broad programme cutting ...