Theory and Implementation of linear regression

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2023 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here .

Loading metrics

Open Access

Peer-reviewed

Research Article

Review of guidance papers on regression modeling in statistical series of medical journals

Roles Data curation, Formal analysis, Investigation, Methodology, Software, Visualization, Writing – original draft, Writing – review & editing

* E-mail: christine[email protected] (CW); [email protected] (GR)

Affiliations Institute of Biometry and Clinical Epidemiology, Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Charité—Universitätsmedizin Berlin, Berlin, Germany, Center for Medical Statistics, Informatics and Intelligent Systems, Section for Clinical Biometrics, Medical University of Vienna, Vienna, Austria

ORCID logo

Roles Data curation, Formal analysis, Investigation, Methodology, Writing – review & editing

Affiliations Institute of Biometry and Clinical Epidemiology, Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Charité—Universitätsmedizin Berlin, Berlin, Germany, School of Business and Economics, Emmy Noether Group in Statistics and Data Science, Humboldt-Universität zu Berlin, Berlin, Germany

Roles Data curation, Formal analysis, Investigation, Writing – review & editing

Affiliation Institute of Biometry and Clinical Epidemiology, Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Charité—Universitätsmedizin Berlin, Berlin, Germany

Roles Validation, Writing – review & editing

Affiliation School of Business and Economics, Emmy Noether Group in Statistics and Data Science, Humboldt-Universität zu Berlin, Berlin, Germany

Roles Conceptualization, Funding acquisition, Project administration, Resources, Supervision, Writing – review & editing

Affiliation Faculty of Medicine and Medical Center, Institute of Medical Biometry and Statistics, University of Freiburg, Freiburg, Germany

Affiliation Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, The Netherlands

Roles Conceptualization, Funding acquisition, Investigation, Supervision, Validation, Writing – review & editing

Affiliation Center for Medical Statistics, Informatics and Intelligent Systems, Section for Clinical Biometrics, Medical University of Vienna, Vienna, Austria

Roles Conceptualization, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Validation, Writing – original draft, Writing – review & editing

¶ Membership of the topic group 2 of the STRATOS initiative is listed in the Acknowledgments.

  • Christine Wallisch, 
  • Paul Bach, 
  • Lorena Hafermann, 
  • Nadja Klein, 
  • Willi Sauerbrei, 
  • Ewout W. Steyerberg, 
  • Georg Heinze, 
  • Geraldine Rauch, 
  • on behalf of topic group 2 of the STRATOS initiative

PLOS

  • Published: January 24, 2022
  • https://doi.org/10.1371/journal.pone.0262918
  • Reader Comments

Fig 1

Although regression models play a central role in the analysis of medical research projects, there still exist many misconceptions on various aspects of modeling leading to faulty analyses. Indeed, the rapidly developing statistical methodology and its recent advances in regression modeling do not seem to be adequately reflected in many medical publications. This problem of knowledge transfer from statistical research to application was identified by some medical journals, which have published series of statistical tutorials and (shorter) papers mainly addressing medical researchers. The aim of this review was to assess the current level of knowledge with regard to regression modeling contained in such statistical papers. We searched for target series by a request to international statistical experts. We identified 23 series including 57 topic-relevant articles. Within each article, two independent raters analyzed the content by investigating 44 predefined aspects on regression modeling. We assessed to what extent the aspects were explained and if examples, software advices, and recommendations for or against specific methods were given. Most series (21/23) included at least one article on multivariable regression. Logistic regression was the most frequently described regression type (19/23), followed by linear regression (18/23), Cox regression and survival models (12/23) and Poisson regression (3/23). Most general aspects on regression modeling, e.g. model assumptions, reporting and interpretation of regression results, were covered. We did not find many misconceptions or misleading recommendations, but we identified relevant gaps, in particular with respect to addressing nonlinear effects of continuous predictors, model specification and variable selection. Specific recommendations on software were rarely given. Statistical guidance should be developed for nonlinear effects, model specification and variable selection to better support medical researchers who perform or interpret regression analyses.

Citation: Wallisch C, Bach P, Hafermann L, Klein N, Sauerbrei W, Steyerberg EW, et al. (2022) Review of guidance papers on regression modeling in statistical series of medical journals. PLoS ONE 17(1): e0262918. https://doi.org/10.1371/journal.pone.0262918

Editor: Tim Mathes, Witten/Herdecke University, GERMANY

Received: June 28, 2021; Accepted: January 8, 2022; Published: January 24, 2022

Copyright: © 2022 Wallisch et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The data was collected within the review and is available as supporting information S6.

Funding: CW: I-4739-B Austrian Science Fund, https://www.fwf.ac.at/en/ LH: RA 2347/8-1, German Research Foundation, https://www.dfg.de/en/ WS: SA 580/10-1, German Research Foundation, https://www.dfg.de/en/ All funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Knowledge transfer from the rapidly growing body of methodological research in statistics to application in medical research does not always work as it should [ 1 ]. Possible reasons for this problem are the lack of guidance and that not all statistical analyses are conducted by statistical experts but often by medical researchers who may or may not have a solid statistical background. Applied researchers cannot be aware of all statistical pitfalls and the most recent developments in statistical methodology. Keeping up is already challenging for a professional biostatistical researcher, who is often restricted to an area of main interest. Moreover, articles on statistical methodology are often written in a rather technical style making knowledge transfer even more difficult. Therefore, there is a need for statistical guidance documents and tutorials written in more informal language, explaining difficult concepts intuitively and with illustrative educative examples. The international STRengthening Analytical Thinking for Observational Studies (STRATOS) initiative ( http://stratos-initiative.org ) aims to provide accessible and accurate guidance documents for relevant topics in the design and analysis of observational studies [ 1 ]. Guidance is intended for applied statisticians and other medical researchers with varying levels of statistical education, experience and interest. Some medical journals are aware of this situation and regularly publish isolated statistical tutorials and shorter articles or even whole series of articles with the intention to provide some methodological guidance to their readership. Such articles and series can have a high visibility among medical researchers. Although some of the articles are short notes or rather introductory texts, we will use the phrase ‘statistical tutorial’ for all articles in our review.

Regression modeling plays a central role in the analysis of many medical studies, in particular, of observational studies. More specifically, regression model building involves aspects such as selection of a model type that matches the type of outcome variable, selection of explanatory variables to include in a model, choosing an adequate coding of the variables, deciding on how flexibly the association of continuous variables with the outcome should be modeled, planning and performing model diagnostics, model validation and model revision, reporting of a model and describing how well differences in the outcome can be explained by differences in the covariates. Some of the choices made during model building will strongly depend on the aim of modeling. Shmueli (2010) [ 2 ] distinguished between three conceptual modeling approaches: descriptive, predictive and explanatory modeling. In practice these aims are still often not well clarified, leading to confusion about which specific approach is useful in a modeling problem at hand. This confusion, and an ever-growing body of literature in regression modeling may explain why a common state-of-the-art is still difficult to define [ 3 ]. However, not all studies require an analysis with the most advanced techniques and there is the need for guidance for researchers without a strong background in statistical methodology, who might be “medical students or residents, or epidemiologists who completed only a few basic courses in applied statistics” according to the definition of level-1 researchers by the STRATOS initiative [ 1 ].

If suitable guidance for level-1 researchers in peer-reviewed journals was available, many misconceptions about regression model building could be avoided [ 4 – 6 ]. The researchers need to be informed about methods that are easily implemented, and they need to know about strengths and weaknesses of common approaches [ 3 ]. Suitable guidance should also point to possible pitfalls, elaborate on dos and don’ts in regression analyses, and provide software recommendations and understandable code for different methods and aspects. In this review, we focused on low-dimensional regression models where the sample size exceeds the number of candidate predictors. Moreover, we will not specifically address the field of causal inference, which goes beyond classical regression modeling.

So far, it is unclear what aspects of regression modeling have already been well-covered by related tutorials and where gaps still exist. Furthermore, suitable tutorial papers may be published but they are unknown to (nearly all) clinicians and therefore widely ignored in their analyses.

The objective of this review was to provide an evidence-based information basis assessing the extent to which regression modeling has been covered by series of statistical tutorials published in medical journals. Specifically, we sought to define a catalogue of important aspects on regression modeling, to identify series of statistical tutorials in medical journals, and to evaluate which aspects were treated in the identified articles and at which level of sophistication. Thereby, we put an intended focus on the choice of the regression model type, on variable selection and for continuous variables on the functional form. Furthermore, this paper will provide an overview, which helps to inform a broad audience of medical researchers about the availability of suitable papers written in English.

The remainder of this review is organized as follows: In the next section, the review protocol is described. Subsequently, we summarize the results of the review by means of descriptive measures. Finally, we discuss implications of our results suggesting potential topics for future tutorials or entire series.

Material and methods

The protocol of this review describing the detailed design was already published by Bach et al. (2020) [ 7 ]. In here, we summarize its main characteristics.

Eligibility criteria

First, we identified series of statistical tutorials and papers published in medical journals with a target audience mainly or exclusively consisting of medical researchers or practitioners. Second, we searched for topic-relevant articles on regression modeling within these series. Journals with a target audience of pure theoretical, methodological or statistical focus were not considered. We included medical journals if they were available in English language since this implies high international impact and broad visibility. Moreover, the series had to comprise at least five or more articles including at least one topic-relevant article. We focused on statistical series only since we believed that entire series have higher impact and visibility than isolated articles.

Sources of information & search strategy

After conducting a pilot study for a systematic search for series of statistical tutorials, we had to adapt our search strategy since sensitive keywords to identify statistical series could not be found. Therefore, we consulted more than 20 members of the STRATOS initiative via email in spring 2018 for suggestions on statistical series addressing medical researchers. We also asked them to forward this request to colleagues, which resembles snowball sampling [ 8 , 9 ]. This call was repeated at two international STRATOS meetings in summer 2018 and in 2019. The search was closed on June 30 st , 2019. Our approach also included elements of respondent-driven sampling [ 10 ] by offering collaboration and co-authorship in case of relevant contribution to the review. In addition, we included several series that were additionally proposed by a reviewer during the peer-review process of this manuscript, and which were published by the end of June, 2019 to be consistent with the original request.

Data management & selection process

The list of all resulting statistical series suggested is available as S1 File .

Two independent raters selected relevant statistical series from the pool of candidate series by applying the inclusion criteria outlined above.

An article within a series was considered to be topic-relevant if the title included one of the following keywords: regression , linear , logistic , Cox , survival , Poisson , multivariable , multivariate , or if the title suggested that the main topic of the article was statistical regression modeling . Both raters decided on the topic-relevance of an article independently and resolved discrepancies by discussion. To facilitate the selection of relevant statistical series, we designed a report form called inclusion form ( S2 File ).

Data collection process

After the identification of relevant series and topic-relevant articles, a content analysis was performed on all topic-relevant articles using an article content form ( S3 File ). The article content form was filled-in for every identified topic-relevant article by the two raters independently and again discrepancies were resolved by discussion. The results of completed article content forms were copied into a data base for further quantitative analysis.

In total 44 aspects of regression modeling were examined in the article content form ( S3 File ), which were related to four areas: type of regression model , general aspects of regression modeling , functional form of continuous predictors , and selection of variables . The 44 aspects cover topics of different complexity. Some aspects can be considered basic, others are more advanced. This was also commented in the S3 File for orientation. We mainly focused on predictive and descriptive models and did not consider particular aspects attributed to ethological models.

For each aspect, we evaluated whether it was mentioned at all, and if yes, the extent of explanation (short = one sentence only / medium = more than one sentence to one paragraph / long = more than one paragraph) [ 7 ]. We recorded whether examples and software commands were provided, and if recommendations or warnings were given with respect to each aspect. A box for comments provided space to note recommendations, warnings and other issues. In the article content form, it was also possible to add further aspects to each area. A manual for raters was created to support an objective evaluation of the aspects ( S4 File ).

Summary measures & synthesis of results

This review was designed as an explorative study and uses descriptive statistics to summarize results. We calculated absolute and relative frequencies to analyze the 44 statistical aspects. We used stacked bar charts to describe the ordinal variable extent of explanation for each aspect. To structure the analysis, we grouped the aspects into the afore mentioned areas: type of regression model , general aspects of regression modeling , determination of functional form for continuous predictors and selection of variables .

We conducted the above analyses both article-wise and series-wise. In the article-wise analysis, each article was considered individually. For the series-wise analysis, the results from all articles in a series were pooled and each series was considered as the unit of observation. This means, if an aspect was explained in at least one article, this also counted for the entire series.

Risk of bias

The risk of bias by missing a series was addressed extensively in the protocol of this study [ 7 , 11 , 12 ]. Moreover, bias could result from the inclusion criterion of series, which was the requirement of at least five articles in a series. This may have led to a less representative set of series. We set this inclusion criterion to identify highly visible series. Bias could also result from the specific choice of aspects of regression modeling to be screened. We tried to minimize this bias by the possibility for free text entries that could later be combined into additional aspects.

This review has been written according to the PRISMA reporting guideline [ 13 , 14 ], compare S1 Checklist . This review does not include patients or humans. The data that were collected within the review are available in S1 Data .

Selection of series and articles

The initial query revealed 47 series of statistical tutorials ( Fig 1 and S1 File ). Out of these 47 series, two series were not published in a medical journal and five series did not target an audience with low statistical knowledge. Therefore, these series were excluded. Five and ten series were excluded because they were not written in English or they did not comprise at least five articles, respectively. Further, we excluded three series because they did not contain any topic-relevant article. The list of the series and the reason for each excluded series is found in S1 File . Finally, we included 23 series with 57 topic-relevant articles.

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

https://doi.org/10.1371/journal.pone.0262918.g001

Characteristics of the series

Each series contained between one to nine topic-relevant articles (two on average, Table 1 ). The variability of the average number of article pages per series illustrates that the extent of the articles was very different (1 to 10.3 pages). Whereas the series Statistics Notes in the BMJ typically used a single page to discuss a topic, hence pointing only to the most relevant issues, there were longer papers with a length of up to 16 pages [ 15 , 16 ]. The series in the BMJ is also the one spanning over the longest time period (1994–2018). Beside of the series in the BMJ , only the Archives of Disease in Childhood and the Nutrition series started publishing papers already in the last century. Fig 2 shows that most series were published only during a short period, perhaps paralleling terms of office of an Editor.

thumbnail

https://doi.org/10.1371/journal.pone.0262918.g002

thumbnail

We considered 44 aspects, see S3 File .

https://doi.org/10.1371/journal.pone.0262918.t001

The most informative series with respect to our pre-specified list of aspects was published in Revista Española de Cardiologia , which mentioned 35 aspects in two articles on regression modeling ( Table 1 ). Similarly, Circulation and Archives of Disease in Childhood covered 31 and 30 aspects in three article each. The number of articles and the years of publication varied across the series ( Fig 2 ). Some series comprised only five articles whereas Statistics Notes of the BMJ published 68 short articles, which was very successful with some articles that were cited about 2000 times. Almost all series covered multivariable regression in at least one article. The range of regression types varied across series. Most statistical series were published with the intention to improve the knowledge of their readership about how to apply appropriate methodology in data analyses and how to critically appraise published research [ 17 – 19 ].

Characteristics of articles

The top three articles that covered the highest number of aspects (27 to 34 out of 44 aspects) on six to seven pages were published in Revista Española de Cardiologia , Deutsches Ärzteblatt International , and in European Journal of Cardio-Thoracic Surgery [ 20 – 22 ]. The article of Nuñez et al. [ 22 ] published in Revista Española de Cardiologia covered the most popular regression types (linear, logistic and Cox regression) and explained not only general aspect but also gave insights into non-linear modeling and variable selection. Schneider et al. [ 20 ] covered all regression types that we considered in our review in their publication in Deutsches Ärzteblatt International . The top-ranked article in European Journal of Cardio-Thoracic Surgery [ 21 ] particularly focused on the development and validation of prediction models.

Explanation of aspects in the series

Almost all statistical series included at least one article that mentioned or explained multivariable regression ( Table 1 ). Logistic regression was the most frequently described regression type in 19 out of 23 series (83%), followed by linear regression (78%). Cox regression/survival model (including proportional hazards regression) was mentioned in twelve series (52%) and was less extensively described than linear and logistic regression. Poisson regression was covered by three series (13%). Each of the considered general aspects of regression modeling were mentioned in at least four series (17%) ( Fig 3 ) except for random effect models , which were treated in only one series (4%). Interpretation of regression coefficients , model assumptions , and different purposes of regression mode were covered in 19 series (83%). The aspect different purposes of regression models comprised at least one statement in an article concerning purposes of regression models, which could be identified by keywords like prediction, description, explanation, etiology, or confounding. More than one sentence was used for the explanation of different purposes in 15 series (65%). In 18 series (78%), reporting of regression results and regression diagnostics were described, which was done extensively in most series. Aspects like treatment of binary covariates , missing values , measurement error , and adjusted coefficient of determination were rather infrequently mentioned and found in four to seven series each (25–30%).

thumbnail

Extent of explanation of general aspects of regression modeling in statistical series: One sentence only (light grey), more than one sentence to one paragraph (grey) and more than one paragraph (black).

https://doi.org/10.1371/journal.pone.0262918.g003

At least one aspect of functional forms of continuous predictors , was mentioned in 17 series (74%), but details were hardly ever given ( Fig 4 ). The possibility of non-linear relation and non-linear transformations were raised in 16 (70%) and eleven series (48%), respectively. Dichotomization of continuous covariates was found in eight series (35%) and it was extensively discussed in two (9%). More advanced techniques like the use of splines or fractional polynomials were mentioned in some series but detailed information for splines was not provided. Generalized additive models were never mentioned.

thumbnail

Extent of explanation of aspects of functional forms of continuous predictors in statistical series: One sentence only (light grey), more than one sentence to one paragraph (grey) and more than one paragraph (black).

https://doi.org/10.1371/journal.pone.0262918.g004

Selection of variables was mentioned in 15 series (65%) and described extensively in ten series (43%) ( Fig 5 ). However, specific variable selection methods were rarely described in detail. Backward elimination , selection based on background knowledge , forward selection , and stepwise selection were the most frequently described selection methods in seven to eleven series (30–48%). Univariate screening , which is still popular in medical research, was only described in three series (13%) in up to one paragraph. Other aspects of variable selection were hardly ever mentioned. Selection based on AIC/BIC , relating to best subset selection or stepwise selection based on these information criteria, and the choice of the significance level were found in 2 series only (9%). Relative frequencies of aspects mentioned in articles are detailed in Figs 1 – 3 in S5 File .

thumbnail

Extent of explanation of aspects of selection of variables in statistical series: One sentence only (light grey), more than one sentence to one paragraph (grey) and more than one paragraph (black).

https://doi.org/10.1371/journal.pone.0262918.g005

We found general recommendations for software in nine articles of nine different series. Authors mentioned R, Nanostat, GLIM package, SAS and SPSS [ 75 – 78 ]. SAS as well as R were recommended in three articles. In only one article the authors referred to a specific package in R. Detailed code examples were provided in two articles only [ 16 , 58 ]. In the article of Curran-Everett [ 58 ], the R script file was provided as appendix and in the article of Obuchowski [ 16 ], code chunks were included throughout the text directly showing how to derive the reported results. In all, software recommendations were rare and mostly not detailed.

Recommendations and warnings in the series

Recommendations and warnings were given on many aspects of our list. All statements are listed in S5 File : Table 1 and some frequent statements across articles are summarized below.

Statements on general aspects

We found numerous recommendations and warnings on general aspects as described in the following. Concerning data preparation, some authors recommended to impute missing values in multivariable models, e.g. by multiple imputation [ 20 – 22 , 31 ]. Steyerberg et al. [ 31 ] and Grant et al. [ 21 ] discouraged from using a complete case analysis to handle missing values. As an aspect of model development, number of observations/events per variable was a disputed topic in several articles [ 79 – 81 ]. In seven articles, we found explicit recommendations for the number of observations (in linear models) or the events per variable (in logistic and Cox/survival models), varying between at least ten to 20 observations/events per variable [ 16 , 20 , 22 , 25 , 31 , 33 , 55 ]. Several recommendations and warnings were given on model assumptions and model diagnostics . Many series authors recommended to check assumptions graphically [ 24 , 27 , 44 , 58 , 72 ] and they warned that models may be inappropriate if the assumptions are not met [ 20 , 24 , 31 , 33 , 52 , 55 , 56 , 62 ]. In the context of Cox proportional hazards model, authors especially mentioned the proportional hazards assumption [ 24 , 44 , 49 , 56 , 62 ]. Concerning reporting of results, some authors warned to not confuse odds ratios with relative risks or hazard ratios [ 25 , 44 , 59 ]. Several warnings could also be found on reporting performance of a model. Most authors did not recommend to report the coefficient of determination R 2 [ 20 , 27 , 51 , 61 ] and indicated that the pitfall of R 2 is that its value increases with increasing number of covariates in the model [ 15 ]. Schneider et al. [ 20 ] and Richardson et al. [ 61 ] recommended to use the adjusted coefficient of determination instead. We also found many recommendations and statements about model validation for prediction models. Authors of the evaluated articles recommended cross-validation or bootstrap validation instead of split sample validation if internal validation is performed [ 21 , 22 , 31 , 70 , 72 ]. It was also suggested that internal validation is not sufficient for the model to be used in clinical practice and an external validation should be executed as well [ 21 ]. In several articles, we found that authors warned about applying the Hosmer-Lemeshow test because of potential pitfalls [ 31 , 60 , 61 ]. For reporting regression results , in two articles the guideline for Transparent Reporting of multivariable prediction models for Individual Prognosis or Diagnosis (TRIPOD) was mentioned [ 21 , 71 , 82 ].

Statements on functional form of continuous predictors

Dichotomization of continuous predictors is an aspect of functional forms of continuous predictors that was frequently discussed. Many authors argued against categorization of continuous variables because it may lead to loss of power, to increased risk of false positive results, to underestimation of variation, and to concealment of non-linearities [ 21 , 26 , 31 , 69 ]. However, other authors advised to categorize continuous variables if the relation to the outcome is non-linear [ 24 , 25 , 59 ].

Statements on variable selection

We also found recommendations in favor of or against specific variable selection methods. Four articles explicitly recommended to take advantage of background knowledge to select variables [ 15 , 20 , 48 , 59 ]. Univariate screening was advised against by one article [ 19 ]. Comparing stepwise selection methods, Grant et al. [ 21 ] preferred backward elimination over forward selection. Authors warned about consequences of stepwise methods such as unstable selection and overfitting [ 21 , 31 ]. It was also pointed out that selected models must be interpreted with greatest caution and implications should be checked on new data [ 28 , 53 ].

Methodological gaps in the series

This descriptive analysis of contents gives rise to some observations on important gaps and possibly misleading recommendations. First, we found that one general type of regression models, Poisson regression, was not treated in most series. This omission is probably due to the fact that Poisson regression is less frequently applied in medical research because most outcomes are binary or time-to-event and, therefore, logistic and Cox regression are more frequent. Second, several series introduced the possibility of non-linear relations of continuous covariates with the outcome. However, only few statements on how to deal with non-linearities by specifying flexible functional forms in multiple regression were available. Third, we did not find very detailed information on advantages and disadvantages of data-driven variable selection methods in any of the series. Finally, tutorials on statistical software and on specific code examples were hardly found in the reviewed series.

Misleading recommendations in the series

Quality assessment of recommendations would have been controversial and we did not intend doing it. Nevertheless, here we mention two issues that we consider as severely misleading. Although univariate screening as a method for variable selection was never recommended in any of the series, one article showed an example with the application of this procedure to pre-filter the explanatory variables based on their associations with the outcome variable [ 47 ]. It is known since long that univariate screening should be avoided because it has the potential to wrongly reject important variables [ 83 ]. In another article it was suggested that a model can be considered robust if results from both backward elimination and forward selection agree [ 20 ]. Such agreement does not support robustness of stepwise methods: relying on agreement is a poor strategy [ 84 , 85 ].

Series and articles recommended to read

Depending on the aim of the planned study, as well as the focus and knowledge level of the reader, different series and articles might be recommended. The series in Circulation comprised three papers about multiple linear and logistic regression [ 24 – 26 ], which provide basics and describe many essential aspects of univariable and multivariable regression modeling. For more advanced researchers, we recommend the article of Nuñ ez et al. in Revista Española de Cardiologia [ 22 ], which gives a quick overview of aspects and existing methods including functional forms and variable selection. The Nature Methods series published short articles focusing on few, specific aspects of regression modeling [ 34 – 42 ]. This series might be of interest if one likes to spent more time on learning about regression modeling. If someone is especially interested in prediction models, we recommend a concise publication in the European Heart Journal [ 31 ], which provides details on model development and validation for predictive purposes. For the same topic we can also recommend the paper by Grant et al. [ 21 ]. We consider all series and articles recommended in this paragraph as suitable reading for medical researchers but this does not imply that we agree to all explanations, statements and aspects discussed.

Summary and consequences for future work

This review summarizes the knowledge about regression modeling that is transferred through statistical tutorials published in medical journals. A total of 23 series with 57 topic-relevant articles were identified and evaluated for coverage of 44 aspects of regression modeling. We found that almost all aspects of regression modeling were at least mentioned in any of the series. Several aspects of regression modeling, in particular most general aspects, were covered. However, detailed descriptions and explanations of non-linear relations and variable selection in multivariable models were lacking. Only few papers provided suitable methods and software guidance for analysts with a relatively weak statistical background and limited practical experience as recommended by the STRATOS initiative [ 1 ]. However, we confess that currently there is no agreement on state of the art methodology [ 3 ].

Nevertheless, readers of statistical tutorials should not only be informed about the possibility of non-linear relations of continuous predictors with the outcome but they should also be given a brief overview about which methods are generally available and may be suitable. This could be achieved by tutorials that introduce readers to methods like fractional polynomials or splines, explaining similarities and differences between these approaches, e.g., by comparative, commented analyses of realistic data sets. Such documents could also show how alternative analyses (considering/ignoring potential non-linearities) may result in conflicting results and explain the reasons for such discrepancies.

Detailed tutorials on variable selection could aim at describing the mechanism of different variable selection methods, which can easily be applied with standard statistical software, and should state in what situations variable selection methods are needed and could be used. For example, if sufficient background knowledge is available, prefiltering or even the selection of variables should be based on this information rather than using data-driven methods on the entire data set. Such tutorials should provide comparisons and interpretation of the results of various variable selection methods and suggest adequate methods for different data settings.

Generally, the articles also lacked details on software to perform statistical analysis and usually did not provide code chunks, descriptions of specific functions, an appendix with commented code or references to software packages. Future work should also focus on filling this gap by recommendations of software as well as providing well commented and documented code for different statistical methods in a format that is accessible by non-experts. We recommend that software, packages and functions therein to apply certain methods should be reported in every statistical tutorial article. The respective code to derive analysis results could be provided in an appendix or directly in the manuscript text, if not too lengthy. Any provided code in the appendix should be well-structured and lavishly commented referring to the particular method and describing all defined parameter settings. This will encourage medical researchers to increase the reproducibility of their research by also publishing their statistical code, e.g., in electronic appendices to their publications. For example, worked examples with openly accessible data sets and commented code allowing fully reproducible results have a high potential to guide researchers in their own statistical tasks. On the contrary, we discourage from using point-and-click software programs, which sometimes output far more analysis results than requested. Users may pick inadequate methods or report wrong results inadvertently, which could debilitate their research work.

Generally, our review may stimulate the development of targeted gap-filling guidance and tutorial papers in the field of regression modeling, which should support medical researchers in several ways: 1) by explaining how to interpret published results correctly, 2) by guiding them how to critically appraise the methodology used in a published article, 3) by enabling them to plan, perform basic statistical analyses and report results in a proper way and 4) by helping them to identify situations in which the advice of a statistical expert is required. In S3 File : CRF article screening we commented which aspects should usually be addressed by an expert and which aspects are considered basic.

Strengths and limitations

According to our knowledge this is the first review on series of statistical tutorials in the medical field with the focus on regression modeling. Our review followed a pre-specified and published protocol to which many experienced researchers in the field of applied regression modeling contributed. One aspect of this contribution was the collection of series of statistical tutorials that could not be identified by common keyword searches.

We standardized the selection process by designing an inclusion checklist for series of statistical tutorials and by providing a manual for the content form with which we extracted the actual information of the article and series. Another strength is that the data collection process was performed objectively since each article was analyzed by two out of three independent raters. Discrepancies were discussed among all three of them to find a consent. This procedure avoided that single opinions were transferred to the output of this review. This review is informative for many clinical colleagues who are interested in statistical issues in regression modeling and search for suitable literature.

This review also has limitations. An automated, systematic search was not possible because series could not be identified by common keywords neither on the series’ title level nor on the article’s title level. Thus, not all available series may have been found. To enrich our initial query, we also searched on certain journals’ webpages and requested our expert panel from the STRATOS initiative to complement our list with other series they were aware of. We also included series that were suggested by one reviewer during the peer-review procedure of this manuscript. This selection strategy may impose a bias towards higher-quality journals since series of less prestigious journals might not be known to the experts. However, the higher-quality journals can be considered as the primary source of information for researchers seeking advice on statistical methodology.

We considered only series with at least five articles. This boundary is of course to a certain extend arbitrary. It was motivated by the fact that we intended to do analyses on the series level, which is only reasonable if a series covers an adequate number of articles. We also assumed that larger series are more visible and well-known to researchers.

We also might have missed or excluded some important aspects of regression modeling in our catalogue. The catalogue of aspects was developed and discussed by several experienced researchers of the STRATOS initiative working in the field of regression modeling. After submission of the protocol paper some more aspects were added on request of its reviewers [ 7 ]. However, further important aspects such as meta-regression, diagnostic models, causal inference, reproducibility or open data and open software code were not addressed. We encourage researchers to repeat similar reviews on these related fields.

A third limitation is that we only searched for series whereas there might be other educational papers on regression modeling that were published as single articles. However, we believe that the average visibility of an entire series and thereby its educational impact is much higher than for isolated articles. This does not negate that there could be excellent isolated articles, which can have a high impact for training medical researchers. While working on the final version of this paper we became aware of the series Big-data Clinical Trial Column in the Annals of Translational Medicine . Until 1 January 2019 they had published 36 papers and the series would have been eligible for our review. Obviously, we might have overseen further series, but it is unlikely that it has a larger effect on the results of our review.

Moreover, there are many introductory textbooks, educational workshops and online video tutorials, some of them with excellent quality, which were not considered here. A detailed review of such sources clearly was out of our scope.

Despite many series of statistical tutorials being available to guide medical researchers on various aspects of regression modeling, several methodological gaps still persist, specifically on addressing nonlinear effects, model specification and variable selection. Furthermore, papers are published in a large number of different journals and are therefore likely unknown to many medical researchers. This review fills the latter gap, but many more steps are needed to improve the quality and interpretation of medical research. More detailed statistical guidance and tutorials with a low technical level on regression modeling and other topics are needed to better support medical researchers who perform or interpret regression analyses.

Supporting information

S1 checklist. prisma reporting guideline..

https://doi.org/10.1371/journal.pone.0262918.s001

S1 File. List of candidate series for potential inclusion in the review.

https://doi.org/10.1371/journal.pone.0262918.s002

S2 File. Case report form–series inclusion.

https://doi.org/10.1371/journal.pone.0262918.s003

S3 File. Case report form–article screening.

https://doi.org/10.1371/journal.pone.0262918.s004

S4 File. Manual for the article screening sheet.

https://doi.org/10.1371/journal.pone.0262918.s005

S5 File. Supplementary figures and tables.

https://doi.org/10.1371/journal.pone.0262918.s006

S1 Data. Collected data.

https://doi.org/10.1371/journal.pone.0262918.s007

Acknowledgments

When this article was written, topic group 2 of STRATOS consisted of the following members: Georg Heinze (Co-chair, [email protected] ), Medical University of Vienna, Austria; Willi Sauerbrei (co-chair, [email protected] ), University of Freiburg, Germany; Aris Perperoglou (co-chair, [email protected] ), AstraZeneca, London, Great Britain; Michal Abrahamowicz, Royal Victoria Hospital, Montreal, Canada; Heiko Becher, Medical University Center Hamburg, Eppendorf, Hamburg, Germany; Harald Binder, University of Freiburg, Germany; Daniela Dunkler, Medical University of Vienna, Austria; Rolf Groenwold, Leiden University, Leiden, Netherlands; Frank Harrell, Vanderbilt University School of Medicine, Nashville TN, USA; Nadja Klein, Humboldt Universität, Berlin, Germany; Geraldine Rauch, Charité–Universitätsmedizin Berlin, Germany; Patrick Royston, University College London, Great Britain; Matthias Schmid, University of Bonn, Germany.

We thank Edith Motschall (Freiburg) for her important support in the pilot study where we tried to define keywords for identifying statistical series within medical journals. We thank several members of the STRATOS initiative for proposing a high number of candidate series and we thank Frank Konietschke for English language editing in our protocol.

  • View Article
  • PubMed/NCBI
  • Google Scholar
  • 75. R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2021.
  • 77. SAS Institute Inc. The SAS system for Windows. Release 9.4. Cary, NC: SAS Institute Inc.; 2021.
  • 78. IBM Corporation. IBM SPSS Statistics for Windows, Version 27.0. Armonk, NY: IBM Corporation; 2020.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Advanced Search
  • Journal List
  • Hippokratia
  • v.14(Suppl 1); 2010 Dec

Introduction to Multivariate Regression Analysis

Statistics are used in medicine for data description and inference. Inferential statistics are used to answer questions about the data, to test hypotheses (formulating the alternative or null hypotheses), to generate a measure of effect, typically a ratio of rates or risks, to describe associations (correlations) or to model relationships (regression) within the data and, in many other functions. Usually point estimates are the measures of associations or of the magnitude of effects. Confounding, measurement errors, selection bias and random errors make unlikely the point estimates to equal the true ones. In the estimation process, the random error is not avoidable. One way to account for is to compute p-values for a range of possible parameter values (including the null). The range of values, for which the p-value exceeds a specified alpha level (typically 0.05) is called confidence interval. An interval estimation procedure will, in 95% of repetitions (identical studies in all respects except for random error), produce limits that contain the true parameters. It is argued that the question if the pair of limits produced from a study contains the true parameter could not be answered by the ordinary (frequentist) theory of confidence intervals 1 . Frequentist approaches derive estimates by using probabilities of data (either p-values or likelihoods) as measures of compatibility between data and hypotheses, or as measures of the relative support that data provide hypotheses. Another approach, the Bayesian, uses data to improve existing (prior) estimates in light of new data. Proper use of any approach requires careful interpretation of statistics 1 , 2 .

The goal in any data analysis is to extract from raw information the accurate estimation. One of the most important and common question concerning if there is statistical relationship between a response variable (Y) and explanatory variables (Xi). An option to answer this question is to employ regression analysis in order to model its relationship. There are various types of regression analysis. The type of the regression model depends on the type of the distribution of Y; if it is continuous and approximately normal we use linear regression model; if dichotomous we use logistic regression; if Poisson or multinomial we use log-linear analysis; if time-to-event data in the presence of censored cases (survival-type) we use Cox regression as a method for modeling. By modeling we try to predict the outcome (Y) based on values of a set of predictor variables (Xi). These methods allow us to assess the impact of multiple variables (covariates and factors) in the same model 3 , 4 .

In this article we focus in linear regression. Linear regression is the procedure that estimates the coefficients of the linear equation, involving one or more independent variables that best predict the value of the dependent variable which should be quantitative. Logistic regression is similar to a linear regression but is suited to models where the dependent variable is dichotomous. Logistic regression coefficients can be used to estimate odds ratios for each of the independent variables in the model.

Linear equation

In most statistical packages, a curve estimation procedure produces curve estimation regression statistics and related plots for many different models (linear, logarithmic, inverse, quadratic, cubic, power, S-curve, logistic, exponential etc.). It is essential to plot the data in order to determine which model to use for each depedent variable. If the variables appear to be related linearly, a simple linear regression model can be used but in the case that the variables are not linearly related, data transformation might help. If the transformation does not help then a more complicated model may be needed. It is strongly advised to view early a scatterplot of your data; if the plot resembles a mathematical function you recognize, fit the data to that type of model. For example, if the data resemble an exponential function, an exponential model is to be used. Alternatively, if it is not obvious which model best fits the data, an option is to try several models and select among them. It is strongly recommended to screen the data graphically (e.g. by a scatterplot) in order to determine how the independent and dependent variables are related (linearly, exponentially etc.) 4 – 6 .

The most appropriate model could be a straight line, a higher degree polynomial, a logarithmic or exponential. The strategies to find an appropriate model include the forward method in which we start by assuming the very simple model i.e. a straight line (Y = a + bX or Y = b 0 + b 1 X ). Then we find the best estimate of the assumed model. If this model does not fit the data satisfactory, then we assume a more complicated model e.g. a 2nd degree polynomial (Y=a+bX+cX 2 ) and so on. In a backward method we assume a complicated model e.g. a high degree polynomial, we fit the model and we try to simplify it. We might also use a model suggested by theory or experience. Often a straight line relationship fits the data satisfactory and this is the case of simple linear regression. The simplest case of linear regression analysis is that with one predictor variable 6 , 7 .

Linear regression equation

The purpose of regression is to predict Y on the basis of X or to describe how Y depends on X (regression line or curve)

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-24-e001.jpg

The Xi (X 1 , X 2 , , X k ) is defined as "predictor", "explanatory" or "independent" variable, while Y is defined as "dependent", "response" or "outcome" variable.

Assuming a linear relation in population, mean of Y for given X equals α+βX i.e. the "population regression line".

If Y = a + bX is the estimated line, then the fitted

Ŷi = a + bXi is called the fitted (or predicted) value, and Yi Ŷi is called the residual.

The estimated regression line is determined in such way that (residuals) 2 to be the minimal i.e. the standard deviation of the residuals to be minimized (residuals are on average zero). This is called the "least squares" method. In the equation

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-24-e002.jpg

b is the slope (the average increase of outcome per unit increase of predictor)

a is the intercept (often has no direct practical meaning)

A more detailed (higher precision of the estimates a and b) regression equation line can also be written as

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-24-e003.jpg

Further inference about regression line could be made by the estimation of confidence interval (95%CI for the slope b). The calculation is based on the standard error of b:

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-24-e004.jpg

so, 95% CI for β is b ± t0.975*se(b) [t-distr. with df = n-2]

and the test for H0: β=0, is t = b / se(b) [p-value derived from t-distr. with df = n-2].

If the p value lies above 0.05 then the null hypothesis is not rejected which means that a straight line model in X does not help predicting Y. There is the possibility that the straight line model holds (slope = 0) or there is a curved relation with zero linear component. On the other hand, if the null hypothesis is rejected either the straight line model holds or in a curved relationship the straight line model helps, but is not the best model. Of course there is the possibility for a type II or type I error in the first and second option, respectively. The standard deviation of residual (σ res ) is estimated by

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-24-e005.jpg

The standard deviation of residual (σ res ) characterizes the variability around the regression line i.e. the smaller the σ res , the better the fit. It has a number of degrees of freedom. This is the number to divide by in order to have an unbiased estimate of the variance. In this case df = n-2, because two parameters, α and β, are estimated 7 .

Multiple linear regression analysis

As an example in a sample of 50 individuals we measured: Y = toluene personal exposure concentration (a widespread aromatic hydrocarbon); X1 = hours spent outdoors; X2 = wind speed (m/sec); X3 = toluene home levels. Y is the continuous response variable ("dependent") while X1, X2, , Xp as the predictor variables ("independent") [7]. Usually the questions of interest are how to predict Y on the basis of the X's and what is the "independent" influence of wind speed, i.e. corrected for home levels and other related variables? These questions can in principle be answered by multiple linear regression analysis.

In the multiple linear regression model, Y has normal distribution with mean

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-24-e006.jpg

The model parameters β 0 + β 1 + +β ρ and σ must be estimated from data.

β 0 = intercept

β 1 β ρ = regression coefficients

σ = σ res = residual standard deviation

Interpretation of regression coefficients

In the equation Y = β 0 + β 1 1 + +βρXρ

β 1 equals the mean increase in Y per unit increase in Xi , while other Xi's are kept fixed. In other words βi is influence of Xi corrected (adjusted) for the other X's. The estimation method follows the least squares criterion.

If b 0 , b 1 , , bρ are the estimates of β 0 , β 1 , , βρ then the "fitted" value of Y is

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-24-e007.jpg

In our example, the statistical packages give the following estimates or regression coefficients (bi) and standard errors (se) for toluene personal exposure levels.

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-24-i001.jpg

Then the regression equation for toluene personal exposure levels would be:

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-25-e001.jpg

The estimated coefficient for time spent outdoors (0.582) means that the estimated mean increase in toluene personal levels is 0.582 g/m 3 if time spent outdoors increases 1 hour, while home levels and wind speed remain constant. More precisely one could say that individuals differing one hour in the time that spent outdoors, but having the same values on the other predictors, will have a mean difference in toluene xposure levels equal to 0.582 µg/m 3 8 .

Be aware that this interpretation does not imply any causal relation.

Confidence interval (CI) and test for regression coefficients

95% CI for i is given by bi ± t0.975*se(bi) for df= n-1-p (df: degrees of freedom)

In our example that means that the 95% CI for the coefficient of time spent outdoors is 95%CI: - 0.19 to 0.49

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-25-e002.jpg

As in example if we test the H0: β humidity = 0 and find P = 0.40, which is not significant, we assumed that the association between between toluene personal exposure and humidity could be explained by the correlation between humididty and wind speed 8 .

In order to estimate the standard deviation of the residual (Y Yfit), i.e. the estimated standard deviation of a given set of variable values in a population sample, we have to estimate σ

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-25-e003.jpg

The number of degrees of freedom is df = n (p + 1), since p + 1 parameters are estimated.

The ANOVA table gives the total variability in Y which can be partitioned in a part due to regression and a part due to residual variation:

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-25-e004.jpg

With degrees of freedom (n 1) = p + (n p 1)

In statistical packages the ANOVA table in which the partition is given usually has the following format [6]:

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-25-i001.jpg

SS: "sums of squares"; df: Degrees of freedom; MS: "mean squares" (SS/dfs); F: F statistics (see below)

As a measure of the strength of the linear relation one can use R. R is called the multiple correlation coefficient between Y, predictors (X1, Xp ) and Yfit and R square is the proportion of total variation explained by regression (R 2 =SSreg / SStot).

Test on overall or reduced model

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-25-e005.jpg

In our example Tpers = β 0 + β 1 time outdoors + β 2 Thome +β 3 wind speed + residual

The null hypothesis (H 0 ) is that there is no regression overall i.e. β 1 = β 2 =+βρ = 0

The test is based on the proportion of the SS explained by the regression relative to the residual SS. The test statistic (F= MSreg / MSres) has F-distribution with df1 = p and df2 = n p 1 (F- distribution table). In our example F= 5.49 (P<0.01)

If now we want to test the hypothesis Ho: β 1 = β 2 = β 5 = 0 (k = 3)

In general k of p regression coefficients are set to zero under H0. The model that is valid if H 0 =0 is true is called the "reduced model". The Idea is to compare the explained variability of the model at hand with that of the reduced model.

The test statistic (F):

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-25-e006.jpg

follows a F-distribution with df 1 = k and df 2 = n p 1.

If one or two variables are left out and we calculate SS reg (the statistical package does) and we find that the test statistic for F lies between 0.05 < P < 0.10, that means that there is some evidence, although not strong, that these variables together, independently of the others, contribute to the prediction of the outcome.

Assumptions

If a linear model is used, the following assumptions should be met. For each value of the independent variable, the distribution of the dependent variable must be normal. The variance of the distribution of the dependent variable should be constant for all values of the independent variable. The relationship between the dependent variable and the independent variables should be linear, and all observations should be independent. So the assumptions are: independence; linearity; normality; homoscedasticity. In other words the residuals of a good model should be normally and randomly distributed i.e. the unknown does not depend on X ("homoscedasticity") 2 , 4 , 6 , 9 .

Checking for violations of model assumptions

To check model assumptions we used residual analysis. There are several kinds of residuals most commonly used are the standardized residuals (ZRESID) and the studentized residuals (SRESID) [6]. If the model is correct, the residuals should have a normal distribution with mean zero and constant sd (i.e. not depending on X). In order to check this we can plot residuals against X. If the variation alters with increasing X, then there is violation of homoscedasticity. We can also use the Durbin-Watson test for serial correlation of the residuals and casewise diagnostics for the cases meeting the selection criterion (outliers above n standard deviations). The residuals are (zero mean) independent, normally distributed with constant standard deviation (homogeneity of variances) 4 , 6 .

To discover deviations form linearity and homogeneity of variables we can plot residuals against each predictor or against predicted values. Alternatively by using the PARTIAL plot we can assess linearity of a predictor variable. The partial plot for a predictor X 1 is a plot of residuals of Y regressed on other Xs and against residuals of Xi regressed on other X's. The plot should be linear. To check the normality of residuals we can use an histogram (with normal curve) or a normal probability plot 6 , 7 .

The goodness-of-fit of the model is assessed by studying the behavior of the residuals, looking for "special observations / individuals" like outliers, observations with high "leverage" and influential points. Observations deserving extra attention are outliers i.e. observations with unusually large residual; high leverage points: unusual x - pattern, i.e. outliers in predictor space; influential points: individuals with high influence on estimate or standard error of one or more β's. An observation could be all three. It is recommended to inspect individuals with large residual, for outliers; to use distances for high leverage points i.e. measures to identify cases with unusual combinations of values for the independent variables and cases that may have a large impact on the regression model. For influential points use influence statistics i.e. the change in the regression coefficients (DfBeta(s)) and predicted values (DfFit) that results from the exclusion of a particular case. Overall measure for influence on all β's jointly is "Cook's distance" (COOK). Analogously for standard errors overall measure is COVRATIO 6 .

Deviations from model assumptions

We can use some tips to correct some deviation from model assumptions. In case of curvilinearity in one or more plots we could add quadratic term(s). In case of non homogeneity of residual sd, we can try some transformation: log Y if Sres is proportional to predicted Y; square root of Y if Y distribution is Poisson-like; 1/Y if Sres 2 is proportional to predicted Y; Y 2 if Sres 2 decreases with Y. If linearity and homogeneity hold then non-normality does not matter if the sample size is big enough (n≥50- 100). If linearity but not homogeneity hold then estimates of β's are correct, but not the standard errors. They can be corrected by computing the "robust" se's (sandwich, Huber's estimate) 4 , 6 , 9 .

Selection methods for Linear Regression modeling

There are various selection methods for linear regression modeling in order to specify how independent variables are entered into the analysis. By using different methods, a variety of regression models from the same set of variables could be constructed. Forward variable selection enters the variables in the block one at a time based on entry criteria. Backward variable elimination enters all of the variables in the block in a single step and then removes them one at a time based on removal criteria. Stepwise variable entry and removal examines the variables in the block at each step for entry or removal. All variables must pass the tolerance criterion to be entered in the equation, regardless of the entry method specified. A variable is not entered if it would cause the tolerance of another variable already in the model to drop below the tolerance criterion 6 . In a model fitting the variables entered and removed from the model and various goodness-of-fit statistics are displayed such as R2, R squared change, standard error of the estimate, and an analysis-of-variance table.

Relative issues

Binary logistic regression models can be fitted using either the logistic regression procedure or the multinomial logistic regression procedure. An important theoretical distinction is that the logistic regression procedure produces all statistics and tests using data at the individual cases while the multinomial logistic regression procedure internally aggregates cases to form subpopulations with identical covariate patterns for the predictors based on these subpopulations. If all predictors are categorical or any continuous predictors take on only a limited number of values the mutinomial procedure is preferred. As previously mentioned, use the Scatterplot procedure to screen data for multicollinearity. As with other forms of regression, multicollinearity among the predictors can lead to biased estimates and inflated standard errors. If all of your predictor variables are categorical, you can also use the loglinear procedure.

In order to explore correlation between variables, Pearson or Spearman correlation for a pair of variables r (Xi, Xj) is commonly used. For each pair of variables (Xi, Xj) Pearson's correlation coefficient (r) can be computed. Pearsons r (Xi; Xj) is a measure of linear association between two (ideally normally distributed) variables. R 2 is the proportion of total variation of the one explained by the other (R 2 = b * Sx/Sy), identical with regression. Each correlation coefficient gives measure for association between two variables without taking other variables into account. But there are several useful correlation concepts involving more variables. The partial correlation coefficient between Xi and Xj, adjusted for other X's e.g. r (X1; X2 / X3). The partial correlation coefficient can be viewed as an adjustment of the simple correlation taking into account the effect of a control variable: r(X ; Y / Z ) i.e. correlation between X and Y controlled for Z. The multiple correlation coefficient between one X and several other X's e.g. r (X1 ; X2 , X3 , X4) is a measure of association between one variable and several other variables r (Y ; X1, X2, , Xk). The multiple correlation coefficient between Y and X1, X2,, Xk is defined as the simple Pearson correlation coefficient r (Y ; Yfit) between Y and its fitted value in the regression model: Y = β0 + β1X1+ βkXk + residual. The square of r (Y; X1, , Xk ) is interpreted as the proportion of variability in Y that can be explained by X1, , Xk. The null hypothesis [H 0 : ρ ( : X1, , Xk) = 0] is tested with the F-test for overall regression as it is in the multivariate regression model (see above) 6 , 7 . The multiple-partial correlation coefficient between one X and several other X`s adjusted for some other X's e.g. r (X1 ; X2 , X3 , X4 / X5 , X6 ). The multiple partial correlation coefficient equal the relative increase in % explained variability in Y by adding X1,, Xk to a model already containing Z1, , Zρ as predictors 6 , 7 .

Other interesting cases of multiple linear regression analysis include: the comparison of two group means. If for example we wish to answer the question if mean HEIGHT differs between men and women? In the simple linear regression model:

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-27-e001.jpg

Testing β1 = 0 is equivalent with testing

HEIGHT MEN sub> = HEIGHT WOMEN by means of Student's t-test

The linear regression model assumes a normal distribution of HEIGHT in both groups, with equal . This is exactly the model of the two-sample t-test. In the case of comparison of several group means, we wish to answer the question if mean HEIGHT differ between different SES classes?

SES: 1 (low); 2 (middle) and 3 (high) (socioeconomic status)

We can use the following linear regression model:

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-27-e002.jpg

Then β 1 and β 2 are interpreted as:

β 1 = difference in mean HEIGHT between low and high class

β 2 = difference in mean HEIGHT between middle and high class

Testing β 1 = β 2 = 0 is equivalent with the one-way ANalysis Of VAriance F-test . The statistical model in both cases is in fact the same 4 , 6 , 7 , 9 .

Analysis of covariance (ANCOVA)

If we wish to compare a continuous variable Y (e.g. HEIGHT) between groups (e.g. men and women) corrected (adjusted or controlled) for one or more covariables X (confounders) (e.g. X = age or weight) then the question is formulated: Are means of HEIGHT of men and women different, if men and women of equal weight are compared? Be aware that this question is different from that if there is a difference between the means of HEIGHT for men and women? And the answers can be quite different! The difference between men and women could be opposite, larger or smaller than the crude if corrected. In order to estimate the corrected difference the following multiple regression model is used:

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-27-e003.jpg

where Y: response variable (for example HEIGHT); Z: grouping variable (for example Z = 0 for men and Z = 1 for women); X: covariable (confounder) (for example weight).

So, for men the regression line is y = β 0 + β 2 and for women is y = (β 0 + β 1 ) + β 2 .

This model assumes that regression lines are parallel. Therefore β 1 is the vertical difference, and can be interpreted as the: for X corrected difference between the mean response Y of the groups. If the regression lines are not parallel, then difference in mean Y depends on value of X. This is called "interaction" or "effect modification" .

A more complicated model, in which interaction is admitted, is:

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-27-e004.jpg

regression line men: y = β 0 + β 2

regression line women: y = (β 0 + β 1 )+ (β 2 + β 3 )X

The hypothesis of the absence of "effect modification" is tested by H 0 : 3 = 0

As an example, we are interested to answer what is - the corrected for body weight - difference in HEIGHT between men and women in a population sample?

We check the model with interaction:

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-27-e005.jpg

By testing β 3 =0, a p-value much larger than 0.05 was calculated. We assume therefore that there is no interaction i.e. regression lines are parallel. Further Analysis of Covariance for ≥ 3 groups could be used if we ask the difference in mean HEIGHT between people with different level of education (primary, medium, high), corrected for body weight. In a model where the three lines may be not parallel we have to check for interaction (effect modification) 7. Testing the hypothesis that coefficient of interactions terms equal 0, it is reasonable to assume a model without interaction. Testing the hypothesis H 0 : β 1 = β 2 = 0, i.e. no differences between education level when corrected for weight, gives the result of fitting the model, for which the P-values for Z 1 and Z 2 depend on your choice of the reference group. The purposes of ANCOVA are to correct for confounding and increase of precision of an estimated difference.

In a general ANCOVA model as:

An external file that holds a picture, illustration, etc.
Object name is hippokratia-14-27-e006.jpg

where Y the response variable; k groups (dummy variables Z 1 , Z 2 , , Z k-1 ) and X 1 , , X p confounders

there is a straightforward extension to arbitrary number of groups and covariables.

Coding categorical predictors in regression

One always has to figure out which way of coding categorical factors is used, in order to be able to interpret the parameter estimates. In "reference cell" coding, one of the categories plays the role of the reference category ("reference cell"), while the other categories are indicated by dummy variables. The β's corresponding to the dummies that are interpreted as the difference of corresponding category with the reference category. In "difference with overall mean" coding in the model of the previous example: [Y = β 0 + β 1 1 + β 2 2 ++ residual], the β 0 is interpreted as the overall mean of the three levels of education while β 1 and β 2 are interpreted as the deviation of mean of primary and medium from overall mean, respectively. The deviation of the mean of high level from overall mean is given by (- β 1 - β 2 ). In "cell means" coding in the previous model (without intercept): [Y = β 0 + β 1 1 + β 2 2 + β 3 3 + residual], β 1 is the mean of primary, β 2 the middle and β 3 of the high level education 6 , 7 , 9 .

Conclusions

It is apparent to anyone who reads the medical literature today that some knowledge of biostatistics and epidemiology is a necessity. The goal in any data analysis is to extract from raw information the accurate estimation. But before any testing or estimation, a careful data editing, is essential to review for errors, followed by data summarization. One of the most important and common question is if there is statistical relationship between a response variable (Y) and explanatory variables (Xi). An option to answer this question is to employ regression analysis. There are various types of regression analysis. All these methods allow us to assess the impact of multiple variables on the response variable.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Published: 31 January 2022

The clinician’s guide to interpreting a regression analysis

  • Sofia Bzovsky 1 ,
  • Mark R. Phillips   ORCID: orcid.org/0000-0003-0923-261X 2 ,
  • Robyn H. Guymer   ORCID: orcid.org/0000-0002-9441-4356 3 , 4 ,
  • Charles C. Wykoff 5 , 6 ,
  • Lehana Thabane   ORCID: orcid.org/0000-0003-0355-9734 2 , 7 ,
  • Mohit Bhandari   ORCID: orcid.org/0000-0001-9608-4808 1 , 2 &
  • Varun Chaudhary   ORCID: orcid.org/0000-0002-9988-4146 1 , 2

on behalf of the R.E.T.I.N.A. study group

Eye volume  36 ,  pages 1715–1717 ( 2022 ) Cite this article

12k Accesses

3 Citations

1 Altmetric

Metrics details

  • Outcomes research

Introduction

When researchers are conducting clinical studies to investigate factors associated with, or treatments for disease and conditions to improve patient care and clinical practice, statistical evaluation of the data is often necessary. Regression analysis is an important statistical method that is commonly used to determine the relationship between several factors and disease outcomes or to identify relevant prognostic factors for diseases [ 1 ].

This editorial will acquaint readers with the basic principles of and an approach to interpreting results from two types of regression analyses widely used in ophthalmology: linear, and logistic regression.

Linear regression analysis

Linear regression is used to quantify a linear relationship or association between a continuous response/outcome variable or dependent variable with at least one independent or explanatory variable by fitting a linear equation to observed data [ 1 ]. The variable that the equation solves for, which is the outcome or response of interest, is called the dependent variable [ 1 ]. The variable that is used to explain the value of the dependent variable is called the predictor, explanatory, or independent variable [ 1 ].

In a linear regression model, the dependent variable must be continuous (e.g. intraocular pressure or visual acuity), whereas, the independent variable may be either continuous (e.g. age), binary (e.g. sex), categorical (e.g. age-related macular degeneration stage or diabetic retinopathy severity scale score), or a combination of these [ 1 ].

When investigating the effect or association of a single independent variable on a continuous dependent variable, this type of analysis is called a simple linear regression [ 2 ]. In many circumstances though, a single independent variable may not be enough to adequately explain the dependent variable. Often it is necessary to control for confounders and in these situations, one can perform a multivariable linear regression to study the effect or association with multiple independent variables on the dependent variable [ 1 , 2 ]. When incorporating numerous independent variables, the regression model estimates the effect or contribution of each independent variable while holding the values of all other independent variables constant [ 3 ].

When interpreting the results of a linear regression, there are a few key outputs for each independent variable included in the model:

Estimated regression coefficient—The estimated regression coefficient indicates the direction and strength of the relationship or association between the independent and dependent variables [ 4 ]. Specifically, the regression coefficient describes the change in the dependent variable for each one-unit change in the independent variable, if continuous [ 4 ]. For instance, if examining the relationship between a continuous predictor variable and intra-ocular pressure (dependent variable), a regression coefficient of 2 means that for every one-unit increase in the predictor, there is a two-unit increase in intra-ocular pressure. If the independent variable is binary or categorical, then the one-unit change represents switching from one category to the reference category [ 4 ]. For instance, if examining the relationship between a binary predictor variable, such as sex, where ‘female’ is set as the reference category, and intra-ocular pressure (dependent variable), a regression coefficient of 2 means that, on average, males have an intra-ocular pressure that is 2 mm Hg higher than females.

Confidence Interval (CI)—The CI, typically set at 95%, is a measure of the precision of the coefficient estimate of the independent variable [ 4 ]. A large CI indicates a low level of precision, whereas a small CI indicates a higher precision [ 5 ].

P value—The p value for the regression coefficient indicates whether the relationship between the independent and dependent variables is statistically significant [ 6 ].

Logistic regression analysis

As with linear regression, logistic regression is used to estimate the association between one or more independent variables with a dependent variable [ 7 ]. However, the distinguishing feature in logistic regression is that the dependent variable (outcome) must be binary (or dichotomous), meaning that the variable can only take two different values or levels, such as ‘1 versus 0’ or ‘yes versus no’ [ 2 , 7 ]. The effect size of predictor variables on the dependent variable is best explained using an odds ratio (OR) [ 2 ]. ORs are used to compare the relative odds of the occurrence of the outcome of interest, given exposure to the variable of interest [ 5 ]. An OR equal to 1 means that the odds of the event in one group are the same as the odds of the event in another group; there is no difference [ 8 ]. An OR > 1 implies that one group has a higher odds of having the event compared with the reference group, whereas an OR < 1 means that one group has a lower odds of having an event compared with the reference group [ 8 ]. When interpreting the results of a logistic regression, the key outputs include the OR, CI, and p-value for each independent variable included in the model.

Clinical example

Sen et al. investigated the association between several factors (independent variables) and visual acuity outcomes (dependent variable) in patients receiving anti-vascular endothelial growth factor therapy for macular oedema (DMO) by means of both linear and logistic regression [ 9 ]. Multivariable linear regression demonstrated that age (Estimate −0.33, 95% CI − 0.48 to −0.19, p  < 0.001) was significantly associated with best-corrected visual acuity (BCVA) at 100 weeks at alpha = 0.05 significance level [ 9 ]. The regression coefficient of −0.33 means that the BCVA at 100 weeks decreases by 0.33 with each additional year of older age.

Multivariable logistic regression also demonstrated that age and ellipsoid zone status were statistically significant associated with achieving a BCVA letter score >70 letters at 100 weeks at the alpha = 0.05 significance level. Patients ≥75 years of age were at a decreased odds of achieving a BCVA letter score >70 letters at 100 weeks compared to those <50 years of age, since the OR is less than 1 (OR 0.96, 95% CI 0.94 to 0.98, p  = 0.001) [ 9 ]. Similarly, patients between the ages of 50–74 years were also at a decreased odds of achieving a BCVA letter score >70 letters at 100 weeks compared to those <50 years of age, since the OR is less than 1 (OR 0.15, 95% CI 0.04 to 0.48, p  = 0.001) [ 9 ]. As well, those with a not intact ellipsoid zone were at a decreased odds of achieving a BCVA letter score >70 letters at 100 weeks compared to those with an intact ellipsoid zone (OR 0.20, 95% CI 0.07 to 0.56; p  = 0.002). On the other hand, patients with an ungradable/questionable ellipsoid zone were at an increased odds of achieving a BCVA letter score >70 letters at 100 weeks compared to those with an intact ellipsoid zone, since the OR is greater than 1 (OR 2.26, 95% CI 1.14 to 4.48; p  = 0.02) [ 9 ].

The narrower the CI, the more precise the estimate is; and the smaller the p value (relative to alpha = 0.05), the greater the evidence against the null hypothesis of no effect or association.

Simply put, linear and logistic regression are useful tools for appreciating the relationship between predictor/explanatory and outcome variables for continuous and dichotomous outcomes, respectively, that can be applied in clinical practice, such as to gain an understanding of risk factors associated with a disease of interest.

Schneider A, Hommel G, Blettner M. Linear Regression. Anal Dtsch Ärztebl Int. 2010;107:776–82.

Google Scholar  

Bender R. Introduction to the use of regression models in epidemiology. In: Verma M, editor. Cancer epidemiology. Methods in molecular biology. Humana Press; 2009:179–95.

Schober P, Vetter TR. Confounding in observational research. Anesth Analg. 2020;130:635.

Article   Google Scholar  

Schober P, Vetter TR. Linear regression in medical research. Anesth Analg. 2021;132:108–9.

Szumilas M. Explaining odds ratios. J Can Acad Child Adolesc Psychiatry. 2010;19:227–9.

Thiese MS, Ronna B, Ott U. P value interpretations and considerations. J Thorac Dis. 2016;8:E928–31.

Schober P, Vetter TR. Logistic regression in medical research. Anesth Analg. 2021;132:365–6.

Zabor EC, Reddy CA, Tendulkar RD, Patil S. Logistic regression in clinical studies. Int J Radiat Oncol Biol Phys. 2022;112:271–7.

Sen P, Gurudas S, Ramu J, Patrao N, Chandra S, Rasheed R, et al. Predictors of visual acuity outcomes after anti-vascular endothelial growth factor treatment for macular edema secondary to central retinal vein occlusion. Ophthalmol Retin. 2021;5:1115–24.

Download references

R.E.T.I.N.A. study group

Varun Chaudhary 1,2 , Mohit Bhandari 1,2 , Charles C. Wykoff 5,6 , Sobha Sivaprasad 8 , Lehana Thabane 2,7 , Peter Kaiser 9 , David Sarraf 10 , Sophie J. Bakri 11 , Sunir J. Garg 12 , Rishi P. Singh 13,14 , Frank G. Holz 15 , Tien Y. Wong 16,17 , and Robyn H. Guymer 3,4

Author information

Authors and affiliations.

Department of Surgery, McMaster University, Hamilton, ON, Canada

Sofia Bzovsky, Mohit Bhandari & Varun Chaudhary

Department of Health Research Methods, Evidence & Impact, McMaster University, Hamilton, ON, Canada

Mark R. Phillips, Lehana Thabane, Mohit Bhandari & Varun Chaudhary

Centre for Eye Research Australia, Royal Victorian Eye and Ear Hospital, East Melbourne, VIC, Australia

Robyn H. Guymer

Department of Surgery, (Ophthalmology), The University of Melbourne, Melbourne, VIC, Australia

Retina Consultants of Texas (Retina Consultants of America), Houston, TX, USA

Charles C. Wykoff

Blanton Eye Institute, Houston Methodist Hospital, Houston, TX, USA

Biostatistics Unit, St. Joseph’s Healthcare Hamilton, Hamilton, ON, Canada

Lehana Thabane

NIHR Moorfields Biomedical Research Centre, Moorfields Eye Hospital, London, UK

Sobha Sivaprasad

Cole Eye Institute, Cleveland Clinic, Cleveland, OH, USA

Peter Kaiser

Retinal Disorders and Ophthalmic Genetics, Stein Eye Institute, University of California, Los Angeles, CA, USA

David Sarraf

Department of Ophthalmology, Mayo Clinic, Rochester, MN, USA

Sophie J. Bakri

The Retina Service at Wills Eye Hospital, Philadelphia, PA, USA

Sunir J. Garg

Center for Ophthalmic Bioinformatics, Cole Eye Institute, Cleveland Clinic, Cleveland, OH, USA

Rishi P. Singh

Cleveland Clinic Lerner College of Medicine, Cleveland, OH, USA

Department of Ophthalmology, University of Bonn, Bonn, Germany

Frank G. Holz

Singapore Eye Research Institute, Singapore, Singapore

Tien Y. Wong

Singapore National Eye Centre, Duke-NUD Medical School, Singapore, Singapore

You can also search for this author in PubMed   Google Scholar

  • Varun Chaudhary
  • , Mohit Bhandari
  • , Charles C. Wykoff
  • , Sobha Sivaprasad
  • , Lehana Thabane
  • , Peter Kaiser
  • , David Sarraf
  • , Sophie J. Bakri
  • , Sunir J. Garg
  • , Rishi P. Singh
  • , Frank G. Holz
  • , Tien Y. Wong
  •  & Robyn H. Guymer

Contributions

SB was responsible for writing, critical review and feedback on manuscript. MRP was responsible for conception of idea, critical review and feedback on manuscript. RHG was responsible for critical review and feedback on manuscript. CCW was responsible for critical review and feedback on manuscript. LT was responsible for critical review and feedback on manuscript. MB was responsible for conception of idea, critical review and feedback on manuscript. VC was responsible for conception of idea, critical review and feedback on manuscript.

Corresponding author

Correspondence to Varun Chaudhary .

Ethics declarations

Competing interests.

SB: Nothing to disclose. MRP: Nothing to disclose. RHG: Advisory boards: Bayer, Novartis, Apellis, Roche, Genentech Inc.—unrelated to this study. CCW: Consultant: Acuela, Adverum Biotechnologies, Inc, Aerpio, Alimera Sciences, Allegro Ophthalmics, LLC, Allergan, Apellis Pharmaceuticals, Bayer AG, Chengdu Kanghong Pharmaceuticals Group Co, Ltd, Clearside Biomedical, DORC (Dutch Ophthalmic Research Center), EyePoint Pharmaceuticals, Gentech/Roche, GyroscopeTx, IVERIC bio, Kodiak Sciences Inc, Novartis AG, ONL Therapeutics, Oxurion NV, PolyPhotonix, Recens Medical, Regeron Pharmaceuticals, Inc, REGENXBIO Inc, Santen Pharmaceutical Co, Ltd, and Takeda Pharmaceutical Company Limited; Research funds: Adverum Biotechnologies, Inc, Aerie Pharmaceuticals, Inc, Aerpio, Alimera Sciences, Allergan, Apellis Pharmaceuticals, Chengdu Kanghong Pharmaceutical Group Co, Ltd, Clearside Biomedical, Gemini Therapeutics, Genentech/Roche, Graybug Vision, Inc, GyroscopeTx, Ionis Pharmaceuticals, IVERIC bio, Kodiak Sciences Inc, Neurotech LLC, Novartis AG, Opthea, Outlook Therapeutics, Inc, Recens Medical, Regeneron Pharmaceuticals, Inc, REGENXBIO Inc, Samsung Pharm Co, Ltd, Santen Pharmaceutical Co, Ltd, and Xbrane Biopharma AB—unrelated to this study. LT: Nothing to disclose. MB: Research funds: Pendopharm, Bioventus, Acumed—unrelated to this study. VC: Advisory Board Member: Alcon, Roche, Bayer, Novartis; Grants: Bayer, Novartis—unrelated to this study.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Cite this article.

Bzovsky, S., Phillips, M.R., Guymer, R.H. et al. The clinician’s guide to interpreting a regression analysis. Eye 36 , 1715–1717 (2022). https://doi.org/10.1038/s41433-022-01949-z

Download citation

Received : 08 January 2022

Revised : 17 January 2022

Accepted : 18 January 2022

Published : 31 January 2022

Issue Date : September 2022

DOI : https://doi.org/10.1038/s41433-022-01949-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

regression model research paper

Review of guidance papers on regression modeling in statistical series of medical journals

Affiliations.

  • 1 Institute of Biometry and Clinical Epidemiology, Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Charité-Universitätsmedizin Berlin, Berlin, Germany.
  • 2 Center for Medical Statistics, Informatics and Intelligent Systems, Section for Clinical Biometrics, Medical University of Vienna, Vienna, Austria.
  • 3 School of Business and Economics, Emmy Noether Group in Statistics and Data Science, Humboldt-Universität zu Berlin, Berlin, Germany.
  • 4 Faculty of Medicine and Medical Center, Institute of Medical Biometry and Statistics, University of Freiburg, Freiburg, Germany.
  • 5 Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, The Netherlands.
  • PMID: 35073384
  • PMCID: PMC8786189
  • DOI: 10.1371/journal.pone.0262918

Although regression models play a central role in the analysis of medical research projects, there still exist many misconceptions on various aspects of modeling leading to faulty analyses. Indeed, the rapidly developing statistical methodology and its recent advances in regression modeling do not seem to be adequately reflected in many medical publications. This problem of knowledge transfer from statistical research to application was identified by some medical journals, which have published series of statistical tutorials and (shorter) papers mainly addressing medical researchers. The aim of this review was to assess the current level of knowledge with regard to regression modeling contained in such statistical papers. We searched for target series by a request to international statistical experts. We identified 23 series including 57 topic-relevant articles. Within each article, two independent raters analyzed the content by investigating 44 predefined aspects on regression modeling. We assessed to what extent the aspects were explained and if examples, software advices, and recommendations for or against specific methods were given. Most series (21/23) included at least one article on multivariable regression. Logistic regression was the most frequently described regression type (19/23), followed by linear regression (18/23), Cox regression and survival models (12/23) and Poisson regression (3/23). Most general aspects on regression modeling, e.g. model assumptions, reporting and interpretation of regression results, were covered. We did not find many misconceptions or misleading recommendations, but we identified relevant gaps, in particular with respect to addressing nonlinear effects of continuous predictors, model specification and variable selection. Specific recommendations on software were rarely given. Statistical guidance should be developed for nonlinear effects, model specification and variable selection to better support medical researchers who perform or interpret regression analyses.

Publication types

  • Research Support, Non-U.S. Gov't
  • Medical Writing*
  • Models, Statistical*
  • Periodicals as Topic
  • Regression Analysis*

Grants and funding

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base
  • Simple Linear Regression | An Easy Introduction & Examples

Simple Linear Regression | An Easy Introduction & Examples

Published on February 19, 2020 by Rebecca Bevans . Revised on June 22, 2023.

Simple linear regression is used to estimate the relationship between two quantitative variables . You can use simple linear regression when you want to know:

  • How strong the relationship is between two variables (e.g., the relationship between rainfall and soil erosion).
  • The value of the dependent variable at a certain value of the independent variable (e.g., the amount of soil erosion at a certain level of rainfall).

Regression models describe the relationship between variables by fitting a line to the observed data. Linear regression models use a straight line, while logistic and nonlinear regression models use a curved line. Regression allows you to estimate how a dependent variable changes as the independent variable(s) change.

If you have more than one independent variable, use multiple linear regression instead.

Table of contents

Assumptions of simple linear regression, how to perform a simple linear regression, interpreting the results, presenting the results, can you predict values outside the range of your data, other interesting articles, frequently asked questions about simple linear regression.

Simple linear regression is a parametric test , meaning that it makes certain assumptions about the data. These assumptions are:

  • Homogeneity of variance (homoscedasticity) : the size of the error in our prediction doesn’t change significantly across the values of the independent variable.
  • Independence of observations : the observations in the dataset were collected using statistically valid sampling methods , and there are no hidden relationships among observations.
  • Normality : The data follows a normal distribution .

Linear regression makes one additional assumption:

  • The relationship between the independent and dependent variable is linear : the line of best fit through the data points is a straight line (rather than a curve or some sort of grouping factor).

If your data do not meet the assumptions of homoscedasticity or normality, you may be able to use a nonparametric test instead, such as the Spearman rank test.

If your data violate the assumption of independence of observations (e.g., if observations are repeated over time), you may be able to perform a linear mixed-effects model that accounts for the additional structure in the data.

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

regression model research paper

Simple linear regression formula

The formula for a simple linear regression is:

y = {\beta_0} + {\beta_1{X}} + {\epsilon}

  • y is the predicted value of the dependent variable ( y ) for any given value of the independent variable ( x ).
  • B 0 is the intercept , the predicted value of y when the x is 0.
  • B 1 is the regression coefficient – how much we expect y to change as x increases.
  • x is the independent variable ( the variable we expect is influencing y ).
  • e is the error of the estimate, or how much variation there is in our estimate of the regression coefficient.

Linear regression finds the line of best fit line through your data by searching for the regression coefficient (B 1 ) that minimizes the total error (e) of the model.

While you can perform a linear regression by hand , this is a tedious process, so most people use statistical programs to help them quickly analyze the data.

Simple linear regression in R

R is a free, powerful, and widely-used statistical program. Download the dataset to try it yourself using our income and happiness example.

Dataset for simple linear regression (.csv)

Load the income.data dataset into your R environment, and then run the following command to generate a linear model describing the relationship between income and happiness:

This code takes the data you have collected data = income.data and calculates the effect that the independent variable income has on the dependent variable happiness using the equation for the linear model: lm() .

To learn more, follow our full step-by-step guide to linear regression in R .

To view the results of the model, you can use the summary() function in R:

This function takes the most important parameters from the linear model and puts them into a table, which looks like this:

Simple linear regression summary output in R

This output table first repeats the formula that was used to generate the results (‘Call’), then summarizes the model residuals (‘Residuals’), which give an idea of how well the model fits the real data.

Next is the ‘Coefficients’ table. The first row gives the estimates of the y-intercept, and the second row gives the regression coefficient of the model.

Row 1 of the table is labeled (Intercept) . This is the y-intercept of the regression equation, with a value of 0.20. You can plug this into your regression equation if you want to predict happiness values across the range of income that you have observed:

The next row in the ‘Coefficients’ table is income. This is the row that describes the estimated effect of income on reported happiness:

The Estimate column is the estimated effect , also called the regression coefficient or r 2 value. The number in the table (0.713) tells us that for every one unit increase in income (where one unit of income = 10,000) there is a corresponding 0.71-unit increase in reported happiness (where happiness is a scale of 1 to 10).

The Std. Error column displays the standard error of the estimate. This number shows how much variation there is in our estimate of the relationship between income and happiness.

The t value  column displays the test statistic . Unless you specify otherwise, the test statistic used in linear regression is the t value from a two-sided t test . The larger the test statistic, the less likely it is that our results occurred by chance.

The Pr(>| t |)  column shows the p value . This number tells us how likely we are to see the estimated effect of income on happiness if the null hypothesis of no effect were true.

Because the p value is so low ( p < 0.001),  we can reject the null hypothesis and conclude that income has a statistically significant effect on happiness.

The last three lines of the model summary are statistics about the model as a whole. The most important thing to notice here is the p value of the model. Here it is significant ( p < 0.001), which means that this model is a good fit for the observed data.

When reporting your results, include the estimated effect (i.e. the regression coefficient), standard error of the estimate, and the p value. You should also interpret your numbers to make it clear to your readers what your regression coefficient means:

It can also be helpful to include a graph with your results. For a simple linear regression, you can simply plot the observations on the x and y axis and then include the regression line and regression function:

Simple linear regression graph

Prevent plagiarism. Run a free check.

No! We often say that regression models can be used to predict the value of the dependent variable at certain values of the independent variable. However, this is only true for the range of values where we have actually measured the response.

We can use our income and happiness regression analysis as an example. Between 15,000 and 75,000, we found an r 2 of 0.73 ± 0.0193. But what if we did a second survey of people making between 75,000 and 150,000?

Extrapolating data in R

The r 2 for the relationship between income and happiness is now 0.21, or a 0.21-unit increase in reported happiness for every 10,000 increase in income. While the relationship is still statistically significant (p<0.001), the slope is much smaller than before.

Extrapolating data in R graph

What if we hadn’t measured this group, and instead extrapolated the line from the 15–75k incomes to the 70–150k incomes?

You can see that if we simply extrapolated from the 15–75k income data, we would overestimate the happiness of people in the 75–150k income range.

Curved data line

If we instead fit a curve to the data, it seems to fit the actual pattern much better.

It looks as though happiness actually levels off at higher incomes, so we can’t use the same regression line we calculated from our lower-income data to predict happiness at higher levels of income.

Even when you see a strong pattern in your data, you can’t know for certain whether that pattern continues beyond the range of values you have actually measured. Therefore, it’s important to avoid extrapolating beyond what the data actually tell you.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Chi square test of independence
  • Statistical power
  • Descriptive statistics
  • Degrees of freedom
  • Pearson correlation
  • Null hypothesis

Methodology

  • Double-blind study
  • Case-control study
  • Research ethics
  • Data collection
  • Hypothesis testing
  • Structured interviews

Research bias

  • Hawthorne effect
  • Unconscious bias
  • Recall bias
  • Halo effect
  • Self-serving bias
  • Information bias

A regression model is a statistical model that estimates the relationship between one dependent variable and one or more independent variables using a line (or a plane in the case of two or more independent variables).

A regression model can be used when the dependent variable is quantitative, except in the case of logistic regression, where the dependent variable is binary.

Simple linear regression is a regression model that estimates the relationship between one independent variable and one dependent variable using a straight line. Both variables should be quantitative.

For example, the relationship between temperature and the expansion of mercury in a thermometer can be modeled using a straight line: as temperature increases, the mercury expands. This linear relationship is so certain that we can use mercury thermometers to measure temperature.

Linear regression most often uses mean-square error (MSE) to calculate the error of the model. MSE is calculated by:

  • measuring the distance of the observed y-values from the predicted y-values at each value of x;
  • squaring each of these distances;
  • calculating the mean of each of the squared distances.

Linear regression fits a line to the data by finding the regression coefficient that results in the smallest MSE.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bevans, R. (2023, June 22). Simple Linear Regression | An Easy Introduction & Examples. Scribbr. Retrieved September 26, 2023, from https://www.scribbr.com/statistics/simple-linear-regression/

Is this article helpful?

Rebecca Bevans

Rebecca Bevans

Other students also liked, an introduction to t tests | definitions, formula and examples, multiple linear regression | a quick guide (examples), linear regression in r | a step-by-step guide & examples, what is your plagiarism score.

  • Reference Manager
  • Simple TEXT file

People also looked at

Original research article, performance evaluation of regression models for the prediction of the covid-19 reproduction rate.

regression model research paper

  • 1 School of Computer Science and Engineering, Vellore Institute of Technology, Vellore, India
  • 2 Electrical and Computer Engineering Department, Effat University, Jeddah, Saudi Arabia
  • 3 School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, India
  • 4 Department of Computer Science and Information Engineering, National Yunlin University of Science and Technology, Douliu, Taiwan
  • 5 School of Social Sciences and Languages, Vellore Institute of Technology, Vellore, India

This paper aims to evaluate the performance of multiple non-linear regression techniques, such as support-vector regression (SVR), k-nearest neighbor (KNN), Random Forest Regressor, Gradient Boosting, and XGBOOST for COVID-19 reproduction rate prediction and to study the impact of feature selection algorithms and hyperparameter tuning on prediction. Sixteen features (for example, Total_cases_per_million and Total_deaths_per_million) related to significant factors, such as testing, death, positivity rate, active cases, stringency index, and population density are considered for the COVID-19 reproduction rate prediction. These 16 features are ranked using Random Forest, Gradient Boosting, and XGBOOST feature selection algorithms. Seven features are selected from the 16 features according to the ranks assigned by most of the above mentioned feature-selection algorithms. Predictions by historical statistical models are based solely on the predicted feature and the assumption that future instances resemble past occurrences. However, techniques, such as Random Forest, XGBOOST, Gradient Boosting, KNN, and SVR considered the influence of other significant features for predicting the result. The performance of reproduction rate prediction is measured by mean absolute error (MAE), mean squared error (MSE), root mean squared error (RMSE), R-Squared, relative absolute error (RAE), and root relative squared error (RRSE) metrics. The performances of algorithms with and without feature selection are similar, but a remarkable difference is seen with hyperparameter tuning. The results suggest that the reproduction rate is highly dependent on many features, and the prediction should not be based solely upon past values. In the case without hyperparameter tuning, the minimum value of RAE is 0.117315935 with feature selection and 0.0968989 without feature selection, respectively. The KNN attains a low MAE value of 0.0008 and performs well without feature selection and with hyperparameter tuning. The results show that predictions performed using all features and hyperparameter tuning is more accurate than predictions performed using selected features.

Introduction

The world has witnessed several deadly diseases at different times. In the year 2020, the world suffered a serious pandemic that took away many lives ( 1 ). The coronavirus disease (COVID-19) is a disease that started as an epidemic and evolved into a pandemic. The disease was first discovered in late December 2019 in Wuhan, China ( 2 ). The virus responsible for causing the disease is the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), which is highly contagious and causes severe respiratory issues. The virus rapidly spread across the world and affected 223 countries, infected more than 9.3 × 10 7 people, and took over 2 × 10 6 human lives, according to the WHO report in January 2021 ( 3 ). As a result, scientists and epidemiologists worldwide are investigating the virus to reduce its impact on human lives.

The coronavirus is named after the word, “coronation” since the spikes on the surface of the virus resemble a crown. This virus was believed to be an animal virus in 2002. The SARS-CoV is mostly found in bats and transmitted to other animals, such as cats. The first human-infected coronavirus case was reported in 2003 in Guangdong province in the south of China ( 4 ).

Knowledge of the immune system of our body is required to understand the mechanisms of the COVID-19 or any other viral infections. Viruses are microorganisms that make our body cells their hosts for replication and multiplication. The immune system of our body is activated by the entry of the virus and identifies the virus as an alien body for destruction. After attacking and killing the viruses, the immune system “remembers” the virus and launches the same protective measures when the virus enters again. Viruses are capable of fast evolutions. They evolve to new shapes or mechanisms to survive in the changing environment.

Viral infections often affect people with weak immune systems. The elderly, children, and people with medical conditions are prone to the attack of novel viruses. The virus can be deadly and threatening to the senior population, especially the elderly with chronic medical conditions.

The SARS-CoV-2 is transmitted via respiratory droplets expelled by sneezing, coughing, or talking. The virus can also be contracted by touching a contaminated surface. One significant property of the SARS-CoV-2 is its capacity to survive on various surfaces for up to 9 days at room temperature, which facilitates its rapid transmission ( 5 ). Acute Respiratory Disease Syndrome is caused predominantly by this virus and often leads to multiple organ dysfunctions, resulting in physiological deterioration and even death of the infected persons ( 6 ).

This study is intended to predict the rate of reproduction of the deadly SARS-CoV-2. The reproduction rate (R o ) is an important parameter to predict the spread of a disease in a pandemic situation. The R o value indicates the transmissibility of a virus through the average number of new infections caused by an infectious person in a naïve population. The value of R o <1 indicates that the infection would die out. On the other hand, if the value is >1, the spread of the disease would increase. For example, a reproduction rate of 18 indicates that a single infected individual can potentially infect 18 more individuals. The reproduction rate is needed to determine whether the disease is under control or turning into an epidemic.

There are many standard methods to predict the reproduction rate. The XGBoost is an optimal Gradient Boosting algorithm with tree pruning, parallel processing, missing value handling, and by the efficient use of hardware resources and regularization to avoid overfitting and bias. The XGBoost has faster computational times ( 7 ) in all types of environments. The XGBoost is an improvement on the Gradient Boosting algorithm. Training models with an XGBoost iterative boosting approach remove errors at preceding boosting trees in the following iterations ( 8 ). Support vector regression (SVR) was based on the Vapnik–Chervonenkis (VC) theory. It is used when the output is a continuous numerical variable. Support vectors are data points closest to the hyperplanes. The hyperplanes represent the decision boundaries. The Radial Basis Function is a commonly used kernel function. The use of a kernel function is to transform the data into a higher-dimensional space. The SVR and convolutional neural network (CNN) have been used to detect groundwater location, and comparative results showed that the SVR outperforms the CNN ( 9 ). In k-nearest neighbor (KNN), the outcome of the variable is determined by taking an average of the observations found in the same neighborhood. The KNN algorithm assigns a weight, “W” to the KNN and a weight of 0 to the others.

The performance of the machine learning algorithms depends on the hyperparameter values. The values for the hyperparameters can be assigned in three ways:

1) Based on default values given in the software packages.

2) Manually configured by the user.

3) Assigned by algorithms, such as the simple grid search, random search, Bayesian optimization, ANOVA approach, and bio-inspired optimization algorithms.

The process of identifying the most relevant features is referred to as “feature selection.” The three main advantages of feature selection are:

• Simplifying the interpretation of the model.

• Reducing the variance of the model to avoid overfitting.

• Reducing the computational cost (and time) for model training.

Artificial intelligence (AI) has been successful in many fields and facilitates our daily life in various ways ( 10 – 17 ). The reproduction rate prediction is crucial in successfully establishing public healthcare in the battle against COVID-19. The prediction of the reproduction rate is performed by using not just the past values but also by using the closely related factors. This work also investigates the impact of feature selection and hyperparameter tuning on the performance of non-linear machine learning techniques.

Research Gap and Contribution

The reproduction rate is related to many factors, such as the average number of contacts a person has, number of days a person is infected, from the day of exposure to the disease, the number of active cases, values of stringency index, testing capacity and positivity, and so on. As a result, the reproduction rate, time curve, and future values cannot be satisfactorily estimated by the probability distribution functions alone ( 18 ).

Time series prediction models, such as the autoregressive integrated moving average (ARIMA), Gray Model, and Markov Chain models do not consider multiple factors in reproduction rate prediction. Autoregressive models assume that future values resemble the past. Mechanistic models based on the susceptible, exposed, infected, and recovered (SEIR) states framework or modified version of the framework use the time series data to hold the currently confirmed cases, removed cases (including recovered and deceased cases), and time-varying transmission rates. However, some factors are still not included, and there is no weighting for the factors ( 19 ).

So there arises a need to study the various factors acting on the reproduction rate and to prioritize it. Hence identifying the importance of various features (for example, Total_cases_per_million, Total_deaths_per_million) under factors like active cases, stringency index, testing capacity, and positivity are done using feature selection algorithms. Multiple regression uses several explanatory variables to predict the single response variable. The performance of the non-linear machine learning techniques, such as Random Forest, XGBOOST, Gradient Boosting, KNN, and SVR are used in reproduction rate prediction. The performance of these approaches for predicting the COVID-19 reproduction rate was measured using the evaluation metrics like mean absolute error (MAE), mean squared error (MSE), r oot mean squared e rror (RMSE), R-Squared, relative absolute e rror ( RAE), and root relative squared error (RRSE). The influence of feature selection and hyperparameter tuning operation on their performance is also studied.

Structure of the Paper

Section Introduction of the paper introduces the reproduction rate, feature selection, machine learning techniques and hyperparameters, and the motivation of this study. Section Related Works discusses the related works, as well as the identified research gap and the contributions. Section Materials and Methods describes the methods used in this work, including feature selection, hyperparameter tuning, and prediction and performance measurement. Section Results and Discussion discusses the experimental results. Finally, section Conclusion and Future Work provides the conclusion and future work.

Related Works

Zivkovic et al. ( 20 ) proposed a hybridized machine learning, adaptive neuro-fuzzy inference system with enhanced beetle antennae search (BAS) swarm intelligence metaheuristics. The results showed that the system is a good prediction model in time series forecasting. The defects in the BAS algorithm were rectified using the Cauchy exploration strategy BAS (CESBAS) using the Cauchy mutation and three additional control parameters. The selection of optimum values for the adaptive network-based fuzzy inference system ( ANFIS) parameters became an NP-hard optimization problem. The ANFIS parameters values were determined using the CESBAS metaheuristics algorithm. The performance metrics, such as RMSE, MAE, MAPE, RMRE, and R-Squared, were used to evaluate the outcomes on influenza datasets.

The research goal in Mojjada et al. ( 21 ) was to forecast the number of new COVID-19 cases, mortalities, and recoveries using various machine learning regression models, such as the lowest absolute and selective shrinking operator (LASSO), vector supports, such as short message service (SMS), and exponential smoking (ES) models. While the linear regression and LASSO models were more effective in estimating and verifying the death rate, the ES model provided the overall best results.

Farooq and Bazaz ( 22 ) used an artificial neural network (ANN) to forecast the COVID-19 based on an online incremental learning technique using an adaptive and non-intrusive analytical model. The COVID-19 data was updated every day, so online incremental learning was the best option for forecasting since there is no need to retrain or rebuild the model from scratch.

Milind et al. ( 23 ) discovered many factors behind the spread of the coronavirus, such as the relationship between the weather and the spread of COVID-19, growth rate, and mitigation. Support vector regression (SVR) was used to predict the transmission rate, epidemic end, and the spread of the coronavirus across regions, and to analyze the growth rates and the types of mitigation across countries. The Pearson coefficient was used in the correlation between the coronavirus and weather correlation coefficient. Weather factors, such as the wind speed, temperature, and humidity of Milan city in Italy and New York City in the United States were considered. The SVR is a non-parametric technique since it only depends on the kernel function, implying that there is no need to change the explanatory variables in constructing a non-linear model. The study also compared the performances of SVR, linear regression, and polynomial regression.

Chicco and Jurman ( 24 ) predicted the survival of patients who had heart failure based on the ejection fraction and serum creatinine level. A database of 299 patients collected in 2015 was used. Feature selection was performed, and the factors were ranked. The ejection fraction and serum creatinine levels were found to be highly relevant among the 13 selected features. As a result, the prediction model was built and executed based on these two factors.

Mortazavi et al. ( 25 ) investigated the capability of machine learning techniques when applied to a high dimensional and non-linear relationship. They predicted the readmission of patients hospitalized for heart failure. The prediction was performed with various machine learning techniques, such as Random Forest, Gradient Boosting, and Random Forest combined hierarchically with support vector machines (SVMs) or logistic regression (LR) and Poisson regression. The obtained results were tested against traditional LR methods. The model was evaluated using the receiver operating characteristics (ROC) curve (C statistic), the positive predictive value (PPV), sensitivity, specificity, and f-score. The ROC was found to be a good measure for model discrimination.

Balli ( 26 ) analyzed the COVID-19 data from Germany, the United States, and other parts of the world. Methods including the SVM, linear regression, multilayer perceptron, and Random Forest methods were used to model the COVID-19 data. The performances of the methods were compared using the RMSE, absolute percentage error (APE), and mean absolute percentage error (MAPE). Among the tested methods, the SVM outperformed all other methods in the COVID-19 data modeling and was successfully used to diagnose the behavior of cumulative COVID-19 data over time.

A system to handle the data with non-linear relationships and non-normal distribution was proposed by Kuo and Fu ( 27 ). A total of 52 input variables relating to confirmed cases, environment variables, country-dependent variables, community mobility variables, and time series variables were used in the study ( 27 ). The impact of population mobility had caused an increase in the number of infections over the weekend. This work served as a basis for researchers analyzing geographical characteristics, seasonality, as well as models, such as long short-term memory (LSTM), ARIMA, convolutional neural network (CNN), and so on.

The COVID Patient Detection System (CPDS) used by Shaban et al. ( 28 ) was designed using a Hybrid Feature Selection Method (HFSM) consisting of two stages, a fast selection stage (FS 2 ) and an accurate selection stage (AS 2 ). The FS 2 used several filter methods, and the filtered features served as the initial population of the genetic algorithm, which was used as a wrapper method. An enhanced K-nearest neighbor (EKNN) classifier was used to solve the trapping problem. The most significant features from the chest CT images of patients were selected. The HFSM allowed the EKNN classifier to obtain rapid predictions with high accuracy. The proposed feature selection algorithm was compared with four recent feature selection techniques, and the proposed CPDS had achieved an accuracy of 96%.

Sujatha et al. ( 29 ) utilized the linear regression, multi-layer perceptron (MLP), and vector autoregression (VAR) models to foresee the spread of the COVID-19 using the COVID-19 Kaggle data. The correlations between the features of the dataset are crucial in finding the dependencies. The VAR model is a more suitable analysis model for multivariate time series. It is an m-equation, m- variable model where an individual variable is based on its current and past values. The MLP methods provide better predictions than the linear regression and VAR models.

Yang et al. ( 30 ) predicted the number of new confirmed cases using SEIR and AI methods. The authors used the probability of transmission, incubation rate, and the probability of recovery or death as factors in the predictions. New case predictions made by the AI method are more accurate than the SEIR predictions.

The Gradient Boosting Feature Selection (GBFS) algorithm learns the ensemble of regression trees to identify the non-linear relationship between features with ease. The classification error rates for the GBFS and Random Forest methods are the lowest, whereas the L1-regularized logistic regression (L1-LR) and Hilbert–Schmidt independence criterion (HSIC) Lasso methods have higher error rates ( 31 ).

The XGBOOST algorithm was applied to calculate the business risk by Wang ( 32 ). Several feature selection methods were used to find the redundant features. Two hyper-parameter optimization approaches were applied: random search (RS) and Bayesian tree-structured Parzen Estimator (TPE). The XGBOOST with hyper-parameter optimization performed well for business risk modeling.

Chintalapudi et al. ( 33 ) used the predicted reproduction rate to forecast the daily and the cumulative COVID-19 cases for the next 30 days in Marche, Italy. The probability-based prediction was performed with the maximum likelihood function. In the implementation, a simple linear regression method was used to fit the exponential growth of infected incidences over time, and the linear regression was applied over the incidence data. This study showed that the outbreak size and daily incidence are primarily dependent on the daily reproductive number.

Locatelli et al. ( 34 ) estimated the COVID-19 reproduction rate of Western Europe with the average from 15 countries. The authors used the generation interval, defined as the time needed for an infected person to infect another person and for reproduction rate estimation. The works by Zhang et al. ( 35 ) and by Srinivasu et al. ( 36 ), Panigrahi et al. ( 37 , 38 ), Tamang ( 39 ), Chowdhary et al. ( 40 ), and Gaur et al. ( 41 ) demonstrated the efficacy of machine learning algorithms in various fields.

Materials and Methods

The spread of the COVID-19 depends on many factors. New factors influencing the spread of the disease are still being discovered, and the identification of predominant factors is crucial. The prediction of COVID-19 spread is highly related to the feature-reproduction rate. Data science can be applied to track the crucial features used for the prediction from any number of features. Traditional statistical approaches, such as the chi-square and Pearson correlation coefficient provide the importance of the features in relation to the other features. Feature selection reduces the overfitting and underfitting problems, computational cost, and time. The reproduction rate prediction is important since it is associated with the status of the COVID-19. Feature-ranking is performed using Random Forest regression, Gradient Boosting, and XGBoost. Seven factors are considered in this study: the total number of cases, number of new cases, total number of deaths, total number of cases per million, total number of deaths per million, total number of tests conducted per thousand, and the positive rate. The proposed system architecture is represented in Figure 1 .

www.frontiersin.org

Figure 1 . Proposed system architecture.

Reproduction Rate

Newly occurring diseases can be detrimental for humans and other animals, whether the diseases are caused by a new pathogen or a modified form of an existing pathogen ( 19 ). In this work, the simple compartmental disease models and matrix methods are used to calculate the reproduction rate, R0.

Feature Selection

Embedded filter-based feature selection methods, such as Random Forest, Gradient Boosting, and XGBoost, which take into account the regression process, are used in this work. The Random Forest approach is an embedded feature selection method in which hundreds of decision trees are constructed by extracting random observation values of random features. The training determines features that reduce impurity. The principle of Gradient Boosting, and XGBoost methods are used to boost the weak learners. Gradient Boosting strives to minimize the error between the predicted and the actual values. The XGBoost is an extreme Gradient Boosting algorithm. The XGBoost is the regularized form of Gradient Boosting ( 42 ). The XGBoost is fast with L1, L2 regularization and parallel computing. It delivers high performance since it works on the second partial derivatives of the loss function. The main highlight of the Random Forest algorithm lies in its ability to prevent overfitting and increase accuracy. The advantage of gradient boosting is its ability to tune many hyperparameters and loss functions.

Parameter Settings

The experiments use multiple non-linear regression tree algorithms and the result is implemented in Python. For experiments without hyperparameter tuning, the default values in the SciKit library are used. For Random Forest regression, the parameter values are initialized as follows: n_estimators = 100, n_jobs = −1, oob_score = True, bootstrap = True, and random_state = 42. For XGBoost, the XGB Regressor method is used to fit the test data, and n_estimatorsis set to 100. The Gradient Boosting feature importance is calculated by setting the value of n_estimators to 500, max_depth to 4, min_samples_split to 5, learning_rate to 0.01, and loss as ls. For the KNN algorithm, the lower error rate is achieved when the K value equals 7. For the SVR, the radial basis function kernel is used with degree = 3 and gamma = scale. For experiments with hyperparameter tuning, grid search and randomized approaches are used. A grid search exhaustively tests all possible combinations of the specified hyperparameter values for an estimator. In a randomized search, the model selects the combinations randomly. Both approaches are very effective ways of tuning the parameters to increase the generalizability of the model. The GridSearchCV method of sklearn tunes the hyperparameters of the SVR, KNN, XGBoost, and Gradient Boosting approaches. The randomized search CV function is used for the hyperparameter tuning of Random Forest Regressor.

The dataset was taken from the website “ https://github.com/owid/covid-19-data/tree/master/public/data ” ( 43 ). A total of 16 fields were used for the study of reproduction rate. They are Total_cases, New_cases, Total_deaths, New_deaths, Total_cases_per_million, New_cases_per_million, Total_deaths_per_million, New_deaths_per_million, New_tests, Total_tests, Total_tests_per_thousand, New_tests_per_thousand, Positive_rate, Tests_per_case, Stringency_index, Population_density. Records from April 1, 2020, to November 30, 2020, are used as training data (244 records/day). Records from December 1, 2020, to March 10, 2021, are used as testing data (100 records/day).

Performance Metrics

Numerous machine learning (ML)-based predictive modeling techniques are used in the COVID-19 predictions. Therefore, there is a need to measure the performance of each model and its prediction accuracy. The metrics used to assess the effectiveness of the model in predicting the outcome are very important since they influence the conclusion. The performance metrics to identify the error rate between the predicted and observed values are as follows:

• Root mean square error (RMSE)

• Mean absolute error (MAE)

• Determination coefficient ( R 2 )

• Relative absolute error (RAE)

• Root relative squared error (RRSE)

Mean Absolute Error

The mean absolute error measures the sum of the absolute differences between the predicted output and the actual output. One cannot identify whether it is under predicting or over predicting since all variations have equal weight.

Equation 1 provides the formula to calculate the MAE.

where SWL FOR,i represents the forecast output, SWL OBS,i represents the actual output, N represents the total number of data points, and I represents a single data entry from the data points.

Root Mean Squared Error

The RMSE measures the square root of the average squared deviation between the forecast and the actual output, as given in Equation 2. It is used when the error is highly non-linear. The RMSE indicates the amount of errors in the predicted data on average and is a good measure of the prediction accuracy.

Determination Coefficient

The R 2 metric shows the percentage variation in y explained by x-variables, where x and y signify a set of data. It finds the likelihood of the occurrence of a future event' in the predicted outcome, as given in Equation 3.

Relative Absolute Error

Relative Absolute Error (RAE) metric gives the ratio of residual or mean error to the forecast error of a naive model. Equation 4, returns a value less than 1, when the proposed model performs better than the naïve model. In Equation 4, “P” stands for the predicted value and “A” for the actual value.

Root Relative Squared Error

The Root Relative Squared Error (RRSE) is given as the square root of the relative squared error (RSE). The RSE metric compares the actual forecast error to the forecast error of a naive model. It can be used in models whose errors are measured in different units. As given in Equation 5 and 6, the total squared error is divided by the total squared error of the simple predictor. The simple predictor is just the average of the actual values. The predicted output is “P” and “T” is the target value. Further, the value “i” represents the model and j represents the record. RRSE is given as the square root of the relative squared error, which provides the error in the dimensions of the quantity being predicted, as given in Equation (7). RSE i represents the relative squared error for the model “i”.

Results and Discussion

All experiments were performed using Python's Sci-Kit Library on a Jupyter notebook. The feature importance scores obtained by Random Forest regression, XGBoost, and Gradient Boosting are given in Table 1 and plotted in Figure 2 .

www.frontiersin.org

Table 1 . Feature Importance Scores obtained using Random Forest Regression, XGBoost, and Gradient Boosting.

www.frontiersin.org

Figure 2 . Comparison graph of feature score given by feature selection algorithms.

Out of the 16 features, the top seven features affecting the reproduction rate are identified from the obtained feature importance scores. The seven features are Total_cases, New_cases, Total_deaths, Total_cases_per_million, Total_deaths_per_million, Total_tests, Total_tests_per_thousand, and Positive_rate.

Experiments were conducted for the reproduction rate prediction using Random Forest, XGBoost, Gradient Boosting, support-vector regression (SVR), and k - nearest neighbor (KNN) regression methods. In addition, the experiments were intended to investigate the impacts of feature selection and hyperparameter tuning. Four experiments were conducted with and without feature selection or hyperparameter tuning. Experiment 1 was conducted using the five regression techniques without feature selection and without parameter tuning. Experiment 2 was conducted with feature selection and without parameter tuning. Experiment 3 was conducted without feature selection and with hyperparameter tuning, and finally, Experiment 4 was conducted with feature selection and with parameter tuning. The reproduction rate prediction was measured using the mean absolute error (MAE), mean squared error (MSE), root mean squared error ( RMSE), R-Squared, relative absolute error (RAE), and root relative squared error (RRSE). The best, the second-best and the worst results for the particular metrics and experiments are discussed in detail below. The best results obtained for the metrics are given in bold in the Tables 2 , 3 , 5 , 6 .

www.frontiersin.org

Table 2 . Prediction without Feature Selection and without hyperparameter tuning.

www.frontiersin.org

Table 3 . Prediction with Feature Selection and without hyperparameter tuning.

The first experiment used all features in the reproduction rate prediction, and each of the regression techniques used the default values for the hyperparameter. The resulting performance metric values are given in Table 2 . The Gradient Boosting method performs well with the lowest MSE, RMSE, RAE, and RRSE values and the second-best score for MAE. Random Forest is the next best algorithm with the best R-Squared value and the second-best scores for MSE, RMSE, and RAE. The XGBoost has an average performance. The SVR has the worst scores in all of the performance metrics. The lowest MAE of 0.0189412 was obtained by XGBoost followed by Gradient Boosting with the second-best MAE of 0.02226608. The SVR has the highest MAE of 0.0712651. The minimum MSE, RMSE, RAE, and RRSE values of 0.00135535, 0.036815107, 0.1173159354, and 0.1472868, respectively, are achieved by Gradient Boosting. Random Forest achieves the maximum R-squared value of 0.97923. The obtained metric values are plotted in Figure 3 .

www.frontiersin.org

Figure 3 . Performance comparison graph of regression techniques without feature selection and without hyperparameter tuning.

The seven selected features were used in the second experiment, and the prediction process was performed without hyperparameter tuning. There is a reduction in the MAE, MSE, and RMSE values of about 0.01 using Random Forest regression. There is a marginal reduction in the MAE, MSE, and RMSE values using XGBoost and Gradient Boosting for feature selection. The lowest MAE value of 0.018391, RAE value of 0.096898, and the best R-squared value of 0.9796126 is achieved by XGBoost. Random Forest gives the lowest MSE of 0.00139, RMSE of 0.037380, and RRSE of 0.159998. Out of all the algorithms, the SVR produces the highest error rate. The results are given in Table 3 and plotted in Figure 4 . The Random Forest regression and XGBoost techniques performed better than the other techniques. The performance of SVR is the worst among the algorithms compared for reproduction rate prediction. In the experiment with feature selection and without hyperparameter tuning, the Random Forest approach has achieved the top performance with three best scores and two second-best scores. The XGBoost has the best scores for the MAE, R-squared, and RAE, and the second-best score for RRSE.

www.frontiersin.org

Figure 4 . Performance comparison graph of regression techniques with feature selection algorithm and without hyperparameter tuning.

Experiment 3 was conducted without feature selection and with hyperparameter tuning. The tuned hyperparameter values are listed in Table 4 . The results are good after hyperparameter tuning is performed with the grid search or random search. The results are analyzed based on the best, the second-best, and the last scores. The performance of KNN tops all of the other algorithms when the experiment is performed without feature selection and with hyperparameter tuning. Random Forest is the next best algorithm with the second-best scores for the MSE, RMSE, R-Squared and RRSE. The values are given in Table 5 and plotted in Figure 5 .

www.frontiersin.org

Table 4 . Best tuned values of the hyperparameters for the different regression techniques.

www.frontiersin.org

Table 5 . Prediction without Feature Selection and with hyperparameter tuning.

www.frontiersin.org

Figure 5 . Performance comparison graph of regression techniques without feature selection and with hyperparameter tuning.

Experiment 4 was conducted with feature selection and with parameter tuning. In this experiment, the Random Forest approach has the best scores for the MSE, RMSE, R-Squared, and RRSE. The XGBOOST has the best scores for the MAE and RAE. Nevertheless, Gradient Boosting has the second-best scores in the MSE, RMSE, R-Squared, and RRSE. The KNN has two second-best scores. Again, the SVR has the worst scores for all of the performance metrics. The values are given in Table 6 and plotted in Figure 6 .

www.frontiersin.org

Table 6 . Prediction with Feature Selection and with hyperparameter tuning.

www.frontiersin.org

Figure 6 . Performance comparison graph of regression techniques with feature selection algorithm and with hyperparameter tuning.

The computation times for the different types of prediction are computed and listed in Table 7 . The Random Forest algorithm uses a random search technique for hyperparameter tuning, which requires more time. All of the other algorithms use the grid search technique. The KNN and SVR are able to perform hyperparameter tuning rapidly. XGBOOST and Gradient Boosting regression have moderate running times of around 100 s.

www.frontiersin.org

Table 7 . Running time of algorithms with hyperparameter tuning and prediction.

The predicted and actual reproduction rates for Random Forest, KNN, SVR, XGBoost, and Gradient Boosting are, respectively, plotted in Figures 7 – 11 . The graphs show that the predicted values are very close to the actual values.

www.frontiersin.org

Figure 7 . Graph describing the predicted value vs. actual value of Random Forest.

www.frontiersin.org

Figure 8 . Graph describing the predicted value vs. actual value of KNN.

www.frontiersin.org

Figure 9 . Graph describing the predicted value vs. actual value of SVR.

www.frontiersin.org

Figure 10 . Graph describing the predicted value vs. actual value of XGBoost.

www.frontiersin.org

Figure 11 . Graph describing the predicted value vs. actual value of Gradient Boosting.

The major contributions of this paper are the study of features affecting the COVID-19 reproduction rate, as well as the investigation into the effects of feature selection and hyperparameter tuning on the prediction accuracy. Furthermore, prediction accuracy comparisons of the state-of-the-art regression techniques for COVID-19 reproduction rate have also been performed.

The selected features suggest that the total numbers of death and testing also influence the reproduction rate. Instead of depending only on the past value of the predictor variable as cited by Milind et al. ( 23 ), our work finds the crucial features affecting the predictor variable. Different regression techniques are used in the prediction and they are used to determine the final reproduction rate. The effectiveness of feature selection in prediction has also been proven. Random forest has achieved the best performance in the accuracy comparison of the state-of-the-art techniques, as has already been proven by Chicco and Jurman ( 24 ). In the results obtained by the four experiments, the overall best values of MAE, MSE, RMSE, RAE, RRSE, and R-Squared were all obtained by the KNN approach. Therefore, KNN has obtained the best performance on average, followed by Random Forest and XGBOOST.

Conclusion and Future Work

Predicting the reproduction rate is crucial, especially when a country has to take preventative measures to protect its citizens from a pandemic. Autoregressive models rely on and work with previous values to forecast future values. Non-linear machine learning regression algorithms have consistently produced the best prediction results in various applications, including the stock exchange, banking, and weather forecasting. Among the many factors involved in the spread of the COVID-19, the prominent factors are identified using Random Forest, Gradient Boosting, and XGBOOST in this work. Random Forest returned the highest importance score for Total_cases_per_million as 0.10196. For XGBOOST, the maximum score was 0.92185 for Total_case, and for Gradient Boosting, the top value of Total_deaths_per_million is 0.1183. Out of 16 features selected for investigation, seven features, namely, Total_cases, New_cases, Total_deaths, Total_cases_per_million, Total_deaths_per_million, Total_tests, Total_tests_per_thousand, and Positive_rate, are found to be prominent in reproduction rate prediction. Furthermore, this work investigated the reproduction rate prediction with non-linear machine learning regression techniques. The experiments were performed using Random Forest, Gradient Boosting, XGBOOST, KNN, and SVR, with and without feature selection and hyperparameter tuning. The results showed a decrease in the prediction error rate with hyperparameter tuning and with all of the features. Overall, the KNN algorithm had obtained the best performance. The study shows that Random Forest obtained the best performance with hyperparameter tuning and selected features. Individual regression techniques are applied in this study. However, the ensemble of regression techniques can be applied to obtain better performances. The regression algorithms obtained improved results with hyperparameter tuning and Gridsearch or Randomsearch methods. There is no remarkable difference in the prediction accuracy of algorithms with and without feature selection algorithms, so there is a need to find out the optimal features related to the reproduction rate.

Data Availability Statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author/s.

Author Contributions

C-YC and KSr: conceptualization. C-YC: resources, project administration, and funding acquisition. JK and KSu: methodology and software. KSr, SM, and C-YC: validation. KSr, SM, and SC: writing—review and editing. JK: writing—original draft preparation. All authors contributed to the article and approved the submitted version.

This research was partially funded by the Intelligent Recognition Industry Service Research Center from The Featured Areas Research Center Program within the framework of the Higher Education Sprout Project by the Ministry of Education (MOE) in Taiwan. Grant number: N/A and the APC were funded by the aforementioned Project.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

1. Available, online at: https://www.history.com/topics/middle-ages/pandemics-timeline (accessed May 20, 2021).

2. WHO(2020). Available online at: https://www.who.int/dg/speeches/detail/whodirector-general-s-opening-remarks-at-the-mission-briefing-oncovid-19-$-$12-march-2020 (accessed May 20, 2021).

3. Available, online at: https://www.who.int/emergencies/diseases/novel-coronavirus-2019 (accessed May 21, 2021).

4. Wadhwa P, Aishwarya Tripathi A, Singh P, Diwakar M, Kumar N. Predicting the time period of extension of lockdown due to increase in rate of COVID - 19 cases in India using machine learning. Mater Today. (2020) 37:2617–22. doi: 10.1016/j.matpr.2020.08.509

PubMed Abstract | CrossRef Full Text | Google Scholar

5. Van Doremalen N, Bushmaker T, Morris DH, Holbrook MG, Gamble A, Williamson BN, et al. Aerosol and surface stability of SARSCoV- 2 as compared with SARS-CoV-1. N Engl J Med. (2020) 382:1564–7. doi: 10.1056/NEJMc2004973

6. Gibson PG, Qin L, Puah SH. COVID-19 acute respiratory distress syndrome (ARDS): clinical features and differences from typical pre-COVID-19 ARDS. Med. J. Australia. (2020) 2:54–6. doi: 10.5694/mja2.50674

7. Bhattacharya SRK, Maddikunta PKR, Kaluri R, Singh S, Gadekallu TR, Alazab M, et al. A novel PCA-firefly based xgboost classification model for intrusion detection in networks using GPU. Electronics. (2020) 9:219. doi: 10.3390/electronics9020219

CrossRef Full Text | Google Scholar

8. Luckner M, Topolski B, Mazurek M. Application of XGBoost algorithm in fingerprinting localisation task. In: 16th IFIP International Conference on Computer Information Systems and Industrial Management . Bialystok: Springer (2017). p. 661–71. doi: 10.1007/978-3-319-59105-6_57

9. Vanden Driessche P. Reproduction numbers of infectious disease models. Infect Dis Model. (2017) 2:288–303. doi: 10.1016/j.idm.2017.06.002

10. Iwendi C, Bashir AK, Peshkar A, Sujatha R, Chatterjee JM, Pasupuleti S, et al. COVID-19 patient health prediction using boosted random forest algorithm. Front Public Health . (2020) 8:357. doi: 10.3389/fpubh.2020.00357

11. Bhattacharya S, Maddikunta PKR, Pham QV, Gadekallu TR, Krishnan SSR, Chowdhary CL, et al. Deep learning and medical image processing for coronavirus (COVID-19) pandemic: a survey. Sustain Cities Soc . (2021) 65:102589. doi: 10.1016/j.scs.2020.102589

12. Iwendi C, Maddikunta PKR, Gadekallu TR, Lakshmanna K, Bashir AK, Piran MJ. A metaheuristic optimization approach for energy efficiency in the IoT networks. Softw Pract Exp . (2020). doi: 10.1002/spe.2797. [Epub ahead of print].

13. Dhanamjayulu C, Nizhal UN, Maddikunta PK, Gadekallu TR, Iwendi C, Wei C, et al. Identification of malnutrition and prediction of BMI from facial images using real-time image processing and machine learning. IET Image Processing . (2021). doi: 10.1049/ipr2.12222. [Epub ahead of print].

14. Srinivasan K, Garg L, Chen B, Alaboudi AA, Jhanjhi NZ, Chang CT, et al. Expert system for stable power generation prediction in microbial fuel cell. Intellig Automat Soft Comput . (2021) 30:17–30. doi: 10.32604/iasc.2021.018380

15. Srinivasan K, Garg L, Datta D, Alaboudi AA, Jhanjhi NZ, Chang CT, et al. Performance comparison of deep cnn models for detecting driver's distraction. Comput Mater Continua . (2021) 68:4109–24. doi: 10.32604/cmc.2021.016736

16. Srinivasan K, Mahendran N, Vincent DR, Chang C-Y, Syed-Abdul S. Realizing an integrated multistage support vector machine model for augmented recognition of unipolar depression. Electronics . (2020) 9:647. doi: 10.3390/electronics9040647

17. Sundararajan K, Garg L, Srinivasan K, Bashir AK, Kaliappan J, Ganapathy GP, et al. A contemporary review on drought modeling using machine learning approaches. CMES Comput Model Eng Sci . (2021) 128:447–87. doi: 10.32604/cmes.2021.015528

18. Khosravi A, Chaman R, Rohani-Rasaf M, Zare F, Mehravaran S, Emamian MH. The basic reproduction number and prediction of the epidemic size of the novel coronavirus (COVID-19) in Shahroud, Iran. Epidemiol Infect . (2020) 148:e115, 1–7. doi: 10.1017/S0950268820001247

19. Wangping J, Ke H, Yang S, Wenzhe C, Shengshu W, Shanshan Y, et al. Extended SIR prediction of the epidemics trend of COVID-19 in Italy and compared with Hunan, China. Front Med . (2020) 7:169. doi: 10.3389/fmed.2020.00169

20. Zivkovic M, Bacanin N, Venkatachalam K, Nayyar A, Djordjevic A, Strumberger I, et al. COVID-19 cases prediction by using hybrid machine learning and beetle antennae search approach. Sustain Cities Soc. (2021) 66:102669. doi: 10.1016/j.scs.2020.102669

21. Mojjada RK, Yadav A, Prabhu AV, Natarajan Y. Machine learning models for covid-19 future forecasting. Mater. Today Proc. (2020). doi: 10.1016/j.matpr.2020.10.962. [Epub ahead of print].

22. Farooq J, Bazaz A. A deep learning algorithm for modeling and forecasting of COVID-19 in five worst affected states of India. Alexandria Eng J. (2020) 60:587–96. doi: 10.1016/j.aej.2020.09.037

23. Milind Y, Murukessan P, Srinivas M. Analysis on novel coronavirus (COVID-19) using machine learning methods. Chaos Solitons Fractals . (2020) 139:110050. doi: 10.1016/j.chaos.2020.110050

24. Chicco D, Jurman G. Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Med Inform Decis Mak. (2020) 20:16. doi: 10.1186/s12911-020-1023-5

25. Mortazavi BJ, Downing NS, Bucholz EM, Dharmarajan K, Manhapra A, Li S-X, et al. Analysis of machine learning techniques for heart failure readmissions. Circ Cardiovasc Qual Outcomes. (2016) 9:629–40. doi: 10.1161/CIRCOUTCOMES.116.003039

26. Balli S. Data analysis of Covid-19 pandemic and short-term cumulative case forecasting using machine learning time series methods. Chaos Solitons Fractals. (2021) 142:110512. doi: 10.1016/j.chaos.2020.110512

27. Kuo CP, Fu JS. Evaluating the impact of mobility on COVID-19 pandemic with machine learning hybrid predictions. Sci Total Environ . (2021) 758:144151. doi: 10.1016/j.scitotenv.2020.144151

28. Shaban WM, Rabie AH, Saleh AI, Abo-Elsoud MA. A new COVID-19 Patients Detection Strategy (CPDS) based on hybrid feature selection and enhanced KNN classifier. Knowl Based Syst . (2020) 25:106270. doi: 10.1016/j.knosys.2020.106270

29. Sujatha R, Chatterjee JM, Hassanien AE. A machine learning forecasting model for COVID-19 pandemic in India. Stoch Environ Res Risk Assess . (2020) 34:959–72. doi: 10.1007/s00477-020-01827-8

30. Yang Z, Zeng Z, Wang K, Wong S-S, Liang W, Zanin M, et al. Modified SEIR and AI prediction of the epidemics trend of COVID-19 in China under public health interventions. J Thorac Dis. (2020) 12:165–74. doi: 10.21037/jtd.2020.02.64

31. Xu Z, Huang G, Weinberger KQ, Zheng AX. Gradient boosted feature selection. In: Proceedings of the 20 th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . New York, NY: ACM (2014). p. 522–31.

Google Scholar

32. Wang Y. A Xgboost risk model via feature selection and Bayesian hyper-parameter optimization. arXiv:1901.08433 (2019).

33. Chintalapudi N, Battineni G, Sagaro GG, Amenta F. COVID-19 outbreak reproduction number estimations forecasting in Marche, Italy. Int J Infect Dis . (2020) 96:327–33. doi: 10.1016/j.ijid.2020.05.029

34. Locatelli I, Trächsel B, Rousson V. Estimating the basic reproduction number for COVID-19 in Western Europe. PLoS ONE. (2021) 16:e0248731. doi: 10.1371/journal.pone.0248731

35. Zhang Z, Trevino V, Hoseini S, Shahabuddin B, Smaranda B, Manivanna Zhang P, et al. Variable selection in logistic regression model with genetic algorithm. Ann Transl Med . (2018) 6:45. doi: 10.21037/atm.2018.01.15

36. Srinivasu PN, SivaSai JG, Ijaz MF, Bhoi AK, Kim W, Kang JJ. Classification of skin disease using deep learning neural networks with MobileNet V2 LSTM. Sensors . (2021) 21:2852. doi: 10.3390/s21082852

37. Panigrahi R, Borah S, Bhoi A, Ijaz M, Pramanik M, Kumar Y, et al. Consolidated decision tree-based intrusion detection system for binary and multiclass imbalanced datasets. Mathematics . (2021) 9:751. doi: 10.3390/math9070751

38. Panigrahi R, Borah S, Bhoi A, Ijaz M, Pramanik M, Jhaveri R, et al. Performance assessment of supervised classifiers for designing intrusion detection systems: a comprehensive review and recommendations for future research. Mathematics . (2021) 9:690. doi: 10.3390/math9060690

39. Tamang J. Dynamical properties of ion-acoustic waves in space plasma and its application to image encryption. IEEE Access . (2021) 9:18762–82. doi: 10.1109/ACCESS.2021.3054250

40. Chowdhary CL, Patel PV, Kathrotia KJ, Attique M, Perumal K, Ijaz MF. Analytical study of hybrid techniques for image encryption and decryption. Sensors . (2020) 20:5162. doi: 10.3390/s20185162

41. Gaur L, Singh G, Solanki A, Jhanjhi NZ, Bhatia U, Sharma S, et al. Disposition of youth in predicting sustainable development goals using the neuro-fuzzy and random forest algorithms. Hum Cent Comput Inf Sci . (2021) 11:24. doi: 10.22967/HCIS.2021.11.024

42. Chen T, Guestrin C. XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . San Francisco, CA. (2016) p. 13–7. doi: 10.1145/2939672.2939785

43. Available, online at: https://github.com/owid/covid-19-data/tree/master/public/data (accessed May 2, 2021).

Keywords: COVID-19, feature selection, machine learning, prediction error, reproduction rate prediction, regression

Citation: Kaliappan J, Srinivasan K, Mian Qaisar S, Sundararajan K, Chang C-Y and C S (2021) Performance Evaluation of Regression Models for the Prediction of the COVID-19 Reproduction Rate. Front. Public Health 9:729795. doi: 10.3389/fpubh.2021.729795

Received: 23 June 2021; Accepted: 16 August 2021; Published: 14 September 2021.

Reviewed by:

Copyright © 2021 Kaliappan, Srinivasan, Mian Qaisar, Sundararajan, Chang and C. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Chuan-Yu Chang, chuanyu@yuntech.edu.tw

This article is part of the Research Topic

Big Data Analytics for Smart Healthcare applications

  • Published: 27 July 2020

Regression analysis of student academic performance using deep learning

  • Sadiq Hussain   ORCID: orcid.org/0000-0002-9840-4796 1 ,
  • Silvia Gaftandzhieva 2 ,
  • Md. Maniruzzaman 3 ,
  • Rositsa Doneva 2 &
  • Zahraa Fadhil Muhsin 4  

Education and Information Technologies volume  26 ,  pages 783–798 ( 2021 ) Cite this article

1675 Accesses

14 Citations

Metrics details

Educational data mining helps the educational institutions to perform effectively and efficiently by exploiting the data related to all its stakeholders. It can help the at-risk students, develop recommendation systems and alert the students at different levels. It is beneficial to the students, educators and authorities as a whole. Deep learning has gained momentum in various domains especially image processing with a large dataset. We devise a regression model for analyzing the academic performance of the students using deep learning. We have applied regression using deep learning and linear regression on the dataset. For such models with smaller datasets, to tackle the issue of overfitting is critical. Hence, the parameters can be tuned to deal with such issues. The deep learning model records a mean absolute score (mae) of 1.61 and loss 4.7 with the value of k = 3. While the linear regression model yields a loss of 6.7 and mae score of 1.97. The deep learning model outperforms the linear regression model. The model may be successfully extended to other programmes to mine and predict the performance of the learners.

This is a preview of subscription content, access via your institution .

Access options

Buy single article.

Instant access to the full article PDF.

Price includes VAT (Russian Federation)

Rent this article via DeepDyve.

regression model research paper

Abhinav, K., Subramanian, V., Dubey, A., Bhat, P., & Venkat, A. D. (2018). LeCoRe: A framework for modeling Learner’s preference. In EDM .

Abu Tair, M. M., & El-Halees, A. M. (2012). Mining educational data to improve students’ performance: A case study. Mining Educational Data to Improve Students’ Performance: A Case Study , 2 (2).

Adekitan, A. I., & Salau, O. (2019). The impact of engineering students’ performance in the first three years on their graduation result using educational data mining. Heliyon, 5 (2), e01250.

Article   Google Scholar  

Ahmed, A. B. E. D., & Elaraby, I. S. (2014). Data mining: A prediction for student’s performance using classification method. World Journal of Computer Application and Technology, 2 (2), 43–47.

Google Scholar  

Algarni, A. (2016). Data mining in education. International Journal of Advanced Computer Science and Applications, 7 (6), 456–461.

Al-Radaideh, Q. A., Al-Shawakfa, E. M., & Al-Najjar, M. I. (2006, December). Mining student data using decision trees. In International Arab Conference on Information Technology (ACIT’2006) , Yarmouk University, Jordan.

Baradwaj, B. K., & Pal, S. (2012). Mining educational data to analyze students’ performance. arXiv preprint arXiv:1201.3417.

Ben-Zadok, G., Hershkovitz, A., Mintz, E., & Nachmias, R. (2009). Examining online learning processes based on log files analysis: A case study. In 5th International Conference on Multimedia and ICT in Education (m-ICTE’09) .

Bhise, R. B., Thorat, S. S., & Supekar, A. K. (2013). Importance of data mining in higher education system. IOSR Journal Of Humanities And Social Science (IOSR-JHSS), 6 (6), 18–21.

Campbell, C. M., & Cabrera, A. F. (2014). Making the mark: Are grades and deep learning related? Research in Higher Education, 55 (5), 494–507.

Carmona, C., Castillo, G., & Millán, E. (2007, September). Discovering student preferences in e-learning. In Proceedings of the international workshop on applying data mining in e-learning (pp. 33–42).

Deng, L. (2014). A tutorial survey of architectures, algorithms, and applications for deep learning. APSIPA Transactions on Signal and Information Processing, 3 .

Gadhavi, M., & Patel, C. (2017). Student final grade prediction based on linear regression. Indian Journal of Computer Science and EngineeringIndian Journal of Computer Science and Engineering, 8 (3), 274–279.

Goyal, M., & Vohra, R. (2012). Applications of data mining in higher education. International Journal of Computer Science Issues (IJCSI), 9 (2), 113.

Guo, B., Zhang, R., Xu, G., Shi, C., & Yang, L. (2015, July). Predicting students performance in educational data mining. In 2015 International Symposium on Educational Technology (ISET) (pp. 125–128). IEEE.

Hernández-Blanco, A., Herrera-Flores, B., Tomás, D., & Navarro-Colorado, B. (2019). A systematic review of deep learning approaches to educational data mining. Complexity, 2019 , 1–22.

Hijaz, S. T., & Naqvi, S. R. (2006). Factors affecting students’ performance: A case of private colleges in Bangladesh. Journal of Sociology, 3 (1), 44–45.

Hussain, S., Muhsion, Z. F., Salal, Y. K., Theodoru, P., KurtoÄŸlu, F., & Hazarika, G. C. (2019). Prediction model on student performance based on internal assessment using deep learning. International Journal of Emerging Technologies in Learning (iJET), 14 (08), 4–22.

Jiawei, H., & Kamber, M. (2011) Data mining: Concepts and techniques, (the Morgan Kaufmann series in data management systems), vol. 2.

Kaur, H., & Bathla, E. G. (2018). Student performance prediction using educational data mining techniques. International Journal on Future Revolution in Computer Science & Communication Engineering, 4 (12), 93–97.

Kim, B. H., Vizitei, E., & Ganapathi, V. (2018). GritNet: Student performance prediction with deep learning. arXiv preprint arXiv:1804.07405.

Laxman, S., & Sastry, P. S. (2006). A survey of temporal data mining. Sadhana, 31 (2), 173–198.

Article   MathSciNet   Google Scholar  

Mannila, H. (1996, June). Data mining: Machine learning, statistics, and databases. In Proceedings of 8th International Conference on Scientific and Statistical Data Base Management (pp. 2–9). IEEE.

Mardikyan, S., & Badur, B. (2011). Analyzing teaching performance of instructors using data mining techniques. Informatics in Education, 10 (2), 245–257.

Mihăescu, M. C. (2011, September). Classification of learners using linear regression. In 2011 Federated Conference on Computer Science and Information Systems (FedCSIS) (pp. 717–721). IEEE.

Montgomery, D. C., Peck, E. A., & Vining, G. G. (2012). Introduction to linear regression analysis (Vol. 821). Hoboken: John Wiley & Sons.

MATH   Google Scholar  

Nichat, A. A., & Raut, D. A. B. (2017). Predicting and analysis of student performance using decision tree technique. International Journal, 5 , 7319–7328.

Oyerinde, O. D., & Chia, P. A. (2017). Predicting students’ academic performances–a learning analytics approach using multiple linear regression.

Padhy, N., Mishra, D., & Panigrahi, R. (2012). The survey of data mining applications and feature scope. arXiv preprint arXiv:1211.5723.

Pandey, U. K., & Pal, S. (2011). Data mining: A prediction of performer or underperformer using classification. arXiv preprint arXiv:1104.4163.

Piad, K. C., Dumlao, M., Ballera, M. A., & Ambat, S. C. (2016, July). Predicting IT employability using data mining techniques. In 2016 Third International Conference on Digital Information Processing, Data Mining, and Wireless Communications (DIPDMWC) (pp. 26–30). IEEE.

Priya, K. S., & Kumar, A. S. (2013). Improving the student’s performance using educational data mining. International Journal of Advanced Networking and Applications, 4 (4), 1806.

Ramesh, V. A. M. A. N. A. N., Parkavi, P., & Ramar, K. (2013). Predicting student performance: A statistical and data mining approach. International Journal of Computer Applications, 63 (8), 35–39.

Romero, C., & Ventura, S. (2010). Educational data mining: A review of the state of the art. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 40 (6), 601–618.

Salas-Rueda, R. A. (2016). The impact of usable system for regression analysis in higher education. International Journal of Educational Technology in Higher Education, 13 (1), 14.

Shahiri, A. M., & Husain, W. (2015). A review on predicting student's performance using data mining techniques. Procedia Computer Science, 72 , 414–422.

Shih, Y. C., Huang, P. R., Hsu, Y. C., & Chen, S. Y. (2012). A complete understanding of disorientation problems in web-based learning. Turkish Online Journal of Educational Technology-TOJET, 11 (3), 1–13.

Srimani, P. K., & Patil, M. M. (2014). Regression model for Edu-data in technical education system: A linear approach. In ICT and Critical Infrastructure: Proceedings of the 48th Annual Convention of Computer Society of India, vol II (pp. 785–793). Springer, Cham.

Sultana, J., Rani, M. U., & Farquad, M. A. H. (2009). Student’s performance prediction using deep learning and data mining methods. International Journal of Recent Technology and Engineering, 8 (1S4), 1–4.

Suthar, V., & Tarmizi, R. (2010). Effects of students’ beliefs on mathematics and achievement of university students: Regression analysis approach. Journal of social sciences, 6 (2), 146–152.

Thomas, E. H., & Galambos, N. (2004). What satisfies students? Mining student-opinion data with regression and decision tree analysis. Research in Higher Education, 45 (3), 251–269.

Vora, D. R., & Iyer, K. (2018). EDM–survey of performance factors and algorithms applied. International Journal of Engineering & Technology, 7 (2.6), 93–97.

Wang, L., Sy, A., Liu, L., & Piech, C. (2017). Learning to represent student knowledge on programming exercises using deep learning. International Educational Data Mining Society.

Xing, W., & Du, D. (2019). Dropout prediction in MOOCs: Using deep learning for personalized intervention. Journal of Educational Computing Research, 57 (3), 547–570.

Yadav, S. K., Bharadwaj, B., & Pal, S. (2012a). Data mining applications: A comparative study for predicting student’s performance. arXiv preprint arXiv:1202.4815.

Yadav, S. K., Bharadwaj, B., & Pal, S. (2012b). Mining education data to predict student’s retention: A comparative study. arXiv preprint arXiv:1203.2987.

Zhou, Q., Quan, W., Zhong, Y., Xiao, W., Mou, C., & Wang, Y. (2018). Predicting high-risk students using internet access logs. Knowledge and Information Systems, 55 (2), 393–413.

Download references

Availability of data and material

Available on request.

Author information

Authors and affiliations.

Dibrugarh University, Dibrugarh, Assam, India

Sadiq Hussain

University of Plovdiv “PaisiiHilendarski”, Plovdiv, Bulgaria

Silvia Gaftandzhieva & Rositsa Doneva

Statistics Discipline, Khulna University, Khulna, Bangladesh

Md. Maniruzzaman

Science College, Baghdad University, Baghdad, Iraq

Zahraa Fadhil Muhsin

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Sadiq Hussain .

Ethics declarations

Conflicts of interest/competing interests.

Not applicable.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Cite this article.

Hussain, S., Gaftandzhieva, S., Maniruzzaman, M. et al. Regression analysis of student academic performance using deep learning. Educ Inf Technol 26 , 783–798 (2021). https://doi.org/10.1007/s10639-020-10241-0

Download citation

Received : 03 April 2020

Accepted : 29 May 2020

Published : 27 July 2020

Issue Date : January 2021

DOI : https://doi.org/10.1007/s10639-020-10241-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Data mining
  • Educational data mining
  • Mean absolute error
  • Deep learning

Advertisement

  • Find a journal
  • Publish with us
  • Experience Management
  • Market Research
  • Survey Data Analysis & Reporting
  • Regression Analysis

Try Qualtrics for free

The complete guide to regression analysis.

19 min read What is regression analysis and why is it useful? While most of us have heard the term, understanding regression analysis in detail may be something you need to brush up on. Here’s what you need to know about this popular method of analysis.

When you rely on data to drive and guide business decisions, as well as predict market trends, just gathering and analyzing what you find isn’t enough — you need to ensure it’s relevant and valuable.

The challenge, however, is that so many variables can influence business data: market conditions, economic disruption, even the weather! As such, it’s essential you know which variables are affecting your data and forecasts, and what data you can discard.

And one of the most effective ways to determine data value and monitor trends (and the relationships between them) is to use regression analysis, a set of statistical methods used for the estimation of relationships between independent and dependent variables.

In this guide, we’ll cover the fundamentals of regression analysis, from what it is and how it works to its benefits and practical applications.

Discover the newest trends in market research and regression analysis

What is regression analysis?

Regression analysis is a statistical method. It’s used for analyzing different factors that might influence an objective – such as the success of a product launch, business growth, a new marketing campaign – and determining which factors are important and which ones can be ignored.

Regression analysis can also help leaders understand how different variables impact each other and what the outcomes are. For example, when forecasting financial performance, regression analysis can help leaders determine how changes in the business can influence revenue or expenses in the future.

Running an analysis of this kind, you might find that there’s a high correlation between the number of marketers employed by the company, the leads generated, and the opportunities closed.

This seems to suggest that a high number of marketers and a high number of leads generated influences sales success. But do you need both factors to close those sales? By analyzing the effects of these variables on your outcome,  you might learn that when leads increase but the number of marketers employed stays constant, there is no impact on the number of opportunities closed, but if the number of marketers increases, leads and closed opportunities both rise.

Regression analysis can help you tease out these complex relationships so you can determine which areas you need to focus on in order to get your desired results, and avoid wasting time with those that have little or no impact. In this example, that might mean hiring more marketers rather than trying to increase leads generated.

How does regression analysis work?

Regression analysis starts with variables that are categorized into two types: dependent and independent variables. The variables you select depend on the outcomes you’re analyzing.

Understanding variables:

1. dependent variable.

This is the main variable that you want to analyze and predict. For example, operational (O) data such as your quarterly or annual sales, or experience (X) data such as your net promoter score (NPS) or customer satisfaction score (CSAT) .

These variables are also called response variables, outcome variables, or left-hand-side variables (because they appear on the left-hand side of a regression equation).

There are three easy ways to identify them:

  • Is the variable measured as an outcome of the study?
  • Does the variable depend on another in the study?
  • Do you measure the variable only after other variables are altered?

2. Independent variable

Independent variables are the factors that could affect your dependent variables. For example, a price rise in the second quarter could make an impact on your sales figures.

You can identify independent variables with the following list of questions:

  • Is the variable manipulated, controlled, or used as a subject grouping method by the researcher?
  • Does this variable come before the other variable in time?
  • Are you trying to understand whether or how this variable affects another?

Independent variables are often referred to differently in regression depending on the purpose of the analysis. You might hear them called:

Explanatory variables

Explanatory variables are those which explain an event or an outcome in your study. For example, explaining why your sales dropped or increased.

Predictor variables

Predictor variables are used to predict the value of the dependent variable. For example, predicting how much sales will increase when new product features are rolled out .

Experimental variables

These are variables that can be manipulated or changed directly by researchers to assess the impact. For example, assessing how different product pricing ($10 vs $15 vs $20) will impact the likelihood to purchase.

Subject variables (also called fixed effects)

Subject variables can’t be changed directly, but vary across the sample. For example, age, gender, or income of consumers.

Unlike experimental variables, you can’t randomly assign or change subject variables, but you can design your regression analysis to determine the different outcomes of groups of participants with the same characteristics. For example, ‘how do price rises impact sales based on income?’

Carrying out regression analysis

Regression analysis

So regression is about the relationships between dependent and independent variables. But how exactly do you do it?

Assuming you have your data collection done already, the first and foremost thing you need to do is plot your results on a graph. Doing this makes interpreting regression analysis results much easier as you can clearly see the correlations between dependent and independent variables.

Let’s say you want to carry out a regression analysis to understand the relationship between the number of ads placed and revenue generated.

On the Y-axis, you place the revenue generated. On the X-axis, the number of digital ads. By plotting the information on the graph, and drawing a line (called the regression line) through the middle of the data, you can see the relationship between the number of digital ads placed and revenue generated.

Regression analysis - step by step

This regression line is the line that provides the best description of the relationship between your independent variables and your dependent variable. In this example, we’ve used a simple linear regression model.

Regression analysis - step by step

Statistical analysis software can draw this line for you and precisely calculate the regression line. The software then provides a formula for the slope of the line, adding further context to the relationship between your dependent and independent variables.

Simple linear regression analysis

A simple linear model uses a single straight line to determine the relationship between a single independent variable and a dependent variable.

This regression model is mostly used when you want to determine the relationship between two variables (like price increases and sales) or the value of the dependent variable at certain points of the independent variable (for example the sales levels at a certain price rise).

While linear regression is useful, it does require you to make some assumptions.

For example, it requires you to assume that:

  • the data was collected using a statistically valid sample collection method that is representative of the target population
  • The observed relationship between the variables can’t be explained by a ‘hidden’ third variable – in other words, there are no spurious correlations.
  • the relationship between the independent variable and dependent variable is linear – meaning that the best fit along the data points is a straight line and not a curved one

Multiple regression analysis

As the name suggests, multiple regression analysis is a type of regression that uses multiple variables. It uses multiple independent variables to predict the outcome of a single dependent variable. Of the various kinds of multiple regression, multiple linear regression is one of the best-known.

Multiple linear regression is a close relative of the simple linear regression model in that it looks at the impact of several independent variables on one dependent variable. However, like simple linear regression, multiple regression analysis also requires you to make some basic assumptions.

For example, you will be assuming that:

  • there is a linear relationship between the dependent and independent variables (it creates a straight line and not a curve through the data points)
  • the independent variables aren’t highly correlated in their own right

An example of multiple linear regression would be an analysis of how marketing spend, revenue growth, and general market sentiment affect the share price of a company.

With multiple linear regression models you can estimate how these variables will influence the share price, and to what extent.

Multivariate linear regression

Multivariate linear regression involves more than one dependent variable as well as multiple independent variables, making it more complicated than linear or multiple linear regressions. However, this also makes it much more powerful and capable of making predictions about complex real-world situations.

For example, if an organization wants to establish or estimate how the COVID-19 pandemic has affected employees in its different markets, it can use multivariate linear regression, with the different geographical regions as dependent variables and the different facets of the pandemic as independent variables (such as mental health self-rating scores, proportion of employees working at home, lockdown durations and employee sick days).

Through multivariate linear regression, you can look at relationships between variables in a holistic way and quantify the relationships between them. As you can clearly visualize those relationships, you can make adjustments to dependent and independent variables to see which conditions influence them. Overall, multivariate linear regression provides a more realistic picture than looking at a single variable.

However, because multivariate techniques are complex, they involve high-level mathematics that require a statistical program to analyze the data.

Logistic regression

Logistic regression models the probability of a binary outcome based on independent variables.

So, what is a binary outcome? It’s when there are only two possible scenarios, either the event happens (1) or it doesn’t (0). e.g. yes/no outcomes, pass/fail outcomes, and so on. In other words, if the outcome can be described as being in either one of two categories.

Logistic regression makes predictions based on independent variables that are assumed or known to have an influence on the outcome. For example, the probability of a sports team winning their game might be affected by independent variables like weather, day of the week, whether they are playing at home or away and how they fared in previous matches.

What are some common mistakes with regression analysis?

Across the globe, businesses are increasingly relying on quality data and insights to drive decision-making — but to make accurate decisions, it’s important that the data collected and statistical methods used to analyze it are reliable and accurate.

Using the wrong data or the wrong assumptions can result in poor decision-making, lead to missed opportunities to improve efficiency and savings, and — ultimately — damage your business long term.

  • Assumptions

When running regression analysis, be it a simple linear or multiple regression, it’s really important to check that the assumptions your chosen method requires have been met. If your data points don’t conform to a straight line of best fit, for example, you need to apply additional statistical modifications to accommodate the non-linear data. For example, if you are looking at income data, which scales on a logarithmic distribution, you should take the Natural Log of Income as your variable then adjust the outcome after the model is created.

  • Correlation vs. causation

It’s a well-worn phrase that bears repeating – correlation does not equal causation. While variables that are linked by causality will always show correlation, the reverse is not always true. Moreover, there is no statistic that can determine causality (although the design of your study overall can).

If you observe a correlation in your results, such as in the first example we gave in this article where there was a correlation between leads and sales, you can’t assume that one thing has influenced the other. Instead, you should use it as a starting point for investigating the relationship between the variables in more depth.

  • Choosing the wrong variables to analyze

Before you use any kind of statistical method, it’s important to understand the subject you’re researching in detail. Doing so means you’re making informed choices of variables and you’re not overlooking something important that might have a significant bearing on your dependent variable.

  • Model building The variables you include in your analysis are just as important as the variables you choose to exclude. That’s because the strength of each independent variable is influenced by the other variables in the model. Other techniques, such as Key Drivers Analysis, are able to account for these variable interdependencies.

Benefits of using regression analysis

There are several benefits to using regression analysis to judge how changing variables will affect your business and to ensure you focus on the right things when forecasting.

Here are just a few of those benefits:

Make accurate predictions

Regression analysis is commonly used when forecasting and forward planning for a business. For example, when predicting sales for the year ahead, a number of different variables will come into play to determine the eventual result.

Regression analysis can help you determine which of these variables are likely to have the biggest impact based on previous events and help you make more accurate forecasts and predictions.

Identify inefficiencies

Using a regression equation a business can identify areas for improvement when it comes to efficiency, either in terms of people, processes, or equipment.

For example, regression analysis can help a car manufacturer determine order numbers based on external factors like the economy or environment.

Using the initial regression equation, they can use it to determine how many members of staff and how much equipment they need to meet orders.

Drive better decisions

Improving processes or business outcomes is always on the minds of owners and business leaders, but without actionable data, they’re simply relying on instinct, and this doesn’t always work out.

This is particularly true when it comes to issues of price. For example, to what extent will raising the price (and to what level) affect next quarter’s sales?

There’s no way to know this without data analysis. Regression analysis can help provide insights into the correlation between price rises and sales based on historical data.

How do businesses use regression? A real-life example

Marketing and advertising spending are common topics for regression analysis. Companies use regression when trying to assess the value of ad spend and marketing spend on revenue.

A typical example is using a regression equation to assess the correlation between ad costs and conversions of new customers. In this instance,

  • our dependent variable (the factor we’re trying to assess the outcomes of) will be our conversions
  • the independent variable (the factor we’ll change to assess how it changes the outcome) will be the daily ad spend
  • the regression equation will try to determine whether an increase in ad spend has a direct correlation with the number of conversions we have

The analysis is relatively straightforward — using historical data from an ad account, we can use daily data to judge ad spend vs conversions and how changes to the spend alter the conversions.

By assessing this data over time, we can make predictions not only on whether increasing ad spend will lead to increased conversions but also what level of spending will lead to what increase in conversions. This can help to optimize campaign spend and ensure marketing delivers good ROI.

This is an example of a simple linear model. If you wanted to carry out a more complex regression equation, we could also factor in other independent variables such as seasonality, GDP, and the current reach of our chosen advertising networks.

By increasing the number of independent variables, we can get a better understanding of whether ad spend is resulting in an increase in conversions, whether it’s exerting an influence in combination with another set of variables, or if we’re dealing with a correlation with no causal impact – which might be useful for predictions anyway, but isn’t a lever we can use to increase sales.

Using this predicted value of each independent variable, we can more accurately predict how spend will change the conversion rate of advertising.

Regression analysis tools

Regression analysis is an important tool when it comes to better decision-making and improved business outcomes. To get the best out of it, you need to invest in the right kind of statistical analysis software.

The best option is likely to be one that sits at the intersection of powerful statistical analysis and intuitive ease of use, as this will empower everyone from beginners to expert analysts to uncover meaning from data, identify hidden trends and produce predictive models without statistical training being required.

IQ stats in action

To help prevent costly errors, choose a tool that automatically runs the right statistical tests and visualizations and then translates the results into simple language that anyone can put into action.

With software that’s both powerful and user-friendly, you can isolate key experience drivers, understand what influences the business, apply the most appropriate regression methods, identify data issues, and much more.

Regression analysis tools

With Qualtrics’ Stats iQ™, you don’t have to worry about the regression equation because our statistical software will run the appropriate equation for you automatically based on the variable type you want to monitor. You can also use several equations, including linear regression and logistic regression, to gain deeper insights into business outcomes and make more accurate, data-driven decisions.

Related resources

Analysis & Reporting

Data Analysis 31 min read

Social media analytics 13 min read, kano analysis 21 min read, margin of error 11 min read, sentiment analysis 20 min read, thematic analysis 11 min read, behavioral analytics 12 min read, request demo.

Ready to learn more about Qualtrics?

regression model research paper

  • Survey Software The world’s leading omnichannel survey software
  • Online Survey Tools Create sophisticated surveys with ease.
  • Mobile Offline Conduct efficient field surveys.
  • Text Analysis
  • Close The Loop
  • Automated Translations
  • NPS Dashboard
  • CATI Manage high volume phone surveys efficiently
  • Cloud/On-premise Dialer TCPA compliant Cloud & on-premise dialer
  • IVR Survey Software Boost productivity with automated call workflows.
  • Analytics Analyze survey data with visual dashboards
  • Panel Manager Nurture a loyal community of respondents.
  • Survey Portal Best-in-class user friendly survey portal.
  • Voxco Audience Conduct targeted sample research in hours.
  • Predictive Analytics
  • Customer 360
  • Customer Loyalty
  • Fraud & Risk Management
  • AI/ML Enablement Services
  • Credit Underwriting

regression model research paper

Find the best survey software for you! (Along with a checklist to compare platforms)

Get Buyer’s Guide

  • 100+ question types
  • Drag-and-drop interface
  • Skip logic and branching
  • Multi-lingual survey
  • Text piping
  • Question library
  • CSS customization
  • White-label surveys
  • Customizable ‘Thank You’ page
  • Customizable survey theme
  • Reminder send-outs
  • Survey rewards
  • Social media
  • SMS surveys
  • Website surveys
  • Correlation analysis
  • Cross-tabulation analysis
  • Trend analysis
  • Real-time dashboard
  • Customizable report
  • Email address validation
  • Recaptcha validation
  • SSL security

Take a peek at our powerful survey features to design surveys that scale discoveries.

Download feature sheet.

  • Hospitality
  • Financial Services
  • Academic Research
  • Customer Experience
  • Employee Experience
  • Product Experience
  • Market Research
  • Social Research
  • Data Analysis
  • Banking & Financial Services
  • Retail Solution
  • Risk Management
  • Customer Lifecycle Solutions
  • Net Promoter Score
  • Customer Behaviour Analytics
  • Customer Segmentation
  • Data Unification

Explore Voxco 

Need to map Voxco’s features & offerings? We can help!

Watch a Demo 

Download Brochures 

Get a Quote

  • NPS Calculator
  • CES Calculator
  • A/B Testing Calculator
  • Margin of Error Calculator
  • CX Strategy & Management Hub
  • Market Research Hub
  • Patient Experience Hub
  • Employee Experience Hub
  • Market Research Guide
  • Customer Experience Guide
  • The Voxco Guide to Customer Experience
  • NPS Knowledge Hub
  • Survey Research Guides
  • Survey Template Library
  • Webinars and Events
  • Feature Sheets
  • Try a sample survey
  • Professional services
  • Blogs & White papers
  • Case Studies

Find the best customer experience platform

Uncover customer pain points, analyze feedback and run successful CX programs with the best CX platform for your team.

Get the Guide Now

regression model research paper

We’ve been avid users of the Voxco platform now for over 20 years. It gives us the flexibility to routinely enhance our survey toolkit and provides our clients with a more robust dataset and story to tell their clients.

VP Innovation & Strategic Partnerships, The Logit Group

  • Client Stories
  • Voxco Reviews
  • Why Voxco Research?
  • Why Voxco Intelligence?
  • Careers at Voxco
  • Vulnerabilities and Ethical Hacking

Explore Regional Offices

  • Our clients
  • Client stories
  • Featuresheets

Regression model: Definition, Types, and examples

  • December 17, 2021

SHARE THE ARTICLE ON

Regression model: Definition, Types and examples Participant bias

What is regression model?​

A regression model determines a relationship between an independent variable and a dependent variable, by providing a function. Formulating a regression analysis helps you predict the effects of the independent variable on the dependent one. 

Example: we can say that age and height can be described using a linear regression model. Since a person’s height increases as age increases, they have a linear relationship. 

Regression models are commonly used as statistical proof of claims regarding everyday facts. 

In this article, we will take a deeper look at the regression model and its types.

Supercharge your survey data with Voxco Analytics.

Collect feedback & uncover insights.

What are the different types of regression models?

There are three different types of regression models:

Let’s look at them in detail:

Linear regression model

A linear regression model is used to depict a relationship between variables that are proportional to each other. Meaning, that the dependent variable increases/decreases with the independent variable. 

In the graphical representation, it has a straight linear line plotted between the variables. Even if the points are not exactly in a straight line (which is always the case) we can still see a pattern and make sense of it. 

For example, as the age of a person increases, the level of glucose in their body increases as well.

Download Market Research Toolkit

Get market research trends guide, Online Surveys guide, Agile Market Research Guide & 5 Market research Template

Regression model: Definition, Types and examples Participant bias

Multiple regression model

A multiple regression model is used when there is more than one independent variable affecting a dependent variable. While predicting the outcome variable, it is important to measure how each of the independent variables moves in their environment and how their changes will affect the output or target variable. 

For example, the chances of a student failing their test can be dependent on various input variables like hard work, family issues, health issues, etc.

See Voxco survey software in action with a Free demo.

What is stepwise regression modeling?

Unlike the above-mentioned regression model types, stepwise regression modeling is more of a technique used when various input variables are affecting one output variable . The analyst will automatically proceed to measure the variable that is directly correlated input variable and build a model out of it. The rest of the variables come into the picture when he decides to perfect the model. 

The analyst may add the remaining inputs one after the other based on their significance and the extent to which it affects the target variable. 

For example, vegetable prices have increased in a certain areas. The reason behind the event can be anything from natural calamities to transport and supply chain management. When an analyst decides to put it out on a graph, he will pick up the most obvious reason, heavy rainfall in the agricultural regions. 

Once the model is built, he can then add the rest of the affecting input variables into the picture based on their occurrence and significance.

Explore Voxco Survey Software

+ Omnichannel Survey Software 

+ Online Survey Software 

+ CATI Survey Software 

+ IVR Survey Software 

+ Market Research Tool

+ Customer Experience Tool 

+ Product Experience Software 

+ Enterprise Survey Software 

Regression model: Definition, Types and examples Participant bias

Participant bias: Types, Reasons, & Measures

Participant bias: Types, Reasons, & Measures SHARE THE ARTICLE ON Table of Contents Researchers rely on their survey participants for reliable data. Any data collecting

DEMOGRAPHIC SURVEY QUESTIONS2

Demographic survey template

Demographic survey template SHARE THE ARTICLE ON Table of Contents Why do you need demographic surveys? As a business, it is important for you to

Cluster Sampling – Definition and examples

Mastering the Methods of Cluster Sampling Get started with Voxco’s ultimate guide to sampling methods Download Free Sampling Methods Guide SHARE THE ARTICLE ON Table

Impact of Closing the Feedback Loop on Customer Experience & Loyalty

Impact of Closing the Feedback Loop on Customer Experience & Loyalty SHARE THE ARTICLE ON Table of Contents Closing the feedback loop is a part

Ratio Scale 1 1

Nominal Scale: Definition, Method and Examples

SURVEY METHODOLOGIES Nominal Scale Market Research Tool kit Get started with Voxco’s Market Research Toolkit. Market Research trends guide + Online Surveys guide + Agile MArket

Ensuring product success the right way1

Ensuring product success the right way

Ensuring product success the right way SHARE THE ARTICLE ON Share on facebook Share on twitter Share on linkedin Table of Contents Trying to sell

We use cookies in our website to give you the best browsing experience and to tailor advertising. By continuing to use our website, you give us consent to the use of cookies. Read More

IMAGES

  1. Linear Regression model sample illustration

    regression model research paper

  2. FREE 10+ Regression Analysis Samples in PDF

    regression model research paper

  3. Multiple Regression analysis Model Summary b

    regression model research paper

  4. Regression Analysis

    regression model research paper

  5. Simple linear regression analysis of the number of papers published

    regression model research paper

  6. PPT

    regression model research paper

VIDEO

  1. Regression Analysis #research

  2. Multiple Regression Model for CCP by Srinivasan Panchapakesan

  3. How to estimate a regression model

  4. Regression Model Summary

  5. Regression Problems 2

  6. Regression Analysis

COMMENTS

  1. (PDF) Regression Analysis

    Erik Mooi University of Melbourne Abstract and Figures After reading this chapter, you should understand: What regression analysis is and what it can be used for. How to specify a regression...

  2. Review of guidance papers on regression modeling in statistical series

    PLoS One. 2022; 17 (1): e0262918. Published online 2022 Jan 24. doi: 10.1371/journal.pone.0262918 PMCID: PMC8786189 PMID: 35073384 Review of guidance papers on regression modeling in statistical series of medical journals

  3. Linear Regression Analysis

    Regression analysis is an important statistical method for the analysis of medical data. It enables the identification and characterization of relationships among multiple factors. It also enables the identification of prognostically relevant risk factors and the calculation of risk scores for individual prognostication. Methods

  4. A Study on Multiple Linear Regression Analysis

    Regression analysis is a statistical technique for estimating the relationship among variables which have reason and result relation. Main focus of univariate regression is analyse the relationship between a dependent variable and one independent variable and formulates the linear relation equation between dependent and independent variable.

  5. Theory and Implementation of linear regression

    Linear regression refers to the mathematical technique of fitting given data to a function of a certain type. It is best known for fitting straight lines. In this paper, we explain the theory behind linear regression and illustrate this technique with a real world data set.

  6. Review of guidance papers on regression modeling in statistical ...

    Paul Bach, Lorena Hafermann, Nadja Klein, Willi Sauerbrei, Review of guidance papers on regression modeling in statistical series of medical journals on behalf of topic group 2 of the STRATOS initiative x Published: January 24, 2022 https://doi.org/10.1371/journal.pone.0262918 Article Authors Metrics Comments Media Coverage Abstract Introduction

  7. Analysis and selection of a regression model for the Use Case Points

    Linear regression models. Linear regression models describe the relationship between a dependent variable and one or more independent variables. The goal is to find the best fit straight line that minimizes the sum of squared residuals of the linear regression model. The least squares method is the most common method used to fit a regression line.

  8. Introduction to Multivariate Regression Analysis

    Test on overall or reduced model. In our example Tpers = β 0 + β 1 time outdoors + β 2 Thome +β 3 wind speed + residual. The null hypothesis (H 0) is that there is no regression overall i.e. β 1 = β 2 =+βρ = 0. The test is based on the proportion of the SS explained by the regression relative to the residual SS.

  9. The clinician's guide to interpreting a regression analysis

    In a linear regression model, the dependent variable must be continuous ... Logistic regression in medical research. Anesth Analg. 2021;132:365-6. Article Google Scholar

  10. Review of guidance papers on regression modeling in ...

    The aim of this review was to assess the current level of knowledge with regard to regression modeling contained in such statistical papers. We searched for target series by a request to international statistical experts. We identified 23 series including 57 topic-relevant articles. Within each article, two independent raters analyzed the ...

  11. Handbook of Regression Modeling in People Analytics: With Examples in R

    Logistic regression belongs to the class of the generalized linear models (GLM), and examples in R with this function are given on various data with plotting the results. Models by one and many input variables are considered, the coefficients are interpreted via the log odds linear link function, and goodness-of-fit and model parsimony are ...

  12. Multiple Linear Regression

    The formula for a multiple linear regression is: = the predicted value of the dependent variable. = the y-intercept (value of y when all other parameters are set to 0) = the regression coefficient () of the first independent variable () (a.k.a. the effect that increasing the value of the independent variable has on the predicted y value ...

  13. A Multiple Linear Regression Approach For Estimating the Market Value

    1- INTRODUCTION Linear regression is a statistical analysis which depends on modeling a relationship between two kinds of variables, dependent(response) and independent(predictor).

  14. Regression Model

    Statistical regression models are also used to determine relationships between clinical variables and preference-based utility values when the cost-effectiveness models are driven by clinical variables, which represent stages or progression in the primary health condition. In these instances it may be that, although the clinical effectiveness study collects the required preference-based data ...

  15. Simple Linear Regression

    Regression models describe the relationship between variables by fitting a line to the observed data. Linear regression models use a straight line, while logistic and nonlinear regression models use a curved line. Regression allows you to estimate how a dependent variable changes as the independent variable (s) change.

  16. Frontiers

    This paper aims to evaluate the performance of multiple non-linear regression techniques, such as support-vector regression (SVR), k-nearest neighbor (KNN), Random Forest Regressor, Gradient Boosting, and XGBOOST for COVID-19 reproduction rate prediction and to study the impact of feature selection algorithms and hyperparameter tuning on prediction. Sixteen features (for example, Total_cases ...

  17. Performance evaluation of regression models for COVID-19: A statistical

    This study also offers a path for future research using regression models based on machine learning. Precise validation and data analysis can contribute to strategies for healing and disease prevention at an early stage. ... So, in various research paper analysis is done on vaccination, drug therapy and also on the prediction of future infected ...

  18. (PDF) Machine Learning -Regression

    The goal of a regression model is to build a mathematical equation that defines y (the outcome variable) as a function of one or multiple predictor variables (x). Next, this equation can be...

  19. Regression analysis of student academic performance using ...

    Hence, the parameters can be tuned to deal with such issues. The deep learning model records a mean absolute score (mae) of 1.61 and loss 4.7 with the value of k = 3. While the linear regression model yields a loss of 6.7 and mae score of 1.97. The deep learning model outperforms the linear regression model.

  20. Regression Model

    Survival Analysis: Overview. R.L. Prentice, J.D. Kalbfleisch, in International Encyclopedia of the Social & Behavioral Sciences, 2001 4.1 Parametric Models. Parametric failure-time regression models may be considered as an alternative to the semiparametric Cox model (7). For example, requiring λ 0 (t) in (7) to be a constant or a power function of time gives exponential and Weibull regression ...

  21. A Study on Multiple Linear Regression Analysis

    In this study, data for multilinear regression analysis is occur from Sakarya University Education Faculty student's lesson (measurement and evaluation, educational psychology, program...

  22. Regression Analysis: The Complete Guide

    Regression analysis is a statistical method. It's used for analyzing different factors that might influence an objective - such as the success of a product launch, business growth, a new marketing campaign - and determining which factors are important and which ones can be ignored. Regression analysis can also how different variables ...

  23. Regression model: Definition, Types and examples

    A linear regression model is used to depict a relationship between variables that are proportional to each other. Meaning, that the dependent variable increases/decreases with the independent variable. In the graphical representation, it has a straight linear line plotted between the variables. Even if the points are not exactly in a straight ...

  24. Agriculture

    The use of Internet of Things (IoT) technology for real-time monitoring of agricultural pests is an unavoidable trend in the future of intelligent agriculture. This paper aims to address the difficulties in deploying models at the edge of the pest monitoring visual system and the low recognition accuracy. In order to achieve that, a lightweight GCSS-YOLOv5s algorithm is proposed. Firstly, we ...