Fundamentally, regression analysis examines the relationship between one or more independent variables (predictors) and a dependent variable (outcome). The dependent variable in biostatistics typically refers to a biological measurement or health outcome that we intend to understand or forecast, whereas the independent variables could be risk factors, treatments, or demographic traits.
Depending on the type of data and the research question, several regression methods can be applied:
The simplest form of regression assumes a linear relationship between variables. It’s used when the dependent variable is continuous (e.g., blood pressure, weight, cholesterol levels).
Formula:
Y = β₀ + β₁X₁ + β₂X₂ + ... + ε
Where:
When dealing with binary outcomes (like disease presence/absence, Yes/No), logistic regression becomes essential. Rather than predicting the value directly, it models the probability of an outcome using the logit function:
log(p/(1-p)) = β₀ + β₁X₁ + ... + βₙXₙ
Example: The output is a probability, which is then converted to a yes/no prediction
Applications include:
Applied when there are multiple dependent variables being predicted at once.
Used for count data, like the number of hospital visits or number of mutations in a gene.
For time-to-event data (survival analysis), Cox regression models the hazard function:
h(t) = h₀(t) × exp (β₁X₁ + ... + βₙXₙ)
This is crucial for:
Regression isn't just about fitting lines to data. In biostatistics, it plays a crucial role in:
When there isn't a straight line that can explain the link between the independent and dependent variables, non-linear regression is used. Unlike linear regression, which fits data to a line, non-linear regression fits data to a curve.
These curves could be:
In non-linear regression:
Unlike linear regression, there’s no closed-form solution. That’s why non-linear models need good initial estimates to converge on the best-fit solution.
Increasing a drug's dosage in pharmacology doesn't always increase its effect linearly. At a certain point, the effect plateaus. This relationship is often modelled using a sigmoidal (logistic) curve, such as the four-parameter logistic (4PL) or five-parameter logistic (5PL) models.
The Michaelis-Menten equation is a classic example of non-linear regression used to model enzyme-substrate interactions:
V = (Vmax × [S]) / (Km + [S])
Where: