Understanding how variation in a dataset is explained or left unexplained by a regression model is a critical concept in statistical modeling. One of the fundamental ways to evaluate this is by Partitioning the Sum of Squares. This concept lies at the heart of Analysis of Variance (ANOVA) and is the basis for F-tests in regression.

In any regression model, the total variability in the dependent variable (Y) is made up of two parts:
- Explained variation: The part of the variation in Y explained by the independent variable(s).
- Unexplained variation: The part that remains due to random error or factors not included in the model.
This decomposition helps us understand how well our regression model is performing. The method used to break down this variability is called the partitioning of the sum of squares.
Components of Total Variation
a. SST – Total Sum of Squares
The Total Sum of Squares (SST) measures the total variation in the observed values of the dependent variable (Y) from their mean (Ȳ).
Formula:
SST = Σ(Yi – Ȳ)²
This is the baseline measure of total variation, and it exists regardless of the regression model.
b. SSR – Regression Sum of Squares
The Regression Sum of Squares (SSR) measures the variation explained by the regression line. It is the difference between the predicted values and the mean of Y.
Formula:
SSR = Σ(Ŷi – Ȳ)²
It shows how much of the total variation in Y is explained by X (the independent variable).
c. SSE – Error Sum of Squares
The Error Sum of Squares (SSE) measures the variation left unexplained by the regression model—i.e., the deviation of the actual Y values from their predicted values.
Formula:
SSE = Σ(Yi – Ŷi)²
d. Fundamental Relationship
This key identity forms the basis of ANOVA in regression analysis. It tells us that:
Total Variation = Explained Variation + Unexplained Variation
In symbols:
SST = SSR + SSE
ANOVA Table for Simple Linear Regression
The Analysis of Variance (ANOVA) table organizes these components into a format that helps in hypothesis testing using the F-test.
Source of Variation | Sum of Squares (SS) | Degrees of Freedom (df) | Mean Square (MS) | F-ratio |
---|---|---|---|---|
Regression | SSR | 1 | SSR / 1 | MSR / MSE |
Error | SSE | n – 2 | SSE / (n – 2) | — |
Total | SST | n – 1 | — | — |
Example:
Let’s assume we have data for 5 observations:
X | Y | Ŷ (Predicted Y) |
---|---|---|
1 | 2 | 2.2 |
2 | 4 | 3.8 |
3 | 5 | 5.4 |
4 | 4 | 7.0 |
5 | 6 | 8.6 |
Step-by-step:
- Compute Ȳ = (2 + 4 + 5 + 4 + 6)/5 = 4.2
- SST = Σ(Y – Ȳ)² = (2–4.2)² + (4–4.2)² + (5–4.2)² + (4–4.2)² + (6–4.2)² = 9.2
- SSR = Σ(Ŷ – Ȳ)² = (2.2–4.2)² + (3.8–4.2)² + (5.4–4.2)² + (7.0–4.2)² + (8.6–4.2)² = 27.2
- SSE = Σ(Y – Ŷ)² = (2–2.2)² + (4–3.8)² + (5–5.4)² + (4–7.0)² + (6–8.6)² = 7.0
Now check the identity:
SST = SSR + SSE = 27.2 + 7.0 = 34.2
Interpretation of Results:
If SSR is large compared to SST, it means that most of the variation is explained by the regression model. A high SSR and low SSE indicate a good model fit.
Using the F-ratio from the ANOVA table, we can perform a hypothesis test to check if the regression coefficient is significantly different from zero.
Application Cases
- Academic Performance: Explaining student marks based on hours studied (SSR) versus variation due to other unknown factors (SSE).
- Business Forecasting: Evaluating how well a sales prediction model (using advertising spend) explains actual sales variability.
- Quality Control: Checking how much variation in product quality is due to process factors versus random causes.
Merits and Limitations
Merits:
- Provides clarity on model performance through SSR/SSE ratio.
- Helps in hypothesis testing using ANOVA.
- Improves understanding of residuals and model accuracy.
Limitations:
- Only applicable when linear assumptions are met.
- Outliers may distort SST, SSR, or SSE.
- Not useful without proper regression diagnostics.
Summary
Partitioning the sum of squares is a cornerstone in regression analysis. It allows us to divide the total variation in the dependent variable into explained and unexplained parts, using SST = SSR + SSE. This decomposition is used to create the ANOVA table and calculate the F-ratio to test the model’s significance. By mastering this concept, students can critically assess any regression output and understand the effectiveness of prediction models.