Partitioning the Sum of Squares

Understanding how variation in a dataset is explained or left unexplained by a regression model is a critical concept in statistical modeling. One of the fundamental ways to evaluate this is by Partitioning the Sum of Squares. This concept lies at the heart of Analysis of Variance (ANOVA) and is the basis for F-tests in regression.

Unit 5: Business Statistics and Research Methods

In any regression model, the total variability in the dependent variable (Y) is made up of two parts:

Explained variation: The part of the variation in Y explained by the independent variable(s).
Unexplained variation: The part that remains due to random error or factors not included in the model.

This decomposition helps us understand how well our regression model is performing. The method used to break down this variability is called the partitioning of the sum of squares.

Components of Total Variation

a. SST – Total Sum of Squares

The Total Sum of Squares (SST) measures the total variation in the observed values of the dependent variable (Y) from their mean (Ȳ).

Formula:

SST = Σ(Y_i – Ȳ)²

This is the baseline measure of total variation, and it exists regardless of the regression model.

b. SSR – Regression Sum of Squares

The Regression Sum of Squares (SSR) measures the variation explained by the regression line. It is the difference between the predicted values and the mean of Y.

Formula:

SSR = Σ(Ŷ_i – Ȳ)²

It shows how much of the total variation in Y is explained by X (the independent variable).

c. SSE – Error Sum of Squares

The Error Sum of Squares (SSE) measures the variation left unexplained by the regression model—i.e., the deviation of the actual Y values from their predicted values.

Formula:

SSE = Σ(Y_i – Ŷ_i)²

d. Fundamental Relationship

This key identity forms the basis of ANOVA in regression analysis. It tells us that:

Total Variation = Explained Variation + Unexplained Variation

In symbols:

SST = SSR + SSE

ANOVA Table for Simple Linear Regression

The Analysis of Variance (ANOVA) table organizes these components into a format that helps in hypothesis testing using the F-test.

Source of Variation	Sum of Squares (SS)	Degrees of Freedom (df)	Mean Square (MS)	F-ratio
Regression	SSR	1	SSR / 1	MSR / MSE
Error	SSE	n – 2	SSE / (n – 2)	—
Total	SST	n – 1	—	—

Example:

Let’s assume we have data for 5 observations:

X	Y	Ŷ (Predicted Y)
1	2	2.2
2	4	3.8
3	5	5.4
4	4	7.0
5	6	8.6

Step-by-step:

Compute Ȳ = (2 + 4 + 5 + 4 + 6)/5 = 4.2
SST = Σ(Y – Ȳ)² = (2–4.2)² + (4–4.2)² + (5–4.2)² + (4–4.2)² + (6–4.2)² = 9.2
SSR = Σ(Ŷ – Ȳ)² = (2.2–4.2)² + (3.8–4.2)² + (5.4–4.2)² + (7.0–4.2)² + (8.6–4.2)² = 27.2
SSE = Σ(Y – Ŷ)² = (2–2.2)² + (4–3.8)² + (5–5.4)² + (4–7.0)² + (6–8.6)² = 7.0

Now check the identity:

SST = SSR + SSE = 27.2 + 7.0 = 34.2

Interpretation of Results:

If SSR is large compared to SST, it means that most of the variation is explained by the regression model. A high SSR and low SSE indicate a good model fit.

Using the F-ratio from the ANOVA table, we can perform a hypothesis test to check if the regression coefficient is significantly different from zero.

Application Cases

Academic Performance: Explaining student marks based on hours studied (SSR) versus variation due to other unknown factors (SSE).
Business Forecasting: Evaluating how well a sales prediction model (using advertising spend) explains actual sales variability.
Quality Control: Checking how much variation in product quality is due to process factors versus random causes.

Merits and Limitations

Merits:

Provides clarity on model performance through SSR/SSE ratio.
Helps in hypothesis testing using ANOVA.
Improves understanding of residuals and model accuracy.

Limitations:

Only applicable when linear assumptions are met.
Outliers may distort SST, SSR, or SSE.
Not useful without proper regression diagnostics.

Summary

Partitioning the sum of squares is a cornerstone in regression analysis. It allows us to divide the total variation in the dependent variable into explained and unexplained parts, using SST = SSR + SSE. This decomposition is used to create the ANOVA table and calculate the F-ratio to test the model’s significance. By mastering this concept, students can critically assess any regression output and understand the effectiveness of prediction models.

Dear Learner,

Partitioning the Sum of Squares

Components of Total Variation

a. SST – Total Sum of Squares

b. SSR – Regression Sum of Squares

c. SSE – Error Sum of Squares

d. Fundamental Relationship

ANOVA Table for Simple Linear Regression

Example:

Interpretation of Results:

Application Cases

Merits and Limitations

Merits:

Limitations:

Summary

Recent Posts

Dear Learner,

Partitioning the Sum of Squares

Components of Total Variation

a. SST – Total Sum of Squares

b. SSR – Regression Sum of Squares

c. SSE – Error Sum of Squares

d. Fundamental Relationship

ANOVA Table for Simple Linear Regression

Example:

Interpretation of Results:

Application Cases

Merits and Limitations

Merits:

Limitations:

Summary

Related Articles

Recent Posts