Simple Linear Regression

Simple Linear Regression Need Help? Use the index below to try some exercises that might help.

Index

Creating the Regression Line

Calculating b1 & b0, creating the line
and testing its significance with a t-test.

DEFINITIONS:
b1 - This is the SLOPE of the regression line. Thus this is the amount that the Y variable (dependent) will change for each 1 unit change in the X variable.
b0 - This is the intercept of the regression line with the y-axis. In otherwords it is the value of Y if the value of X = 0.
Y-hat = b0 + b1(x) - This is the sample regression line. You must calculate b0 & b1 to create this line. Y-hat stands for the predicted value of Y, and it can be obtained by plugging an individual value of x into the equation and calculating y-hat.

EXAMPLE:
A firm wants to see if there is sales is explained by the number of hours overtime that their salespeople work. Using a spreadsheet containing 25 months of sales & overtime figures, the following calculations are made; SSx = 85, SSy = 997 and SSxy = 2,765, X-bar = 13 and Y-bar = 67,987, also s(b1) = 21.87. Create the regression line.

(1) find b1 - One method of caluating b1 is b1 = SSxy/SSx = 2765/85 = 32.53. This is the slope of the line - for every unit change in X, y will increase by 32.53. It is a positive number, thus its a direct relationship - as X goes up, so does Y. However, if b1 = -32.53, then we would know the relationship between X & Y is an inverse relationship - as X goes up, y goes down)
(2) find b0 - again the formula is on pg. 420 and is b0 = Y-bar - b1(x-bar) = 67,987 - 32.53(13) = 67,987 - 422.89 = 67,564, this is the intercept of the line and the Y-axis, and can be interpreted as the value of Y if zero hours of overtime (x=0) are worked.
(3) Create Line - Y-hat = b0 + b1(x) or Y-hat = 67,564 + 32.53(x), This line quantifies the relationship between X & Y.
But is this Relationship "Significant" ????
Since it is based on a sample and we wish to generalize to a population, it must be tested to see if it is "significant," meaning would the relationship we found actually exist in the population or is the result due to sampling error (our sample did not represent the true population). The specific test we use is a t-test to test to see if b1 is different from 0. Since B1 would be the slope of the regression line in the population, it makes sense to test to see if it is different from zero. If it is zero, then our slope is 0, meaning if we graphed the relationship between X & Y we would end up with a horizontal (flat) line. And if this line is flat then we know that no matter what value the X variable takes on, the Y variable's value will not change. This means there is no linear relationship between the two variables. This also means that the regression line we calculated is useless for explaining or predicting the dependent variable.

TESTING B1 We use our standard five step hypothesis testing procedure.
Hypotheses: H0: B1 = 0, H1: B1 not = 0
Critical value: a t-value based on n-2 degrees of freedom. Also divide alpha by 2 because it is a 2-tailed test. In this case n = 25 (25 months data used) thus n-2 = 23. With alpha = .05 we have alpha/2 = .025 and then t = 2.069 (from t-table inside front cover of book).
Calculated Value: The formula is on page 442 and is simply t = b1/s(b1) = 32.53/21.87 = 1.49. s(b1) is the standard error of b1 and is given in the problem)
Compare: t-calc < t-crit and thus accept H0.
Conclusion: B1 = 0, the population slope of our regression is a flat line, thus there is no linear relationship between sales and overtime worked, and the sample regression line we calculated is not useful for explaining or predicting sales dollars from overtime worked.

Correlation

Correlation is a measure of the degree of linear association between two variables. The value of a correlation can range from -1, thru 0, to +1. A correlation = 0 means there is no LINEAR association between the two variables, a value of -1 or +1 means there is a perfect linear association between the two variables, the difference being that -1 indicates a perfect inverse relationship and +1 a perfect positive relationship. The sample notation for a correlation is "r" while the population correlation coefficient is represented by the greek letter "Rho" (which looks like a small "p").

We often want to find out if a calculated sample correlation would be "significant." Again this would mean we would test to see if Rho = 0 or not. If Rho=0 then there would be no linear relationship between the two variables in the population.

AN EXAMPLE:
Based on a sample of 42 days, the correlation between sales and number of sunny hours in the day is calculated for the Sunglass Hut store in Meridian Mall. The r = .56. Is this a "significant" correlation?

This is a basic hypothesis test.....
Hypotheses: H0: Rho = 0, H1: Rho not = 0.
Critical Value: The t-test for the significance of Rho has n-2 degrees of freedom, and alpha will need to be divided by 2, thus n-2 = 40 and alpha (.05/2) = .025 ... from the table we find: 2.021.
Calculated Value: The formula on page 438 is t = r / sqr root of (1-r-sqrd)/(n-2). In this case that equals .56 / the square root of (1-.56-squared)/(40) = .56/.131 = 4.27
Compare: The t-calc is larger than the t-crit thus we REJECT Ho.
Conclusion: Rho does not equal zero and thus there is evidence of a linear association between the two variables in the population.

The F-test in Regression

EXAMPLE
Using the information given, construct the ANOVA table and determine whether there is a regression relationship between years of car ownership (Y) and salary (X). n= 47, SSR = 458 and SSE = 1281.

ANOVA Table: The anova table is on page 451, and is basically the same as a one-way ANOVA table. The first thing we need is the df, and by definition the df for the regression = 1, the df for the error = n-2 or 45, and the total df = n-1 or 46. Next we need the MS calculations. MSR = SSR/df for the regression = SSR/1 = SSR or 458. MSE = SSE/n-2 = 1281/45 = 28.47. Finally, the F-calc = MSR/MSE or 458/28.47 = 16.09.
Hypotheses: H0: There is no regression relationship,i.e, B1 =0. H1: There is a regression relationship, i.e, B1 is not = 0.
Critical Value: F(num. df, den. df) = F(1, 45) at alpha = .05 = 4.08
Calculated Value: from above ANOVA table = 16.09
Compare: F-calc larger than F-crit thus REJECT
Conclusion: There is a regression (linear) relationship between years of car ownership and salary.

The Coefficient of Determination - r-sqrd

We can also test the significance of the regression coefficient using an F-test. Since we only have one coefficient in simple linear regression, this test is analagous to the t-test. However, when we proceed to multiple regression, the F-test will be a test of ALL of the regression coefficients jointly being 0. (Note: b0 is not a coefficient and we generally do not test its significance although we could do so with a t-test just as we did ofr b1.
r-sqrd is always a number between 0 and 1. The closer it is to 1.0 the better the X-Y relationship predicts or explains the variance in Y. Unfortunately there are no set values that allow you to say that is a "good" r-sqrd or "bad" r-sqrd. Such a determination is subjective and is determined by the research you are conducting. If nobody has ever explained more that 15% of the variance in some Y variable before, and you design a study that explains 25% of variance, then this might be considered good r-sqrd, even though the actual number, 25%, is not very high.

EXAMPLE:
What is the r-sqrd if SSR = 345 and SSE = 123?
r-sqrd = SSR/SST. We don't have SST, but we know that SSR + SSE = SST, thus SST = 345 + 123 = 468, thus r-sqrd = 345/468 = .737. This means that the regression relationship between X & Y explains 73.7% of the variance in the Y variable. Under most circumstances this would be a high amount, but again we would have to know more about our research varaibles.

revised: 8-11-09