If you’re working in the corporate world, you’ve probably heard the term “regression analysis”. Most Excel analysts kind of know what regression means, but it’s still a vague concept.

.

So today, I’ll try to explain regression in my ham-handed way.

.

There are actually several types of regression analysis: Simple Linear Regression, Multivariate Regression, Polynomial Regression, etc. For the most part, when someone asks for a regression analysis in the business world, they’re most commonly talking about a Simple Linear Regression. A Simple Linear Regression allows you to determine functional dependency between two sets of numbers. For example, there may be a functional dependency between Ice Cream Sales and outdoor Temperature; you could statistically prove that as the temperature outdoors goes up, ice cream sales increase. .

.

In a Simple Linear Regression, you have a Dependent Y variable and an Independent X variable. In our example, Ice Cream Sales will be the Dependent variable. Temperature will be the Independent variable. As the Independent variable (Temperature) moves up or down, we expect the Dependent variable (Ice Cream Sales) to move along with it. In the rest of this post, we’ll create a Simple Linear regression to test whether or not our Independent variable has a significant effect on our Dependent variable. If there *is* a significant effect, we can assume that the Independent variable (Temperature) can be reliably used to make *predictions* on the Dependent variable (Ice Cream Sales).

.

What most people don’t realize when running a regression is that there are actually two models being run: the actual regression analysis and a separate hidden regression analysis. The actual regression analysis is the model you yourself pass to the regression with your own Dependent and Independent variables. The hidden regression analysis is a separate model (run in the background) which evaluates your Dependent variable against its own Mean (or Average). To fully understand regression, it’s important to first understand the hidden analysis being run as a separate regression model. Let’s start there.

.

**The Hidden Regression Analysis
**

Imagine that all we have is a set of data showing Ice Cream Sales. No other data; just the total historical revenue for ice cream. If we wanted to predict what the next revenue number would be, what would be best prediction we could make? If you said the average of all the numbers, you’re right. In the absence of any other information, the best prediction we can possibly make for any one variable is the **Mean** of that variable (Mean is another word for Average). In the case of the numbers seen below, the best prediction we can make is that the next sales number will be the Mean of $160.

.

Now, because we’re rational people, we would instinctively know that our prediction would have a degree of error built into it. Just how big is the degree of error? Well, this is basically what is being calculated in this hidden regression analysis.

To understand how, let’s pretend we plot our sales data on a chart. We also add a line to represent the prediction of Mean for each data point. This line is what we call our “Best Fit Line”. It represents the best prediction we can come up with given the data we have.

.

Each data point will be a certain number above or below our “Best Fit Line”. Each data point has a variance to the Mean. Any variance greater or less than 0 is considered to be an error in prediction. That is to say, if we predict the Mean, but the actual data came in at some other number; we call that difference an error.

.

In the hidden regression analysis, these errors are squared. For example -40 would become 1,600 (-40*-40), 70 would become 4,900 (70*70), etc.

.

If we were to add up all the squares, we would get a Sum of Total Errors Squared (SST). The bigger the Sum of Total Errors Squared, the bigger the error factor is for the Mean predictor. To overly simplify it, the Sum of Total Errors Squared represents a kind of worst case scenario where you only have the Mean of the Dependent variable to use as a predictor. This number provides a benchmark to compare to when the Excel runs our own Dependent and Independent variables.

.

**The Actual Regression Analysis
**

The actual regression analysis starts with a simple table of two variables. In this case, we have a Dependent Y variable (Ice Cream Sales) and an Independent X variable (Temperature).

.

We can use these two variables to create a Scatterplot chart, and then add a Linear Trendline (Right-click any data point and select Add Trendline). Although it’s called a Trendline, it’s not really a trend per se’. What Excel actually does is evaluates your Dependent and Independent variables and plots a set of predictions that result in the least amount of variance to your actual data points. This new set of predictions is represented in the Trendline as “Best Fit” line.

.

Again, a “Best Fit” line plots a set of predictions against a set of actual data. The variances between the actual data and the predictions make up a set of squared errors.

.

Excel takes all the squared errors and calculates a Sum of Errors Squared. Again, the bigger the Sum of Errors Squared, the bigger the error factor is for the Independent predictor.

..

**Putting it all Together
**

Excel does its calculations behind the scenes in an instant.

On our end, we end up with a simple chart that shows our Dependent Variable in the Y axis and our Independent Variable in the X axis. So now what?

.

Well it’s time to interpret the results. In order to do so, we’ll need to see something called the R-squared value.

We can right-click on the Trendline and select Format Trendline. In the Trendline Options dialog box, we choose the option to display R-squared value on the chart.

.

Excel associates a new label with the Trendline showing us a strange number called the R-squared.

.

To understand what your linear regression is telling you, you’ve got to understand what that R-square value means.

Remember that the first hidden regression analysis Excel performed used just the Mean of Ice Cream Sales as the basis for the “Best Fit” prediction. This produced a Total Sum of Errors Squared of 13,800. It then ran our actual regression analysis using the Temperature variable as the basis for the “Best Fit” prediction. That produced a Sum of Errors Squared of 4,113.

.

Take a look at this representation of the difference between the two analyses. When considering new Independent variable in your data, your goal is to effectively reduce the Errors Squared as much as possible. This visualization suggests that by using Temperature as the Independent variable, we reduced the number of errors significantly and improved our ability to predict Ice Cream sales.

.

We can quantify just how significant the improvement is by calculating the Sum of Squared Residuals. This number tells us how many of the original errors were “explained” by the regression (thus removed from the error pool).

.

The R-squared value (also known as the Coefficient of Determination) uses the Sum of Squared Residuals to quantify how well our data fits the regression line.

.

In our example, the R-squared for our regression is .70 which means that the Temperature provides for a Good Fit and is thus a reliable predictor for Ice Cream Sales.

The R-squared value will always be a number from 0 to 1. The closer the R-squared value is to 1, the better the functional dependency between two variables. If your R-squared is close to 0, it means your chosen Independent variable did not reduce the Errors Squared by very much at all, making it no better than using the simple Mean of the Dependent variable. That being said a low R-squared is not always bad. Read here for more on that.

.

**Bonus
**

Although it’s always best to visually see your data in a graph when performing a Linear Regression, Excel offers a nifty R-squared function called **RSQ()**.

.

The RSQ function lets you quickly determine the significance of the functional dependency between two variables before you go through the trouble of graphing the data. Simply pass the RSQ() function your Dependent Y range and your Independent X range. You’ll instantly get an idea of how well the two variables fit!

Congrats, very well explained!

Mike – Fantastic Explanation….Very easy to understand

Shouldn’t the RSQ formula at the bottom of the article

be =RSQ(C15:C22,B15;B22) ? This would bring the formula in line with the displayed table.

Very educating article.

Nice. I particularly like the way you show that adding a variable (temperature) improves the fit. It would be interesting to take it to the next level (multivariate) and show how adding another variable (price?) could improve the fit even further while adding other variables (height of ice cream vendor) wouldn’t improve the fit.

Excellent. That was – flat out – the best explanation I’ve ever heard. Thanks!

Agree with above comments. This is an excellent article!

Very well done to put all these facts into a logical and easy to read/understand format! Your visuals are just as good!

Cheers

Excellent Summary!

One mistake I noticed. In the “Putting it all Together”: you mistakenly call the Y the Independent and the X the dependent variable.

Also in the Bonus, while you comment “it’s always best to visually see your data”, you still seem to suggest one could use RSQ() without plotting the data. I do not believe this is ever a good idea. RSQ() is for a linear response, there can be many correlations that are not linear that would be missed if it was screened without looking at the plot.

I think a plot should always be done before any statistics are done to check the validity of the statistical tools that will be used.

Steve: Thanks. I corrected the error.

“I think a plot should always be done before any statistics are done to check the validity of the statistical tools that will be used”

I agree with you wholeheartedly.

You should talk about the Standardized Beta Coefficients for Multiple Linear Regressions, because these help to visually represent the degree of change in the Rise/Run (y=f(x)).

That is, if you can predict that X predicts so much change in Y, wouldn’t it also be nice to know the slope of that change and whether it’s significant?

I know what you’re thinking: “But I’ve got the equation of the line!”. Yes you do, but what you can’t tell from the linear equation is which X factor(s) contribute significantly to changes in Y in a multiple regression. That can only be done by looking at the p-values and Std. Betas for each X.

Wow, this is basic Forecasting 101. Extremely informative. Clear precise and to the point. Thanks for the lesson, appreciate it!

Well done Mike, you got a future in this! Thanks for the insights.

Very well said! One critique, In your second example you have an error in your formula

44(?)+472+660+1558+…….. earthier the chart number is wrong or your formula number is.

However, Thank you for you time to put this together.

Excellent, I understood it on the first read! For years I have been doing this graphically, with considerable success, in order to write the resulting formula into estimating programs [times, costs, duty etc] based on historical actual data. I think I may be able to improve things with this to work with!

Agree with many others here in saying that your article is by far the best and easiest to understand description I have ever heard.

Really appreciate your efforts.

Excellent lesson.

regards.

Hi,thanks for the simple explanation. One major question I have is:

How did you get the individual numbers for the Sum of Errors Squared? How would I go about getting 49, 471, 660, 1558, 836, 355, 49, and 140?

I understand getting to the sum of total errors squared but am stuck on SSE.

thx.

hi

under the actual regression analysis how did you arrive the figures like 49 ,660 etc.. in the graph please?

I have the same question as Dave’s.

great explanation , thanks for posting!