Skewness is essentially how data is clustered together within a dataset. Every dataset can have some gradient of three skew states: No Skew, Positive Skew and Negative Skew.
- In some datasets, the values tend to cluster around the Mean (the Average). These datasets are said to have a Normal Distribution and no skew.
- In some datasets, the values tend to cluster below the Mean, giving you a distribution with a "tail" that tapers toward higher values. These datasets are said to have a Positive Skew.
- In some datasets, the values tend to cluster above the Mean, giving you a distribution with a "tail" that tapers toward Lower values. These datasets are said to have a Negative Skew.
Understanding the Skew of your dataset can help make sense of the statistics that come out of it. For example, if you knew that your data included extreme values that skew to the negative, it could help guide you in deciding whether you should anchor your analysis to the Average, Median, Inter-Quartile range, or some other key statistical value.
Today, I'd like to give you several quick methods for determining the skewness of data.
Method 1: Graph your Data
Visualizing your data can quickly show you how your data is skewed. Two excellent ways to see the skew of your data are to visualize your data with a Histogram chart or a Box Plot.
Histogram charts are essentially visual representations of the frequency distribution of your data. With a histogram chart, you can get a good sense of how your data is distributed. My good friend and charting guru Jon Peltier has several excellent tutorials on building histograms.
Box Plots are another useful way to visualize the distribution of data. These charts plot the Mean of your dataset then creates a "box" around the Mean using the Quartiles. This method essentially gives you a visual representation of the location of key statistical points. Using this visualization, you can quickly see where your data bunches up and where there are long tails. Again, Jon Peltier has an excellent post on how to create Box Plots in Excel.
Method 2: Calculate using the Mean Median and Mode
If you don't want to take the time to create a chart, you can use some of the simple statistical points to get a general sense of how the data skews.
Specifically, you can review the location relationship between the Mean, Median and Mode of your dataset. If you don't know:
- Mean is the average of your dataset
- Median is the central data value of your dataset
- Mode is the value that has the highest frequency (occurs most often)
The idea is to use some simple comparison measures.
If the Mode, Median, and Mean are around the name number, your dataset has No Skew.
If the Mode is less than the Median and the Median is less than the Mean, your dataset has a Positive Skew.
If the Mode is greater than the Median and the Median is greater than the Mean, your dataset has a Negative Skew.
Method 3: Calculate using the Quartiles
Another easy way to mathematically get a sense of the skew is to compare the Quartiles in your dataset.
A quartile is essentially a division of data into four defined intervals based upon the values of the data and how they compare to the entire set. These intervals each contain 25% of the entire population. Values falling below 25% of all the data analyzed is said to be in the 1st quartile. Values falling between 25.1% and 50% are in the 2nd quartile. Values falling between 51% and 75% belong to the in the 3rd quartile. And the remaining values are allocated in the 4th quartile. Feel free to read my previous post on Quartiles to get a detailed primer.
In a dataset that has No Skew, Quartile 3 minus Quartile 2 will be equal to Quartile 2 minus Quartile 1.
In a dataset that has Positive Skew, Quartile 3 minus Quartile 2 will be greater than Quartile 2 minus Quartile 1.
In a dataset that has Negative Skew, Quartile 3 minus Quartile 2 will be less than Quartile 2 minus Quartile 1.
Method 4: Use the standard Skew Calculation
You can also use the standard statistical calculation for Skew:
3*(MEAN – MEDIAN)/STANDARD DEVIATION
This calculation is used in statistics to get a quick glimpse at the Skewness of a dataset.
Just for reference:
- Normally Distributed datasets have a Skew result of 0 (no skew).
- Datasets with a Positive Skew result in a number greater than 0.
- Datasets with a Negative Skew result in a number less than 0.
The closer the resulting Skew calculation is to 0, the more normal the distribution is.
Take the example below. You can see in cell Z21, 1.11 is the calculated Skew for the dataset in column W.
Since the Skew is greater than 0, this tells us that we have a Positively Skewed dataset.
The intensity of the Skew is determined by how far it is from 0. In this case, 1.11 tells us that our dataset has a fairly strong skew to the positive.
Method 5: Use Excel's SKEW function.
Some of you statistics nerds will point out that in the previous example, we calculated the Skew on only 11 data points. The standard statistical calculation seems to exaggerate the result of Skew when you're dealing with small sample sizes. A more robust measure of Skewness would take into account the number of observations. This is where Excel's SKEW function comes in.
Excel's SKEW function will give you a different answer from the statistical calculation shown in Method 4.
This is because Excel' Skew calculation accounts for the number of observations in the dataset.
Excel's SKEW function tells us that the dataset is indeed positively skewed (the result is greater than zero). But you'll notice that Excel tempers the intensity of the skew, showing a lower number than the standard statistical calculation (.08 vs 1.11).
Method 6: Use Excel's Descriptive Statistics Feature
You can use Excel's Descriptive Statistics feature to output key statistical data points from your dataset.
SKEW is one of the calculations that output.
If you're interested, here is a clean tutorial on Excel's Descriptive Statistics feature.
So which one of these methods are best? Well it depends on what you need to do and how much time you have. I would say that it's always best to visualize your dataset before working on it. I typically build a histogram of any dataset I have just to see the general makeup of the data. If you don't want to go through that rigmarole, and just want a general sense of the Skewness of your data, Excel's SKEW function is probably the easiest method.
I guess that's it then. I don't know about you, but all that math talk has made me hungry. I'm off to get to get a breakfast sandwich.