This is the second of three posts in Descriptive Statistics. Click here to see the full list of statistics posts.

Matlab Logo

We left off last time with central tendency, spread, and trying to beat Quan in a breath-holding contest. In this post we will continue the topic of descriptive statistics by covering shape.

Contents

Platy-kur-what?

When it comes to describing the shape of a dataset, we are almost always visualizing it in the form of a histogram. Fortunately, the hist command is quite capable and flexible, which makes this work much easier. If we recall my breathholding data from last time, which we examined with a scatter / dot-plot and a boxplot, we can look at its shape this time in histogram form (You can download the .mat file here).

load RobPracticeHolds.mat;
h1 = figure('Position',[100 100 600 400],'Color','w');
subplot(4,1,1:2); hist(breathholds);
  title('Practice Holds'); showx = get(gca,'Xlim');
subplot(4,1,3); boxplot(breathholds,'orientation','horizontal','widths',.5);
  set(gca,'Yticklabel',[],'Xlim',showx); xlabel(''); ylabel('');
subplot(4,1,4); scatter(breathholds,ones(size(breathholds)));
  set(gca,'Yticklabel',[],'Xlim',showx); xlabel('seconds');

Stat Pic 1

The hist function (top graph) does a few things: first it fits 10 “bins” equispaced within the range of the data (default is 10 bins, but using the nbins input arg changes that); then it counts the number of points that land inside of each of those bins; and finally it draws a vertical bar graph of that count, using bar widths to fit those bins.

The data takes on a “toothier” look using hist, where the main features are the 2 peak bins with 8 points each, and an empty bin to the right, sitting between the outlier and the rest of the data. In general, I would “eyeball” this shape to be slightly asymmetric, with the longer tail going out to the right (called skewed to the right). It also looks like a somewhat flat / wide shape since the data is not tightly grouped (called platykurtosis).

These visual observations of shape are nice, but they sound too much like art critique to be useful. The next questions you might have are: what do they mean, and is there are way to quantify them so we don’t have to rely on the subjectiveness of the “eyeball” method?

Quantifying Shape

When looking at histograms, the 2 basic ways to quantify their shape goes to answering 2 questions about them:

  • Is the data asymmetric, and if so - which way does it tilt?
  • How is the data spread out - wide and flat or skinny and tall?

Skewness answers the first question about symmetry. For a given dataset, it compares the tails on the left and right sides of the mean against each other. Perfectly symmetrical tails (such as the Normal Distribution has) return a skewness value of zero. Datasets with a heavier left tail return negative numbers (skewed to the left), while heavier right tails return positive numbers (skewed to the right). The greater the magnitude of skewness, the less balanced the tails.

myskew = skewness(breathholds);
disp (['My practice holds have a skewness of ',num2str(myskew)]);
if myskew+.1<0
    disp('So they are skewed to the left.');
elseif myskew-.1>0
    disp('So they are skewed to the right.');
else
    disp('So they are approximately symmetrically distributed.');
end
My practice holds have a skewness of 0.72107
So they are skewed to the right.

Kurtosis answers the second question about how widely or narrowly the data is spread, compared to a Normal Distribution. This function sets the kurtosis of a perfect Normal Distribution equal to 3 (some other softwares subtract 3 so that perfect Normal equals zero). If the given distribution is flatter / wider than Normal, then kurtosis returns values greater than 3 (indicating that more outliers could be present). For a narrower / taller than Normal distribution, the kurtosis is less than 3 (indicating outliers are less likely).

mykurt = kurtosis(breathholds);
disp (['My practice holds have a kurtosis of ',num2str(mykurt)]);
if mykurt+.1<3
    disp('So they are narrower than Normal.');
elseif mykurt-.1>3
    disp('So they are flatter than Normal.');
else
    disp('So they are spread about the same as Normal.');
end
My practice holds have a kurtosis of 3.1865
So they are flatter than Normal.

A Real-Life Use of Shape

Like a lot of things in math, you might be thinking “OK, that’s nice. But when would I use this?” The uses of shape parameters appear most often when checking to see if certain inferential statistics are appropriate for your data. Particularly with statistical tests that depend on the shape being approximately Normal, or those that focus on what the tails are doing.

For example, an industry standard test to determine if a group of values meets a specification based on a small sample (i.e. only testing a few out of the group) is found in ANSI Z1.9. This test procedure makes assumptions about the skewness and kurtosis (symmetry and flatness) of the dataset being close to that of the Normal Distribution. If you have a one-sided specification, let’s say it is higher-is-better, then data that is skewed to the right would be more likely to fail this test unnecessarily. In fact, you would be penalized for better performance (higher values)!

h2 = figure('Position',[100 100 600 400],'Color','w');
histfit(breathholds); title('Practice Holds Compared to ANSI Z1.9');
legend('My data','ANSI Z1.9 thinks');
annotation('textarrow',[.25 .32],[.6 .13],...
    'string',[{'Overestimate of'};{'the left tail'}]);
annotation('line',[.35 .35],[.11 .92],'color','r','linewidth',2);

Stat pic 2

So, let’s assume that I cannot accept a breath hold time of less than 40 seconds. If we were to run the ANSI Z1.9 test on my practice breathholding data, the higher times cause the test to assume the variation is symmetric (like the Normal distribution). It therefore concludes that the left tail (low values) extends below our lower limit of 40 by an unacceptable amount (we fail the test).

So, when testing to see if I can hold my breath for an acceptable amount of time, I should choose something other than ANSI Z1.9 since my data is skewed. Continuing to use this type of test would be overly conservative and reject some good results.

Wrapping up

That’s it for our introduction to descriptive statistics. In the next post we’ll tidy up with a few of my favorite visualizations for descriptive stats. Then, from there on, we’re going to be talking about inferential statistics, how they’re used, and what visualizations go along with them. As usual, questions and comments are welcome below.