Measures of Dispersion
Often we want to describe not only the values of our data but
also how the data are spread out. For data that follow a normal distribution, the
classic approach is to compute the standard deviation
(SD) of the data. With this value, roughly 68% of the data fall within 1 SD of the
mean and roughly 95% fall within 2 SD of the mean value. The larger the SD, the
wider the "bell-shaped" curve, and the smaller the SD, the narrower the curve.
Another way of looking at this concept is to consider that if
we were making measurements of an unknown value and our values were scattered in
a random normal distribution around that value, it is likely that about 68% of the
data points would be within 1 SD of the true value. Of course, this result is only
the most probable outcome, and in any real set of randomly distributed data we could
be unlucky and have very atypical points.
For data that do not follow a normal distribution, it can be difficult
to describe the dispersion of the data set in a standard fashion. Often, just the
range of the data, from lowest to highest, is given.
Occasionally, the data may be so scattered, with very distant extremes, that the
range is not as useful as reporting the 25th to 75th percentiles of the data.
Caution
Because it is so easy to use a computer package to calculate the
SD, we must be particularly cautious about using it. Frequently, data do not fit
nicely into the normal distribution with a symmetric bell shape and unbounded tails
out in both directions. As an example, consider a population with widely dispersed
ages, but with many children. It would be easy to get a mean value for the age of
10, with an SD of 15. Clearly, no values lie at the lower end of the distribution
with ages of -5. In such situations, the SD has some merit in describing the spread
of the data but is, by strict mathematics, being misapplied.
Standard Error of the Mean
Just as random numerical data can often be described by the SD,
the means computed from multiple grouped determinations of the data set can be described
by making
a computation very similar to the SD but applied not to the measured data but to
the computed means. This computed quantity is termed the standard
error of the mean (SEM). As more data sets are gathered, with each data
point being a measurement of what is a "true," but unknown, value the means of the
data sets will very probably get closer and closer to that true value. Hence, as
more and more data are gathered, the SEM will get smaller and smaller. This concept
is intuitively very reasonable—as we make more and more measurements of a value,
we should be getting a mean value closer and closer to the "true" value, even if
the data points themselves remain scattered.
The difference between SD and SEM is important. SD is used to
describe the data, SEM is used for computations about the certainty of the mean of
the data. Because our computer packages can as easily provide either of these values
and SEM is a smaller value, one is tempted to describe the data by the SEM. Although
this is not dishonest if clearly labeled, it can certainly be misleading. If you
have a very large number of data points, the SEM will be very small, but widely dispersed
data will have a large SD no matter how many points are measured.