Book Text

Measures of Dispersion

Often we want to describe not only the values of our data but also how the data are spread out. For data that follow a normal distribution, the classic approach is to compute the standard deviation (SD) of the data. With this value, roughly 68% of the data fall within 1 SD of the mean and roughly 95% fall within 2 SD of the mean value. The larger the SD, the wider the "bell-shaped" curve, and the smaller the SD, the narrower the curve.

Another way of looking at this concept is to consider that if we were making measurements of an unknown value and our values were scattered in a random normal distribution around that value, it is likely that about 68% of the data points would be within 1 SD of the true value. Of course, this result is only the most probable outcome, and in any real set of randomly distributed data we could be unlucky and have very atypical points.

For data that do not follow a normal distribution, it can be difficult to describe the dispersion of the data set in a standard fashion. Often, just the range of the data, from lowest to highest, is given. Occasionally, the data may be so scattered, with very distant extremes, that the range is not as useful as reporting the 25th to 75th percentiles of the data.

Caution

Because it is so easy to use a computer package to calculate the SD, we must be particularly cautious about using it. Frequently, data do not fit nicely into the normal distribution with a symmetric bell shape and unbounded tails out in both directions. As an example, consider a population with widely dispersed ages, but with many children. It would be easy to get a mean value for the age of 10, with an SD of 15. Clearly, no values lie at the lower end of the distribution with ages of -5. In such situations, the SD has some merit in describing the spread of the data but is, by strict mathematics, being misapplied.

Standard Error of the Mean

Just as random numerical data can often be described by the SD, the means computed from multiple grouped determinations of the data set can be described by making

884

a computation very similar to the SD but applied not to the measured data but to the computed means. This computed quantity is termed the standard error of the mean (SEM). As more data sets are gathered, with each data point being a measurement of what is a "true," but unknown, value the means of the data sets will very probably get closer and closer to that true value. Hence, as more and more data are gathered, the SEM will get smaller and smaller. This concept is intuitively very reasonable—as we make more and more measurements of a value, we should be getting a mean value closer and closer to the "true" value, even if the data points themselves remain scattered.

The difference between SD and SEM is important. SD is used to describe the data, SEM is used for computations about the certainty of the mean of the data. Because our computer packages can as easily provide either of these values and SEM is a smaller value, one is tempted to describe the data by the SEM. Although this is not dishonest if clearly labeled, it can certainly be misleading. If you have a very large number of data points, the SEM will be very small, but widely dispersed data will have a large SD no matter how many points are measured.