Previous Next

NORMAL DISTRIBUTION

Perhaps the most powerful numerical concept of mathematical statistics is the normal distribution, often called the "bell-shaped curve." The idea of a most likely central value with other values scattered evenly to both sides is a classic statistical concept with deep mathematical roots. It fits in nicely with our intuitive idea of a "true" value with measurements distributed randomly around that true value such that more values are close to the center than are distant.

Many typical data sets follow a normal distribution around a center point. This distribution, sometimes also called a gaussian distribution, is derived from the assumption that there is a center, "true" value and that deviations from the center are random and diminish in likelihood the farther the values get from the center.

The strict use of normal distribution implies very large amounts of continuous interval data. When, as is usual, our data set does not meet this criterion, approximations are needed. Fortunately, most computerized statistical programs can deal with this problem with good effect.

In general, the mathematical description of the placement of the values is termed the distribution. There are many, mostly uncommon distributions other than normal distribution. They yield different curves and have specific mathematical properties. Not all data sets, even if they look like a bell-shaped curve, are normal distributions. If the curve is too wide (i.e., many distant outlying points) or too narrow or uneven, the mathematics derived for normal distributions will not work well.

When data follow a normal distribution or, more generally, any specific mathematical distribution, the methods of parametric statistics can be used. The term parametric refers to the ability to describe the distribution with a specific set of values. Frequently, such is not possible, and nonparametric statistical methods are necessary. These methods make fewer mathematical assumptions about the distribution of the data, but they are more difficult and can be less robust in finding statistical distinctions. Conversely, if parametric statistical methods are applied to data that are not normally distributed, false or misleading results can be calculated. This type of error is


883
common and can be prevented by using appropriate tests or checking the data for normality before applying tests that assume it.

In practical use, we often treat our data as though they follow a normal distribution. However, wise investigators will use the tests in their computer statistics packages to test for normality and recognize that if the data do not follow a normal distribution, other methods will be needed.

Measures of Central Tendency

In describing data, the first task is to give some indication of the approximate value, range, or size of what is being described. How this is done depends on the type of data involved ( Table 23-2 ). For categorical or binary data, all that may be possible is to give the counts in each group. For categorical data with many groups, the most populous groups may be named and listed in order. For ordinal data, summary description may be difficult; often, first place or last place will be the most interesting.

The median is the center or middle data point if the data can be ordered from smallest to largest. This value would apply both to interval data and to ordinal types of categorical data. For interval data, many mathematical techniques can be applied. The simplest is the mean, or the simple average value of the numerical data, with each data point being counted equally. In a weighted mean, the individual points may be added in unequally, with some getting more credit (i.e., "weight") than others.

For interval data that can be analyzed mathematically, the data may be fitted to a curve, which means that a mathematical formula is computed that closely fits the measured data points, often by using a computer and sophisticated calculations. The relationship can be as simple as a straight line, or a complicated mathematical formula with exponentials, polynomials, or other functions included as may be needed. The various computed values in these formula would be the parameters of the curves. The data would thus be described by the parameters of the formula approximated to the observed data. For the simple case of a straight-line approximation to data, the parameters would be the slope and intercept of the line. For complex equations fitted to data, the parameters would be the various computed numbers that make up the equations.

The mode is the most common value in a set of data points. This concept can be misleading if the data are truly continuous because every data point would then probably be different (if only infinitesimally so) from every other point. Hence, to describe the mode of continuous data it makes the most sense to group the values into brief intervals. In this sense, the interval with the
TABLE 23-2 -- Measures of central tendency
Mean ± standard deviation
Mode
Median
Percentiles

most points within it is the modal interval. The mode may also be a very misleading description of the central tendency of a data set because there may be no reason for the most common value to be anywhere near the middle.

Numerical data can be described by categorizing the values into percentiles or similar groups. The meaning here is that for the, say, 10th percentile, 10% of the data points are at or below that value. The 50th percentile corresponds to the median of the data set, and the 99th percentile is the value at or above 99% of the data points. Similarly, quartiles, quintiles, or other groups can be computed. With any method used to describe data, the choice of the description may subtly bias us in how we think about the result. In the simple data set [2, 2, 3, 7, 14], the mode = 2, the median = 3, and the mean = 5.6. Which, if any, of these representations is more accurate? The answer depends on how we want to use the data because no one representation is perfect.

Previous Next