Book Text

REGRESSION ANALYSIS

Data may often fit a mathematical formula, which may be a straight line (e.g., a linear relationship) or just about any sort of continuous mathematical formula. In regression analysis, we use the power of the computer program to determine the mathematical formula that best fits the data. To do this we have to provide the program with the type of curve to use, for example, a straight line, quadratic equation, or exponential curve. The program will then provide us with the parameters within the selected equation that best fit the data points.

The art of regression is complex. The first problem is to pick the appropriate mathematical relationship to fit the data. Graphing the data is obviously very useful for this. There are also some mathematical subtleties in how to best fit the data to the curve, but by far the most common method is a least-squares approach that minimizes the square of the distance of each data point from the proposed curve. Although this methodology is about 2 centuries old, it clearly benefits from the use of modern computerized calculations, without which it would be practically unusable.

The computer program for regression analysis will provide the parameters of the fitted equation, such as the slope and y intercept for a straight line, and also some information about how well these parameters fit the data. The user has to be cautious in using the fitted equation if the agreement is poor because almost any jumble of data can be fitted to some equation, albeit with terrible agreement. It is fair to stress that the human eye has real power to judge trends, so if a graph of the data does not look like it fits the curve, the wise will be very skeptical of the result.

Univariate versus Multivariate Regression

In fitting data in a regression analysis, the first step is to choose the variables that will be used as independent variables. In univariate analysis, only one variable is used for the data fit, and the data are plotted and the computations performed by using that one variable to describe the data. As an example, the weights of a group of subjects may be compared with their heights. However, in the real world, many possible variables can almost always be used to determine an outcome. In multivariate analysis, more than one variable is used to describe the observed results. In the weight example, the subjects' weights might be analyzed by using heights, ages, and gender. Note that in this multivariate example, the outcome is a continuous interval variable, but is determined by a collection of interval variables (age, height) and a categorical variable (gender). The precise mathematical techniques that are used for multivariable analysis will depend on the nature of the variable used; in general, a mixed group of variables makes the most sense with real-world data.

On first approaching analysis of a data set, a simple univariate analysis may help in understanding the relationships involved. However, the data analyst must be cautioned that subtle relationships can be missed with only univariate analysis if the relevant variable is not used. As an example, heart rate may correlate with the dosage of pain medicine in an injured patient, but the causal relationship for both is likely to depend on the degree of pain.

With multivariate regression analysis, many potential pitfalls can readily be encountered with computerized calculations. One critical category of problems derives from the relationships that the various variables may have with one another, which is often described by indicating how independent these variables are in relation to one another. For an independent variable, the values of one variable do not tell what the values of the other would be. In the example of weight versus age and height, we would expect age and height to be independent in adults. Mathematically, it would be determined that age and height are uncorrelated. However, in children, age and height are correlated, so weight could be expressed as a function of weight versus age or weight versus height. In this last case, a multivariate analysis that focused on independent variables might just produce the result that weight is a function of height and throw out the relationship between weight and age. Hence, when many variables are involved, caution must be applied so that experimentally important relationships are not missed because they do not add mathematically to the results. Sophisticated statistical computer

885

packages will look for these sorts of problems. Identification of significant correlation among the different variables would suggest that some information may be lost if only independent variables are reported. If a simpler statistical program is used, it would be wise to perform multivariate analyses with some of the variables left out to see whether other possible relevant relationships appear.

Another type of problem that can be misleading with multivariable relationships occurs in the situation in which the computer program produces a variable that correlates only weakly with the data, but does so with high reliability. This situation would be expressed in linear analysis as a result with a small correlation coefficient but a very good "P value" (more about "P" later). The fact that the correlation is not probably a statistical fluke does not offer much to the experimental analysis if the correlation is poor—the data are not well explained by the relationship. Here, the confusion comes from the fact that although the correlation is weak and does not explain much, the calculation shows very high confidence in the presence of this weak correlation.