REGRESSION ANALYSIS
Data may often fit a mathematical formula, which may be a straight
line (e.g., a linear relationship) or just about any sort of continuous mathematical
formula. In regression analysis, we use the power
of the computer program to determine the mathematical formula that best fits the
data. To do this we have to provide the program with the type of curve to use, for
example, a straight line, quadratic equation, or exponential curve. The program
will then provide us with the parameters within the selected equation that best fit
the data points.
The art of regression is complex. The first problem is to pick
the appropriate mathematical relationship to fit the data. Graphing the data is
obviously very useful for this. There are also some mathematical subtleties in how
to best fit the data to the curve, but by far the most common method is a least-squares
approach that minimizes the square of the distance of each data point from the proposed
curve. Although this methodology is about 2 centuries old, it clearly benefits from
the use of modern computerized calculations, without which it would be practically
unusable.
The computer program for regression analysis will provide the
parameters of the fitted equation, such as the slope and y
intercept for a straight line, and also some information about how well these parameters
fit the data. The user has to be cautious in using the fitted equation if the agreement
is poor because almost any jumble of data can be fitted to some equation, albeit
with terrible agreement. It is fair to stress that the human eye has real power
to judge trends, so if a graph of the data does not look like it fits the curve,
the wise will be very skeptical of the result.
Univariate versus Multivariate Regression
In fitting data in a regression analysis, the first step is to
choose the variables that will be used as independent variables. In univariate
analysis, only one variable is used for the data fit, and the data are plotted and
the computations performed by using that one variable to describe the data. As an
example, the weights of a group of subjects may be compared with their heights.
However, in the real world, many possible variables can almost always be used to
determine an outcome. In multivariate analysis,
more than one variable is used to describe the observed results. In the weight example,
the subjects' weights might be analyzed by using heights, ages, and gender. Note
that in this multivariate example, the outcome is a continuous interval variable,
but is determined by a collection of interval variables (age, height) and a categorical
variable (gender). The precise mathematical techniques that are used for multivariable
analysis will depend on the nature of the variable used; in general, a mixed group
of variables makes the most sense with real-world data.
On first approaching analysis of a data set, a simple univariate
analysis may help in understanding the relationships involved. However, the data
analyst must be cautioned that subtle relationships can be missed with only univariate
analysis if the relevant variable is not used. As an example, heart rate may correlate
with the dosage of pain medicine in an injured patient, but the causal relationship
for both is likely to depend on the degree of pain.
With multivariate regression analysis, many potential pitfalls
can readily be encountered with computerized calculations. One critical category
of problems derives from the relationships that the various variables may have with
one another, which is often described by indicating how independent
these variables are in relation to one another. For an independent variable, the
values of one variable do not tell what the values of the other would be. In the
example of weight versus age and height, we would expect age and height to be independent
in adults. Mathematically, it would be determined that age and height are uncorrelated.
However, in children, age and height are correlated, so weight could be expressed
as a function of weight versus age or weight versus height. In this last case, a
multivariate analysis that focused on independent variables might just produce the
result that weight is a function of height and throw out the relationship between
weight and age. Hence, when many variables are involved, caution must be applied
so that experimentally important relationships are not missed because they do not
add mathematically to the results. Sophisticated statistical computer
packages will look for these sorts of problems. Identification of significant correlation
among the different variables would suggest that some information may be lost if
only independent variables are reported. If a simpler statistical program is used,
it would be wise to perform multivariate analyses with some of the variables left
out to see whether other possible relevant relationships appear.
Another type of problem that can be misleading with multivariable
relationships occurs in the situation in which the computer program produces a variable
that correlates only weakly with the data, but does so with high reliability. This
situation would be expressed in linear analysis as a result with a small correlation
coefficient but a very good "P value" (more about
"P" later). The fact that the correlation is not
probably a statistical fluke does not offer much to the experimental analysis if
the correlation is poor—the data are not well explained by the relationship.
Here, the confusion comes from the fact that although the correlation is weak and
does not explain much, the calculation shows very high confidence in the presence
of this weak correlation.