Computation using this formula is demonstrated below on some example data: Computation is rarely done in this manner and is provided as an example of the application of the definitional formula, although this formula provides little insight into the meaning of the correlation coefficient.
X | Y | zX | zY | zXzY |
12 | 33 | -1.07 | -0.61 | 0.65 |
15 | 31 | -0.07 | -1.38 | 0.97 |
19 | 35 | -0.20 | 0.15 | -0.03 |
25 | 37 | 0.55 | .92 | 0.51 |
32 | 37 | 1.42 | .92 | 1.31 |
SUM = | 3.40 |
The Correlation Matrix
A convenient way of summarizing a large number of correlation coefficients is to put them in in a single table, called a correlation matrix. A Correlation Matrix is a table of all possible correlation coefficients between a set of variables. For example, suppose a questionnaire of the following form (Reed, 1983) produced a data matrix as follows.
AGE - What is your age? _____
KNOW - Number of correct answers out of 10 possible to a Geology quiz which consisted of correctly locating 10 states on a state map of the United States.
VISIT - How many states have you visited? _____
COMAIR - Have you ever flown on a commercial airliner? _____
SEX - 1 = Male, 2 = Female
Since there are five questions on the example questionnaire there are 5 * 5 = 25 different possible correlation coefficients to be computed. Each computed correlation is then placed in a table with variables as both rows and columns at the intersection of the row and column variable names. For example, one could calculate the correlation between AGE and KNOWLEDGE, AGE and STATEVIS, AGE and COMAIR, AGE and SEX, KNOWLEDGE and STATEVIS, etc., and place then in a table of the following form.
One would not need to calculate all possible correlation coefficients, however, because the correlation of any variable with itself is necessarily 1.00. Thus the diagonals of the matrix need not be computed. In addition, the correlation coefficient is non-directional. That is, it doesn't make any difference whether the correlation is computed between AGE and KNOWLEDGE with AGE as X and KNOWLEDGE as Y or KNOWLDEGE as X and AGE as Y. For this reason the correlation matrix is symmetrical around the diagonal. In the example case then, rather than 25 correlation coefficients to compute, only 10 need be found, 25 (total) - 5 (diagonals) - 10 (redundant because of symmetry) = 10 (different unique correlation coefficients).
To calculate a correlation matrix using SPSS select CORRELATIONS and BIVARIATE as follows:
Select the variables that are to be included in the correlation matrix as follows. In this case all variables will be included, and optional means and standard deviations will be output.
The results of the preceding are as follows:
Interpretation of the data analysis might proceed as follows. The table of means and standard deviations indicates that the average Psychology 121 student who filled out this questionnaire was about 19 years old, could identify slightly more than six states out of ten, and had visited a little over 18 of the 50 states. The majority (67%) have flown on a commercial airplane and there were fewer females (43%) than males.
The analysis of the correlation matrix indicates that few of the observed relationships were very strong. The strongest relationship was between the number of states visited and whether or not the student had flown on a commercial airplane (r=.42) which indicates that if a student had flown he/she was more likely to have visited more states. This is because of the positive sign on the correlation coefficient and the coding of the commercial airplane question (0=NO, 1=YES). The positive correlation means that as X increases, so does Y: thus, students who responded that they had flown on a commercial airplane visited more states on the average than those who hadn't.
Age was positively correlated with number of states visited (r=.22) and flying on a commercial airplane (r=.19) with older students more likely both to have visited more states and flown, although the relationship was not very strong. The greater the number of states visited, the more states the student was likely to correctly identify on the map, although again relationship was weak (r=.28). Note that one of the students who said he had visited 48 of the 50 states could identify only 5 of 10 on the map.
Finally, sex of the participant was slightly correlated with both age, (r=.17) indicating that females were slightly older than males, and number of states visited (r=-.16), indicating that females visited fewer states than males These conclusions are possible because of the sign of the correlation coefficient and the way the sex variable was coded: 1=male 2=female. When the correlation with sex is positive, females will have more of whatever is being measured on Y. The opposite is the case when the correlation is negative.
CAUTIONS ABOUT INTERPRETING CORRELATION COEFFICIENTS
Appropriate Data Type
Correct interpretation of a correlation coefficient requires the assumption that both variables, X and Y, meet the interval property requirements of their respective measurement systems. Calculators and computers will produce a correlation coefficient regardless of whether or not the numbers are "meaningful" in a measurement sense.
As discussed in the chapter on Measurement, the interval property is rarely, if ever, fully satisfied in real applications. There is some difference of opinion among statisticians about when it is appropriate to assume the interval property is met. My personal opinion is that as long as a larger number means that the object has more of something or another, then application of the correlation coefficient is useful, although the potentially greater deviations from the interval property must be interpreted with greater caution. When the data is clearly nominal categorical with more than two levels (1=Protestant, 2=Catholic, 3=Jewish, 4=Other), application of the correlation coefficient is clearly inappropriate.
An exception to the preceding rule occurs when the nominal categorical scale is dichotomous, or has two levels (1=Male, 2=Female). Correlation coefficients computed with data of this type on either the X and/or Y variable may be safely interpreted because the interval property is assumed to be met for these variables. Correlation coefficients computed using data of this type are sometimes given special, different names, but since they seem to add little to the understanding of the meaning of the correlation coefficient, they will not be presented.
Effect of Outliers
An outlier is a score that falls outside the range of the rest of the scores on the scatterplot. For example, if age is a variable and the sample is a statistics class, an outlier would be a retired individual. Depending upon where the outlier falls, the correlation coefficient may be increased or decreased.
An outlier which falls near where the regression line would normally fall would necessarily increase the size of the correlation coefficient, as seen below.
r = .457
An outlier that falls some distance away from the original regression line would decrease the size of the correlation coefficient, as seen below:
r = .336
The effect of the outliers on the above examples is somewhat muted because the sample size is fairly large (N=100). The smaller the sample size, the greater the effect of the outlier. At some point the outlier will have little or no effect on the size of the correlation coefficient.
When a researcher encounters an outlier, a decision must be made whether to include it in the data set. It may be that the respondent was deliberately malingering, giving wrong answers, or simply did not understand the question on the questionnaire. On the other hand, it may be that the outlier is real and simply different. The decision whether to include or not include an outlier remains with the researcher; he or she must justify deleting any data to the reader of a technical report, however. It is suggested that the correlation coefficient be computed and reported both with and without the outlier if there is any doubt about whether or not it is real data. In any case, the best way of spotting an outlier is by drawing the scatterplot.
CORRELATION AND CAUSATION
No discussion of correlation would be complete without a discussion of causation. It is possible for two variables to be related (correlated), but not have one variable cause another.
For example, suppose there exists a high correlation between the number of popsicles sold and the number of drowning deaths. Does that mean that one should not eat popsicles before one swims? Not necessarily. Both of the above variable are related to a common variable, the heat of the day. The hotter the temperature, the more popsicles sold and also the more people swimming, thus the more drowning deaths. This is an example of correlation without causation.
Much of the early evidence that cigarette smoking causes cancer was correlational. It may be that people who smoke are more nervous and nervous people are more susceptible to cancer. It may also be that smoking does indeed cause cancer. The cigarette companies made the former argument, while some doctors made the latter. In this case I believe the relationship is causal and therefore do not smoke.
Sociologists are very much concerned with the question of correlation and causation because much of their data is correlational. Sociologists have developed a branch of correlational analysis, called path analysis, precisely to determine causation from correlations (Blalock, 1971). Before a correlation may imply causation, certain requirements must be met. These requirements include: (1) the causal variable must temporally precede the variable it causes, and (2) certain relationships between the causal variable and other variables must be met.
If a high correlation was found between the age of the teacher and the students' grades, it does not necessarily mean that older teachers are more experienced, teach better, and give higher grades. Neither does it necessarily imply that older teachers are soft touches, don't care, and give higher grades. Some other explanation might also explain the results. The correlation means that older teachers give higher grades; younger teachers give lower grades. It does not explain why it is the case.
SUMMARY AND CONCLUSION
A simple correlation may be interpreted in a number of different ways: as a measure of linear relationship, as the slope of the regression line of z-scores, and as the correlation coefficient squared as the proportion of variance accounted for by knowing one of the variables. All the above interpretations are correct and in a certain sense mean the same thing.
A number of qualities which might effect the size of the correlation coefficient were identified. They included missing parts of the distribution, outliers, and common variables. Finally, the relationship between correlation and causation was discussed.
No comments:
Post a Comment