A correlation coefficient equal to 1 means. How to Calculate Linear Correlation Coefficient

Regression analysis allows you to evaluate how one variable depends on another and what is the spread of the values ​​of the dependent variable around the straight line that defines the relationship. These estimates and the corresponding confidence intervals allow us to predict the value of the dependent variable and determine the accuracy of this prediction.

results regression analysis can only be represented in a fairly complex digital or graphical form. However, we are often interested not in predicting the value of one variable from the value of another, but simply in characterizing the tightness (strength) of the relationship between them, while expressed as a single number.

This characteristic is called the correlation coefficient, it is usually denoted by the letter r. The correlation coefficient can be

can take values ​​from -1 to +1. The sign of the correlation coefficient shows the direction of the connection (direct or inverse), and the absolute value shows the closeness of the connection. A coefficient equal to -1 determines the same rigid connection as equal to 1. In the absence of a connection, the correlation coefficient is zero.

On fig. 8.10 shows examples of dependencies and their corresponding values ​​of r. We will consider two correlation coefficients.

The Pearson correlation coefficient is intended to describe linear connection quantitative features; like regression
ionic analysis, it requires a normal distribution. When people just talk about "correlation coefficient", they almost always mean Pearson's correlation coefficient, and that's exactly what we will do.

Spearman's rank correlation coefficient can be used when the relationship is non-linear - and not only for quantitative, but also for ordinal features. This is a non-parametric method and does not require any particular type of distribution.

We have already spoken about quantitative, qualitative and ordinal features in Chap. 5. Quantitative signs are ordinary numerical data, such as height, weight, temperature. The values ​​of a quantitative attribute can be compared with each other and say which of them is greater, by how much and by how many times. For example, if one Martian weighs 15 g and the other 10, then the first one is heavier than the second and one and a half times and 5 g. how many times. In medicine, ordinal signs are quite common. For example, the results of a vaginal Pap test are evaluated on the following scale: 1) normal, 2) mild dysplasia, 3) moderate dysplasia, 4) severe dysplasia, 5) cancer in situ. Both quantitative and ordinal signs can be arranged in order - on this common property based on a large group of non-parametric criteria, which include the Spearman rank correlation coefficient. We will get acquainted with other nonparametric criteria in Chap. 10.

Pearson correlation coefficient

And yet, why can't regression analysis be used to describe the tightness of the relationship? The residual standard deviation could be used as a measure of the closeness of the relationship. However, if you swap the dependent and independent variables, then the residual standard deviation, like other indicators of the regression analysis, will be different.

Let's look at fig. 8.11. Based on a sample of 10 Martians known to us, two regression lines were constructed. In one case, the weight is the dependent variable, in the second it is the independent variable. The regression lines are markedly different



20

If you swap x and y, the regression equation will be different, but the correlation coefficient will remain the same.

hope. It turns out that the relationship of height with weight is one, and weight with height is another. The asymmetry of regression analysis is what prevents it from being directly used to characterize the strength of a relationship. The correlation coefficient, although its idea stems from regression analysis, is free from this shortcoming. We present the formula.

rY(X - X)(Y - Y)

&((- X) S(y - Y)2"

where X and Y are the average values ​​of the variables X and Y. The expression for r is "symmetrical" - swapping X and Y, we get the same value. The correlation coefficient takes values ​​from -1 to +1. The closer the relationship, the greater the absolute value of the correlation coefficient. The sign shows the direction of the connection. For r > 0, we speak of a direct correlation (with an increase in one variable, the other also increases), for r Let's take the example with 10 Martians, which we have already considered from the point of view of regression analysis. Let's calculate the correlation coefficient. The initial data and intermediate results of calculations are given in Table. 8.3. Sample size n = 10, average height

X = £ X/n = 369/10 = 36.9 and weight Y = £ Y/n = 103.8/10 = 10.38.

We find Shch-X)(Y-Y) = 99.9, Shch-X)2 = 224.8, £(Y - Y)2 = 51.9.

Let's substitute the obtained values ​​into the formula for the correlation coefficient:

224.8 x 51.9'"

The value of r is close to 1, which indicates a close relationship between height and weight. To get a better idea of ​​which correlation coefficient should be considered large and which should be considered insignificant, take a look at

Table 8.3. Calculation of the correlation coefficient
X Y X-X Y-Y (X-X)(Y-Y) (X-X)2 (Y-Y)2
31 7,8 -5,9 -2,6 15,3 34,8 6,8
32 8,3 -4,9 -2,1 10,3 24,0 4,4
33 7,6 -3,9 -2,8 10,9 15,2 7,8
34 9,1 -2,9 -1,3 3,8 8,4 1,7
35 9,6 -1,9 -0,8 1,5 3,6 0,6
35 9,8 -1,9 -0,6 1,1 3,6 0,4
40 11,8 3,1 1,4 4,3 9,6 2,0
41 12,1 4,1 1,7 7,0 16,8 2,9
42 14,7 5,1 4,3 22,0 26,0 18,5
46 13,0 9,1 2,6 23,7 82,8 6,8
369 103,8 0,0 0,2 99,9 224,8 51,9


those on the table. 8.4 - it shows the correlation coefficients for the examples that we analyzed earlier.

Relationship between regression and correlation

We initially used all examples of correlation coefficients (Table 8.4) to build regression lines. Indeed, there is a close relationship between the correlation coefficient and regression analysis parameters, which we will now demonstrate. Different ways of presenting the correlation coefficient, which we will get in this case, will allow us to better understand the meaning of this indicator.

Recall that the regression equation is constructed in such a way as to minimize the sum of squared deviations from the regression line.


We denote this minimum sum of squares by S (this value is called the residual sum of squares). The sum of the squared deviations of the values ​​of the dependent variable Y from its mean Y will be denoted by S^. Then:

The value of r2 is called the coefficient of determination - it is simply the square of the correlation coefficient. The coefficient of determination shows the strength of the connection, but not its direction.

From the above formula it can be seen that if the values ​​of the dependent variable lie on the direct regression, then S = 0, and thus r = +1 or r = -1, that is, there is a linear relationship between the dependent and independent variable. Any value of the independent variable can accurately predict the value of the dependent variable. On the contrary, if the variables are not related at all, then Soci = SofSisi Then r = 0.

It can also be seen that the coefficient of determination is equal to that share of the total variance S^, which is caused or, as they say, explained by linear regression.

The residual sum of squares S is related to the residual variance s2y\x by the relation Socj = (n - 2) s^, and the total sum of squares S^ to the variance s2 by the relation S^ = (n - 1)s2 . In this case

r2 = 1 _ n _ 2 sy\x n _1 sy

This formula makes it possible to judge the dependence of the correlation coefficient on the share of the residual variance in the total variance

six/s2y The smaller this proportion, the greater (in absolute value) the correlation coefficient, and vice versa.

We have seen that the correlation coefficient reflects the tightness of the linear relationship of the variables. However, when it comes to predicting the value of one variable from the value of another,
the correlation coefficient should not be overly relied upon. For example, the data in Fig. 8.7 corresponds to a very high correlation coefficient (r = 0.92), but the width of the confidence region shows that the prediction uncertainty is quite significant. Therefore, even with a large correlation coefficient, be sure to calculate the confidence range.


And in the end, we give the ratio of the correlation coefficient and the coefficient of the slope of the direct regression b:

where b is the slope of the regression line, sx and sY are the standard deviations of the variables.

If we do not take into account the case sx = 0, then the correlation coefficient is equal to zero if and only if b = 0. We will now use this fact to estimate the statistical significance of the correlation.

Statistical Significance of Correlation

Since b = 0 implies r = 0, the hypothesis of no correlation is equivalent to the hypothesis of zero slope of the direct regression. Therefore, to assess the statistical significance of the correlation, we can use the formula already known to us for assessing the statistical significance of the difference between b and zero:

Here the number of degrees of freedom is v = n - 2. However, if the correlation coefficient has already been calculated, it is more convenient to use the formula:

The number of degrees of freedom here is also v = n - 2.

With the outward dissimilarity of the two formulas for t, they are identical. Indeed, from what


r 2 _ 1 - n_ 2 Sy]x_

Substituting the value of sy^x into the formula for the standard error

Animal fat and breast cancer

In experiments on laboratory animals, it has been shown that a high content of animal fat in the diet increases the risk of breast cancer. Is this dependence observed in humans? K. Carroll collected data on the consumption of animal fats and mortality from breast cancer in 39 countries. The result is shown in fig. 8.12A. The correlation coefficient between the consumption of animal fats and mortality from breast cancer was found to be 0.90. Let us estimate the statistical significance of the correlation.

0,90 1 - 0,902 39 - 2

The critical value of t for the number of degrees of freedom v = 39 - 2 = 37 is 3.574, which is less than that obtained by us. Thus, at a significance level of 0.001, it can be argued that there is a correlation between animal fat intake and mortality from breast cancer.

Now let's check whether mortality is associated with the consumption of vegetable fats? The corresponding data are shown in fig. 8.12B. The correlation coefficient is 0.15. Then

1 - 0,152 39 - 2

Even at a significance level of 0.10, the calculated value of t is less than the critical value. The correlation is not statistically significant.

The correlation coefficient is the degree of association between two variables. Its calculation gives an idea of ​​whether there is a relationship between two data sets. Unlike regression, correlation does not allow predicting values. However, the calculation of the coefficient is an important step in the preliminary statistical analysis. For example, we found that the correlation coefficient between the level of foreign direct investment and GDP growth is high. This gives us an idea that in order to ensure prosperity, it is necessary to create a favorable climate specifically for foreign entrepreneurs. Not so obvious conclusion at first glance!

Correlation and causality

Perhaps there is not a single area of ​​statistics that would be so firmly established in our lives. The correlation coefficient is used in all areas of public knowledge. Its main danger lies in the fact that often its high values ​​​​are speculated in order to convince people and make them believe in some conclusions. However, in fact, a strong correlation does not at all indicate a causal relationship between the quantities.

Correlation coefficient: Pearson and Spearman formula

There are several main indicators that characterize the relationship between two variables. Historically, the first is Pearson's linear correlation coefficient. It is passed at school. It was developed by K. Pearson and J. Yule based on the work of Fr. Galton. This ratio allows you to see the relationship between rational numbers that change rationally. It is always greater than -1 and less than 1. A negative number indicates an inversely proportional relationship. If the coefficient is zero, then there is no relationship between the variables. Raven positive number- takes place directly proportional dependence between the studied quantities. Spearman's rank correlation coefficient makes it possible to simplify calculations by constructing a hierarchy of variable values.

Relationships between variables

Correlation helps answer two questions. First, whether the relationship between variables is positive or negative. Secondly, how strong is the addiction. Correlation analysis is a powerful tool with which to obtain this important information. It is easy to see that household incomes and expenses rise and fall proportionately. Such a relationship is considered positive. On the contrary, when the price of a product rises, the demand for it falls. Such a relationship is called negative. The values ​​of the correlation coefficient are between -1 and 1. Zero means that there is no relationship between the studied values. The closer the indicator to the extreme values, the stronger the relationship (negative or positive). The absence of dependence is evidenced by a coefficient from -0.1 to 0.1. It must be understood that such a value only indicates the absence of a linear relationship.

Application features

The use of both indicators is subject to certain assumptions. First, the presence of a strong relationship does not determine the fact that one value determines the other. There may well be a third quantity that defines each of them. Secondly, a high Pearson correlation coefficient does not indicate a causal relationship between the studied variables. Thirdly, it shows an exclusively linear relationship. Correlation can be used to evaluate meaningful quantitative data (for example, atmospheric pressure, air temperature), and not categories such as gender or favorite color.

Multiple correlation coefficient

Pearson and Spearman investigated the relationship between two variables. But what to do if there are three or even more of them. This is where the multiple correlation coefficient comes in. For example, the gross national product is affected not only by foreign direct investment, but also by the monetary and fiscal policies of the state, as well as the level of exports. The growth rate and the volume of GDP are the result of the interaction of a number of factors. However, it should be understood that the multiple correlation model is based on a number of simplifications and assumptions. First, multicollinearity between quantities is excluded. Second, the relationship between the dependent variable and the variables that affect it is assumed to be linear.

Areas of use of correlation and regression analysis

This method of finding the relationship between quantities is widely used in statistics. It is most often resorted to in three main cases:

  1. For testing causal relationships between the values ​​of two variables. As a result, the researcher hopes to find a linear relationship and derive a formula that describes these relationships between quantities. Their units of measurement may be different.
  2. To check for a relationship between values. In this case, no one determines which variable is dependent. It may turn out that the value of both quantities determines some other factor.
  3. To derive an equation. In this case, you can simply substitute numbers into it and find out the values ​​of the unknown variable.

A man in search of a causal relationship

Consciousness is arranged in such a way that we definitely need to explain the events that occur around. A person is always looking for a connection between the picture of the world in which he lives and the information he receives. Often the brain creates order out of chaos. He can easily see a causal relationship where there is none. Scientists have to specifically learn to overcome this trend. The ability to evaluate relationships between data is objectively essential in an academic career.

Media bias

Consider how the presence of a correlation can be misinterpreted. group British students children with bad behavior were asked if their parents smoked. Then the test was published in the newspaper. The result showed a strong correlation between parents' smoking and their children's delinquency. The professor who conducted this study even suggested putting a warning about this on cigarette packs. However, there are a number of problems with this conclusion. First, the correlation does not indicate which of the quantities is independent. Therefore, it is quite possible to assume that the pernicious habit of parents is caused by the disobedience of children. Secondly, it is impossible to say with certainty that both problems did not arise due to some third factor. For example, low-income families. It should be noted the emotional aspect of the initial conclusions of the professor who conducted the study. He was an ardent opponent of smoking. Therefore, it is not surprising that he interpreted the results of his study in this way.

conclusions

Misinterpreting correlation as a causal relationship between two variables can lead to embarrassing research errors. The problem is that it lies at the very core of human consciousness. Many marketing tricks are based on this feature. Understanding the difference between causation and correlation allows you to rationally analyze information as in Everyday life as well as in professional careers.

Correlation coefficient is a value that can vary from +1 to -1. In the case of a complete positive correlation, this coefficient is equal to plus 1 (they say that with an increase in the value of one variable, the value of another variable increases), and with a complete negative correlation - minus 1 (indicate feedback, i.e. With an increase in the values ​​of one variable, the values ​​of the other decrease).

Ex 1:

Dependence graph of shyness and depression. As you can see, the dots (subjects) are not located randomly, but line up around one line, and, looking at this line, we can say that the higher the shyness is expressed in a person, the more depressive, i.e. these phenomena are interconnected.

Ex 2: Graph for Shyness and Sociability. We see that as shyness increases, sociability decreases. Their correlation coefficient is -0.43. Thus, a correlation coefficient greater from 0 to 1 indicates a directly proportional relationship (the more ... the more ...), and a coefficient from -1 to 0 indicates an inversely proportional relationship (the more ... the less ...)

If the correlation coefficient is 0, both variables are completely independent of each other.

correlation- this is a relationship where the impact of individual factors appears only as a trend (on average) with the mass observation of actual data. Examples of correlation dependence can be the dependence between the size of the bank's assets and the amount of the bank's profit, the growth of labor productivity and the length of service of employees.

Two systems of classification of correlations according to their strength are used: general and particular.

The general classification of correlations: 1) strong, or close with a correlation coefficient of r> 0.70; 2) medium at 0.500.70, and not just a correlation high level significance.

The following table lists the names of the correlation coefficients for different types of scales.

Dichotomous scale (1/0) Rank (ordinal) scale
Dichotomous scale (1/0) Pearson's association coefficient, Pearson's four-cell conjugation coefficient. Biserial correlation
Rank (ordinal) scale Rank-biserial correlation. Spearman's or Kendall's rank correlation coefficient.
Interval and absolute scale Biserial correlation The values ​​of the interval scale are converted into ranks and the rank coefficient is used Pearson correlation coefficient (linear correlation coefficient)

At r=0 there is no linear correlation. In this case, the group means of the variables coincide with their general means, and the regression lines are parallel to the coordinate axes.

Equality r=0 speaks only of the absence of a linear correlation dependence (uncorrelated variables), but not in general about the absence of a correlation, and even more so, a statistical dependence.

Sometimes the conclusion that there is no correlation is more important than the presence of a strong correlation. A zero correlation of two variables may indicate that there is no influence of one variable on the other, provided that we trust the results of the measurements.

In SPSS: 11.3.2 Correlation coefficients

Until now, we have found out only the very fact of the existence of a statistical relationship between two features. Next, we will try to find out what conclusions can be drawn about the strength or weakness of this dependence, as well as about its form and direction. Criteria for quantifying the relationship between variables are called correlation coefficients or measures of connectivity. Two variables are positively correlated if there is a direct, unidirectional relationship between them. In a unidirectional relationship, small values ​​of one variable correspond to small values ​​of the other variable, large values ​​correspond to large ones. Two variables are negatively correlated if there is an inverse relationship between them. With a multidirectional relationship, small values ​​of one variable correspond to large values ​​of the other variable and vice versa. The values ​​of the correlation coefficients are always in the range from -1 to +1.

Spearman's coefficient is used as a correlation coefficient between variables belonging to the ordinal scale, and Pearson's correlation coefficient (moment of products) is used for variables belonging to the interval scale. In this case, it should be noted that each dichotomous variable, that is, a variable belonging to the nominal scale and having two categories, can be considered as ordinal.

First, we will check if there is a correlation between the sex and psyche variables from the studium.sav file. In doing so, we take into account that the dichotomous variable sex can be considered an ordinal variable. Do the following:

Select from the command menu Analyze (Analysis) Descriptive Statistics (Descriptive statistics) Crosstabs... (Contingency tables)

· Move the variable sex to a list of rows and the variable psyche to a list of columns.

· Click the Statistics... button. In the Crosstabs: Statistics dialog, check the Correlations box. Confirm your choice with the Continue button.

· In the Crosstabs dialog, stop displaying tables by checking the Supress tables checkbox. Click the OK button.

The Spearman and Pearson correlation coefficients will be calculated, and their significance will be tested:

/ SPSS 10

Task number 10 Correlation analysis

The concept of correlation

Correlation or correlation coefficient is a statistical indicator probabilistic relationships between two variables measured on quantitative scales. In contrast to the functional connection, in which each value of one variable corresponds to strictly defined the value of another variable, probabilistic connection characterized by the fact that each value of one variable corresponds to set of values Another variable, An example of a probabilistic relationship is the relationship between height and weight of people. It is clear that people of different weights can have the same height and vice versa.

Correlation is a value between -1 and + 1 and is denoted by the letter r. Moreover, if the value is closer to 1, then this means the presence of a strong connection, and if it is closer to 0, then a weak one. Correlation value less than 0.2 is considered as weak correlation, more than 0.5 - high. If the correlation coefficient is negative, this means that there is an inverse relationship: the higher the value of one variable, the lower the value of the other.

Depending on the accepted values ​​of the coefficient r, different types of correlation can be distinguished:

Strong positive correlation is determined by the value r=1. The term "strict" means that the value of one variable is uniquely determined by the values ​​of another variable, and the term " positive" - that with increasing values ​​of one value variable the other variable also increase.

Strong correlation is mathematical abstraction and almost never found in real studies.

positive correlation corresponds to the values ​​0

Lack of correlation is determined by the value r=0. A correlation coefficient of zero indicates that the values ​​of the variables are not related to each other in any way.

Lack of correlation H o : 0 r xy =0 formulated as a reflection null hypotheses in correlation analysis.

negative correlation: -1

Strong negative correlation determined by the value r= -1. It, like a strict positive correlation, is an abstraction and does not find expression in practical research.

Table 1

Types of correlation and their definitions

The method of calculating the correlation coefficient depends on the type of scale on which the values ​​of the variable are measured.

Correlation coefficient rPearson is the main one and can be used for variables with nominal and partially ordered interval scales, the distribution of values ​​over which corresponds to normal (correlation of product moments). The Pearson correlation coefficient gives fairly accurate results in cases of abnormal distributions as well.

For distributions that are not normal, it is preferable to use the Spearman and Kendall rank correlation coefficients. They are ranked because the program pre-ranks the correlated variables.

The SPSS program calculates the r-Spearman correlation as follows: first, the variables are converted to ranks, and then the Pearson-formula is applied to the ranks.

The correlation proposed by M. Kendall is based on the idea that the direction of the connection can be judged by comparing the subjects in pairs. If for a pair of subjects the change in X coincides in direction with the change in Y coincides, then this indicates a positive relationship. If it does not match, then about a negative relationship. This coefficient is used mainly by psychologists working with small samples. Since sociologists work with large data arrays, it is difficult to sort through pairs, identify the difference in relative frequencies and inversions of all pairs of subjects in the sample. The most common is the coefficient. Pearson.

Since the correlation coefficient rPearson is the main one and can be used (with some error depending on the type of scale and the level of abnormality in the distribution) for all variables measured on quantitative scales, we will consider examples of its use and compare the results obtained with the results of measurements using other correlation coefficients.

The formula for calculating the coefficient r- Pearson:

r xy = ∑ (Xi-Xav)∙(Yi-Yav) / (N-1)∙σ x ∙σ y ∙

Where: Xi, Yi- Values ​​of two variables;

Xav, Yav - average values ​​of two variables;

σ x , σ y are standard deviations,

N is the number of observations.

Pair correlations

For example, we would like to find out how the answers between various types traditional values ​​in students' ideas about the ideal place of work (variables: a9.1, a9.3, a9.5, a9.7), and then about the ratio of liberal values ​​(a9.2, a9.4. a9.6, a9. eight) . These variables are measured on 5-term ordered scales.

We use the procedure: "Analysis",  "Correlations",  "Paired". By default, the coefficient Pearson is set in the dialog box. We use the coefficient Pearson

The tested variables are transferred to the selection window: a9.1, a9.3, a9.5, a9.7

By pressing OK, we get the calculation:

Correlations

a9.1.t. How important is it to have enough time for family and personal life?

Pearson correlation

Value(2-sided)

a9.3.t. How important is it to not be afraid of losing your job?

Pearson correlation

Value(2-sided)

a9.5.t. How important is it to have such a boss who will consult with you when making this or that decision?

Pearson correlation

Value(2-sided)

a9.7.t. How important is it to work in a well-coordinated team, to feel like a part of it?

Pearson correlation

Value(2-sided)

** Correlation is significant at the 0.01 level (2-sided).

Table of quantitative values ​​of the constructed correlation matrix

Partial correlations:

First, let's build a pairwise correlation between these two variables:

Correlations

c8. Feel close to those who live near you, neighbors

Pearson correlation

Value(2-sided)

c12. Feel close to their family

Pearson correlation

Value(2-sided)

**. The correlation is significant at the 0.01 level (2-sided).

Then we use the procedure for constructing a partial correlation: "Analysis",  "Correlations",  "Partial".

Suppose that the value “It is important to independently determine and change the order of your work” in relation to the indicated variables will be the decisive factor, under the influence of which the previously identified relationship will disappear or turn out to be of little significance.

Correlations

Excluded variables

c8. Feel close to those who live near you, neighbors

c12. Feel close to their family

c16. Feel close to people who have the same wealth as you

c8. Feel close to those who live near you, neighbors

Correlation

Significance (2-sided)

c12. Feel close to their family

Correlation

Significance (2-sided)

As can be seen from the table, under the influence of the control variable, the relationship decreased slightly: from 0.120 to 0.102. it remains sufficiently high and allows one to disprove the null hypothesis with zero error.

Correlation coefficient

The most accurate way to determine the tightness and nature of the correlation is to find the correlation coefficient. The correlation coefficient is a number determined by the formula:


where r xy is the correlation coefficient;

x i -values ​​of the first feature;

i -values ​​of the second feature;

Medium arithmetic values first sign

Arithmetic mean of the values ​​of the second feature

To use formula (32), we construct a table that will provide the necessary sequence in the preparation of numbers to find the numerator and denominator of the correlation coefficient.

As can be seen from formula (32), the sequence of actions is as follows: we find the arithmetic means of both signs x and y, we find the difference between the values ​​​​of the sign and its average (х i - ) and y i - ), then we find their product (х i - ) ( y i - ) – the sum of the latter gives the numerator of the correlation coefficient. To find its denominator, one should square the differences (x i -) and (y i -), find their sums and extract the square root from their product.

So for example 31, finding the correlation coefficient in accordance with formula (32) can be represented as follows (Table 50).

The resulting number of the correlation coefficient makes it possible to establish the presence, closeness and nature of the relationship.

1. If the correlation coefficient is zero, there is no relationship between the features.

2. If the correlation coefficient is equal to one, the relationship between the features is so great that it turns into a functional one.

3. The absolute value of the correlation coefficient does not go beyond the interval from zero to one:

This makes it possible to focus on the tightness of the connection: the closer the coefficient is to zero, the weaker the connection, and the closer to unity, the closer the connection.

4. The sign of the correlation coefficient "plus" means direct correlation, the sign "minus" means the opposite.

table 50

x i i (х i - ) (y i - ) (x i - )(y i - ) (х i - )2 (y i - )2
14,00 12,10 -1,70 -2,30 +3,91 2,89 5,29
14,20 13,80 -1,50 -0,60 +0,90 2,25 0,36
14,90 14,20 -0,80 -0,20 +0,16 0,64 0,04
15,40 13,00 -0,30 -1,40 +0,42 0,09 1,96
16,00 14,60 +0,30 +0,20 +0,06 0,09 0,04
17,20 15,90 +1,50 +2,25 2,25
18,10 17,40 +2,40 +2,00 +4,80 5,76 4,00
109,80 101,00 12,50 13,97 13,94


Thus, the correlation coefficient calculated in Example 31 is r xy = +0.9. allows us to draw the following conclusions: there is a correlation between the magnitude of the muscle strength of the right and left hands in the studied schoolchildren (the coefficient r xy \u003d + 0.9 is non-zero), the relationship is very close (the coefficient r xy \u003d + 0.9 is close to unity), the correlation is direct (the coefficient r xy = +0.9 is positive), i.e. with an increase in the muscle strength of one of the hands, the strength of the other hand increases.

When calculating the correlation coefficient and using its properties, it should be taken into account that the conclusions give correct results when the features are normally distributed and when the relationship between a large number of values ​​of both features is considered.

In the considered example 31, only 7 values ​​of both features were analyzed, which, of course, is not enough for such studies. We remind here again that the examples, in this book in general and in this chapter in particular, are in the nature of illustrating methods, and not a detailed presentation of any scientific experiments. As a result, a small number of feature values ​​are considered, measurements are rounded - all this is done in order not to obscure the idea of ​​the method with cumbersome calculations.

Particular attention should be paid to the essence of the relationship under consideration. The correlation coefficient cannot lead to the correct results of the study if the analysis of the relationship between the features is carried out formally. Let's go back to example 31. Both considered signs were the values ​​of the muscle strength of the right and left hands. Let's imagine that by feature xi in example 31 (14.0; 14.2; 14.9... ...18.1) we mean the length of randomly caught fish in centimeters, and by feature y i (12.1 ; 13.8; 14.2 ... ... 17.4) - the weight of instruments in the laboratory in kilograms. Having formally used the apparatus of calculations to find the correlation coefficient and in this case also obtained r xy =+0>9, we should have concluded that there is a close relationship of a direct nature between the length of the fish and the weight of the instruments. The absurdity of such a conclusion is obvious.

To avoid a formal approach to using the correlation coefficient, one should use any other method - mathematical, logical, experimental, theoretical - to identify the possibility of a correlation between signs, that is, to detect the organic unity of signs. Only then can one begin to use correlation analysis and establish the magnitude and nature of the relationship.

In mathematical statistics, there is also the concept multiple correlation- Relationships between three or more features. In these cases, a multiple correlation coefficient is used, consisting of the pairwise correlation coefficients described above.

For example, the correlation coefficient of three signs - x і , y і , z і - is:

where R xyz -multiple correlation coefficient expressing how feature x i depends on features y i and z i ;

r xy -correlation coefficient between features x i and y i ;

r xz - correlation coefficient between features Xi and Zi;

r yz - correlation coefficient between features y i , z i

Correlation analysis is:

Correlation analysis

Correlation- statistical relationship of two or more random variables (or variables that can be considered as such with some acceptable degree of accuracy). At the same time, changes in one or more of these quantities lead to a systematic change in the other or other quantities. The correlation coefficient serves as a mathematical measure of the correlation of two random variables.

Correlation can be positive and negative (it is also possible that there is no statistical relationship - for example, for independent random variables). negative correlation - correlation, in which an increase in one variable is associated with a decrease in another variable, while the correlation coefficient is negative. positive correlation - a correlation in which an increase in one variable is associated with an increase in another variable, while the correlation coefficient is positive.

autocorrelation - statistical relationship between random variables from the same series, but taken with a shift, for example, for a random process - with a shift in time.

The method of processing statistical data, which consists in studying the coefficients (correlations) between variables, is called correlation analysis.

Correlation coefficient

Correlation coefficient or pair correlation coefficient in probability theory and statistics, this is an indicator of the nature of the change in two random variables. The correlation coefficient is denoted by the Latin letter R and can take values ​​between -1 and +1. If the modulo value is closer to 1, then this means the presence of a strong connection (with a correlation coefficient equal to one speak of a functional connection), and if closer to 0, then weak.

Pearson correlation coefficient

For metric quantities, the Pearson correlation coefficient is used, the exact formula of which was introduced by Francis Galton:

Let X,Y- two random variables defined on the same probability space. Then their correlation coefficient is given by the formula:

,

where cov is the covariance and D is the variance, or equivalently,

,

where the symbol denotes the mathematical expectation.

To graphically represent such a relationship, you can use a rectangular coordinate system with axes that correspond to both variables. Each pair of values ​​is marked with a specific symbol. Such a plot is called a "scatterplot".

The method of calculating the correlation coefficient depends on the type of scale to which the variables refer. So, to measure variables with interval and quantitative scales, it is necessary to use the Pearson correlation coefficient (correlation of product moments). If at least one of the two variables has an ordinal scale or is not normally distributed, Spearman's rank correlation or Kendal's τ (tau) must be used. In the case when one of the two variables is dichotomous, a point two-series correlation is used, and if both variables are dichotomous, a four-field correlation is used. The calculation of the correlation coefficient between two non-dichotomous variables makes sense only if the relationship between them is linear (unidirectional).

Kendell correlation coefficient

Used to measure mutual disorder.

Spearman's correlation coefficient

Properties of the correlation coefficient

  • Cauchy - Bunyakovsky inequality:
if we take the covariance as the scalar product of two random variables, then the norm random variable will be equal to , and the consequence of the Cauchy-Bunyakovsky inequality will be: . , where . Moreover, in this case the signs and k match up: .

Correlation analysis

Correlation analysis- method of processing statistical data, which consists in studying the coefficients ( correlations) between variables. In this case, the correlation coefficients between one pair or multiple pairs of features are compared to establish statistical relationships between them.

Target correlation analysis- provide some information about one variable with the help of another variable. In cases where it is possible to achieve the goal, we say that the variables correlate. In the very general view accepting the hypothesis of the presence of a correlation means that a change in the value of variable A will occur simultaneously with a proportional change in the value of B: if both variables increase, then correlation is positive if one variable increases and the other decreases, correlation is negative.

The correlation reflects only the linear dependence of the quantities, but does not reflect their functional connectivity. For example, if we calculate the correlation coefficient between the values A = sin(x) and B = cos(x), then it will be close to zero, i.e., there is no dependence between the quantities. Meanwhile, the quantities A and B are obviously related functionally according to the law sin 2(x) + cos 2(x) = 1.

Limitations of correlation analysis



Plots of distributions of pairs (x,y) with corresponding x and y correlation coefficients for each of them. Note that the correlation coefficient reflects a linear relationship (top row), but does not describe a relationship curve (middle row), and is not at all suitable for describing complex, non-linear relationships (bottom row).
  1. Application is possible if there are enough cases to study: for a particular type of correlation coefficient, it ranges from 25 to 100 pairs of observations.
  2. The second limitation follows from the hypothesis of correlation analysis, which includes linear dependence of variables. In many cases, when it is reliably known that the relationship exists, the correlation analysis may not give results simply because the relationship is non-linear (expressed, for example, as a parabola).
  3. By itself, the fact of correlation does not give grounds to assert which of the variables precedes or causes changes, or that the variables are generally causally related to each other, for example, due to the action of a third factor.

Application area

This method of processing statistical data is very popular in economics and social sciences(in particular, in psychology and sociology), although the scope of correlation coefficients is extensive: quality control of industrial products, metallurgy, agricultural chemistry, hydrobiology, biometrics, and others.

The popularity of the method is due to two points: the correlation coefficients are relatively easy to calculate, their application does not require special mathematical training. Combined with ease of interpretation, the ease of application of the coefficient has led to its widespread use in the field of statistical data analysis.

spurious correlation

The often tempting simplicity of a correlation study encourages the researcher to draw false intuitive conclusions about the presence of a causal relationship between pairs of traits, while the correlation coefficients establish only statistical relationships.

In the modern quantitative methodology of the social sciences, in fact, there has been a abandonment of attempts to establish causal relationships between observed variables. empirical methods. Therefore, when researchers in the social sciences talk about establishing relationships between the variables they study, either a general theoretical assumption or a statistical dependence is implied.

see also

  • Autocorrelation function
  • Cross-correlation function
  • covariance
  • Determination coefficient
  • Regression analysis

Wikimedia Foundation. 2010.

Correlation is the degree of connection between 2 or more independent phenomena.

Correlation can be positive or negative.

Positive correlation (direct) occurs when 2 variables change simultaneously in the same directions (positive or negative). For example, the relationship between the number of users coming to the site from search results and the load on the server: the more users, the greater the load.

Correlation is negative (inverse) if a change in one quantity causes the opposite change in another. For example, with an increase in the tax burden on companies, their profits decrease. The more taxes, the less money for development.

The effectiveness of correlation as a statistical tool lies in the ability to express the relationship between two variables using the correlation coefficient.

The correlation coefficient (CC) is in the range of numbers from -1 to 1.

When the QC value is equal to 1, it should be understood that with each change in the 1st variable, an equivalent change in the 2nd variable occurs in the same direction.


If the QC value is -1, then with each change there is an equivalent change in the second variable in the opposite direction.


The closer the correlation is to -1 or 1, the stronger the relationship between the variables. At a value of zero (or close to 0), there is no significant relationship between the 2 variables or very minimal.

This method of processing statistical information is popular in economic, technical, social and other sciences due to the simplicity of calculating QC, the ease of interpreting the results and the absence of the need for high-level mathematics.

Correlation reflects only the relationship between variables and does not speak of causation: a positive or negative correlation between 2 variables does not necessarily mean that a change in one variable causes a change in the other.

For example, there is a positive correlation between an increase in the salary of sales managers and the quality of work with clients (improving the quality of service, working with objections, knowledge positive qualities product in comparison with competitors) with appropriate staff motivation. The increased volume of sales, and consequently the salaries of managers, does not mean at all that managers have improved the quality of work with clients. It is likely that large orders accidentally arrived and were shipped, or the marketing department increased the advertising budget, or something else happened.

Perhaps there is some third variable that affects the reason for the presence or absence of a correlation.

The correlation coefficient is not calculated:

  • when the relationship between two variables is not linear, such as quadratic;
  • there is more than 1 observation for each case in the data;
  • there are anomalous observations (outliers, "renegades");
  • the data contains distinct subgroups of observations.

COURSE WORK

Topic: Correlation analysis

Introduction

1. Correlation analysis

1.1 The concept of correlation

1.2 General classification of correlations

1.3 Correlation fields and the purpose of their construction

1.4 Stages of correlation analysis

1.5 Correlation coefficients

1.6 Normalized Bravais-Pearson correlation coefficient

1.7 Spearman's rank correlation coefficient

1.8 Basic properties of correlation coefficients

1.9 Checking the significance of correlation coefficients

1.10 Critical values ​​of the pair correlation coefficient

2. Planning a multivariate experiment

2.1 Condition of the problem

2.2 Determination of the center of the plan (main level) and the level of variation of factors

2.3 Building a planning matrix

2.4 Checking the homogeneity of the dispersion and the equal accuracy of measurements in different series

2.5 Coefficients of the regression equation

2.6 Reproducibility dispersion

2.7 Checking the significance of the coefficients of the regression equation

2.8 Checking the adequacy of the regression equation

Conclusion

Bibliography

INTRODUCTION

Experiment planning is a mathematical and statistical discipline that studies the methods of rational organization experimental studies- from the optimal choice of the studied factors and the determination of the actual plan of the experiment in accordance with its purpose to methods for analyzing the results. The beginning of experiment planning was laid by the works of the English statistician R. Fisher (1935), who emphasized that rational experiment planning gives no less significant gain in the accuracy of estimates than optimal processing of measurement results. In the 60s of the 20th century, there was modern theory experiment planning. Its methods are closely related to the theory of approximation of functions and mathematical programming. Optimal plans are constructed and their properties are investigated for a wide class of models.

Experiment planning is the choice of an experiment plan that meets the specified requirements, a set of actions aimed at developing an experimentation strategy (from obtaining a priori information to obtaining a workable mathematical model or determining optimal conditions). This is a purposeful control of the experiment, implemented in conditions of incomplete knowledge of the mechanism of the phenomenon under study.

In the process of measurements, subsequent data processing, as well as formalization of the results in the form of a mathematical model, errors occur and part of the information contained in the original data is lost. The use of experiment planning methods makes it possible to determine the error of the mathematical model and judge its adequacy. If the accuracy of the model is insufficient, then the use of experiment planning methods makes it possible to modernize the mathematical model with additional experiments without losing previous information and at minimal cost.

The purpose of experiment planning is to find such conditions and rules for conducting experiments under which it is possible to obtain reliable and reliable information about the object with the least labor costs, as well as present this information in a compact and convenient form with a quantitative assessment of accuracy.

Among the main planning methods used in different stages research uses:

Planning a screening experiment, the main meaning of which is the selection of a group of significant factors from the totality of factors that are subject to further detailed study;

Designing an experiment for analysis of variance, i.e. drawing up plans for objects with qualitative factors;

Planning a regression experiment that allows you to obtain regression models (polynomial and others);

Planning an extreme experiment, in which the main task is the experimental optimization of the object of study;

Planning in the study of dynamic processes, etc.

The purpose of studying the discipline is to prepare students for production and technical activities in the specialty using the methods of planning theory and modern information technologies.

Objectives of the discipline: study modern methods planning, organizing and optimizing scientific and industrial experiments, conducting experiments and processing the results.

1. CORRELATION ANALYSIS

1.1 The concept of correlation

The researcher is often interested in how two or more variables are related to each other in one or more of the studied samples. For example, can height affect a person's weight, or can pressure affect product quality?

This kind of relationship between variables is called correlation, or correlation. A correlation is a consistent change in two features, reflecting the fact that the variability of one feature is in line with the variability of the other.

It is known, for example, that on average there is a positive relationship between the height of people and their weight, and such that the greater the height, the greater the weight of a person. However, there are exceptions to this rule when low people are overweight, and, conversely, asthenics, with high growth, are light in weight. The reason for such exclusions is that each biological, physiological or psychological trait is determined by the influence of many factors: environmental, genetic, social, ecological, etc.

Correlations are probabilistic changes that can only be studied on representative samples by methods mathematical statistics. Both terms - correlation and correlation dependence - are often used interchangeably. Dependence means influence, connection - any coordinated changes that can be explained by hundreds of reasons. Correlations cannot be considered as evidence of a causal relationship, they only indicate that changes in one feature, as a rule, are accompanied by certain changes in another.

Correlation dependence - are the changes that the values ​​of one feature make to the probability of occurrence different values another sign.

The task of correlation analysis is reduced to establishing the direction (positive or negative) and the form (linear, non-linear) of the relationship between varying features, measuring its tightness, and, finally, checking the significance level of the obtained correlation coefficients.

Correlations differ in form, direction and degree (strength) .

The shape of the correlation can be rectilinear or curvilinear. For example, the relationship between the number of training sessions on the simulator and the number of correctly solved problems in the control session can be straightforward. Curvilinear can be, for example, the relationship between the level of motivation and the effectiveness of the task (Figure 1). With an increase in motivation, the efficiency of the task first increases, then the optimal level of motivation is reached, which corresponds to the maximum efficiency of the task; a further increase in motivation is accompanied by a decrease in efficiency.

Figure 1 - The relationship between the effectiveness of problem solving and the strength of the motivational tendency

In direction, the correlation can be positive ("direct") and negative ("reverse"). With a positive straight-line correlation, higher values ​​of one attribute correspond to more high values another, and lower values ​​of one attribute - low values ​​of the other (Figure 2). With a negative correlation, the ratios are reversed (Figure 3). With a positive correlation, the correlation coefficient has a positive sign, with a negative correlation - a negative sign.

Figure 2 - Direct correlation

Figure 3 - Inverse correlation


Figure 4 - No correlation

The degree, strength or tightness of the correlation is determined by the value of the correlation coefficient. The strength of the connection does not depend on its direction and is determined by the absolute value of the correlation coefficient.

1.2 General classification of correlations

Depending on the correlation coefficient, the following correlations are distinguished:

Strong or close with correlation coefficient r>0.70;

Medium (at 0.50

Moderate (at 0.30

Weak (at 0.20

Very weak (at r<0,19).

1.3 Correlation fields and the purpose of their construction

Correlation is studied on the basis of experimental data, which are the measured values ​​(x i , y i) of two features. If there is little experimental data, then the two-dimensional empirical distribution is represented as a double series of x i and y i values. In this case, the correlation between features can be described in different ways. The correspondence between an argument and a function can be given by a table, formula, graph, etc.

Correlation analysis, like other statistical methods, is based on the use of probabilistic models that describe the behavior of the studied features in a certain general population, from which the experimental values ​​x i and y i are obtained. When the correlation between quantitative characteristics, the values ​​of which can be accurately measured in units of metric scales (meters, seconds, kilograms, etc.), is investigated, the model of a two-dimensional normally distributed general population is very often adopted. Such a model displays the relationship between the variables x i and y i graphically as a locus of points in a rectangular coordinate system. This graphical dependence is also called a scatterplot or correlation field.
This model of a two-dimensional normal distribution (correlation field) allows you to give a visual graphical interpretation of the correlation coefficient, because distribution in aggregate depends on five parameters: μ x , μ y – average values ​​(mathematical expectations); σ x ,σ y are the standard deviations of the random variables X and Y and p is the correlation coefficient, which is a measure of the relationship between the random variables X and Y.
If p \u003d 0, then the values, x i , y i , obtained from a two-dimensional normal set, are located on the graph in x, y coordinates within the area bounded by a circle (Figure 5, a). In this case, there is no correlation between the random variables X and Y and they are called uncorrelated. For a two-dimensional normal distribution, uncorrelatedness means at the same time the independence of the random variables X and Y.

Share with friends or save for yourself:

Loading...