Multidimensional statistical studies. Multidimensional Statistical Analysis: Essence and Types

Dispersion analysis.

The purpose of the dispersion analysis is to verify the statistical significance of the difference between the average (for groups or variables). This check is carried out by splitting the sum of the squares to the components, i.e. By separating the overall dispersion (variations) on the part, one of which is due to a random error (that is, internal variability), and the second is associated with the difference in average values. The latter dispersion component is then used to analyze the statistical significance of the difference between the average values. If this is the difference significantly, zero hypothesis rejected And an alternative hypothesis is taken on the existence of the difference between average.

Split the sum of the squares. To select the volume N, the selective dispersion is calculated as the sum of the squares of deviations from the selective medium, divided by N-1 (sampling amount minus one). Thus, with a fixed size of the sample N, the dispersion is the function of the sum of the squares (deviations). The dispersion analysis is based on the separation of the dispersion into parts or components, i.e. The sample is broken down into two parts in which the average and the sum of the squares of deviations are calculated. The calculation of the same indicators for the sample as a whole gives greater value of the dispersion, which explains the discrepancy between group averages. Thus, the dispersion analysis allows you to explain the intragroup variability, which in the study of the entire group as a whole cannot be changed.

Verification of significance in dispersion analysis is based on comparing the dispersion component due to intergroup and dispersion components caused by intragroup scatter (called an average square of error). If the zero hypothesis is true (equality of the average in two populations), then we can expect a relatively small difference in sample medium due to purely random variability. Therefore, with zero hypothesis, the intragroup dispersion will practically coincide with the overall dispersion calculated without taking into account group affiliation. The obtained intragroup dispersions can be compared with the F-criterion checking whether the dispersion ratio is indeed more than 1.

Advantages: 1) Dispersion analysis is significantly more efficient and, for small samples, because more informative; 2) Dispersion analysis allows you to detect effects interactionbetween factors and, therefore, allows you to check more complex hypotheses

The main component method consists in a linear reduction of dimension, which defines pairs of orthogonal directions of maximum variation of the source data, after which the data is designed to the space of a smaller dimension generated by the components with the greatest variation.

The main component method is part of a factor analysis, which is that two correlated variables are combined into one factor. If an example with two variables spread to a larger number of variables, then the calculations become more difficult, however, the basic principle of representing two or more dependent variables by one factor remains in force.

When reducing the number of variables, the decision on when the procedure is to highlight the procedure should be stopped mainly on the point of view on what is considered to be small "random" variability. With repeated iterations, factors are allocated with a smaller and less dispersion.

Centroid method for determining factors.

The centroid method is used in cluster analysis. In this method, the distance between two clusters is defined as the distance between their centers of gravity with a non-tested centroid method ..

The weighted centroid method (median) is identical not to weighted, except that the calculations use weights to account for the difference between the size of clusters (i.e., the numbers of objects in them). Therefore, if there are (or suspected) significant differences in the sizes of clusters, this method is preferable to the previous one.

Cluster analysis.

The term cluster analysis in reality includes a set of different classification algorithms. The general question asked by researchers in many areas is how to organize the observed data in visual structures, i.e. Determine clusters of similar objects. In fact, the cluster analysis is not so much the usual statistical method, as the "set of" different algorithms "distribution of objects by clusters". There is a point of view that, in contrast to many other statistical procedures, the methods of cluster analysis are used in most cases when you do not have any a priori hypotheses regarding classes, but still in the descriptive stages of the study. It should be understood that the cluster analysis determines the "most possibly significant solution".

Algorithm of tree clustering. The purpose of this algorithm consists in combining objects into sufficiently large clusters using some kind of similarity or distance between objects. A typical result of such clustering is a hierarchical tree, which is a diagram. The diagram begins with each object in the class (on the left side of the chart). Now imagine that you are gradually (very small steps) you are "weakening" your criterion about which objects are unique, and which are not. In other words, you lower the threshold related to a solution to combining two or more objects in one cluster. As a result, you associate an increasingly large number of objects and aggregate (combine) more and more clusters consisting of increasingly different elements. Finally, at the last step, all objects are combined together. On these diagrams, the horizontal axes represent the distance of the union (in vertical tree charts vertical axes represent the distance of the union). So, for each node in the column (where a new cluster is formed) you can see the amount of the distance for which the corresponding elements are associated with a new single cluster. When the data has a clear "structure" in terms of clusters of objects similar to each other, then this structure is likely to be reflected in the hierarchical tree with various branches. As a result of a successful analysis by the method of combining, it is possible to detect clusters (branches) and interpret them.

Discriminant analysis is used to make a decision on which variables are distinguished (discriminatory) two or more arising combination (groups). The most common use of discriminant analysis is the inclusion in the study of many variables in order to determine those of them that the best way Separate the aggregate among themselves. In other words, you want to build a "model", allowing it best to predict to which aggregate will belong to one or another sample. In the next reasoning, the term "in the model" will be used to denote the variables used in predicting accessories to the aggregate; We will say about unused variables that they are "outside the model."

In step-by-step analysis of discriminant functions, the discrimination model is based on steps. More precisely, every step viewed all the variables and is that of them, which contributes the greatest contribution to the difference between the aggregates. This variable must be included in the model at this step, and goes to the next step.

You can also move in the opposite direction, in this case all variables will first be included in the model, and then at each step will be eliminated by variables that make a small contribution to predictions. Then, as a result of a successful analysis, only "important" variables in the model can be maintained, that is, the variables whose contribution to discrimination is greater than the rest.

This step-by-step procedure is "guided by" the corresponding value f for inclusion and the corresponding value f for exception. The value f of statistics for the variable indicates its statistical significance in discrimination between aggregates, that is, it is a measure of the deposit of a variable in the prediction of membership in the aggregate.

For two groups, discriminant analysis can also be considered as a procedure for multiple regression. If you encode two groups as 1 and 2, and then use these variables as dependent variables in multiple regression, then get results similar to those that would get with the help of discriminant analysis. In general, in the case of two aggregates, you customize the linear equation of the following type:

Group \u003d a + b1 * x1 + b2 * x2 + ... + bm * xm

where A is a constant, and B1 ... Bm are regression coefficients. Interpretation of the results of a task with two aggregates closely follows the logic of the use of multiple regression: the variables with the largest regression coefficients make the greatest contribution to discrimination.

If there are more than two groups, then it is possible to estimate more than one discriminant function is similar to how it was done earlier. For example, when there are three aggregates, you can estimate: (1) - the function for discrimination between the set 1 and the aggregates 2 and 3, taken together, and (2) - another function for discrimination between the set 2 and the totality 3. For example, you can have one function discriminating between those graduates high SchoolWho goes to college, against those who do not do this (but want to get a job or go to the school), and the second function for discrimination between those graduates who want to get a job against those who want to go to the school. The coefficients B in these discriminatory functions can be interpreted in the same way as before.

Canonical correlation.

Canonical analysis is designed to analyze dependencies between the lists of variables. More precisely, it allows you to investigate the relationship between two variable sets. When calculating the canonical roots, the eigenvalues \u200b\u200bof the correlation matrix calculate. These values \u200b\u200bare equal to the fraction of the dispersion due to the correlation between the corresponding canonical variables. In this case, the resulting proportion is calculated relative to the dispersion of canonical variables, i.e. suspended amounts on two variable sets; Thus, the eigenvalues \u200b\u200bdo not show the absolute value explained in the respective canonical variables.

If you remove the square root from the obtained your own values, we obtain a set of numbers that can be interpreted as correlation coefficients. Since they relate to canonical variables, they are also called canonical correlations. Like eigenvalues, the correlations between the canonical variables allocated at each step decrease. However, other canonical variables can also be significantly correlated, and these correlations often admit a sufficiently meaningful interpretation.

The criterion for the significance of canonical correlations is relatively low. First, the canonical correlations are estimated at one another in descending order. Only those roots that were statistically significant are left for subsequent analysis. Although actually calculations occur a little different. The program first estimates the importance of the entire set of roots, then the significance of the set left after removal of the first root, the second root, etc.

Studies have shown that the criterion used detects large canonical correlations even with a small sample size (for example, N \u003d 50). Weak canonical correlations (for example, R \u003d .3) require large sample sizes (N\u003e 200) to detect in 50% of cases. Note that the canonical correlations of small size usually do not represent practical value, since it corresponds to a small real variability of the source data.

Canonical weights. After determining the number of significant canonical roots, the question arises about the interpretation of each (meaningful) root. Recall that each root in reality represents two weighted amounts, one for each set of variables. One way to interpret the "meaning" of each canonical root is the consideration of the scales associated with each set of variables. These weights are also called canonical scales.

When analyzing, usually, use that the larger the weight is (i.e., the absolute value of the weight), the greater the contribution of the corresponding variable to the value of the canonical variable.

If you are familiar with multiple regression, you can apply for canonical interpretations used for beta-scales in the multiple regression equation. Canonical weights, in a sense, are similar to the private correlations of variables corresponding to the canonical root. Thus, consideration of canonical scales make it possible to understand the "meaning" of each canonical root, i.e. To see how specific variables in each set affect the weighted amount (that is, the canonical variable).

Parametric and non-parametric methods for evaluating results.

Parametric methods based on the selective distribution of certain statistics. Talking briefly, if you know the distribution of the observed variable, you can predict how in repeated samples of equal volume will be "to behave" used statistics - i.e. How it will be distributed.

In practice, the use of parametric methods is limited due to the volume or sample size available for analysis; problems with accurate measurement of signs of the observed object

Thus, it becomes necessary for the presence of procedures that allow you to handle the data of "low quality" from the samples of a small volume with variables, which is not known about the distribution of which or at all. Neparametric methods are just designed for those situations, quite often arising in practice when the researcher knows nothing about the parameters of the population under study (from here and the name of the methods - non-parametric). Specific language, non-parametric methods are not based on the parameters estimate (such as mean or standard deviation) when describing the sample distribution of the value of interest. Therefore, these methods are sometimes also called free from parameters or freely distributed.

Essentially, for each parametric criterion, there is at least one non-parametric analog. These criteria can be attributed to one of the following groups:

criteria differences between groups (independent samples);

criteria differences between groups (dependent samples);

criteria for relationships between variables.

Differences between independent groups. Usually, when there are two samples (for example, men and women) that you want to compare about the average value of some studied variable, you use the T-criterion for independent. Non-parametric alternatives to this criterion are: the criterion of Wald-Wolfovitsa series, U Manna-Whitney criterion and the Kolmogorov-Smirnov two-discharge criterion. If you have several groups, you can use dispersion analysis. Its non-parametric analogues are: Range dispersion analysis of Kraklala Wallis and the median test.

Differences between dependent groups. If you want to compare two variables related to the same sample (for example, the mathematical successes of students at the beginning and at the end of the semester), then the T-criterion for dependent samples is commonly used. Alternative non-parametric tests are: criteria for signs and criterion of Wilcoxon of paired comparisons. If the variables under consideration by nature are categorical or are categorized (i.e. are presented in the form of frequencies in certain categories), then the criterion of the chi-square of McNiara will be suitable. If more than two variables belonging to the same sample are considered, then dispersion analysis (ANOVA) is usually used with repeated measurements. An alternative non-parametric method is a rank dispersion analysis of Friedman or Q Cryer criterion (the latter is used, for example, if the variable is measured in the nominal scale). Q Kohrine Criterion is also used to evaluate frequency changes (fractions).

Dependencies between variables. In order to estimate the dependence (connection) between two variables, they usually calculate the correlation coefficient. The non-parametric analogues of the standard Pearson correlation coefficient are the statistics of the R, Tau Kendalla and Gamma coefficient if the two considered variables by nature are categorical, suitable non-parametric criteria for testing dependence: hee-square, fi coefficient, accurate Fisher criterion. Additionally, the dependence criterion is available between several variables the so-called Kendalla conciliation coefficient. This test is often used to assess the consistency of the opinions of independent experts (judges), in particular, the points issued to the same subject.

If the data is not normally distributed, and measurements, at best, contain ranked information, the calculation of ordinary descriptive statistics (for example, the average, standard deviation) is not too informative. For example, in psychometry it is well known that the perceived intensity of stimuli (for example, the perceived light brightness) is a logarithmic function of real intensity (brightness measured in objective units - suites). IN this exampleThe usual assessment of the average (the sum of the values \u200b\u200bdivided by the number of incentives) does not give a correct idea of \u200b\u200bthe average value of the actual intensity of the incentive. (In the example discussed, it should rather calculate the geometric average.) Non-parametric statistics calculates a varied set of position measures (middle, median, fashion, etc.) and scattering (dispersion, harmonic average, quartile, etc.), allowing to present more "Complete picture" data.

The introduction of PEVM in the management of the national economy involves the transition from traditional methods for analyzing enterprises in more advanced economy management models that allow its depth processes to reveal.

Wide use of B. economic research Methods mathematical statistics It makes it possible to deepen economic analysis, improve the quality of information in planning and predicting production indicators and analyzing its effectiveness.

The complexity and diversity of the relations of economic indicators determine the multidimensionality of the signs and in connection with this require the use of the most complex mathematical apparatus - methods of multidimensional statistical analysis.

The concept of "multidimensional statistical analysis" implies a number of methods designed to explore the combination of interrelated signs. We are talking about dismemberment (splitting) of the combination under consideration, which is represented by multidimensional signs on a relatively small amount.

At the same time, the transition from a large number of signs to a lesser is haunted by the goal of reducing their dimension and an increase in informative container. Such a goal is achieved by identifying information, repeated, generated by interrelated signs, establishing the possibility of aggregation (association, summation) for some signs. The latter involves the transformation of the actual model in a model with a smaller number of factor signs.

The method of multidimensional statistical analysis allows you to identify objectively existing ones, but are clearly not expressed patterns that are manifested in certain socio-economic phenomena. This has to deal with this in solving a number of practical tasks in the field of economics. In particular, the said occupied if it is necessary to accumulate (fix) at the same time the values \u200b\u200bof several quantitative characteristics (features) on the object being studied when each characteristic is prone to uncontrolled variation (in the context of objects), despite the homogeneity of observation objects.

For example, exploring homogeneous (by natural economic conditions and the type of specialization) of the enterprise for a number of performance performance indicators, we are convinced that in the transition from one object to another, almost each of the selected characteristics (identical) has an unequal meaning, that is, it finds so to speak uncontrollable (random) scatter. Such a "random" characterization of the signs, as a rule, is subject to some (natural) trends, both in terms of sufficiently specific sizes of signs, around which variation and in terms of the degree and interdependence of the variation itself are carried out.

The above leads to the definition of a multidimensional random variable as a set of quantitative features, the value of each of which is subject to uncontrollable scatter during repetitions of this process, statistical observation, experience, experiment, etc.

It was previously said that a multi-dimensional analysis combines a number of methods; We call them: factor analysis, the main component method, cluster analysis, image recognition, discriminant analysis and, etc. The first three of these methods are considered in the following paragraphs.

Like other mathematics - statistical methodsMultidimensional analysis can be effective in its application, subject to the high quality of the initial information and mass of observation data, are processed using PC.

The basic concepts of the method of factor analysis, the essence of the tasks solved by them

When analyzing (equally and investigated), social and economic phenomena are often found in cases where among the diversity (bagatoparametricity) observation objects, it is necessary to exclude the proportion of parameters, or replace them with a smaller number of certain functions, without causing the integrity of the integrity (full) information . The solution of such a task makes sense within a certain model and is due to its structure. An example of such a model that is most suitable for many real situationsIs a model of factor analysis, methods of which make it possible to concentrate the signs (information about them) by "condensation" of a large number to less, information more capacious. In this case, the obtained "condensate" of the information must be represented by the most significant and determining quantitative characteristics.

The concept of "factor analysis" does not need to be mixed with the widespread concept of analysis of causal relations, when the influence of various factors (their combinations, combinations) is being studied to the productive basis.

The essence of the method of factor analysis is to exclude the description of the multiple characteristics of the studied and replacement by its smaller number of information and more capacious variables, which are called factors and reflect the most significant properties of phenomena. Such variables are some features of the source signs.

Analysis, according to Ya. Perch 9, allows you to have the first approximate characteristics of the patterns underlying the phenomenon, formulate the first, general conclusions about the directions in which further research should be carried out. Further, it indicates the main assumption of factor analysis, which is reduced to the fact that the phenomenon, despite its heterogeneity and variability can be described by a small number of functional units, parameters or factors. These dates are called differently: influence, causes, parameters, functional units, abilities, basic or independent indicators. The use of one or another term is due to

Okun I. Factor analysis: Per. from. floor. M.: Statistics, 1974.- p.16.

the context of the factor and knowledge of the essence of the studied phenomenon.

The stages of factor analysis are consecutive comparisons of various sets of factors and options to groups with their inclusion, shutdown and evaluation of the reliability of differences between groups.

V.M. Zhukovska and I.B. Mutnik 10, speaking of the essence of the tasks of factor analysis, argue that the latter does not require a priori division of variables on dependent and independent, since all variables are considered in it as equal.

The task of factor analysis is reduced to a certain concept, the number and nature of the most significant and relatively independent functional characteristics of the phenomenon, its meters or basic parameters - factors. According to the authors, important a distinctive feature Factor analysis is that it allows you to study at the same time big number interrelated variables without assumption about "invariating all other conditions", so necessary when using a number of other methods of analysis. This is the great advantage of factor analysis as a valuable study tool of the phenomenon due to complex diversity and refineries of connections.

The analysis relies mainly to monitor the natural varying variation.

1. When using factor analysis, the set of variables that are studied from the point of view of links between them are not selected arbitrarily: this method allows you to identify the main factors that carry out a significant impact in this area.

2. The analysis does not require preliminary hypotheses, on the contrary, it itself can serve as a hypotheses extending, as well as to act as a criterion of hypotheses based on the data obtained by other methods.

3. The analysis does not require a priori guesseen on what variables are independent, and dependent, it does not hypertrophies causes and solves the question of their extent in the process of further research.

Scroll specific taskssolved using factor analysis methods will be like this (according to V.M. Zhukovskoy). Let's call the main ones in the field of socio-economic research:

Zhukovskaya V.M., Muchnik I.B. Factor analysis in socio-economic research. -Station, 1976. p.4.

1. Determination of the main aspects of differences between observation objects (minimization description).

2. The wording of the hypotheses about the nature of the differences between objects.

3. Detection of the structure of relationships between the signs.

4. Checking the hypotheses about the relationship and interchangeability of signs.

5. Comparison of characteristic set structures.

6. Dismemberment of observation objects for typical features.

The above indicates the great possibilities of factor analysis in

the study of social phenomena, where, as a rule, it is impossible to control (experimentally) the influence of individual factors.

It is quite effective to use the results of factor analysis in multiple regression models.

Having a predetermined correlation-regression model of the studied phenomenon in the form of correlated signs, using factor analysis, you can such a set of signs to turn into a significantly smaller number by aggregation. In this case, it should be noted that such a transformation does not deteriorate in any way and complete the information about the considered phenomenon. Created aggregated signs are uncorrelated and represent a linear combination of primary signs. With a formal mathematical side, setting tasks in this case may have infinite multiple solutions. But it must be remembered that when studying socio-economic phenomena, the obtained aggregated signs must have an economically informed interpretation. In other words, in any case of the use of the mathematical apparatus, first of all overlook the knowledge of the economic essence of studied phenomena.

Thus, the above above allows you to summarize that factor analysis is a specific study method, which is carried out on the basis of the arsenal of methods of mathematical statistics.

Naught practical use Factor analysis first found in the field of psychology. Ability to minimize a large number of psychological tests A small number of factors made it possible to explain the abilities of human intelligence.

In the study of socio-economic phenomena, where there are difficulties in isolated the influence of individual variables, a factor analysis can be successfully used. The use of its techniques allows for certain calculations to "profile" insignificant signs and continue research in the direction of its deepening.

The effectiveness of this method is evident in the study of such issues (problems): in the economy - specialization and concentration of production, the intensity of the management of the economy, the budget of families of workers, the construction of various generalizing indicators. etc

From the preface of the author
Chapter 1. Introduction
1.1. Multidimensional normal distribution as a model
1.2. General Overview of Multidimensional Methods
Literature
Chapter 2. Multidimensional Normal Distribution
2.1. Introduction
2.2. Concepts associated with multidimensional distributions
2.3. Multidimensional normal distribution
2.4. Distribution of a linear combination normally distributed values; independence of quantities; Private distributions
2.5. Conditional distributions and multiple correlation coefficient
2.6. Characteristic function; Moments
Literature
Tasks
Chapter 3. Estimation of the middle value vector and covariance matrix
3.1. Introduction
3.2. Estimates of the greatest believing for the vector of the average and covariance matrix
3.3. Distribution of selective medium vector; Conclusion about the average value when the covariance matrix is \u200b\u200bknown
Literature
Tasks
Chapter 4. Distribution and use of selective correlation coefficients
4.1. Introduction
4.2. Correlation coefficient of two-dimensional sample
4.3. Private correlation coefficients
4.4. Multiple correlation coefficient
Literature
Tasks
Chapter 5. Generalized T2 Statistics
5.1. Introduction
5.2. Generalized T2 Statistics and its distribution
5.3. Applications T2 Statistics
5.4. Distribution of T2-statistics in the presence of competing hypotheses; Power function
5.5. Some optimal properties criterion T2
5.6. Multidimensional problem of Berens - Fisher
Literature
Tasks
Chapter 6. Classification of observations
6.1. Classification problem
6.2. Principles of proper classification
6.3. Methods of classification of observations in the case of two general sets with a known probability distribution
6.4. Classification of observations in the case of two general aggregates having known multidimensional normal distributions
6.5. Classification of observations in the case of two multidimensional normal general sets whose parameters are rated by sample
6.6. Classification of observations in the case of several general sets
6.7. Classification of observations in the case of several multidimensional normal aggregates
6.8. An example of a classification in the case of several multidimensional normal general sets
Literature
Tasks
Chapter 7. Distribution of the sample covariance matrix and selective generalized dispersion
7.1. Introduction
7.2. Distribution of Wishart.
7.3. Some properties of the distribution of Wishart
7.4. Kohrena Theorem
7.5. Generalized dispersion
7.6. Distribution of a set of correlation coefficients in the case of a diagonal covariance matrix of a set
Literature
Tasks
Chapter 8. Checking common linear hypotheses. Dispersion analysis
8.1. Introduction
8.2. Estimates of the parameters of multidimensional linear regression
8.3. Criteria of the likelihood relationship for checking linear hypotheses about regression coefficients
8.4. The moments of the attitude of believing in the case when the zero hypothesis is valid
8.5. Some distributions of values \u200b\u200bu
8.6. Asymptotic decomposition of the distribution of credibility
8.7. Checking the hypotheses about the matrices of regression coefficients and trust areas
8.8. Checking the hypothesis about the equality of average values \u200b\u200bof normal distributions with a common covariance matrix
8.9. Generalized dispersion analysis
8.10. Other criteria for checking linear hypothesis
8.11. Canonical form
Literature
Tasks
Chapter 9. Checking the hypothesis about the independence of the sets of random variables
9.1. Introduction
9.2. The attitude of believing as a criterion for testing a hypothesis about the independence of the sets of random variables
9.3. The moments of the relationship of believing, provided that the zero hypothesis is valid
9.4. Some truth-like relationship distributions
9.5. Asymptotic decomposition of the distribution of the value of H (likelihood ratio)
9.6. Example
9.7. The case of two sets of random variables
Literature
Tasks
Chapter 10. Checking the hypotheses on the equality of covariance matrices and on the equality of simultaneously vectors of the average value and covariance matrices
10.1 Introduction
10.2 Criteria for testing hypotheses about the equality of several covariance matrices
10.3. Criteria for checking a hypothesis about the equivalence of several normal aggregates
10.4. Moments of the relationship of truth-like
10.5. Asymptotic decomposition of distribution functions V1 and V
10.6. The case of two general aggregates
10.7. Checking the hypothesis that the covariance matrix is \u200b\u200bproportional to the specified matrix. Criterion sphericity
10.8. Checking the hypothesis that the covariance matrix is \u200b\u200bequal to this matrix
10.9. Checking the hypothesis that the vector of the average value and the covariance matrix are respectively equal this vector and this matrix
Literature
Tasks
Chapter 11. Main Components
11.1. Introduction
11.2. Definition of the main components of the aggregate
11.3. Estimates of the greatest believing for the main components and their dispersions
11.4. Calculation of the highest believing estimates for the main components
11.5. Example
Literature
Tasks
Chapter 12. Canonical Correlations and Canonical Values
12.1. Introduction
12.2. Canonical correlations and canonical values \u200b\u200bof the general aggregate
12.3. Evaluation of canonical correlations and canonical values
12.4. Method of calculations
12.5. Example
Literature
Tasks
Chapter 13. Distribution of some characteristic roots and vectors that do not depend on the parameters
13.1. Introduction
13.2. The case of two matrices of Wishart
13.3. The case of one non-degenerate Wishanta Matrix
13.4. Canonical correlations
Literature
Tasks
Chapter 14. Review of some other works on multidimensional analysis
14.1. Introduction
14.2 Checking hypotheses about the rank and assessment of linear restrictions on regression coefficients. Canonical correlations and canonical values
14.3. Noncentral distribution of Wisharta
14.4. Distribution of some characteristic roots and vectors dependent on parameters
14.5. Asymptotic distribution of some characteristic roots and vectors
14.6. Main components
14.7. Factor analysis
14.8. Stochastic equations
14.9. Analysis of temporary series
Literature
Application. Theory of Matrix
1. Definition of matrices. Actions on matrices
2. Characteristic roots and vectors
3. Breaking vectors and matrices to blocks
4. Some results
5. Method of reduction of dulittl and a method of thickening along the axis for solving systems linear equations
Literature
Subject index

The basic concepts and methods of statistical analysis are presented multidimensional results technical experiments. <...> There are theoretical information about properties multidimensional Gaussian distributions. <...> The result of the experiment considered to the manual is random vector, Distributed according to the normal law.<...> Multidimensional normal The density often the result of the experiment is total Numbers characterizing some object studied.<...> 4 F x  Record in the form ξ  ~ ( ND.,) μ  has p-dimensional Normal distribution. means that vector ξ, ξ) takes various values, so you can talk about random vector 12 component vector , ξ  component, ξ  i.e. ede e   ξ \u003d e e ξ ξ  \u003d μ \u003d ξ - μ ξ - μ () ()  ξp where e is a sign mathematical expectation. <...> Let η be p.p.    μ μ \u003d ν + b;.   bd bd bη ξ \u003d ', (1.3) The matrix D from (1.2) - symmetrical, positive-defined, therefore it is true of its representation D cc '\u003d λ where c is orthogonal the matrixcomposed of out own vectors matrians ; D λ - diagonal the matrix from own numbers λ\u003e i 0 matrians D on the main diagonal.<...> Joint density its component, 1, η \u003d i ip, determined by common rules (see Appendix), equal to 5 (1.4); linear conversion , η  where b is a square matrix size - random vector, variations,.<...> Evaluation of the parameters of the normal distribution Let 12 ξ, Nξξ    of the consideration, i.e., statistical processing is the estimate of the average vector μ  and i ND. . <...> The main task is primary μ \u003d I N  matrix covariances . <...> A ln ∂ \u003d (1.5) considering regulations differentiation Functionals on vector or matrix arguments (see<...> Then σ \u003d ξ -ξ ξ - ξ \u003d ξ ξ -ξ ξσς ij nn ki i kj j kk here kiξ - i-i component vector Middle Iμ I-th components vector . <...> Estimates Maximum playing Coeffij / ρ \u003d Σ Σ σ have the form of IJ ,. IJ II JJ Ri J Σ σσ  ≠ II JJ Proof.<...> Evaluation of the relationship between components normal vector Detailed analysis Ties<...>

MU_K_SEFLAY_KURSE_RATE_ "Multidimensional_Strastic_analysis" .pdf

UDC 519.2 BBK 22.172 K27 Reviewer V.Yu. Chuev Kartashov GD, Timonin V.I., Budovskaya L.M. K27 Multidimensional Statistical Analysis: Methodical Indications term paper. - M.: Publishing House MSTU. AD Bauman, 2007. - 48 p.: Il. The basic concepts and methods of statistical analysis of multidimensional results of technical experiments are presented. Theoretical information is given on the properties of multidimensional Gaussian distributions. For students of senior courses Faculty of fundamental sciences. Il. 2. Bibliogr. 5 name. UDC 519.2 BBK 22.172 © MSTU them. AD Bauman, 2007.

P.2

Table of Contents Introduction ................................................ .................................................. ..... 3 1. Multidimensional normal distribution ...................................... 4 2. Statistical conclusions about the vector middle .................................... 17 3. Discriminant analysis .. .................................................. ............. 23 4. Method of the main component ............................... ................................ 27 5. Canonical correlations ............. .................................................. . 30 6. Multidimensional regression analysis............................................. 35 7. Factor analysis .................................................. ........................... 40 Appendix ..................... .................................................. ........................... 44 List of references .................... .................................................. ................ 47 47.

Introduction

Chapter 1. Multiple Regression Analysis

Chapter 2. Cluster Analysis

Chapter 3. Factor Analysis

Chapter 4. Discriminant Analysis

Bibliography

Introduction

Source information in socio-economic research seems most often in the form of a set of objects, each of which is characterized by a number of features (indicators). Since the number of such objects and features can reach tens and hundreds, and the visual analysis of these data is ineffective, then the reduction concentrations, the concentration of the source data, detect the structure and the relationship between them based on the construction of the generalized characteristics of a set of features and sets of objects. Such tasks can decide the methods of multidimensional statistical analysis.

Multidimensional statistical analysis is a section of statistics on mathematical methods aimed at identifying the nature and structure of the relationship between the components of the studied and intended for obtaining scientific and practical conclusions.

The focus of multidimensional statistical analysis is paid to mathematical methods for the construction of optimal collection plans, systematization and processing of data aimed at identifying the nature and structure of the relationship between the components of the multidimensional character under study and intended for obtaining scientific and practical conclusions.

The initial array of multidimensional data for multidimensional analysis is usually used by the results of the measurement of the component of a multidimensional feature for each of the objects of the totality under study, i.e. Sequence of multidimensional observations. The multidimensional feature is most often interpreted as, and the sequence of observations as a sample from the general population. In this case, the choice of the method of processing source statistical data is made on the basis of certain assumptions about the nature of the distribution law of the studied multidimension.

1. A multidimensional statistical analysis of multidimensional distributions and their main characteristics covers situations where the processed observations have a probabilistic nature, i.e. Interpreted as a sample from the corresponding general population. The main tasks of this subsection include: assessment of statistical studied multidimensional distributions and their main parameters; study properties of used statistical estimates; The study of probability distributions for a number of statistics, with which statistical criteria for checking various hypotheses are being built on the probabilistic nature of the analyzed multidimensional data.

2. Multidimensional statistical analysis of the nature and structure of interrelations The component of the studied multidimensional feature combines the concepts and results inherent in such methods and models, as an analysis, dispersion analysis, covariance analysis, factor analysis, etc. The methods belonging to this group include both algorithms based on the assumption of the probability nature of the data and methods that do not fit into the framework of any probabilistic model (the latter are more often referred to as methods).

3. Themnogomerous statistical analysis of the geometric structure of the studied set of multidimensional observations combines concepts and results peculiar to such models and methods as a discriminant analysis, cluster analysis, multidimensional scaling. Nodal for these models is the concept of distance, or measures of proximity between the analyzed elements as points of some space. At the same time, they can be analyzed as objects (both points as specified in the feature space) and features (as the points as defined in the object space).

The applied value of multidimensional statistical analysis consists mainly in solving the following three tasks:

· The task of statistical research of dependencies between the indicators under consideration;

· The task of classifying elements (objects or signs);

· The problem of reducing the dimension of the characteristic space under consideration and the selection of the most informative features.

Multiple regression analysis is designed to construct a model that allows the values \u200b\u200bof independent variables to obtain estimates of the values \u200b\u200bof the dependent variable.

Logistic regression to solve the task of classification. This is a type of multiple regression, the purpose of which is to analyze the relationship between several independent variables and the dependent variable.

Factor analysis is deigned by the determination of a relatively small number of hidden (latent) factors whose variability explains the variability of all observed indicators. Factor analysis is aimed at reducing the dimension of the problem under consideration.

Cluster and discriminant analysis are designed to separate the sets of objects to classes, each of which should include objects in a certain sense, homogeneous or relatives. With a cluster analysis, it is not known in advance how many groups of objects and what they will be volumes. Discriminant analysis shares objects according to existing classes.

Chapter 1. Multiple Regression Analysis

Task: Study of the housing market in Orel (Soviet and North Areas).

The table shows the price of apartments in Orel and for various factors, which causes it:

· total area;

· Kitchen area;

· living space;

· House type;

· Number of rooms. (Fig.1)

Fig. 1 Source Data

In the column "district" used notation:

3 - Soviet (elite, refers to central districts);

4 - Northern.

In the column "Type of House":

1 - brick;

0 - panel.

Requires:

1. Analyze the connection of all factors with the indicator of the "price" and among themselves. Select the factors most suitable for constructing a regression model;

2. Describe a fictitious variable that displays an apartment to the central and peripheral areas of the city;

3. Build a linear regression model for all factors, including a fictitious variable. Calculate the economic meaning of the parameters of the equation. Evaluate the quality of the model, the statistical significance of the equation and its parameters;

4. distribute factors (except for a fictitious variable) by the degree of influence on the price "price";

5. Build a linear regression model for the most influential factors, leaving the fictitious variable in equation. Evaluate the quality and statistical significance of the equation and its parameters;

6. Enough the feasibility or inappropriateness of the inclusion in the equation n. 3 and 5 of the fictitious variable;

7. Assess the interval estimates of the parameters of the equation with a probability of 95%;

8. Determine how much an apartment will cost with a total area of \u200b\u200b74.5 m² in an elite (peripheral) area.

Performance:

1. After analyzing the connection of all factors with the indicator of the "price" and among themselves, the factors most suitable for constructing a regression model were selected using the Instructions method "FORWARD":

A) the total area;

C) Number of rooms.

Included / Excluded Variables (A)

a dependent variable: price

2. The variable x4 "district" is a fictitious variable, as it has 2 values: 3-people to the Central District "Soviet", 4- to the Peripheral District "Northern".

3. We construct a linear regression model for all factors (including a fictitious variable x4).

The resulting model:

Assessment of the quality of the model.

Standard error \u003d 126,477

Darbina coefficient - Watson \u003d 2,136

Check the significance of the regression equation

The value of the criterion F-Fisher \u003d 41,687

4. We construct a linear model regression with all factors (except for a fictitious variable x4)

According to the degree of influence on the indicator "Price" distributed:

The most significant factor is the total area (F \u003d 40.806)

The second largest factor is the number of rooms (F \u003d 29,313)

5. Included / Excluded Variables

a dependent variable: price

6. We construct a linear regression model for the most influential factors with a fictitious variable, in our case it is one of the influential factors.

The resulting model:

Y \u003d 348,349 + 35,788 x1 -217,075 x4 +305,687 x7

Assessment of the quality of the model.

Determination coefficient R2 \u003d 0.807

Shows the share of the variation of an effective feature under the influence of the factors under study. Therefore, about 89% of the variation of the dependent variable is taken into account and is due in the model by the influence of the included factors.

Multiple correlation coefficient R \u003d 0.898

Shows the tightness of the relationship between the dependent variable y with all the factors involved in the model.

Standard error \u003d 126,477

Darbina coefficient - Watson \u003d 2,136

Check the significance of the regression equation

The value of the criterion F-Fisher \u003d 41,687

The regression equation should be recognized as adequate, the model is considered significant.

The most significant factor is the number of rooms (F \u003d 41,687)

The second largest factor is the total area (F \u003d 40.806)

Third largest factor area (F \u003d 32,288)

7. The fictitious variable x4 is a significant factor, therefore it is advisable to include it in the equation.

Interval estimates of the parameters of the equation show the results of forecasting according to the regression model.

With a probability of 95%, the implementation of the implementation in the forecasted month will be from 540.765 to 1080.147 million rubles.

8. Determination of the cost of an apartment in an elite area

For 1 room y \u003d 348,349 + 35,788 * 74, 5 - 217,075 * 3 + 305,687 * 1

For 2 rooms y \u003d 348,349 + 35,788 * 74, 5 - 217,075 * 3 + 305,687 * 2

For 3 rooms y \u003d 348,349 + 35,788 * 74, 5 - 217,075 * 3 + 305,687 * 3

in peripheral

For 1 room y \u003d 348,349 + 35.788 * 74, 5 - 217,075 * 4 + 305,687 * 1

For 2 rooms y \u003d 348,349 + 35.788 * 74, 5 - 217,075 * 4 + 305,687 * 2

For 3 rooms y \u003d 348,349 + 35,788 * 74, 5 - 217,075 * 4 + 305,687 * 3

Chapter 2. Cluster Analysis

Task: Study of the structure of cash flow and savings of the population.

The table presents the structure of cash flow and savings of the population by regions of the Central Federal District Russian Federation In 2003, for the following indicators:

· Powder - buying goods and payment of services;

· Opiv - Mandatory payments and contributions;

· Mon - acquisition of real estate;

· PPA - an increase in financial assets;

· DR - increase (decrease) money in the hands of the population.

Fig. 8 Source data

Requires:

1) determine the optimal number of clusters to split regions into homogeneous groups in all grouping features simultaneously;

2) to classify areas by the hierarchical method with the algorithm of intergroup bonds and display results as a dendrogram;

3) to analyze the main priorities of cash flow and savings in the clusters obtained;

Performance:

1) determine the optimal number of clusters to split regions into homogeneous groups in all grouping features simultaneously;

To determine the optimal number of clusters, you need to use hierarchical cluster analysis and refer to the Saglomation Steps Table to the "Factors" column.

These coefficients involve the distance between two clusters, determined on the basis of the selected remote measure (Euclidean distance). At that stage, when the measure of the distance between two clusters increases jumps like, the process of combining into new clusters must be stopped.

As a result, the number of clusters is considered optimal, equal to the difference in the number of observations (17) and the number of the step (14), after which the coefficient increases jump-like. Thus, the optimal number of clusters is 3. (Fig.9)

statistical Mathematical Analysis Cluster

Fig. 9 Table "Saglomation Steps"

2) to classify areas by the hierarchical method with the algorithm of intergroup bonds and display results as a dendrogram;

Now, using the optimal number of clusters, we carry out the classification by the regions by the hierarchical method. And in the output, we appeal to the "belonging to clusters" table. (Fig.10)

Fig. 10 Table "Belonging to Clusters"

In fig. 10 clearly seen that in the 3 cluster 2 areas (Kaluga, Moscow) and Moscow, 2 cluster two (Bryanskaya, Voronezh, Ivanovo, Lipetsk, Orlovskaya, Ryazan, Smolenskaya, Tambov, Tverskaya), in 1 cluster - Belgorod , Vladimirskaya, Kostroma, Kurskaya, Tula, Yaroslavskaya.

Fig. 11 Doddrogram

3) to analyze the main priorities of cash flow and savings, in the clusters obtained;

To analyze the received clusters, we need to have a "medium comparison". The following table is displayed in the output window (Fig. 12)

Fig. 12 Average Variables

In the "mean values" table, we can trace how the structures are given the greatest priority in the distribution of cash flow and savings of the population.

First of all, it is worth noting that the highest priority in all areas is given to buying goods and payment of services. A greater value of the parameter takes in 3 cluster.

2 place occupies an increase in financial assets. The greatest value In 1 cluster.

The smallest coefficient of 1 and 2 clusters at the "acquisition of real estate", and in the 3 cluster, a noticeable decrease in the money in the population was revealed.

In general, the purchase of goods and payment of services and a slight purchase of real estate has special importance for the population.

4) Compare the obtained classification with the results of the use of an intragroup link algorithm.

In the analysis of intergroup links, the situation has practically not changed, with the exception of the Tambov region, which from 2 cluster fell in 1. (Fig.13)

Fig. 13 Analysis of intragroup connections

There were no changes in the "mean values" table.

Chapter 3. Factor Analysis

Task: Analysis of light industry enterprises.

There are data of surveys of 20 light industry enterprises (Fig. 14) on the following characteristic features:

· X1 - level of capital studios;

· X2 - the complexity of the unit of products;

· X3 - the proportion of procurement materials in total expenditures;

· X4 - equipment change coefficient;

· X5 - premiums and remuneration per employee;

· X6 - the proportion of loss of marriage;

· X7 - the average annual value of the main production facilities;

· X8 - average annual wage fund;

· X9 - the level of product realizability;

· X10 - index of a permanent asset (the ratio of fixed assets and other non-current assets to their own means);

· X11 - turnover turnover;

· X12 - non-manufacturing costs.

Fig.14 Initial data

Requires:

1. To conduct a factor analysis of the following variables: 1,3,5-7, 9, 11.12, reveal and interpret factor signs;

2. Specify the most prosperous and promising enterprises.

Performance:

1. To conduct a factor analysis of the following variables: 1,3,5-7, 9, 11.12, reveal and interpret factor signs.

Factor analysis is a set of methods that, on the basis of actually existing bonds of objects (features), make it possible to identify latent (implicit) generalizing characteristics of the organizational structure.

In the Factor Analysis dialog box, select our variables, specify the necessary parameters.

Fig. 15 Complete Dispersion

According to the table "full explained dispersion" it can be seen that 3 factors explaining 74.8% variations of variables - the constructed model is quite good.

Now interpret the factors on the "matrix of the rotated components": (Fig.16).

Fig. 16 matrix rotated components

Factor 1 is most closely associated with the level of product sales and has reverse dependence from non-production costs.

Factor 2 is most closely associated with the specific gravity of the procurement materials in the total costs and the specific weight of the marriage loss and has the opposite dependence on premiums and remuneration per employee.

Factor 3 is most closely associated with the level of fund-studies and turnover of working capital and has the opposite dependence on the average annual value of the main production facilities.

2. Specify the most prosperous and promising enterprises.

In order to identify the most prosperous enterprises, we will carry out data sorting of 3 factor signs descending. (Fig.17)

The most prosperous enterprises should be considered: 13,4,5, since in general 3 factors their indicators occupy the highest and stable positions.

Chapter 4. Discriminant Analysis

Assessment of creditworthiness of legal entities in a commercial bank

As a significant indicators characterizing the financial condition of borrowing organizations, six indicators are chosen by the bank (Table 4.1.1):

QR (x1) - the coefficient of urgent liquidity;

CR (X2) - the current liquidity ratio;

EQ / TA (X3) - the coefficient of financial independence;

TD / EQ (X4) - total liabilities to equity;

ROS (x5) - profitability of sales;

Fat (X6) - turnover of fixed assets.

Table 4.1.1. Initial data


Requires:

Based on the discriminant analysis using the SPSS package, determine which of the four categories includes three borrowers ( legal entities) Who wishes to get a loan in a commercial bank:

§ Group 1 - with excellent financial indicators;

§ Group 2 - with good financial indicators;

§ Group 3 - with bad financial indicators;

§ Group 4 - with very bad financial indicators.

According to the results of the calculation, construct discriminant functions; Assess their significance in the Wilx coefficient (λ). Build a map of perception and chart mutual location observations in the space of three functions. Perform the interpretation of the results of the analysis.

Progress:

In order to determine which of the four categories includes three borrowers who wish to receive a loan in a commercial bank, we build a discriminant analysis that allows you to determine which of the previously identified sets (training samples) should include new customers.

As a dependent variable, select a group to which the borrower may include depending on its financial indicators. From the task data, each group assigns the corresponding rating 1, 2, 3 and 4.

Undorrible canonical coefficients of discriminant functions shown in Fig. 4.1.1, are used to construct the equation of discriminant functions d1 (x), d2 (x) and d3 (x):

3.) D3 (X) \u003d


1

(Constant)

Fig. 4.1.1. Canonical discriminant function coefficients

Fig. 4.1.2. Lambda Wilksa

However, since the significance of the Wilx coefficient (Fig. 4.1.2) of the second and thirds of the function more than 0.001, they are inappropriate for discrimination.

Table data "Classification results" (Fig. 4.1.3) suggest that for 100% observations, the classification is carried out correctly, high accuracy is achieved in all four groups (100%).

Fig. 4.1.3. Classification results

Information on actual and predicted groups for each borrower is shown in the Table "Potor Statistics" (Fig. 4.1.4).

As a result of the discriminant analysis, the most likelihood of new bank borrowers to the training subset M1 - the first, second and third borrower (serial number 41, 42, 43) are classified as a subset of M1 with the corresponding probabilities of 100%.

Observation number

Actual group

Holder group

Predicted group

ungrouped

ungrouped

ungrouped

Fig. 4.1.4. Starting statistics

The coordinates of the centroids by groups are given in the table "Functions in General Center groups" (Fig. 4.1.5). They are used to apply centroids to the perception card (Fig. 4.1.6).

1

Fig. 4.1.5. Functions in group centrophids

Fig. 4.1.6. Map of perception for two discriminant functions d1 (x) and d2 (x) (* - Centroid group)

The field of the "territorial card" is divided by discriminant functions into four areas: in the left part there are mainly observations of the fourth group of borrowers with very poor financial indicators, in the right part - the first group with excellent financial indicators, in the middle and lower part - the third and second group of borrowers with Bad and good financial indicators, respectively.

Fig. 4.1.7. Scattering diagram for all groups

In fig. 4.1.7 shows the combined schedule for the distribution of all groups of borrowers together with its centroids; It can be used to carry out a comparative visual analysis of the nature of the mutual location of the bank borrowers in financial indicators. In the right part of the schedule there are borrowers with high indicators, in the left - with low, and in the middle part - with the average financial indicators. Since, according to the results of the calculation, the second discriminant function d2 (x) was insignificant, then the differences in the coordinates of the centroids along this axis are insignificant.

Estimation of the creditworthiness of individuals in a commercial bank

The Commercial Bank's credit department conducted a sample survey of 30 of its customers (individuals). Based on pre-analyzing data, borrowers were estimated at six indicators (Table 4.2.1):

X1 - the borrower took a loan in commercial banks earlier;

X2 - average monthly income of the borrower's family, thousand rubles;

X3 - term (period) of repayment of the loan, years;

X4 - the size of the loan issued, thousand rubles;

X5 - the composition of the borrower's family, people;

X6 - the age of the borrower, years.

At the same time, three groups of borrowers were identified by the probability of the loan refund:

§ Group 1 - Low probability of repayment of the loan;

§ Group 2 - with the average probability of repayment of the loan;

§ Group 3 - with a high probability of repayment of the loan.

Requires:

Based on the discriminant analysis using the SPSS package, it is necessary to classify three bank customers (by the probability of repayment of the loan), i.e. Assess the belonging of each of them to one of the three groups. According to the calculation, it is significant to construct significant discriminant functions, their significance to estimate according to the Wilx coefficient (λ). In the space of two discriminant functions for each group, construct diagrams of the mutual location of observations and the combined chart. Rate the location of each borrower on these diagrams. Perform the interpretation of the results of the analysis.

Table 4.2.1. Initial data

Progress:

To build a discriminant analysis, as a dependent variable, we choose the likelihood of a timely repayment of the loan by the client. Considering that it can be low, medium and high, each category assign an appropriate estimate of 1.2 and 3.

Undorrible canonical coefficients of discriminant functions shown in Fig. 4.2.1, used to construct the equation of discriminant functions D1 (X), D2 (X):

2.) d2 (x) \u003d

Fig. 4.2.1. Canonical discriminant function coefficients

Fig. 4.2.2. Lambda Wilksa

By the Wilx coefficient (Fig. 4.2.2) for the second function, the significance of more than 0.001, therefore, it is impractical for discrimination.

Table data "Classification results" (Fig. 4.2.3) indicate that for 93.3% of observations, the classification is carried out correctly, high accuracy is achieved in the first and second groups (100% and 91.7%), less accurate results are obtained In the third group (88, 9%).

Fig. 4.2.3. Classification results

Information on actual and predicted groups for each client is shown in the "Lot Statistics" table (Fig. 4.2.4).

As a result of the discriminant analysis, the belonging of new customers of the bank to the training subset of the M3 - the first, second and third client (serial number 31, 32, 33) is attributed to the subset of M3 with the corresponding probabilities of 99%, 99% and 100%.

Observation number

Actual group

Holder group

Predicted group

ungrouped

ungrouped

ungrouped

Fig. 4.2.4. Starting statistics

The probability of repayment of the loan

Fig. 4.2.5. Functions in group centrophids

The coordinates of the centroids by groups are given in the table "Functions in General Group" (Fig. 4.2.5). They are used to apply centroids on a perception card (Fig. 4.2.6).

The field of the "territorial card" is divided by discriminant functions into three areas: in the left side are predominantly observing the first group of clients with a very low probability of repayment of the loan, in the right part of the third group with a high probability, in the middle-second group of clients with the average probability of the loan refund, respectively .

In fig. 4.2.7 (a - c) reflects the location of customers of each of the three groups on the plane of two discriminant functions d1 (x) and d2 (x). According to these graphics, it is possible to carry out a detailed analysis of the probability of repayment of the loan within each group, to judge the nature of the distribution of customers and evaluate the degree of their remoteness from the corresponding centroid.

Fig. 4.2.6. Map of perception for three discriminant functions d1 (x) and d2 (x) (* - Centroid group)

Also in fig. 4.2.7 (d) In the same coordinate system, a combined schedule for the distribution of all customer groups is given along with its centroids; It can be used to carry out a comparative visual analysis of the nature of the mutual location of the Bank's client groups with different credit repayment probabilities. In the left side of the graph, there are borrowers with a high probability of repayment of the loan, in the right - with low, and in the middle part - with an average probability. Since, according to the results of the calculation, the second discriminant function d2 (x) was insignificant, then the differences in the coordinates of the centroids along this axis are insignificant.

Fig. 4.2.7. Location of observations on the plane of two discriminant functions for groups with low (A), medium (b), high (c) probability of repayment of the loan and for all groups (g)

Bibliography

1. "Multidimensional statistical analysis in economic tasks. Computer simulation in SPSS ", 2009

2. Orlov A.I. "Applied Statistics" M.: Publishing House "Exam", 2004

3. Fisher R.A. "Statistical methods for researchers", 1954

4. Kalinina V.N., Soloviev V.I. "Introduction to multidimensional statistical analysis" Guide, 2003 Tutorial;

5. Akhim Büul, Peter Tsofel, "SPSS: Information Processing Art" Publishing House DiaSoft, 2005;

6. http://ru.wikipedia.org/wiki.

Share with friends or save for yourself:

Loading...