Statistical methods. Probability and Statistics - Basic Facts Probabilistic Statistical Research Methods

In many cases in mining science it is necessary to study not only deterministic, but also random processes. All geomechanical processes take place in continuously changing conditions, when certain events may or may not occur. In this case, it becomes necessary to analyze random connections.

Despite the random nature of events, they obey certain patterns considered in probability theory , which studies theoretical distributions of random variables and their characteristics. Another science, the so-called mathematical statistics, deals with the methods of processing and analyzing random empirical events. These two related sciences constitute a unified mathematical theory of mass random processes, which is widely used in scientific research.

Elements of probability theory and mathematical statistics. Under aggregate understand a set of homogeneous events of a random variable NS, which constitutes the primary statistical material. The population can be general (large sample N), containing the most varied variants of the mass phenomenon, and selective (small sample N 1), which is only a part of the general population.

Probability R(NS) developments NS is the ratio of the number of cases N(NS), which lead to the occurrence of the event NS, to the total number of possible cases N:

In mathematical statistics, the analogue of probability is the concept of the frequency of an event, which is the ratio of the number of cases in which an event took place to the total number of events:

With an unlimited increase in the number of events, the frequency tends to the probability R(NS).

Suppose there are some statistical data presented in the form of a distribution series (histogram) in Fig. 4.11, then the frequency characterizes the probability of the appearance of a random variable in the interval і , and the smooth curve is called the distribution function.

The probability of a random variable is a quantitative estimate of the possibility of its occurrence. A credible event has R= 1, impossible event - R= 0. Hence, for a random event, and the sum of the probabilities of all possible values.

In studies, it is not enough to have a distribution curve, but you need to know its characteristics:

a) arithmetic mean -; (4.53)

b) scope - R= x max - x min, which can be used to roughly estimate the variation of events, where x max and x min - extreme values of the measured value;

c) mathematical expectation -. (4.54)

For continuous random variables, the expectation is written in the form

, (4.55)

those. is equal to the actual value of the observed events NS, and the abscissa corresponding to the expectation is called the distribution center.

d) variance - , (4.56)

which characterizes the scattering of a random variable in relation to the mathematical expectation. The variance of a random variable is also called the second-order central moment.

For a continuous random variable, the variance is

; (4.57)

e) standard deviation or standard -

f) coefficient of variation (relative scattering) -

, (4.59)

which characterizes the intensity of scattering in different populations and is used to compare them.

The area under the distribution curve corresponds to one, which means that the curve covers all values of random variables. However, a large number of such curves, which will have an area equal to one, can be constructed, i.e. they can have different scattering. The measure of scattering is the variance or standard deviation (Figure 4.12).

Above, we examined the main characteristics of the theoretical distribution curve, which are analyzed by the theory of probability. In statistics, empirical distributions are used, and the main task of statistics is the selection of theoretical curves according to the existing empirical distribution law.

Let, as a result of n measurements of a random variable, a variation series is obtained NS 1 , NS 2 , NS 3 , …x n... The processing of such rows is reduced to the following operations:

- group x i in the interval and set for each of them the absolute and relative frequencies;

- the values are used to construct a stepped histogram (Fig. 4.11);

- calculate the characteristics of the empirical distribution curve: arithmetic mean variance D=; standard deviation.

The values D and s empirical distributions correspond to the values, D(NS) and s(NS) theoretical distribution.

Consider the main theoretical distribution curves. Most often in research, the law of normal distribution is used (Fig. 4.13), the equation of which at has the form:

(4.60)

If you align the coordinate axis with the point m, i.e. accept m(x) = 0 and accept, the law of normal distribution will be described by a simpler equation:

To estimate the scattering, the value is usually used ... The less s, the less the scattering, i.e. observations differ little from each other. With magnification s the scattering increases, the probability of errors increases, and the maximum of the curve (ordinate), equal to, decreases. Therefore the value at= 1 / for 1 is called the measure of accuracy. The root-mean-square deviations and correspond to the inflection points (shaded area in Fig. 4.12) of the distribution curve.

When analyzing many random discrete processes, the Poisson distribution (short-term events occurring per unit of time) is used. Probability of occurrence of numbers of rare events NS= 1, 2, ... for a given period of time is expressed by Poisson's law (see Fig. 4.14):

, (4.62)

where NS- the number of events for a given period of time t;

λ - density, i.e. average number of events per unit of time;

- the average number of events for the time t;

For Poisson's law, the variance is equal to the mathematical expectation of the number of occurrences of events in the time t, i.e. ...

To study the quantitative characteristics of some processes (machine failure time, etc.), an exponential distribution law is used (Figure 4.15), the distribution density of which is expressed by the dependence

where λ - intensity (average number) of events per unit of time.

In an exponential distribution, the intensity λ is the reciprocal of the mathematical expectation λ = 1/m(x). In addition, the ratio is true.

In various fields of research, the Weibull distribution law is widely used (Fig. 4.16):

, (4.64)

where n, μ , Are the parameters of the law; NS- an argument, most often time.

Investigating the processes associated with a gradual decrease in parameters (a decrease in the strength of rocks over time, etc.), the law of gamma distribution is applied (Fig. 4.17):

, (4.65)

where λ , a- options. If a= 1, the gamma of the function turns into an exponential law.

In addition to the above laws, other types of distributions are also used: Pearson, Rayleigh, beta distribution, etc.

Analysis of variance. In research, the question often arises: To what extent does this or that random factor affect the process under study? Methods for establishing the main factors and their influence on the process under study are considered in a special section of the theory of probability and mathematical statistics - analysis of variance. There is one thing - and multivariate analysis. Analysis of variance is based on the use of the normal distribution law and on the hypothesis that the centers of normal distributions of random variables are equal. Therefore, all measurements can be viewed as a sample from the same normal population.

Reliability theory. The methods of probability theory and mathematical statistics are often used in the theory of reliability, which is widely used in various branches of science and technology. Reliability is understood as the property of an object to perform specified functions (maintain established performance indicators) for a required period of time. In reliability theory, failures are treated as random events. For a quantitative description of failures, mathematical models are used - distribution functions of time intervals (normal and exponential distribution, Weibull, gamma distribution). The task is to find the probabilities of various indicators.

Monte Carlo method. To study complex processes of a probabilistic nature, the Monte Carlo method is used to solve the problem of finding the best solution from the set of options under consideration.

The Monte Carlo method is also called the method of statistical modeling. This is a numerical method based on the use of random numbers that simulate probabilistic processes. The mathematical basis of the method is the law of large numbers, which is formulated as follows: with a large number of statistical tests, the probability that the arithmetic mean of a random variable tends to its mathematical expectation, is equal to 1:

, (4.64)

where ε is any small positive number.

The sequence of solving problems by the Monte Carlo method:

- collection, processing and analysis of statistical observations;

- selection of the main and discarding of secondary factors and drawing up a mathematical model;

- drawing up algorithms and solving problems on a computer.

To solve problems by the Monte Carlo method, it is necessary to have a statistical series, to know the law of its distribution, mean value, mathematical expectation and standard deviation. The solution is effective only with the use of a computer.

3. The essence of probabilistic and statistical methods

How are the approaches, ideas and results of the theory of probability and mathematical statistics used in data processing - the results of observations, measurements, tests, analyzes, experiments in order to make practically important decisions?

The base is a probabilistic model of a real phenomenon or process, i.e. a mathematical model in which objective relationships are expressed in terms of probability theory. Probabilities are used primarily to describe uncertainties that need to be considered when making decisions. This refers to both unwanted opportunities (risks) and attractive ones ("lucky chance"). Sometimes randomness is deliberately introduced into a situation, for example, by drawing lots, randomly selecting units to control, holding lotteries or consumer surveys.

Probability theory allows for some probabilities to calculate others that are of interest to the researcher. For example, based on the probability of a coat of arms falling out, you can calculate the probability that with 10 coin tosses at least 3 coats of arms will fall out. Such a calculation is based on a probabilistic model, according to which coin tosses are described by a scheme of independent tests, in addition, the coat of arms and the lattice are equally possible, and therefore the probability of each of these events is ½. A more complex model is one in which, instead of tossing a coin, checking the quality of a unit of output is considered. The corresponding probabilistic model is based on the assumption that the quality control of various items of production is described by an independent test scheme. In contrast to the coin tossing model, a new parameter must be introduced - the probability R that the item is defective. The model will be fully described if it is assumed that all items have the same probability of being defective. If the latter assumption is incorrect, then the number of model parameters increases. For example, you can assume that each item has its own probability of being defective.

Let's discuss a quality control model with a common defectiveness probability for all product units R... In order to "reach the number" when analyzing the model, it is necessary to replace R for some specific meaning. To do this, it is necessary to go beyond the probabilistic model and turn to the data obtained during quality control. Mathematical statistics solves the inverse problem in relation to the theory of probability. Its purpose is to draw conclusions about the probabilities that underlie the probabilistic model based on the results of observations (measurements, analyzes, tests, experiments). For example, based on the frequency of occurrence of defective products during inspection, conclusions can be drawn about the likelihood of defectiveness (see discussion above using Bernoulli's theorem). On the basis of Chebyshev's inequality, conclusions were drawn about the correspondence of the frequency of occurrence of defective products to the hypothesis that the probability of defectiveness takes on a certain value.

Thus, the application of mathematical statistics is based on a probabilistic model of a phenomenon or process. Two parallel series of concepts are used - related to theory (probabilistic model) and related to practice (sample of observation results). For example, the theoretical probability corresponds to the frequency found from the sample. The mathematical expectation (theoretical series) corresponds to the sample arithmetic mean (practical series). Typically, sample characteristics are theoretical estimates. At the same time, the values related to the theoretical series “are in the heads of researchers”, refer to the world of ideas (according to the ancient Greek philosopher Plato), and are inaccessible for direct measurement. Researchers have only sample data, with the help of which they try to establish the properties of the theoretical probabilistic model that interest them.

Why is a probabilistic model needed? The fact is that only with its help it is possible to transfer the properties established from the results of the analysis of a particular sample to other samples, as well as to the entire so-called general population. The term “general population” is used when referring to a large but finite population of units of interest. For example, about the aggregate of all residents of Russia or the aggregate of all consumers of instant coffee in Moscow. The purpose of marketing or opinion polls is to transfer statements from a sample of hundreds or thousands of people to populations of several million people. In quality control, a batch of products acts as a general population.

In order to transfer conclusions from a sample to a larger population, one or another assumption about the relationship of the sample characteristics with the characteristics of this larger population is necessary. These assumptions are based on an appropriate probabilistic model.

Of course, it is possible to process sample data without using a particular probabilistic model. For example, you can calculate the sample arithmetic mean, calculate the frequency of the fulfillment of certain conditions, etc. However, the calculation results will relate only to a specific sample; the transfer of the conclusions obtained with their help to any other population is incorrect. This activity is sometimes referred to as “data mining”. Compared to probabilistic-statistical methods, data analysis has limited cognitive value.

So, the use of probabilistic models based on evaluating and testing hypotheses using sample characteristics is the essence of probabilistic-statistical decision-making methods.

We emphasize that the logic of using sample characteristics for making decisions based on theoretical models presupposes the simultaneous use of two parallel series of concepts, one of which corresponds to probabilistic models, and the other to sample data. Unfortunately, in a number of literary sources, usually outdated or written in a recipe spirit, no distinction is made between selective and theoretical characteristics, which leads readers to bewilderment and errors in the practical use of statistical methods.

Part 1. Foundations of Applied Statistics

1.2.3. The essence of probabilistic and statistical methods of decision making

How are the approaches, ideas and results of probability theory and mathematical statistics used in decision making?

Let's discuss a quality control model with a common defectiveness probability for all product units R... In order to "reach the number" when analyzing the model, it is necessary to replace R for some specific meaning. To do this, it is necessary to go beyond the probabilistic model and turn to the data obtained during quality control. Mathematical statistics solves the inverse problem in relation to the theory of probability. Its purpose is to draw conclusions about the probabilities that underlie the probabilistic model based on the results of observations (measurements, analyzes, tests, experiments). For example, based on the frequency of occurrence of defective products during inspection, conclusions can be drawn about the probability of defectiveness (see Bernoulli's theorem above). On the basis of Chebyshev's inequality, conclusions were drawn about the correspondence of the frequency of occurrence of defective products to the hypothesis that the probability of defectiveness takes on a certain value.

So, the use of probabilistic models based on evaluating and testing hypotheses using sample characteristics is the essence of probabilistic-statistical decision-making methods.

Probabilistic and statistical methods for modeling economic systems

Introduction

The problem of identifying the distribution law of an observed random variable (structural-parametric identification), as a rule, is understood as the problem of choosing a parametric model of the probability distribution law that best matches the results of experimental observations. Accidental errors of measuring instruments do not so often obey the normal law, or rather, they are not so often well described by the normal law model. Measuring devices and systems are based on different physical principles, different measurement methods and different conversions of measuring signals. Measurement errors as quantities are the result of the influence of many factors, random and non-random in nature, acting constantly or sporadically. Therefore, it is clear that only if certain prerequisites (theoretical and technical) are met, the measurement errors are sufficiently well described by the normal law model.

Generally speaking, it should be understood that the true distribution law (if it exists, of course), describing the errors of a particular measuring system, remains (remains) unknown, despite all our attempts to identify it. Based on these measurements and theoretical considerations, we can only choose a probabilistic model that, in a sense, best approximates this true law. If the constructed model is adequate, that is, the applied criteria do not give grounds for its rejection, then on the basis of this model it is possible to calculate all the probabilistic characteristics of the random component of the error of the measuring instrument that are of interest to us, which will differ from the true values only due to the non-excluded systematic (unobservable or unrecorded) ) the component of the measurement error. Its smallness characterizes the correctness of measurements. The set of possible probability distribution laws that can be used to describe the observed random variables is not limited. It makes no sense to set the goal of the identification problem to find the true distribution law of the observed quantity. We can only solve the problem of choosing the best model from a set. For example, from that set of parametric laws and A set of distributions that are used in applications, and which can be found in the literature.

The classical approach to the structural-parametric identification of the distribution law. By the classical approach we mean an algorithm for choosing a distribution law, entirely based on the apparatus of mathematical statistics.

1. Elementary concepts of random events, quantities and functions

We have already seen that for many experiments there is no difference in the calculation of the probabilities of events, while the elementary outcomes in these experiments are very different. But we should be interested in the probabilities of events, and not in the structure of the space of elementary outcomes. Therefore, it is time in all such "similar" experiments to use, for example, numbers instead of very different elementary outcomes. In other words, each elementary outcome can be associated with some real number, and work only with numbers.

Let the probability space be given.

Definition 26.Function called random variableif for any Borel set lots of is an event, i.e. belongs - algebra .

Lots of consisting of those elementary outcomes for which belongs , is called the full preimage of the set.

Remark 9 . In general, let the function acts from many in a multitude , and given -algebras and subsets and respectively. Function called measurableif for any set its full prototype belongs .

Remark 10. The reader who does not want to bother himself with abstractions associated with -algebras of events and with measurability, can safely assume that any set of elementary outcomes is an event, and, therefore, a random variable is arbitraryfunction from v ... In practice, this does not entail any troubles, so everything further in this paragraph can be skipped.

Now, having got rid of uncurious readers, let's try to understand why a random variable needs measurability.

If a random variable is given , we may need to compute probabilities of the form , , , (and, in general, very different probabilities of falling into Borel sets on the line). This is possible only if the sets under the probability sign are events - after all probabilitythere is a function defined only on -algebra of events. The requirement of measurability is equivalent to the fact that for any Borel set the probability is determined.

You can ask for something else in definition 26. For example, so that an event is a hit in any interval: , or in any half-interval:.

Let us verify, for example, that Definitions 26 and 27 are equivalent:

Definition 27. Function is called a random variable if for any real lots of belongs to -algebra .

Proof equivalence of definitions 26, 27.

If is a random variable in the sense of Definition 26, then it will be a random variable in the sense of Definition 27, since any interval is a Borel set.

Let us prove that the converse is also true. Let for any interval done ... We must prove that the same is true for any Borel sets.

Collect in a set all subsets of the real line whose inverse images are events. Lots of already contains all intervals ... Let us now show that the set is an -algebra. A-priory, if and only if the set belongs .

1. Let's make sure that ... But and therefore.

2. Let's make sure that for anyone ... Let be ... Then , because - -algebra.

3. Let's make sure that for any ... Let be for all ... But - -algebra, therefore

We have proven that - -algebra and contains all intervals on the line. But is the smallest of -algebras containing all intervals on the line. Hence, contains: .

Let's give examples of measurable and non-measurable functions.

Example 25. We toss the dice. Let be , and two functions from v are set like this: , ... Not yet set -algebra , one cannot talk about measurability. A function measurable with respect to some -algebras may not be the same for the other.

If there is a set of all subsets , then and are random variables, since any set of elementary outcomes belongs to including or ... You can write the correspondence between the values of random variables and and the probabilities take these values in the form "Probability tables"or, in short, "allocation tables":

Here .

2. Let -algebra of events consists of four sets:

those. an event is, in addition to reliable and impossible events, an even or odd number of points. Let us make sure that with such a relatively poor -algebra nor nor are not random variables, since they are immeasurable. Take, say, ... We see that and

2. Numerical characteristics of random variables

Expected value.The mathematical expectation of a discrete random variable X, which takes a finite number of values xi with probabilities pi, is the sum:

(6a)

The mathematical expectation of a continuous random variable X is the integral of the product of its values x by the probability distribution density f (x):

(6b)

Improper integral (6b) is assumed to be absolutely convergent (otherwise it is said that the mathematical expectation M (X) does not exist). The mathematical expectation characterizes the average value of the random variable X. Its dimension coincides with the dimension of the random variable. Mathematical expectation properties:

Dispersion.The variance of a random variable X is a number:

Dispersion is a characteristic of the scattering of the values of a random variable X relative to its mean value M (X). The dimension of the variance is equal to the dimension of the random variable squared. Based on the definitions of variance (8) and mathematical expectation (5) for a discrete random variable and (6) for a continuous random variable, we obtain similar expressions for the variance:

Here m = M (X).

Dispersion properties:

(10)

Mean square deviation:

(11)

Since the dimension of the standard deviation is the same as that of a random variable, it is more often than the variance used as a measure of scattering.

Distribution moments.The concepts of mathematical expectation and variance are special cases of a more general concept for the numerical characteristics of random variables - the moments of distribution. The moments of distribution of a random variable are introduced as the mathematical expectations of some of the simplest functions of a random variable. So, the moment of order k relative to the point x0 is the mathematical expectation M (X - x0) k. Moments relative to the origin x = 0 are called initial moments and are denoted:

(12)

The initial moment of the first order is the center of the distribution of the considered random variable:

(13)

The moments about the center of the distribution x = m are called central moments and are denoted:

(14)

It follows from (7) that the central moment of the first order is always zero:

(15)

The central moments do not depend on the origin of the values of the random variable, since when shifted by a constant value C, its center of distribution shifts by the same value C, and the deviation from the center does not change:

X - m = (X - C) - (m - C).

It is now clear that variance is a second-order focal point:

(16)

Asymmetry.Central moment of the third order:

(17)

serves to assess the asymmetry of the distribution. If the distribution is symmetric about the point x = m, then the third-order central moment will be zero (as are all odd-order central moments). Therefore, if the central moment of the third order is nonzero, then the distribution cannot be symmetric. The magnitude of the asymmetry is estimated using the dimensionless coefficient of asymmetry:

(18)

The sign of the asymmetry coefficient (18) indicates right-sided or left-sided asymmetry (Fig. 2).

Rice. 1. Types of distribution asymmetry

Excess.The central moment of the fourth order:

(19)

serves to assess the so-called kurtosis, which determines the degree of steepness (peakedness) of the distribution curve near the center of distribution in relation to the normal distribution curve. Since for the normal distribution , then the value is taken as the excess:

(20)

In fig. 3 shows examples of distribution curves with different kurtosis values. For a normal distribution, E = 0. Curves that are more peaked than normal have a positive kurtosis, more flat-topped - negative.

Rice. 2. Curves of distribution with different degrees of steepness (kurtosis)

Higher-order moments are usually not used in engineering applications of mathematical statistics.

Fashiondiscrete random variable is its most probable value. The mode of a continuous random variable is its value at which the probability density is maximum (Fig. 2). If the distribution curve has one maximum, then the distribution is called unimodal. If the distribution curve has more than one maximum, then the distribution is called polymodal. Sometimes there are distributions whose curves have not a maximum, but a minimum. Such distributions are called anti-modal. In the general case, the mode and the mathematical expectation of a random variable do not coincide. In a particular case, for a modal, i.e. having a mode, symmetric distribution and provided that there is a mathematical expectation, the latter coincides with the mode and the center of symmetry of the distribution.

Medianof a random variable X is its value Me, for which the equality holds: those. it is equally probable that the random variable X will be less or more Me. Geometrically, the median is the abscissa of the point at which the area under the distribution curve is halved. In the case of a symmetric modal distribution, the median, mode, and mathematical expectation are the same.

... Statistical estimation of the laws of distribution of random variables

General set - is called the totality of all objects to be studied or the possible results of all observations made under the same conditions over one object.

Sample population or a sample is a set of objects or results of observation of an object, selected at random from the general population.

Sample sizeis the number of objects or observations in the sample.

The specific values of the sample are called the observed values of the random variable X. The observed values are recorded in the protocol. The protocol is a table. The drawn up protocol is the primary form of recording the processing of the received material. To obtain reliable, reliable conclusions, the sample must be sufficiently representative in size. A large sample is an unordered set of numbers. For research, the sample is brought to a visual ordered form. For this, the largest and smallest values of the random variable are found in the protocol. The sample sorted in ascending order is shown in Table 1.

Table 1. Protocol

8,66-5,49-4,11-3,48-2,9-2,32-1,82-1,09-0,440,64-8,31-4,71-3,92-3,41-2,85-2,31-1,82-1,01-0,430,71-8,23-4,68-3,85-3,33-2,83-2,29-1,8-0,99-0,430,73-7,67-4,6-3,85-3,25-2,77-2,27-1,77-0,95-0,310,99-6,64-4,43-3,81-3,08-2,72-2,25-1,73-0,89-0,31,03-6,6-4,38-3,8-3,07-2,67-2,19-1,38-0,70,041,05-6,22-4,38-3,77-3,01-2,6-2,15-1,32-0,560,081,13-5,87-4,25-3,73-3,01-2,49-2,09-1,3-0,510,151,76-5,74-4,18-3,59-2,99-2,37-2,01-1,28-0,490,262,95-5,68-4,14-3,49-2,98-2,33-1,91-1,24-0,480,534,42

Sample spanis the difference between the largest and the smallest value of the random variable X:

The sample span is divided into k intervals - bits. The number of digits is set depending on the size of the sampling range from 8 to 25, in this course work we will take k = 10.

Then the length of the interval will be equal to:

Let's count the number of observed values in each interval in the protocol, denote them by m1, m2,…, m10. ...

We call mi hit raterandom variable in i interval. If any observed value of the random variable coincides with the end of the interval, then this value of the random variable is referred to one of the intervals by convention.

After determining the frequencies mi, we determine frequencyrandom variable, i.e. find the ratio of frequencies mi to the total number of observed values n.

Frequency, completeness condition -

Let's find the middle of each interval:.

Let's compose table 2

Range Boundary Values Table and the corresponding frequencies , where i = 1, 2, 3,…, k, is called a statistical series. A graphical representation of a statistical series is called a histogram. It is constructed as follows: intervals are plotted along the abscissa axis and on each such interval, as on the base, a rectangle is constructed, the area of which is equal to the corresponding frequency.

, - the height of the rectangle,.

table 2

Interval number Left border of interval Right border of interval Interval Midpoint of interval Interval frequency Interval frequency Rectangle height 1-8.66-7.352 (-8.66; -7.352) -8.00640.040.03062-7.352-6.044 (-7.352; -6.044) -6.69830 , 030.02293-6.044-4.736 (-6.044; -4.736) -5.3940.040.03064-4.736-3.428 (-4.736; -3.428) -4.082200.20.15295-3.428-2.12 (- 3.428; -2.12) -2.774260.260.19886-2.12-0.812 (-2.12; -0.812) -1.466180.180.13767-0.8120.496 (-0.812; 0.496) -0.158140.140.107080.4961.804 (0.496; 1.804) 1.1590.090.068891.8043.112 (1.804; 3.112) 2.45810.010.0076103.1124.42 (3.112; 4.42 ) 3,76610,010,0076 Sum 1001

Figure 3

The statistical distribution function is the frequency of a random variable that does not exceed a given value X:

For a discrete random variable X, the statistical distribution function is found by the formula:

Let's write down the statistical distribution function in expanded form:

where is the middle of the interval i, and are the corresponding frequencies, where i = 1, 2,…, k.

The graph of the statistical distribution function is a stepped line, the break points of which are the midpoints of the intervals, and the final jumps are equal to the corresponding frequencies.

Figure 3

Calculation of the numerical characteristics of a statistical series

Statistical mathematical expectation,

Statistical variance,

Statistical standard deviation.

Statistical expectationor statistical averageis called the arithmetic mean of the observed values of the random variable X.

Statistical varianceis called the arithmetic mean of the quantity or

With a large sample size, calculations by formulas and lead to cumbersome calculations. To simplify calculations, a statistical series with boundaries is used and frequent , where i = 1, 2, 3, ..., k, find the midpoints of the intervals and then all the selections that fall within the interval , are replaced by a single value , then such values will be in each interval.

where is the average value of the corresponding interval ;- frequency of the interval

Table 4. Numerical characteristics

Frequency of PiXiPi (Xi-m) ^ 2 (Xi-m) ^ 2 * Pi1-8.0060.04-0.320231.486911.25952-6.6980.03-0.200918.518560.55563-5.390.04 -0.21568.971940.35894 -4.0820.20-0.81642.847050.56945-2.7740.26-0.72120.143880.03746-1.4660.18-0.26390.862450.15527 -0,1580,14-0,02215,002740,700481,150,090,103512,564761,130892,4580,010,024623,548500,2355103,7660,010,037737,953980,3795 Statistical expectation -2.3947 Statistical variance 5.3822 Statistical standard deviation 2.3200

Determines the position of the center of the grouping of the observed values of the random variable.

, characterize the scattering of the observed values of a random variable around

In any statistical distribution, there are inevitably elements of chance. However, with a very large number of observations, these accidents are smoothed out, and random phenomena reveal their inherent regularity.

When processing statistical material, one has to decide how to choose a theoretical curve for a given statistical series. This theoretical distribution curve should express the essential features of the statistical distribution - this task is called the problem of smoothing or flattening the statistical series.

Sometimes the general form of the distribution of a random variable X follows from the very nature of this random variable.

Let the random variable X be the result of measuring some physical quantity of the device.

X = exact value of the physical quantity + device error.

The random error of the device during measurement has a total nature and is distributed according to the normal law. Consequently, the same distribution has a random variable X, i.e. normal distribution with probability density:

Where , , .

Options and are determined so that the numerical characteristics of the theoretical distribution are equal to the corresponding numerical characteristics of the statistical distribution. With a normal distribution, it is assumed that ,,, then the normal distribution function will take the form:

Table 5. Equalization Curve

Interval number Midpoint of interval Xi Tabulated function Normal curve 1-8.0060-2.41870.02140.00922-6.6980-1.85490.07140.03083-5.3900-1.29110.17340.07474-4.0820-0.72730.30620.13205- 2.7740-0.16350.39360.1697M-2.394700,39890,17206-1.46600,40030,36820,15877-0,15800,96410,25070,108081,15001,52790,12420,053592,45802, 09170,04480,0193103,76602,65550,01170,0051

Plot the theoretical normal curve by points on the same chart with the histogram of the statistical series (Error! Reference source not found).

Figure 6

Equalization of the statistical distribution function

Statistical distribution function we align with the distribution function of the normal law:

where ,,is the Laplace function.

Table 7. Distribution function

Interval number Midpoint of interval Xi Laplace function Distribution function 1-8.0060-2.4187-0.49220.00782-6.6980-1.8549-0.46820.03183-5.3900-1.2911-0.40170.09834-4.0820-0, 7273-0.26650.23355-2.7740-0.1635-0.06490.4351m-2.3947000.50006-1.46600,40030,15550,65557-0,15800,96410,33250,832581,15001, 52790,43670,936792,45802,09170,48180,9818103,76602,65550,49600,9960

We build a graph of the theoretical distribution function by points / along with a graph of the statistical distribution function.

Figure 6

Let a random variable X be studied with a mathematical expectation and variance , both parameters are unknown.

Let x1, x2, x3, ..., xn be the sample obtained as a result of n independent observations of the random variable X. To emphasize the random nature of the quantities x1, x2, x3, ..., xn, we rewrite them in the form:

X1, X2, X3, ..., Xn, where Xi is the value of the random variable X in the i-th experiment.

It is required on the basis of these experimental data to estimate the mathematical expectation and variance of a random variable. Such estimates are called point estimates; the statistical mathematical expectation can be taken as estimates of m and D and statistical variance, where

Before the experiment, the sample X1, X2, X3, ..., Xn is a set of independent random variables that have a mathematical expectation and variance, and therefore the probability distribution is the same as the random variable X itself. Thus:

Where i = 1, 2, 3, ..., n.

Based on this, we find the mathematical expectation and variance of the random variable (using the properties of mathematical expectation).

Thus, the mathematical expectation of the statistical mean is is equal to the exact value of the mathematical expectation m of the measured value, and the variance of the statistical mean n times less variance of individual measurement results.

This means that for a large sample size N, the statistical means is an almost non-random value, it deviates only slightly from the exact value of the random variable m. This law is called Chebyshev's law of large numbers.

Point estimates of unknown values of mathematical expectation and variance are of great importance at the initial stage of processing static data. Their disadvantage is that it is not known with what precision they give the estimated parameter.

Let for a given sample X1, X2, X3, ..., Xn accurate statistical estimates are obtained and , then the numerical characteristics of the random variable X will be approximately equal ... For a sample of a small size, the issue of the accuracy of the estimate is essential, since between m and , D and there will be insufficient deviations. In addition, when solving practical problems, it is required not only to find the approximate values of m and D, but also to evaluate their accuracy and reliability. Let be , i.e. is a point estimate for m. It's obvious that the more accurately determines m, the smaller the modulus of the difference ... Let be , where ?>0, then the less ?, the more accurate the estimate of m. Thus, ?>0 characterizes the accuracy of the parameter estimation. However, statistical methods do not allow categorically asserting that the estimate of the true value of m satisfies , we can only talk about the probability ?, with which this inequality holds:

Thus, ?- this is confidence levelor reliability of assessment, meaning ? are selected in advance depending on the problem being solved. Reliability ? it is customary to choose 0.9; 0.95; 0.99; 0.999. Events with this probability are practically certain. For a given confidence level, one can find the number?> 0 from .

Then we get the interval which covers with probability ? the true value of the mathematical expectation m, the length of this interval is 2 ?. This interval is called confidence interval... And this way of estimating the unknown parameter m - interval.

Let a sample X1, X2, X3, ..., Xn be given, and let it be found for this sample, ,.

It is required to find the confidence interval for the mathematical expectation m with a confidence level ?. The magnitude there is a random quantity with mathematical expectation, .

Random value has a cumulative nature, with a large sample size, it is distributed according to a law close to normal. Then the probability of a random variable falling into the interval will be equal to:

Where

Where is the Laplace function.

From formula (3) and the tables of the Laplace function, we find the number ?>0 and write down the confidence interval for the exact value random variable X with reliability?.

In this term paper, the value ? replace , and then formula (3) will take the form:

Find the confidence interval , in which the mathematical expectation is located. At ? = 0.99, n = 100, ,.

according to the Laplace tables we find:

From here? = 0.5986.

Confidence interval, in which the exact value of the mathematical expectation is found with a 99% probability.

Conclusion

random variable economic distribution

Solving the problems of structural-parametric identification with limited sample sizes, which, as a rule, metrologists possess, exacerbates the problem. In this case, the correctness of the application of statistical methods of analysis is even more important. the use of estimates with the best statistical properties and criteria with the greatest power.

When solving identification problems, it is preferable to rely on the classical approach. When identifying, it is recommended to consider a wider set of distribution laws, including models in the form of mixtures of laws. In this case, for any empirical distribution, we can always build an adequate, statistically much more substantiated mathematical model.

You should focus on the use and development of software systems that ensure the solution of problems of structural and parametric identification of distribution laws for any form of recorded observations (measurements), including modern methods of statistics. analytical analysis, focus on the wide, but correct use of computer modeling methods in research. We have already seen that for many experiments there is no difference in the calculation of the probabilities of events, while the elementary outcomes in these experiments are very different. But we should be interested in the probabilities of events, and not in the structure of the space of elementary outcomes. Therefore, it is time in all such "similar" experiments to use, for example, numbers instead of very different elementary outcomes. In other words, each elementary outcome can be associated with some real number, and work only with numbers.

How are probability theory and mathematical statistics used? These disciplines are the basis of probabilistic-statistical methods. decision making... To use their mathematical apparatus, you need problems decision making expressed in terms of probabilistic-statistical models. Application of a specific probabilistic-statistical method decision making consists of three stages:

transition from economic, managerial, technological reality to an abstract mathematical and statistical scheme, i.e. building a probabilistic model of a control system, technological process, decision-making procedures, in particular, based on the results of statistical control, etc .;
making calculations and obtaining conclusions by purely mathematical means within the framework of a probabilistic model;
interpretation of mathematical and statistical conclusions in relation to a real situation and making an appropriate decision (for example, on the conformity or non-conformity of product quality with established requirements, the need to adjust the technological process, etc.), in particular, conclusions (on the proportion of defective product units in a batch, on specific form of distribution laws monitored parameters technological process, etc.).

Mathematical statistics uses the concepts, methods and results of the theory of probability. Consider the main issues of building probabilistic models decision making in economic, managerial, technological and other situations. For active and correct use of normative-technical and instructional-methodological documents on probabilistic-statistical methods decision making requires prior knowledge. So, you need to know under what conditions a particular document should be applied, what initial information is necessary to have for its selection and application, what decisions should be made based on the results of data processing, etc.

Examples of the application of probability theory and mathematical statistics... Let's consider a few examples when probabilistic-statistical models are a good tool for solving managerial, production, economic, and national economic problems. So, for example, in the novel by A.N. Tolstoy's "Walking through the agony" (v. 1) says: "The workshop gives twenty-three percent of the marriage, and you stick to this figure," Strukov said to Ivan Ilyich. "

The question arises how to understand these words in the conversation of factory managers, since one unit of production cannot be 23% defective. It can be either good or defective. Probably, Strukov meant that a batch of large volume contains approximately 23% of defective items. Then the question arises, what does "about" mean? Let 30 out of 100 tested units of production turn out to be defective, or out of 1,000-300, or out of 100,000-30,000, etc., should Strukov be accused of lying?

Or another example. The coin to be used as a lot must be "symmetrical", i.e. when throwing it, on average, in half of the cases, the coat of arms should fall out, and in half of the cases - the lattice (tails, number). But what does "average" mean? If you carry out many series of 10 tosses in each series, then you will often encounter series in which the coin drops 4 times with the emblem. For a symmetrical coin, this will occur in 20.5% of the series. And if there are 40,000 coats of arms per 100,000 tosses, can the coin be considered symmetrical? Procedure decision making is based on the theory of probability and mathematical statistics.

The example in question may not seem serious enough. However, it is not. The drawing of lots is widely used in the organization of industrial technical and economic experiments, for example, when processing the results of measuring the quality indicator (friction moment) of bearings depending on various technological factors (the influence of a conservation environment, methods of preparing bearings before measurement, the effect of bearing load during measurement, etc.). NS.). Let's say it is necessary to compare the quality of bearings depending on the results of their storage in different conservation oils, i.e. in composition oils and. When planning such an experiment, the question arises of which bearings should be placed in the oil of the composition, and which ones in the oil of the composition, but in such a way as to avoid subjectivity and ensure the objectivity of the decision.

The answer to this question can be obtained by drawing lots. A similar example can be given with quality control of any product. To decide whether a controlled batch of products meets the established requirements or not, a sample is taken. Based on the results of sampling, a conclusion is made about the entire batch. In this case, it is very important to avoid subjectivity in the selection of the sample, i.e. it is necessary that each item in the controlled lot has the same probability of being selected in the sample. In production conditions, the selection of units of production in the sample is usually carried out not by lot, but by special tables of random numbers or with the help of computer random number sensors.

Similar problems of ensuring the objectivity of comparison arise when comparing different schemes. organization of production, remuneration, during tenders and competitions, selection of candidates for vacant positions, etc. Draws or similar procedures are needed everywhere. Let us explain by the example of identifying the strongest and second strongest teams when organizing a tournament according to the Olympic system (the loser is eliminated). Let the stronger team always win the weaker one. It is clear that the strongest team will definitely become the champion. The second strongest team will reach the final if and only if it has no games with the future champion before the final. If such a game is planned, then the second-strongest team will not make it to the final. Anyone planning a tournament can either “knock out” the second-strongest team from the tournament ahead of schedule, bringing it together in the first meeting with the leader, or secure it the second place, ensuring meetings with weaker teams until the final. To avoid subjectivity, draw lots. For an 8-team tournament, the probability that the two strongest teams will meet in the final is 4/7. Accordingly, with a probability of 3/7, the second-strongest team will leave the tournament ahead of schedule.

Any measurement of product units (using a caliper, micrometer, ammeter, etc.) has errors. To find out whether there are systematic errors, it is necessary to make multiple measurements of a unit of production, the characteristics of which are known (for example, a standard sample). It should be remembered that in addition to the systematic, there is also a random error.

Therefore, the question arises of how to find out from the measurement results whether there is a systematic error. If we only note whether the error obtained during the next measurement is positive or negative, then this problem can be reduced to the previous one. Indeed, let us compare the measurement with tossing a coin, the positive error - with the fall of the coat of arms, negative - the grating (zero error with a sufficient number of scale divisions practically never occurs). Then checking the absence of a systematic error is equivalent to checking the symmetry of the coin.

The purpose of this reasoning is to reduce the problem of checking the absence of a systematic error to the problem of checking the symmetry of a coin. The above reasoning leads to the so-called "sign criterion" in mathematical statistics.

With the statistical regulation of technological processes on the basis of the methods of mathematical statistics, rules and plans for statistical control of processes are developed, aimed at timely detection of disruptions in technological processes, taking measures to adjust them and preventing the release of products that do not meet the established requirements. These measures are aimed at reducing production costs and losses from the supply of substandard products. In statistical acceptance control, based on the methods of mathematical statistics, quality control plans are developed by analyzing samples from batches of products. The difficulty lies in being able to correctly build probabilistic and statistical models decision making, on the basis of which it is possible to answer the above questions. In mathematical statistics, probabilistic models and methods for testing hypotheses have been developed for this, in particular, hypotheses that the proportion of defective units of production is equal to a certain number, for example, (remember the words of Strukov from the novel by A.N. Tolstoy).

Assessment tasks... In a number of managerial, industrial, economic, and national economic situations, problems of a different type arise - the problem of assessing the characteristics and parameters of probability distributions.

Let's look at an example. Suppose that a batch of N light bulbs was received for inspection. A sample of n light bulbs was randomly selected from this batch. A number of natural questions arise. How, based on the results of testing the elements of the sample, to determine the average service life of electric lamps and with what accuracy can this characteristic be estimated? How does the accuracy change if you take a larger sample? At what number of hours can it be guaranteed that at least 90% of light bulbs will last more than one hour?

Suppose that when testing a sample with a volume of electric lamps, the electric lamps turned out to be defective. Then the following questions arise. What limits can be specified for the number of defective light bulbs in a batch, for the level of defectiveness, etc.?

Or, in a statistical analysis of the accuracy and stability of technological processes, such quality indicators as average monitored parameter and the degree of its spread in the process under consideration. According to the theory of probability, it is advisable to use its mathematical expectation as the mean value of a random variable, and variance, standard deviation, or the coefficient of variation... This raises the question: how to evaluate these statistical characteristics from sample data and with what accuracy can this be done? There are many similar examples. Here it was important to show how the theory of probability and mathematical statistics can be used in production management when making decisions in the field of statistical management of product quality.

What is "mathematical statistics"? Mathematical statistics is understood as "a section of mathematics devoted to mathematical methods for collecting, organizing, processing and interpreting statistical data, as well as using them for scientific or practical conclusions. The rules and procedures of mathematical statistics are based on the theory of probability, which makes it possible to assess the accuracy and reliability of conclusions obtained in each problem based on the available statistical material "[[2.2], p. 326]. In this case, statistical data is called information about the number of objects in some more or less extensive set that have certain characteristics.

According to the type of problems being solved, mathematical statistics is usually divided into three sections: data description, estimation and hypothesis testing.

By the type of processed statistical data, mathematical statistics is divided into four areas:

one-dimensional statistics (statistics of random variables), in which the observation result is described by a real number;
multivariate statistical analysis, where the result of observation over an object is described by several numbers (vector);
statistics of random processes and time series, where the observation result is a function;
statistics of objects of a non-numerical nature, in which the observation result is of a non-numerical nature, for example, it is a set (geometric figure), an ordering, or is obtained as a result of measurement by a qualitative attribute.

Historically, some areas of statistics of objects of a non-numerical nature (in particular, problems of estimating the proportion of marriage and testing hypotheses about it) and one-dimensional statistics were the first to appear. The mathematical apparatus is simpler for them, therefore, by their example, the basic ideas of mathematical statistics are usually demonstrated.

Only those data processing methods, i.e. mathematical statistics are evidence based on probabilistic models of relevant real phenomena and processes. We are talking about models of consumer behavior, the occurrence of risks, the functioning of technological equipment, obtaining experimental results, the course of the disease, etc. A probabilistic model of a real phenomenon should be considered constructed if the quantities under consideration and the relationships between them are expressed in terms of probability theory. Compliance with the probabilistic model of reality, i.e. its adequacy is substantiated, in particular, with the help of statistical methods for testing hypotheses.

Improbable data processing methods are exploratory, they can only be used for preliminary data analysis, since they do not make it possible to assess the accuracy and reliability of conclusions obtained on the basis of limited statistical material.

Probabilistic and statistical methods are applicable wherever it is possible to construct and substantiate a probabilistic model of a phenomenon or process. Their use is mandatory when conclusions drawn from a sample of data are transferred to the entire population (for example, from a sample to an entire batch of products).

In specific applications, they are used as probabilistic statistical methods widespread use, and specific. For example, in the section of production management devoted to statistical methods of product quality management, applied mathematical statistics (including planning of experiments) are used. With the help of her methods, statistical analysis accuracy and stability of technological processes and statistical quality assessment. The specific methods include methods of statistical acceptance control of product quality, statistical regulation of technological processes, assessment and control of reliability, etc.

Applied probabilistic and statistical disciplines such as reliability theory and queuing theory are widely used. The content of the first of them is clear from the name, the second is studying systems such as a telephone exchange, which at random times receives calls - the requirements of subscribers dialing numbers on their telephones. The duration of servicing these claims, i.e. the duration of conversations is also modeled with random variables. A great contribution to the development of these disciplines was made by Corresponding Member of the USSR Academy of Sciences A.Ya. Khinchin (1894-1959), Academician of the Academy of Sciences of the Ukrainian SSR B.V. Gnedenko (1912-1995) and other domestic scientists.

Briefly about the history of mathematical statistics... Mathematical statistics as a science begins with the works of the famous German mathematician Karl Friedrich Gauss (1777-1855), who, based on the theory of probability, investigated and substantiated least square method, created by him in 1795 and used for processing astronomical data (in order to clarify the orbit of the minor planet Ceres). His name is often called one of the most popular probability distributions - normal, and in the theory of random processes the main object of study is Gaussian processes.

At the end of the XIX century. - the beginning of the twentieth century. a major contribution to mathematical statistics was made by English researchers, primarily K. Pearson (1857-1936) and R.A. Fisher (1890-1962). In particular, Pearson developed the chi-square test for statistical hypotheses, and Fisher developed analysis of variance, experiment planning theory, maximum likelihood parameter estimation method.

In the 30s of the twentieth century. Pole Jerzy Neumann (1894-1977) and Englishman E. Pearson developed a general theory of testing statistical hypotheses, and Soviet mathematicians Academician A.N. Kolmogorov (1903-1987) and Corresponding Member of the USSR Academy of Sciences N.V. Smirnov (1900-1966) laid the foundations for nonparametric statistics. In the forties of the twentieth century. Romanian A. Wald (1902-1950) built a theory of sequential statistical analysis.

Mathematical statistics is developing rapidly at the present time. So, over the past 40 years, four fundamentally new areas of research can be distinguished [[2.16]]:

development and implementation of mathematical methods for planning experiments;
development of statistics of objects of non-numerical nature as an independent direction in applied mathematical statistics;
development of statistical methods that are stable in relation to small deviations from the used probabilistic model;
widespread development of work on the creation of computer software packages designed for statistical analysis of data.

Probabilistic-statistical methods and optimization... The idea of optimization permeates modern applied mathematical statistics and other statistical methods... Namely - methods of planning experiments, statistical acceptance control, statistical regulation of technological processes, etc. On the other hand, optimization statements in theory decision making, for example, the applied theory of optimization of product quality and the requirements of standards, provide for the widespread use of probabilistic and statistical methods, primarily applied mathematical statistics.

In production management, in particular when optimizing product quality and standard requirements, it is especially important to apply statistical methods at the initial stage of the product life cycle, i.e. at the stage of research preparation of experimental design developments (development of promising requirements for products, preliminary design, technical specifications for experimental design development). This is due to the limited information available at the initial stage of the product life cycle and the need to predict the technical capabilities and economic situation for the future. Statistical Methods should be used at all stages of solving the optimization problem - when scaling variables, developing mathematical models for the functioning of products and systems, conducting technical and economic experiments, etc.

All areas of statistics are used in optimization problems, including optimization of product quality and requirements of standards. Namely - statistics of random variables, multidimensional statistical analysis, statistics of random processes and time series, statistics of objects of non-numerical nature. The choice of a statistical method for the analysis of specific data is advisable to carry out according to the recommendations [