What feature is a characteristic feature of the variation series. Variational series and its characteristics

Various sampled values ​​will be called options a number of values ​​and denote: x 1 , x 2,…. First of all, we will produce ranging options, i.e. their arrangement in ascending or descending order. Each option has its own weight, i.e. a number that characterizes the contribution of this option to the total population. Frequencies or frequencies are used as weights.

Frequency n i option x i is a number that shows how many times a given option occurs in the considered sample population.

Frequency or relative frequency w i option x i called a number equal to the ratio of the frequency of a variant to the sum of the frequencies of all variants. Frequency shows what part of the sample population has a given option.

A sequence of options with their corresponding weights (frequencies or frequencies), written in ascending (or descending) order, is called variation series.

Variational series are discrete and interval.

For a discrete variation series, point values ​​of a feature are set, for an interval - feature values ​​are specified as intervals. Variational series can show the distribution of frequencies or relative frequencies (frequencies), depending on what value is indicated for each option - frequency or frequency.

Discrete variation series of frequency distribution looks like:

The frequencies are found by the formula, i = 1, 2, ..., m.

w 1 +w 2 + … + w m = 1.

Example 4.1. For a given set of numbers

4, 6, 6, 3, 4, 9, 6, 4, 6, 6

build discrete variation series distribution of frequencies and frequencies.

Decision . The volume of the population is n= 10. The discrete frequency distribution series has the form

Interval series have a similar form of notation.

Interval variation series of frequency distribution is written as:

The sum of all frequencies is equal to the total number of observations, i.e. the volume of the population: n = n 1 +n 2 + … + n m.

Interval variation series of distribution of relative frequencies (frequencies) looks like:

The frequency is found by the formula, i = 1, 2, ..., m.

The sum of all the frequencies is equal to one: w 1 +w 2 + … + w m = 1.

Interval series are most often used in practice. If there are a lot of statistical sample data and their values ​​differ from each other by an arbitrarily small amount, then discrete series for these data will be rather cumbersome and inconvenient for further research. In this case, data grouping is used, i.e. the interval containing all the values ​​of the feature is divided into several partial intervals and, having calculated the frequency for each interval, an interval series is obtained. Let us write down in more detail the scheme for constructing an interval series, assuming that the lengths of the partial intervals will be the same.

2.2 Building an interval series

To build an interval series, you need:

Determine the number of intervals;

Determine the length of the intervals;

Determine the location of the spacing on the axis.

For determining number of intervals k there is Sturges' formula, according to which

,

Where n- the volume of the entire population.

For example, if there are 100 values ​​of a characteristic (variant), then it is recommended to take the number of intervals in equal intervals to build an interval series.

However, very often in practice, the number of intervals is chosen by the researcher himself, given that this number should not be very large, so that the series is not cumbersome, but not very small, so as not to lose some properties of the distribution.

Interval length h is determined by the following formula:

,

Where x max and x min is the largest and most small value options.

The quantity are called sweep row.

To construct the intervals themselves, one does different things. One of the most simple ways is as follows. The beginning of the first interval is taken as the value
... Then the rest of the boundaries of the intervals are found by the formula. Obviously, the end of the last interval a m + 1 must satisfy the condition

After all the boundaries of the intervals have been found, the frequencies (or frequencies) of these intervals are determined. To solve this problem, look through all the options and determine the number of options that fall into one or another interval. Let us consider the complete construction of an interval series using an example.

Example 4.2. For the following statistics, written in ascending order, construct an interval series with the number of intervals equal to 5:

11, 12, 12, 14, 14, 15, 21, 21, 22, 23, 25, 38, 38, 39, 42, 42, 44, 45, 50, 50, 55, 56, 58, 60, 62, 63, 65, 68, 68, 68, 70, 75, 78, 78, 78, 78, 80, 80, 86, 88, 90, 91, 91, 91, 91, 91, 93, 93, 95, 96.

Decision. Total n= 50 option values.

The number of intervals is specified in the problem statement, i.e. k=5.

The length of the intervals is
.

Let's define the boundaries of the intervals:

a 1 = 11 − 8,5 = 2,5; a 2 = 2,5 + 17 = 19,5; a 3 = 19,5 + 17 = 36,5;

a 4 = 36,5 + 17 = 53,5; a 5 = 53,5 + 17 = 70,5; a 6 = 70,5 + 17 = 87,5;

a 7 = 87,5 +17 = 104,5.

To determine the frequency of intervals, we count the number of variants that fall into this interval. For example, options 11, 12, 12, 14, 14, 15 fall into the first interval from 2.5 to 19.5. Their number is 6, therefore, the frequency of the first interval is n 1 = 6. The frequency of the first interval is ... The second interval from 19.5 to 36.5 includes variants 21, 21, 22, 23, 25, the number of which is 5. Therefore, the frequency of the second interval is n 2 = 5, and the frequency ... Finding in a similar way the frequencies and frequencies for all intervals, we obtain the following interval series.

The interval series of frequency distribution is as follows:

The sum of the frequencies is 6 + 5 + 9 + 11 + 8 + 11 = 50.

The interval series of frequency distribution is as follows:

The sum of the frequencies is 0.12 + 0.1 + 0.18 + 0.22 + 0.16 + 0.22 = 1. ■

When constructing interval series, depending on the specific conditions of the problem under consideration, other rules can also be applied, namely

1. Interval variation series can consist of partial intervals of different lengths. Unequal interval lengths make it possible to single out the properties of a statistical population with an uneven distribution of a feature. For example, if the boundaries of the intervals determine the number of inhabitants in cities, then it is advisable in this problem to use intervals that are unequal in length. Obviously, for small cities, a small difference in the number of inhabitants is also important, and for large cities, the difference of tens and hundreds of inhabitants is not significant. Interval rows with unequal lengths of partial intervals are studied mainly in general theory statistics and their consideration is beyond the scope of this manual.

2.In mathematical statistics sometimes interval series are considered for which the left boundary of the first interval is assumed to be –∞, and the right boundary of the last interval is assumed to be + ∞. This is done in order to bring the statistical distribution closer to the theoretical one.

3. When constructing interval series, it may turn out that the value of some variant coincides exactly with the interval boundary. The best thing to do in this case is to do the following. If there is only one such coincidence, then consider that the considered option with its frequency fell into an interval located closer to the middle of the interval series, if there are several such options, then either all of them are attributed to the right intervals of these options, or all - to the left ones.

4. After determining the number of intervals and their length, the arrangement of the intervals can be done in another way. Find the arithmetic mean of all considered values ​​of the options x Wed and the first interval is constructed in such a way that this sample mean would be within some interval. Thus, we get an interval from x Wed - 0.5 h before x Wed + 0.5 h... Then to the left and to the right, adding the length of the interval, we build the remaining intervals until x min and x max will not fall into the first and last intervals, respectively.

5. Interval rows at a large number it is convenient to write intervals vertically, i.e. intervals should not be recorded in the first line, but in the first column, but frequencies (or frequencies) in the second column.

Sample data can be considered as values ​​of some random variable X... A random variable has its own distribution law. From the theory of probability it is known that the distribution law of a discrete random variable can be specified in the form of a distribution series, and continuous - using the distribution density function. However, there is a universal distribution law that holds for both discrete and continuous random variables... This distribution law is given in the form of a distribution function F(x) = P(X<x). For sample data, you can specify an analog of the distribution function - an empirical distribution function.

Statistical distribution series- This is an ordered distribution of units of the population into groups according to a certain varying characteristic.
Depending on the feature underlying the formation of a distribution series, they distinguish attributive and variation series of distribution.

The presence of a common feature is the basis for the formation of a statistical population, which is the results of the description or measurement of common features of the objects of study.

The subject of study in statistics is changing (varying) signs or statistical signs.

Types of statistical features.

Distribution series are called attributive based on quality criteria. Attributive Is a sign that has a name (for example, profession: seamstress, teacher, etc.).
A number of distributions are usually drawn up in the form of tables. Table 2.8 shows the attributive series of the distribution.
Table 2.8 - Distribution of types of legal assistance provided by lawyers to citizens of one of the regions of the Russian Federation.

Variational series Are the values ​​of the characteristic (or intervals of values) and their frequencies.
The distribution series are called variation series built on a quantitative basis. Any variation series consists of two elements: options and frequencies.
Variants are considered to be the individual values ​​of the characteristic that it takes in the variation series.
Frequencies are the numbers of individual variants or each group of the variation series, i.e. these are numbers that show how often certain variants occur in the distribution series. The sum of all frequencies determines the size of the entire population, its volume.
Frequencies are frequencies expressed in fractions of one or as a percentage of the total. Accordingly, the sum of the frequencies is 1 or 100%. The variational series allows one to estimate the shape of the distribution law using actual data.

Depending on the nature of the variation of the trait, they are distinguished discrete and interval variation series.
An example of a discrete variation series is given in table. 2.9.
Table 2.9 - Distribution of families by the number of occupied rooms in individual apartments in 1989 in the Russian Federation.

The first column of the table presents the variants of the discrete variation series, the second column contains the frequencies of the variation series, the third - the frequency indices.

Variational series

In the general population, a certain quantitative feature is being investigated. A sample of volume is randomly extracted from it n, that is, the number of elements in the sample is n... At the first stage of statistical processing, ranging sampling, i.e. ordering numbers x 1, x 2, ..., x n Ascending. Each observed value x i called variant... Frequency m i Is the number of observations of the value x i in the sample. Relative frequency (frequency) w i Is the frequency ratio m i to the sample size n: .
When studying the variation series, the concepts of accumulated frequency and accumulated frequency are also used. Let be x some number. Then the number of options , whose values ​​are less x, is called the accumulated frequency: for x i n is called the accumulated frequency w i max.
A feature is called discretely varied if its individual values ​​(variants) differ from each other by some finite value (usually an integer). The variation series of such a feature is called a discrete variation series.

Table 1. General view of the discrete variation series of frequencies

Characteristic valuesx i x 1 x 2 x n
Frequenciesm i m 1 m 2 m n

A feature is called continuously varying if its values ​​differ from each other by an arbitrarily small amount, i.e. the attribute can take any values ​​in a certain interval. A continuous variation series for such a feature is called interval.

Table 2. General view of the interval variation series of frequencies

Table 3. Graphic images of the variation series

RowPolygon or histogramEmpirical distribution function
Discrete
Interval
Looking at the results of the observations, it is determined how many values ​​of the options fell into each specific interval. It is assumed that each interval belongs to one of its ends: either in all cases it is left (more often), or in all cases it is right, and the frequencies or frequencies show the number of options included in the indicated boundaries. Differences a i - a i +1 are called partial intervals. To simplify subsequent calculations, the interval variation series can be replaced by a conventionally discrete one. In this case, the median value i-th interval is taken as a variant x i, and the corresponding interval frequency m i- for the frequency of this interval.
For graphical representation of variation series, polygon, histogram, cumulative curve and empirical distribution function are most often used.

Table 2.3 (Grouping of the population of Russia by average per capita income in April 1994) is presented interval variation series.
It is convenient to analyze distribution series with the help of a graphic image, which makes it possible to judge the shape of the distribution. A clear idea of ​​the nature of the change in the frequencies of the variation series is given by polygon and histogram.
The polygon is used when displaying discrete variation series.
Let's depict, for example, graphically the distribution of housing stock by type of apartments (Table 2.10).
Table 2.10 - Distribution of the housing stock of the urban area by type of apartments (arbitrary numbers).


Fig. Housing stock allocation polygon


Not only the values ​​of frequencies, but also the frequencies of the variation series can be plotted on the ordinate axis.
The histogram is taken for the image of the interval variation series... When constructing a histogram, the values ​​of the intervals are plotted on the abscissa axis, and the frequencies are depicted by rectangles built at the corresponding intervals. The height of the bars in the case of equal spacing should be proportional to the frequencies. A histogram is a graph in which a series is shown in the form of bars adjacent to each other.
Let us graphically depict the interval distribution series given in table. 2.11.
Table 2.11 - Distribution of families by the size of living space per person (arbitrary figures).
N p / p Groups of families according to the size of living space per person The number of families with a given size of living space Accumulated number of families
1 3 – 5 10 10
2 5 – 7 20 30
3 7 – 9 40 70
4 9 – 11 30 100
5 11 – 13 15 115
TOTAL 115 ----


Fig. 2.2. Histogram of the distribution of families by the size of living space per person


Using the data of the accumulated series (Table 2.11), we construct cumulative distribution.


Fig. 2.3. Cumulative distribution of families by living space per person


The representation of the variation series in the form of cumulates is especially effective for variation series, the frequencies of which are expressed in fractions or percentages to the sum of the frequencies of the series.
If we change the axes when graphically depicting the variation series in the form of cumulates, then we get ogive... In fig. 2.4 shows the ogive built on the basis of the data in Table. 2.11.
A histogram can be converted to a distribution polygon by finding the midpoints of the sides of the rectangles and then connecting these points with straight lines. The resulting distribution polygon is shown in Fig. 2.2 with a dotted line.
When constructing a histogram of the distribution of the variation series with unequal intervals on the ordinate axis, not the frequencies are plotted, but the density of the feature distribution in the corresponding intervals.
The distribution density is the frequency calculated per unit interval width, i.e. how many units in each group are per unit of the interval. An example of calculating the distribution density is presented in table. 2.12.
Table 2.12 - Distribution of enterprises by the number of employees (conditional numbers)
N p / p Groups of enterprises by the number of employees, people Number of enterprises Interval size, persons Distribution density
BUT 1 2 3=1/2
1 Up to 20 15 20 0,75
2 20 – 80 27 60 0,25
3 80 – 150 35 70 0,5
4 150 – 300 60 150 0,4
5 300 – 500 10 200 0,05
TOTAL 147 ---- ----

For graphical representation of variation series can also be used cumulative curve... With the help of cumulates (sum curve), a series of accumulated frequencies is displayed. The accumulated frequencies are determined by sequentially summing the frequencies by groups and show how many units of the population have a feature value no greater than the considered value.


Fig. 2.4. Range of distribution of families by the size of living space per person

When constructing the cumulates of the interval variation series, the variants of the series are plotted along the abscissa axis, and the accumulated frequencies are plotted along the ordinate axis.

Continuous variation series

Continuous variation series is a series built on the basis of a quantitative statistical feature. An example. The average duration of illness of convicts (days per person) in the autumn-winter period in the current year was:
7,0 6,0 5,9 9,4 6,5 7,3 7,6 9,3 5,8 7,2
7,1 8,3 7,5 6,8 7,1 9,2 6,1 8,5 7,4 7,8
10,2 9,4 8,8 8,3 7,9 9,2 8,9 9,0 8,7 8,5

RUSSIAN ACADEMY OF FOLK ECONOMY AND PUBLIC SERVICE under the PRESIDENT OF THE RUSSIAN FEDERATION

ORLOV BRANCH

Department of Mathematics and Mathematical Methods in Management

Independent work

Mathematics

on the topic "Variational series and its characteristics"

for full-time students of the Faculty of Economics and Management

areas of training "Personnel Management"


Purpose of work: Mastering the concepts of mathematical statistics and methods of primary data processing.

An example of solving typical tasks.

Objective 1.

The following data were obtained by polling ():

1 2 3 2 2 4 3 3 5 1 0 2 4 3 2 2 3 3 1 3 2 4 2 4 3 3 3 2 0 6

3 3 1 1 2 3 1 4 3 1 7 4 3 4 2 3 2 3 3 1 4 3 1 4 5 3 4 2 4 5

3 6 4 1 3 2 4 1 3 1 0 0 4 6 4 7 4 1 3 5

It is necessary:

1) Compile a variation series (statistical distribution of the sample), having previously recorded a ranked discrete series of options.

2) Construct a polygon of frequencies and cumulative.

3) Compile a series of distributions of relative frequencies (frequencies).

4) Find the main numerical characteristics of the variation series (use simplified formulas to find them): a) the arithmetic mean, b) the median Me and fashion Moe, c) variance s 2, d) standard deviation s, e) coefficient of variation V.

5) Explain the meaning of the results obtained.

Decision.

1) To compose ranked discrete range of options sort the survey data by size and arrange it in ascending order

0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2

3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

5 5 5 5 6 6 6 7 7.

Let's compose a variation series by writing the observed values ​​(options) in the first row of the table, and the corresponding frequencies in the second (Table 1)

Table 1.

2) The frequency polygon is a broken line connecting the points ( x i; n i), i=1, 2,…, m where m X.

Let's draw a polygon of frequencies of the variation series (Fig. 1).

Fig. 1. Frequency polygon

The cumulative curve (cumulative) for a discrete variation series is a broken line connecting the points ( x i; n i nak), i=1, 2,…, m.

Find the accumulated frequencies n i nak(the cumulative frequency shows how many variants were observed with a feature value less x). The found values ​​are entered into the third row of Table 1.



Let's build a cumulative (Fig. 2).

Fig. 2. Cumulata

3) Let us find the relative frequencies (frequencies), where, where m- the number of different values ​​of the characteristic X, which will be calculated with the same accuracy.

Let us write down a series of distributions of relative frequencies (frequencies) in the form of table 2

table 2

4) Let's find the main numerical characteristics of the variation series:

a) We find the arithmetic mean using a simplified formula:

,

where are conditional options

We put from= 3 (one of the average observed values), k= 1 (the difference between the two neighboring options) and draw up a calculation table (Table 3).

Table 3.

x i n i u i u i n i u i 2 n i
-3 -12
-2 -26
-1 -14
Amount -11

Then the arithmetic mean

b) Median Me a variation series is the value of a feature falling in the middle of a ranked series of observations. This discrete variation series contains an even number of terms ( n= 80), which means that the median is equal to the half-sum of the two middle options.

Fashion Moe the variation series is the variant that corresponds to the highest frequency. For a given variation series, the highest frequency n max = 24 corresponds to variant x= 3 means fashion Moe=3.

c) Dispersion s 2, which is a measure of the dispersion of possible values ​​of the indicator X around its mean, we find using a simplified formula:

where u i- conditional options

We will also enter intermediate calculations in Table 3.

Then the variance

d) Standard deviation s find by the formula:

.

e) Coefficient of variation V: (),

The coefficient of variation is an immeasurable quantity; therefore, it is suitable for comparing the scattering of variation series, the variants of which have different dimensions.

The coefficient of variation

.

5) The meaning of the results obtained is that the value characterizes the average value of the feature X within the considered sample, that is, the average value was 2.86. Standard deviation s describes the absolute spread of the indicator values X and in this case is s≈ 1.55. The coefficient of variation V characterizes the relative variability of the indicator X, that is, the relative spread around its average value, and in this case is.

Answer: ; ; ; .

Objective 2.

The following data are available on the equity capital of the 40 largest banks in Central Russia:

12,0 49,4 22,4 39,3 90,5 15,2 75,0 73,0 62,3 25,2
70,4 50,3 72,0 71,6 43,7 68,3 28,3 44,9 86,6 61,0
41,0 70,9 27,3 22,9 88,6 42,5 41,9 55,0 56,9 68,1
120,8 52,4 42,0 119,3 49,6 110,6 54,5 99,3 111,5 26,1

It is necessary:

1) Construct an interval variation series.

2) Calculate the mean sample and sample variance

3) Find the standard deviation and the coefficient of variation.

4) Construct a histogram of distribution frequencies.

Decision.

1) Let's choose an arbitrary number of intervals, for example, 8. Then the width of the interval is:

.

Let's compose a calculation table:

Interval option, x k –x k +1 Frequency, n i Middle of the interval x i Conditional option, and i and i n i and i 2 n i (and i + 1) 2 n i
10 – 25 17,5 – 3 – 12
25 – 40 32,5 – 2 – 10
40 – 55 47,5 – 1 – 11
55 – 70 62,5
70 – 85 77,5
85 – 100 92,5
100 – 115 107,5
115 – 130 122,5
Amount – 5

The value is selected as a false zero c = 62.5 (this option is located approximately in the middle of the variation row) .

Conditional options are determined by the formula

An example of solving a test on mathematical statistics

Problem 1

Initial data : students of a certain group of 30 people passed the exam in the course "Informatics". The grades received by the students form the following series of numbers:

I. Let's compose a variation series

m x

w x

m x nak

w x nak

Total:

II. Graphical presentation of statistical information.

III. Numerical characteristics of the sample.

1. Arithmetic mean

2. Geometric mean

3. Fashion

4. Median

222222333333333 | 3 34444444445555

5. Sample variance

7. Coefficient of variation

8. Asymmetry

9. Asymmetry coefficient

10. Excess

11. The coefficient of kurtosis

Problem 2

Initial data : students of a certain group wrote their final test. The group consists of 30 people. The points scored by the students form the following series of numbers

Decision

I. Since the feature takes many different meanings, we will construct an interval variation series for it. To do this, first set the value of the interval h... We will use Stairger's formula

Let's compose a scale of intervals. In this case, we will take the value determined by the formula as the upper boundary of the first interval:

The upper bounds of the subsequent intervals are determined by the following recursive formula:

then

We finish building the scale of intervals, since the upper limit of the next interval has become greater than or equal to the maximum value of the sample
.

II. Graphical display of interval variation series

III. Numerical characteristics of the sample

To determine the numerical characteristics of the sample, we will compose an auxiliary table

Amount:

1. Arithmetic mean

2. Geometric mean

3. Fashion

4. Median

10 11 12 12 13 13 13 13 14 14 14 14 15 15 15 |15 15 15 16 16 16 16 16 17 17 18 19 19 20 20

5. Sample variance

6. Sample standard deviation

7. Coefficient of variation

8. Asymmetry

9. Asymmetry coefficient

10. Excess

11. The coefficient of kurtosis

Problem 3

Condition : the scale division of the ammeter is 0.1 A. Readings are rounded to the nearest whole division. Find the probability that an error exceeding 0.02 A.

Decision.

The rounding error can be considered as a random value. X, which is distributed evenly in the interval between two adjacent integer divisions. Density of uniform distribution

,

Where
- the length of the interval in which the possible values ​​are enclosed X; outside this interval
In this problem, the length of the interval containing the possible values X, is equal to 0.1, therefore

The counting error will exceed 0.02 if it is enclosed in the interval (0.02; 0.08). Then

Answer: R=0,6

Problem 4

Initial data: mathematical expectation and standard deviation of a normally distributed feature X are respectively equal to 10 and 2. Find the probability that in the test result X will take the value enclosed in the interval (12, 14).

Decision.

Let's use the formula

And theoretical frequencies

Decision

For X, its mathematical expectation M (X) and variance D (X). Decision... Let us find the distribution function F (x) of a random variable ... sampling error). Let's compose variational row Span Width will be: For each value a number of let's calculate how many ...

  • Solution: Separable Equation

    Decision

    In the form To find the private solutions inhomogeneous equation make up system Let's solve the resulting system ...; +47; +61; +10; -eight. Construct interval variational row... Give statistical estimates of the mean ...

  • Solution: Let's calculate chain and basic absolute increments, growth rates, growth rates. The obtained values ​​are summarized in table 1

    Decision

    The volume of production. Decision: Arithmetic mean of interval variational a number of is calculated as follows: for ... Marginal sampling error with a probability of 0.954 (t = 2) will be: Δ w = t * μ = 2 * 0.0146 = 0.02927 Determine the boundaries ...

  • Decision. Sign

    Decision

    About whose work experience and made up sample. The sample average length of service ... of the working day of these employees and made up sample. The average duration for the sample is ... 1.16, the significance level is α = 0.05. Decision. Variational row of this sample has the form: 0.71 ...

  • Working curriculum in biology for grades 10-11 Compiled by S. Polikarpova

    Working curriculum

    The simplest crossing schemes "5 LR. " Decision elementary genetic problems "6 L. r. " Decision elementary genetic problems ”7 L. r. “..., 110, 115, 112, 110. Make up variational row, draw variational curve, find the average value of the feature ...

  • A special place in statistical analysis belongs to the definition of the average level of the studied attribute or phenomenon. The average level of a feature is measured by average values.

    The average value characterizes the general quantitative level of the trait under study and is a group property of the statistical population. It neutralizes, weakens random deviations of individual observations in one direction or another and brings to the fore the main, typical property of the trait under study.

    Average values ​​are widely used:

    1. To assess the state of health of the population: characteristics of physical development (height, weight, chest circumference, etc.), identifying the prevalence and duration of various diseases, analyzing demographic indicators (natural movement of the population, average life expectancy, reproduction of the population, average population and etc.).

    2. To study the activities of medical institutions, medical personnel and assess the quality of their work, plan and determine the needs of the population in various types of medical care (the average number of calls or visits per inhabitant per year, the average length of stay of the patient in the hospital, the average duration of the examination patient, average provision of doctors, beds, etc.).

    3. To characterize the sanitary and epidemiological state (average dust content in the workshop, average area per person, average consumption of proteins, fats and carbohydrates, etc.).

    4. To determine the medical and physiological indicators in health and disease, when processing laboratory data, to establish the reliability of the results of a sample study in social and hygienic, clinical, experimental studies.

    Average values ​​are calculated on the basis of variation series. Variational series Is a qualitatively homogeneous statistical population, the individual units of which characterize the quantitative differences of the trait or phenomenon under study.

    Quantitative variation can be of two types: discontinuous (discrete) and continuous.

    A discontinuous (discrete) sign is expressed only as an integer and cannot have any intermediate values ​​(for example, the number of visits, the population of the site, the number of children in the family, the severity of the disease in points, etc.).

    A continuous sign can take any values ​​within certain limits, including fractional ones, and is expressed only approximately (for example, weight - for adults you can limit yourself to kilograms, and for newborns - grams; height, blood pressure, time spent on seeing a patient, and etc.).



    The numerical value of each individual feature or phenomenon included in the variation series is called a variant and is denoted by the letter V ... There are other designations in the mathematical literature, for example x or y.

    The variation series, where each option is indicated once, is called simple. Such series are used in most statistical tasks in the case of computer data processing.

    With an increase in the number of observations, as a rule, there are repeated values ​​of the variant. In this case, a grouped variation series, where the number of repetitions is indicated (frequency, denoted by the letter " R »).

    Ranked variation series consists of variants arranged in ascending or descending order. Both simple and grouped series can be ranked.

    Interval variation series are compiled in order to simplify subsequent calculations performed without the use of a computer, with a very large number of observation units (more than 1000).

    Continuous variation series includes variant values, which can be expressed with any value.

    If in the variation series the values ​​of the feature (options) are given in the form of separate specific numbers, then such a series is called discrete.

    The general characteristics of the values ​​of the attribute, reflected in the variation series, are the average values. Among them, the most used are: the arithmetic mean M, fashion Moe and median Me. Each of these characteristics is unique. They cannot substitute for each other, and only in the aggregate are sufficiently complete and in a compressed form are the features of the variation series.

    Fashion (Mo) name the meaning of the most common options.

    Median (Me) Is the value of the variation, dividing the ranked variation series in half (on each side of the median, there is half the variation). In rare cases, when there is a symmetrical variation series, the mode and median are equal to each other and coincide with the value of the arithmetic mean.

    The most typical characteristic of variant values ​​is arithmetic mean quantity ( M ). In the mathematical literature, it is denoted .

    Arithmetic mean (M, ) Is a general quantitative characteristic of a certain feature of the studied phenomena that make up a qualitatively homogeneous statistical population. Distinguish between simple and weighted arithmetic mean. The simple arithmetic mean is calculated for a simple variation series by summing all the options and dividing this sum by the total number of options included in the given variation series. Calculations are carried out according to the formula:

    Where: M - simple arithmetic mean;

    Σ V - the amount of the option;

    n- the number of observations.

    In the grouped variation series, the weighted arithmetic mean is determined. The formula for its calculation:

    Where: M - weighted arithmetic mean;

    Σ Vp - the sum of the works of the variant on their frequency;

    n- the number of observations.

    For a large number of observations in the case of manual calculations, the method of moments can be used.

    The arithmetic mean has the following properties:

    The sum of deviations of the variant from the mean ( Σ d ) is equal to zero (see Table 15);

    · When multiplying (dividing) all options by the same factor (divisor), the arithmetic mean is multiplied (divided) by the same factor (divisor);

    · If you add (subtract) the same number to all options, the arithmetic mean increases (decreases) by the same number.

    The arithmetic mean values, taken by themselves, without taking into account the variability of the series from which they are calculated, may not fully reflect the properties of the variation series, especially when comparison with other averages is necessary. Averages that are close in value can be obtained from series with varying degrees of scattering. The closer the individual options are to each other in terms of their quantitative characteristics, the less scattering (oscillation, variability) row, the more typical its average.

    The main parameters that allow us to assess the variability of the trait are:

    · Swipe;

    · Amplitude;

    · Standard deviation;

    · The coefficient of variation.

    Approximately the variability of the trait can be judged by the range and amplitude of the variation series. The swing indicates the maximum (V max) and minimum (V min) options in the row. The amplitude (A m) is the difference between these options: A m = V max - V min.

    The main, generally accepted measure of the variability of the variation series are dispersion (D ). But the most often used is a more convenient parameter calculated on the basis of variance - the standard deviation ( σ ). It takes into account the amount of deviation ( d ) of each variant of the variation series from its arithmetic mean ( d = V - M ).

    Since deviations of the variant from the mean can be positive and negative, then when summed up they give the value "0" (S d = 0). To avoid this, the deviation values ​​( d) are raised to the second power and averaged. Thus, the variance of the variation series is the mean square of the deviations of the variant from the arithmetic mean and is calculated by the formula:

    It is the most important characteristic of variability and is used to calculate many statistical criteria.

    Since the variance is expressed in the square of the deviation, its value cannot be used in comparison with the arithmetic mean. For these purposes, it is applied standard deviation, which is denoted by the sign "Sigma" ( σ ). It characterizes the average deviation of all variants of the variation series from the arithmetic mean in the same units as the mean itself, so they can be used together.

    The standard deviation is determined by the formula:

    The indicated formula is applied when the number of observations ( n ) is greater than 30. For a smaller number n the value of the standard deviation will have an error associated with the mathematical bias ( n - one). In this regard, a more accurate result can be obtained by taking into account such a bias in the formula for calculating the standard deviation:

    standard deviation (s ) Is an estimate of the standard deviation of a random variable X relative to its mathematical expectation based on an unbiased estimate of its variance.

    With values n > 30 standard deviation ( σ ) and standard deviation ( s ) will be the same ( σ = s ). Therefore, in most practical manuals, these criteria are considered as ambiguous. In Excel, the calculation of the standard deviation can be done with the function = STDEV (range). And in order to calculate the standard deviation, you need to create an appropriate formula.

    The root mean square or standard deviation allows you to determine how much the values ​​of a characteristic can differ from the mean. Suppose there are two cities with the same average daytime temperatures during the summer. One of these cities is located on the coast and the other on the continent. It is known that in cities located on the coast, the differences in daytime temperatures are less than in cities located in the interior of the continent. Therefore, the standard deviation of daytime temperatures for the coastal city will be less than that of the second city. In practice, this means that the average air temperature for each particular day in a city located on the continent will differ more from the average value than in a city on the coast. In addition, the standard deviation allows you to estimate the possible deviations of the temperature from the average with the required level of probability.

    According to the theory of probability, in phenomena obeying the normal distribution law, there is a strict relationship between the values ​​of the arithmetic mean, standard deviation and options ( the three sigma rule). For example, 68.3% of the values ​​of the variable attribute are within the range of M ± 1 σ , 95.5% - within M ± 2 σ and 99.7% - within M ± 3 σ .

    The value of the standard deviation allows us to judge the nature of the homogeneity of the variation series and the studied group. If the value of the standard deviation is small, then this indicates a sufficiently high homogeneity of the phenomenon under study. In this case, the arithmetic mean should be recognized as quite characteristic of the given variation series. However, too low a sigma value makes one think about an artificial selection of observations. With a very large sigma, the arithmetic mean characterizes the variation series to a lesser extent, which indicates a significant variability of the studied trait or phenomenon or the heterogeneity of the studied group. However, the comparison of the value of the standard deviation is possible only for features of the same dimension. Indeed, if we compare the diversity of weights between newborn babies and adults, we always get higher sigma values ​​in adults.

    Comparison of the variability of features of different dimensions can be performed using coefficient of variation... It expresses diversity as a percentage of the average, which allows comparison of different traits. The coefficient of variation in the medical literature is indicated by the sign “ FROM ", And in mathematical" v"And calculated by the formula:

    The values ​​of the coefficient of variation less than 10% indicate a small scattering, from 10 to 20% - about the average, more than 20% - about a strong scattering variant around the arithmetic mean.

    The arithmetic mean, as a rule, is calculated based on the data of the sample population. With repeated studies, under the influence of random phenomena, the arithmetic mean may change. This is due to the fact that, as a rule, only a part of the possible observation units, that is, the sample population, is studied. Information about all possible units representing the phenomenon under study can be obtained by studying the entire general population, which is not always possible. At the same time, in order to generalize the experimental data, the value of the average in the general population is of interest. Therefore, to formulate a general conclusion about the phenomenon under study, the results obtained on the basis of the sample population must be transferred to the general population by statistical methods.

    To determine the degree of coincidence between the sample study and the general population, it is necessary to estimate the magnitude of the error that inevitably occurs in sample observation. This error is called “ The error of representativeness"Or" Average error of the arithmetic mean. " In fact, it is the difference between the averages obtained during selective statistical observation and similar values ​​that would have been obtained in a continuous study of the same object, i.e. when studying the general population. Since the sample mean is a random variable, such a forecast is carried out with an acceptable level of probability for the researcher. In medical research, it is at least 95%.

    The error of representativeness should not be confused with errors of registration or errors of attention (clerical errors, miscalculations, typos, etc.), which should be minimized by adequate methods and tools used in the experiment.

    The magnitude of the representativeness error depends both on the sample size and on the variability of the trait. The larger the number of observations, the closer the sample is to the general population and the smaller the error. The more volatile the attribute, the greater the magnitude of the statistical error.

    In practice, the following formula is used to determine the error of representativeness in variation series:

    Where: m - error of representativeness;

    σ - standard deviation;

    n- the number of observations in the sample.

    It can be seen from the formula that the size of the mean error is directly proportional to the standard deviation, i.e., the variability of the trait under study, and inversely proportional to the square root of the number of observations.

    When performing statistical analysis based on the calculation of relative values, the construction of a variation series is optional. In this case, the determination of the average error for relative indicators can be performed using a simplified formula:

    Where: R- the value of the relative indicator, expressed as a percentage, ppm, etc .;

    q- the inverse of P and expressed as (1-P), (100-P), (1000-P), etc., depending on the basis for which the indicator is calculated;

    n- the number of observations in the sample.

    However, the above formula for calculating the representativeness error for relative values ​​can be applied only when the indicator value is less than its base. In some cases of calculating intensive indicators, such a condition is not met, and the indicator can be expressed as a number of more than 100% or 1000%. In such a situation, a variation series is constructed and the representativeness error is calculated using the formula for mean values ​​based on the standard deviation.

    Forecasting the value of the arithmetic mean in the general population is carried out with the indication of two values ​​- the minimum and maximum. These extreme values ​​of possible deviations, within which the sought average value of the general population can fluctuate, are called " Confidence limits».

    The postulates of the theory of probability proved that with a normal distribution of a trait with a probability of 99.7%, the extreme values ​​of deviations of the mean will not exceed the value of three times the error of representativeness ( M ± 3 m ); in 95.5% - no more than the doubled mean error of the mean ( M ± 2 m ); in 68.3% - no more than one mean error ( M ± 1 m ) (fig. 9).

    P%

    Fig. 9. The probability density of the normal distribution.

    Note that the above statement is valid only for a feature that obeys the normal Gaussian distribution.

    Most experimental research, including in the field of medicine, is associated with measurements, the results of which can take almost any value in a given interval, therefore, as a rule, they are described by a model of continuous random variables. Therefore, most statistical methods consider continuous distributions. One of such distributions, which has a fundamental role in mathematical statistics, is normal, or Gaussian, distribution.

    There are a number of reasons for this.

    1. First of all, many experimental observations can be successfully described using the normal distribution. It should be noted right away that there are no distributions of empirical data that would be exactly normal, since the normally distributed random variable ranges from to, which never occurs in practice. However, the normal distribution is very often a good approximation.

    Whether measurements of weight, height and other physiological parameters of the human body are carried out - everywhere the results are influenced by a very large number of random factors (natural causes and measurement errors). Moreover, as a rule, the effect of each of these factors is insignificant. Experience shows that the results in such cases will be approximately normally distributed.

    2. Many distributions associated with a random sample, with an increase in the size of the latter, turn into normal.

    3. The normal distribution is well suited as an approximate description of other continuous distributions (for example, asymmetric).

    4. The normal distribution has a number of favorable mathematical properties, which largely ensured its widespread use in statistics.

    At the same time, it should be noted that there are many experimental distributions in medical data that cannot be described by a normal distribution model. For this, statistics have developed methods that are commonly called "Nonparametric".

    The choice of a statistical method that is suitable for processing the data of a particular experiment should be made depending on the belonging of the obtained data to the normal distribution law. Testing the hypothesis for the subordination of the feature to the normal distribution law is performed using a frequency distribution histogram (graph), as well as a number of statistical criteria. Among them:

    The asymmetry criterion ( b );

    The criterion for checking for kurtosis ( g );

    Shapiro - Wilkes criterion ( W ) .

    The analysis of the nature of the distribution of data (it is also called the check for normal distribution) is carried out for each parameter. To confidently judge the correspondence of the distribution of the parameter to the normal law, a sufficiently large number of observation units (at least 30 values) are required.

    For a normal distribution, the criteria for skewness and kurtosis take the value 0. If the distribution is shifted to the right b > 0 (positive asymmetry), for b < 0 - график распределения смещен влево (отрицательная асимметрия). Критерий асимметрии проверяет форму кривой распределения. В случае нормального закона g = 0. When g > 0, the distribution curve is sharper if g < 0 пик более сглаженный, чем функция нормального распределения.

    To check for normality according to the Shapiro-Wilks test, it is required to find the value of this criterion using statistical tables at the required level of significance and depending on the number of observation units (degrees of freedom). Appendix 1. The hypothesis of normality is rejected at small values ​​of this criterion, as a rule, at w <0,8.

    Share with your friends or save for yourself:

    Loading...