Fundamentals of mathematical statistics. Basic concepts of mathematical statistics Mathematical statistics classification

Mathematical statistics is a branch of mathematics that studies approximate methods for collecting and analyzing data from experimental results to identify existing patterns, i.e. finding laws of distribution of random variables and their numerical characteristics.

In mathematical statistics, it is customary to distinguish two main areas of research::

1. Estimation of parameters of the general population.

2. Testing statistical hypotheses (some a priori assumptions).

The basic concepts of mathematical statistics are: population, sample, theoretical distribution function.

General population is the set of all conceivable statistical data from observations of a random variable.

X G = (x 1, x 2, x 3, ..., x N, ) = (x i; i=1,N)

The observed random variable X is called the sample feature or factor. A general population is a statistical analogue of a random variable; its volume N is usually large, so a part of the data is selected from it, called a sample population or simply a sample.

X B = (x 1, x 2, x 3, ..., x n, ) = (x i; i=1,n)

Х В М Х Г, n £ N

Sample is a set of randomly selected observations (objects) from the general population for direct study. The number of objects in the sample is called the sample size and is denoted by n. Typically the sample is 5%-10% of the population.

Using a sample to construct patterns to which an observed random variable is subject allows one to avoid its continuous (mass) observation, which is often a resource-intensive process, or even simply impossible.

For example, a population is a set of individuals. Studying an entire population is time-consuming and expensive, so data is collected from a sample of individuals who are considered representative of that population, allowing inferences to be made about that population.

However, the sample must satisfy the condition representativeness, i.e. provide a reasonable representation of the population. How to form a representative (representative) sample? Ideally, they strive to obtain a randomized sample. To do this, a list of all individuals in the population is made and they are randomly selected. But sometimes the costs of compiling a list may turn out to be unacceptable, and then they take an acceptable sample, for example, one clinic, hospital, and study all the patients in this clinic with a given disease.

Each sample element is called a variant. The number of repetitions of variants in a sample is called frequency of occurrence. The quantity is called relative frequency options, i.e. is found as the ratio of the absolute frequency of variants to the entire sample size. A sequence of options written in ascending order is called variation series.

Let's consider three forms of variation series: ranked, discrete and interval.

Ranked series- this is a list of individual units of the population in ascending order of the characteristic being studied.

Discrete variation series is a table consisting of columns or rows: a specific value of the characteristic x i and the absolute frequency n i (or relative frequency ω i) of the manifestation of the i-th value of the characteristic x.

An example of a variation series is the table

Write the distribution of relative frequencies.

Solution: Let's find the relative frequencies. To do this, divide the frequencies by the sample size:

The distribution of relative frequencies has the form:


	0,15	0,5	0,35

Control: 0.15 + 0.5 + 0.35 = 1.

A discrete series can be represented graphically. In a rectangular Cartesian coordinate system, points with coordinates () or () are marked, which are connected by straight lines. Such a broken line is called frequency polygon.

Construct a discrete variation series (DVR) and draw a polygon for the distribution of 45 applicants according to the number of points they received in the admission exams:

39 41 40 42 41 40 42 44 40 43 42 41 43 39 42 41 42 39 41 37 43 41 38 43 42 41 40 41 38 44 40 39 41 40 42 40 41 42 40 43 38 39 41 41 42.

Solution: To construct a variation series, we place the different values of the characteristic x (variants) in ascending order and write down its frequency under each of these values.

Let's construct a polygon for this distribution:

Rice. 13.1. Frequency polygon

Interval variation series used for a large number of observations. To construct such a series, you need to select the number of intervals of the characteristic and set the length of the interval. If there are a large number of groups, the interval will be minimal. The number of groups in a variation series can be found using the Sturges formula: (k is the number of groups, n is the sample size), and the width of the interval is

where is the maximum; - the minimum value is an option, and their difference R is called range of variation.

A sample of 100 people from the population of all medical university students is being studied.

Solution: Let's calculate the number of groups: . Thus, to compile an interval series, it is better to divide this sample into 7 or 8 groups. The set of groups into which observation results are divided and the frequency of obtaining observation results in each group is called statistical totality.

To visually represent the statistical distribution, use a histogram.

Frequency histogram is a stepped figure consisting of adjacent rectangles built on one straight line, the bases of which are identical and equal to the width of the interval, and the height is equal to either the frequency of falling into the interval or the relative frequency ω i.

Observations of the number of particles entering the Geiger counter within a minute gave the following results:

21 30 39 31 42 34 36 30 28 30 33 24 31 40 31 33 31 27 31 45 31 34 27 30 48 30 28 30 33 46 43 30 33 28 31 27 31 36 51 34 31 36 34 37 28 30 39 31 42 37.

Based on these data, construct an interval variation series with equal intervals (I interval 20-24; II interval 24-28, etc.) and draw a histogram.

Solution: n = 50

The histogram of this distribution looks like:

Rice. 13.2. Distribution histogram

Task options

№ 13.1. Every hour, the voltage in the electrical network was measured. The following values (B) were obtained:

227 219 215 230 232 223 220 222 218 219 222 221 227 226 226 209 211 215 218 220 216 220 220 221 225 224 212 217 219 220.

Construct a statistical distribution and draw a polygon.

№ 13.2. Observations of blood sugar in 50 people gave the following results:

3.94 3.84 3.86 4.06 3.67 3.97 3.76 3.61 3.96 4.04

3.82 3.94 3.98 3.57 3.87 4.07 3.99 3.69 3.76 3.71

3.81 3.71 4.16 3.76 4.00 3.46 4.08 3.88 4.01 3.93

3.92 3.89 4.02 4.17 3.72 4.09 3.78 4.02 3.73 3.52

3.91 3.62 4.18 4.26 4.03 4.14 3.72 4.33 3.82 4.03

Based on these data, construct an interval variation series with equal intervals (I - 3.45-3.55; II - 3.55-3.65, etc.) and depict it graphically, draw a histogram.

№ 13.3. Construct a polygon of frequency distributions of erythrocyte sedimentation rate (ESR) for 100 people.

Methods of mathematical statistics are used, as a rule, at all stages of the analysis of research materials to select a strategy for solving problems based on specific sample data and evaluate the results obtained. Methods of mathematical statistics were used to process the material. Mathematical processing of materials makes it possible to clearly identify and evaluate the quantitative parameters of objective information, analyze and present them in various ratios and dependencies. They make it possible to determine the measure of variation in values in collected materials containing quantitative information about a certain set of cases, some of which confirm the proposed connections, and some do not reveal them, to calculate the reliability of quantitative differences between selected sets of cases, and to obtain other mathematical characteristics necessary for the correct interpretation of facts . The reliability of the differences obtained during the study was determined by Student's t-test.

The following values were calculated.

1. Arithmetic mean of the sample.

Characterizes the average value of the population under consideration. Let us denote the measurement results. Then:

where Y is the sum of all values when the current index i changes from 1 to n.

2. Mean square deviation (standard deviation), characterizing the dispersion, scattering of the population under consideration relative to the arithmetic mean.

= (x max - x min)/ k

where is the standard deviation

xmax - maximum table value;

xmin - minimum table value;

k - coefficient

3. Standard error of the arithmetic mean or representativeness error (m). The standard error of the arithmetic mean characterizes the degree of deviation of the sample arithmetic mean from the arithmetic mean of the population.

The standard error of the arithmetic mean is calculated using the formula:

where y is the standard deviation of the measurement results,

n - sample size. The smaller m, the higher the stability and stability of the results.

4. Student's t-test.

(in the numerator - the difference between the average values of the two groups, in the denominator - the square root of the sum of the squares of the standard errors of these averages).

When processing the results of the study, a computer program with the Excel package was used.

Organization of the study

The study was carried out by us according to generally accepted rules, and was carried out in 3 stages.

At the first stage, the obtained material on the research problem under consideration was collected and analyzed. The subject of scientific research was formed. The analysis of the literature at this stage made it possible to specify the purpose and objectives of the study. Initial testing of 30m running technique was carried out.<... class="gads_sm">

At the third stage, the material obtained as a result of scientific research was systematized and all available information on the research problem was summarized.

The experimental study was carried out on the basis of the State Educational Institution “Lyakhovichi Secondary School”; in total, the sample consisted of 20 students of 6 grades (11-12 years old).

Chapter 3. Analysis of research results

As a result of the pedagogical experiment, we identified the initial level of 30 m running technique for students in the control and experimental groups (Appendices 1-2). Statistical processing of the results obtained allowed us to obtain the following data (Table 6).

Table 6. Initial level of running quality

As can be seen from Table 6, the average number of points for athletes in the control and experimental groups is not statistically different; in the experimental group the average score was 3.6 points, and in the control group 3.7 points. T-test in both groups temp=0.3; Р?0.05, with tcrit=2.1; The results of the initial testing showed that the indicators do not depend on training and are random in nature. According to the initial testing, the running quality indicators of the control group were slightly higher than those of the experimental group. But no statistically significant differences were revealed in the groups, which is proof of the identity of students in the control and experimental groups in the 30m running technique.

During the experiment, indicators characterizing the efficiency of running technique improved in both groups. However, this improvement was of a different nature in different groups of experiment participants. As a result of the training, a natural small increase in indicators in the control group was revealed (3.8 points). As can be seen from Appendix 2, a large increase in indicators was revealed in the experimental group. The students studied according to the program we proposed, which significantly improved their performance.

Table 7. Changes in running quality among subjects in the experimental group

During the experiment, we found that increased loads in the experimental group gave significant improvements in the development of speed than in the control group.

In adolescence, it is advisable to develop speed through the primary use of physical education tools aimed at increasing the frequency of movements. At the age of 12-15 years, speed abilities increase, as a result of the use of mainly speed-strength and power exercises that we used in the process of conducting physical education lessons and extracurricular activities in the sports section of basketball and athletics.

When conducting classes in the experimental group, there was a strict staged progression of complexity and motor experience. Work on errors was carried out in a timely manner. As the analysis of actual data showed, the experimental teaching method had a significant change in the quality of running technique (temp = 2.4). Analysis of the results obtained in the experimental group and comparison of them with the data obtained in the control group using generally accepted teaching methods gives grounds to assert that our proposed methodology will increase the effectiveness of training.

Thus, at the stage of improving the 30m running methodology at school, we identified the dynamics of changes in testing indicators in the experimental and control groups. After the experiment, the quality of the technique increased in the experimental group to 4.9 points (t=3.3; P?0.05). By the end of the experiment, the quality of running technique in the experimental group turned out to be higher than in the control group.

The data obtained as a result of the experiment is characterized by variability, which can be caused by a random error: the error of the measuring device, the heterogeneity of the samples, etc. After collecting a large amount of homogeneous data, the experimenter needs to process it to extract the most accurate information possible about the quantity under consideration. To process large amounts of measurement data, observations, etc., which can be obtained during an experiment, it is convenient to use methods of mathematical statistics.

Mathematical statistics is inextricably linked with probability theory, but there is a significant difference between these sciences. Probability theory uses already known distributions of random variables, on the basis of which the probabilities of events, mathematical expectation, etc. are calculated. Mathematical Statistics Problem– obtain the most reliable information about the distribution of a random variable based on experimental data.

Typical directions mathematical statistics:

sampling theory;
appraisal theory;
testing statistical hypotheses;
regression analysis;
analysis of variance.

Methods of mathematical statistics

Methods for assessing and testing hypotheses are based on probabilistic and hyper-random models of data origin.

Mathematical statistics evaluates parameters and functions of them that represent important characteristics of distributions (median, expected value, standard deviation, quantiles, etc.), densities and distribution functions, etc. Point and interval estimates are used.

Modern mathematical statistics contains a large section - statistical sequential analysis, in which it is possible to form an array of observations from one array.

Mathematical statistics also contains general hypothesis testing theory and a large number of methods for testing specific hypotheses(for example, about the symmetry of the distribution, about the values of parameters and characteristics, about the agreement of the empirical distribution function with a given distribution function, the hypothesis of testing homogeneity (the coincidence of characteristics or distribution functions in two samples), etc.).

Carrying out sample surveys, related to the construction of adequate methods for assessing and testing hypotheses, with the properties of different sampling schemes, is a branch of mathematical statistics that is of great importance. Methods of mathematical statistics directly use the following basic concepts.

Sample

Definition 1

Sampling refers to the data obtained during the experiment.

For example, the results of the flight range of a bullet when fired by the same or a group of similar guns.

Empirical distribution function

Note 1

Distribution function makes it possible to express all the most important characteristics of a random variable.

In mathematical statistics there is a concept theoretical(not known in advance) and empirical distribution functions.

The empirical function is determined according to experimental data (empirical data), i.e. by sample.

Histogram

Histograms are used for a visual, but rather approximate, representation of an unknown distribution.

Histogram is a graphical representation of the data distribution.

To obtain a high-quality histogram, adhere to the following: rules:

The number of sample elements must be significantly less than the sample size.
The split intervals must contain a sufficient number of sample elements.

If the sample is very large, the interval of sample elements is often divided into equal parts.

Sample mean and sample variance

Using these concepts, you can obtain an estimate of the necessary numerical characteristics of an unknown distribution without resorting to constructing a distribution function, histogram, etc.

Odessa National Medical University Department of Biophysics, Informatics and Medical Equipment Guidelines for 1st year students on the topic “Fundamentals of Mathematical Statistics” Odessa 2009

1. Topic: “Fundamentals of mathematical statistics.”

2. Relevance of the topic.

Mathematical statistics is a branch of mathematics that studies methods of collecting, systematizing and processing the results of observations of mass random events in order to clarify and practically apply existing patterns. Methods of mathematical statistics have found wide application in clinical medicine and healthcare. They are used, in particular, in the development of mathematical methods of medical diagnostics, in the theory of epidemics, in planning and processing the results of a medical experiment, in the organization of healthcare. Statistical concepts are used, consciously or unconsciously, in decision making in such matters as clinical diagnosis, predicting the course of disease in an individual patient, predicting the likely outcome of programs in a given population, and selecting the appropriate program in a particular setting. Familiarity with the ideas and methods of mathematical statistics is an essential element of the professional education of every health care worker.

3. Entire classes. The general goal of the lesson is to teach students to consciously use mathematical statistics when solving problems of a biomedical profile. Specific whole lessons:

to acquaint students with the basic ideas, concepts and methods of mathematical statistics, paying attention mainly to issues related to processing the results of observations of mass random events in order to clarify and practically apply existing patterns;
to teach students to consciously apply the basic concepts of mathematical statistics when solving simple problems that arise in the professional activity of a doctor.

The student must know (level 2):

determination of class frequency (absolute and relative)
determination of the general aggregate and sampling, sampling volume
point and interval estimation
reliable interval and reliability
definition of mode, median and sample mean
definition of range, interquartile range, quartile deviation
determination of mean absolute deviation
determination of sample covariance and variance
determination of sample standard deviation and coefficient of variation
determination of sample regression coefficients
empirical linear regression equations
determination of the sample correlation coefficient.

The student must master basic calculation habits (level 3):

mode, median and sample mean
range, interquartile range, quartile deviation
mean absolute deviation
sample covariance and variance
sample standard deviation and coefficient of variation
reliable interval for expectation and variance
sample regression coefficients
sample correlation coefficient.

4. Ways to achieve the goals of the lesson: To achieve the goals of the lesson, you need the following background knowledge:

Definition of distribution, distribution series and multi-knot distribution of a discrete random variable
Determination of functional variation between random variables
Determination of correlation between random variables

You also need to be able to calculate the probabilities of incompatible and compatible events using the appropriate rules. 5. A task for students to test their initial level of knowledge. Security questions

Definition of a flash event, its relative frequency and probability.
Theorem for composing the probabilities of incompatible events
Theorem for compiling probabilities of joint events
Theorem for multiplying the probabilities of independent events
Theorem for multiplying probabilities of dependent events
Total probability theorem
Bayes' theorem
Definition of random variables: discrete and continuous
Definition of distribution, distribution series and distribution polygon of a discrete random variable
Definition of the distribution function
Definition of distribution center position measures
Determination of measures of variability of random variable values
Determination of the thickness of the distribution and the distribution curve of a continuous random variable
Determination of functional dependence between random variables
Determining the correlation between random variables
Regression definition, equation and regression lines
Determination of covariance and correlation coefficient
Definition of linear regression equation.

6. Information for strengthening initial knowledge and skills can be found in the manuals:

Zhumatiy P.G. Lecture “Probability Theory”. Odessa, 2009.
Zhumatiy P.G. “Fundamentals of probability theory.” Odessa, 2009.
Zhumatiy P.G., Senitska Y.R. Elements of probability theory. Guidelines for medical institute students. Odessa, 1981.
Chaly O.V., Agapov B.T., Tsekhmister Y.V. Medical and biological physics. Kyiv, 2004.

7. Contents of educational material from this topic, highlighting the main key issues.

Mathematical statistics is a branch of mathematics that studies methods of collecting, systematizing, processing, depicting, analyzing and interpreting observational results in order to identify existing patterns.

The use of statistics in health care is necessary at both the community and individual patient levels. Medicine deals with individuals who differ from each other in many characteristics, and the values by which a person can be considered healthy vary from one individual to another. No two patients or groups of patients are exactly alike, so decisions that affect individual patients or populations must be made based on experience gained from other patients or populations with similar biological characteristics. It is necessary to realize that, given the existing discrepancies, these decisions cannot be absolutely accurate - they are always associated with some uncertainty. This is precisely the viral nature of medicine.

Some examples of the application of statistical methods in medicine:

interpretation of variation (variability of the characteristics of an organism when deciding what value of one or another characteristic will be ideal, normal, average, etc., makes it necessary to use appropriate statistical methods).

diagnosis of diseases in individual patients and assessment of the health status of a population group.

predicting the end of a disease in individual patients or the possible outcome of a control program for a particular disease in any population group.

selecting an appropriate influence on a patient or population group.

planning and conducting medical research, analyzing and publishing results, reading and critically evaluating them.

health care planning and management.

Useful health information is usually hidden in masses of raw data. It is necessary to concentrate the information contained in them and present the data so that the structure of variation is clearly visible, and then select specific methods of analysis.

Data presentation provides an introduction to the following concepts and terms:

variation series (ordered arrangement) - a simple arrangement of individual observations of a quantity.

class is one of the intervals into which the entire range of values of a random variable is divided.

extreme points of the class - values that limit the class, for example 2.5 and 3.0, lower and upper limits of the class 2.5 - 3.0.

(absolute) class frequency - the number of observations in a class.

relative class frequency - the absolute frequency of a class, expressed as a fraction of the total number of observations.

cumulative (accumulated) frequency of a class - the number of observations that is equal to the sum of the frequencies of all previous classes and this class.

Stovptsev diagram - a graphical representation of data frequencies for nominal classes using columns whose heights are directly proportional to the class frequencies.

pie chart - a graphical representation of data frequencies for nominal classes using sectors of a circle, the areas of which are directly proportional to the class frequencies.

histogram - a graphical representation of the frequency distribution of quantitative data with areas of rectangles directly proportional to the class frequencies.

frequency polygon - a graph of the frequency distribution of quantitative data; the point corresponding to the class frequency is located above the middle of the interval, each two adjacent points are connected by a straight line segment.

ogive (cumulative curve) - a graph of the distribution of cumulative relative frequencies.

All medical data has inherent variability, so analysis of measurement results is based on the study of information about what values the random variable under study took.

The set of all possible values of a random variable is called general.

The part of the general population registered as a result of tests is called a sample.

The number of observations included in the sample is called the volume of the sample (usually denoted n).

The task of the sampling method is to use the resulting voter to make a correct estimate of the random variable that is being studied. Therefore, the main requirement for a sample is the maximum reflection of all the features of the general population. A sample that satisfies this requirement is called representative. The representativeness of the sample determines the quality of the assessment, that is, the degree of compliance of the assessment with the parameter that it characterizes.

When estimating the parameters of a population based on a voter (parametric estimation), the following concepts are used:

point estimation - an estimate of a population parameter in the form of a single value that it can take with the highest probability.

interval estimation - estimation of a population parameter in the form of an interval of values that has a given probability of covering its true value.

When using interval assessment, the concept is used:

reliable interval - an interval of values that has a given probability of covering the true value of the population parameter during interval estimation.

reliability (reliable probability) - the probability with which the reliable interval covers the true value of the population parameter.

reliable limits - lower and upper limits of the reliable interval.

Conclusions obtained by methods of mathematical statistics are always based on a limited, selective number of observations, so it is natural that for the second sample the results may be different. This circumstance determines the international nature of the conclusions of mathematical statistics and, as a consequence, the widespread use of probability theory in the practice of statistical research.

A typical statistical research path is:

Having estimated the quantities or relationships between them based on observational data, they make the assumption that the phenomenon being studied can be described by one or another stochastic model

using statistical methods, this assumption can be confirmed or rejected; upon confirmation, the goal is achieved - a model has been found that describes the patterns under study; otherwise, work continues, putting forward and testing a new hypothesis.

Definition of sample statistical estimates:

mode is the value that occurs most often in the voter,

median - central (middle) value of the variation series

range R - the difference between the largest and smallest values in a series of observations

percentiles - a value in a variation series that divides the distribution into 100 equal parts (thus, the median will be the fiftieth percentile)

first quartile - 25th percentile

third quartile - 75th percentile

interquartile range - the difference between the first and third quartiles (covers the central 50% of observations)

quartile deviation - half of the interquartile range

sample mean - arithmetic mean of all sample values (sample estimate of mathematical expectation)

average absolute deviation - the sum of deviations from the corresponding beginning (without taking into account the sign), divided by the sample volume

the average absolute deviation from the sample mean is calculated using the formula

sample variance (X) - (sample variance estimate) is given by

sample covariance -- (sample estimate of covariance K ( X,Y )) equals

the sample regression coefficient of Y on X (sample estimate of the regression coefficient of Y on X) is equal to

the empirical linear regression equation of Y on X has the form

the sample regression coefficient of X on Y (sample estimate of the regression coefficient of X on Y) is equal to

the empirical linear regression equation of X on Y has the form

sample standard deviation s(X) - (sample estimate of standard deviation) equals the square root of the sample variance

sample correlation coefficient - (sample estimate of the correlation coefficient) equals

sample coefficient of variation  - (sample estimate of coefficient of variation CV) is equal to

.

8. Task for independent preparation of students. 8.1 Task for independent study of material from the topic.

8.1.1 Practical calculation of sample estimates

Practical calculation of sample point estimates

Example 1.

The duration of the disease (in days) in 20 cases of pneumonia was:

10, 11, 6, 16, 7, 13, 15, 8, 9, 10, 11, 13, 7, 8, 13, 15, 16, 13, 14, 15

Determine the mode, median, range, interquartile range, sample mean, mean absolute deviation from the sample mean, sample dispersion, sample coefficient of variation.

Rozv"zok.

The variation series for sampling has the form

6, 7, 7, 8, 8, 9, 10, 10, 11, 11, 13, 13, 13, 13, 14, 15, 15, 15, 16, 16

Fashion

The most common number in the voter is 13. Therefore, the value of the mode in the voter will be this number.

Median

When a variation series contains a pair of observations, the median is equal to the average of the two central terms of the series, in this case 11 and 13, so the median is 12.

Scope

The minimum value in a voter is 6 and the maximum is 16, so R = 10.

Interquartile range, quartile deviation

In a variation series, a quarter of all data has a value less than, or level 8, so the first quartile is 8, and 75% of all data has a value less, or level 12, so the third quartile is 14. So, the interquartile range is 6, and the quartile deviation is 3.

Sample mean

The arithmetic mean of all sample values is equal to

.

Mean absolute deviation from sample mean

.

Sample variance

Sample standard deviation

.

Birk coefficient of variation

.

In the following example, we will consider the simplest means of studying the stochastic dependence between two random variables.

Example 2.

When examining a group of patients, data were obtained on height H (cm) and circulating blood volume V (l):

Find empirical linear regression equations.

Rozv"zok.

The first thing you need to calculate is:

sample mean

.

The second thing you need to calculate is:

sample variance (H)

sample variance (V)

sample covariance

Third, is the calculation of sample regression coefficients:

sample regression coefficient V on H

sample regression coefficient H on V

.

Fourth, write down the required equations:

the empirical linear regression equation of V on H has the form

the empirical linear regression equation of H on V has the form

.

Example 3.

Using the conditions and results of example 2, calculate the correlation coefficient and check the reliability of the existence of a correlation between human height and circulating blood volume with a 95% reliable probability.

Rozv"zok.

The correlation coefficient is related to regression coefficients and a practically useful formula

.

For a sample assessment of the correlation coefficient, this formula has the form

.

Using the values of the sample regression coefficients and in Example 2, we obtain

.

Checking the reliability of the correlation between random variables (assuming a normal distribution for each of them) is carried out in this way:

calculate the value of T

find the coefficient in the Student distribution table

the existence of a correlation between random variables is confirmed when performing the unevenness

.

Since 3.5 > 2.26, then with a 95% reliable probability of the existence of a correlation between the patient’s height and the volume of circulating blood, it can be considered established.

Interval estimates for mathematical expectation and variance

If the random variable has a normal distribution, then interval estimates for the mathematical expectation and variance are calculated in the following sequence:

1.find the sample mean;

2. calculate the sample variance and sample standard deviation s;

3. in the Student distribution table, using the reliable probability  and sample volume n, find the Student coefficient;

4. The reliable interval for the mathematical expectation is written in the form

5.in the distribution table "> and the sampling volumen, find the coefficients

;

6. The reliable interval for the dispersion is written in the form

The value of the reliable interval, the reliable probability and the sampling volumen depend on each other. In fact, the attitude

decreases with increasing n, so, with a constant value of the reliable interval, with increasing n, u increases. At a constant reliable probability, as the volume of vibrator increases, the value of the reliable interval decreases. When planning medical research, this connection is used to determine the minimum sampling volume that will provide the required values of the reliable interval and reliable probability according to the conditions of the problem being solved.

Example 5.

Using the conditions and results of Example 1, find the interval estimates of the mathematical expectation and variance for the 95% reliable probability.

Rozv"zok.

In example 1, the point estimates of the mathematical expectation (sample mean = 12), variance (sample variance = 10.7) and standard deviation (sample standard deviation) are determined. The sample volume is n = 20.

From the Student distribution table we find the value of the coefficient

Next, we calculate the half-widthd of the reliable interval

and write down the interval estimate of the mathematical expectation

10,5 < < 13,5 при = 95%

From the Pearson distribution table "chi-square" we find the coefficients

calculate the lower and upper reliable bounds

and write the interval estimate for the variance in the form

6.2 23 at = 95% .

8.1.2. Problems to solve independently

For independent solution, problems 5.4 C 1 – 8 are offered (P.G. Zhumatiy. “Mathematical processing of medical and biological data. Problems and examples.” Odessa, 2009, pp. 24-25)

8.1.3. Security questions

Class frequency (absolute and relative).
Population and sample, sample size.
Point and interval estimation.
Reliable interval and reliability.
Mode, median and sample mean.
Range, interquartile range, quarterly deviation.
Average absolute deviation.
Sample covariance and variance.
Sample standard deviation and coefficient of variation.
Sample regression coefficients.
Empirical regression equations.
Calculation of the correlation coefficient and reliability of the correlation.
Construction of interval estimates of normally distributed random variables.

8.2 Basic literature

Zhumatiy P.G. “Mathematical processing of medical and biological data. Tasks and examples.” Odessa, 2009.
Zhumatiy P.G. Lecture “Mathematical statistics”. Odessa, 2009.
Zhumatiy P.G. “Fundamentals of mathematical statistics.” Odessa, 2009.
Zhumatiy P.G., Senitska Y.R. Elements of probability theory. Guidelines for medical institute students. Odessa, 1981.
Chaly O.V., Agapov B.T., Tsekhmister Y.V. Medical and biological physics. Kyiv, 2004.

8.3 Further reading

Remizov O.M. Medical and biological physics. M., “Higher School”, 1999.
Remizov O.M., Isakova N.Kh., Maksina O.G.. Collection of problems from medical and biological physics. M., ., “Higher School”, 1987.

Methodological instructions compiled by Assoc. P. G. Zhumatiy.

RANDOM VARIABLES AND THE LAWS OF THEIR DISTRIBUTION.

Random They call a quantity that takes values depending on a combination of random circumstances. Distinguish discrete and random continuous quantities.

Discrete A quantity is called if it takes on a countable set of values. ( Example: the number of patients at a doctor's appointment, the number of letters on a page, the number of molecules in a given volume).

Continuous is a quantity that can take values within a certain interval. ( Example: air temperature, body weight, human height, etc.)

Law of distribution A random variable is a set of possible values of this variable and, corresponding to these values, probabilities (or frequencies of occurrence).

EXAMPLE:

x	x 1	x 2	x 3	x 4	...	x n
p	p 1	p 2	p 3	p 4	...	p n

x	x 1	x 2	x 3	x 4	...	x n
m	m 1	m 2	m 3	m 4	...	m n

NUMERICAL CHARACTERISTICS OF RANDOM VARIABLES.

In many cases, along with the distribution of a random variable or instead of it, information about these quantities can be provided by numerical parameters called numerical characteristics of a random variable . The most common of them:

1 .Expectation - (average value) of a random variable is the sum of the products of all its possible values and the probabilities of these values:

2 .Dispersion random variable:

3 .Standard deviation :

“THREE SIGMA” rule - if a random variable is distributed according to a normal law, then the deviation of this value from the average value in absolute value does not exceed three times the standard deviation

GAUSS LAW – NORMAL DISTRIBUTION LAW

Often there are quantities distributed over normal law (Gauss's law). Main feature : it is the limiting law to which other laws of distribution approach.

A random variable is distributed according to the normal law if it probability density has the form:

M(X)- mathematical expectation of a random variable;

s- standard deviation.

Probability Density(distribution function) shows how the probability assigned to an interval changes dx random variable, depending on the value of the variable itself:

BASIC CONCEPTS OF MATHEMATICAL STATISTICS

Mathematical statistics- a branch of applied mathematics directly adjacent to probability theory. The main difference between mathematical statistics and probability theory is that mathematical statistics does not consider actions on distribution laws and numerical characteristics of random variables, but approximate methods for finding these laws and numerical characteristics based on the results of experiments.

Basic concepts mathematical statistics are:

1. General population;

2. sample;

3. variation series;

4. fashion;

5. median;

6. percentile,

7. frequency range,

8. histogram.

Population- a large statistical population from which part of the objects for research is selected

(Example: the entire population of the region, university students of a given city, etc.)

Sample (sample population)- a set of objects selected from the general population.

Variation series- statistical distribution consisting of variants (values of a random variable) and their corresponding frequencies.

Example:

X,kg

x- value of a random variable (weight of girls aged 10 years);

m- frequency of occurrence.

Fashion– the value of the random variable that corresponds to the highest frequency of occurrence. (In the example above, the fashion corresponds to the value 24 kg, it is more common than others: m = 20).

Median– the value of a random variable that divides the distribution in half: half of the values are located to the right of the median, half (no more) - to the left.

Example:

1, 1, 1, 1, 1. 1, 2, 2, 2, 3 , 3, 4, 4, 5, 5, 5, 5, 6, 6, 7 , 7, 7, 7, 7, 7, 8, 8, 8, 8, 8 , 8, 9, 9, 9, 10, 10, 10, 10, 10, 10

In the example we observe 40 values of a random variable. All values are arranged in ascending order, taking into account the frequency of their occurrence. You can see that to the right of the highlighted value 7 are 20 (half) of the 40 values. Therefore, 7 is the median.

To characterize the scatter, we will find values not higher than 25 and 75% of the measurement results. These values are called 25th and 75th percentiles . If the median divides the distribution in half, then the 25th and 75th percentiles are cut off by a quarter. (The median itself, by the way, can be considered the 50th percentile.) As can be seen from the example, the 25th and 75th percentiles are equal to 3 and 8, respectively.

Use discrete (point) statistical distribution and continuous (interval) statistical distribution.

For clarity, statistical distributions are depicted graphically in the form frequency range or - histograms .

Frequency polygon- a broken line, the segments of which connect points with coordinates ( x 1 ,m 1), (x 2 ,m 2), ..., or for relative frequency polygon – with coordinates ( x 1,р * 1), (x 2 ,р * 2), ...(Fig.1).

m m i /n f(x)

Fig.1 Fig.2

Frequency histogram- a set of adjacent rectangles built on one straight line (Fig. 2), the bases of the rectangles are the same and equal dx , and the heights are equal to the ratio of frequency to dx , or p* To dx (probability density).

Example:

x, kg	2,7	2,8	2,9	3,0	3,1	3,2	3,3	3,4	3,5	3,6	3,7	3,8	3,9	4,0	4,1	4,2	4,3	4,4
m

Frequency polygon

The ratio of relative frequency to interval width is called probability density f(x)=m i / n dx = p* i / dx

An example of constructing a histogram .

Let's use the data from the previous example.

1. Calculation of the number of class intervals

Where n - number of observations. In our case n = 100 . Hence:

2. Calculation of interval width dx :

3. Drawing up an interval series:

dx	2.7-2.9	2.9-3.1	3.1-3.3	3.3-3.5	3.5-3.7	3.7-3.9	3.9-4.1	4.1-4.3	4.3-4.5
m
f(x)	0.3	0.75	1.25	0.85	0.55	0.6	0.4	0.25	0.05

Histogram

My secret

Organization of the study

Methods of mathematical statistics

Sample

Empirical distribution function

Histogram

Sample mean and sample variance

1. Topic: “Fundamentals of mathematical statistics.”

2. Relevance of the topic.

Mathematical statistics is a branch of mathematics that studies methods of collecting, systematizing, processing, depicting, analyzing and interpreting observational results in order to identify existing patterns.

Some examples of the application of statistical methods in medicine:

interpretation of variation (variability of the characteristics of an organism when deciding what value of one or another characteristic will be ideal, normal, average, etc., makes it necessary to use appropriate statistical methods).

diagnosis of diseases in individual patients and assessment of the health status of a population group.

predicting the end of a disease in individual patients or the possible outcome of a control program for a particular disease in any population group.

selecting an appropriate influence on a patient or population group.

planning and conducting medical research, analyzing and publishing results, reading and critically evaluating them.

health care planning and management.

Useful health information is usually hidden in masses of raw data. It is necessary to concentrate the information contained in them and present the data so that the structure of variation is clearly visible, and then select specific methods of analysis.

Data presentation provides an introduction to the following concepts and terms:

variation series (ordered arrangement) - a simple arrangement of individual observations of a quantity.

class is one of the intervals into which the entire range of values ​​of a random variable is divided.

extreme points of the class - values ​​that limit the class, for example 2.5 and 3.0, lower and upper limits of the class 2.5 - 3.0.

(absolute) class frequency - the number of observations in a class.

relative class frequency - the absolute frequency of a class, expressed as a fraction of the total number of observations.

cumulative (accumulated) frequency of a class - the number of observations that is equal to the sum of the frequencies of all previous classes and this class.

Stovptsev diagram - a graphical representation of data frequencies for nominal classes using columns whose heights are directly proportional to the class frequencies.

pie chart - a graphical representation of data frequencies for nominal classes using sectors of a circle, the areas of which are directly proportional to the class frequencies.

histogram - a graphical representation of the frequency distribution of quantitative data with areas of rectangles directly proportional to the class frequencies.

frequency polygon - a graph of the frequency distribution of quantitative data; the point corresponding to the class frequency is located above the middle of the interval, each two adjacent points are connected by a straight line segment.

ogive (cumulative curve) - a graph of the distribution of cumulative relative frequencies.

All medical data has inherent variability, so analysis of measurement results is based on the study of information about what values ​​the random variable under study took.

The set of all possible values ​​of a random variable is called general.

The part of the general population registered as a result of tests is called a sample.

The number of observations included in the sample is called the volume of the sample (usually denoted n).

When estimating the parameters of a population based on a voter (parametric estimation), the following concepts are used:

point estimation - an estimate of a population parameter in the form of a single value that it can take with the highest probability.

interval estimation - estimation of a population parameter in the form of an interval of values ​​that has a given probability of covering its true value.

When using interval assessment, the concept is used:

reliable interval - an interval of values ​​that has a given probability of covering the true value of the population parameter during interval estimation.

reliability (reliable probability) - the probability with which the reliable interval covers the true value of the population parameter.

reliable limits - lower and upper limits of the reliable interval.

A typical statistical research path is:

Having estimated the quantities or relationships between them based on observational data, they make the assumption that the phenomenon being studied can be described by one or another stochastic model

using statistical methods, this assumption can be confirmed or rejected; upon confirmation, the goal is achieved - a model has been found that describes the patterns under study; otherwise, work continues, putting forward and testing a new hypothesis.

Definition of sample statistical estimates:

mode is the value that occurs most often in the voter,

median - central (middle) value of the variation series

range R - the difference between the largest and smallest values ​​in a series of observations

percentiles - a value in a variation series that divides the distribution into 100 equal parts (thus, the median will be the fiftieth percentile)

first quartile - 25th percentile

third quartile - 75th percentile

interquartile range - the difference between the first and third quartiles (covers the central 50% of observations)

quartile deviation - half of the interquartile range

sample mean - arithmetic mean of all sample values ​​(sample estimate of mathematical expectation)

average absolute deviation - the sum of deviations from the corresponding beginning (without taking into account the sign), divided by the sample volume

the average absolute deviation from the sample mean is calculated using the formula

sample variance (X) - (sample variance estimate) is given by

sample covariance -- (sample estimate of covariance K ( X,Y )) equals

the sample regression coefficient of Y on X (sample estimate of the regression coefficient of Y on X) is equal to

the empirical linear regression equation of Y on X has the form

the sample regression coefficient of X on Y (sample estimate of the regression coefficient of X on Y) is equal to

the empirical linear regression equation of X on Y has the form

sample standard deviation s(X) - (sample estimate of standard deviation) equals the square root of the sample variance

sample correlation coefficient - (sample estimate of the correlation coefficient) equals

sample coefficient of variation  - (sample estimate of coefficient of variation CV) is equal to

.

8.1.1 Practical calculation of sample estimates

Example 1.

The duration of the disease (in days) in 20 cases of pneumonia was:

10, 11, 6, 16, 7, 13, 15, 8, 9, 10, 11, 13, 7, 8, 13, 15, 16, 13, 14, 15

Determine the mode, median, range, interquartile range, sample mean, mean absolute deviation from the sample mean, sample dispersion, sample coefficient of variation.

Rozv"zok.

The variation series for sampling has the form

6, 7, 7, 8, 8, 9, 10, 10, 11, 11, 13, 13, 13, 13, 14, 15, 15, 15, 16, 16

Fashion

The most common number in the voter is 13. Therefore, the value of the mode in the voter will be this number.

Median

When a variation series contains a pair of observations, the median is equal to the average of the two central terms of the series, in this case 11 and 13, so the median is 12.

Scope

The minimum value in a voter is 6 and the maximum is 16, so R = 10.

class is one of the intervals into which the entire range of values of a random variable is divided.

extreme points of the class - values that limit the class, for example 2.5 and 3.0, lower and upper limits of the class 2.5 - 3.0.

All medical data has inherent variability, so analysis of measurement results is based on the study of information about what values the random variable under study took.

The set of all possible values of a random variable is called general.

interval estimation - estimation of a population parameter in the form of an interval of values that has a given probability of covering its true value.

reliable interval - an interval of values that has a given probability of covering the true value of the population parameter during interval estimation.

range R - the difference between the largest and smallest values in a series of observations

sample mean - arithmetic mean of all sample values (sample estimate of mathematical expectation)

The arithmetic mean of all sample values is equal to

Using the values of the sample regression coefficients and in Example 2, we obtain