Current location - Trademark Inquiry Complete Network - Futures platform - Spss analysis method-missing value analysis

Spss analysis method-missing value analysis

Missing values can cause serious problems. If there is an essential difference between the situation

Spss analysis method-missing value analysis

Spss analysis method-missing value analysis

Missing values can cause serious problems. If there is an essential difference between the situation

Spss analysis method-missing value analysis

Spss analysis method-missing value analysis

Missing values can cause serious problems. If there is an essential difference between the situation with missing value and the situation without missing value, the result will be misleading. In addition, the missing data may also reduce the accuracy of the calculated statistics, because there is less information in the calculation than originally planned.

Another problem is that the assumptions behind many statistical processes are based on complete cases, and the missing values may complicate the required theories.

Below we mainly explain from the following four aspects:

[If! supportLineBreakNewLine]

[endif]

practical application

theoretical thinking

Build a model

[If! supportLineBreakNewLine]

[endif]

Analysis results

[If! supportLineBreakNewLine]

[endif]

I. Practical application

[If! supportLineBreakNewLine]

[endif]

As we all know, in the study of income, traffic accidents and other issues, there will be some unanswered questions because the respondents refuse to answer or because of the losses in the investigation and research.

For example, in a population survey, 15% people did not answer the income situation, and the answer rate of high-income people was lower than that of middle-income people, or in the report of serious traffic accidents, key issues such as whether to use seat belts and alcohol concentration were not recorded in many cases. These missing case values are missing values. There are three kinds of missing values: (1) completely random missing (MCAR), which means missing is independent of the value of the variable. For example, suppose we are studying the relationship between age and income. If the missing data has nothing to do with age or income value, the missing value method is MCAR. To evaluate whether MCAR is a valid hypothesis, we can evaluate the observed data by comparing the distribution of respondents and non-respondents. You can also use univariate T test or Little's MCAR multivariate test for more formal evaluation. If the MCAR hypothesis is true, list deletion (complete case analysis) can be used without worrying about estimation bias, although some effectiveness may be lost. If MCAR is not established, approximate methods such as list deletion and mean replacement may not be a good choice. (2) Random Missing (MAR), in which the investigated variables only depend on the recorded variables in the data set. Continuing the above example, considering that age is observed, income is sometimes missing. If the missing value of income only depends on age, the missing value is MAR. (3) Non-random deletion. This is the last thing researchers want to see. The lack of data is not only related to the values of other variables, but also related to itself. If the missing income value depends on the income value, it is neither MCAR nor mal.

[If! supportLineBreakNewLine]

[endif]

Second, the theoretical thought

SPSS mainly analyzes the missing values of MCAR and Mal.

The difference between MCAR and MAR is that MCAR is actually difficult to satisfy, so we should consider which important variables may have non-invalid unanswered questions before the investigation, and try to include covariates in the investigation so as to use these variables to estimate the missing values.

[If! supportLineBreakNewLine]

[endif]

According to the different situations of missing values, SPSS operation gives the following three processing methods:

(1) Delete the missing value. This method is suitable for cases with few missing values. It does not require special steps, and is usually set in the options sub-dialog box of the corresponding analysis dialog box.

(2) Replace the missing values. Use the "Replace Missing Values" command in the "Transform" menu to treat all records as a sequence, and then use some indicators to fill in the missing values.

(3) Missing value analysis process, which is a module specially designed for missing value analysis provided by SPSS.

The missing value analysis process mainly has the following three functions: (1) describes the missing value pattern. Through the diagnosis report of missing value analysis, users can clearly know where the missing value is and what proportion it appears, and can also infer whether the missing value is random missing. (2) Use list method, pairing method, regression method or EM (expectation maximization) method to estimate the mean, standard deviation, covariance and correlation of data with missing values. Pairing method can also display the count of complete cases. (3) Use regression method or EM method to fill (interpolate) missing values with estimated values, so as to improve the reliability of statistical results. Missing data can be classified data or quantitative data (scale or continuity), but SPSS can only estimate the statistical data of quantitative variables and interpolate the missing data. For each variable, missing values that are not coded as system missing values must be defined as user missing values. Schell discriminant method simplifies multidimensional problems into one-dimensional problems by projection method. By establishing a linear discriminant function, it calculates the coordinates of each observed value in each typical variable dimension, and obtains the distance between the sample and each class center as the classification basis.

[If! supportLineBreakNewLine]

[endif]

[If! supportLineBreakNewLine]

[endif]

Third, build a model.

Missing value analysis case:

[If! supportLineBreakNewLine]

[endif]

Title: Some demographic data values in the table below have been replaced by missing values. Suppose the data file relates to the measures taken by the telecom company to reduce the customer churn in its customer base. Each case corresponds to a separate customer and records various demographic and service usage information. The following will explain in detail how to use this data file to obtain the missing value of the data file, so as to understand the missing value analysis process of SPSS.

I. Data entry

2. Operation steps 1: Enter SPSS, open the relevant data file and command "Analysis | Missing Value Analysis" 2. Select four variables: marital status, education level, retirement and gender to enter the list box of "classified variables"; Select six variables: number of months of service [term of office], age [age], number of years of residence in current address [address], family income (thousands) [income], number of years of work in current position [employment] and number of family members [residence] to enter the list box of quantitative variables.

3. Click the Mode button in the Missing Value Analysis dialog box to open the Missing Value Analysis: Mode dialog box. Select the case table (grouped by missing pattern) check box in the display option group, select four variables from the missing pattern list box, and enter the additional information of the following object list.

Others use the default settings. After setting, click the Continue button to return to the Missing Value Analysis dialog box.

4. Click the Description button to open the Missing Value Analysis: Description dialog box. Select the check boxes for univariate statistics and indicator variable statistics, as well as the check boxes for T-test with a set of indicator variables and cross tables of classified variables and indicator variables. Others use the default settings.

5. Select EM, and other settings will be set by the system default values. Click the "OK" button and wait for the output result.

[If! supportLineBreakNewLine]

[endif]

Fourth, the result analysis

1, univariate statistical table The following table gives the frequency, average and standard deviation of all the analyzed variables without missing data, and gives the statistical information of the number and percentage of missing values and extreme values. Through this information, we can initially understand the general characteristics of the data. Taking the column of employment as an example, there are 904 valid data of employment variables, with the average value of 1 1 and the standard deviation of 10. 1 13. There are 96 missing data, accounting for 9.6% of the total data.

2. Use EM method to estimate the changes of the mean and standard deviation of the overall data after the estimation of missing values in the two tables below, where "all values" are the statistical characteristics of the original data and EM is the statistical characteristics of the overall data after using EM method.

three

Independent variance t test table Independent variance t test results, users can find out the missing value patterns of variables that affect other quantitative variables, that is, through the results of one-way difference t statistics, check whether the missing values are completely randomly missing. It can be seen that older people often don't report their income level. When the income value is missing, the average age is 49.73, and when the income value is complete, the average age is 40.0 1. From the T statistics of the income column, we can see that the loss of income will obviously affect other quantitative variables, which shows that the loss of income is not completely random.

4. Comparison table of classified variables and quantitative variables Taking marriage as an example, the comparison table of classified variables and other quantitative variables is given. The table shows the number and percentage of non-missing variables and missing values under different marital status. The figure determines the values of missing values in the system and the distribution of variables in different marriage situations.

5. Table mode output results The following table is the output result of table mode (missing value style sheet), which gives the detailed information of missing value distribution, and X is the missing variable in this mode. As can be seen from the figure, in all the 950 cases displayed, there are 475 cases with complete values of 9 variables, 109 cases with missing income values, and 16 cases with missing address and income values. The interpretation of other data is similar.

6.EM estimation statistics The following three tables give the relevant statistics of EM algorithm, including EM mean, covariance and correlation. From the output result of EM average, the average value of age variable is 41.91; From the output results of EM covariance, the covariance value between age and tenure is135.326; From the output results of EM correlation, the correlation coefficient between age and tenure is 0.496. In addition, the significance value of chi-square test is obviously less than 0.05 of Little MCAR test at the bottom of three tables. Therefore, we reject the hypothesis that the missing value is completely random missing (MCAR), which also verifies the conclusion from 3. Independent variance t test table.

[If! supportLineBreakNewLine]

[endif]

Reference case data:

[If! supportLineBreakNewLine]

[endif]

[If! Supportlists]1[endif] SPSS statistical analysis from entry to mastery (fourth edition)? , Chen,, Liu Rong? Tsinghua university press

(For more knowledge, go to program gz to explain)

The original text comes from/s/csmioa _ vu8hjopvw16onfg.