Missing Data and Imputation

Authors

Javier Estrada

Michael Underwood

Elizabeth Subject-Scott

Published

May 1, 2023

Website

Slides

Introduction

Missing Data

Missing data occurs when there are missing values in a dataset. There are many reasons why this happens. It can be intentional or unintentional, and oftentimes there is no way to know for sure what causes the missing data. There are three categories, otherwise known as missingness mechanisms, which measure the severity of the missingness (Mainzer et al. 2023):

  • Missing completely at random (MCAR) is when the distribution of missing data is completely independent of any other variables. There is no relationship between the pattern of missingness and the observed data. This would be the best-case scenario.

  • Missing at random (MAR) is when the distribution of missing data is dependent of the observed values. The pattern of missingness is related to the observed data and not the missing data.

  • Missing not at random (MNAR) is when missing data is neither MCAR or MAR. The distribution of missing data is dependent on other variables. The pattern of missingness is related to the observed data, as well as the missing data. This would be the worst-case scenario and is nonignorable.

Figure 1: Graphical Representation of Missingness Mechanisms (Schafer and Graham 2002)

(X are the completely observed variables. Y are the partly missing variables. Z is the component of the cause of missingness unrelated to X and Y. R is the missingness.)

Looking for patterns in the missing data can help us to determine which category they belong. These mechanisms are important in determining how to handle the missing data. MCAR would be the best-case scenario but seldom occurs. MAR and MNAR are more common. Most methods of dealing with missing data assume a MAR missingness. If data are found to be MNAR, worst-case scenario, then specialized expertise and additional steps are needed to model the missingness before handling the missing data. The problem with ignoring any missing values is that it does not give a true representation of the dataset and can lead to bias when analyzing. This reduces the statistical power of the analysis (van_Ginkel et al. 2020).

According to Dong and Peng (2013), to enhance the quality of research, the following should be followed with regards to missing data:

  • Determine under what conditions the missing data occur.
  • Use principled methods to handle the missing data.
  • Include and report the treatment of the missing data.

Methods to Deal with Missing Data

There are three types of methods to deal with missing data, the likelihood and Bayesian method, weighting methods, or imputation methods (Cao et al. 2021).

  • Likelihood Bayesian method is when information from a previous predictive distribution is combined with evidence obtained in a sample to predict a value. It requires technical coding and advanced statistical knowledge.

  • The weighting method is a traditional approach when weights from available data are used to adjust for non-response in a survey. Inefficiency occurs when there are extreme weights or a need for many weights.

  • The imputation method is when an estimate from the original dataset is used to estimate the missing value. There are two types of imputation: single and multiple.

Deleting Missing Data

Another way to handle missing data is to simply delete it. Deleting missing values can lead to the loss of important information regarding your dataset, and is therefore not recommended. The two types of deletion methods are listwise and pairwise.

Listwise deletion is when the entire observation is removed from the dataset. In certain cases when the amount of missing data is small and the type is MCAR, listwise deletion can be used. There usually won’t be bias but potentially important information may be lost.

T-tests and chi-square tests can be used to assess pairs of predictor variables to determine whether the groups’ means differ significantly. According to van_Ginkel et al. (2020), if significant, the null hypothesis is rejected, therefore, indicating that the missing values are not randomly distributed throughout the data. This implies that the missing data is either MAR or MNAR. Conversely, if non-significant, this implies that the data cannot be MAR. This does not eliminate the possibility that it is not MNAR–other information about the population is needed to determine this.

Whenever missing data is categorized as MAR or MNAR, listwise deletion would be wasteful, and the analysis biased. Alternate methods of dealing with the missing data is recommended: either pairwise deletion or imputation.

Pairwise deletion is when only the missing variable of an observation is removed. It allows more data to be analyzed than listwise deletion but limits the ability to make inferences of the total sample. For this reason, it is recommended to use imputation to properly deal with missing data.

Preferred Method to Handle Missing Data

Imputation is the preferred method to handle missing data. It consists of replacing missing data with an estimate obtained from the original, available data. After imputation, there will be a full dataset to analyze. According to Wulff and Jeppesen (2017), 3-5 imputations are sufficient, and 10 imputations are more than enough. More recent research has indicated that to improve statistical power, the number of imputations created should be at least equal to the percent of missing data (5% equals 5 imputations, 10% equals 10 imputations, 20% equals 20 imputations, etc.) (Pedersen et al. 2017).

Single, or univariate, imputation is when only one estimate is used to replace the missing data. Methods of single imputation include using the mean, the last observation carried forward, and random imputation. The following is a brief explanation of each:

  • Using the mean to replace a missing value is a straight-forward process. The mean of the dataset is calculated, including the missing value. The mean is then multiplied by the number of observations in the study. Next, the known values are subtracted from the product, and this gives an estimate that can be used for any missing values. The problem with this method is that it can produce misleading results such as showing the variance and the confidence interval smaller than they actually are.

  • Last Observation Carried Forward (LOCF) is a technique of replacing a missing value in longitudinal studies with a previously observed value, the most recent value is carried forward (Streiner 2008). The problem with this method is that it assumes that the previous observed value is perpetual when in reality that most likely is not the case.

  • Random imputation is a method of randomly drawing an observation and using that observation for any of the missing values. The problem with this method is that it introduces additional variability.

These single imputation methods are inaccurate. They often result in underestimation of standard errors or too small p-values (Dong and Peng 2013), which can cause bias in the analysis. Therefore, multiple imputation is the superior method because it handles missing data better and provides less biased results.

Multiple, or multivariate, imputation is when various estimates are used to replace the missing data by creating multiple datasets from versions of the original dataset. According to Dong and Peng (2013), it can be done by using a “regression model, or a sequence of regression models, such as linear, logistic and Poison,” where a set of m plausible values are generated for each unobserved data point, resulting in M complete data sets. The new values are randomly drawn from predictive distributions either through joint modeling (JM, which is not used much anymore) or fully conditional specification (FCS) (Wongkamthong and Akande 2023). It is then analyzed and the results are combined to obtain a single value for the missing data.

The purpose of multiple imputation is to create a pool of imputed data for analysis, but if the pooled results are lacking, then multiple imputation should not be done (Mainzer et al. 2023). Another reason not to use multiple imputation is if there are very few missing values; there may be no benefit in using it. Also worth noting is some statistical analyses software already have built-in features to deal with missing data.

Multiple imputation by chained methods, otherwise known as MICE, is the most common and preferred method of multiple imputation (Wulff and Jeppesen 2017). It provides a more reliable way to analyze data with missing values. For this reason, this paper will focus on the methodology and application of the MICE process.

Figure 2: Flowchart of the MICE-process based on procedures proposed by Rubin (Wulff and Jeppesen 2017)

Code
# load package
library(DiagrammeR)

# flowchart of mice process
DiagrammeR::grViz("digraph {

# initiate graph
graph [layout = dot, rankdir = LR, label = 'The MICE-Process\n\n',labelloc = t, fontcolor = DarkSlateBlue, fontsize = 45]

# global node settings
node [shape = rectangle, style = filled, fillcolor = AliceBlue, fontcolor = DarkSlateBlue, fontsize = 35]
bgcolor = none

# label nodes
incomplete [label =  'Incomplete data set']
imputed1 [label = 'Imputed \n data set 1']
estimates1 [label = 'Estimates from \n analysis 1']
rubin [label = 'Rubins Rules', shape = diamond]
combined [label = 'Combined results']
imputed2 [label = 'Imputed \n data set 2']
estimates2 [label = 'Estimates from \n analysis 2']
imputedm [label = 'Imputed \n data set m']
estimatesm [label = 'Estimates from \n analysis m']

# edge definitions with the node IDs
incomplete -> imputed1 [arrowhead = vee, color = DarkSlateBlue]
imputed1 -> estimates1 [arrowhead = vee, color = DarkSlateBlue]
estimates1 -> rubin [arrowhead = vee, color = DarkSlateBlue]
incomplete -> imputed2 [arrowhead = vee, color = DarkSlateBlue]
imputed2 -> estimates2 [arrowhead = vee, color = DarkSlateBlue]
estimates2-> rubin [arrowhead = vee, color = DarkSlateBlue]
incomplete -> imputedm [arrowhead = vee, color = DarkSlateBlue]
imputedm -> estimatesm [arrowhead = vee, color = DarkSlateBlue]
estimatesm -> rubin [arrowhead = vee, color = DarkSlateBlue]
rubin -> combined [arrowhead = vee, color = DarkSlateBlue]
}")

Rubin’s Rules: Donald Rubin was a Harvard professor in the seventies. He felt that the then current method of dealing with missing data (single imputation) was flawed. It was impossible to calculate missing values with absolute certainty because the method did not consider the estimates in error in the predictions of the missing data. For this reason, he recommended creating multiple datasets based on the Bayesian method and doing the following steps:

  • Impute missing data by averaging the estimates across imputed datasets.
  • Run regression on each of the imputed datasets.
  • Find the standard errors and variance of the imputed estimates. Combine them using an adjustment term.

He felt like this would provide more accurate results than using just one imputed dataset. His theory is referred to as Rubin’s Rules.

Other Methods of Imputation

There are other methods of imputation worth noting and are briefly described below.

Regression Imputation is based on a linear regression model. Missing data is randomly drawn from a conditional distribution when variables are continuous and from a logistic regression model when they are categorical (van_Ginkel et al. 2020).

Predictive Mean Matching (PMM) is also based on a linear regression model. The approach is the same as regression imputation when it comes to categorical missing values but different for continuous variables. Instead of random draws from a conditional distribution, missing values are based on predicted values of the outcome variable (van_Ginkel et al. 2020).

Hot Deck (HD) imputation is when a missing value is replaced by a similar observed value, known as “the donor.” It can be either random or deterministic, which is based on a metric or value (Thongsri and Samart 2022). It does not rely on model fitting.

Stochastic Regression (SR) Imputation is an extension of regression imputation. The process is the same but a residual term from the normal distribution of the regression of the predictor outcome is added to the imputed value (Thongsri and Samart 2022). This maintains the variability of the data.

Random Forest (RF) Imputation is based on machine learning algorithms. Missing values are first replaced with the mean or mode of that particular variable and then the dataset is split into a training set and a prediction set (Thongsri and Samart 2022). The missing values are replaced with estimates from these sets. This type of imputation can be used on continuous or categorical variables with complex interactions.

Methodology

Multiple Imputation by Chained Equations (MICE)

According to Rubin’s Rules, in multiple imputation m imputed values are created for each of the missing data and result in M complete datasets. For each of the M datasets, an estimate of \(\theta\) is acquired. Let \({\hat{\theta}}_{m}\) and \({\hat{\phi}}_{m}\) be an estimator of the variance of \({\hat{\theta}}_{m}\) based on the Mth complete dataset. The following formulas are used in the multiple imputation process (Arnab 2017) and (Heymans and Eekhout 2019).

The combined estimator of \(\theta\), the pooled mean difference, is given by the following equation:

\[{\overline{\theta}} = \displaystyle \frac{1}{M}\sum_{m = 1}^{M} {\hat{\theta}}_{m}\] This is the sum of the m estimators from each of the M imputed datasets which are pooled together and averaged.

The proposed variance estimator of \({\hat{\theta}}_{M}\), the total variance, is given by the following equation:

\[{\hat{\Phi}} = {\overline{\phi}}_{M} + (1+\displaystyle \frac{1}{M})B_{M}\]

This formula is made up of two parts: the average within imputation variance and the between imputation variance (with a correction factor of \((1+\displaystyle \frac{1}{M})\)).

The imputation variance within, which is the normal standard error, is calculated as follows:

\[{\overline{\phi}} = \displaystyle \frac{1}{M}\sum_{m = 1}^{M}{SE}^2_m\]

This is the average of the sum of the standard errors squared of the imputed datasets. It will be large in small datasets and small in large datasets.

The between imputation variance, which is how much extra variation there is due to the imputation process, is calculated as follows:

\[B_{M} = \displaystyle \frac{1}{M-1}\sum_{m = 1}^{M}({\hat{\theta}}_{m}-{\overline{\theta}})^{2}\]

This is the sum of the sample variance for each imputed data set multiplied by 1 over the imputed datasets. It will be large when there is a high number of missing data and small when there is a small number of missing data (Heymans and Eekhout 2019). When the between variance is greater than the within variance, then greater efficiency occurs, and more accurate estimates can be achieved by increasing M. Conversely, when the within variance is greater than the between variance, then little is gained by increasing M.

To find the total standard error, take the square root of the total variance:

\[SE_{pooled} = {\sqrt{\hat{\Phi}}}\]

For significance testing, we use the Wald Test:

\[Wald_{pooled} = \frac{(\overline{\theta} - {\theta_{0}})^2}{{\hat{\Phi}}}\]

The Wald test follows a t-distribution with the degrees of freedom as follows:

\[df = (M-1)(1+\frac{1}{M+1} * \frac{\overline{\theta}}{B_{M}})^2\]

It results in a 2-sided t-test with the t-value as follows:

\[t_{df, 1-\alpha/2}\]

To test for significance using F-distribution, we square the t-value:

\[ F_{1,df} = t^{2}_{df, 1-\alpha/2}\]

Assumptions

All multiple imputation methods have the following assumptions:

  • Observed data follow a normal multivariate distribution. If not normal, a transformation is first needed before imputation can be performed.

  • Missing data are classified as MAR, which is when the distribution of missing data is dependent of the observed values and not the unobserved values.

  • “The parameters \({\theta}\) of the data model and the parameters \({\phi}\) of the model for the missing values are distinct. That is, knowing the values of \({\theta}\) does not provide any information about \({\phi}\)(“MI Procedure” 2019).

These assumptions are essential, otherwise we wouldn’t be able to predict a plausible replacement value for the missing data.

Steps:

The chained equation process has the following steps (Azur et al. 2011):

Step 1: Using simple imputation, replace the missing data with this value, referred to as the “place holder.”

Step 2: The “place holder” values for one variable are set back to missing.

Step 3: The observed values from this variable (dependent variable) are regressed on the other variables (independent variables) in the model, using the same assumptions when performing linear, logistic, or Poison regression.

Step 4: The missing values are replaced with predictions “m” from this newly created model.

Step 5: Repeat Steps 2-4 for each variable that have missing values until all missing values have been replaced.

Step 6: Repeat Steps 2-4, updating imputations each cycle for as many “M” cycles/imputations that are required.

Analysis and Results

MICE in R

Using the MICE (Multivariate Imputation by Chained Equations) package in R, a statistical programming software, we will be demonstrating how to impute missing data. We will run through the MICE process and provide results of the imputations. We will then check the accuracy of the imputations by comparing these results to the results of the original, complete data set.

Data and Visualizations

Load Data and Packages
Code
# load libraries
library(vtable)
library(gtsummary)
library(gt)
library(ggplot2)
library(VIM, warn.conflicts=FALSE)
library(dplyr, warn.conflicts=FALSE)
library(mice, warn.conflicts=FALSE)
library(tidyverse)
library(modelsummary)

# load data
original = read.csv("kc_house_data.csv")
Description of Dataset

kc_house_data.csv

Details of Dataset

This dataset contains the sale prices of 21,613 houses in King County, Seattle between May 2014 to May 2015. The original dataset contained 21 columns with various selling attributes. For the purpose of this project, we have condensed the variables to the following four: sales price, number of bedrooms, number of bathrooms, and square footage of living space.

Definition of Data in Dataset
Variable Type Description
price numeric Sale price of homes sold between May 2014 and May 2015 in King County, Seattle; range from $75,000 to $7,700,000.
bedrooms integer Number of bedrooms in homes sold between May 2014 and May 2015 in King County, Seattle; range from 0 to 33.
bathrooms numeric Number of bathrooms in homes sold between May 2014 and May 2015 in King County, Seattle; range from 0 to 8.
sqft_living integer Number of square feet of living space in homes sold between May 2014 and May 2015 in King County, Seattle; range from 290 sf to 13,540 sf.
Summary of Dataset:
Code
# summary of data
sumtable(original, col.align = c('left', 'center'))
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
price 21613 540182 367362 75000 321950 645000 7700000
bedrooms 21613 3.4 0.93 0 3 4 33
bathrooms 21613 2.1 0.77 0 1.8 2.5 8
sqft_living 21613 2080 918 290 1427 2550 13540
Plotting of Data
Code
# plot the data
ggplot(original, aes(x=sqft_living, y=price)) +
  geom_point(alpha=0.5, color='darkblue') +
  labs(x="sqft_living", y="price")

Evaluate Dataset

First, we evaluate the dataset for missing values by using the is.na() function:

Code
# check for missing data
sapply(original, function(x) sum(is.na(x)))
      price    bedrooms   bathrooms sqft_living 
          0           0           0           0 

There are no missing values; this is the original, complete dataset. Normally, this is where you would find out how much missing data your dataset contains and you would move on to the next step in imputation. Since our dataset does not contain any missing values, we will need to add an additional step to artificially create some missing data. We do this by first duplicating the dataset. We want to preserve the original dataset so that we can later check the accuracy of the imputation process.

Code
# duplicate dataset
house <- original

Next, we randomly replace the desired amount of values with NA to mimic missing data. We will assign 200 NA values to the following variables: bedrooms, bathrooms, and sqft_living and 100 NA values to the price variable.

Code
# replace some values with NA to create missing data
set.seed(10)
house[sample(1:nrow(house), 200), "sqft_living"] <- NA
house[sample(1:nrow(house), 200), "bedrooms"] <- NA
house[sample(1:nrow(house), 200), "bathrooms"] <- NA
house[sample(1:nrow(house), 100), "price"] <- NA

Using the is.na() function, we can check to confirm we have created missing data:

Code
# check for missing data
sapply(house, function(x) sum(is.na(x)))
      price    bedrooms   bathrooms sqft_living 
        100         200         200         200 

Our dataset now has 700 NA/missing values. This equates to about 3% of the data (700/21,613).

Checking for Patterns

Next, we need to check the missingness by looking for patterns in the dataset by using the md.pattern() function. Of course our missing data was randomly created, so we know we have MCAR.

Code
# check the missingness patterns
md.pattern(house, rotate.names = TRUE)

      price bedrooms bathrooms sqft_living    
20920     1        1         1           1   0
195       1        1         1           0   1
196       1        1         0           1   1
2         1        1         0           0   2
197       1        0         1           1   1
2         1        0         1           0   2
1         1        0         0           1   2
98        0        1         1           1   1
1         0        1         1           0   2
1         0        1         0           1   2
        100      200       200         200 700

Blue are the observed/complete values, and red are the missing values. The numbers on the left indicate how many complete values there are and the numbers on the right show how many variables contain missing data. So for example, the first line shows there are 20,920 complete observations across all four variables. The next line indicates there are 195 complete observations in three of the variables with data missing in the sqft_living variable, so forth and so on all the way down the list. The total number of missing data for each variable is listed across the bottom. This visual gives a breakdown of where the missing data occurs.

Another way to look for patterns in the missing data is to use the aggr() function:

Code
# histogram, patterns and count for missing data
aggr_plot <- aggr(house, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))


 Variables sorted by number of missings: 
    Variable       Count
    bedrooms 0.009253690
   bathrooms 0.009253690
 sqft_living 0.009253690
       price 0.004626845

It provides a histogram of the missing values and shows each variable and the percent of missing data for each (as a decimal):

  • bedrooms: 0.9%

  • bathrooms: 0.9%

  • sqft_living: 0.9%

  • price: 0.5%

Another visualization to check for the missingness is a margin plot:

Code
# margin plot for missingness
marginplot(house[c(1,4)])

The red box plot on the left shows the distribution of sqft_living with price missing. The blue box plot shows the distribution of the remaining data. The same is true for the price box plots on the bottom. We expect the red and blue boxplots to be very similar if our assumptions of MCAR data are true. In this case, they are.

Imputation Process

Since we are missing about 3% of our data, we need to perform at least 3 imputations. This can be done by using the mice() function. Since 5 is the default, we will use that; if you need to use something different, you must set m equal to the number of imputations that you desire. The set.seed will be given the value 1337 (any number can be used here) to ensure that the same random values are generated each time.

Code
# impute the data 5 times (default)
imp = mice(data = house, set.seed = 1337)

 iter imp variable
  1   1  price  bedrooms  bathrooms  sqft_living
  1   2  price  bedrooms  bathrooms  sqft_living
  1   3  price  bedrooms  bathrooms  sqft_living
  1   4  price  bedrooms  bathrooms  sqft_living
  1   5  price  bedrooms  bathrooms  sqft_living
  2   1  price  bedrooms  bathrooms  sqft_living
  2   2  price  bedrooms  bathrooms  sqft_living
  2   3  price  bedrooms  bathrooms  sqft_living
  2   4  price  bedrooms  bathrooms  sqft_living
  2   5  price  bedrooms  bathrooms  sqft_living
  3   1  price  bedrooms  bathrooms  sqft_living
  3   2  price  bedrooms  bathrooms  sqft_living
  3   3  price  bedrooms  bathrooms  sqft_living
  3   4  price  bedrooms  bathrooms  sqft_living
  3   5  price  bedrooms  bathrooms  sqft_living
  4   1  price  bedrooms  bathrooms  sqft_living
  4   2  price  bedrooms  bathrooms  sqft_living
  4   3  price  bedrooms  bathrooms  sqft_living
  4   4  price  bedrooms  bathrooms  sqft_living
  4   5  price  bedrooms  bathrooms  sqft_living
  5   1  price  bedrooms  bathrooms  sqft_living
  5   2  price  bedrooms  bathrooms  sqft_living
  5   3  price  bedrooms  bathrooms  sqft_living
  5   4  price  bedrooms  bathrooms  sqft_living
  5   5  price  bedrooms  bathrooms  sqft_living

We need to create the imputed dataset using the complete() function. We will not display the results since there are 21,613 rows.

Code
# create imputed dataset
imputed <- complete(imp)

We now need to check the imputed dataset to make sure there is no missing data. Again, we use the is.na() function:

Code
# check imputed dataset for missing values
sapply(imputed, function(x) sum(is.na(x)))
      price    bedrooms   bathrooms sqft_living 
          0           0           0           0 

We can check the quality of the imputations by running a strip plot, which is a single axis scatter plot. It will show the distribution of the imputed values (red) over the observed values (blue). We want the imputations to be values that could have been observed had the data not been missing. These look to be acceptable.

Code
# strip plots to check the quality of the imputations
par(mfrow=c(2,2))
stripplot(imp, price, pch = 19, cex=1, xlab = "Imputation number")

Code
stripplot(imp, bedrooms, pch = 19, cex=1, xlab = "Imputation number")

Code
stripplot(imp, bathrooms, pch = 19, cex=1, xlab = "Imputation number")

Code
stripplot(imp, sqft_living, pch = 19, cex=1, xlab = "Imputation number")

Fitting and Pooling Imputed Datasets

Next, we will run regression on our imputed datasets, so we will have 5 regression results. We do this with the with() function. For each regression, we get different results.

Code
# fit complete-data model
fit <- with(imp, lm(price~bedrooms+bathrooms+sqft_living))

Next, we will pool the results to arrive at estimates that will properly account for the missing data. It will give us the estimate, standard error, test statistic, degrees of freedom, and the p-value for each variable.

Code
# pool and summarize the results
summary(pool(fit))
         term    estimate   std.error  statistic        df       p.value
1 (Intercept)  75043.7775 6952.563287  10.793685 14431.278  4.677205e-27
2    bedrooms -57930.4966 2357.322483 -24.574702  7683.745 1.936143e-128
3   bathrooms   7624.2182 3531.505441   2.158914 11516.540  3.087739e-02
4 sqft_living    309.7992    3.114475  99.470779  8431.335  0.000000e+00

We can now analyze our dataset as if there were no missing data!

Checking for Accuracy

We will now compare the results from our pooled, imputed dataset to our original, complete dataset (before we removed values to create missing data) to check for accuracy of the imputation method by chained equations:

Code
# fit original, complete dataset
og_lm = lm(price~bedrooms+bathrooms+sqft_living, data = original)

# compare imputed dataset to the original dataset
msummary(list("Imputed" = pool(fit), "Original" = og_lm), title = "Comparison of Results", statistic = "p.value", estimate = "estimate", gof_omit = 'IC|Log|Adj|F|RMSE')
Comparison of Results
 Imputed Original
(Intercept) 75043.777 74662.099
(<0.001) (<0.001)
bedrooms −57930.497 −57906.631
(<0.001) (<0.001)
bathrooms 7624.218 7928.708
(0.031) (0.024)
sqft_living 309.799 309.605
(<0.001) (<0.001)
Num.Obs. 21613 21613
Num.Imp. 5
R2 0.507 0.507

The estimates from the multiple imputation are very close to the estimates for the original dataset, with the exception of bathrooms which is off just slightly. The p-values are all statistically significant in the original dataset, as well as the imputed dataset, with very minimal differences in their values. These results indicate a high accuracy of the imputation process. We can conclude that multiple imputation by chained equation is a reliable source to impute missing data.

Conclusion

In conclusion, missing data can occur in research for a variety of reasons. It is never a good idea to ignore it. Doing this will lead “to biased estimates of parameters, loss of information, decreased statistical power, and weak reliability of findings” (Dong and Peng 2013). When missing data is discovered, it is important to first identify it and look for missing data patterns. The best course of action is to impute the missing data by using multiple imputation. Next, define the variables in the dataset that are related to the missing values that will be used for imputation. Create the necessary number of complete data sets with imputed values. Run regression on these models and combine them using Rubin’s Rules. Finally, analyze the dataset. Performing these steps will minimize the adverse effects on the analysis caused by missing data (Pampka, Hutcheson, and Williams 2016).

References

Arnab, R. 2017. Survey Sampling Theory and Applications. Academic Press. https://www.sciencedirect.com/topics/mathematics/imputation-method.
Azur, M. J., E. A. Stuart, C. Frangakis, and P. J. Leaf. 2011. “Multiple Imputation by Chained Equations: What Is It and How Does It Work?” Int J Methods Psychiatr Res. 20 (1): 40–49. https://onlinelibrary.wiley.com/doi/epdf/10.1002/mpr.329.
Cao, Y., H. Allore, B. V. Wyk, and Gutman R. 2021. “Review and Evaluation of Imputation Methods for Multivariate Longitudinal Data with Mixed-Type Incomplete Variables.” Statistics in Medicine 41 (30): 5844–76. https://doi-org.ezproxy.lib.uwf.edu/10.1002/sim.9592.
Dong, Y., and C. J. Peng. 2013. “Principled Missing Data Methods for Researchers.” SpringerPlus 2 (222). https://doi.org/10.1186/2193-1801-2-222.
Heymans, M. W., and I. Eekhout. 2019. Applied Missing Data Analysis with SPSS and (r) Studio. Heymans & Eekhout. https://bookdown.org/mwheymans/bookmi/.
Mainzer, R., M. Moreno-Betancur, C. Nguyen, J. Simpson, J. Carlin, and K. Lee. 2023. “Handling of Missing Data with Multiple Imputation in Observational Studies That Address Causal Questions: Protocol for a Scoping Review.” BMJ Open 13: 1–6. http://dx.doi.org/10.1136/bmjopen-2022-065576.
“MI Procedure.” 2019. https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.4/statug/statug_mi_details03.htm.
Pampka, M., G. Hutcheson, and J. Williams. 2016. “Handling Missing Data: Analysis of a Challenging Data Set Using Multiple Imputation.” International Journal of Research & Method in Education 39 (1): 19–37. https://doi.org/10.1080/1743727X.2014.979146.
Pedersen, A. B., E. M. Mikkelsen, D. Cronin-Fenton, N. R. Kristensen, T. M. Pham, L. Pedersen, and I. Petersen. 2017. “Missing Data and Multiple Imputation in Clinical Epidemiological Research.” Clinical Epidemiology 9: 157–66. https://www.tandfonline.com/doi/full/10.2147/CLEP.S129785.
Schafer, J. L., and J. W. Graham. 2002. “Missing Data: Our View of the State of the Art.” Psychological Methods 7 (2): 147–77. https://psycnet.apa.org/doi/10.1037/1082-989X.7.2.147.
Streiner, D. L. 2008. “Missing Data and the Trouble with LOCF.” EBMH 11 (1): 1–5. http://dx.doi.org/10.1136/ebmh.11.1.3-a.
Thongsri, T., and K. Samart. 2022. “Composite Imputation Method for the Multiple Linear Regression with Missing at Random Data.” International Journal of Mathematics and Mathematics and Computer Science 17 (1): 51–62. http://ijmcs.future-in-tech.net/17.1/R-Samart.pdf.
van_Ginkel, J. R., M. Linting, R. C. Rippe, and A. van der Voort. 2020. “Rebutting Existing Misconceptions about Multiple Imputation as a Method for Handling Missing Data.” Journal of Personality Assessment 102 (3): 2812–31. https://doi.org/10.1080/00223891.2018.1530680.
Wongkamthong, C., and O. Akande. 2023. “A Comparative Study of Imputation Methods for Multivariate Ordinal Data.” Journal of Survey Statistics and Methodology 11 (1): 189–212. https://doi.org/10.1093/jssam/smab028.
Wulff, J. N., and L. E. Jeppesen. 2017. “Multiple Imputation by Chained Equations in Praxis: Guidelines and Review.” Electronics Journal of Business Research Methods 15 (1): 41–56. https://vbn.aau.dk/ws/files/257318283/ejbrm_volume15_issue1_article450.pdf.