Missing Data and Imputation

Michael Underwood

Elizabeth Subject-Scott

Javier Estrada

What is Missing Data?

  • Missing data occurs when there are missing values in a dataset
    • Can be intentional or unintentional
  • Missing data is classified into 3 different categories:
    • Missing Completely At Random (MCAR)
    • Missing At Random (MAR)
    • Missing Not At Random (MNAR)

X are the completely observed variables.

Y are the partly missing variables.

Z is the component of the cause of missingness unrelated to X and Y.

R is the missingness.

Methods to Handle Missing Data

  • Likelihood Bayesian Method
    • Predicts missing values based on a previous predictive distribution.
  • Weighting Method
    • Uses weights from available data to adjust for missing values.
  • Imputation Method
    • Uses estimates from original data to determine missing values

Deleting Missing Data

  • When type is MCAR and the amount of missing data is small, deletion can be used.

  • 2 Types

    • Listwise deletion occurs when the entire observation is removed.

    • Pairwise deletion occurs when the variable of an observation is removed.

  • Deleting missing data can lead to the loss of important information regarding your dataset and is not recommended.

Preferred Method:

  • Imputation

    • 2 Types

      • Single Imputation

        • Only one estimate is used to replace the missing data.
      • Multiple Imputation

        • Various estimates are used to replace the missing data by creating mulitple versions of the original dataset.

Single or Univariate Imputation

  • Methods include:

    • Using the mean to replace a missing value.
      • The problem with this method is that it reduces the variance which leads to a smaller confidence interval.
    • Last Observation Carried Forward (LOCF) replaces a missing value with a previously observed value (the most recent value is carried forward).
      • The problem with this method is that it assumes that the previous observed value is perpetual, when in reality that may not be the case.

Multiple Imputation

  • A set of m plausible values are generated for each unobserved data point, resulting in M complete data sets.
  • The new values are randomly drawn from predictive distributions either through joint modeling (JM) or fully conditional specification (FCS).
  • It is then analyzed and the results are combined, or pooled together, to obtain a single value for the missing data.
  • Multiple imputation by chained methods (MICE) is the most common and preferred method of multiple imputation.

Other Methods of Imputation

  • Regression Imputation is based on a linear regression model. Missing data is randomly drawn from a conditional distribution when variables are continuous and from a logistic regression model when they are categorical.

  • Predictive Mean Matching is also based on a linear regression model. The approach is the same as regression imputation except instead of random draws from a conditional distribution, missing values are based on predicted values of the outcome variable.

  • Hot Deck (HD) imputation is when a missing value is replaced by an observed response of a similar unit, also known as the donor. It can be either random or deterministic, which is based on a metric or value. It does not rely on model fitting.

  • Stochastic Regression (SR) Imputation is an extension of regression imputation. The process is the same but a residual term from the normal distribution of the regression of the predictor outcome is added to the imputed value. This maintains the variability of the data.

  • Random Forest (RF) Imputation is based on machine learning algorithms. Missing values are first replaced with the mean or mode of that particular variable and then the dataset is split into a training set and a prediction set. The missing values are then replaced with predictions from these sets. This type of imputation can be used on continuous or categorical variables with complex interactions.

Methodology of Multiple Imputation

  • Point of multiple imputation
    • Not to create artificial values for the missing data
    • Intended to be a tool in analyzing data with missing values
  • Developed by Donald Rubin, Harvard Professor in the 70s
    • Previous method was not correct
      • It is impossible to calculate missing data with certainty.
      • It did not take into account the estimates in error in the predictions.
      • He recommended creating multiple datasets, based on the Bayesian method.
      • It gives more precise estimates than using just one imputed dataset.
    • According to Rubin’s Rule, in multiple imputation missing values are replaced by m imputed values which result in M complete datasets.
      • For each of the M datasets, an estimate of \(\theta\) is acquired.

Pooling Estimates

  • To calculate the pooled mean differences of \(\theta\):

\[{\overline{\theta}}_{M} = \displaystyle \frac{1}{M}\sum_{m = 1}^{M} {\hat{\theta}}_{m}\]

  • This is the average of m estimates of M imputed datasets.

Pooling Standard Errors

  • To calculate the proposed variance estimator of \({\hat{\theta}}_{m}\):

\[{\hat{\Phi}}_{M} = {\overline{\phi}}_{M} + (1+\displaystyle \frac{1}{M})B_{M}\]

  • This is the total variance and is made up of two parts: the average within imputation variance and the between imputation variance (with a correction factor).

  • The average within imputation variance (normal standard error) is:

\[{\overline{\phi}}_{M} = \displaystyle \frac{1}{M}\sum_{m=1}^{M}{SE}^2\]

(It will be large in small samples and small in large samples.)

  • The between imputation variance (how much extra variation there is due to the imputation process) is:

\[B_{M} = \displaystyle \frac{1}{M-1}\sum_{m = 1}^{M}({\hat{\theta}}_{m}-{\overline{\theta}}_{m})^{2}\]

  • When the between variance is greater than the within variance, then greater efficiency occurs and more accurate estimates can be achieved by increasing M.

  • When the within variance is greater than the between variance, then little is gained by increasing M.

  • To calculate the total standard error:

\[SE_{pooled} = {\sqrt{\hat{\Phi}}_{M}}\]

Significance Testing

\[Wald_{pooled} = \frac{(\overline{\theta} - {\theta_{0}})^2}{{\hat{\Phi}}_{M}}\]

  • The Wald test follows a t-distribution with the degrees of freedom as follows:

\[df = (M-1)(1+\frac{1}{M+1} * \frac{\overline{\theta}_{M}}{B_{M}})^2\]

  • The 2 sided t-value is as follows:

\[t_{df, 1-\alpha/2}\]

  • To test for significance using F-distribution:

\[ F_{1,df} = t^{2}_{df, 1-\alpha/2}\]

Assumptions for Multiple Imputation

  • Observed data follow a normal distribution. If not normal, a transformation is first needed.

  • Missing data are classified as MAR, which is the probability that a missing value depends only on observed values and not unobserved values.

  • The parameters \({\theta}\) of the data model and the parameters \({\phi}\) of the model for the missing values are distinct. That is, knowing the values of \({\theta}\) does not provide any information about \({\phi}\).

Steps in Multiple Imputation:

  • Step 1: Impute missing data

  • Step 2: Run regression models on all imputation sets

  • Step 3: Pool regression results into one regression result

MICE in R

  • MICE = Multivariate Imputation by Chained Equations.

  • mice() method in R.

  • The method creates multiple imputations (replacement values) for multivariate missing data.

  • The MICE algorithm can impute mixes of continuous, binary, unordered categorical and ordered categorical data.

  • This can be done using the imputation regressions of the mice() method such as:

  • predictive mean matching (numeric data)

  • logistic regression imputation (binary data with two factorial levels)

  • polytomous regression imputation for unordered categorical data (factor > 2 levels)

  • proportional odds model for (ordered, > 2 levels).

Dataset

  • King County, Seattle Home Sale Prices between 2014 and 2015

    • Contains the sale prices of 21,613 houses

    • The original dataset contained 21 columns with various selling attributes.

      • For the purpose of this project, we have condensed the variables to the following 4:

        • sales price of home
        • number of bedrooms
        • number of bathrooms
        • square footage of living space
# load library
library(mice, warn.conflicts=FALSE)
# load data
original = read.csv("kc_house_data.csv")
# first row
head(original, 1)
   price bedrooms bathrooms sqft_living
1 221900        3         1        1180

  • First, we evaluate the dataset for missing values by using the is.na() function:
# check for missing data
sapply(original, function(x) sum(is.na(x)))
      price    bedrooms   bathrooms sqft_living 
          0           0           0           0 
  • There is no missing data; this is the original, complete dataset.
  • We will artificially create some missing values.
  • We do this by first duplicating the dataset. We want to preserve the original so that we can later check the accuracy of the imputation process.
# duplicate dataset
house <- original

  • Next, we randomly replace the desired amount of values with NA to mimic missing data.

  • We will assign 200 NA values to the following variables: - bedrooms, - bathrooms, - sqft_living

  • And 100 NA values to the price variable.

# replace some values with NA to create missing data
set.seed(10)
house[sample(1:nrow(house), 200), "sqft_living"] <- NA
house[sample(1:nrow(house), 200), "bedrooms"] <- NA
house[sample(1:nrow(house), 200), "bathrooms"] <- NA
house[sample(1:nrow(house), 100), "price"] <- NA
  • Using the is.na() function, we can check to confirm we have created missing data:
# check for missing data
sapply(house, function(x) sum(is.na(x)))
      price    bedrooms   bathrooms sqft_living 
        100         200         200         200 
  • Our dataset now has 700 NA/missing values. This equates to about 3% of the data (700/21,613).

Visualizations

# check the missingness patterns
md.pattern(house, rotate.names = TRUE)

      price bedrooms bathrooms sqft_living    
20920     1        1         1           1   0
195       1        1         1           0   1
196       1        1         0           1   1
2         1        1         0           0   2
197       1        0         1           1   1
2         1        0         1           0   2
1         1        0         0           1   2
98        0        1         1           1   1
1         0        1         1           0   2
1         0        1         0           1   2
        100      200       200         200 700
library(VIM)
aggr_plot <- aggr(house, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))


 Variables sorted by number of missings: 
    Variable       Count
    bedrooms 0.009253690
   bathrooms 0.009253690
 sqft_living 0.009253690
       price 0.004626845
  • The red box plot on the left shows the distribution of sqft_living with price missing. The blue box plot shows the distribution of the remaining data points. The same is true for the price box plots on the bottom. We expect the red and blue boxplots to be very similar if our assumptions of MCAR data are true. In this case, they are.
marginplot(house[c(1,4)])

Imputation Process

  • Since we are missing about 3% of our data, we need to perform at least 3 imputations. This will be done using the mice() function:

  • Since 5 is the default, we will use that (the m parameter can be used to adjust the the # of imputations)

  • The set.seed will be given the value 1337 (any number can be used here) to retrieve the same results each time the multiple imputation is performed.

# impute the data 5 times (default)
imp = mice(data = house, set.seed = 1337, defaultMethod = c("pmm", "logreg", "polyreg",  "norm", "polr"))

 iter imp variable
  1   1  price  bedrooms  bathrooms  sqft_living
  1   2  price  bedrooms  bathrooms  sqft_living
  1   3  price  bedrooms  bathrooms  sqft_living
  1   4  price  bedrooms  bathrooms  sqft_living
  1   5  price  bedrooms  bathrooms  sqft_living
  2   1  price  bedrooms  bathrooms  sqft_living
  2   2  price  bedrooms  bathrooms  sqft_living
  2   3  price  bedrooms  bathrooms  sqft_living
  2   4  price  bedrooms  bathrooms  sqft_living
  2   5  price  bedrooms  bathrooms  sqft_living
  3   1  price  bedrooms  bathrooms  sqft_living
  3   2  price  bedrooms  bathrooms  sqft_living
  3   3  price  bedrooms  bathrooms  sqft_living
  3   4  price  bedrooms  bathrooms  sqft_living
  3   5  price  bedrooms  bathrooms  sqft_living
  4   1  price  bedrooms  bathrooms  sqft_living
  4   2  price  bedrooms  bathrooms  sqft_living
  4   3  price  bedrooms  bathrooms  sqft_living
  4   4  price  bedrooms  bathrooms  sqft_living
  4   5  price  bedrooms  bathrooms  sqft_living
  5   1  price  bedrooms  bathrooms  sqft_living
  5   2  price  bedrooms  bathrooms  sqft_living
  5   3  price  bedrooms  bathrooms  sqft_living
  5   4  price  bedrooms  bathrooms  sqft_living
  5   5  price  bedrooms  bathrooms  sqft_living
imputed <- complete(imp) # We need to create the imputed dataset using the complete() function

Imputation Process - Verifying Missing Data

  • We now need to check the imputed dataset to make sure there is no missing data. Again, we use the is.na() function:
sapply(imputed, function(x) sum(is.na(x)))
      price    bedrooms   bathrooms sqft_living 
          0           0           0           0 

Post-Imputation Visualizations:

  • We can check the quality of the imputations by running a strip plot, which is a single axis scatter plot.

  • It will show the distribution of the imputed values over the observed values.

  • Blue values are observed value and red values are imputed values. Ideally, we want the imputations to be values that could have been observed had the data not been missing.

# strip plots to check the quality of the imputations
par(mfrow=c(2,2))
stripplot(imp, price, pch = 19, cex=1, xlab = "Imputation number")

stripplot(imp, bedrooms, pch = 19, cex=1, xlab = "Imputation number")

stripplot(imp, bathrooms, pch = 19, cex=1, xlab = "Imputation number")

stripplot(imp, sqft_living, pch = 19, cex=1, xlab = "Imputation number")

Fitting and Pooling Imputed Datasets

  • Next, we will run regression on our imputed datasets, so we will have 5 regression results. We do this with the with() function. For each regression, we get different results.
# fit complete-data model
fit <- with(imp, lm(price~bedrooms+bathrooms+sqft_living))
  • Next, we will pool the results from our regressions to arrive at estimates that will properly account for the missing data.

  • It will give us the estimate, standard error, test statistic, degrees of freedom, and the p-value for each variable.

# pool and summarize the results
summary(pool(fit))
         term    estimate   std.error  statistic        df       p.value
1 (Intercept)  75043.7775 6952.563287  10.793685 14431.278  4.677205e-27
2    bedrooms -57930.4966 2357.322483 -24.574702  7683.745 1.936143e-128
3   bathrooms   7624.2182 3531.505441   2.158914 11516.540  3.087739e-02
4 sqft_living    309.7992    3.114475  99.470779  8431.335  0.000000e+00

Checking for Accuracy

  • We will now compare the results from our pooled, imputed dataset to our original, complete dataset (before we removed values to create missing data) to check for accuracy of the imputation method by chained equations:
# fit original, complete dataset
og_lm = lm(price~bedrooms+bathrooms+sqft_living, data = original)
# compare imputed dataset to the original dataset
summary(list("Imputed" = pool(fit), "Original" = og_lm), title = "Comparison", statistic = "p.value", estimate = "estimate", gof_omit = 'IC|Log|Adj|F|RMSE')
         Length Class Mode
Imputed   4     mipo  list
Original 12     lm    list

 

Results

  • The estimates from the multiple imputation are very close to the estimates for the original dataset. The p-values are similar and the differences are minimal with exception of bathrooms which is off just slightly.

  • These results indicate a high accuracy of the imputation process. We can conclude that multiple imputation by chained equation is a reliable source to impute missing data.

Conclusion

  • Missing data can occur in research for a variety of reasons.
  • It is never a good idea to ignore it. Doing this will lead to biased estimates of parameters, loss of information, decreased statistical power, and weak reliability of findings.
  • The best course of action is to impute the missing data by using multiple imputation.
  • Performing multiple imputation will minimize the adverse effects caused by missing data on the analysis.