Refs:

Data Preparation

Let’s load the Social Indicators Survey dataset for this tutorial (cf. dataset’s codebook):

sis <- read.csv("siswave3v4impute3.csv", sep=";")

# from Gelman&Hill's book:
na.fix <- function (a) {
  ifelse(a<0 | a==999999, NA, a)
}

sis$earnings <- (na.fix(sis$rearn) + na.fix(sis$tearn))/1000

# select a column subset of the entire dataset
sis_sm <- as.data.frame(with(sis, cbind(sex, race, educ_r, r_age, earnings, police)))
sis_sm[91:95,]
##    sex race educ_r r_age earnings police
## 91   1    3      3    31       NA      0
## 92   2    1      2    37      135      1
## 93   0    1      1     3       NA      2
## 94   1    1      3    42        3      1
## 95   1    3      1    24        0     NA

# clean dataset
sis_sm <- sis_sm[c(-93,-757),]  # remove lines 93 & 756 (typos?)

# give proper types to predictors
sis_sm$sex    <- as.factor(sis_sm$sex)
sis_sm$race   <- as.factor(sis_sm$race)
sis_sm$educ_r <- as.ordered(sis_sm$educ_r) # ordered factor
sis_sm$police <- as.factor(sis_sm$police)
sapply(sis_sm, class)
## $sex
## [1] "factor"
## 
## $race
## [1] "factor"
## 
## $educ_r
## [1] "ordered" "factor" 
## 
## $r_age
## [1] "numeric"
## 
## $earnings
## [1] "numeric"
## 
## $police
## [1] "factor"

summary(sis_sm)
##    sex        race      educ_r        r_age          earnings        police   
##  1   :543   1   :478   1   :229   Min.   :18.00   Min.   :   0.00   0   :490  
##  2   :956   2   :401   2   :398   1st Qu.:30.00   1st Qu.:   5.00   1   :941  
##  NA's:  2   3   :453   3   :376   Median :39.00   Median :  28.00   NA's: 70  
##             4   :130   4   :483   Mean   :41.33   Mean   :  52.56             
##             NA's: 39   NA's: 15   3rd Qu.:50.00   3rd Qu.:  65.00             
##                                   Max.   :97.00   Max.   :3250.00             
##                                   NA's   :2       NA's   :252

Notice the existence of several NA values, ie, missing data.

Introduction

There are several types of missing data (this is Gelman & Hill classification):

For instance, our dataset does not seem to be MCAR:

library(dplyr)
library(magrittr)

sis_dp <- tbl_df(sis_sm)

check.income.NA <- function(for.race) {

  df <- sis_dp %>% dplyr::filter(race==for.race)
  
  n    <- df %>% nrow()
  n.NA <- df %>% dplyr::filter(is.na(earnings)) %>% nrow()
  
  list(total=n, total.NA=n.NA, perc=n.NA/n)
}

check.income.NA(1) # whites
## $total
## [1] 478
## 
## $total.NA
## [1] 92
## 
## $perc
## [1] 0.1924686
check.income.NA(2) # blacks
## $total
## [1] 401
## 
## $total.NA
## [1] 47
## 
## $perc
## [1] 0.117207

The missing value percentage for whites and blacks is quite different. This is evidence that this dataset is not MCAR.

A good methodology is to include as many predictors as possible so that the MAR hypothesis is reasonable for our model.

Using package mice

We’ll use the mice package to deal with missing data.

library(mice)

Data Exploration

Function md.pattern presents a report about how missing data is spread over the dataset:

md.pattern(sis_sm)
##      sex r_age educ_r race police earnings    
## 1169   1     1      1    1      1        1   0
##   21   1     1      1    0      1        1   1
##    6   1     1      0    1      1        1   1
##  219   1     1      1    1      1        0   1
##   50   1     1      1    1      0        1   1
##   12   1     1      1    0      1        0   2
##    3   1     1      0    1      1        0   2
##    1   1     1      1    0      0        1   2
##    2   1     1      0    1      0        1   2
##   12   1     1      1    1      0        0   2
##    1   1     1      0    0      1        0   3
##    2   1     1      1    0      0        0   3
##    1   1     1      0    1      0        0   3
##    2   0     0      0    0      0        0   6
##        2     2     15   39     70      252 380

We see that 1170 rows are complete (have zero NA), 21 miss the race info, and so on…

In this dataset, we could simply remove the samples with more missing values. For eg, the last six samples, with three or more NAs, could be removed without a significant loss of information. There are 30 rows with two missing values. It would be a decision for the data analyst to keep or remove them.

We can also visualize this information:

library(VIM)

aggr(sis_sm, col=c('blue','red'), numbers=TRUE, sortVars=TRUE,
             labels=names(sis_sm), cex.axis=.8, gap=3, 
             ylab=c("Histogram of missing data", "Red is Missing Data"))

## 
##  Variables sorted by number of missings: 
##  Variable       Count
##  earnings 0.167888075
##    police 0.046635576
##      race 0.025982678
##    educ_r 0.009993338
##       sex 0.001332445
##     r_age 0.001332445

Another function is md.pairs which computes the number of observations for all pairs of variables and for all pairs (observed/missing,observed/missing):

md.pairs(sis_sm)
## $rr
##           sex race educ_r r_age earnings police
## sex      1499 1462   1486  1499     1249   1431
## race     1462 1462   1450  1462     1227   1397
## educ_r   1486 1450   1486  1486     1241   1421
## r_age    1499 1462   1486  1499     1249   1431
## earnings 1249 1227   1241  1249     1249   1196
## police   1431 1397   1421  1431     1196   1431
## 
## $rm
##          sex race educ_r r_age earnings police
## sex        0   37     13     0      250     68
## race       0    0     12     0      235     65
## educ_r     0   36      0     0      245     65
## r_age      0   37     13     0      250     68
## earnings   0   22      8     0        0     53
## police     0   34     10     0      235      0
## 
## $mr
##          sex race educ_r r_age earnings police
## sex        0    0      0     0        0      0
## race      37    0     36    37       22     34
## educ_r    13   12      0    13        8     10
## r_age      0    0      0     0        0      0
## earnings 250  235    245   250        0    235
## police    68   65     65    68       53      0
## 
## $mm
##          sex race educ_r r_age earnings police
## sex        2    2      2     2        2      2
## race       2   39      3     2       17      5
## educ_r     2    3     15     2        7      5
## r_age      2    2      2     2        2      2
## earnings   2   17      7     2      252     17
## police     2    5      5     2       17     70

For eg, there are 251 samples where the sex value is present but the earnings are not (cf. matrix $rm).

Removing Data

The simplest method is to remove the samples with missing data:

sis_sm1a <- cc(sis_sm)
head(sis_sm1a)
##   sex race educ_r r_age earnings police
## 1   1    1      4    29     84.0      0
## 3   1    3      2    36     27.5      0
## 4   1    1      4    71     85.0      1
## 5   1    4      4    30    135.0      1
## 7   2    1      4    32     92.0      1
## 8   2    2      2    67      0.0      1

This however can be too much if many samples have columns with NA values.

Also, if the removed samples are in any way different from the complete samples, then the analysis over the complete cases will bias the analysis.

A partial solution is to remove only a minority of samples that have too much missing information. In out eg, say we decide to remove all the cases with three or more NA values:

n_NAs <- apply(sis_sm, 1, function(x) sum(is.na(x))) # number of NAs per row
sis_sm2 <- sis_sm[n_NAs<3,]                          # keep those with 0,1,2 NAs
md.pattern(sis_sm2)
##      sex r_age educ_r race police earnings    
## 1169   1     1      1    1      1        1   0
##   21   1     1      1    0      1        1   1
##    6   1     1      0    1      1        1   1
##  219   1     1      1    1      1        0   1
##   50   1     1      1    1      0        1   1
##   12   1     1      1    0      1        0   2
##    3   1     1      0    1      1        0   2
##    1   1     1      1    0      0        1   2
##    2   1     1      0    1      0        1   2
##   12   1     1      1    1      0        0   2
##        0     0     11   34     65      246 356

If we only partially remove missing data, we still have to deal with the remaining NA values.

Another method is to use different subsets of the original dataset to answer different inference questions. In this case, we would remove less incomplete samples since we would be using less columns per inference. This means that these various inferences might not be consistent with each other, since each used different data.

Data Imputation

Data imputation is the replacement of NA values for some estimated value based on an imputation algorithm.

Categorical variables are tricky. For eg, we cannot replace them by the mean, since that would probably add a value not within the domain of that variable. A better alternative would be to replace by the mode, but even that is not advisable.

A better method is to fit a linear regression or logistic regression using the available data, and then, for each missing row, use the respective prediction as the imputed value. A variant is to add some reasonable random noise (also estimated from the available data) into the imputation.

Function mice executes a given imputation method over the dataset. It has several methods available:

methods(mice)
##  [1] mice.impute.2l.norm      mice.impute.2l.pan       mice.impute.2lonly.mean  mice.impute.2lonly.norm  mice.impute.2lonly.pmm   mice.impute.cart        
##  [7] mice.impute.fastpmm      mice.impute.lda          mice.impute.logreg       mice.impute.logreg.boot  mice.impute.mean         mice.impute.norm        
## [13] mice.impute.norm.boot    mice.impute.norm.nob     mice.impute.norm.predict mice.impute.passive      mice.impute.pmm          mice.impute.polr        
## [19] mice.impute.polyreg      mice.impute.quadratic    mice.impute.rf           mice.impute.ri           mice.impute.sample       mice.mids               
## [25] mice.theme              
## see '?methods' for accessing help and source code

Here’s information about some of these methods:

  • pmm is predictive mean matching (for numeric)
  • mean is unconditional mean imputation (for numeric)
  • norm is Bayesian Linear Regression (for numeric)
  • norm.nob is (classical) Linear Regression (for numeric)
  • logreg is logistic regression (for 2 factors)
  • polyreg is multinomial logit (for 2+ factors)
  • polr is ordered logit (for 2+ ordered factor)

We can use specific methods for each column (if a predictor is complete, just place "" in the meth argument list):

sapply(sis_sm2, class)  # check again the types of each predictor
## $sex
## [1] "factor"
## 
## $race
## [1] "factor"
## 
## $educ_r
## [1] "ordered" "factor" 
## 
## $r_age
## [1] "numeric"
## 
## $earnings
## [1] "numeric"
## 
## $police
## [1] "factor"
ni <- 7     # the number of imputations
imp.data <- mice(sis_sm2, m=ni, maxit=50, 
                 meth=c('logreg', 'polyreg', 'polr', 'pmm', 'pmm', 'logreg'), 
                 seed=99, diag=FALSE, print=FALSE)
summary(imp.data)
## Multiply imputed data set
## Call:
## mice(data = sis_sm2, m = ni, method = c("logreg", "polyreg", 
##     "polr", "pmm", "pmm", "logreg"), maxit = 50, diagnostics = FALSE, 
##     printFlag = FALSE, seed = 99)
## Number of multiple imputations:  7
## Missing cells per column:
##      sex     race   educ_r    r_age earnings   police 
##        0       34       11        0      246       65 
## Imputation methods:
##       sex      race    educ_r     r_age  earnings    police 
##  "logreg" "polyreg"    "polr"     "pmm"     "pmm"  "logreg" 
## VisitSequence:
##     race   educ_r earnings   police 
##        2        3        5        6 
## PredictorMatrix:
##          sex race educ_r r_age earnings police
## sex        0    0      0     0        0      0
## race       1    0      1     1        1      1
## educ_r     1    1      0     1        1      1
## r_age      0    0      0     0        0      0
## earnings   1    1      1     1        0      1
## police     1    1      1     1        1      0
## Random generator seed value:  99

Mice give us information about which set of predictors were used to impute a certain variable. This information is kept in the predictor matrix:

[The predictor matrix] rows correspond to target variables (i.e. variables to be imputed), in the sequence as they appear in data. A value of ‘1’ means that the column variable is used as a predictor for the target variable (in the rows). The diagonal of predictorMatrix must be zero [ref: mice help]

This predictor matrix can be supplied by the user in argument pred, which allows her to control which predictors are used for each variable.

We can check the values assigned to each sample column at each data imputation for diagnostic checking, ie, to see if the imputations are plausible:

head(imp.data$imp$police, 10) # for this predictor, should be 0s and 1s only
##     1 2 3 4 5 6 7
## 31  0 1 1 1 1 1 0
## 38  1 1 0 1 0 0 0
## 52  1 1 1 1 1 1 1
## 54  1 1 1 1 0 0 1
## 58  1 1 1 1 1 1 1
## 95  0 0 1 0 0 1 1
## 115 1 1 1 1 1 1 1
## 136 0 0 0 1 0 0 0
## 159 0 1 1 0 1 1 1
## 167 1 1 1 1 1 1 1

Visualizing Imputation Results

Package mice provides several plotting tools.

Function stripplot shows the distributions for each predictor:

library(lattice)
stripplot(imp.data, pch = 20, cex = 1)

We can visualize if the density plots of the original data and the inputed data are close:

densityplot(imp.data, xlim=c(-1e2,1e3))  # blue is the original data, red are the imputations

In this case, we only have one continuous variable (income). Despite the plot, there are no negative incomes, this is only an effect of fitting a density to the actual datapoints.

And this is a scatterplot of earnings against other predictors:

xyplot(imp.data, earnings ~ race+educ_r+police+r_age, pch=20, cex=1)

Checking Convergence

One way to check if the imputation algorithm ran ok, is to verify how the mean and variance of each imputation evolved over iterations (let’s call them imputation streams). A sign of convergence is to see each stream mixing with the other, with no sign of a trend. If the streams look flat, something wrong happened and the user should try other methods, increase the number of interations (fortunately, the number of iterations needed are much lower than for MCMC methods) or make changes to the predictor matrix.

plot(imp.data, c("race", "earnings", "police", "educ_r"))