Refs:

Data Preparation

Let’s load the Social Indicators Survey dataset for this tutorial (cf. dataset’s codebook):

sis <- read.csv("siswave3v4impute3.csv", sep=";")

# from Gelman&Hill's book:
na.fix <- function (a) {
  ifelse(a<0 | a==999999, NA, a)
}

sis$earnings <- (na.fix(sis$rearn) + na.fix(sis$tearn))/1000

# select a column subset of the entire dataset
sis_sm <- as.data.frame(with(sis, cbind(sex, race, educ_r, r_age, earnings, police)))
sis_sm[91:95,]
##    sex race educ_r r_age earnings police
## 91   1    3      3    31       NA      0
## 92   2    1      2    37      135      1
## 93   0    1      1     3       NA      2
## 94   1    1      3    42        3      1
## 95   1    3      1    24        0     NA

# clean dataset
sis_sm <- sis_sm[c(-93,-757),]  # remove lines 93 & 756 (typos?)

# give proper types to predictors
sis_sm$sex    <- as.factor(sis_sm$sex)
sis_sm$race   <- as.factor(sis_sm$race)
sis_sm$educ_r <- as.ordered(sis_sm$educ_r) # ordered factor
sis_sm$police <- as.factor(sis_sm$police)
sapply(sis_sm, class)
## $sex
## [1] "factor"
## 
## $race
## [1] "factor"
## 
## $educ_r
## [1] "ordered" "factor" 
## 
## $r_age
## [1] "numeric"
## 
## $earnings
## [1] "numeric"
## 
## $police
## [1] "factor"

summary(sis_sm)
##    sex        race      educ_r        r_age          earnings        police   
##  1   :543   1   :478   1   :229   Min.   :18.00   Min.   :   0.00   0   :490  
##  2   :956   2   :401   2   :398   1st Qu.:30.00   1st Qu.:   5.00   1   :941  
##  NA's:  2   3   :453   3   :376   Median :39.00   Median :  28.00   NA's: 70  
##             4   :130   4   :483   Mean   :41.33   Mean   :  52.56             
##             NA's: 39   NA's: 15   3rd Qu.:50.00   3rd Qu.:  65.00             
##                                   Max.   :97.00   Max.   :3250.00             
##                                   NA's   :2       NA's   :252

Notice the existence of several NA values, ie, missing data.

Introduction

There are several types of missing data (this is Gelman & Hill classification):

For instance, our dataset does not seem to be MCAR:

library(dplyr)
library(magrittr)

sis_dp <- tbl_df(sis_sm)

check.income.NA <- function(for.race) {

  df <- sis_dp %>% dplyr::filter(race==for.race)
  
  n    <- df %>% nrow()
  n.NA <- df %>% dplyr::filter(is.na(earnings)) %>% nrow()
  
  list(total=n, total.NA=n.NA, perc=n.NA/n)
}

check.income.NA(1) # whites
## $total
## [1] 478
## 
## $total.NA
## [1] 92
## 
## $perc
## [1] 0.1924686
check.income.NA(2) # blacks
## $total
## [1] 401
## 
## $total.NA
## [1] 47
## 
## $perc
## [1] 0.117207

The missing value percentage for whites and blacks is quite different. This is evidence that this dataset is not MCAR.

A good methodology is to include as many predictors as possible so that the MAR hypothesis is reasonable for our model.

Using package mice

We’ll use the mice package to deal with missing data.

library(mice)

Data Exploration

Function md.pattern presents a report about how missing data is spread over the dataset:

md.pattern(sis_sm)
##      sex r_age educ_r race police earnings    
## 1169   1     1      1    1      1        1   0
##   21   1     1      1    0      1        1   1
##    6   1     1      0    1      1        1   1
##  219   1     1      1    1      1        0   1
##   50   1     1      1    1      0        1   1
##   12   1     1      1    0      1        0   2
##    3   1     1      0    1      1        0   2
##    1   1     1      1    0      0        1   2
##    2   1     1      0    1      0        1   2
##   12   1     1      1    1      0        0   2
##    1   1     1      0    0      1        0   3
##    2   1     1      1    0      0        0   3
##    1   1     1      0    1      0        0   3
##    2   0     0      0    0      0        0   6
##        2     2     15   39     70      252 380

We see that 1170 rows are complete (have zero NA), 21 miss the race info, and so on…

In this dataset, we could simply remove the samples with more missing values. For eg, the last six samples, with three or more NAs, could be removed without a significant loss of information. There are 30 rows with two missing values. It would be a decision for the data analyst to keep or remove them.

We can also visualize this information:

library(VIM)

aggr(sis_sm, col=c('blue','red'), numbers=TRUE, sortVars=TRUE,
             labels=names(sis_sm), cex.axis=.8, gap=3, 
             ylab=c("Histogram of missing data", "Red is Missing Data"))

## 
##  Variables sorted by number of missings: 
##  Variable       Count
##  earnings 0.167888075
##    police 0.046635576
##      race 0.025982678
##    educ_r 0.009993338
##       sex 0.001332445
##     r_age 0.001332445

Another function is md.pairs which computes the number of observations for all pairs of variables and for all pairs (observed/missing,observed/missing):

md.pairs(sis_sm)
## $rr
##           sex race educ_r r_age earnings police
## sex      1499 1462   1486  1499     1249   1431
## race     1462 1462   1450  1462     1227   1397
## educ_r   1486 1450   1486  1486     1241   1421
## r_age    1499 1462   1486  1499     1249   1431
## earnings 1249 1227   1241  1249     1249   1196
## police   1431 1397   1421  1431     1196   1431
## 
## $rm
##          sex race educ_r r_age earnings police
## sex        0   37     13     0      250     68
## race       0    0     12     0      235     65
## educ_r     0   36      0     0      245     65
## r_age      0   37     13     0      250     68
## earnings   0   22      8     0        0     53
## police     0   34     10     0      235      0
## 
## $mr
##          sex race educ_r r_age earnings police
## sex        0    0      0     0        0      0
## race      37    0     36    37       22     34
## educ_r    13   12      0    13        8     10
## r_age      0    0      0     0        0      0
## earnings 250  235    245   250        0    235
## police    68   65     65    68       53      0
## 
## $mm
##          sex race educ_r r_age earnings police
## sex        2    2      2     2        2      2
## race       2   39      3     2       17      5
## educ_r     2    3     15     2        7      5
## r_age      2    2      2     2        2      2
## earnings   2   17      7     2      252     17
## police     2    5      5     2       17     70

For eg, there are 251 samples where the sex value is present but the earnings are not (cf. matrix $rm).

Removing Data

The simplest method is to remove the samples with missing data:

sis_sm1a <- cc(sis_sm)
head(sis_sm1a)
##   sex race educ_r r_age earnings police
## 1   1    1      4    29     84.0      0
## 3   1    3      2    36     27.5      0
## 4   1    1      4    71     85.0      1
## 5   1    4      4    30    135.0      1
## 7   2    1      4    32     92.0      1
## 8   2    2      2    67      0.0      1

This however can be too much if many samples have columns with NA values.

Also, if the removed samples are in any way different from the complete samples, then the analysis over the complete cases will bias the analysis.

A partial solution is to remove only a minority of samples that have too much missing information. In out eg, say we decide to remove all the cases with three or more NA values:

n_NAs <- apply(sis_sm, 1, function(x) sum(is.na(x))) # number of NAs per row
sis_sm2 <- sis_sm[n_NAs<3,]                          # keep those with 0,1,2 NAs
md.pattern(sis_sm2)
##      sex r_age educ_r race police earnings    
## 1169   1     1      1    1      1        1   0
##   21   1     1      1    0      1        1   1
##    6   1     1      0    1      1        1   1
##  219   1     1      1    1      1        0   1
##   50   1     1      1    1      0        1   1
##   12   1     1      1    0      1        0   2
##    3   1     1      0    1      1        0   2
##    1   1     1      1    0      0        1   2
##    2   1     1      0    1      0        1   2
##   12   1     1      1    1      0        0   2
##        0     0     11   34     65      246 356

If we only partially remove missing data, we still have to deal with the remaining NA values.

Another method is to use different subsets of the original dataset to answer different inference questions. In this case, we would remove less incomplete samples since we would be using less columns per inference. This means that these various inferences might not be consistent with each other, since each used different data.

Data Imputation

Data imputation is the replacement of NA values for some estimated value based on an imputation algorithm.

Categorical variables are tricky. For eg, we cannot replace them by the mean, since that would probably add a value not within the domain of that variable. A better alternative would be to replace by the mode, but even that is not advisable.

A better method is to fit a linear regression or logistic regression using the available data, and then, for each missing row, use the respective prediction as the imputed value. A variant is to add some reasonable random noise (also estimated from the available data) into the imputation.

Function mice executes a given imputation method over the dataset. It has several methods available:

methods(mice)
##  [1] mice.impute.2l.norm      mice.impute.2l.pan       mice.impute.2lonly.mean  mice.impute.2lonly.norm  mice.impute.2lonly.pmm   mice.impute.cart        
##  [7] mice.impute.fastpmm      mice.impute.lda          mice.impute.logreg       mice.impute.logreg.boot  mice.impute.mean         mice.impute.norm        
## [13] mice.impute.norm.boot    mice.impute.norm.nob     mice.impute.norm.predict mice.impute.passive      mice.impute.pmm          mice.impute.polr        
## [19] mice.impute.polyreg      mice.impute.quadratic    mice.impute.rf           mice.impute.ri           mice.impute.sample       mice.mids               
## [25] mice.theme              
## see '?methods' for accessing help and source code

Here’s information about some of these methods:

  • pmm is predictive mean matching (for numeric)
  • mean is unconditional mean imputation (for numeric)
  • norm is Bayesian Linear Regression (for numeric)
  • norm.nob is (classical) Linear Regression (for numeric)
  • logreg is logistic regression (for 2 factors)
  • polyreg is multinomial logit (for 2+ factors)
  • polr is ordered logit (for 2+ ordered factor)

We can use specific methods for each column (if a predictor is complete, just place "" in the meth argument list):

sapply(sis_sm2, class)  # check again the types of each predictor
## $sex
## [1] "factor"
## 
## $race
## [1] "factor"
## 
## $educ_r
## [1] "ordered" "factor" 
## 
## $r_age
## [1] "numeric"
## 
## $earnings
## [1] "numeric"
## 
## $police
## [1] "factor"
ni <- 7     # the number of imputations
imp.data <- mice(sis_sm2, m=ni, maxit=50, 
                 meth=c('logreg', 'polyreg', 'polr', 'pmm', 'pmm', 'logreg'), 
                 seed=99, diag=FALSE, print=FALSE)
summary(imp.data)
## Multiply imputed data set
## Call:
## mice(data = sis_sm2, m = ni, method = c("logreg", "polyreg", 
##     "polr", "pmm", "pmm", "logreg"), maxit = 50, diagnostics = FALSE, 
##     printFlag = FALSE, seed = 99)
## Number of multiple imputations:  7
## Missing cells per column:
##      sex     race   educ_r    r_age earnings   police 
##        0       34       11        0      246       65 
## Imputation methods:
##       sex      race    educ_r     r_age  earnings    police 
##  "logreg" "polyreg"    "polr"     "pmm"     "pmm"  "logreg" 
## VisitSequence:
##     race   educ_r earnings   police 
##        2        3        5        6 
## PredictorMatrix:
##          sex race educ_r r_age earnings police
## sex        0    0      0     0        0      0
## race       1    0      1     1        1      1
## educ_r     1    1      0     1        1      1
## r_age      0    0      0     0        0      0
## earnings   1    1      1     1        0      1
## police     1    1      1     1        1      0
## Random generator seed value:  99

Mice give us information about which set of predictors were used to impute a certain variable. This information is kept in the predictor matrix:

[The predictor matrix] rows correspond to target variables (i.e. variables to be imputed), in the sequence as they appear in data. A value of ‘1’ means that the column variable is used as a predictor for the target variable (in the rows). The diagonal of predictorMatrix must be zero [ref: mice help]

This predictor matrix can be supplied by the user in argument pred, which allows her to control which predictors are used for each variable.

We can check the values assigned to each sample column at each data imputation for diagnostic checking, ie, to see if the imputations are plausible:

head(imp.data$imp$police, 10) # for this predictor, should be 0s and 1s only
##     1 2 3 4 5 6 7
## 31  0 1 1 1 1 1 0
## 38  1 1 0 1 0 0 0
## 52  1 1 1 1 1 1 1
## 54  1 1 1 1 0 0 1
## 58  1 1 1 1 1 1 1
## 95  0 0 1 0 0 1 1
## 115 1 1 1 1 1 1 1
## 136 0 0 0 1 0 0 0
## 159 0 1 1 0 1 1 1
## 167 1 1 1 1 1 1 1

Visualizing Imputation Results

Package mice provides several plotting tools.

Function stripplot shows the distributions for each predictor:

library(lattice)
stripplot(imp.data, pch = 20, cex = 1)

We can visualize if the density plots of the original data and the inputed data are close:

densityplot(imp.data, xlim=c(-1e2,1e3))  # blue is the original data, red are the imputations

In this case, we only have one continuous variable (income). Despite the plot, there are no negative incomes, this is only an effect of fitting a density to the actual datapoints.

And this is a scatterplot of earnings against other predictors:

xyplot(imp.data, earnings ~ race+educ_r+police+r_age, pch=20, cex=1)

Checking Convergence

One way to check if the imputation algorithm ran ok, is to verify how the mean and variance of each imputation evolved over iterations (let’s call them imputation streams). A sign of convergence is to see each stream mixing with the other, with no sign of a trend. If the streams look flat, something wrong happened and the user should try other methods, increase the number of interations (fortunately, the number of iterations needed are much lower than for MCMC methods) or make changes to the predictor matrix.

plot(imp.data, c("race", "earnings", "police", "educ_r"))

Creating a Complete Dataset

Function complete uses the inputation information to create a complete dataset:

head(sis_sm2)
##   sex race educ_r    r_age earnings police
## 1   1    1      4 29.00000     84.0      0
## 2   2    3      4 40.00000       NA      1
## 3   1    3      2 36.00000     27.5      0
## 4   1    1      4 71.00000     85.0      1
## 5   1    4      4 30.00000    135.0      1
## 6   2    2      4 41.32423       NA      0
sis_sm2c <- complete(imp.data, action=1)
head(sis_sm2c)
##   sex race educ_r    r_age earnings police
## 1   1    1      4 29.00000     84.0      0
## 2   2    3      4 40.00000     32.5      1
## 3   1    3      2 36.00000     27.5      0
## 4   1    1      4 71.00000     85.0      1
## 5   1    4      4 30.00000    135.0      1
## 6   2    2      4 41.32423     26.0      0

In function complete, the argument action='long' creates a larger dataset using all the imputation’ values (check help), which can be then split into differen inputed datasets:

n <- nrow(sis_sm2) # this dataset has 1495 rows

sis_sm2cl <- complete(imp.data, action='long')
nrow(sis_sm2cl)    # 1495 * 7 inputations
## [1] 10465
# split dataframes
dfs <- split(sis_sm2cl[,c(-1,-2)], f= sis_sm2cl$.imp)
head(dfs[[1]])   # first imputation
##   sex race educ_r    r_age earnings police
## 1   1    1      4 29.00000     84.0      0
## 2   2    3      4 40.00000     32.5      1
## 3   1    3      2 36.00000     27.5      0
## 4   1    1      4 71.00000     85.0      1
## 5   1    4      4 30.00000    135.0      1
## 6   2    2      4 41.32423     26.0      0
head(dfs[[ni]])  # last  imputation
##      sex race educ_r    r_age earnings police
## 8971   1    1      4 29.00000     84.0      0
## 8972   2    3      4 40.00000     70.0      1
## 8973   1    3      2 36.00000     27.5      0
## 8974   1    1      4 71.00000     85.0      1
## 8975   1    4      4 30.00000    135.0      1
## 8976   2    2      4 41.32423      0.0      0

Pooling Results

We can apply all the imputations in an inference and then pool the results. Let’s assume we would like to check if earnings is influenced by any other predictor:

# performing ni linear regressions
fits <- with(imp.data, lm(earnings ~ sex+race+police+educ_r))
summary(fits$analyses[[ni]]) # just from one regression
## 
## Call:
## lm(formula = earnings ~ sex + race + police + educ_r)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -102.50  -34.66  -11.27   12.83 3150.41 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   18.755     11.662   1.608   0.1080    
## sex2          -2.911      6.135  -0.474   0.6352    
## race2          3.006      8.036   0.374   0.7084    
## race3        -14.345      8.067  -1.778   0.0755 .  
## race4        -16.101     11.078  -1.453   0.1463    
## police2       13.316      6.341   2.100   0.0359 *  
## educ_r2        8.291      9.518   0.871   0.3838    
## educ_r3       23.016      9.641   2.387   0.0171 *  
## educ_r4       67.425      9.743   6.920 6.67e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 112.6 on 1486 degrees of freedom
## Multiple R-squared:  0.07068,    Adjusted R-squared:  0.06568 
## F-statistic: 14.13 on 8 and 1486 DF,  p-value: < 2.2e-16
summary(pool(fits))          # pooled results
##                    est        se          t         df     Pr(>|t|)       lo 95      hi 95 nmis        fmi     lambda
## (Intercept)  20.889623 13.115835  1.5927025  134.44978 1.135753e-01  -5.0504219 46.8296674   NA 0.21060091 0.19894494
## sex2         -1.214141  6.261670 -0.1939005 1214.90625 8.462862e-01 -13.4990274 11.0707457   NA 0.02951440 0.02791808
## race2        -2.998659 10.141810 -0.2956729   43.55632 7.688849e-01 -23.4440166 17.4466990   NA 0.38989176 0.36250583
## race3       -19.261405  9.554186 -2.0160174   72.37198 4.751091e-02 -38.3056540 -0.2171554   NA 0.29719480 0.27803748
## race4       -19.967243 13.266031 -1.5051407   66.91614 1.369938e-01 -46.4469630  6.5124762   NA 0.31009511 0.28977873
## police2      13.189392  6.472954  2.0376157 1240.65052 4.180019e-02   0.4902471 25.8885374   NA 0.02772790 0.02616180
## educ_r2       9.826511  9.942971  0.9882871  656.63073 3.233761e-01  -9.6973422 29.3503636   NA 0.07205852 0.06923645
## educ_r3      24.219873  9.817695  2.4669613 1351.48474 1.374976e-02   4.9602971 43.4794498   NA 0.01940862 0.01795856
## educ_r4      69.312043 10.037361  6.9054053 1042.69290 8.687717e-12  49.6163150 89.0077701   NA 0.04115345 0.03931603

The p-values (the column Pr(>|t|)) indicate that the two higher education levels are strongly significant to predict income (the higher education, the higher the income). Having police in the neighborhood (police=1) is weakly significant. Also, hispanic subjects (race=3) also seem to have weakly negative significance.

Using package simputation

simputation is a new (September 2016) package that standardizes the interface to access different imputation methods.

library(magrittr)
library(simputation)

In the author’s vignette, he uses the iris dataset with some NAs:

dat <- iris
dat[1:3,1] <- dat[3:7,2] <- dat[8:10,5] <- NA  # make some NAs
head(dat,10)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1            NA         3.5          1.4         0.2  setosa
## 2            NA         3.0          1.4         0.2  setosa
## 3            NA          NA          1.3         0.2  setosa
## 4           4.6          NA          1.5         0.2  setosa
## 5           5.0          NA          1.4         0.2  setosa
## 6           5.4          NA          1.7         0.4  setosa
## 7           4.6          NA          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2    <NA>
## 9           4.4         2.9          1.4         0.2    <NA>
## 10          4.9         3.1          1.5         0.1    <NA>

The next code imputes Sepal.Width and Sepal.Length by regression using the values of Petal.Width and Species:

dat %>% 
  impute_lm(Sepal.Width + Sepal.Length ~ Petal.Width + Species) -> imputed
head(imputed,10)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1      4.979844    3.500000          1.4         0.2  setosa
## 2      4.979844    3.000000          1.4         0.2  setosa
## 3      4.979844    3.409547          1.3         0.2  setosa
## 4      4.600000    3.409547          1.5         0.2  setosa
## 5      5.000000    3.409547          1.4         0.2  setosa
## 6      5.400000    3.561835          1.7         0.4  setosa
## 7      4.600000    3.485691          1.4         0.3  setosa
## 8      5.000000    3.400000          1.5         0.2    <NA>
## 9      4.400000    2.900000          1.4         0.2    <NA>
## 10     4.900000    3.100000          1.5         0.1    <NA>

Next, let’s impute the missing Species values wusing a decision tree model:

imputed %>% 
  impute_cart(Species ~ .) -> imputed
head(imputed,10)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1      4.979844    3.500000          1.4         0.2  setosa
## 2      4.979844    3.000000          1.4         0.2  setosa
## 3      4.979844    3.409547          1.3         0.2  setosa
## 4      4.600000    3.409547          1.5         0.2  setosa
## 5      5.000000    3.409547          1.4         0.2  setosa
## 6      5.400000    3.561835          1.7         0.4  setosa
## 7      4.600000    3.485691          1.4         0.3  setosa
## 8      5.000000    3.400000          1.5         0.2  setosa
## 9      4.400000    2.900000          1.4         0.2  setosa
## 10     4.900000    3.100000          1.5         0.1  setosa

The package can impute for each class separately:

# complete 'Species' using the sequential hot desk (shd) method
dat %>% impute_shd(Species ~ Petal.Length) -> dat2
# then impute Sepal.Length by regressing on Sepal.Width, for each Species
dat2 %>% impute_lm(Sepal.Length ~ Sepal.Width | Species) %>% head()
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1     5.067813         3.5          1.4         0.2  setosa
## 2     4.725677         3.0          1.4         0.2  setosa
## 3           NA          NA          1.3         0.2  setosa
## 4     4.600000          NA          1.5         0.2  setosa
## 5     5.000000          NA          1.4         0.2  setosa
## 6     5.400000          NA          1.7         0.4  setosa

Notice that the value at the 3rd row could not be found, since Sepal.Width was also missing.

This by group imputation can also be executed using dplyr::group_by:

dat2 %>% dplyr::group_by(Species) %>% 
    impute_lm(Sepal.Length ~ Sepal.Width) %>% 
    head()
## Source: local data frame [6 x 5]
## Groups: Species [1]
## 
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl>  <fctr>
## 1     5.067813         3.5          1.4         0.2  setosa
## 2     4.725677         3.0          1.4         0.2  setosa
## 3           NA          NA          1.3         0.2  setosa
## 4     4.600000          NA          1.5         0.2  setosa
## 5     5.000000          NA          1.4         0.2  setosa
## 6     5.400000          NA          1.7         0.4  setosa