Refs:

Let’s load the Social Indicators Survey dataset for this tutorial (cf. dataset’s codebook):

```
sis <- read.csv("siswave3v4impute3.csv", sep=";")
# from Gelman&Hill's book:
na.fix <- function (a) {
ifelse(a<0 | a==999999, NA, a)
}
sis$earnings <- (na.fix(sis$rearn) + na.fix(sis$tearn))/1000
# select a column subset of the entire dataset
sis_sm <- as.data.frame(with(sis, cbind(sex, race, educ_r, r_age, earnings, police)))
sis_sm[91:95,]
## sex race educ_r r_age earnings police
## 91 1 3 3 31 NA 0
## 92 2 1 2 37 135 1
## 93 0 1 1 3 NA 2
## 94 1 1 3 42 3 1
## 95 1 3 1 24 0 NA
# clean dataset
sis_sm <- sis_sm[c(-93,-757),] # remove lines 93 & 756 (typos?)
# give proper types to predictors
sis_sm$sex <- as.factor(sis_sm$sex)
sis_sm$race <- as.factor(sis_sm$race)
sis_sm$educ_r <- as.ordered(sis_sm$educ_r) # ordered factor
sis_sm$police <- as.factor(sis_sm$police)
sapply(sis_sm, class)
## $sex
## [1] "factor"
##
## $race
## [1] "factor"
##
## $educ_r
## [1] "ordered" "factor"
##
## $r_age
## [1] "numeric"
##
## $earnings
## [1] "numeric"
##
## $police
## [1] "factor"
summary(sis_sm)
## sex race educ_r r_age earnings police
## 1 :543 1 :478 1 :229 Min. :18.00 Min. : 0.00 0 :490
## 2 :956 2 :401 2 :398 1st Qu.:30.00 1st Qu.: 5.00 1 :941
## NA's: 2 3 :453 3 :376 Median :39.00 Median : 28.00 NA's: 70
## 4 :130 4 :483 Mean :41.33 Mean : 52.56
## NA's: 39 NA's: 15 3rd Qu.:50.00 3rd Qu.: 65.00
## Max. :97.00 Max. :3250.00
## NA's :2 NA's :252
```

Notice the existence of several `NA`

values, ie, missing data.

There are several types of missing data (this is Gelman & Hill classification):

- MCAR - Missing Completely at Random. In this case, the probability of having a
`NA`

is the same for all rows and columns. This is the best-case scenario, because if we remove those samples from the inference, the bias is not affected. However, this is also the least probable case for our dataset.

For instance, our dataset does not seem to be MCAR:

```
library(dplyr)
library(magrittr)
sis_dp <- tbl_df(sis_sm)
check.income.NA <- function(for.race) {
df <- sis_dp %>% dplyr::filter(race==for.race)
n <- df %>% nrow()
n.NA <- df %>% dplyr::filter(is.na(earnings)) %>% nrow()
list(total=n, total.NA=n.NA, perc=n.NA/n)
}
check.income.NA(1) # whites
## $total
## [1] 478
##
## $total.NA
## [1] 92
##
## $perc
## [1] 0.1924686
check.income.NA(2) # blacks
## $total
## [1] 401
##
## $total.NA
## [1] 47
##
## $perc
## [1] 0.117207
```

The missing value percentage for whites and blacks is quite different. This is evidence that this dataset is not MCAR.

MAR - Missing at Random. Here the hypothesis is that the probability of a certain variable to be

`NA`

is based on the available information. So, if`income`

is missing then the missing probability is modelled by the other available data like sex, race, etc.Missing that depends on unobserved predictors. In this case, there is lack of information in the dataset. Eg, say education is important to reveal income, but education information was not collected. Since it is likely that the dataset is not representative of the education population (there was no control for it), there will be extra bias in our inferences.

Censoring. It’s when the value of the predictor influences the probability of missingness. Perhaps people with large incomes usually prefer not to state them in the survey. This can be mitigated by adding more data. In this eg, people with high education or high payed jobs tend to have higher incomes, so we could model the income with this extra information.

A good methodology is to include as many predictors as possible so that the MAR hypothesis is reasonable for our model.

`mice`

We’ll use the `mice`

package to deal with missing data.

`library(mice)`

Function `md.pattern`

presents a report about how missing data is spread over the dataset:

`md.pattern(sis_sm)`

```
## sex r_age educ_r race police earnings
## 1169 1 1 1 1 1 1 0
## 21 1 1 1 0 1 1 1
## 6 1 1 0 1 1 1 1
## 219 1 1 1 1 1 0 1
## 50 1 1 1 1 0 1 1
## 12 1 1 1 0 1 0 2
## 3 1 1 0 1 1 0 2
## 1 1 1 1 0 0 1 2
## 2 1 1 0 1 0 1 2
## 12 1 1 1 1 0 0 2
## 1 1 1 0 0 1 0 3
## 2 1 1 1 0 0 0 3
## 1 1 1 0 1 0 0 3
## 2 0 0 0 0 0 0 6
## 2 2 15 39 70 252 380
```

We see that 1170 rows are complete (have zero `NA`

), 21 miss the `race`

info, and so on…

In this dataset, we could simply remove the samples with more missing values. For eg, the last six samples, with three or more `NA`

s, could be removed without a significant loss of information. There are 30 rows with two missing values. It would be a decision for the data analyst to keep or remove them.

We can also visualize this information:

```
library(VIM)
aggr(sis_sm, col=c('blue','red'), numbers=TRUE, sortVars=TRUE,
labels=names(sis_sm), cex.axis=.8, gap=3,
ylab=c("Histogram of missing data", "Red is Missing Data"))
```

```
##
## Variables sorted by number of missings:
## Variable Count
## earnings 0.167888075
## police 0.046635576
## race 0.025982678
## educ_r 0.009993338
## sex 0.001332445
## r_age 0.001332445
```

Another function is `md.pairs`

which computes the number of observations for all pairs of variables and for all pairs (obse**r**ved/**m**issing,obse**r**ved/**m**issing):

`md.pairs(sis_sm)`

```
## $rr
## sex race educ_r r_age earnings police
## sex 1499 1462 1486 1499 1249 1431
## race 1462 1462 1450 1462 1227 1397
## educ_r 1486 1450 1486 1486 1241 1421
## r_age 1499 1462 1486 1499 1249 1431
## earnings 1249 1227 1241 1249 1249 1196
## police 1431 1397 1421 1431 1196 1431
##
## $rm
## sex race educ_r r_age earnings police
## sex 0 37 13 0 250 68
## race 0 0 12 0 235 65
## educ_r 0 36 0 0 245 65
## r_age 0 37 13 0 250 68
## earnings 0 22 8 0 0 53
## police 0 34 10 0 235 0
##
## $mr
## sex race educ_r r_age earnings police
## sex 0 0 0 0 0 0
## race 37 0 36 37 22 34
## educ_r 13 12 0 13 8 10
## r_age 0 0 0 0 0 0
## earnings 250 235 245 250 0 235
## police 68 65 65 68 53 0
##
## $mm
## sex race educ_r r_age earnings police
## sex 2 2 2 2 2 2
## race 2 39 3 2 17 5
## educ_r 2 3 15 2 7 5
## r_age 2 2 2 2 2 2
## earnings 2 17 7 2 252 17
## police 2 5 5 2 17 70
```

For eg, there are 251 samples where the sex value is present but the earnings are not (cf. matrix `$rm`

).

The simplest method is to remove the samples with missing data:

```
sis_sm1a <- cc(sis_sm)
head(sis_sm1a)
```

```
## sex race educ_r r_age earnings police
## 1 1 1 4 29 84.0 0
## 3 1 3 2 36 27.5 0
## 4 1 1 4 71 85.0 1
## 5 1 4 4 30 135.0 1
## 7 2 1 4 32 92.0 1
## 8 2 2 2 67 0.0 1
```

This however can be too much if many samples have columns with `NA`

values.

Also, if the removed samples are in any way different from the complete samples, then the analysis over the complete cases will bias the analysis.

A partial solution is to remove only a minority of samples that have too much missing information. In out eg, say we decide to remove all the cases with three or more `NA`

values:

```
n_NAs <- apply(sis_sm, 1, function(x) sum(is.na(x))) # number of NAs per row
sis_sm2 <- sis_sm[n_NAs<3,] # keep those with 0,1,2 NAs
md.pattern(sis_sm2)
```

```
## sex r_age educ_r race police earnings
## 1169 1 1 1 1 1 1 0
## 21 1 1 1 0 1 1 1
## 6 1 1 0 1 1 1 1
## 219 1 1 1 1 1 0 1
## 50 1 1 1 1 0 1 1
## 12 1 1 1 0 1 0 2
## 3 1 1 0 1 1 0 2
## 1 1 1 1 0 0 1 2
## 2 1 1 0 1 0 1 2
## 12 1 1 1 1 0 0 2
## 0 0 11 34 65 246 356
```

If we only partially remove missing data, we still have to deal with the remaining `NA`

values.

Another method is to use different subsets of the original dataset to answer different inference questions. In this case, we would remove less incomplete samples since we would be using less columns per inference. This means that these various inferences might not be consistent with each other, since each used different data.

Data imputation is the replacement of `NA`

values for some estimated value based on an imputation algorithm.

Categorical variables are tricky. For eg, we cannot replace them by the mean, since that would probably add a value not within the domain of that variable. A better alternative would be to replace by the mode, but even that is not advisable.

A better method is to fit a linear regression or logistic regression using the available data, and then, for each missing row, use the respective prediction as the imputed value. A variant is to add some reasonable random noise (also estimated from the available data) into the imputation.

Function `mice`

executes a given imputation method over the dataset. It has several methods available:

`methods(mice)`

```
## [1] mice.impute.2l.norm mice.impute.2l.pan mice.impute.2lonly.mean mice.impute.2lonly.norm mice.impute.2lonly.pmm mice.impute.cart
## [7] mice.impute.fastpmm mice.impute.lda mice.impute.logreg mice.impute.logreg.boot mice.impute.mean mice.impute.norm
## [13] mice.impute.norm.boot mice.impute.norm.nob mice.impute.norm.predict mice.impute.passive mice.impute.pmm mice.impute.polr
## [19] mice.impute.polyreg mice.impute.quadratic mice.impute.rf mice.impute.ri mice.impute.sample mice.mids
## [25] mice.theme
## see '?methods' for accessing help and source code
```

Here’s information about some of these methods:

`pmm`

is predictive mean matching (for numeric)`mean`

is unconditional mean imputation (for numeric)`norm`

is Bayesian Linear Regression (for numeric)`norm.nob`

is (classical) Linear Regression (for numeric)`logreg`

is logistic regression (for 2 factors)`polyreg`

is multinomial logit (for 2+ factors)`polr`

is ordered logit (for 2+ ordered factor)

We can use specific methods for each column (if a predictor is complete, just place `""`

in the `meth`

argument list):

```
sapply(sis_sm2, class) # check again the types of each predictor
## $sex
## [1] "factor"
##
## $race
## [1] "factor"
##
## $educ_r
## [1] "ordered" "factor"
##
## $r_age
## [1] "numeric"
##
## $earnings
## [1] "numeric"
##
## $police
## [1] "factor"
ni <- 7 # the number of imputations
imp.data <- mice(sis_sm2, m=ni, maxit=50,
meth=c('logreg', 'polyreg', 'polr', 'pmm', 'pmm', 'logreg'),
seed=99, diag=FALSE, print=FALSE)
summary(imp.data)
## Multiply imputed data set
## Call:
## mice(data = sis_sm2, m = ni, method = c("logreg", "polyreg",
## "polr", "pmm", "pmm", "logreg"), maxit = 50, diagnostics = FALSE,
## printFlag = FALSE, seed = 99)
## Number of multiple imputations: 7
## Missing cells per column:
## sex race educ_r r_age earnings police
## 0 34 11 0 246 65
## Imputation methods:
## sex race educ_r r_age earnings police
## "logreg" "polyreg" "polr" "pmm" "pmm" "logreg"
## VisitSequence:
## race educ_r earnings police
## 2 3 5 6
## PredictorMatrix:
## sex race educ_r r_age earnings police
## sex 0 0 0 0 0 0
## race 1 0 1 1 1 1
## educ_r 1 1 0 1 1 1
## r_age 0 0 0 0 0 0
## earnings 1 1 1 1 0 1
## police 1 1 1 1 1 0
## Random generator seed value: 99
```

Mice give us information about which set of predictors were used to impute a certain variable. This information is kept in the predictor matrix:

[The predictor matrix] rows correspond to target variables (i.e. variables to be imputed), in the sequence as they appear in data. A value of ‘1’ means that the column variable is used as a predictor for the target variable (in the rows). The diagonal of predictorMatrix must be zero [ref: mice help]

This predictor matrix can be supplied by the user in argument `pred`

, which allows her to control which predictors are used for each variable.

We can check the values assigned to each sample column at each data imputation for diagnostic checking, ie, to see if the imputations are plausible:

`head(imp.data$imp$police, 10) # for this predictor, should be 0s and 1s only`

```
## 1 2 3 4 5 6 7
## 31 0 1 1 1 1 1 0
## 38 1 1 0 1 0 0 0
## 52 1 1 1 1 1 1 1
## 54 1 1 1 1 0 0 1
## 58 1 1 1 1 1 1 1
## 95 0 0 1 0 0 1 1
## 115 1 1 1 1 1 1 1
## 136 0 0 0 1 0 0 0
## 159 0 1 1 0 1 1 1
## 167 1 1 1 1 1 1 1
```

Package `mice`

provides several plotting tools.

Function `stripplot`

shows the distributions for each predictor:

```
library(lattice)
stripplot(imp.data, pch = 20, cex = 1)
```

We can visualize if the density plots of the original data and the inputed data are close:

`densityplot(imp.data, xlim=c(-1e2,1e3)) # blue is the original data, red are the imputations`

In this case, we only have one continuous variable (income). Despite the plot, there are no negative incomes, this is only an effect of fitting a density to the actual datapoints.

And this is a scatterplot of earnings against other predictors:

`xyplot(imp.data, earnings ~ race+educ_r+police+r_age, pch=20, cex=1)`

One way to check if the imputation algorithm ran ok, is to verify how the mean and variance of each imputation evolved over iterations (let’s call them imputation streams). A sign of convergence is to see each stream mixing with the other, with no sign of a trend. If the streams look flat, something wrong happened and the user should try other methods, increase the number of interations (fortunately, the number of iterations needed are much lower than for MCMC methods) or make changes to the predictor matrix.

`plot(imp.data, c("race", "earnings", "police", "educ_r"))`