# estimate_bayes_lm_jeffreys(Y, X): obtain Bayesian linear regression parameter
# estimates from design matrix X and observations Y, using a Jeffreys prior.
# Should produce the same output as lm().
<- function(Y, X) {
estimate_bayes_lm_jeffreys stopifnot(is.matrix(X) & is.numeric(X) & !any(is.na(X)))
stopifnot(is.vector(Y) & is.numeric(Y) & !any(is.na(Y)))
stopifnot(nrow(X) == length(Y))
# crossprod(X) computes X^T X
# solve(X, Y) computes X^-1 Y
# Some fiddling is needed since we want our vector outputs to be R vectors,
# not R matrices
<- crossprod(X)
xtx <- as.vector(solve(xtx, t(X) %*% Y))
beta <- nrow(X) - ncol(X)
df <- as.vector(crossprod(Y - X %*% beta) / df)
s2 <- s2 * solve(xtx)
V
# Attach variable names to regression coefficients
names(beta) <- colnames(X)
<- list(X = X, Y = Y, beta = beta, V = V, df = df, s2 = s2)
res class(res) <- "bayeslm"
res }
Introduction
Coding up an algorithm is a great way to make sure you really understand how the details work. In this post I’m going to implement multiple imputation and the MICE algorithm (Van Buuren, 2007), albeit in much simplified form: only considering missing data in numeric variables, only using Normal-distribution Bayesian linear regression to generate the imputed data, no concerns about robustness of the code for production purposes.
If the standard version of this method is called MICE, think of this as a smaller, cuter, maybe slightly endangered variation. Perhaps a fat-tailed dunnart.
Bayesian linear regression
We will be imputing each individual variable with missing data using Bayesian linear regression with uninformative Jeffreys priors. For this choice of prior, the posterior distribution of the parameters has an easily computable analytic distribution. This approach is explained in detail in Gelman et al. (2014), chapter 14.
The regression equation is given by \(E(Y|X) = X \beta\), where \(Y\) is a column vector of observations, \(X\) is the design matrix, and \(\beta\) is a column vector of regression coefficients. We assume independent normally distributed errors with equal standard deviation: \(Y|X \sim \mathrm{MVN}(X \beta, \sigma^2 I)\). In other words, the usual classical linear regression.
Obtaining parameter estimates from Jeffreys priors
The simplest choice of prior (in terms of easily obtaining a posterior distribution) is the Jeffreys prior, which is an improper uninformative prior: uniform over \((\beta, \log \sigma)\). In this case, the posterior distribution of \(\beta\) is multivariate normal: \(\beta \sim \mathrm{MVN}(\hat\beta, V)\). Here \(\hat\beta = (X^T X)^{-1} X^T Y\) is the frequentist maximum likelihood estimate (ordinary least squares regression) and \(V = \sigma^2 (X^T X)^{-1}\) is the usual linear regression variance-covariance matrix.
The posterior distribution of \(\sigma^2\) is \(\mathrm{Inverse-}\chi^2(n-k, s^2)\) where \(s^2\) is the standard frequentist estimate of the residual variance (Gelman et al., 2014, p. 355). This scaled inverse \(\chi^2\) distribution was a new one to me1. We say a random variable \(U \sim \mathrm{Inverse-}\chi^2(\nu, \mu)\) if \(V \sim \chi^2(\nu)\) and \(U = \nu\mu/V\) (Gelman et al., 2014, p. 581).
We could just use the built-in R lm
function, but it’s been a long time since I last implemented linear regression from scratch, so thought I’d give it a go here. If you’re not interested in seeing this, skip to the section about drawing from the posterior distribution.
This function expects a vector of observations and a design matrix containing the predictors, but that’s not very convenient in practice. To make this easier to use, let’s write a function that implements a formula interface closer to the standard R lm
function. This takes advantage of two base R functions: model.frame
takes a model formula and prepares a data frame with the variables mentioned in it, with the outcome variable first, omitting missing values, and providing the option to only include a subset of rows; and model.matrix
which creates a design matrix from a formula and a data frame, adding an intercept and creating dummy variables for categorical variables if needed.
<- function(
bayes_lm data = NULL, subset = NULL,
formula, na.action = getOption("na.action"),
estimator = estimate_bayes_lm_jeffreys
) {if (is.character(na.action)) {
<- get(na.action)
na.action
}<- list(formula = formula, data = data,
mf_args subset = subset, na.action = na.action)
<- do.call(stats::model.frame, mf_args)
model_frame <- model.matrix(formula, model_frame)
model_matrix estimator(model_frame[[1]], model_matrix)
}
To make the objects that we’ve created behave a bit more like standard R lm
objects, we can implement some S3 methods for our new type of object:
<- function(m) m$beta
coef.bayeslm <- function(m) m$V
vcov.bayeslm <- function(m) sqrt(m$s2)
sigma.bayeslm <- function(m) m$df
df.residual.bayeslm <- function(m) (m$Y - m$X %*% t(m$beta))
resid.bayeslm <- function(m, digits = 3) {
print.bayeslm cat("Coefficients:\n")
print(format(m$beta, digits = digits), quote = FALSE)
}<- function(m)
summary.bayeslm data.frame(
term = names(m$beta),
estimate = m$beta,
std.error = sqrt(diag(m$V)),
row.names = seq_along(m$beta)
)
Let’s make sure this works as expected, using the penguins
data from the palmerpenguins
package:
library(palmerpenguins)
data(penguins)
Fit two models, one using lm
and one using bayes_lm
, to predict bill length from flipper length:
<- lm(
penguins_lm ~ flipper_length_mm ,
bill_length_mm data = penguins
)<- bayes_lm(
penguins_blm ~ flipper_length_mm,
bill_length_mm data = penguins
)
Check that the regression coefficients and standard errors are the same for both models:
summary(penguins_lm)
Call:
lm(formula = bill_length_mm ~ flipper_length_mm, data = penguins)
Residuals:
Min 1Q Median 3Q Max
-8.5792 -2.6715 -0.5721 2.0148 19.1518
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.26487 3.20016 -2.27 0.0238 *
flipper_length_mm 0.25477 0.01589 16.03 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.126 on 340 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.4306, Adjusted R-squared: 0.4289
F-statistic: 257.1 on 1 and 340 DF, p-value: < 2.2e-16
summary(penguins_blm)
term estimate std.error
1 (Intercept) -7.2648678 3.20015684
2 flipper_length_mm 0.2547682 0.01588914
Drawing from the posterior predictive distribution
To sample from the posterior predictive distribution, we use a two-stage process. First, draw from the posterior distribution of the parameters (regression coefficients and residual variance). As previously discussed, the regression coefficients have a multivariate normal distribution and the residual variance has an \(\mathrm{Inverse-}\chi^2\) distribution. Secondly, draw posterior predictions conditional on those parameter estimates, using \(Y|X,\beta,\sigma^2 \sim \mathrm{N}(X \beta, \sigma^2)\).
The function below does the first stage, drawing from the posterior distribution of the model parameters. It doesn’t require the model to be fit using the code from above, this should work for any lm
model.
<- function(m, ndraw = 1) {
draw_bayes_lm_params stopifnot(inherits(m, c("lm", "bayeslm")))
stopifnot(is.numeric(ndraw) & length(ndraw) == 1 & ndraw >= 1)
# Draw from the posterior distribution of parameters
<- mvtnorm::rmvnorm(ndraw, coef(m), vcov(m))
draw_beta <- df.residual(m)
df <- sigma(m) * sqrt(df / rchisq(ndraw, df))
draw_sigma list(beta = draw_beta, sigma = draw_sigma)
}
To get some idea if this is working, let’s plot the posterior distribution of our parmaeters. They look like what we might expect - normally distributed about the parameter estimates shown above. In fact the distribution of sigma looks far closer to normal than I expected, given it’s actually a scaled inverse Chi squared distribution!
Code
set.seed(12345)
%>%
penguins_blm draw_bayes_lm_params(1000) %>%
map(as_tibble) %>%
bind_cols() %>%
rename(sigma = value) %>%
pivot_longer(everything(), names_to = "var", values_to = "value") %>%
ggplot(aes(x = value)) +
stat_slabinterval(normalize = "panels") +
scale_y_continuous(breaks = NULL) +
facet_wrap(vars(var), scales = "free") +
labs(x = "value", y = "density") +
panel_border()
The posterior regression coefficients for flipper length and intercept are negatively correlated, as we might expect:
Code
set.seed(54321)
%>%
penguins_blm draw_bayes_lm_params(100) %>%
pluck("beta") %>%
as_tibble() %>%
ggplot(aes(x = flipper_length_mm, y = `(Intercept)`)) +
geom_point(pch = 1)
Now we’re more confident that seemed to work, we can write another function to estimate the posterior predictive distribution of a bunch of new observations, given their X values. This works by drawing a set of \(\beta\) parameters, calculating Y values from those, and adding some normally-distributed random noise based on \(\sigma^2\). This results in the model-predicted distribution of new observations.
<- function(m, X = m$X, ndraw = 1) {
draw_bayes_lm_ppred stopifnot(inherits(m, c("lm", "bayeslm")))
stopifnot(is.matrix(X) & is.numeric(X) & !any(is.na(X)))
stopifnot(ncol(X) == length(coef(m)))
<- draw_bayes_lm_params(m, ndraw = ndraw)
params %*% t(params$beta) + matrix(
X rnorm(nrow(X) * ndraw, 0, rep(params$sigma, each = nrow(X))),
nrow = nrow(X)
) }
To demonstrate this in action, the plot below shows a subset of the penguins data, the regression line (grey), the observed data (solid blue circles), and 10 draws from the posterior predictive distribution (hollow red circles):
Code
set.seed(42)
<- penguins %>%
penguins_subset select(bill_length_mm, flipper_length_mm) %>%
drop_na() %>%
sample_n(30)
<- draw_bayes_lm_ppred(
penguins_subset_ppred
penguins_blm, model.matrix(bill_length_mm ~ flipper_length_mm, data = penguins_subset),
ndraw = 10
)<- penguins_subset_ppred %>%
penguins_subset_ppred_dat as_tibble(.name_repair = "unique") %>%
bind_cols(flipper_length_mm = penguins_subset$flipper_length_mm) %>%
pivot_longer(-flipper_length_mm,
names_to = "rep",
values_to = "bill_length_mm")
ggplot(penguins_subset, aes(x = flipper_length_mm, y = bill_length_mm)) +
geom_abline(intercept = coef(penguins_blm)[1],
slope = coef(penguins_blm)[2],
colour = "black",
linewidth = 1.5,
alpha = 0.5) +
geom_point(colour = "dodgerblue4", size = 4, alpha = 0.5, stroke = NA) +
geom_point(colour = "firebrick4", size = 1, pch = 1,
data = penguins_subset_ppred_dat, alpha = 0.5)
If we only had missing values in one variable, we would be at the point of producing imputed datasets now. But real-life missing data problems tend to have missing data in multiple variables, so we will need to use an iterative method to generate imputations for all variables.
Creating imputations by chained equations
Now we have a function to draw from the posterior of a Bayesian linear regression, we can apply it iteratively to each incomplete variable to produce multiple imputed datasets. The MICE algorithm works something like this:
Create an initial dataset with complete values for all variables. (In this simple implementation, I will replace all missing values with the mean of the observed values.)
For each incomplete variable, take the most recent iteration and fit an appropriate linear regression to the subset of the observations for which that variable has been observed. Draw from the posterior predictive distribution to fill in new values for the observations where that variable was not observed.
Repeat step 2 a fixed number of terms (
iter
in our example implementation).Repeat the entire procedure (steps 1 to 3)
m
times to producem
imputed datasets.
This section of the blog will reimplement a simple version of this algorithm, which I have named dunnart (a smaller, cuter MICE mouse). It only supports imputation of numerical variables, and only using linear regression. Rather than the comprehensive inputs mice
uses to control the imputation algorithm, the main input will be a list of regression formulas, one for each variable to impute.
The first iteration of imputed data
For the initial iteration of imputation, we will replace all missing values with the mean of the non-missing values. The function below does some error checking (ensures that at it is being called on a numeric variable with at least one missing value and at least one observed value) and then uses replace_na()
from the tidyr
package to do the actual work.
<- function(x) {
dunnart_initial_imputation_for_var stopifnot(is.vector(x) & is.numeric(x))
stopifnot(!all(is.na(x)))
stopifnot(any(is.na(x)))
# replace missing values with the mean of the non-missing values
replace_na(x, mean(x, na.rm = TRUE))
}
The function below applies dunnart_initial_imputation_for_var
to several variables in a data frame, selected based on them appearing as outcome variables in a list of regression formulas.
<- function(dat, formula_list) {
dunnart_initial_imputation_for_df stopifnot(is.data.frame(dat))
stopifnot(is.list(formula_list))
# find the outcome variable in each formula in formula_list
<- map_chr(formula_list, \(f) all.vars(f)[1])
vars
# do the initial imputation for each of those variables
mutate(dat, across(all_of(vars), dunnart_initial_imputation_for_var))
}
Creating subsequent iterations
The function below, dunnart_impute_var
, is really the heart of the multiple imputation algorithm. It takes the original data frame with incomplete data dat_orig
, the current working iteration of the imputed data dat_cur
, and a regression formula formula
to specify which predictors are used for the imputation. There’s an additional flag verbose
which controls how much diagnostic information is printed to the console (the default emulates the output of mice
as it runs).
The steps are as follows:
Determine which variable we’re imputing, based on the outcome variable of the regression formula.
Determine which observations we need to impute (and the inverse: which observations we should fit the regression model to), based on the original incomplete data frame.
Fit a linear regression using the provided model formula to the observations of current imputation-in-progress data frame chosen above.
Draw from the posterior predictive distribution of the observations which were not used to fit that model and use them to replace the corresponding observations of the variable we’re imputing.
<- function(dat_orig, dat_cur, formula, verbose = TRUE) {
dunnart_impute_var stopifnot(is.data.frame(dat_orig) & is.data.frame(dat_cur) &
inherits(formula, "formula"))
# find the variable we're imputing
<- all.vars(formula)[1]
out_var if (verbose >= 2) {
cat(" imputing using", deparse1(formula), "\n")
else if (verbose >= 1) {
} cat(" ", out_var)
}
# which rows have missing data in the original data frame?
<- is.na(dat_orig[[out_var]])
miss_rows
# fit a regression model to the rows that didn't have missing values in
# the original data
<- lm(formula, data = dat_cur[!miss_rows, ])
mod
# draw predictions for the rows with missing values
<- model.frame(formula, data = dat_cur[miss_rows, ])
mf <- model.matrix(formula, data = mf)
mat <- draw_bayes_lm_ppred(mod, mat)
dat_cur[miss_rows, out_var]
# return the updated data frame
dat_cur }
To get one complete iteration of the MICE algorithm for a single imputed dataset, we need to apply the above function to each variable we need to impute. This is done using the reduce()
function from the purrr
package, which repeatedly applies a function to its own output (the current imputed dataset, dat_cur
) and a list of new values (the regression model formulas for the variables to impute, formula_list
).
<- function(
dunnart_impute_iterate verbose = TRUE, j = NA, m = NA
dat_orig, dat_cur, formula_list,
) {stopifnot(is.data.frame(dat_orig) & is.data.frame(dat_cur) &
is.list(formula_list) & all(!is.na(dat_cur)))
if (verbose >= 2) {
cat(" imputation", j, "of", m, "\n")
else if (verbose >= 1) {
} cat(" [", j, "/", m, "]")
}
<- reduce(
dat_out
formula_list,dunnart_impute_var(dat_orig, dat, form, verbose),
\(dat, form) .init = dat_cur
)
if (verbose == 1) {
cat("\n")
}
dat_out }
Creating multiple imputed datasets
The functions we have above all operate on a single imputed dataset, but we want to produce multiple imputed datasets. The dunnart_multiple_impute_iterate
function applies the dunnart_impute_iterate
function to a list of imputed datasets using the map2()
function from the purrr
package. The code is slightly messier than ideal because we pass through some additional information (the imputation index and the total number of imputations) to allow progress reporting in verbose mode.
<- function(
dunnart_multiple_impute_iterate verbose = TRUE, i = NA, iter = NA
data, imp_out, formula_list,
) {if (verbose >= 1) {
cat("imputation iteration", i, "of", iter, "\n")
}
# apply one imputation step to each imputed dataset
map2(
imp_out,seq_along(imp_out),
dunnart_impute_iterate(
\(dat_cur, j) verbose = verbose,
data, dat_cur, formula_list, j = j, m = length(imp_out)
)
) }
Finally, putting it all together: dunnart_impute()
is our equivalent of the mice()
function, taking an incomplete dataset, a list of model formulas, the number of imputed datasets and number of iterations to run, and returning the multiple imputed output. Most of the code here is error checking. The two statements that actually do the work are the ones beginning imp_out <- map(...)
to create the initial imputation, and all_imps <- accumulate(...)
to create the subsequent iterations by repeatedly calling dunnart_multiple_impute_iterate()
. Finally we create an object of class dunnart
to hold the results.
<- function(
dunnart_impute m = 5, iter = 10, verbose = TRUE
data, formula_list,
) {stopifnot(is.data.frame(data))
stopifnot(is.list(formula_list) &
all(map_lgl(formula_list, \(x) inherits(x, "formula"))))
stopifnot(is.numeric(m) & length(m) == 1 & !is.na(m) & m >= 1)
stopifnot(is.numeric(iter) & length(iter) == 1 & !is.na(iter) & iter >= 0)
stopifnot((is.logical(verbose) | is.numeric(verbose)) &
length(verbose) == 1 & !is.na(verbose))
# set up initial imputations (mean imputation)
if (verbose >= 1) {
cat("generating", m, "initial imputations\n")
}<- map(
imp_out seq_len(m),
dunnart_initial_imputation_for_df(data, formula_list)
\(x)
)
# check for any NA values after initial imputation, signifying that some
# variables with missing data do not have imputation formulas provided
if (any(map_lgl(imp_out, \(dat) any(is.na(dat))))) {
stop("NAs found after initial imputation step - fix imputation formulas")
}
# repeat the imputation iteration step 'iter' times
<- accumulate(
all_imps seq_len(iter),
dunnart_multiple_impute_iterate(
\(imp, i) verbose = verbose, i = i, iter = iter
data, imp, formula_list,
),.init = imp_out
)
# return results
<- list(
res m = m,
iter = iter,
formula_list = formula_list,
orig_data = data,
iterations = all_imps,
imputations = all_imps[[iter + 1]]
)class(res) <- "dunnart"
res }
Functions for extracting imputed data
Define a print
method for dunnart
objects so that we see a nice summary instead of all of the imputed data spewed to the console:
<- function(obj) {
print.dunnart cat("'dunnart' multiple imputation object:\n",
" -", obj$m, "imputed datasets\n",
" -", obj$iter, "iterations\n",
" -", nrow(obj$orig_data), "observations of",
ncol(obj$orig_data), "variables\n")
}
The dunnart_complete_long
function does something similar to the complete
function in mice
, returning a long-form data frame with all imputed datasets, containing a .imp
variable identifying the imputation and a .id
variable identifying the row within the imputation:
<- function(obj, iteration = obj$iter) {
dunnart_complete_long stopifnot(inherits(obj, "dunnart"))
stopifnot(is.numeric(iteration) & length(iteration) == 1)
stopifnot(iteration >= 0 & iteration <= obj$iter)
bind_rows(
map(obj$iterations[[iteration + 1]],
mutate(dat, .id = row_number(), .before = 1L,
\(dat) .by = all_of(NULL))),
.id = ".imp"
%>%
) mutate(.imp = as.numeric(.imp)) %>%
remove_rownames()
}
The dunnart_all_iters_long()
function does similar, but includes all intermediate iterations of the imputation process. This will be used later to create trace plots to assess imputation convergence.
<- function(obj) {
dunnart_all_iters_long stopifnot(inherits(obj, "dunnart"))
<- 0:obj$iter
iters bind_rows(
map(iters, \(i) dunnart_complete_long(obj, i)) %>%
set_names(iters),
.id = ".iter"
%>%
) mutate(.iter = as.numeric(.iter))
}
Taking it for a test ride
For this example we’ll use the nhanes
data from the mice
package. This dataset contains four variables: age
, bmi
, hyp
, and chl
; of which age
is completely observed but all of the others contain missing values. The variable hyp
is an indicator variable for a binary factor but we will treat it as if it’s numeric for our imputation purposes.
data(nhanes, package = "mice")
The code below calls the dunnart_impute()
function to obtain 5 imputed datasets, using 10 iterations. Each variable is imputed using all other variables. Unlike in mice
, we need to specify this by providing a full regression formula for each variable to be imputed.
set.seed(314159)
<- dunnart_impute(
imp
nhanes,list(
~ age + hyp + chl,
bmi ~ age + bmi + chl,
hyp ~ age + bmi + hyp
chl
),m = 5,
iter = 10
)
generating 5 initial imputations
imputation iteration 1 of 10
[ 1 / 5 ] bmi hyp chl
[ 2 / 5 ] bmi hyp chl
[ 3 / 5 ] bmi hyp chl
[ 4 / 5 ] bmi hyp chl
[ 5 / 5 ] bmi hyp chl
imputation iteration 2 of 10
[ 1 / 5 ] bmi hyp chl
[ 2 / 5 ] bmi hyp chl
[ 3 / 5 ] bmi hyp chl
[ 4 / 5 ] bmi hyp chl
[ 5 / 5 ] bmi hyp chl
imputation iteration 3 of 10
[ 1 / 5 ] bmi hyp chl
[ 2 / 5 ] bmi hyp chl
[ 3 / 5 ] bmi hyp chl
[ 4 / 5 ] bmi hyp chl
[ 5 / 5 ] bmi hyp chl
imputation iteration 4 of 10
[ 1 / 5 ] bmi hyp chl
[ 2 / 5 ] bmi hyp chl
[ 3 / 5 ] bmi hyp chl
[ 4 / 5 ] bmi hyp chl
[ 5 / 5 ] bmi hyp chl
imputation iteration 5 of 10
[ 1 / 5 ] bmi hyp chl
[ 2 / 5 ] bmi hyp chl
[ 3 / 5 ] bmi hyp chl
[ 4 / 5 ] bmi hyp chl
[ 5 / 5 ] bmi hyp chl
imputation iteration 6 of 10
[ 1 / 5 ] bmi hyp chl
[ 2 / 5 ] bmi hyp chl
[ 3 / 5 ] bmi hyp chl
[ 4 / 5 ] bmi hyp chl
[ 5 / 5 ] bmi hyp chl
imputation iteration 7 of 10
[ 1 / 5 ] bmi hyp chl
[ 2 / 5 ] bmi hyp chl
[ 3 / 5 ] bmi hyp chl
[ 4 / 5 ] bmi hyp chl
[ 5 / 5 ] bmi hyp chl
imputation iteration 8 of 10
[ 1 / 5 ] bmi hyp chl
[ 2 / 5 ] bmi hyp chl
[ 3 / 5 ] bmi hyp chl
[ 4 / 5 ] bmi hyp chl
[ 5 / 5 ] bmi hyp chl
imputation iteration 9 of 10
[ 1 / 5 ] bmi hyp chl
[ 2 / 5 ] bmi hyp chl
[ 3 / 5 ] bmi hyp chl
[ 4 / 5 ] bmi hyp chl
[ 5 / 5 ] bmi hyp chl
imputation iteration 10 of 10
[ 1 / 5 ] bmi hyp chl
[ 2 / 5 ] bmi hyp chl
[ 3 / 5 ] bmi hyp chl
[ 4 / 5 ] bmi hyp chl
[ 5 / 5 ] bmi hyp chl
Diagnostics
Trace plots
The plot below shows the mean and standard deviation of the imputed values of each variable at each iteration of the imputation process. Each coloured line shows a different imputed dataset. The ‘zeroth’ iteration is the initial mean imputation step where all imputed datasets are the same. The subsequent iterations show that the imputations very quickly start ‘mixing’ well, resembling a low-resolution fuzzy caterpillar.
Code
<- nhanes %>%
miss_rows mutate(.id = row_number()) %>%
pivot_longer(c(bmi, hyp, chl), names_to = "var", values_to = "miss") %>%
mutate(miss = is.na(miss))
<- dunnart_all_iters_long(imp) %>%
trace_data pivot_longer(c(bmi, hyp, chl), names_to = "var", values_to = "value") %>%
left_join(miss_rows, by = c(".id", "var")) %>%
filter(miss) %>%
summarise(
mean = mean(value),
sd = sd(value),
.by = c(.iter, .imp, var)
%>%
) mutate(across(c(.iter, .imp), as.factor))
plot_grid(
%>%
trace_data ggplot(aes(x = .iter, y = mean, colour = .imp, group = .imp)) +
geom_line() +
facet_wrap(~var, scales = "free_y") +
panel_border() +
theme(legend.position = "off"),
%>%
trace_data ggplot(aes(x = .iter, y = sd, colour = .imp, group = .imp)) +
geom_line() +
facet_wrap(~var, scales = "free_y") +
panel_border() +
theme(legend.position = "off"),
nrow = 2
)
Distribution of imputed data
In the plot below, the blue circles show the original (completely observed data) and the red circles showing imputed values. We can see that the imputation for ‘hyp’ isn’t ideal, with the original data only taking the values of 1 or 2, but imputed values being anywhere on the real line.
Code
dunnart_complete_long(imp) %>%
pivot_longer(c(bmi, hyp, chl), names_to = "var", values_to = "value") %>%
left_join(miss_rows, by = c(".id", "var")) %>%
ggplot(aes(x = value, y = .imp, colour = miss)) +
geom_point(pch = 1) +
scale_colour_manual(values = c("dodgerblue3", "firebrick3")) +
facet_grid(~var, scales = "free_x") +
panel_border() +
theme(legend.position = "bottom")
Running an analysis on the imputed data
Running the same analysis repeatedly on the imputed data is a fairly straightforward task. The function dunnart_analyse()
below does this. The actual work is in the line res <- map(obj$imputations, fn)
which calls a provded function on each imputation and returns the result as a list.
<- function(obj, fn) {
dunnart_analyse stopifnot(inherits(obj, "dunnart"))
stopifnot(is.function(fn))
<- map(obj$imputations, fn)
res class(res) <- "dunnart_analysis"
res }
As an example, let’s fit a regression model to our imputed data to predict chl
from all other variables.
<- dunnart_analyse(
example_model
imp,lm(chl ~ age + bmi + hyp, data = dat)
\(dat) )
Pooling imputations using Rubin’s rules
Once repeated analyses on each imputed dataset have been done, the results need to be combined into a single output. This is done using Rubin’s rules, as described in e.g. Rubin (1996). The pooled point estimate is the mean of the estimates from each dataset, and the variance of the pooled estimate is the mean of the variances from each dataset plus \((m+1)/m\) times the between-imputation variance. There are some additional rules to derive the appropriate degrees of freedom for a t distribution but in this case we will keep things simple and just use a large-sample Normal approximation.
The function below implements this, taking advantage of the broom::tidy()
function to obtain a data frame of estimates and standard errors for each model parameter. In principle a different function could be supplied (e.g. from the emmeans
or marginaleffects
package) to obtain a data frame of estimates and standard errors of any estimand of interest from a model.
<- function(obj, tidy_fn = broom::tidy, conf_level = 0.95) {
dunnart_pool stopifnot(inherits(obj, "dunnart_analysis"))
stopifnot(is.function(tidy_fn))
# apply the tidier function to each imputation dataset. this should return
# a data frame for each analysis output.
<- bind_rows(
tidy_analysis map(obj, tidy_fn),
.id = ".imp"
)
# number of imputed datasets (needed for Rubin's rules)
<- length(obj)
m
# multiplier for confidence interval, e.g. 1.96 for 95% CI
<- qnorm(0.5 + 0.5*conf_level)
ci_mult
# apply Rubin's rules and compute CIs and p-values
%>%
tidy_analysis summarise(
std.error = sqrt(mean(std.error^2) + (m+1)/m * sd(estimate)^2),
estimate = mean(estimate),
.by = term
%>%
) relocate(std.error, .after = estimate) %>%
mutate(
statistic = estimate / std.error,
p.value = 2*pnorm(-abs(statistic)),
conf.low = estimate - ci_mult*std.error,
conf.high = estimate + ci_mult*std.error
) }
We can apply this to the regression model we fit earlier to get a table of pooled model parameters:
dunnart_pool(example_model)
# A tibble: 4 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -70.0 66.4 -1.05 0.291 -200. 60.1
2 age 47.7 10.9 4.37 0.0000122 26.3 69.1
3 bmi 7.12 2.32 3.07 0.00213 2.58 11.7
4 hyp -9.92 19.8 -0.502 0.616 -48.7 28.8
This is very similar to the results from the complete case analysis, which is reassuring:
::tidy(lm(chl ~ age + bmi + hyp, data = nhanes), conf.int = TRUE) broom
# A tibble: 4 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -81.0 61.8 -1.31 0.222 -221. 58.8
2 age 55.2 14.3 3.86 0.00383 22.9 87.5
3 bmi 7.07 2.05 3.44 0.00736 2.42 11.7
4 hyp -6.22 23.2 -0.268 0.794 -58.7 46.2
References
Footnotes
Well, I probably met it and forgot about it long ago in undergrad Bayes class.↩︎