| Title: | What the Package Does (One Line, Title Case) |
|---|---|
| Description: | What the package does (one paragraph). |
| Authors: | First Last [aut, cre] (ORCID: YOUR-ORCID-ID) |
| Maintainer: | First Last <[email protected]> |
| License: | `use_mit_license()`, `use_gpl3_license()` or friends to pick a license |
| Version: | 0.1.0 |
| Built: | 2026-05-10 08:05:04 UTC |
| Source: | https://github.com/robertkuchen/booami |
Minimal, dependency-free predictor for models fitted by
cv_boost_raw, cv_boost_imputed, or a
pooled impu_boost fit. Supports Gaussian (identity)
and logistic (logit) models, returning either the linear predictor
or, for logistic, predicted probabilities.
booami_predict( object, X_new, family = NULL, type = c("response", "link"), center_means = NULL )booami_predict( object, X_new, family = NULL, type = c("response", "link"), center_means = NULL )
object |
A fit returned by |
X_new |
New data (matrix or data.frame) with the same |
family |
Model family; one of |
type |
Prediction type; one of |
center_means |
Optional numeric vector of length |
This function is deterministic and involves no random number generation.
Coefficients are extracted from either $final_model (intercept first,
then coefficients) or from $INT+$BETA (pooled impu_boost).
If X_new has column names and the model has named coefficients, columns
are aligned by name; otherwise they are used in order.
If your training pipeline centered covariates (e.g., center = "auto"),
providing the same center_means here yields numerically consistent
predictions. If not supplied but object$center_means exists, it will
be used automatically. If both are supplied, the explicit center_means
argument takes precedence.
A numeric vector of predictions (length nrow(X_new)). If
X_new has row names, they are propagated to the returned vector.
cv_boost_raw, cv_boost_imputed, impu_boost
# 1) Fit on data WITH missing values set.seed(123) sim_tr <- simulate_booami_data( n = 120, p = 12, p_inf = 3, type = "gaussian", miss = "MAR", miss_prop = 0.20 ) X_tr <- sim_tr$data[, 1:12] y_tr <- sim_tr$data$y fit <- cv_boost_raw( X_tr, y_tr, k = 2, mstop = 50, seed = 123, impute_args = list(m = 2, maxit = 1, printFlag = FALSE, seed = 1), quickpred_args = list(method = "spearman", mincor = 0.30, minpuc = 0.60), show_progress = FALSE ) # 2) Predict on a separate data set WITHOUT missing values (same p) sim_new <- simulate_booami_data( n = 5, p = 12, p_inf = 3, type = "gaussian", miss = "MCAR", miss_prop = 0 # <- complete data with existing API ) X_new <- sim_new$data[, 1:12, drop = FALSE] preds <- booami_predict(fit, X_new = X_new, family = "gaussian", type = "response") round(preds, 3)# 1) Fit on data WITH missing values set.seed(123) sim_tr <- simulate_booami_data( n = 120, p = 12, p_inf = 3, type = "gaussian", miss = "MAR", miss_prop = 0.20 ) X_tr <- sim_tr$data[, 1:12] y_tr <- sim_tr$data$y fit <- cv_boost_raw( X_tr, y_tr, k = 2, mstop = 50, seed = 123, impute_args = list(m = 2, maxit = 1, printFlag = FALSE, seed = 1), quickpred_args = list(method = "spearman", mincor = 0.30, minpuc = 0.60), show_progress = FALSE ) # 2) Predict on a separate data set WITHOUT missing values (same p) sim_new <- simulate_booami_data( n = 5, p = 12, p_inf = 3, type = "gaussian", miss = "MCAR", miss_prop = 0 # <- complete data with existing API ) X_new <- sim_new$data[, 1:12, drop = FALSE] preds <- booami_predict(fit, X_new = X_new, family = "gaussian", type = "response") round(preds, 3)
A simulated dataset with predictors X1...X25 and a continuous
outcome y, with missing values generated under a MAR mechanism. The
object is a data.frame and carries attributes describing the
data-generating process (true coefficients, informative indices, etc.).
A data frame with 300 rows and 26 variables:
numeric
numeric
numeric
numeric
numeric
numeric
numeric
numeric
numeric
numeric
numeric
numeric
numeric
numeric
numeric
numeric
numeric
numeric
numeric
numeric
numeric
numeric
numeric
numeric
numeric
numeric outcome
Generated by simulate_booami_data with typical settings (see
?simulate_booami_data). The following attributes are attached to
booami_sim:
"true_beta": numeric length-25 vector of true coefficients
(non-zeros in positions 1-5).
"informative": integer vector 1:5.
"type": "gaussian".
"corr_structure": "all_ar1"; "rho": 0.3.
"intercept": 1; "noise_sd": 1 (Gaussian; NA otherwise).
"mar_scale": TRUE; "keep_mar_drivers": TRUE.
simulate_booami_data,
impu_boost, cv_boost_raw, cv_boost_imputed
## \donttest{ utils::data(booami_sim) dim(booami_sim) mean(colSums(is.na(booami_sim)) > 0) # fraction of columns with any NAs head(attr(booami_sim, "true_beta")) attr(booami_sim, "informative") ## }## \donttest{ utils::data(booami_sim) dim(booami_sim) mean(colSums(is.na(booami_sim)) > 0) # fraction of columns with any NAs head(attr(booami_sim, "true_beta")) attr(booami_sim, "informative") ## }
Performs k-fold cross-validation for impu_boost to determine
the optimal value of mstop before fitting the final model on the
full dataset. This function should only be used when data have already
been imputed. In most cases, it is preferable to provide unimputed data
and use cv_boost_raw instead.
cv_boost_imputed( X_train_list, y_train_list, X_val_list, y_val_list, X_full, y_full, ny = 0.1, mstop = 250, type = c("gaussian", "logistic"), MIBoost = TRUE, pool = TRUE, pool_threshold = 0, show_progress = TRUE, center = c("auto", "off", "force") )cv_boost_imputed( X_train_list, y_train_list, X_val_list, y_val_list, X_full, y_full, ny = 0.1, mstop = 250, type = c("gaussian", "logistic"), MIBoost = TRUE, pool = TRUE, pool_threshold = 0, show_progress = TRUE, center = c("auto", "off", "force") )
X_train_list |
A list of length |
y_train_list |
A list of length |
X_val_list |
A list of length |
y_val_list |
A list of length |
X_full |
A list of length |
y_full |
A list of length |
ny |
Learning rate. Defaults to |
mstop |
Maximum number of boosting iterations to evaluate during
cross-validation. The selected |
type |
Type of loss function. One of:
|
MIBoost |
Logical. If |
pool |
Logical. If |
pool_threshold |
Only used when |
show_progress |
Logical; print fold-level progress and summary timings.
Default |
center |
One of |
To avoid data leakage, each CV fold should first be split into training and validation subsets, after which imputation is performed. For the final model, all data should be imputed independently.
The recommended workflow is illustrated in the examples.
Centering affects only X; y is left unchanged. For
type = "logistic", responses are treated as numeric 0/1
via the logistic link. Validation loss is averaged over
imputations and then over folds.
A list with:
CV_error: numeric vector of length mstop with the mean
cross-validated loss across folds (and imputations).
best_mstop: integer index of the minimizing entry in
CV_error.
final_model: numeric vector of length 1 + p containing
the intercept followed by coefficients of the final pooled
model fitted at best_mstop on X_full/y_full.
Kuchen, R. (2025). MIBoost: A Gradient Boosting Algorithm for Variable Selection After Multiple Imputation. arXiv:2507.21807. doi:10.48550/arXiv.2507.21807 https://arxiv.org/abs/2507.21807.
set.seed(123) utils::data(booami_sim) k <- 2; M <- 2 n <- nrow(booami_sim); p <- ncol(booami_sim) - 1 folds <- sample(rep(seq_len(k), length.out = n)) X_train_list <- vector("list", k) y_train_list <- vector("list", k) X_val_list <- vector("list", k) y_val_list <- vector("list", k) for (cv in seq_len(k)) { tr <- folds != cv va <- !tr dat_tr <- booami_sim[tr, , drop = FALSE] dat_va <- booami_sim[va, , drop = FALSE] pm_tr <- mice::quickpred(dat_tr, method = "spearman", mincor = 0.30, minpuc = 0.60) imp_tr <- mice::mice(dat_tr, m = M, predictorMatrix = pm_tr, maxit = 1, printFlag = FALSE) imp_va <- mice::mice.mids(imp_tr, newdata = dat_va, maxit = 1, printFlag = FALSE) X_train_list[[cv]] <- vector("list", M) y_train_list[[cv]] <- vector("list", M) X_val_list[[cv]] <- vector("list", M) y_val_list[[cv]] <- vector("list", M) for (m in seq_len(M)) { tr_m <- mice::complete(imp_tr, m) va_m <- mice::complete(imp_va, m) X_train_list[[cv]][[m]] <- data.matrix(tr_m[, 1:p, drop = FALSE]) y_train_list[[cv]][[m]] <- tr_m$y X_val_list[[cv]][[m]] <- data.matrix(va_m[, 1:p, drop = FALSE]) y_val_list[[cv]][[m]] <- va_m$y } } pm_full <- mice::quickpred(booami_sim, method = "spearman", mincor = 0.30, minpuc = 0.60) imp_full <- mice::mice(booami_sim, m = M, predictorMatrix = pm_full, maxit = 1, printFlag = FALSE) X_full <- lapply(seq_len(M), function(m) data.matrix( mice::complete(imp_full, m)[, 1:p, drop = FALSE])) y_full <- lapply(seq_len(M), function(m) mice::complete(imp_full, m)$y) res <- cv_boost_imputed( X_train_list, y_train_list, X_val_list, y_val_list, X_full, y_full, ny = 0.1, mstop = 50, type = "gaussian", MIBoost = TRUE, pool = TRUE, center = "auto", show_progress = FALSE ) ## Not run: set.seed(2025) utils::data(booami_sim) k <- 5; M <- 10 n <- nrow(booami_sim); p <- ncol(booami_sim) - 1 folds <- sample(rep(seq_len(k), length.out = n)) X_train_list <- vector("list", k) y_train_list <- vector("list", k) X_val_list <- vector("list", k) y_val_list <- vector("list", k) for (cv in seq_len(k)) { tr <- folds != cv; va <- !tr dat_tr <- booami_sim[tr, , drop = FALSE] dat_va <- booami_sim[va, , drop = FALSE] pm_tr <- mice::quickpred(dat_tr, method = "spearman", mincor = 0.20, minpuc = 0.40) imp_tr <- mice::mice(dat_tr, m = M, predictorMatrix = pm_tr, maxit = 5, printFlag = TRUE) imp_va <- mice::mice.mids(imp_tr, newdata = dat_va, maxit = 1, printFlag = FALSE) X_train_list[[cv]] <- vector("list", M) y_train_list[[cv]] <- vector("list", M) X_val_list[[cv]] <- vector("list", M) y_val_list[[cv]] <- vector("list", M) for (m in seq_len(M)) { tr_m <- mice::complete(imp_tr, m); va_m <- mice::complete(imp_va, m) X_train_list[[cv]][[m]] <- data.matrix(tr_m[, 1:p, drop = FALSE]) y_train_list[[cv]][[m]] <- tr_m$y X_val_list[[cv]][[m]] <- data.matrix(va_m[, 1:p, drop = FALSE]) y_val_list[[cv]][[m]] <- va_m$y } } pm_full <- mice::quickpred(booami_sim, method = "spearman", mincor = 0.20, minpuc = 0.40) imp_full <- mice::mice(booami_sim, m = M, predictorMatrix = pm_full, maxit = 5, printFlag = TRUE) X_full <- lapply(seq_len(M), function(m) data.matrix(mice::complete(imp_full, m)[, 1:p, drop = FALSE])) y_full <- lapply(seq_len(M), function(m) mice::complete(imp_full, m)$y) res_heavy <- cv_boost_imputed( X_train_list, y_train_list, X_val_list, y_val_list, X_full, y_full, ny = 0.1, mstop = 250, type = "gaussian", MIBoost = TRUE, pool = TRUE, center = "auto", show_progress = TRUE ) str(res_heavy) ## End(Not run)set.seed(123) utils::data(booami_sim) k <- 2; M <- 2 n <- nrow(booami_sim); p <- ncol(booami_sim) - 1 folds <- sample(rep(seq_len(k), length.out = n)) X_train_list <- vector("list", k) y_train_list <- vector("list", k) X_val_list <- vector("list", k) y_val_list <- vector("list", k) for (cv in seq_len(k)) { tr <- folds != cv va <- !tr dat_tr <- booami_sim[tr, , drop = FALSE] dat_va <- booami_sim[va, , drop = FALSE] pm_tr <- mice::quickpred(dat_tr, method = "spearman", mincor = 0.30, minpuc = 0.60) imp_tr <- mice::mice(dat_tr, m = M, predictorMatrix = pm_tr, maxit = 1, printFlag = FALSE) imp_va <- mice::mice.mids(imp_tr, newdata = dat_va, maxit = 1, printFlag = FALSE) X_train_list[[cv]] <- vector("list", M) y_train_list[[cv]] <- vector("list", M) X_val_list[[cv]] <- vector("list", M) y_val_list[[cv]] <- vector("list", M) for (m in seq_len(M)) { tr_m <- mice::complete(imp_tr, m) va_m <- mice::complete(imp_va, m) X_train_list[[cv]][[m]] <- data.matrix(tr_m[, 1:p, drop = FALSE]) y_train_list[[cv]][[m]] <- tr_m$y X_val_list[[cv]][[m]] <- data.matrix(va_m[, 1:p, drop = FALSE]) y_val_list[[cv]][[m]] <- va_m$y } } pm_full <- mice::quickpred(booami_sim, method = "spearman", mincor = 0.30, minpuc = 0.60) imp_full <- mice::mice(booami_sim, m = M, predictorMatrix = pm_full, maxit = 1, printFlag = FALSE) X_full <- lapply(seq_len(M), function(m) data.matrix( mice::complete(imp_full, m)[, 1:p, drop = FALSE])) y_full <- lapply(seq_len(M), function(m) mice::complete(imp_full, m)$y) res <- cv_boost_imputed( X_train_list, y_train_list, X_val_list, y_val_list, X_full, y_full, ny = 0.1, mstop = 50, type = "gaussian", MIBoost = TRUE, pool = TRUE, center = "auto", show_progress = FALSE ) ## Not run: set.seed(2025) utils::data(booami_sim) k <- 5; M <- 10 n <- nrow(booami_sim); p <- ncol(booami_sim) - 1 folds <- sample(rep(seq_len(k), length.out = n)) X_train_list <- vector("list", k) y_train_list <- vector("list", k) X_val_list <- vector("list", k) y_val_list <- vector("list", k) for (cv in seq_len(k)) { tr <- folds != cv; va <- !tr dat_tr <- booami_sim[tr, , drop = FALSE] dat_va <- booami_sim[va, , drop = FALSE] pm_tr <- mice::quickpred(dat_tr, method = "spearman", mincor = 0.20, minpuc = 0.40) imp_tr <- mice::mice(dat_tr, m = M, predictorMatrix = pm_tr, maxit = 5, printFlag = TRUE) imp_va <- mice::mice.mids(imp_tr, newdata = dat_va, maxit = 1, printFlag = FALSE) X_train_list[[cv]] <- vector("list", M) y_train_list[[cv]] <- vector("list", M) X_val_list[[cv]] <- vector("list", M) y_val_list[[cv]] <- vector("list", M) for (m in seq_len(M)) { tr_m <- mice::complete(imp_tr, m); va_m <- mice::complete(imp_va, m) X_train_list[[cv]][[m]] <- data.matrix(tr_m[, 1:p, drop = FALSE]) y_train_list[[cv]][[m]] <- tr_m$y X_val_list[[cv]][[m]] <- data.matrix(va_m[, 1:p, drop = FALSE]) y_val_list[[cv]][[m]] <- va_m$y } } pm_full <- mice::quickpred(booami_sim, method = "spearman", mincor = 0.20, minpuc = 0.40) imp_full <- mice::mice(booami_sim, m = M, predictorMatrix = pm_full, maxit = 5, printFlag = TRUE) X_full <- lapply(seq_len(M), function(m) data.matrix(mice::complete(imp_full, m)[, 1:p, drop = FALSE])) y_full <- lapply(seq_len(M), function(m) mice::complete(imp_full, m)$y) res_heavy <- cv_boost_imputed( X_train_list, y_train_list, X_val_list, y_val_list, X_full, y_full, ny = 0.1, mstop = 250, type = "gaussian", MIBoost = TRUE, pool = TRUE, center = "auto", show_progress = TRUE ) str(res_heavy) ## End(Not run)
Performs k-fold cross-validation for impu_boost on data with
missing values. Within each fold, multiple imputation, centering, model
fitting, and validation are performed in a leakage-avoiding manner to select
the optimal number of boosting iterations (mstop). The final model is
then fitted on multiple imputations of the full dataset at the selected
stopping iteration.
cv_boost_raw( X, y, k = 5, ny = 0.1, mstop = 250, type = c("gaussian", "logistic"), MIBoost = TRUE, pool = TRUE, pool_threshold = 0, impute_args = list(m = 10, maxit = 5, printFlag = FALSE), impute_method = NULL, use_quickpred = TRUE, quickpred_args = list(mincor = 0.1, minpuc = 0.5, method = NULL, include = NULL, exclude = NULL), seed = 123, show_progress = TRUE, return_full_imputations = FALSE, center = "auto" )cv_boost_raw( X, y, k = 5, ny = 0.1, mstop = 250, type = c("gaussian", "logistic"), MIBoost = TRUE, pool = TRUE, pool_threshold = 0, impute_args = list(m = 10, maxit = 5, printFlag = FALSE), impute_method = NULL, use_quickpred = TRUE, quickpred_args = list(mincor = 0.1, minpuc = 0.5, method = NULL, include = NULL, exclude = NULL), seed = 123, show_progress = TRUE, return_full_imputations = FALSE, center = "auto" )
X |
A data.frame or matrix of predictors of size |
y |
A vector of length |
k |
Number of cross-validation folds. Default is |
ny |
Learning rate. Defaults to |
mstop |
Maximum number of boosting iterations to evaluate during
cross-validation. The selected |
type |
Type of loss function. One of:
|
MIBoost |
Logical. If |
pool |
Logical. If |
pool_threshold |
Only used when |
impute_args |
A named list of arguments forwarded to |
impute_method |
Optional named character vector passed to
|
use_quickpred |
Logical. If |
quickpred_args |
A named list of arguments forwarded to
|
seed |
Base random seed for fold assignment. If |
show_progress |
Logical. If |
return_full_imputations |
Logical. If |
center |
One of |
Within each CV fold, the data are first split into a training subset and a
validation subset. The training subset is multiply imputed times
using mice, producing imputed training datasets. Covariates
in each training dataset are centered. The corresponding validation subset
is then imputed times using the imputation models learned from the
training imputations, ensuring consistency between training and validation.
These validation datasets are centered using the variable means from their
associated training datasets.
impu_boost is run on the imputed training datasets for up to
mstop boosting iterations. At each iteration, prediction errors are
computed on the corresponding validation datasets and averaged across
imputations. This yields an aggregated error curve per fold, which is then
averaged across folds. The optimal stopping iteration is chosen as the
mstop value minimizing the mean CV error.
Finally, the full dataset is multiply imputed times and centered
independently within each imputed dataset. impu_boost is
applied to these datasets for the selected number of boosting iterations to
obtain the final model.
Imputation control. All key mice settings can be passed via
impute_args (a named list forwarded to mice::mice()) and/or
impute_method (a named character vector of per-variable methods).
Internally, the function builds a full default method vector from the actual
data given to mice(), then merges any user-supplied entries
by name. The names in impute_method must exactly match the
column names in data.frame(y = y, X) (i.e., the data passed
to mice()). Partial vectors are allowed; variables not listed fall
back to defaults; unknown names are ignored with a warning. The function sets
and may override data, method (after merging overrides),
predictorMatrix, and ignore (to enforce train-only learning).
Predictor matrices can be built with mice::quickpred() (see
use_quickpred, quickpred_args) or with
mice::make.predictorMatrix().
A list with:
CV_error: numeric vector (length mstop) of mean CV loss.
best_mstop: integer index minimizing CV_error.
final_model: numeric vector of length 1 + p with the
intercept and pooled coefficients of the final fit on full-data
imputations at best_mstop.
full_imputations: (optional) when return_full_imputations=TRUE,
a list list(X = <list length m>, y = <list length m>) containing
the full-data imputations used for the final model.
folds: integer vector of length giving the CV fold id
for each observation (1..k).
Kuchen, R. (2025). MIBoost: A Gradient Boosting Algorithm for Variable Selection After Multiple Imputation. arXiv:2507.21807. doi:10.48550/arXiv.2507.21807 https://arxiv.org/abs/2507.21807.
impu_boost, cv_boost_imputed, mice
utils::data(booami_sim) X <- booami_sim[, 1:25] y <- booami_sim[, 26] res <- cv_boost_raw( X = X, y = y, k = 2, seed = 123, impute_args = list(m = 2, maxit = 1, printFlag = FALSE, seed = 1), quickpred_args = list(mincor = 0.30, minpuc = 0.60), mstop = 50, show_progress = FALSE ) # Partial custom imputation method override meth <- c(y = "pmm", X1 = "pmm") res2 <- cv_boost_raw( X = X, y = y, k = 2, seed = 123, impute_args = list(m = 2, maxit = 1, printFlag = FALSE, seed = 456), quickpred_args = list(mincor = 0.30, minpuc = 0.60), mstop = 50, impute_method = meth, show_progress = FALSE )utils::data(booami_sim) X <- booami_sim[, 1:25] y <- booami_sim[, 26] res <- cv_boost_raw( X = X, y = y, k = 2, seed = 123, impute_args = list(m = 2, maxit = 1, printFlag = FALSE, seed = 1), quickpred_args = list(mincor = 0.30, minpuc = 0.60), mstop = 50, show_progress = FALSE ) # Partial custom imputation method override meth <- c(y = "pmm", X1 = "pmm") res2 <- cv_boost_raw( X = X, y = y, k = 2, seed = 123, impute_args = list(m = 2, maxit = 1, printFlag = FALSE, seed = 456), quickpred_args = list(mincor = 0.30, minpuc = 0.60), mstop = 50, impute_method = meth, show_progress = FALSE )
Applies component-wise gradient boosting to multiply imputed datasets. Depending on the settings, either a separate model is reported for each imputed dataset, or the M models are pooled to yield a single final model. For pooling, one can choose the novel MIBoost algorithm, which enforces a uniform variable-selection scheme across all imputations, or the more conventional ad-hoc approaches of estimate-averaging and selection-frequency thresholding.
impu_boost( X_list, y_list, X_list_val = NULL, y_list_val = NULL, ny = 0.1, mstop = 250, type = c("gaussian", "logistic"), MIBoost = TRUE, pool = TRUE, pool_threshold = 0, center = c("auto", "force", "off") )impu_boost( X_list, y_list, X_list_val = NULL, y_list_val = NULL, ny = 0.1, mstop = 250, type = c("gaussian", "logistic"), MIBoost = TRUE, pool = TRUE, pool_threshold = 0, center = c("auto", "force", "off") )
X_list |
List of length M; each element is an |
y_list |
List of length M; each element is a length- |
X_list_val |
Optional validation list (same structure as |
y_list_val |
Optional validation list (same structure as |
ny |
Learning rate. Defaults to |
mstop |
Number of boosting iterations (default |
type |
Type of loss function. One of:
|
MIBoost |
Logical. If |
pool |
Logical. If |
pool_threshold |
Only used when |
center |
One of |
This function supports MIBoost, which enforces uniform variable selection across multiply imputed datasets. For full methodology, see Kuchen (2025).
A list with elements:
INT: intercept(s). A scalar if pool = TRUE, otherwise
a length-M vector.
BETA: coefficient estimates. A length-p vector if
pool = TRUE, otherwise an M p matrix.
CV_error: vector of validation errors (if validation data
were provided), otherwise NULL.
Kuchen, R. (2025). MIBoost: A Gradient Boosting Algorithm for Variable Selection After Multiple Imputation. arXiv:2507.21807. doi:10.48550/arXiv.2507.21807 https://arxiv.org/abs/2507.21807.
simulate_booami_data, cv_boost_raw, cv_boost_imputed
set.seed(123) utils::data(booami_sim) M <- 2 n <- nrow(booami_sim) x_cols <- grepl("^X\\d+$", names(booami_sim)) tr_idx <- sample(seq_len(n), floor(0.8 * n)) dat_tr <- booami_sim[tr_idx, , drop = FALSE] dat_va <- booami_sim[-tr_idx, , drop = FALSE] pm_tr <- mice::quickpred(dat_tr, method = "spearman", mincor = 0.30, minpuc = 0.60) imp_tr <- mice::mice(dat_tr, m = M, predictorMatrix = pm_tr, maxit = 1, printFlag = FALSE) imp_va <- mice::mice.mids(imp_tr, newdata = dat_va, maxit = 1, printFlag = FALSE) X_list <- vector("list", M) y_list <- vector("list", M) X_list_val <- vector("list", M) y_list_val <- vector("list", M) for (m in seq_len(M)) { tr_m <- mice::complete(imp_tr, m) va_m <- mice::complete(imp_va, m) X_list[[m]] <- data.matrix(tr_m[, x_cols, drop = FALSE]) y_list[[m]] <- tr_m$y X_list_val[[m]] <- data.matrix(va_m[, x_cols, drop = FALSE]) y_list_val[[m]] <- va_m$y } fit <- impu_boost( X_list, y_list, X_list_val = X_list_val, y_list_val = y_list_val, ny = 0.1, mstop = 50, type = "gaussian", MIBoost = TRUE, pool = TRUE, center = "auto" ) which.min(fit$CV_error) head(fit$BETA) fit$INT ## Not run: # Heavier demo (more imputations and iterations; for local runs) set.seed(2025) utils::data(booami_sim) M <- 10 n <- nrow(booami_sim) x_cols <- grepl("^X\\d+$", names(booami_sim)) tr_idx <- sample(seq_len(n), floor(0.8 * n)) dat_tr <- booami_sim[tr_idx, , drop = FALSE] dat_va <- booami_sim[-tr_idx, , drop = FALSE] pm_tr <- mice::quickpred(dat_tr, method = "spearman", mincor = 0.20, minpuc = 0.40) imp_tr <- mice::mice(dat_tr, m = M, predictorMatrix = pm_tr, maxit = 5, printFlag = TRUE) imp_va <- mice::mice.mids(imp_tr, newdata = dat_va, maxit = 1, printFlag = FALSE) X_list <- vector("list", M) y_list <- vector("list", M) X_list_val <- vector("list", M) y_list_val <- vector("list", M) for (m in seq_len(M)) { tr_m <- mice::complete(imp_tr, m) va_m <- mice::complete(imp_va, m) X_list[[m]] <- data.matrix(tr_m[, x_cols, drop = FALSE]) y_list[[m]] <- tr_m$y X_list_val[[m]] <- data.matrix(va_m[, x_cols, drop = FALSE]) y_list_val[[m]] <- va_m$y } fit_heavy <- impu_boost( X_list, y_list, X_list_val = X_list_val, y_list_val = y_list_val, ny = 0.1, mstop = 250, type = "gaussian", MIBoost = TRUE, pool = TRUE, center = "auto" ) str(fit_heavy) ## End(Not run)set.seed(123) utils::data(booami_sim) M <- 2 n <- nrow(booami_sim) x_cols <- grepl("^X\\d+$", names(booami_sim)) tr_idx <- sample(seq_len(n), floor(0.8 * n)) dat_tr <- booami_sim[tr_idx, , drop = FALSE] dat_va <- booami_sim[-tr_idx, , drop = FALSE] pm_tr <- mice::quickpred(dat_tr, method = "spearman", mincor = 0.30, minpuc = 0.60) imp_tr <- mice::mice(dat_tr, m = M, predictorMatrix = pm_tr, maxit = 1, printFlag = FALSE) imp_va <- mice::mice.mids(imp_tr, newdata = dat_va, maxit = 1, printFlag = FALSE) X_list <- vector("list", M) y_list <- vector("list", M) X_list_val <- vector("list", M) y_list_val <- vector("list", M) for (m in seq_len(M)) { tr_m <- mice::complete(imp_tr, m) va_m <- mice::complete(imp_va, m) X_list[[m]] <- data.matrix(tr_m[, x_cols, drop = FALSE]) y_list[[m]] <- tr_m$y X_list_val[[m]] <- data.matrix(va_m[, x_cols, drop = FALSE]) y_list_val[[m]] <- va_m$y } fit <- impu_boost( X_list, y_list, X_list_val = X_list_val, y_list_val = y_list_val, ny = 0.1, mstop = 50, type = "gaussian", MIBoost = TRUE, pool = TRUE, center = "auto" ) which.min(fit$CV_error) head(fit$BETA) fit$INT ## Not run: # Heavier demo (more imputations and iterations; for local runs) set.seed(2025) utils::data(booami_sim) M <- 10 n <- nrow(booami_sim) x_cols <- grepl("^X\\d+$", names(booami_sim)) tr_idx <- sample(seq_len(n), floor(0.8 * n)) dat_tr <- booami_sim[tr_idx, , drop = FALSE] dat_va <- booami_sim[-tr_idx, , drop = FALSE] pm_tr <- mice::quickpred(dat_tr, method = "spearman", mincor = 0.20, minpuc = 0.40) imp_tr <- mice::mice(dat_tr, m = M, predictorMatrix = pm_tr, maxit = 5, printFlag = TRUE) imp_va <- mice::mice.mids(imp_tr, newdata = dat_va, maxit = 1, printFlag = FALSE) X_list <- vector("list", M) y_list <- vector("list", M) X_list_val <- vector("list", M) y_list_val <- vector("list", M) for (m in seq_len(M)) { tr_m <- mice::complete(imp_tr, m) va_m <- mice::complete(imp_va, m) X_list[[m]] <- data.matrix(tr_m[, x_cols, drop = FALSE]) y_list[[m]] <- tr_m$y X_list_val[[m]] <- data.matrix(va_m[, x_cols, drop = FALSE]) y_list_val[[m]] <- va_m$y } fit_heavy <- impu_boost( X_list, y_list, X_list_val = X_list_val, y_list_val = y_list_val, ny = 0.1, mstop = 250, type = "gaussian", MIBoost = TRUE, pool = TRUE, center = "auto" ) str(fit_heavy) ## End(Not run)
Predict responses (link or response scale) from fitted booami models.
## S3 method for class 'booami_cv' predict(object, newdata, type = c("link", "response"), ...) ## S3 method for class 'booami_pooled' predict(object, newdata, type = c("link", "response"), ...) ## S3 method for class 'booami_multi' predict(object, newdata, type = c("link", "response"), ...)## S3 method for class 'booami_cv' predict(object, newdata, type = c("link", "response"), ...) ## S3 method for class 'booami_pooled' predict(object, newdata, type = c("link", "response"), ...) ## S3 method for class 'booami_multi' predict(object, newdata, type = c("link", "response"), ...)
object |
A fitted booami object. One of:
|
newdata |
A data.frame or matrix of predictors (same columns/order as training). |
type |
Either |
... |
Passed to |
A numeric vector of predictions.
Generates a dataset with predictors, of which the first p_inf
are informative. Predictors are drawn from a multivariate normal with a chosen
correlation structure, and the outcome can be continuous (type = "gaussian")
or binary (type = "logistic"). Missing values are introduced via MAR or MCAR.
simulate_booami_data( n = 300, p = 25, p_inf = 5, rho = 0.3, type = c("gaussian", "logistic"), beta_range = c(1, 2), intercept = 1, corr_structure = c("all_ar1", "informative_cs", "blockdiag", "none"), rho_noise = NULL, noise_sd = 1, miss = c("MAR", "MCAR"), miss_prop = 0.25, mar_drivers = c(1, 2, 3), gamma_vec = NULL, calibrate_mar = FALSE, mar_scale = TRUE, keep_observed = integer(0), jitter_sd = 0.25, keep_mar_drivers = TRUE )simulate_booami_data( n = 300, p = 25, p_inf = 5, rho = 0.3, type = c("gaussian", "logistic"), beta_range = c(1, 2), intercept = 1, corr_structure = c("all_ar1", "informative_cs", "blockdiag", "none"), rho_noise = NULL, noise_sd = 1, miss = c("MAR", "MCAR"), miss_prop = 0.25, mar_drivers = c(1, 2, 3), gamma_vec = NULL, calibrate_mar = FALSE, mar_scale = TRUE, keep_observed = integer(0), jitter_sd = 0.25, keep_mar_drivers = TRUE )
n |
Number of observations (default |
p |
Total number of predictors (default |
p_inf |
Number of informative predictors (default |
rho |
Correlation parameter (interpretation depends on |
type |
Either |
beta_range |
Length-2 numeric; coefficients for the first |
intercept |
Intercept added to the linear predictor (default |
corr_structure |
One of |
rho_noise |
Optional correlation for the noise block when |
noise_sd |
Std. dev. of Gaussian noise added to |
miss |
Missingness mechanism: |
miss_prop |
Target marginal missingness proportion (default |
mar_drivers |
Indices of predictors that drive MAR (default |
gamma_vec |
Coefficients for MAR drivers; length must equal the number of MAR drivers actually used
(i.e., |
calibrate_mar |
If |
mar_scale |
If |
keep_observed |
Indices of predictors kept fully observed (values outside |
jitter_sd |
Standard deviation of the per-row jitter added to the MAR logit to induce heterogeneity
(default |
keep_mar_drivers |
Logical; if |
Correlation structures:
"all_ar1": AR(1) correlation with parameter rho across all predictors.
"informative_cs": compound symmetry (exchangeable) within the first p_inf
predictors with parameter rho; others independent.
"blockdiag": block-diagonal AR(1): the informative block (size p_inf) has AR(1) with rho;
the noise block (size p - p_inf) has AR(1) with rho_noise (defaults to rho).
"none": independent predictors.
Missingness:
"MAR": for each row, a logit missingness score is computed from the
selected MAR drivers (see mar_drivers, gamma_vec, mar_scale);
an intercept is set via calibrate_mar to target the proportion miss_prop
(otherwise qlogis(miss_prop)),
and per-row jitter N(0, \code{jitter_sd}) adds heterogeneity. The resulting probability
is used to mask predictors (except those in keep_observed and—if keep_mar_drivers = TRUE—the drivers themselves).
For type = "gaussian" only, y is also subject to the same missingness mechanism.
"MCAR": each predictor (except those in keep_observed) is masked independently with probability miss_prop.
For type = "gaussian" only, y is also masked MCAR with probability miss_prop.
Note: In the simulation, missingness probabilities are computed using the
fully observed latent covariates before masking. From an analyst’s perspective after
masking, allowing the MAR drivers themselves to be missing makes missingness depend on
unobserved values—i.e., effectively non-ignorable (MNAR). Setting
keep_mar_drivers = TRUE keeps those drivers observed and yields a MAR mechanism.
A list with elements:
data: data.frame with columns X1..Xp and y, containing NAs per the missingness mechanism.
beta: numeric length-p vector of true coefficients (non-zeros in the first p_inf positions).
informative: integer vector 1:p_inf.
type: character, outcome type ("gaussian" or "logistic").
intercept: numeric intercept used.
The data element additionally carries attributes:
"true_beta", "informative",
"type", "corr_structure", "rho", "rho_noise" (if set),
"intercept", "noise_sd" (Gaussian; NA otherwise), "mar_scale",
and "keep_mar_drivers".
booami_sim
set.seed(123) sim <- simulate_booami_data( n = 300, p = 25, p_inf = 5, rho = 0.3, type = "gaussian", beta_range = c(1, 2), intercept = 1, corr_structure = "all_ar1", rho_noise = NULL, noise_sd = 1, miss = "MAR", miss_prop = 0.25, mar_drivers = c(1, 2, 3), gamma_vec = NULL, calibrate_mar = FALSE, mar_scale = TRUE, keep_observed = integer(0), jitter_sd = 0.25, keep_mar_drivers = TRUE ) booami_sim <- sim$data
booami_sim, cv_boost_raw,
cv_boost_imputed, impu_boost
set.seed(42) sim <- simulate_booami_data( n = 200, p = 15, p_inf = 4, rho = 0.25, type = "gaussian", miss = "MAR", miss_prop = 0.20 ) d <- sim$data dim(d) mean(colSums(is.na(d)) > 0) # fraction of columns with any NAs head(attr(d, "true_beta")) attr(d, "informative") # Example with block-diagonal correlation and protected MAR drivers sim2 <- simulate_booami_data( n = 150, p = 12, p_inf = 3, rho = 0.40, rho_noise = 0.10, corr_structure = "blockdiag", miss = "MAR", miss_prop = 0.30, mar_drivers = c(1, 2), keep_mar_drivers = TRUE ) colSums(is.na(sim2$data))[1:4] # Binary outcome example sim3 <- simulate_booami_data( n = 100, p = 10, p_inf = 2, rho = 0.2, type = "logistic", miss = "MCAR", miss_prop = 0.15 ) table(sim3$data$y, useNA = "ifany") utils::data(booami_sim) dim(booami_sim) head(attr(booami_sim, "true_beta")) attr(booami_sim, "informative")set.seed(42) sim <- simulate_booami_data( n = 200, p = 15, p_inf = 4, rho = 0.25, type = "gaussian", miss = "MAR", miss_prop = 0.20 ) d <- sim$data dim(d) mean(colSums(is.na(d)) > 0) # fraction of columns with any NAs head(attr(d, "true_beta")) attr(d, "informative") # Example with block-diagonal correlation and protected MAR drivers sim2 <- simulate_booami_data( n = 150, p = 12, p_inf = 3, rho = 0.40, rho_noise = 0.10, corr_structure = "blockdiag", miss = "MAR", miss_prop = 0.30, mar_drivers = c(1, 2), keep_mar_drivers = TRUE ) colSums(is.na(sim2$data))[1:4] # Binary outcome example sim3 <- simulate_booami_data( n = 100, p = 10, p_inf = 2, rho = 0.2, type = "logistic", miss = "MCAR", miss_prop = 0.15 ) table(sim3$data$y, useNA = "ifany") utils::data(booami_sim) dim(booami_sim) head(attr(booami_sim, "true_beta")) attr(booami_sim, "informative")