General suite of bagging functions for several models.

bagger(x, ...)

# S3 method for default
bagger(x, ...)

# S3 method for data.frame
bagger(
  x,
  y,
  base_model = "CART",
  times = 11L,
  control = control_bag(),
  cost = NULL,
  ...
)

# S3 method for matrix
bagger(
  x,
  y,
  base_model = "CART",
  times = 11L,
  control = control_bag(),
  cost = NULL,
  ...
)

# S3 method for formula
bagger(
  formula,
  data,
  base_model = "CART",
  times = 11L,
  control = control_bag(),
  cost = NULL,
  ...
)

# S3 method for recipe
bagger(
  x,
  data,
  base_model = "CART",
  times = 11L,
  control = control_bag(),
  cost = NULL,
  ...
)

Arguments

x

A data frame, matrix, or recipe (depending on the method being used).

...

Optional arguments to pass to the base model function.

y

A numeric or factor vector of outcomes. Categorical outcomes (i.e classes) should be represented as factors, not integers.

base_model

A single character value for the model being bagged. Possible values are "CART", "MARS", and "C5.0" (classification only).

times

A single integer greater than 1 for the maximum number of bootstrap samples/ensemble members (some model fits might fail).

control

A list of options generated by control_bag().

cost

A non-negative scale (for two class problems) or a cost matrix.

formula

An object of class "formula" (or one that can be coerced to that class): a symbolic description of the model to be fitted. Note that this package does not support multivariate outcomes and that, if some predictors are factors, dummy variables will not be created unless by the underlying model function.

data

A data frame containing the variables used in the formula or recipe.

Details

bagger() fits separate models to bootstrap samples. The prediction function for each model object is encoded in an R expression and the original model object is discarded. When making predictions, each prediction formula is evaluated on the new data and aggregated using the mean.

Variable importance scores are calculated using implementations in each package. When requested, the results are in a tibble with column names term (the predictor), value (the importance score), and used (the percentage of times that the variable was in the prediction equation).

The models can be fit in parallel using the future package. The enable parallelism, use the future::plan() function to declare how the computations should be distributed. Note that this will almost certainly multiply the memory requirements required to fit the models.

Examples

#> Loading required package: dplyr
#> #> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’: #> #> filter, lag
#> The following objects are masked from ‘package:base’: #> #> intersect, setdiff, setequal, union
#> #> Attaching package: ‘recipes’
#> The following object is masked from ‘package:stats’: #> #> step
library(dplyr) data(biomass, package = "modeldata") biomass_tr <- biomass %>% dplyr::filter(dataset == "Training") %>% dplyr::select(-dataset, -sample) biomass_te <- biomass %>% dplyr::filter(dataset == "Testing") %>% dplyr::select(-dataset, -sample) # ------------------------------------------------------------------------------ ctrl <- control_bag(var_imp = TRUE) # ------------------------------------------------------------------------------ # `times` is low to make the examples run faster set.seed(7687) mars_bag <- bagger(x = biomass_tr[, -6], y = biomass_tr$HHV, base_model = "MARS", times = 5, control = ctrl) mars_bag
#> Bagged MARS (regression with 5 members) #> #> Variable importance scores include: #> #> # A tibble: 5 × 4 #> term value std.error used #> <chr> <dbl> <dbl> <int> #> 1 carbon 100 0 5 #> 2 hydrogen 18.1 2.66 5 #> 3 oxygen 11.6 2.53 4 #> 4 sulfur 1.56 0 1 #> 5 nitrogen 1.09 0.757 2 #>
var_imp(mars_bag)
#> # A tibble: 5 × 4 #> term value std.error used #> <chr> <dbl> <dbl> <int> #> 1 carbon 100 0 5 #> 2 hydrogen 18.1 2.66 5 #> 3 oxygen 11.6 2.53 4 #> 4 sulfur 1.56 0 1 #> 5 nitrogen 1.09 0.757 2
set.seed(7687) cart_bag <- bagger(x = biomass_tr[, -6], y = biomass_tr$HHV, base_model = "CART", times = 5, control = ctrl) cart_bag
#> Bagged CART (regression with 5 members) #> #> Variable importance scores include: #> #> # A tibble: 5 × 4 #> term value std.error used #> <chr> <dbl> <dbl> <int> #> 1 carbon 5716. 226. 5 #> 2 oxygen 3190. 110. 5 #> 3 hydrogen 2297. 283. 5 #> 4 sulfur 464. 63.2 5 #> 5 nitrogen 268. 12.3 5 #>
# ------------------------------------------------------------------------------ # Other interfaces # Recipes can be used biomass_rec <- recipe(HHV ~ ., data = biomass_tr) %>% step_pca(all_predictors()) set.seed(7687) cart_pca_bag <- bagger(biomass_rec, data = biomass_tr, base_model = "CART", times = 5, control = ctrl) cart_pca_bag
#> Bagged CART (regression with 5 members) #> #> Variable importance scores include: #> #> # A tibble: 5 × 4 #> term value std.error used #> <chr> <dbl> <dbl> <int> #> 1 PC2 4500. 245. 5 #> 2 PC1 3559. 156. 5 #> 3 PC3 1107. 210. 5 #> 4 PC5 648. 137. 5 #> 5 PC4 468. 71.6 5 #>
# Using formulas mars_bag <- bagger(HHV ~ ., data = biomass_tr, base_model = "MARS", times = 5, control = ctrl) mars_bag
#> Bagged MARS (regression with 5 members) #> #> Variable importance scores include: #> #> # A tibble: 5 × 4 #> term value std.error used #> <chr> <dbl> <dbl> <int> #> 1 carbon 100 0 5 #> 2 oxygen 22.4 3.09 5 #> 3 hydrogen 17.2 1.45 5 #> 4 sulfur 2.22 2.10 2 #> 5 nitrogen 1.39 0 1 #>