Generate Bootstrap Sample with Added Noise

Creates a bootstrap sample from a dataset with controlled noise added to both continuous and categorical variables. This function is useful for generating synthetic datasets that maintain the general structure of the original data while introducing controlled variation.

Usage

generate_bootstrap_with_noise(
  data,
  n = NULL,
  continuous_vars = NULL,
  cat_vars = NULL,
  id_var = "pid",
  seed = 123,
  noise_level = 0.1
)

Arguments

data: A data frame containing the original dataset to bootstrap from.
n: Integer. Number of observations in the output dataset. If NULL (default), uses the same number of rows as the input data.
continuous_vars: Character vector of column names to treat as continuous variables. If NULL (default), automatically detects numeric columns.
cat_vars: Character vector of column names to treat as categorical variables. If NULL (default), automatically detects factors, logical columns, and numeric columns with 10 or fewer unique values.
id_var: Character string specifying the name of the ID variable column. This column will be reset to sequential values (1:n) in the output. Default is "pid".
seed: Integer. Random seed for reproducibility. Default is 123.
noise_level: Numeric between 0 and 1. Controls the amount of noise added. For continuous variables, this is multiplied by the standard deviation to determine noise magnitude. For categorical variables, this is divided by 2 to determine the probability of value changes. Default is 0.1.

Value

A data frame with the same structure as the input data, containing bootstrap sampled observations with added noise.

Details

The function performs the following operations:

Bootstrap Sampling

Samples n observations with replacement from the original dataset.

Continuous Variable Noise

Adds Gaussian noise with standard deviation = original SD × noise_level
Constrains values to remain within original variable bounds
Preserves integer type for variables that appear to be integers

Categorical Variable Perturbation

Changes values with probability = noise_level / 2
Binary variables: flips to opposite value
Multi-level unordered: randomly selects from other levels
Ordered factors: weights selection toward adjacent levels
Preserves factor levels and ordering from original data

Note

The function assumes that categorical variables with numeric encoding should maintain their numeric type unless they are factors in the input
Missing values (NA) are handled appropriately in calculations but are not imputed
For ordered factors or variables named "grade", the perturbation favors transitions to adjacent levels over distant levels