Generate Bootstrap Sample with Added Noise
Source:R/synthetic_data_perturbation.R
generate_bootstrap_with_noise.RdCreates a bootstrap sample from a dataset with controlled noise added to both continuous and categorical variables. This function is useful for generating synthetic datasets that maintain the general structure of the original data while introducing controlled variation.
Usage
generate_bootstrap_with_noise(
data,
n = NULL,
continuous_vars = NULL,
cat_vars = NULL,
id_var = "pid",
seed = 123,
noise_level = 0.1
)Arguments
- data
A data frame containing the original dataset to bootstrap from.
- n
Integer. Number of observations in the output dataset. If NULL (default), uses the same number of rows as the input data.
- continuous_vars
Character vector of column names to treat as continuous variables. If NULL (default), automatically detects numeric columns.
- cat_vars
Character vector of column names to treat as categorical variables. If NULL (default), automatically detects factors, logical columns, and numeric columns with 10 or fewer unique values.
- id_var
Character string specifying the name of the ID variable column. This column will be reset to sequential values (1:n) in the output. Default is "pid".
- seed
Integer. Random seed for reproducibility. Default is 123.
- noise_level
Numeric between 0 and 1. Controls the amount of noise added. For continuous variables, this is multiplied by the standard deviation to determine noise magnitude. For categorical variables, this is divided by 2 to determine the probability of value changes. Default is 0.1.
Value
A data frame with the same structure as the input data, containing bootstrap sampled observations with added noise.
Details
The function performs the following operations:
Note
The function assumes that categorical variables with numeric encoding should maintain their numeric type unless they are factors in the input
Missing values (NA) are handled appropriately in calculations but are not imputed
For ordered factors or variables named "grade", the perturbation favors transitions to adjacent levels over distant levels
Examples
if (FALSE) { # \dontrun{
# Load example dataset
data(gbsg, package = "survival")
# Basic usage with automatic variable detection
synthetic_data <- generate_bootstrap_with_noise(
data = gbsg,
seed = 123
)
# Specify variables explicitly
synthetic_data <- generate_bootstrap_with_noise(
data = gbsg,
n = 1000,
continuous_vars = c("age", "size", "nodes", "pgr", "er", "rfstime"),
cat_vars = c("meno", "grade", "hormon", "status"),
id_var = "pid",
seed = 456,
noise_level = 0.15
)
# Create multiple synthetic datasets
synthetic_list <- lapply(1:10, function(i) {
generate_bootstrap_with_noise(data = gbsg, seed = i)
})
} # }