Generate Synthetic Survival Data using AFT Model with Flexible Subgroups
Source:R/generate_aft_dgm_main.R
generate_aft_dgm_flex.RdCreates a data generating mechanism (DGM) for survival data using an Accelerated Failure Time (AFT) model with Weibull distribution. Supports flexible subgroup definitions and treatment-subgroup interactions.
Usage
generate_aft_dgm_flex(
data,
continuous_vars,
factor_vars,
continuous_vars_cens = NULL,
factor_vars_cens = NULL,
set_beta_spec = list(set_var = NULL, beta_var = NULL),
outcome_var,
event_var,
treatment_var = NULL,
subgroup_vars = NULL,
subgroup_cuts = NULL,
draw_treatment = FALSE,
model = "alt",
k_treat = 1,
k_inter = 1,
n_super = 5000,
select_censoring = TRUE,
cens_type = "weibull",
cens_params = list(),
seed = 8316951,
verbose = TRUE,
standardize = FALSE,
spline_spec = NULL
)Arguments
- data
A data.frame containing the input dataset to base the simulation on
- continuous_vars
Character vector of continuous variable names to be standardized and included as covariates
- factor_vars
Character vector of factor/categorical variable names to be converted to dummy variables (largest value as reference)
- continuous_vars_cens
Character vector of continuous variable names to be used for censoring model. If NULL, uses same as continuous_vars. Default NULL
- factor_vars_cens
Character vector of factor variable names to be used for censoring model. If NULL, uses same as factor_vars. Default NULL
- set_beta_spec
List with elements 'set_var' and 'beta_var' for manually setting specific beta coefficients. Default list(set_var = NULL, beta_var = NULL)
- outcome_var
Character string specifying the name of the outcome/time variable
- event_var
Character string specifying the name of the event/status variable (1 = event, 0 = censored)
- treatment_var
Character string specifying the name of the treatment variable. If NULL, treatment will be randomly simulated with 50/50 allocation
- subgroup_vars
Character vector of variable names defining the subgroup. Default is NULL (no subgroups)
- subgroup_cuts
Named list of cutpoint specifications for subgroup variables. See Details section for flexible specification options
- draw_treatment
Logical indicating whether to redraw treatment assignment in simulation. Default is FALSE (use original assignments)
- model
Character string: "alt" for alternative model with subgroup effects, "null" for null model without subgroup effects. Default is "alt"
- k_treat
Numeric treatment effect modifier. Values >1 increase treatment effect, <1 decrease it. Default is 1 (no modification)
- k_inter
Numeric interaction effect modifier for treatment-subgroup interaction. Default is 1 (no modification)
- n_super
Integer specifying size of super population to generate. Default is 5000
- select_censoring
Logical. If
TRUE(default), fits the censoring distribution to the observed censoring times indatausingsurvregwith AIC-based selection among Weibull and log-normal models (with and without covariates). IfFALSE, no model is fitted; the censoring distribution is specified entirely bycens_params. DefaultTRUE.- cens_type
Character string specifying censoring distribution type:
"weibull"or"uniform". Controls which parametric family is considered whenselect_censoring = TRUE, and determines the required structure ofcens_paramswhenselect_censoring = FALSE. Default"weibull".- cens_params
Named list of censoring distribution parameters. Interpretation depends on
select_censoringandcens_type:select_censoring = TRUEIgnored; all parameters are estimated from data.
select_censoring = FALSE, cens_type = "uniform"Must supply
minandmax. If either is absent, defaults to0.5 * min(y)and1.5 * max(y)with a message.select_censoring = FALSE, cens_type = "weibull"Must supply
mu(log-scale location) andtau(scale). Optionally supplytype("weibull"or"lognormal"); defaults to"weibull". Censoring is treated as intercept-only (no covariate or treatment dependence):lin_pred_cens_0 = lin_pred_cens_1 = mu.
Default
list().- seed
Integer random seed for reproducibility. Default is 8316951
- verbose
Logical indicating whether to print diagnostic information during execution. Default is TRUE
- standardize
Logical indicating whether to standardize continuous variables. Default is FALSE
- spline_spec
List specifying spline configuration for treatment effect. Must include 'var' (variable name), 'knot', 'zeta', and 'log_hrs' (vector of length 3). Default NULL (no spline)
Value
A named list of class aft_dgm containing:
- data
Simulated trial data frame with outcome, event, and treatment columns.
- model_params
Model parameters used for data generation (coefficients, dispersion, spline info if applicable).
- subgroup_info
Subgroup definition and membership indicators, if a heterogeneous treatment effect was specified.
- censoring_info
Censoring model parameters and observed censoring rate.
- call_args
Arguments used in the call, for reproducibility.
Details
Subgroup Cutpoint Specifications
The subgroup_cuts parameter accepts multiple flexible specifications:
Fixed Value
subgroup_cuts = list(er = 20) # er <= 20Model Structure
The AFT model with Weibull distribution is specified as: $$\log(T) = \mu + \gamma' X + \sigma \epsilon$$
Where:
\(T\) is the survival time
\(\mu\) is the intercept
\(\gamma\) contains the covariate effects
\(X\) includes treatment, covariates, and treatment x subgroup interaction
\(\sigma\) is the scale parameter
\(\epsilon\) follows an extreme value distribution
References
Leon, L.F., et al. (2024). Statistics in Medicine.
Kalbfleisch, J.D. and Prentice, R.L. (2002). The Statistical Analysis of Failure Time Data (2nd ed.). Wiley.
Examples
# \donttest{
df <- survival::gbsg
dgm <- generate_aft_dgm_flex(
data = df,
outcome_var = "rfstime",
event_var = "status",
treatment_var = "hormon",
continuous_vars = c("age", "size", "nodes", "pgr", "er"),
factor_vars = "meno",
model = "null",
verbose = FALSE
)
str(dgm)
#> List of 8
#> $ df_super :'data.frame': 5000 obs. of 33 variables:
#> ..$ id : int [1:5000] 1 2 3 4 5 6 7 8 9 10 ...
#> ..$ y : int [1:5000] 191 2172 195 286 600 1730 1264 970 624 1193 ...
#> ..$ treat : int [1:5000] 0 1 0 1 0 0 0 0 0 0 ...
#> ..$ event : num [1:5000] 1 0 0 1 1 0 0 0 1 1 ...
#> ..$ z_age : int [1:5000] 45 59 51 34 53 49 50 66 59 51 ...
#> ..$ z_size : int [1:5000] 10 8 30 30 75 21 40 28 27 35 ...
#> ..$ z_nodes : int [1:5000] 1 2 1 12 19 5 1 2 20 1 ...
#> ..$ z_pgr : int [1:5000] 14 181 119 0 375 80 80 488 9 6 ...
#> ..$ z_er : int [1:5000] 3 0 44 5 107 152 21 298 2 1 ...
#> ..$ z_meno : num [1:5000] 0 1 0 0 0 1 1 1 1 1 ...
#> ..$ pid : int [1:5000] 1102 821 761 359 884 1469 1393 1131 987 94 ...
#> ..$ age : int [1:5000] 45 59 51 34 53 49 50 66 59 51 ...
#> ..$ meno : int [1:5000] 0 1 0 0 0 1 1 1 1 1 ...
#> ..$ size : int [1:5000] 10 8 30 30 75 21 40 28 27 35 ...
#> ..$ grade : int [1:5000] 2 2 2 3 3 2 2 2 3 3 ...
#> ..$ nodes : int [1:5000] 1 2 1 12 19 5 1 2 20 1 ...
#> ..$ pgr : int [1:5000] 14 181 119 0 375 80 80 488 9 6 ...
#> ..$ er : int [1:5000] 3 0 44 5 107 152 21 298 2 1 ...
#> ..$ zcens_age : int [1:5000] 45 59 51 34 53 49 50 66 59 51 ...
#> ..$ zcens_size : int [1:5000] 10 8 30 30 75 21 40 28 27 35 ...
#> ..$ zcens_nodes : int [1:5000] 1 2 1 12 19 5 1 2 20 1 ...
#> ..$ zcens_pgr : int [1:5000] 14 181 119 0 375 80 80 488 9 6 ...
#> ..$ zcens_er : int [1:5000] 3 0 44 5 107 152 21 298 2 1 ...
#> ..$ zcens_meno : num [1:5000] 0 1 0 0 0 1 1 1 1 1 ...
#> ..$ flag_harm : num [1:5000] 0 0 0 0 0 0 0 0 0 0 ...
#> ..$ lin_pred_1 : num [1:5000] 0.547 0.743 0.664 -0.118 0.182 ...
#> ..$ lin_pred_0 : num [1:5000] 0.263 0.459 0.38 -0.402 -0.102 ...
#> ..$ lin_pred_obs : num [1:5000] 0.263 0.743 0.38 -0.118 -0.102 ...
#> ..$ theta_0 : num [1:5000] -0.362 -0.633 -0.524 0.554 0.141 ...
#> ..$ theta_1 : num [1:5000] -0.755 -1.025 -0.916 0.162 -0.251 ...
#> ..$ loghr_po : num [1:5000] -0.392 -0.392 -0.392 -0.392 -0.392 ...
#> ..$ lin_pred_cens_0: num [1:5000] 0 0 0 0 0 0 0 0 0 0 ...
#> ..$ lin_pred_cens_1: num [1:5000] 0 0 0 0 0 0 0 0 0 0 ...
#> $ model_params :List of 6
#> ..$ mu : Named num 7.52
#> .. ..- attr(*, "names")= chr "(Intercept)"
#> ..$ tau : num 0.725
#> ..$ gamma : Named num [1:7] 0.28429 0.0075 -0.00626 -0.03902 0.00194 ...
#> .. ..- attr(*, "names")= chr [1:7] "treat" "z_age" "z_size" "z_nodes" ...
#> ..$ b0 : Named num [1:7] -0.39223 -0.01035 0.00864 0.05384 -0.00268 ...
#> .. ..- attr(*, "names")= chr [1:7] "treat" "z_age" "z_size" "z_nodes" ...
#> ..$ censoring :List of 4
#> .. ..$ mu : Named num 7.45
#> .. .. ..- attr(*, "names")= chr "(Intercept)"
#> .. ..$ tau : num 0.425
#> .. ..$ gamma: Named num(0)
#> .. .. ..- attr(*, "names")= chr(0)
#> .. ..$ type : chr "weibull"
#> ..$ spline_info: NULL
#> $ subgroup_info:List of 5
#> ..$ vars : NULL
#> ..$ cuts : NULL
#> ..$ definitions: list()
#> ..$ size : num 0
#> ..$ proportion : num 0
#> $ hazard_ratios:List of 7
#> ..$ overall : Named num 0.75
#> .. ..- attr(*, "names")= chr "treat"
#> ..$ AHR : num 0.676
#> ..$ AHR_harm : num NaN
#> ..$ AHR_no_harm: num 0.676
#> ..$ CDE : num 0.676
#> ..$ CDE_harm : num NA
#> ..$ CDE_no_harm: num 0.676
#> $ analysis_vars:List of 6
#> ..$ continuous: chr [1:5] "age" "size" "nodes" "pgr" ...
#> ..$ factor : chr "meno"
#> ..$ covariates: chr [1:6] "z_age" "z_size" "z_nodes" "z_pgr" ...
#> ..$ treatment : chr "treat"
#> ..$ outcome : chr "y_sim"
#> ..$ event : chr "event_sim"
#> $ model_type : chr "null"
#> $ n_super : num 5000
#> $ seed : num 8316951
#> - attr(*, "class")= chr [1:2] "aft_dgm_flex" "list"
# }