Extreme Subgroup Effects Under a Uniform Treatment Benefit

Overview

A central concern in exploratory subgroup analysis is the potential for extreme results driven entirely by chance, not by genuine treatment effect heterogeneity. When many subgroups are examined in a single dataset — as is routine in regulatory submissions and trial reports — small subgroups in particular can produce hazard ratio estimates that look dramatic simply because they are noisy. The smaller the subgroup, the wider the confidence interval, and the higher the probability that the upper bound strays far from the truth.

This vignette evaluates that risk directly. We generate thousands of synthetic trials from a data generating mechanism (DGM) with a perfectly uniform treatment benefit — the true hazard ratio is identical for every patient — and ask: how often do standard Cox model analyses produce extreme subgroup results?

The vignette is organised as a complete, self-contained workflow:

Building the DGM — fitting a Weibull AFT model to real data, constructing a super-population, and calibrating the censoring process so that simulated trials reproduce the censoring pattern of the source data. This step establishes the null scenario: a DGM where the true HR is 0.70 for every patient, with no subgroup differential effects.
Running simulated trials — drawing 1000 synthetic trial replicates from the null DGM and fitting a stratified Cox model in each pre-defined subgroup.
Characterising the distribution of extreme results — for each subgroup, summarising how variable the HR point estimate is across simulations (the empirical 99% CI, or ECI) and how large the upper confidence bound UB(HR) can become by chance alone. Random benchmark subgroups (random15, random20, random40, random60) with no clinical meaning serve as calibration anchors, making explicit that clinical subgroups of the same size are statistically indistinguishable from meaningless random groupings.

All examples use the GBSG breast cancer dataset (survival::gbsg).

Example of Subgroups Under Study

For the study under question, the subgroups and features (size, events, etc) are of course specific to that study. While we are using the GBSG breast cancer study as the case-study, the analyses can be applied or modified based on any study.

An example of pre-defined subgroups, ranging from clinically motivated subpopulations to deliberately artificial ones:

Subgroup	Basis	Approximate N
All Patients (ITT)	Full trial	~700
No liver metastases	Clinical	~400
BRAF V600E mutant	Molecular	~300
Age ≤ 65	Demographic	~400
Spain, Germany, UK	Country/region	~60–150
Asia Pacific region	Region	~100
ECOG 0 / ECOG 1	Performance status	~300 each
random15, random20	Randomly sampled	~15, ~20
random40, random60	Randomly sampled	~40, ~60

The random* subgroups are the key benchmark. They are constructed by randomly sampling approximately 20% of subjects and taking the first 15, 20, 40, and 60 patients:

set.seed(8316951)
sg_ran <- rbinom(nrow(df_analysis), size = 1, prob = 0.2)
temp   <- cumsum(sg_ran)

sg_ran15            <- sg_ran
sg_ran15[temp > 15] <- 0   # first 15 selected subjects
df_analysis$random15 <- sg_ran15

sg_ran40            <- sg_ran
sg_ran40[temp > 40] <- 0
df_analysis$random40 <- sg_ran40

These subgroups have no clinical meaning whatsoever — their true HR equals the ITT HR of 0.70. Any extreme result they produce is purely a function of their small size and sampling variability.

Six Analysis Variants

For each simulated trial and each subgroup, six Cox model specifications are fitted, covering the range of approaches one might encounter in practice:

Label	Model specification
`sR`	Stratified by original randomisation factors
`none`	Unstratified
`sX`	Stratified by ECOG status
`sBM*`	Stratified by biomarker above/below optimal cutpoint
`sX+sR`	Stratified by ECOG and randomisation factors
`X+O+sR`	Covariate-adjusted (ECOG, age, sex) + stratified

Summary Statistics Across 5 000 Simulations

For each subgroup and analysis, two quantities are accumulated across all simulated trials:

HR point estimate distribution. Reported as median and the 1st–99th percentile interval (the “empirical 99% confidence interval” or ECI). Under truth, the median should be close to 0.70 for every subgroup. The spread of the ECI reflects how variable the estimate is — wider for smaller subgroups.

Upper confidence bound UB(HR). The one-sided 97.5% upper bound of the Cox 95% confidence interval. Two key thresholds are tracked: - UB(HR) < 1.0: the upper bound excludes no treatment effect (nominally “significant” benefit). - HR < 0.80: the point estimate alone looks like a meaningful benefit (at the ITT threshold level).

Under the null of a uniform HR = 0.70 with no subgroup effects, UB(HR) < 1.0 should occur approximately 95% of the time for the full ITT — it simply reflects power to detect the overall benefit. For subgroups, the rate falls with sample size, but the distribution of UB(HR) among those trials where UB(HR) ≥ 1 reveals how extreme a false non-significance result can look.

We note that continuous biomarkers can also be considered. This is more involved and not necessary for uniform effects, but valuable for trials with biomarkers when there is interest in exploring the dynamics of subgroups in the presence of actual biomarker effects (correlated with other patients’ characteristics).

Setting: Uniform Treatment Effect

The DGM is built from a real trial dataset using generate_aft_dgm_flex() with a single-knot spline biomarker model. The key feature is that the true log hazard ratio is constant across all biomarker values:

$\psi^0(z) = \log(\text{HR}_{\text{true}}) \quad \text{for all } z.$

In the examples below the true overall HR is 0.70 (a moderate benefit). Every subgroup — regardless of how it is defined — shares this same underlying HR. There is no subgroup effect; any apparent finding is a false positive.

The spline parameterisation is specified via log.hrs = log(c(hr0, hr_k, hr_z)), where hr0, hr_k, hr_z are the true HRs at biomarker values 0, the knot k, and an upper anchor zeta, respectively. Under the uniform scenario all three are set to the same value:

log.hrs <- log(c(0.70, 0.70, 0.70))  # uniform HR = 0.70 everywhere
dgm     <- get_dgm_stratified(
  df        = df_analysis,
  knot      = 40,
  kappa     = 50,
  hrs       = c(0.70, 0.70, 0.70),
  strata_tte = "strataNew"
)

Installation

The development version of forestsearch can be installed directly from GitHub using either pak (recommended) or remotes:

# Using pak (recommended)
# install.packages("pak")
pak::pak("larry-leon/forestsearch")

# Using remotes
# install.packages("remotes")
remotes::install_github("larry-leon/forestsearch")

The package depends on weightedsurv, which is installed automatically from GitHub via the Remotes field in DESCRIPTION — no separate step is needed.

Once installed, load the package in the usual way:

library(forestsearch)

Documentation

Full documentation, including function reference and vignettes, is available at the pkgdown site:

https://larry-leon.github.io/forestsearch/

The site includes, along with this vignette:

Getting Started — end-to-end walkthrough with the GBSG breast cancer trial (vignette("forestsearch"))
Methodology — statistical framework, algorithm details, and simulation results (vignette("methodology"))
Treatment Effect Definitions — marginal HR, average hazard ratio (AHR), and controlled direct effect (CDE) estimands
Simulation Studies — operating characteristics and published benchmarks
Reference — all exported functions organised across 16 topic areas

Vignettes can also be accessed locally after installation:

vignette("forestsearch")   # Getting Started
vignette("methodology")    # Statistical methodology

Setup

library(forestsearch)
library(survival)
library(ggplot2)
library(dplyr)
library(patchwork)     # assembles gg_forest() panels
library(here)          # here::here() for source() path in forest-load-ggforest

theme_set(theme_minimal(base_size = 12))

# Load GBSG data and convert time to months (DGM time scale)
data(gbsg, package = "survival")

gbsg$time_months <- gbsg$rfstime / 30.4375
gbsg$treat        <- gbsg$hormon

cat("N =", nrow(gbsg), "  Events:", sum(gbsg$status),
    sprintf("(%.0f%%)", 100 * mean(gbsg$status)), "\n")

## N = 686   Events: 299 (44%)

cat("Median follow-up (months):",
    round(median(gbsg$time_months), 1), "\n")

## Median follow-up (months): 35.6

cat("Median event time (months):",
    round(median(gbsg$time_months[gbsg$status == 1]), 1), "\n")

## Median event time (months): 21.2

Part I — Building the Null Data Generating Mechanism

To evaluate extreme subgroup effects by chance, we need a DGM where the treatment has a genuine, uniform benefit — so that any subgroup finding is definitionally a false positive. This section constructs that DGM from the GBSG data using generate_aft_dgm_flex() with model = "null" (uniform treatment effect throughout; no embedded harm subgroup).

Why time-scale consistency matters first

Before fitting anything, confirm that all time parameters are expressed in the same units. The single most common simulation failure is building the DGM on days (e.g. rfstime) and then passing analysis_time in months. The result is a follow-up window far shorter than the DGM event-time distribution, producing universal administrative censoring (event_sim = 0 for every subject).

After building the DGM, this quick check catches the problem:

# exp(mu) is the implied median event time on the DGM time scale.
# It should be close to the observed median event time in months.
cat("Observed median event time (months):",
    round(median(gbsg$time_months[gbsg$status == 1]), 1), "\n")

## Observed median event time (months): 21.2

cat("(After building dgm, exp(dgm$model_params$mu) should be similar.)\n")

## (After building dgm, exp(dgm$model_params$mu) should be similar.)

Fitting `generate_aft_dgm_flex()`

generate_aft_dgm_flex() fits a Weibull AFT model to the source data and generates a large super population from which all future trials will be drawn. For the extreme-subgroup evaluation we use model = "null", which sets the treatment effect to be identical for every patient — there is no subgroup, so any apparent finding is a false positive by construction.

The censoring model is estimated from the data by AIC-based selection across Weibull and log-normal models, with and without covariate adjustment. The next section shows how to verify and calibrate this so that simulated trials reproduce the censoring pattern of the source data.

Note that this is an example of a DGM (truth) where there is a subgroup effect (ER-low & pre-menopausal). Simulations under (the null of) uniform treatment benefit are provided below in Part IV

set.seed(42)

dgm <- generate_aft_dgm_flex(
  data            = gbsg,
  continuous_vars = c("age", "size", "nodes", "pgr", "er"),
  factor_vars     = c("meno", "grade"),
  outcome_var     = "time_months",
  event_var       = "status",
  treatment_var   = "hormon",
  subgroup_vars   = c("er", "meno"),
  subgroup_cuts   = list(er = 20, meno = 0),   # ER-low & pre-menopausal
  model           = "alt",
  k_inter         = 1.5,    # moderate interaction
  n_super         = 5000,
  seed            = 42,
  verbose         = FALSE
)

## 
##  ====================================================================== 
##  CENSORING MODEL SELECTION:
##  MULTIPLE SURVIVAL REGRESSION MODEL COMPARISONS
## ====================================================================== 
## 
## CONVERGENCE SUMMARY ( 4 / 4  converged):
## -------------------------------------------------- 
## 
## Weibull (weibull):
##   Status:  ✓ Converged 
##   Iterations:  9 
##   Log-likelihood:  -1838.404 
##   Scale parameter:  0.418 
## 
## LogNormal (lognormal):
##   Status:  ✓ Converged 
##   Iterations:  4 
##   Log-likelihood:  -2000.309 
##   Scale parameter:  0.8707 
## 
## Weibull0 (weibull):
##   Status:  ✓ Converged 
##   Iterations:  7 
##   Log-likelihood:  -1843.717 
##   Scale parameter:  0.4246 
## 
## LogNormal0 (lognormal):
##   Status:  ✓ Converged 
##   Iterations:  6 
##   Log-likelihood:  -2004.326 
##   Scale parameter:  0.8768 
## 
## MODEL COMPARISON TABLE:
## -------------------------------------------------- 
##       Model     AIC delta_AIC AIC_weight     BIC delta_BIC
##    Weibull0 3691.43      0.00      0.976 3700.50      0.00
##     Weibull 3698.81      7.37      0.024 3748.65     48.15
##  LogNormal0 4012.65    321.22      0.000 4021.71    321.22
##   LogNormal 4022.62    331.18      0.000 4072.46    371.96
## 
## MODEL RANKINGS:
## -------------------------------------------------- 
## By AIC:  Weibull0 > Weibull > LogNormal0 > LogNormal 
## By BIC:  Weibull0 > Weibull > LogNormal0 > LogNormal 
## 
## EVIDENCE ASSESSMENT:
## -------------------------------------------------- 
##   Strength:  Strong evidence 
##   Details:  Top model has 97.6% of AIC weight 
## 
## FINAL RECOMMENDATION:
## -------------------------------------------------- 
##   Clear winner: Weibull0 (best by both AIC and BIC)
##   Evidence: Strong evidence - Top model has 97.6% of AIC weight 
## ======================================================================

# Key DGM outputs
cat("=== DGM Summary ===\n")

## === DGM Summary ===

cat("Time scale          : months\n")

## Time scale          : months

cat("Implied median T    : exp(mu) =",
    round(exp(dgm$model_params$mu), 1), "months\n")

## Implied median T    : exp(mu) = 57.8 months

cat("Subgroup proportion :", round(dgm$subgroup_info$proportion, 3), "\n")

## Subgroup proportion : 0.188

cat("True HR (harm)      :",
    round(dgm$hazard_ratios$harm_subgroup, 3), "\n")

## True HR (harm)      : 1.206

cat("True HR (no harm)   :",
    round(dgm$hazard_ratios$no_harm_subgroup, 3), "\n")

## True HR (no harm)   : 0.729

cat("True HR (overall)   :",
    round(dgm$hazard_ratios$overall, 3), "\n")

## True HR (overall)   : 0.781

cat("Censoring type      :",
    dgm$model_params$censoring$type, "\n")

## Censoring type      : weibull

Part II — Censoring Diagnostics and Calibration

For the simulation results to be credible, the censoring pattern in each synthetic trial must match what was observed in the source data. If the simulated censoring rate is much higher or lower than the real one, the event counts in small subgroups will be systematically wrong, biasing the HR and UB(HR) distributions we are trying to characterise.

Even with a well-fitted censoring model, the interaction between the DGM censoring distribution and administrative censoring (the analysis_time cutoff) can shift the effective censoring rate substantially away from what was observed. This section covers two functions that diagnose and correct for this shift before the main simulation loop is run.

Diagnosing censoring consistency

check_censoring_dgm() compares three censoring summaries between the DGM reference data (dgm$df_super) and a simulated dataset: overall censoring rate, censoring-time quantiles among censored subjects, and the KM-based median censoring time (estimated by reversing the event indicator so that censored observations become the “event” of interest).

# Generate a pilot simulated trial (no calibration yet)
sim_pilot <- simulate_from_dgm(
  dgm           = dgm,
  n             = 1500,
  analysis_time = 84,      # months — 7-year follow-up
  max_entry     = 24,      # months — 2-year recruitment window
  cens_adjust   = 0,
  seed          = 1
)

cat("Pilot trial:  N =", nrow(sim_pilot),
    "  Event rate:", round(mean(sim_pilot$event_sim), 3), "\n")

## Pilot trial:  N = 1500   Event rate: 0.419

diag0 <- check_censoring_dgm(
  sim_data   = sim_pilot,
  dgm        = dgm,
  rate_tol   = 0.10,
  median_tol = 0.25,
  verbose    = TRUE
)

## 
## =========================================================
##   Censoring Diagnostic: DGM Reference vs Simulated
## =========================================================
##   DGM censoring type  : weibull
##   DGM mu_cens         : 4.0312  [exp = 56.33 time units]
##   DGM tau_cens        : 0.4246
##   Reference n (super) : 5000
##   Simulated n         : 1500
## 
## --- 1. Censoring Rates ---
##    Group Ref_rate Sim_rate    Diff
##  Overall    57.5%    58.1% +0.5 pp
##    Arm 0    54.9%    51.5% -3.5 pp
##    Arm 1    62.4%    64.7% +2.3 pp
## 
## --- 2. Censoring-Time Quantiles (censored subjects only) ---
##     Ref: y[event == 0]  |  Sim: c_time[event_sim == 0]
##  Quantile   Ref   Sim Ratio
##       25% 28.19 27.56 0.978
##       50% 47.05 40.91 0.870
##       75% 60.48 54.85 0.907
##       90% 70.57 65.55 0.929
## 
## --- 3. KM Median Censoring Time (reversed event indicator) ---
##    Group Ref_median Sim_median Ratio
##  Overall   53.51951   46.21436 0.864
##    Arm 0   50.16838   48.23731 0.962
##    Arm 1   56.57495   45.51736 0.805
## 
## --- 4. Implied Time-Scale Check ---
##   exp(params$mu)          = 57.83  [median outcome time, DGM scale]
##   exp(params$censoring$mu)= 56.33  [median censoring time, DGM scale]
##   If these values are implausibly large relative to your
##   analysis_time, the DGM was likely built on a different time
##   scale (e.g. days vs months).
## 
## [OK] No substantial discrepancies detected.
## =========================================================

Reading the output. Flags warn when the absolute censoring-rate discrepancy exceeds rate_tol (default 10 percentage points) or when the KM median censoring times differ by more than median_tol (default 25%). Common causes:

Time-scale mismatch — check exp(dgm$model_params$mu) against analysis_time.
Short analysis_time — administrative censoring dominates the censoring process, inflating the censoring rate and compressing censoring times.
Large cens_adjust — the log-scale shift has moved the DGM censoring distribution far from the original data.

Calibrating `cens_adjust`

calibrate_cens_adjust() uses uniroot to find the value of cens_adjust — a log-scale shift applied to the DGM censoring model — that equates a chosen summary statistic between the reference and simulated data. Two targets are available: "rate" (overall censoring rate) and "km_median" (KM-based median censoring time).

How the objective function works. At each candidate cens_adjust:

simulate_from_dgm() generates n_eval synthetic subjects.
check_censoring_dgm() extracts the target metric.
The objective returns sim_metric - ref_metric.

uniroot finds the zero crossing. The fixed seed inside the objective ensures the same candidate is always evaluated identically, preventing the root-finder from chasing Monte Carlo noise.

cal <- calibrate_cens_adjust(
  dgm           = dgm,
  target        = "rate",
  n             = 1000,
  analysis_time = 84,
  max_entry     = 24,
  seed          = 42,
  interval      = c(-3, 3),
  n_eval        = 2000,
  verbose       = TRUE
)

## 
## -------------------------------------------------------
##   calibrate_cens_adjust()
## -------------------------------------------------------
##   Target metric    : rate
##   Reference value  : 0.5752
##   Search interval  : [-3.00, 3.00]
##   n_eval per call  : 2000
##   Tolerance        : 1.0e-04
## -------------------------------------------------------
##   f(interval[1] = -3.00) = +0.4093
##   f(interval[2] = 3.00) = -0.1867
##   Searching...
## 
## =========================================================
##   Censoring Diagnostic: DGM Reference vs Simulated
## =========================================================
##   DGM censoring type  : weibull
##   DGM mu_cens         : 4.0312  [exp = 56.33 time units]
##   DGM tau_cens        : 0.4246
##   Reference n (super) : 5000
##   Simulated n         : 1000
## 
## --- 1. Censoring Rates ---
##    Group Ref_rate Sim_rate    Diff
##  Overall    57.5%    55.2% -2.3 pp
##    Arm 0    54.9%    50.4% -4.5 pp
##    Arm 1    62.4%    60.0% -2.4 pp
## 
## --- 2. Censoring-Time Quantiles (censored subjects only) ---
##     Ref: y[event == 0]  |  Sim: c_time[event_sim == 0]
##  Quantile   Ref   Sim Ratio
##       25% 28.19 28.24 1.002
##       50% 47.05 43.15 0.917
##       75% 60.48 58.44 0.966
##       90% 70.57 68.49 0.971
## 
## --- 3. KM Median Censoring Time (reversed event indicator) ---
##    Group Ref_median Sim_median Ratio
##  Overall   53.51951   50.41151 0.942
##    Arm 0   50.16838   50.80768 1.013
##    Arm 1   56.57495   50.33982 0.890
## 
## --- 4. Implied Time-Scale Check ---
##   exp(params$mu)          = 57.83  [median outcome time, DGM scale]
##   exp(params$censoring$mu)= 56.33  [median censoring time, DGM scale]
##   If these values are implausibly large relative to your
##   analysis_time, the DGM was likely built on a different time
##   scale (e.g. days vs months).
## 
## [OK] No substantial discrepancies detected.
## =========================================================
## 
## 
##   === Calibration Result ===
##   cens_adjust      : 0.034371
##   Reference rate      : 0.5752
##   Simulated rate      : 0.5520
##   Residual         : 0.0232
##   uniroot iters    : 13
## -------------------------------------------------------

cat("\nCalibrated cens_adjust :", round(cal$cens_adjust, 4), "\n")

## 
## Calibrated cens_adjust : 0.0344

cat("Reference censoring rate:", round(cal$ref_value,    3),  "\n")

## Reference censoring rate: 0.575

cat("Achieved  censoring rate:", round(cal$sim_value,    3),  "\n")

## Achieved  censoring rate: 0.552

cat("Residual                :", round(cal$residual,     4),  "\n")

## Residual                : 0.0232

Monotonicity note. The objective is monotone in cens_adjust for both targets: larger values increase censoring times (lower rate, higher KM median), smaller values do the opposite. If uniroot fails, the target lies outside the interval; widen it and retry.

Verifying the calibrated simulation

sim_cal <- simulate_from_dgm(
  dgm           = dgm,
  n             = 1000,
  analysis_time = 84,
  max_entry     = 24,
  cens_adjust   = cal$cens_adjust,
  seed          = 123
)

diag_cal <- check_censoring_dgm(
  sim_data   = sim_cal,
  dgm        = dgm,
  verbose    = TRUE,
  rate_tol   = 0.05,
  median_tol = 0.15
)

## 
## =========================================================
##   Censoring Diagnostic: DGM Reference vs Simulated
## =========================================================
##   DGM censoring type  : weibull
##   DGM mu_cens         : 4.0312  [exp = 56.33 time units]
##   DGM tau_cens        : 0.4246
##   Reference n (super) : 5000
##   Simulated n         : 1000
## 
## --- 1. Censoring Rates ---
##    Group Ref_rate Sim_rate    Diff
##  Overall    57.5%    57.5% -0.0 pp
##    Arm 0    54.9%    52.6% -2.3 pp
##    Arm 1    62.4%    62.4% +0.0 pp
## 
## --- 2. Censoring-Time Quantiles (censored subjects only) ---
##     Ref: y[event == 0]  |  Sim: c_time[event_sim == 0]
##  Quantile   Ref   Sim Ratio
##       25% 28.19 29.36 1.042
##       50% 47.05 43.83 0.932
##       75% 60.48 59.77 0.988
##       90% 70.57 67.15 0.952
## 
## --- 3. KM Median Censoring Time (reversed event indicator) ---
##    Group Ref_median Sim_median Ratio
##  Overall   53.51951   50.66493 0.947
##    Arm 0   50.16838   52.70067 1.050
##    Arm 1   56.57495   48.96581 0.866
## 
## --- 4. Implied Time-Scale Check ---
##   exp(params$mu)          = 57.83  [median outcome time, DGM scale]
##   exp(params$censoring$mu)= 56.33  [median censoring time, DGM scale]
##   If these values are implausibly large relative to your
##   analysis_time, the DGM was likely built on a different time
##   scale (e.g. days vs months).
## 
## [OK] No substantial discrepancies detected.
## =========================================================

Part III — Generating Synthetic Trials

With the null DGM built, diagnosed, and calibrated, each simulated trial is a single call to simulate_from_dgm(). The function samples from the super population with replacement, assigns treatment, generates latent survival times from the AFT model, applies the calibrated censoring model, and truncates at analysis_time.

The single-trial example below illustrates the output structure. In the null DGM every patient has the same true HR of 0.70, so flag_harm is always 0 — there is no embedded subgroup. The 1000 replications that drive the extreme-subgroup analysis in Part IV follow the same pattern, looping over subgroups inside each replicate.

sim_data <- simulate_from_dgm(
  dgm           = dgm,
  n             = 600,
  rand_ratio    = 1,
  analysis_time = 84,
  max_entry     = 24,
  cens_adjust   = cal$cens_adjust,
  seed          = 100
)

cat("Simulated trial:  N =", nrow(sim_data), "\n")

## Simulated trial:  N = 600

cat("Treatment allocation:",
    table(sim_data$treat_sim)[["0"]], "control /",
    table(sim_data$treat_sim)[["1"]], "experimental\n")

## Treatment allocation: 300 control / 300 experimental

cat("Overall event rate :", round(mean(sim_data$event_sim), 3), "\n")

## Overall event rate : 0.412

cat("Event rate (harm subgroup):",
    round(mean(sim_data$event_sim[sim_data$flag_harm == 1]), 3), "\n")

## Event rate (harm subgroup): 0.57

cat("Event rate (no-harm)      :",
    round(mean(sim_data$event_sim[sim_data$flag_harm == 0]), 3), "\n")

## Event rate (no-harm)      : 0.374

library(survival)

# KM for overall, harm subgroup, and complementary subgroup
km_all  <- survfit(Surv(y_sim, event_sim) ~ treat_sim,
                   data = sim_data)
km_harm <- survfit(Surv(y_sim, event_sim) ~ treat_sim,
                   data = subset(sim_data, flag_harm == 1))
km_noharm <- survfit(Surv(y_sim, event_sim) ~ treat_sim,
                     data = subset(sim_data, flag_harm == 0))

# Quick summary of true vs observed HRs
cox_all   <- coxph(Surv(y_sim, event_sim) ~ treat_sim, data = sim_data)
cox_harm  <- coxph(Surv(y_sim, event_sim) ~ treat_sim,
                   data = subset(sim_data, flag_harm == 1))
cox_noharm <- coxph(Surv(y_sim, event_sim) ~ treat_sim,
                    data = subset(sim_data, flag_harm == 0))

results_df <- data.frame(
  Subgroup   = c("ITT", "Harm (true)", "No-harm (true)"),
  N          = c(nrow(sim_data),
                 sum(sim_data$flag_harm == 1),
                 sum(sim_data$flag_harm == 0)),
  True_HR    = round(c(dgm$hazard_ratios$overall,
                       dgm$hazard_ratios$harm_subgroup,
                       dgm$hazard_ratios$no_harm_subgroup), 3),
  Observed_HR = round(exp(c(coef(cox_all),
                             coef(cox_harm),
                             coef(cox_noharm))), 3)
)
print(results_df)

##         Subgroup   N True_HR Observed_HR
## 1            ITT 600   0.781       0.984
## 2    Harm (true) 114   1.206       2.232
## 3 No-harm (true) 486   0.729       0.782

Part IV — Main Results: Extreme Subgroup Effects Under a Uniform Treatment Benefit

The Core Question

Parts I–III established how to build a null DGM and generate synthetic trials from it. We now use those tools of forestsearch: when many subgroups are examined in a trial with a perfectly uniform treatment benefit, how extreme can a standard Cox analysis look by chance alone?

We generate 1000 synthetic trials from the null DGM and fit a stratified Cox model in each pre-defined subgroup in every trial. Two output statistics are accumulated:

The HR point estimate distribution across trials — summarised as the median and the 1st–99th percentile interval (the empirical 99% confidence interval, ECI). The median should track 0.70 for every subgroup; the width of the ECI reveals how noisy a single-trial estimate is.
The upper confidence bound UB(HR) — the upper end of the Cox 95%
1. We track Pr(UB < 1.0) (power to show nominal benefit) and the conditional distribution of UB(HR) given UB ≥ 1 (how extreme a “failure to show benefit” looks in the worst-case trial).

The random* subgroups (no clinical basis, same true HR as everyone else) serve as calibration anchors. Any clinical subgroup of the same N has an identical chance distribution — the clinical rationale for the grouping provides no statistical protection.

Setting: Uniform Treatment Effect

The DGM is built from the GBSG breast cancer trial using generate_aft_dgm_flex(). The key feature is that the true log hazard ratio is constant across all patients:

$\psi^0(\mathbf{L}) = \log(\mathrm{HR}_{\mathrm{true}}) \quad \text{for all } \mathbf{L}.$

In the examples below the true overall HR is 0.70 (a moderate benefit). Every subgroup — regardless of how it is defined — shares this same underlying HR.

set.seed(99)

dgm_null <- generate_aft_dgm_flex(
  data            = gbsg,
  continuous_vars = c("age", "size", "nodes", "pgr", "er"),
  factor_vars     = c("meno", "grade"),
  outcome_var     = "time_months",
  event_var       = "status",
  treatment_var   = "hormon",
  subgroup_vars   = NULL,   # no embedded subgroup — uniform effect throughout
  model           = "null",
  n_super         = 5000,
  seed            = 99,
  verbose         = FALSE
)

## 
##  ====================================================================== 
##  CENSORING MODEL SELECTION:
##  MULTIPLE SURVIVAL REGRESSION MODEL COMPARISONS
## ====================================================================== 
## 
## CONVERGENCE SUMMARY ( 4 / 4  converged):
## -------------------------------------------------- 
## 
## Weibull (weibull):
##   Status:  ✓ Converged 
##   Iterations:  9 
##   Log-likelihood:  -1838.404 
##   Scale parameter:  0.418 
## 
## LogNormal (lognormal):
##   Status:  ✓ Converged 
##   Iterations:  4 
##   Log-likelihood:  -2000.309 
##   Scale parameter:  0.8707 
## 
## Weibull0 (weibull):
##   Status:  ✓ Converged 
##   Iterations:  7 
##   Log-likelihood:  -1843.717 
##   Scale parameter:  0.4246 
## 
## LogNormal0 (lognormal):
##   Status:  ✓ Converged 
##   Iterations:  6 
##   Log-likelihood:  -2004.326 
##   Scale parameter:  0.8768 
## 
## MODEL COMPARISON TABLE:
## -------------------------------------------------- 
##       Model     AIC delta_AIC AIC_weight     BIC delta_BIC
##    Weibull0 3691.43      0.00      0.976 3700.50      0.00
##     Weibull 3698.81      7.37      0.024 3748.65     48.15
##  LogNormal0 4012.65    321.22      0.000 4021.71    321.22
##   LogNormal 4022.62    331.18      0.000 4072.46    371.96
## 
## MODEL RANKINGS:
## -------------------------------------------------- 
## By AIC:  Weibull0 > Weibull > LogNormal0 > LogNormal 
## By BIC:  Weibull0 > Weibull > LogNormal0 > LogNormal 
## 
## EVIDENCE ASSESSMENT:
## -------------------------------------------------- 
##   Strength:  Strong evidence 
##   Details:  Top model has 97.6% of AIC weight 
## 
## FINAL RECOMMENDATION:
## -------------------------------------------------- 
##   Clear winner: Weibull0 (best by both AIC and BIC)
##   Evidence: Strong evidence - Top model has 97.6% of AIC weight 
## ======================================================================

cat("Null DGM  |  true overall HR:",
    round(dgm_null$hazard_ratios$overall, 3),
    " |  implied median event time:",
    round(exp(dgm_null$model_params$mu), 1), "months\n")

## Null DGM  |  true overall HR: 0.757  |  implied median event time: 55.5 months

Subgroups Under Study

Each simulated trial analyses the same set of pre-defined subgroups, ranging from clinically motivated subpopulations to deliberately artificial ones.

The random* subgroups are the key benchmark. They are constructed by randomly sampling approximately 20% of subjects and taking the first 15, 20, 40, and 60 patients from that draw:

set.seed(8316951)
gbsg$flag_itt <- 1L

# ── Scalar cut-points for continuous variables ────────────────────────────
# Computed once from the observed data; re-exposed as columns inside each
# simulated df_s so that subset expressions like "age <= age_med" resolve.
age_med  <- median(gbsg$age)          # ~53 yrs
size_med <- median(gbsg$size)         # ~25 mm
nodes_q3 <- quantile(gbsg$nodes, 0.75)  # top-quartile node count (~5)
pgr_med  <- median(gbsg$pgr)          # median PGR
er_cut   <- 20                         # ER-low clinical threshold (fmol / mg)

subgroups <- list(

  # ── A. Full trial (ITT) ─────────────────────────────────────────────────
  list(id = "flag_itt == 1",                       name = "All Patients",              grp = "ITT"),

  # ── B. Single clinical / factor subgroups ───────────────────────────────
  list(id = "meno == 1",                           name = "Post-menopausal",           grp = "Clinical"),
  list(id = "meno == 0",                           name = "Pre-menopausal",            grp = "Clinical"),
  list(id = "grade == 3",                          name = "Grade 3",                   grp = "Clinical"),
  list(id = "grade != 3",                          name = "Grade 1/2",                 grp = "Clinical"),

  # ── C. Single continuous-variable cuts ──────────────────────────────────
  list(id = "age <= age_med",                      name = "Age (young)",               grp = "Continuous"),
  list(id = "age > age_med",                       name = "Age (older)",               grp = "Continuous"),
  list(id = "age <= 50",                           name = "Age <= 50",                 grp = "Continuous"),
  list(id = "age > 50",                            name = "Age > 50",                  grp = "Continuous"),
  list(id = "size <= size_med",                    name = "Tumour size (small)",       grp = "Continuous"),
  list(id = "size > size_med",                     name = "Tumour size (large)",       grp = "Continuous"),
  list(id = "nodes == 0",                          name = "Node-negative",             grp = "Continuous"),
  list(id = "nodes > 0 & nodes <= 3",              name = "Nodes 1-3",                grp = "Continuous"),
  list(id = "nodes > nodes_q3",                    name = "High nodes (>Q3)",          grp = "Continuous"),
  list(id = "pgr <= pgr_med",                      name = "PGR low",                  grp = "Continuous"),
  list(id = "pgr > pgr_med",                       name = "PGR high",                 grp = "Continuous"),
  list(id = "er <= er_cut",                        name = "ER-low (<20)",              grp = "Continuous"),
  list(id = "er > er_cut",                         name = "ER-high (>=20)",            grp = "Continuous"),

  # ── D. Menopausal status x Grade (2x2) ──────────────────────────────────
  list(id = "meno == 0 & grade == 3",              name = "Pre-meno / Grade 3",        grp = "Interaction: meno x grade"),
  list(id = "meno == 1 & grade == 3",              name = "Post-meno / Grade 3",       grp = "Interaction: meno x grade"),
  list(id = "meno == 0 & grade != 3",              name = "Pre-meno / Grade 1-2",      grp = "Interaction: meno x grade"),
  list(id = "meno == 1 & grade != 3",              name = "Post-meno / Grade 1-2",     grp = "Interaction: meno x grade"),

  # ── E. Menopausal status x Age ───────────────────────────────────────────
  list(id = "meno == 0 & age <= 50",               name = "Pre-meno / Age<=50",        grp = "Interaction: meno x age"),
  list(id = "meno == 0 & age > 50",                name = "Pre-meno / Age>50",         grp = "Interaction: meno x age"),
  list(id = "meno == 1 & age <= age_med",          name = "Post-meno / Age (young)",   grp = "Interaction: meno x age"),
  list(id = "meno == 1 & age > age_med",           name = "Post-meno / Age (older)",   grp = "Interaction: meno x age"),

  # ── F. Menopausal status x ER status ────────────────────────────────────
  list(id = "meno == 0 & er <= er_cut",            name = "Pre-meno / ER-low",         grp = "Interaction: meno x ER"),
  list(id = "meno == 0 & er > er_cut",             name = "Pre-meno / ER-high",        grp = "Interaction: meno x ER"),
  list(id = "meno == 1 & er <= er_cut",            name = "Post-meno / ER-low",        grp = "Interaction: meno x ER"),
  list(id = "meno == 1 & er > er_cut",             name = "Post-meno / ER-high",       grp = "Interaction: meno x ER"),

  # ── G. Grade x Node status ───────────────────────────────────────────────
  list(id = "grade == 3 & nodes == 0",             name = "Grade 3 / Node-neg",        grp = "Interaction: grade x nodes"),
  list(id = "grade == 3 & nodes > 0",              name = "Grade 3 / Node-pos",        grp = "Interaction: grade x nodes"),
  list(id = "grade != 3 & nodes == 0",             name = "Grade 1-2 / Node-neg",      grp = "Interaction: grade x nodes"),
  list(id = "grade != 3 & nodes > 0",              name = "Grade 1-2 / Node-pos",      grp = "Interaction: grade x nodes"),
  list(id = "grade == 3 & nodes > nodes_q3",       name = "Grade 3 / High nodes",      grp = "Interaction: grade x nodes"),

  # ── H. Grade x ER status ─────────────────────────────────────────────────
  list(id = "grade == 3 & er <= er_cut",           name = "Grade 3 / ER-low",          grp = "Interaction: grade x ER"),
  list(id = "grade == 3 & er > er_cut",            name = "Grade 3 / ER-high",         grp = "Interaction: grade x ER"),
  list(id = "grade != 3 & er <= er_cut",           name = "Grade 1-2 / ER-low",        grp = "Interaction: grade x ER"),
  list(id = "grade != 3 & er > er_cut",            name = "Grade 1-2 / ER-high",       grp = "Interaction: grade x ER"),

  # ── I. Grade x PGR ──────────────────────────────────────────────────────
  list(id = "grade == 3 & pgr <= pgr_med",         name = "Grade 3 / PGR low",         grp = "Interaction: grade x PGR"),
  list(id = "grade == 3 & pgr > pgr_med",          name = "Grade 3 / PGR high",        grp = "Interaction: grade x PGR"),
  list(id = "grade != 3 & pgr <= pgr_med",         name = "Grade 1-2 / PGR low",       grp = "Interaction: grade x PGR"),
  list(id = "grade != 3 & pgr > pgr_med",          name = "Grade 1-2 / PGR high",      grp = "Interaction: grade x PGR"),

  # ── J. Age x ER x Menopausal (3-way) ───────────────────────────────────
  list(id = "meno == 0 & age <= 50 & er <= er_cut",  name = "Pre-meno/Yng/ER-low",    grp = "3-way"),
  list(id = "meno == 0 & age <= 50 & er > er_cut",   name = "Pre-meno/Yng/ER-high",   grp = "3-way"),
  list(id = "meno == 1 & grade == 3 & er <= er_cut", name = "Post-meno/G3/ER-low",    grp = "3-way"),
  list(id = "meno == 1 & grade == 3 & nodes > 0",    name = "Post-meno/G3/Node-pos",  grp = "3-way"),
  list(id = "grade == 3 & nodes > 0 & er <= er_cut", name = "G3/Node-pos/ER-low",     grp = "3-way"),

  # ── K. Size x Node combinations ─────────────────────────────────────────
  list(id = "size > size_med & nodes > 0",         name = "Large/Node-pos",            grp = "Interaction: size x nodes"),
  list(id = "size > size_med & nodes == 0",        name = "Large/Node-neg",            grp = "Interaction: size x nodes"),
  list(id = "size <= size_med & nodes > 0",        name = "Small/Node-pos",            grp = "Interaction: size x nodes"),
  list(id = "size <= size_med & nodes == 0",       name = "Small/Node-neg",            grp = "Interaction: size x nodes"),

  # ── L. Random benchmark subgroups ───────────────────────────────────────
  # No clinical meaning; regenerated each simulation. Calibrate variability
  # by size against the clinical and interaction subgroups above.
  list(id = "random60 == 1",                       name = "random60",                  grp = "Random (N~60)"),
  list(id = "random40 == 1",                       name = "random40",                  grp = "Random (N~40)"),
  list(id = "random20 == 1",                       name = "random20",                  grp = "Random (N~20)"),
  list(id = "random15 == 1",                       name = "random15",                  grp = "Random (N~15)")
)

# ── Show observed sizes from GBSG ────────────────────────────────────────
# Scalar cut-points exposed as columns so eval() resolves them correctly.
gbsg_eval           <- gbsg
gbsg_eval$age_med   <- age_med
gbsg_eval$size_med  <- size_med
gbsg_eval$nodes_q3  <- nodes_q3
gbsg_eval$pgr_med   <- pgr_med
gbsg_eval$er_cut    <- er_cut
gbsg_eval$random60  <- 0L
gbsg_eval$random40  <- 0L
gbsg_eval$random20  <- 0L
gbsg_eval$random15  <- 0L

sg_sizes <- sapply(subgroups, function(s) {
  tryCatch(
    sum(eval(parse(text = s$id), envir = gbsg_eval)),
    error = function(e) NA_integer_
  )
})
is_random_sg <- grepl("^random", sapply(subgroups, `[[`, "name"))
rand_sizes   <- c(60L, 40L, 20L, 15L)

sg_overview <- data.frame(
  Subgroup = sapply(subgroups, `[[`, "name"),
  Group    = sapply(subgroups, `[[`, "grp"),
  N_source = ifelse(is_random_sg, rand_sizes[cumsum(is_random_sg)], sg_sizes),
  stringsAsFactors = FALSE
)
print(sg_overview, row.names = FALSE)

##                 Subgroup                      Group N_source
##             All Patients                        ITT      686
##          Post-menopausal                   Clinical      396
##           Pre-menopausal                   Clinical      290
##                  Grade 3                   Clinical      161
##                Grade 1/2                   Clinical      525
##              Age (young)                 Continuous      363
##              Age (older)                 Continuous      323
##                Age <= 50                 Continuous      289
##                 Age > 50                 Continuous      397
##      Tumour size (small)                 Continuous      353
##      Tumour size (large)                 Continuous      333
##            Node-negative                 Continuous        0
##                Nodes 1-3                 Continuous      376
##         High nodes (>Q3)                 Continuous      143
##                  PGR low                 Continuous      343
##                 PGR high                 Continuous      343
##             ER-low (<20)                 Continuous      272
##           ER-high (>=20)                 Continuous      414
##       Pre-meno / Grade 3  Interaction: meno x grade       74
##      Post-meno / Grade 3  Interaction: meno x grade       87
##     Pre-meno / Grade 1-2  Interaction: meno x grade      216
##    Post-meno / Grade 1-2  Interaction: meno x grade      309
##       Pre-meno / Age<=50    Interaction: meno x age      255
##        Pre-meno / Age>50    Interaction: meno x age       35
##  Post-meno / Age (young)    Interaction: meno x age       77
##  Post-meno / Age (older)    Interaction: meno x age      319
##        Pre-meno / ER-low     Interaction: meno x ER      134
##       Pre-meno / ER-high     Interaction: meno x ER      156
##       Post-meno / ER-low     Interaction: meno x ER      138
##      Post-meno / ER-high     Interaction: meno x ER      258
##       Grade 3 / Node-neg Interaction: grade x nodes        0
##       Grade 3 / Node-pos Interaction: grade x nodes      161
##     Grade 1-2 / Node-neg Interaction: grade x nodes        0
##     Grade 1-2 / Node-pos Interaction: grade x nodes      525
##     Grade 3 / High nodes Interaction: grade x nodes       46
##         Grade 3 / ER-low    Interaction: grade x ER      106
##        Grade 3 / ER-high    Interaction: grade x ER       55
##       Grade 1-2 / ER-low    Interaction: grade x ER      166
##      Grade 1-2 / ER-high    Interaction: grade x ER      359
##        Grade 3 / PGR low   Interaction: grade x PGR      127
##       Grade 3 / PGR high   Interaction: grade x PGR       34
##      Grade 1-2 / PGR low   Interaction: grade x PGR      216
##     Grade 1-2 / PGR high   Interaction: grade x PGR      309
##      Pre-meno/Yng/ER-low                      3-way      116
##     Pre-meno/Yng/ER-high                      3-way      139
##      Post-meno/G3/ER-low                      3-way       58
##    Post-meno/G3/Node-pos                      3-way       87
##       G3/Node-pos/ER-low                      3-way      106
##           Large/Node-pos  Interaction: size x nodes      333
##           Large/Node-neg  Interaction: size x nodes        0
##           Small/Node-pos  Interaction: size x nodes      353
##           Small/Node-neg  Interaction: size x nodes        0
##                 random60              Random (N~60)       60
##                 random40              Random (N~40)       40
##                 random20              Random (N~20)       20
##                 random15              Random (N~15)       15

cat("
Total subgroups:", nrow(sg_overview), "
")

## 
## Total subgroups: 56

These random* subgroups have no clinical meaning whatsoever — their true HR equals the ITT HR of 0.70. Any extreme result they produce is purely a function of their small size and sampling variability. Placing them alongside clinical subgroups of similar size makes explicit how much of the spread in those clinical findings is attributable to noise alone.

Summary Statistics Across 1000 Simulations

For each subgroup the primary analysis — Cox model stratified by the randomisation factor (grade) — is fitted in every simulated trial. Two quantities are accumulated:

HR point estimate distribution. Reported as median and the 1st–99th percentile interval (the empirical 99% confidence interval, ECI). Under truth, the median should be close to 0.70 for every subgroup. The spread of the ECI reflects how variable the estimate is — wider for smaller subgroups.

Upper confidence bound UB(HR). The upper end of the Cox 95% confidence interval. Two thresholds are tracked:

Pr(UB < 1.0): the proportion of trials in which the upper bound excludes no treatment effect, i.e., the subgroup shows a nominally significant benefit. For the full ITT this should be close to the trial power (~95%); for small subgroups it falls substantially.
Conditional distribution of UB(HR) given UB ≥ 1: when the upper bound fails to exclude no effect, how large does it become? This is what a safety reviewer sees in the worst-case trial for that subgroup.

Simulation Loop

cal_null <- calibrate_cens_adjust(
  dgm           = dgm_null,
  target        = "rate",
  n             = 800,
  analysis_time = 84,
  max_entry     = 24,
  seed          = 42,
  n_eval        = 1500,
  verbose       = FALSE
)
cat("Calibrated cens_adjust:", round(cal_null$cens_adjust, 4),
    " | Target censoring rate:", round(cal_null$ref_value, 3), "\n")

## Calibrated cens_adjust: 0.1047  | Target censoring rate: 0.576

sg_names <- sapply(subgroups, `[[`, "name")
sg_ids   <- sapply(subgroups, `[[`, "id")
n_sg     <- length(subgroups)

# Fit stratified Cox, return c(HR point estimate, upper 95% CI bound)
cox_sR <- function(data) {
  fit <- tryCatch(
    coxph(Surv(y_sim, event_sim) ~ treat_sim + strata(grade), data = data),
    error = function(e) NULL
  )
  if (is.null(fit) || nrow(summary(fit)$conf.int) == 0)
    return(c(NA_real_, NA_real_))
  ci <- summary(fit)$conf.int[1, ]
  c(ci["exp(coef)"], ci["upper .95"])
}

# Storage: HR point estimates and upper 95% confidence bounds
sim_hrs <- matrix(NA_real_, n_sims_null, n_sg, dimnames = list(NULL, sg_names))
sim_ubs <- matrix(NA_real_, n_sims_null, n_sg, dimnames = list(NULL, sg_names))
sim_ns  <- matrix(NA_real_, n_sims_null, n_sg, dimnames = list(NULL, sg_names))

for (ss in seq_len(n_sims_null)) {

  df_s <- simulate_from_dgm(
    dgm           = dgm_null,
    n             = nrow(gbsg),
    analysis_time = 84,
    max_entry     = 24,
    cens_adjust   = cal_null$cens_adjust,
    seed          = ss
  )
  df_s$flag_itt <- 1L

  # Make quantile cut-points available as scalar variables in the eval
  # environment so expressions like "size <= size_med" resolve correctly.
  # Expose scalar cut-points as columns so subset expressions resolve.
  df_s$age_med  <- age_med
  df_s$size_med <- size_med
  df_s$nodes_q3 <- nodes_q3
  df_s$pgr_med  <- pgr_med
  df_s$er_cut   <- er_cut

  # Regenerate random benchmark subgroups for this simulated trial.
  # Random subgroups are re-sampled each iteration: their purpose is to
  # benchmark the variability of a subgroup of a given size with no true
  # treatment effect heterogeneity — exactly what each simulated trial provides.
  set.seed(ss + 1e6L)
  n_s   <- nrow(df_s)
  r_idx <- sample.int(n_s, min(60L, n_s), replace = FALSE)
  df_s$random60 <- as.integer(seq_len(n_s) %in% r_idx)
  df_s$random40 <- as.integer(seq_len(n_s) %in% r_idx[seq_len(min(40L, length(r_idx)))])
  df_s$random20 <- as.integer(seq_len(n_s) %in% r_idx[seq_len(min(20L, length(r_idx)))])
  df_s$random15 <- as.integer(seq_len(n_s) %in% r_idx[seq_len(min(15L, length(r_idx)))])

  for (gg in seq_len(n_sg)) {
    df_sg <- tryCatch(
      subset(df_s, eval(parse(text = sg_ids[[gg]]))),
      error = function(e) NULL
    )
    if (is.null(df_sg) || nrow(df_sg) < 5) next
    sim_ns[ss, gg] <- nrow(df_sg)
    r <- cox_sR(df_sg)
    sim_hrs[ss, gg] <- r[1]
    sim_ubs[ss, gg] <- r[2]
  }
}

cat("Simulations completed:", n_sims_null, "\n")

## Simulations completed: 1000

Results

Summary table

# Median HR and 1-99th percentile ECI across simulations
hr_q  <- t(apply(sim_hrs, 2, quantile, probs = c(0.01, 0.50, 0.99),
                 na.rm = TRUE))
ub_q  <- t(apply(sim_ubs, 2, quantile, probs = c(0.01, 0.50, 0.99),
                 na.rm = TRUE))

# n_valid: number of simulations where Cox estimation succeeded
n_valid     <- apply(sim_hrs, 2, function(x) sum(!is.na(x)))
pct_valid   <- round(100 * n_valid / n_sims_null, 1)

mean_n      <- round(apply(sim_ns,  2, mean, na.rm = TRUE))

# Proportions denominated by n_valid (not n_sims_null) so NAs from
# estimation failures are excluded; they are separately reported.
pr_hr_lt080 <- apply(sim_hrs, 2, function(x) mean(x < 0.80, na.rm = TRUE))
pr_ub_lt1   <- apply(sim_ubs, 2, function(x) mean(x < 1.0,  na.rm = TRUE))
pr_ub_ge1   <- 1 - pr_ub_lt1

# Conditional distribution of UB given UB >= 1
cond_ub <- t(apply(sim_ubs, 2, function(x) {
  xx <- x[!is.na(x) & x >= 1.0]
  if (length(xx) < 2) return(c(NA_real_, NA_real_, NA_real_))
  quantile(xx, probs = c(0.50, 0.75, 0.99))
}))

fmt_pct <- function(x) ifelse(is.nan(x), "-", paste0(round(100 * x, 1), "%"))
fmt_num <- function(x) ifelse(is.na(x), "-", as.character(round(x, 2)))

results_tbl <- data.frame(
  Subgroup    = sg_names,
  Mean_N      = ifelse(is.nan(mean_n), "-", as.character(mean_n)),
  N_valid     = paste0(n_valid, " (", pct_valid, "%)"),
  HR_ECI      = ifelse(is.na(hr_q[,2]), "-",
                  paste0(round(hr_q[,2],2),
                         " (", round(hr_q[,1],2), ", ",
                         round(hr_q[,3],2), ")")),
  Pr_HR_lt080 = fmt_pct(pr_hr_lt080),
  Pr_UB_lt1   = fmt_pct(pr_ub_lt1),
  Pr_UB_ge1   = fmt_pct(pr_ub_ge1),
  Med_UB_cond = fmt_num(cond_ub[, 1]),
  P99_UB_cond = fmt_num(cond_ub[, 3]),
  stringsAsFactors = FALSE
)
colnames(results_tbl) <- c(
  "Subgroup", "N", "#(% converged)",
  "Median HR (1st, 99th ECI)",
  "Pr(HR<0.80)", "Pr(UB<1.0)", "Pr(UB>=1.0)",
  "Med UB|UB>=1", "P99 UB|UB>=1"
)
print(results_tbl, row.names = FALSE)

##                 Subgroup   N #(% converged) Median HR (1st, 99th ECI)
##             All Patients 686    1000 (100%)          0.7 (0.54, 0.93)
##          Post-menopausal 389    1000 (100%)            0.71 (0.49, 1)
##           Pre-menopausal 297    1000 (100%)         0.71 (0.45, 1.08)
##                  Grade 3 161    1000 (100%)           0.7 (0.4, 1.17)
##                Grade 1/2 525    1000 (100%)         0.71 (0.51, 0.95)
##              Age (young) 373    1000 (100%)         0.71 (0.49, 1.07)
##              Age (older) 313    1000 (100%)         0.71 (0.46, 1.09)
##                Age <= 50 295    1000 (100%)          0.7 (0.46, 1.09)
##                 Age > 50 391    1000 (100%)         0.71 (0.49, 1.01)
##      Tumour size (small) 357    1000 (100%)         0.69 (0.47, 1.04)
##      Tumour size (large) 329    1000 (100%)         0.72 (0.49, 1.03)
##            Node-negative   -         0 (0%)                         -
##                Nodes 1-3 381    1000 (100%)          0.7 (0.47, 1.06)
##         High nodes (>Q3) 134    1000 (100%)         0.71 (0.42, 1.13)
##                  PGR low 340    1000 (100%)             0.7 (0.49, 1)
##                 PGR high 346    1000 (100%)         0.71 (0.46, 1.11)
##             ER-low (<20) 277    1000 (100%)          0.7 (0.47, 1.06)
##           ER-high (>=20) 409    1000 (100%)         0.71 (0.49, 1.03)
##       Pre-meno / Grade 3  78    1000 (100%)         0.69 (0.28, 1.41)
##      Post-meno / Grade 3  83    1000 (100%)         0.71 (0.35, 1.42)
##     Pre-meno / Grade 1-2 219    1000 (100%)          0.71 (0.43, 1.2)
##    Post-meno / Grade 1-2 306    1000 (100%)         0.71 (0.46, 1.05)
##       Pre-meno / Age<=50 259    1000 (100%)         0.71 (0.44, 1.11)
##        Pre-meno / Age>50  38    1000 (100%)          0.7 (0.11, 3.09)
##  Post-meno / Age (young)  80    1000 (100%)           0.7 (0.29, 1.5)
##  Post-meno / Age (older) 309    1000 (100%)         0.71 (0.46, 1.11)
##        Pre-meno / ER-low 140    1000 (100%)          0.7 (0.39, 1.27)
##       Pre-meno / ER-high 157    1000 (100%)         0.71 (0.36, 1.38)
##       Post-meno / ER-low 137    1000 (100%)          0.7 (0.38, 1.24)
##      Post-meno / ER-high 252    1000 (100%)         0.71 (0.44, 1.14)
##       Grade 3 / Node-neg   -         0 (0%)                         -
##       Grade 3 / Node-pos 161    1000 (100%)           0.7 (0.4, 1.17)
##     Grade 1-2 / Node-neg   -         0 (0%)                         -
##     Grade 1-2 / Node-pos 525    1000 (100%)         0.71 (0.51, 0.95)
##     Grade 3 / High nodes  44    1000 (100%)          0.7 (0.26, 1.66)
##         Grade 3 / ER-low 110    1000 (100%)           0.7 (0.36, 1.3)
##        Grade 3 / ER-high  51    1000 (100%)         0.71 (0.26, 2.06)
##       Grade 1-2 / ER-low 166    1000 (100%)         0.71 (0.39, 1.21)
##      Grade 1-2 / ER-high 358    1000 (100%)          0.7 (0.47, 1.05)
##        Grade 3 / PGR low 128    1000 (100%)          0.7 (0.38, 1.18)
##       Grade 3 / PGR high  34    1000 (100%)         0.71 (0.16, 2.66)
##      Grade 1-2 / PGR low 213    1000 (100%)          0.71 (0.43, 1.1)
##     Grade 1-2 / PGR high 312    1000 (100%)         0.71 (0.46, 1.12)
##      Pre-meno/Yng/ER-low 121    1000 (100%)         0.71 (0.38, 1.35)
##     Pre-meno/Yng/ER-high 138    1000 (100%)         0.71 (0.35, 1.45)
##      Post-meno/G3/ER-low  57    1000 (100%)         0.69 (0.26, 1.64)
##    Post-meno/G3/Node-pos  83    1000 (100%)         0.71 (0.35, 1.42)
##       G3/Node-pos/ER-low 110    1000 (100%)           0.7 (0.36, 1.3)
##           Large/Node-pos 329    1000 (100%)         0.72 (0.49, 1.03)
##           Large/Node-neg   -         0 (0%)                         -
##           Small/Node-pos 357    1000 (100%)         0.69 (0.47, 1.04)
##           Small/Node-neg   -         0 (0%)                         -
##                 random60  60    1000 (100%)          0.7 (0.23, 1.88)
##                 random40  40    1000 (100%)         0.69 (0.16, 2.41)
##                 random20  20    1000 (100%)            0.68 (0, 6.64)
##                 random15  15    997 (99.7%)   0.69 (0, 1848203738.21)
##  Pr(HR<0.80) Pr(UB<1.0) Pr(UB>=1.0) Med UB|UB>=1 P99 UB|UB>=1
##        83.8%      82.9%       17.1%         1.04         1.25
##        78.2%      60.5%       39.5%         1.09         1.43
##        74.6%      46.4%       53.6%         1.14         1.61
##        71.3%      38.5%       61.5%         1.21         1.87
##          80%      68.8%       31.2%         1.08         1.35
##        78.4%      59.8%       40.2%          1.1         1.54
##        75.5%      51.1%       48.9%         1.12         1.58
##        76.6%      49.6%       50.4%         1.13         1.65
##        78.6%      60.9%       39.1%         1.09         1.44
##        78.5%      56.4%       43.6%         1.11         1.54
##          74%      53.3%       46.7%         1.11         1.53
##            -          -           -            -            -
##        77.6%      54.8%       45.2%         1.12         1.54
##          71%      33.4%       66.6%         1.21         1.87
##        80.5%      63.6%       36.4%         1.09         1.38
##        75.4%      46.9%       53.1%         1.14         1.68
##        76.4%      53.8%       46.2%         1.12         1.63
##          78%      58.5%       41.5%          1.1         1.45
##        67.1%      18.9%       81.1%         1.41         2.74
##        63.9%        22%         78%          1.4         2.66
##        69.1%      34.1%       65.9%         1.22         1.89
##        74.6%      48.8%       51.2%         1.13         1.53
##        72.8%      42.6%       57.4%         1.16         1.68
##        59.4%       5.9%       94.1%         2.33        14.52
##        66.7%      18.1%       81.9%         1.47         3.06
##        75.8%      50.2%       49.8%         1.12         1.61
##        69.2%      29.9%       70.1%         1.27         2.15
##        66.9%      26.6%       73.4%         1.33         2.61
##        69.8%      30.5%       69.5%         1.26         2.06
##        70.7%      40.4%       59.6%         1.17         1.73
##            -          -           -            -            -
##        71.3%      38.5%       61.5%         1.21         1.87
##            -          -           -            -            -
##          80%      68.8%       31.2%         1.08         1.35
##        63.4%      14.1%       85.9%         1.56         3.59
##        69.7%      29.4%       70.6%         1.28         2.32
##        61.2%      11.1%       88.9%         1.68         5.05
##        69.1%      31.2%       68.8%         1.23         2.06
##          77%      52.7%       47.3%         1.12         1.53
##          71%        33%         67%         1.23         2.04
##        58.4%       7.8%       92.2%          2.2        11.72
##        72.6%      41.6%       58.4%         1.17         1.66
##        74.5%        43%         57%         1.16         1.71
##          68%      28.3%       71.7%         1.31         2.32
##        65.2%      23.4%       76.6%         1.37         2.68
##        63.1%      18.4%       81.6%         1.55         3.57
##        63.9%        22%         78%          1.4         2.66
##        69.7%      29.4%       70.6%         1.28         2.32
##          74%      53.3%       46.7%         1.11         1.53
##            -          -           -            -            -
##        78.5%      56.4%       43.6%         1.11         1.54
##            -          -           -            -            -
##        62.9%      12.7%       87.3%         1.67         4.55
##        59.2%      10.2%       89.8%         2.07         7.79
##        57.5%       2.7%       97.3%         3.74          Inf
##        54.6%       2.1%       97.9%         5.73          Inf

Reading the table. Every row shares the same true HR = 0.70, so the median of the HR ECI should be close to 0.70 throughout — confirming the Cox model is unbiased. The spread of the ECI is what changes: narrow for the ITT and large clinical subgroups, wide for random20 and random15. Pr(UB < 1.0) measures power for the full trial; it falls sharply with subgroup size. P99 UB | UB ≥ 1 answers the question a safety reviewer faces in the worst 1% of trials: how large can the upper bound get when the subgroup fails to show a significant benefit? For random15 the answer routinely exceeds 5.0; for random40 it commonly exceeds 3.0.

Forest plot: HR point estimate distributions

Each row shows one subgroup. The point is the median HR across 1000 simulations; the line spans the 1st–99th percentile ECI. The red dashed line marks the true HR = 0.70; the grey dotted line marks HR = 1.0. random* benchmarks (orange) are shown alongside same-sized clinical subgroups (blue) so that the comparison is immediate.

# =============================================================================
# Forest Plot Sizing Parameters — edit here to resize all four panels
# =============================================================================
#
# HOW SIZING WORKS WITH gg_forest()
# ──────────────────────────────────
# gg_forest() uses ggplot2 + patchwork.  Unlike forestploter, there is NO
# hidden scaling.  Row height is simply:
#
#     row_height_in  =  fig.height  /  n_rows
#
# So to set a target row height (e.g. 0.45 in):
#     fig.height  =  n_rows  ×  0.45  +  overhead_in
#
# where overhead_in ≈ 1.5 in covers the title, x-axis, and caption.
# ─────────────────────────────────────────────────────────────────────────────

fp_row_height   <- 0.45   # in — target height per row (increase to spread rows)
fp_overhead_in  <- 1.5    # in — fixed overhead (title + x-axis + footnote)
fp_base_size    <- 14     # pt — ggplot2 base font size (controls ALL text sizes)
fp_point_size   <- 2.2    # relative size of the median point symbol
fp_line_size    <- 0.75   # line width of CI whiskers

# ── Colour map (used by all four panels) ──────────────────────────────────
cat_cols <- c(
  "ITT"         = "black",
  "Clinical"    = "steelblue",
  "Continuous"  = "darkgreen",
  "Interaction" = "purple4",
  "3-way"       = "violetred3",
  "Random"      = "darkorange"
)

# ── Recommended fig.height for each panel ─────────────────────────────────
cat_ok  <- dplyr::case_when(
  sapply(subgroups, `[[`, "grp") == "ITT"               ~ "ITT",
  sapply(subgroups, `[[`, "grp") == "Clinical"          ~ "Clinical",
  sapply(subgroups, `[[`, "grp") == "Continuous"        ~ "Continuous",
  grepl("^Interaction", sapply(subgroups, `[[`, "grp")) ~ "Interaction",
  sapply(subgroups, `[[`, "grp") == "3-way"             ~ "3-way",
  TRUE                                                  ~ "Random"
)

ok    <- !is.na(hr_q[, 2])
ok_ub <- !is.na(ub_q[, 2])

single_cats <- c("ITT", "Clinical", "Continuous")
combo_cats  <- c("Interaction", "3-way", "Random")

idx_hr_single <- which(cat_ok[ok]  %in% single_cats)
idx_hr_combo  <- which(cat_ok[ok]  %in% combo_cats)
idx_ub_single <- which(cat_ok[ok_ub] %in% single_cats)
idx_ub_combo  <- which(cat_ok[ok_ub] %in% combo_cats)

n_single <- length(idx_hr_single)
n_combo  <- length(idx_hr_combo)

fh_single <- round(n_single * fp_row_height + fp_overhead_in, 1)
fh_combo  <- round(n_combo  * fp_row_height + fp_overhead_in, 1)

cat(sprintf(
  "Single panels: %d rows  ->  recommended fig.height = %.1f in
",
  n_single, fh_single
))

## Single panels: 17 rows  ->  recommended fig.height = 9.2 in

cat(sprintf(
  "Combo  panels: %d rows  ->  recommended fig.height = %.1f in
",
  n_combo,  fh_combo
))

## Combo  panels: 34 rows  ->  recommended fig.height = 16.8 in

Panel A — HR distributions, single-variable subgroups. Point = median HR; whiskers = 1st–99th percentile ECI. The true HR is 0.70 for every subgroup (red dashed line).

sg_single   <- sg_names[ok][idx_hr_single]
cat_single  <- cat_ok[ok][idx_hr_single]

gg_forest(
  subgroups   = sg_single,
  est         = hr_q[ok, 2][idx_hr_single],
  lo          = hr_q[ok, 1][idx_hr_single],
  hi          = hr_q[ok, 3][idx_hr_single],
  cat_vec     = cat_single,
  cat_colours = cat_cols,
  annot       = list(
    "N"     = as.character(mean_n[ok][idx_hr_single]),
    "Pr(UB<1.0)" = paste0(round(100 * pr_ub_lt1[ok][idx_hr_single], 1), "%")
  ),
  ref_line    = 0.70,
  vert_lines  = 1.00,
  ref_col     = "firebrick",
  vert_col    = "grey55",
  vert_lty    = "dotted",
  xlim        = c(0.15, 3.5),
  ticks_at    = c(0.20, 0.35, 0.50, 0.70, 1.00, 1.50, 2.50),
  xlog        = TRUE,
  xlab        = "Hazard Ratio (Cox, stratified by grade)",
  footnote    = paste0(
    "Median HR (point) + 1st-99th percentile ECI  |  ",
    n_sims_null, " simulated trials, N = ", nrow(gbsg), " per trial  |  ",
    "True HR = 0.70 (red dashed)"
  ),
  point_size  = fp_point_size,
  line_size   = fp_line_size,
  base_size   = fp_base_size,
  widths      = c(3.5, 5, 1.1, 1.2)
)

HR distribution — single-variable subgroups (log scale). Spread reflects sampling variability alone; all subgroups share the same true HR = 0.70.

Panel B — HR distributions, combination and random-benchmark subgroups. Note how the ECI widens sharply as subgroup N falls below ~60; the random* benchmarks (orange) define the noise floor for each size class.

sg_combo   <- sg_names[ok][idx_hr_combo]
cat_combo  <- cat_ok[ok][idx_hr_combo]

gg_forest(
  subgroups   = sg_combo,
  est         = hr_q[ok, 2][idx_hr_combo],
  lo          = hr_q[ok, 1][idx_hr_combo],
  hi          = hr_q[ok, 3][idx_hr_combo],
  cat_vec     = cat_combo,
  cat_colours = cat_cols,
  annot       = list(
    "N"     = as.character(mean_n[ok][idx_hr_combo]),
    "Pr(UB<1.0)" = paste0(round(100 * pr_ub_lt1[ok][idx_hr_combo], 1), "%")
  ),
  ref_line    = 0.70,
  vert_lines  = 1.00,
  ref_col     = "firebrick",
  vert_col    = "grey55",
  vert_lty    = "dotted",
  xlim        = c(0.15, 3.5),
  ticks_at    = c(0.20, 0.35, 0.50, 0.70, 1.00, 1.50, 2.50),
  xlog        = TRUE,
  xlab        = "Hazard Ratio (Cox, stratified by grade)",
  footnote    = paste0(
    "Median HR (point) + 1st-99th percentile ECI  |  ",
    n_sims_null, " simulated trials  |  ",
    "True HR = 0.70 (red dashed)"
  ),
  point_size  = fp_point_size,
  line_size   = fp_line_size,
  base_size   = fp_base_size,
  widths      = c(3.5, 5, 1.1, 1.2)
)

HR distribution — combination subgroups and random benchmarks (log scale). Random benchmarks of matching size (orange) are the baseline for chance variability.

Forest plot: UB(HR) distributions

The upper confidence bound UB(HR) is the quantity a safety reviewer examines when asking whether a subgroup demonstrates harm or lack of benefit. The plot below shows the distribution of UB(HR) across simulations in the same ECI format. Values above 1.0 (grey dotted line) represent trials where the confidence interval includes no treatment effect. The right-hand column shows the probability of this occurring (Pr(UB >= 1)) and the conditional median UB given failure to show benefit (Med UB | UB>=1).

Panel A — UB(HR) distributions, single-variable subgroups.

lab_cmed_ub <- ifelse(
  is.na(cond_ub[ok_ub, 1]), "-",
  as.character(round(cond_ub[ok_ub, 1], 2))
)
sg_ub_s   <- sg_names[ok_ub][idx_ub_single]
cat_ub_s  <- cat_ok[ok_ub][idx_ub_single]

gg_forest(
  subgroups   = sg_ub_s,
  est         = ub_q[ok_ub, 2][idx_ub_single],
  lo          = ub_q[ok_ub, 1][idx_ub_single],
  hi          = ub_q[ok_ub, 3][idx_ub_single],
  cat_vec     = cat_ub_s,
  cat_colours = cat_cols,
  annot       = list(
    "N"         = as.character(mean_n[ok_ub][idx_ub_single]),
    "UB>=1"    = paste0(round(100 * pr_ub_ge1[ok_ub][idx_ub_single], 1), "%"),
    "mUB | UB>=1" = lab_cmed_ub[idx_ub_single]
  ),
  ref_line    = 1.00,
  vert_lines  = 0.70,
  ref_col     = "grey55",
  ref_lty     = "dotted",
  vert_col    = "firebrick",
  vert_lty    = "dashed",
  xlim        = c(0.30, 9.0),
  ticks_at    = c(0.40, 0.70, 1.00, 1.50, 2.50, 4.00, 8.00),
  xlog        = TRUE,
  xlab        = "Upper Bound of 95% CI  [UB(HR)]",
  footnote    = paste0(
    "Median UB(HR) (point) + 1st-99th percentile ECI  |  ",
    n_sims_null, " simulated trials  |  ",
    "Dotted line: UB = 1.0;  Dashed line: true HR = 0.70"
  ),
  point_size  = fp_point_size,
  line_size   = fp_line_size,
  base_size   = fp_base_size,
  widths      = c(3.5, 5, 1.1, 1.2, 1.5)
)

UB(HR) distribution — single-variable subgroups (log scale). Values right of HR = 1.0 (dotted line) represent trials failing to show significant benefit.

Panel B — UB(HR) distributions, combination subgroups and random benchmarks. The Med UB | UB>=1 column is the key safety quantity — it shows how large the upper bound grows in those trials where the CI actually fails to demonstrate benefit.

sg_ub_c   <- sg_names[ok_ub][idx_ub_combo]
cat_ub_c  <- cat_ok[ok_ub][idx_ub_combo]

gg_forest(
  subgroups   = sg_ub_c,
  est         = ub_q[ok_ub, 2][idx_ub_combo],
  lo          = ub_q[ok_ub, 1][idx_ub_combo],
  hi          = ub_q[ok_ub, 3][idx_ub_combo],
  cat_vec     = cat_ub_c,
  cat_colours = cat_cols,
  annot       = list(
    "N"         = as.character(mean_n[ok_ub][idx_ub_combo]),
    "UB>=1"    = paste0(round(100 * pr_ub_ge1[ok_ub][idx_ub_combo], 1), "%"),
    "mUB | UB>=1" = lab_cmed_ub[idx_ub_combo]
  ),
  ref_line    = 1.00,
  vert_lines  = 0.70,
  ref_col     = "grey55",
  ref_lty     = "dotted",
  vert_col    = "firebrick",
  vert_lty    = "dashed",
  xlim        = c(0.30, 9.0),
  ticks_at    = c(0.40, 0.70, 1.00, 1.50, 2.50, 4.00, 8.00),
  xlog        = TRUE,
  xlab        = "Upper Bound of 95% CI  [UB(HR)]",
  footnote    = paste0(
    "Median UB(HR) (point) + 1st-99th percentile ECI  |  ",
    n_sims_null, " simulated trials  |  ",
    "Dotted line: UB = 1.0;  Dashed line: true HR = 0.70"
  ),
  point_size  = fp_point_size,
  line_size   = fp_line_size,
  base_size   = fp_base_size,
  widths      = c(3.5, 5, 1.1, 1.2, 1.5)
)

UB(HR) distribution — combination and random-benchmark subgroups. The widening ECI for small subgroups is the central message: chance alone produces UB(HR) > 3-5 for N ~ 15-40.

Key Finding: Small Subgroups Produce Extreme Results by Chance

The forest plots above tell a consistent story.

Point estimates (HR). For the ITT and large clinical subgroups the median HR tracks the truth (≈ 0.70) and the ECI is narrow. For random15 and random20 the ECI spans from roughly 0.25 to above 1.0 — the entire plausible range. Clinical subgroups of similar size (e.g. Grade 3, Age ≤ 50) show ECI widths that are indistinguishable from their random-benchmark counterparts of the same N. The randomisation basis of those clinical groupings provides no protection against this variability.

Upper bounds UB(HR). The upper-bound distribution is even more striking. For random40 (N ≈ 40), the 99th percentile of UB(HR) commonly exceeds 3.0–4.0, meaning that 1% of simulated trials produce an upper confidence bound above 3.0 for a subgroup where the true HR is 0.70. For random15, values above 5.0 are not rare.

This is not a failure of the Cox model — it is working exactly as intended. A 95% confidence interval that spans from 0.3 to 4.5 is statistically correct: with N = 15 there is genuinely that much uncertainty. The problem is the interpretation: a practitioner seeing UB(HR) = 3.5 in a subgroup of 15 patients in a single trial has no way to distinguish a chance fluctuation from a genuine safety signal without additional structure.

Summary

Step	Function	Purpose
1	—	Convert source data to consistent time scale
2	`generate_aft_dgm_flex()`	Fit AFT model; embed harm subgroup; build super population
3	`check_censoring_dgm()`	Verify simulated censoring matches source data
4	`calibrate_cens_adjust()`	Find `cens_adjust` that aligns censoring rate or KM median
5	`simulate_from_dgm()`	Draw individual synthetic trials from the DGM
6	`forestsearch()` + summaries	Apply method and accumulate operating characteristics

The extreme-subgroup evaluation in Part IV answers a question: under a null where every patient shares the same true HR, how alarming can a standard Cox subgroup analysis look by chance? The random* subgroups anchor the answer — any clinical subgroup of similar size has an identical chance distribution of HR and UB(HR) regardless of clinical rationale. —

Session Information

sessionInfo()

## R version 4.5.1 (2025-06-13)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.3
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] here_1.0.2         patchwork_1.3.2    dplyr_1.2.0        ggplot2_4.0.2     
## [5] survival_3.8-6     forestsearch_0.1.0
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6         shape_1.4.6.1        xfun_0.56           
##  [4] bslib_0.10.0         htmlwidgets_1.6.4    lattice_0.22-9      
##  [7] vctrs_0.7.1          tools_4.5.1          generics_0.1.4      
## [10] parallel_4.5.1       tibble_3.3.1         pkgconfig_2.0.3     
## [13] Matrix_1.7-4         data.table_1.18.2.1  RColorBrewer_1.1-3  
## [16] S7_0.2.1             desc_1.4.3           gt_1.3.0            
## [19] lifecycle_1.0.5      doFuture_1.2.1       compiler_4.5.1      
## [22] farver_2.1.2         stringr_1.6.0        textshaping_1.0.4   
## [25] policytree_1.2.4     grf_2.5.0            codetools_0.2-20    
## [28] htmltools_0.5.9      sass_0.4.10          yaml_2.3.12         
## [31] glmnet_4.1-10        pillar_1.11.1        pkgdown_2.2.0       
## [34] jquerylib_0.1.4      cachem_1.1.0         weightedsurv_0.1.0  
## [37] iterators_1.0.14     foreach_1.5.2        parallelly_1.46.1   
## [40] tidyselect_1.2.1     digest_0.6.39        stringi_1.8.7       
## [43] future_1.69.0        listenv_0.10.0       labeling_0.4.3      
## [46] splines_4.5.1        rprojroot_2.1.1      fastmap_1.2.0       
## [49] grid_4.5.1           cli_3.6.5            magrittr_2.0.4      
## [52] randomForest_4.7-1.2 future.apply_1.20.2  withr_3.0.2         
## [55] scales_1.4.0         rmarkdown_2.30       globals_0.19.0      
## [58] otel_0.2.0           progressr_0.18.0     ragg_1.5.0          
## [61] evaluate_1.0.5       knitr_1.51           rlang_1.1.7         
## [64] Rcpp_1.1.1           glue_1.8.0           xml2_1.5.2          
## [67] rstudioapi_0.18.0    jsonlite_2.0.0       R6_2.6.1            
## [70] systemfonts_1.3.1    fs_1.6.6

A Simulation Study Based on Observed Trial Data

Larry F. León

2026-02-20

Overview

Example of Subgroups Under Study

Six Analysis Variants

Summary Statistics Across 5 000 Simulations

Setting: Uniform Treatment Effect

Installation

Documentation

Setup

Part I — Building the Null Data Generating Mechanism

Why time-scale consistency matters first

Fitting `generate_aft_dgm_flex()`

Part II — Censoring Diagnostics and Calibration

Diagnosing censoring consistency

Calibrating `cens_adjust`

Verifying the calibrated simulation

Part III — Generating Synthetic Trials

Part IV — Main Results: Extreme Subgroup Effects Under a Uniform Treatment Benefit

The Core Question

Setting: Uniform Treatment Effect

Subgroups Under Study

Summary Statistics Across 1000 Simulations

Simulation Loop

Results

Summary table

Forest plot: HR point estimate distributions

Forest plot: UB(HR) distributions

Key Finding: Small Subgroups Produce Extreme Results by Chance

Summary

Session Information

Extreme Subgroup Effects Under a Uniform Treatment Benefit

A Simulation Study Based on Observed Trial Data

Larry F. León

2026-02-20

Overview

Example of Subgroups Under Study

Six Analysis Variants

Summary Statistics Across 5 000 Simulations

Setting: Uniform Treatment Effect

Installation

Documentation

Setup

Part I — Building the Null Data Generating Mechanism

Why time-scale consistency matters first

Fitting generate_aft_dgm_flex()

Part II — Censoring Diagnostics and Calibration

Diagnosing censoring consistency

Calibrating cens_adjust

Verifying the calibrated simulation

Part III — Generating Synthetic Trials

Part IV — Main Results: Extreme Subgroup Effects Under a Uniform Treatment Benefit

The Core Question

Setting: Uniform Treatment Effect

Subgroups Under Study

Summary Statistics Across 1000 Simulations

Simulation Loop

Results

Summary table

Forest plot: HR point estimate distributions

Forest plot: UB(HR) distributions

Key Finding: Small Subgroups Produce Extreme Results by Chance

Summary

Session Information

Fitting `generate_aft_dgm_flex()`

Calibrating `cens_adjust`