Skip to contents

Analyzes a data frame to automatically classify variables as continuous or categorical, and returns a subset of the data with specified variables excluded.

Usage

detect_variable_types(data, max_unique_for_cat = 10, exclude_vars = NULL)

Arguments

data

A data frame to analyze

max_unique_for_cat

Integer. Maximum number of unique values for a numeric variable to be considered categorical. Default is 10.

exclude_vars

Character vector of variable names to exclude from both classification and the returned dataset (e.g., ID variables, timestamps). Default is NULL.

Value

A list containing:

continuous_vars

Character vector of variable names classified as continuous

cat_vars

Character vector of variable names classified as categorical

data_subset

Data frame with exclude_vars columns removed

Details

The function classifies variables using the following rules:

  • Numeric variables with more than max_unique_for_cat unique values are classified as continuous

  • Numeric variables with max_unique_for_cat or fewer unique values are classified as categorical

  • Factor, character, and logical variables are always classified as categorical

  • Variables listed in exclude_vars are omitted from classification and removed from the returned dataset

Examples

if (FALSE) { # \dontrun{
example_data <- data.frame(
  id = 1:100,
  age = rnorm(100, 50, 10),
  grade = sample(1:3, 100, replace = TRUE),
  status = sample(c("Active", "Inactive"), 100, replace = TRUE),
  score = runif(100, 0, 100)
)
result <- detect_variable_types(example_data,
                                 max_unique_for_cat = 10,
                                 exclude_vars = "id")
result$continuous_vars  # c("age", "score")
result$cat_vars         # c("grade", "status")
names(result$data_subset)  # c("age", "grade", "status", "score")
} # }