Automatically Detect Variable Types in a Dataset
Source:R/synthetic_data_perturbation.R
detect_variable_types.RdAnalyzes a data frame to automatically classify variables as continuous or categorical, and returns a subset of the data with specified variables excluded.
Arguments
- data
A data frame to analyze
- max_unique_for_cat
Integer. Maximum number of unique values for a numeric variable to be considered categorical. Default is 10.
- exclude_vars
Character vector of variable names to exclude from both classification and the returned dataset (e.g., ID variables, timestamps). Default is NULL.
Value
A list containing:
- continuous_vars
Character vector of variable names classified as continuous
- cat_vars
Character vector of variable names classified as categorical
- data_subset
Data frame with exclude_vars columns removed
Details
The function classifies variables using the following rules:
Numeric variables with more than
max_unique_for_catunique values are classified as continuousNumeric variables with
max_unique_for_cator fewer unique values are classified as categoricalFactor, character, and logical variables are always classified as categorical
Variables listed in
exclude_varsare omitted from classification and removed from the returned dataset
Examples
if (FALSE) { # \dontrun{
example_data <- data.frame(
id = 1:100,
age = rnorm(100, 50, 10),
grade = sample(1:3, 100, replace = TRUE),
status = sample(c("Active", "Inactive"), 100, replace = TRUE),
score = runif(100, 0, 100)
)
result <- detect_variable_types(example_data,
max_unique_for_cat = 10,
exclude_vars = "id")
result$continuous_vars # c("age", "score")
result$cat_vars # c("grade", "status")
names(result$data_subset) # c("age", "grade", "status", "score")
} # }