Automatically Detect Variable Types in a Dataset
Source:R/synthetic_data_perturbation.R
detect_variable_types.RdAnalyzes a data frame to automatically classify variables as continuous or categorical, and returns a subset of the data with specified variables excluded.
Arguments
- data
A data frame to analyze
- max_unique_for_cat
Integer. Maximum number of unique values for a numeric variable to be considered categorical. Default is 10.
- exclude_vars
Character vector of variable names to exclude from both classification and the returned dataset (e.g., ID variables, timestamps). Default is NULL.
Value
A list containing:
- continuous_vars
Character vector of variable names classified as continuous
- cat_vars
Character vector of variable names classified as categorical
- data_subset
Data frame with exclude_vars columns removed
Details
The function classifies variables using the following rules:
Numeric variables with more than
max_unique_for_catunique values are classified as continuousNumeric variables with
max_unique_for_cator fewer unique values are classified as categoricalFactor, character, and logical variables are always classified as categorical
Variables listed in
exclude_varsare omitted from classification and removed from the returned dataset