6  Data Setup for Modeling

What we know is little, and what we are ignorant of is immense.

— Pierre-Simon Laplace

Suppose a churn prediction model reports 95% accuracy, yet consistently fails to identify customers who actually churn. What went wrong? In many cases, the issue lies not in the algorithm itself but in how the data were structured before modeling began. Before reliable machine learning models can be built, the dataset must be organized to support learning, robust assessment, and generalization to new data.

This chapter focuses on the fourth stage of the Data Science Workflow shown in Figure 2.3: Data Setup for Modeling. At this stage, the goal is no longer to clean or explore the data, but to prepare it for model development in a way that supports fair comparison, trustworthy assessment, and reproducible results.

To accomplish this, we focus on several key components of model-ready data setup: understanding model fit and generalization, partitioning the data into development and assessment subsets, validating whether the resulting split is representative, preparing predictors for modeling, and addressing class imbalance. Throughout the chapter, we also emphasize how to prevent data leakage by ensuring that all data-dependent decisions are learned from the training data only.

The previous chapters laid the groundwork for this stage. In Section 2.4, we defined the modeling objective. In Chapter 3 and Chapter 4, we cleaned and explored the data. Chapter 5 introduced inferential tools that now help us assess whether training and test sets are statistically comparable.

Data setup is a crucial but often underestimated step in machine learning. A model can appear successful during development yet fail in practice if the data are partitioned poorly, if the test set influences training decisions, or if important preprocessing steps are handled incorrectly. Proper data setup helps prevent overfitting, biased assessment, and data leakage, all of which can undermine predictive performance and lead to misleading conclusions.

This stage often raises important questions, especially for readers new to predictive modeling: What does it mean for a model to generalize? Why is it necessary to partition the data? When and how should features be encoded or scaled? What can we do when one class is severely underrepresented?

These questions are not merely technical. They reflect fundamental principles of modern data science, including fairness, reproducibility, and reliable generalization. By examining model fit, data partitioning, feature preparation, and class imbalance, we lay the groundwork for building models that perform well not only on observed data, but also on new data encountered in practice.

What This Chapter Covers

This chapter completes Step 4 of the Data Science Workflow: Data Setup for Modeling. We begin by introducing the ideas of model fit and generalization, showing why predictive modeling requires more than fitting the observed data closely. In particular, we examine underfitting and overfitting and explain why a useful model must perform well not only on observed data, but also on new data.

We then turn to cross-validation and data partitioning for robust assessment. We introduce the train-test split as a basic partitioning strategy, discuss how to check whether the resulting split is reasonably representative, and present cross-validation as a more stable approach when performance estimates from a single split may be unreliable. Throughout this discussion, we connect these ideas to the inferential tools introduced in Chapter 5.

Next, we examine data leakage as a cross-cutting risk in predictive modeling. We show how leakage can arise during partitioning, preprocessing, balancing, or model tuning, and establish the guiding principle that all data-dependent transformations must be learned from the training data only and then applied unchanged to validation or test data.

We then prepare predictors for modeling by encoding categorical variables and scaling numerical features. We present ordinal and one-hot encoding techniques, along with min-max and z-score transformations, so that predictors are represented in a form suitable for common machine learning algorithms.

Finally, we address class imbalance, a common challenge in classification tasks where one outcome dominates the dataset. We examine strategies such as oversampling, undersampling, and class weighting to ensure that minority classes are adequately represented during model training. Together, these components provide the foundation for building, comparing, and interpreting predictive models in the chapters that follow.

6.1 Model Fit and Generalization: Underfitting and Overfitting

After cleaning the data and exploring its main patterns, it may seem that the dataset is ready for modeling. In practice, however, clean data are not necessarily ready for predictive modeling. Before we train a model, we must also consider how to evaluate whether it has learned meaningful structure and whether it will perform well on new observations.

A central goal of machine learning is generalization: the ability of a model to perform well not only on the data used to develop it, but also on new data. This is what distinguishes a useful predictive model from one that merely reproduces patterns in the observed sample. A model may appear highly accurate when assessed on familiar data, yet perform poorly when applied in practice. For this reason, model building is not only about fitting patterns in the data. It is also about determining whether those patterns reflect signal that extends beyond the sample at hand.

This concern leads directly to the notion of model fit. In machine learning, model fit refers to how well a model captures the relationship between predictors and the outcome in the available data. A model with good fit reflects the main structure of the data closely enough to make accurate predictions, but not so closely that it begins to capture random noise. In other words, model fit is not simply about achieving the highest possible performance on observed data. It is about learning patterns that are stable enough to remain useful when the model is applied to new observations.

When model complexity is poorly matched to the underlying structure of the data, two common problems arise: underfitting and overfitting. Underfitting occurs when a model is too simple to capture important relationships in the data. As a result, it performs poorly even on the observed data because it fails to represent the main signal. Overfitting occurs when a model is too complex and adapts too closely to the observed sample. In that case, the model may appear to perform extremely well on the data used to fit it, but this apparent success is misleading because part of what it has learned is noise rather than structure.

To illustrate these ideas, consider a simple classification example. Suppose we have a two-dimensional dataset containing two classes of observations. The light-green square points belong to one class, while the soft-orange circle points belong to another class. The goal is to construct a decision boundary that separates the two classes as accurately as possible.

Figure 6.1 presents three possible decision boundaries for this dataset. The left panel shows a very simple boundary that fails to capture the structure of the data and misclassifies many observations. This is an example of underfitting. The middle panel shows a boundary with an appropriate level of flexibility, capturing the main pattern without becoming unnecessarily complex. The right panel shows a highly irregular boundary that perfectly classifies the observed points. Although this may seem desirable at first, it is actually problematic because the model is adapting to noise and small idiosyncrasies in the sample. This is an example of overfitting.

Figure 6.1: Illustration of underfitting, appropriate model complexity, and overfitting. The left panel shows an overly simple decision boundary, the middle panel shows a well-balanced model, and the right panel shows an overly complex boundary that fits noise in the observed data.

These examples show that good predictive modeling requires more than fitting the observed data as closely as possible. A model that is too simple may miss important relationships, whereas a model that is too complex may capture patterns that do not extend beyond the sample. In both cases, predictive performance on new data suffers.

The same idea can be shown more broadly through the relationship between model complexity and predictive performance. Figure 6.2 illustrates the typical pattern: as model complexity increases, performance on observed data usually improves, but performance on new data improves only up to a point and may then begin to decline.

Figure 6.2: Relationship between model complexity and predictive performance on observed data and new data. Performance on observed data typically improves as model complexity increases, while performance on new data first improves and then declines because of overfitting.

As complexity grows, the model can adapt more closely to the observed data, so performance on those data tends to increase steadily. Performance on new data, however, often follows a different pattern. It may improve at first as the model becomes flexible enough to capture meaningful structure, but beyond a certain point it begins to decline because the model is fitting noise rather than signal. The most useful model is therefore not the one that fits the observed data best, but the one that performs best on new data.

This distinction between observed performance and performance on new data is the key to understanding generalization. It also explains why evaluating a model on the same data used to develop it is not enough. To assess whether a model is likely to perform well in practice, we need an evaluation strategy that approximates prediction on unseen observations. In the next section, we introduce two common strategies for this purpose: the train-test split and cross-validation.

6.2 Cross-Validation and Data Partitioning

As discussed in the previous section, a predictive model should not be judged only by how well it fits the observed data. To assess whether a model is likely to generalize, we need a strategy that approximates performance on unseen observations. In supervised learning, this requires separating model development from final assessment so that predictive performance is not judged on the same data used to fit the model.

A central tool for this purpose is cross-validation. By repeatedly dividing the data into development and validation subsets, cross-validation provides a more stable basis for assessing model performance than a single random split. At the same time, cross-validation does not replace the need for careful data partitioning. In practice, we still need a clear separation between the data used for model development and the data reserved for final assessment. These two ideas, cross-validation and data partitioning, work together to reduce overfitting, limit data leakage, and support more trustworthy conclusions about model performance. We return to data leakage explicitly later in the chapter, but its importance already begins here.

The simplest and most widely used partitioning strategy is the train-test split. In this approach, one part of the dataset is used to train the model, while the remaining part is reserved for assessment. This provides a straightforward way to estimate how well the model may perform on new data. However, the result from a single split can depend strongly on how the data happen to be divided. This sensitivity is especially problematic when the dataset is small or when class proportions and feature distributions vary noticeably across subsets.

To obtain a more stable basis for assessment, data scientists often turn to cross-validation. Cross-validation is a resampling strategy in which the data are repeatedly divided into development and validation subsets, allowing the model to be assessed across multiple splits rather than just one. By averaging performance over repeated partitions, cross-validation provides a more robust picture of how well a model is likely to perform beyond the observed sample.

Throughout the modeling chapters of this book, we follow a three-step workflow that forms the foundation of supervised learning in practice:

  1. Partition the dataset and validate the split.
  2. Train and tune models using the training data.
  3. Assess predictive performance on unseen data.

This three-step structure is central to the modeling strategy used throughout the remainder of the book. The training set is used for model development, including fitting and tuning, while the test set is reserved for final assessment. By keeping these roles separate, we obtain a more realistic picture of how well a model is likely to perform on new data. Figure 6.3 provides a visual summary of this workflow in the setting of a train-test split, which we introduce next.

Figure 6.3: Supervised learning workflow based on a train-test split. The dataset is first partitioned into training and test sets. Models are then developed using the training data and assessed on the test data to examine how well they generalize to unseen observations.

The Train-Test Split as a Basic Evaluation Strategy

We begin with the most common and intuitive evaluation strategy in supervised learning: the train-test split, also known as the holdout method. In this approach, the dataset is divided into two subsets. The training set is used to build the model, while the test set is reserved for evaluating how well that model performs on new, unseen data. This separation allows us to approximate the real predictive setting in which a model is applied to observations that were not available during model development.

The choice of split ratio depends on the size of the dataset and the purpose of the analysis. Common choices include 70–30, 80–20, and 90–10. Larger training sets provide more information for model fitting, whereas larger test sets provide a more stable basis for evaluation. In practice, the choice reflects a trade-off between learning and assessment. Regardless of the exact ratio, both subsets should contain the same predictors and outcome variable, but only the training set is used to estimate the model. The test set must be kept separate until the evaluation stage so that predictive performance is assessed fairly. In classification problems, it is often helpful to create the split in a way that preserves the class proportions of the outcome variable across the training and test sets.

This separation also establishes an important rule for the remainder of the chapter: partitioning should occur before any data-dependent preprocessing steps, such as scaling, encoding, imputation, or class balancing. If such transformations are determined using the full dataset before partitioning, then information from the test set may influence the training process. This creates data leakage and leads to overly optimistic performance estimates. By splitting the data first, we preserve the integrity of model assessment.

We illustrate the train-test split using R and the liver package. We return to the churn dataset introduced in Chapter 4.3, where the goal is to predict customer churn using machine learning models, beginning in Chapter 7. We load the dataset as follows:

The dataset is relatively clean. A small number of observations in the variables education, income, and marital are recorded as "unknown". For simplicity, we treat "unknown" here as a valid category rather than converting it to a missing value. At this stage, our focus is not yet on feature preparation, but on how to partition the data appropriately for model development and evaluation.

There are several ways to perform a train-test split in R, including functions from packages such as rsample or caret, as well as custom code written in base R. In this book, we use the partition() function from the liver package because it provides a simple and consistent interface that we will use throughout the modeling chapters.

The partition() function divides a dataset into subsets according to a specified ratio. In the code below, we create an 80–20 split, with 80% of the observations assigned to the training set and 20% assigned to the test set:

set.seed(42)

splits = partition(data = churn, ratio = c(0.8, 0.2))

train_set = splits$part1
test_set  = splits$part2

test_labels = test_set$churn

The command set.seed(42) ensures that the same random split is obtained each time the code is run, which supports reproducibility. In principle, any integer could be used as the seed. Here, we use 42, a reference to The Hitchhiker’s Guide to the Galaxy, where 42 is described as the “Answer to the Ultimate Question of Life, the Universe, and Everything.” The specific value itself is not important; what matters is that the same seed is used whenever we want random results to be reproducible. The object test_labels stores the true class labels from the test set. These values are used later when evaluating model predictions and should remain unseen during model training.

Practice: Using the partition() function, repeat the train-test split with a 70–30 ratio. Compare the sizes of the training and test sets using nrow(train_set) and nrow(test_set). Reflect on how the choice of split ratio may influence both model learning and evaluation stability.

Although the train-test split is simple and widely used, it is not enough to partition the data and proceed immediately. We should also examine whether the resulting subsets remain reasonably representative of the original dataset. The next subsection therefore discusses how to validate the quality of the train-test split.

Validating the Train-Test Split

Creating a train-test split is an essential first step in model evaluation, but the split should not be accepted uncritically. After partitioning the data, we should examine whether the training and test sets remain reasonably representative of the original dataset. A well-constructed split helps ensure that the training set reflects the broader data-generating structure and that the test set provides a realistic basis for evaluating predictive performance. Without this check, a model may be trained on an unrepresentative subset or assessed on a test set that does not reflect the setting in which the model will be used.

Validating a split typically involves comparing the distributions of key variables across the training and test sets. In practice, this often includes the outcome variable as well as a small set of influential predictors. Because many datasets contain numerous features, it is rarely necessary, or even practical, to test every variable. Instead, we usually focus on variables that are central to the modeling problem or especially important for interpretation. The choice of statistical test depends on the type of variable being examined, as summarized in Table 6.1.

Table 6.1: Suggested hypothesis tests (from Chapter 5) for validating partitions, based on the type of feature.
Type of Feature Suggested Test
Numerical Two-sample t-test
Binary Two-sample Z-test
Categorical (with \(> 2\) categories) Chi-square test

These tests should be interpreted with care. Parametric procedures such as the two-sample \(t\)-test and the two-sample Z-test rely on assumptions and may be sensitive to sample size. In large samples, even minor differences can become statistically significant, whereas in smaller samples meaningful differences may go undetected. For this reason, validation should be viewed as a practical diagnostic step rather than a rigid pass-fail rule. The goal is not to prove that the two subsets are identical, but to check whether any differences are substantial enough to threaten fair model assessment.

To illustrate the process, consider again the churn dataset. We begin by examining whether the proportion of churners is similar in the training and test sets. Since the target variable churn is binary, a two-sample Z-test is appropriate. The hypotheses are \[ \begin{cases} H_0: \pi_{\text{churn, train}} = \pi_{\text{churn, test}} \\ H_a: \pi_{\text{churn, train}} \neq \pi_{\text{churn, test}} \end{cases} \]

The following code performs the test:

x1 <- sum(train_set$churn == "yes")
x2 <- sum(test_set$churn == "yes")

n1 <- nrow(train_set)
n2 <- nrow(test_set)

test_churn <- prop.test(x = c(x1, x2), n = c(n1, n2))
test_churn
   
    2-sample test for equality of proportions with continuity correction
   
   data:  c(x1, x2) out of c(n1, n2)
   X-squared = 0.045831, df = 1, p-value = 0.8305
   alternative hypothesis: two.sided
   95 percent confidence interval:
    -0.02051263  0.01598907
   sample estimates:
      prop 1    prop 2 
   0.1602074 0.1624691

Here, \(x_1\) and \(x_2\) denote the numbers of churners in the training and test sets, and \(n_1\) and \(n_2\) are the corresponding sample sizes. The prop.test() function compares the two proportions and returns a \(p\)-value for assessing whether the observed difference is statistically meaningful.

The resulting \(p\)-value is 0.83. Since this value exceeds the conventional significance level of \(\alpha = 0.05\), we do not reject \(H_0\). This suggests that the difference in churn rates between the training and test sets is not statistically significant. In other words, the partition appears reasonably balanced with respect to the target variable.

Beyond the outcome variable, it is also helpful to compare the distributions of a few influential predictors. For example, numerical variables such as age or available_credit can be examined using two-sample \(t\)-tests, while categorical variables such as education can be compared using Chi-square tests. Detecting substantial imbalances is important because unequal distributions may cause the model to learn from a subset that does not adequately reflect the data on which it will later be evaluated. Although it is rarely feasible to test every predictor in high-dimensional settings, examining a carefully chosen subset provides a useful and practical check on the validity of the partition.

Practice: Use a Chi-square test to evaluate whether the distribution of income differs between the training and test sets. Create a contingency table with table() and apply chisq.test(). Reflect on how differences in income levels across the two sets might influence model training.

Practice: Examine whether the mean of the numerical variable transaction_amount_12 is similar in the training and test sets. Use the t.test() function with the two samples. Consider how imbalanced averages in key financial variables might affect predictions for new customers.

What If the Split Is Not Sufficiently Balanced?

If validation reveals meaningful differences between the training and test sets, the partition should be reconsidered. Even when the split is generated randomly, uneven distributions can arise by chance, especially in smaller datasets. One simple response is to repeat the split with a different random seed or adjust the split ratio slightly to obtain a more representative partition.

Another option is to use stratified sampling. This approach preserves the proportions of key categorical variables, especially the outcome variable, across the training and test sets. Stratified sampling is particularly useful in classification problems, where maintaining similar class proportions supports fairer evaluation.

When a single split appears unstable or when the available sample size is limited, it may be more appropriate to move beyond the holdout method altogether. In such settings, cross-validation provides a more robust alternative by averaging performance over multiple partitions rather than relying on a single random split.

Validation is therefore more than a procedural checkpoint. It is a safeguard that helps ensure that model evaluation remains credible and that the conclusions drawn from the analysis are trustworthy. The next subsection introduces \(k\)-fold cross-validation, a strategy designed to reduce the instability that can arise from relying on a single train-test split.

k-Fold Cross-Validation

When a single train-test split produces an unstable or overly sample-dependent performance estimate, a more robust alternative is k-fold cross-validation. This is the most widely used form of cross-validation in machine learning because it provides a more stable assessment of predictive performance while making efficient use of the available data.

In k-fold cross-validation, the dataset is randomly partitioned into k non-overlapping subsets of approximately equal size, called folds. The model is trained on k - 1 folds and evaluated on the remaining fold, which serves as the validation set. This process is repeated k times so that each fold is used once for validation and k - 1 times for training. The resulting performance values are then averaged to produce an overall estimate. Common choices for \(k\) are 5 or 10, as illustrated in Figure 6.4 for the case \(k = 5\).

Figure 6.4: Illustration of k-fold cross-validation. The dataset is divided into \(k\) folds (\(k = 5\) shown). In each iteration, the model is trained on \(k - 1\) folds (green) and evaluated on the remaining fold (yellow).

This procedure differs from the train-test split in an important way. In the holdout method, the model is trained once and evaluated once, so the estimated performance can depend strongly on a single random partition. In k-fold cross-validation, the model is evaluated repeatedly across multiple partitions of the data. As a result, the final estimate is typically less sensitive to the particular way the data were divided.

Another advantage of k-fold cross-validation is that each observation contributes to both training and validation, though in different iterations. This makes more efficient use of the available data than a single holdout split, which is especially helpful when the sample size is limited. At the same time, cross-validation remains computationally feasible for many practical modeling tasks, which explains its widespread use.

In this book, we often begin with the train-test split because it provides a simple and intuitive framework for understanding model evaluation. We then use cross-validation when we need a more stable estimate of predictive performance or when we want to compare models and tune hyperparameters more reliably. In practice, however, cross-validation should usually be applied within the training portion of the data while the test set remains untouched for final assessment. The next subsection explains this important distinction.

Cross-Validation Within the Training Set

Although cross-validation provides a more stable estimate of predictive performance than a single train-test split, it must be applied carefully. In particular, the test set should remain completely untouched during model development. If the test set influences model selection, hyperparameter tuning, or preprocessing decisions, the final evaluation no longer reflects truly unseen data. This leads to data leakage and produces overly optimistic estimates of model performance.

For this reason, cross-validation is typically used within the training set, not on the full dataset. A common workflow proceeds in three stages. First, the data are partitioned into a training set and a test set. Second, cross-validation is carried out using only the training set to compare candidate models or tune hyperparameters. Finally, once the modeling decisions have been made, the selected model is evaluated once on the untouched test set. This strategy is illustrated in Figure 6.5.

Figure 6.5: Cross-validation applied within the training set while the test set is reserved for final evaluation.

This workflow separates model development from final model assessment. As a result, it provides a more realistic estimate of how well the chosen model is likely to perform on new observations. It also helps protect the integrity of the evaluation process, since the test set is not used to guide tuning or selection decisions.

In practice, this means that preprocessing steps must also be handled carefully during cross-validation. If scaling, encoding, imputation, or balancing are applied before the resampling process, information from the validation folds may leak into the training folds. The correct approach is to estimate such transformations within each training portion of the resampling procedure and then apply them to the corresponding validation portion. The same principle applies later when the final model is evaluated on the test set: all data-dependent transformations should be learned from the training data only and then applied unchanged to the test data.

An example of this workflow appears in Chapter 7.6, where the hyperparameter \(k\) in a k-Nearest Neighbors model is tuned using cross-validation within the training set. Throughout the modeling chapters that follow, we return to this logic repeatedly: we first partition the data, then use the training set for model development and tuning, and finally evaluate the selected model on the reserved test set. This structure supports fair comparison, reduces the risk of overfitting during model selection, and leads to more trustworthy conclusions.

6.3 Data Leakage and How to Prevent It

A common reason why models appear to perform well during development yet disappoint in practice is data leakage: information from outside the training process unintentionally influences model fitting or model selection. Leakage leads to overly optimistic performance estimates because evaluation no longer reflects truly unseen data.

Data leakage can occur in two broad ways. First, feature leakage arises when predictors contain information that is directly tied to the outcome or would only be known after the prediction is made. Second, procedural leakage occurs when preprocessing decisions are informed by the full dataset before the train–test split, allowing the test set to influence the training process.

The guiding principle for preventing leakage is simple: all data-dependent operations must be learned from the training set only. Once a rule is estimated from the training data—such as an imputation value, a scaling parameter, or a selected subset of features—the same rule should be applied unchanged to the test set.

Leakage can arise even earlier than this chapter’s workflow, during data preparation. For example, suppose missing values are imputed using the overall mean of a numerical feature computed from the full dataset. If the test set is included when computing that mean, the training process has indirectly incorporated information from the test set. Although the numerical difference may seem small, the evaluation is no longer strictly out-of-sample. The correct approach is to compute imputation values using the training set only and then apply them to both the training and test sets.

This discipline must be maintained throughout the data setup phase. The test set should remain untouched while models are developed and compared and should be used only once for final evaluation. Cross-validation and hyperparameter tuning must be conducted entirely within the training set. Class balancing techniques such as oversampling or undersampling must also be applied exclusively to the training data. Likewise, encoding rules and scaling parameters should be estimated from the training set and then applied to the test set without recalibration. Any deviation from this workflow allows information from the test data to influence model development and compromises the validity of performance estimates.

Practice: Identify two preprocessing steps in this chapter (or in Chapter 3) that could cause data leakage if applied before partitioning. For each step, describe how you would modify the workflow so that the transformation is learned from the training set only and then applied unchanged to the test set.

A practical example of leakage prevention is discussed in Section 7.5, where feature scaling is performed correctly for a k-Nearest Neighbors model. The same principle applies throughout the modeling workflow: partition first, learn preprocessing rules using the training data only, tune models using cross-validation within the training set, and evaluate only once on the untouched test set.

6.4 Encoding Categorical Features

Categorical features often need to be transformed into numerical format before they can be used in machine learning models. Algorithms such as k-Nearest Neighbors and neural networks require numerical inputs, and failing to encode categorical data properly can lead to misleading results or even errors during model training.

Encoding categorical variables is a critical part of data setup for modeling. It allows qualitative information (such as ratings, group memberships, or item types) to be incorporated into models that operate on numerical representations. In this section, we explore common encoding strategies and illustrate their use with examples from the churn dataset, which includes the categorical variables marital and education.

The choice of encoding method depends on the nature of the categorical variable. For ordinal variables—those with an inherent ranking—ordinal encoding preserves the order of categories using numeric values. For example, the income variable in the churn dataset ranges from <40K to >120K and benefits from ordinal encoding.

In contrast, nominal variables, which represent categories without intrinsic order, are better served by one-hot encoding. This approach creates binary indicators for each category and is particularly effective for features such as marital, where categories like married, single, and divorced are distinct but unordered.

The following subsections demonstrate these encoding techniques in practice, beginning with ordinal encoding and one-hot encoding. Together, these transformations ensure that categorical predictors are represented in a form that machine learning algorithms can interpret effectively.

6.5 Ordinal Encoding

For ordinal features with a meaningful ranking (such as low, medium, high), it is preferable to assign numeric values that reflect their order. This preserves the ordinal relationship in calculations, which would otherwise be lost with one-hot encoding.

There are two common approaches to ordinal encoding. The first assigns simple rank values (e.g., low = 1, medium = 2, high = 3). This approach preserves order but assumes equal spacing between categories. The second assigns values that reflect approximate magnitudes when such information is available.

Consider the income variable in the churn dataset, which has levels <40K, 40K-60K, 60K-80K, 80K-120K, and >120K. A common approach is to assign simple rank-based values from 1 through 5. However, this assumes that the distance between <40K and 40K-60K is the same as the distance between 80K-120K and >120K, which may not reflect true economic differences.

When category ranges represent meaningful numerical intervals, we may instead assign representative values (for example, approximate midpoints) as follows:

churn$income_rank <- factor(churn$income, 
  levels = c("<40K", "40K-60K", "60K-80K", "80K-120K", ">120K"), 
  labels = c(20, 50, 70, 100, 140)
)

churn$income_rank <- as.numeric(churn$income_rank)

This alternative better reflects economic distance between categories and may be more appropriate for linear or distance-based models, where numerical spacing directly influences model behavior.

The choice depends on the modeling objective. If only rank matters, simple ordinal encoding is sufficient. If approximate magnitude is meaningful, representative numerical values may provide a more realistic transformation.

Practice: Apply ordinal encoding to the cut variable in the diamonds dataset. The levels of cut are Fair, Good, Very Good, Premium, and Ideal. Assign numeric values from 1 to 5, reflecting their order from lowest to highest quality. Then reflect on whether the distances between these quality levels should be treated as equal.

Ordinal encoding should be applied only when the order of categories is genuinely meaningful. Using it for nominal variables such as “red,” “green,” and “blue” would impose an artificial numerical hierarchy and could distort model interpretation.

In summary, ordinal encoding always preserves order and, when values are carefully chosen, can also approximate magnitude. Thoughtful encoding ensures that numerical representations align with the substantive meaning of the data rather than introducing unintended assumptions. For features without inherent order, a different approach is needed. The next section introduces one-hot encoding, a method designed specifically for nominal features.

6.6 One-Hot Encoding

How can we represent unordered categories, such as marital status, so that machine learning algorithms can use them effectively? One-hot encoding is a widely used solution. It transforms each unique category into a separate binary column, allowing algorithms to process categorical data without introducing an artificial order.

This method is particularly useful for nominal variables, categorical features with no inherent ranking. For example, the variable marital in the churn dataset includes categories such as married, single, and divorced. One-hot encoding creates binary indicators for each category: marital_married, marital_single, marital_divorced. Each column indicates the presence (1) or absence (0) of a specific category. If there are \(m\) levels, one-hot encoding creates \(m\) binary columns. For linear models, it is common to drop one dummy column to avoid perfect redundancy. For distance-based methods such as kNN, using the full set of indicator columns is often acceptable, provided the same encoding is applied consistently to both training and test sets.

Let us take a quick look at the marital variable in the churn dataset:

table(churn$marital)
   
    married   single divorced  unknown 
       4687     3943      748      749

The output shows the distribution of observations across the categories. We will now use one-hot encoding to convert these into model-ready binary features. This transformation ensures that all categories are represented without assuming any order or relationship among them.

One-hot encoding is essential for models that rely on distance metrics (e.g., k-nearest neighbors, neural networks) or for linear models that require numeric inputs.

One-Hot Encoding in R

To apply one-hot encoding in practice, we can use the one.hot() function from the liver package. This function automatically detects categorical variables and creates a new column for each unique level, converting them into binary indicators.

# One-hot encode the "marital" variable from the churn dataset
churn_encoded <- one.hot(churn, cols = c("marital"), dropCols = FALSE)

str(churn_encoded)
   'data.frame':    10127 obs. of  26 variables:
    $ customer_ID          : int  768805383 818770008 713982108 769911858 709106358 713061558 810347208 818906208 710930508 719661558 ...
    $ age                  : int  45 49 51 40 40 44 51 32 37 48 ...
    $ gender               : Factor w/ 2 levels "female","male": 2 1 2 1 2 2 2 2 2 2 ...
    $ education            : Factor w/ 7 levels "uneducated","highschool",..: 2 4 4 2 1 4 7 2 1 4 ...
    $ marital              : Factor w/ 4 levels "married","single",..: 1 2 1 4 1 1 1 4 2 2 ...
    $ marital_married      : int  1 0 1 0 1 1 1 0 0 0 ...
    $ marital_single       : int  0 1 0 0 0 0 0 0 1 1 ...
    $ marital_divorced     : int  0 0 0 0 0 0 0 0 0 0 ...
    $ marital_unknown      : int  0 0 0 1 0 0 0 1 0 0 ...
    $ income               : Factor w/ 6 levels "<40K","40K-60K",..: 3 1 4 1 3 2 5 3 3 4 ...
    $ card_category        : Factor w/ 4 levels "blue","silver",..: 1 1 1 1 1 1 3 2 1 1 ...
    $ dependent_count      : int  3 5 3 4 3 2 4 0 3 2 ...
    $ months_on_book       : int  39 44 36 34 21 36 46 27 36 36 ...
    $ relationship_count   : int  5 6 4 3 5 3 6 2 5 6 ...
    $ months_inactive      : int  1 1 1 4 1 1 1 2 2 3 ...
    $ contacts_count_12    : int  3 2 0 1 0 2 3 2 0 3 ...
    $ credit_limit         : num  12691 8256 3418 3313 4716 ...
    $ revolving_balance    : int  777 864 0 2517 0 1247 2264 1396 2517 1677 ...
    $ available_credit     : num  11914 7392 3418 796 4716 ...
    $ transaction_amount_12: int  1144 1291 1887 1171 816 1088 1330 1538 1350 1441 ...
    $ transaction_count_12 : int  42 33 20 20 28 24 31 36 24 32 ...
    $ ratio_amount_Q4_Q1   : num  1.33 1.54 2.59 1.41 2.17 ...
    $ ratio_count_Q4_Q1    : num  1.62 3.71 2.33 2.33 2.5 ...
    $ utilization_ratio    : num  0.061 0.105 0 0.76 0 0.311 0.066 0.048 0.113 0.144 ...
    $ churn                : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
    $ income_rank          : num  3 1 4 1 3 2 5 3 3 4 ...

The cols argument specifies which variable(s) to encode. Setting dropCols = FALSE retains the original variable alongside the new binary columns; use TRUE to remove it after encoding. This transformation adds new columns such as marital_divorced, marital_married, and marital_single, each indicating whether a given observation belongs to that category.

Practice: What happens if you encode multiple variables at once? Try applying one.hot() to both marital and card_category, and inspect the resulting structure.

While one-hot encoding is simple and effective, it can substantially increase the number of features, especially when applied to high-cardinality variables (e.g., zip codes or product names). Before encoding, consider whether the added dimensionality is manageable and whether all categories are meaningful for analysis.

Once categorical features are properly encoded, attention turns to numerical variables. These often differ in range and scale, which can affect model performance. The next section introduces feature scaling, a crucial step that ensures comparability across numeric predictors.

6.7 Feature Scaling

What happens when one variable, such as price in dollars, spans tens of thousands, while another, like carat weight, ranges only from 0 to 5? Without scaling, machine learning models that rely on distances or gradients may give disproportionate weight to features with larger numerical ranges, regardless of their actual importance.

Feature scaling addresses this imbalance by adjusting the range or distribution of numerical variables to make them comparable. It is particularly important for algorithms such as k-Nearest Neighbors (Chapter 7) and neural networks (Chapter 13). Scaling can also improve optimization stability in models such as logistic regression and enhance the interpretability of coefficients.

In the churn dataset, for example, available_credit ranges from 3 to 3.4516^{4}, while utilization_ratio spans from 0 to 0.999. Without scaling, features such as available_credit may dominate the learning process—not because they are more predictive, but simply because of their larger magnitude.

This section introduces two widely used scaling techniques: Min–Max Scaling and Z-Score Scaling. Min–max scaling rescales values to a fixed range, typically \([0, 1]\). Z-score scaling standardizes features by centering them at zero and scaling them to unit variance.

Choosing between these methods depends on the modeling approach and the data structure. Min–max scaling is preferred when a fixed input range is required, such as in neural networks, whereas z-score scaling is more suitable for algorithms that assume standardized input distributions or rely on variance-sensitive optimization.

Scaling is not always necessary. Tree-based models, including decision trees and random forests, are scale-invariant and do not require rescaled inputs. However, for many other algorithms, scaling improves model performance, convergence speed, and fairness across features.

One caution: scaling can obscure real-world interpretability or exaggerate the influence of outliers, particularly when using min–max scaling. The choice of method should always reflect your modeling objectives and the characteristics of the dataset.

In the following sections, we demonstrate how to apply each technique in R using the churn dataset. We begin with min–max scaling, a straightforward method for bringing all numerical variables into a consistent range.

6.8 Min–Max Scaling

When one feature ranges from 0 to 1 and another spans thousands, models that rely on distances—such as k-Nearest Neighbors—can become biased toward features with larger numerical scales. Min–max scaling addresses this by rescaling each feature to a common range, typically \([0, 1]\), so that no single variable dominates because of its units or magnitude.

The transformation is defined by the formula \[ x_{\text{scaled}} = \frac{x - x_{\text{min}}}{x_{\text{max}} - x_{\text{min}}}, \] where \(x\) is the original value and \(x_{\text{min}}\) and \(x_{\text{max}}\) are the minimum and maximum of the feature. This operation ensures that the smallest value becomes 0 and the largest becomes 1.

Min–max scaling is particularly useful for algorithms that depend on distance or gradient information, such as k-Nearest Neighbors and neural networks. However, this technique is sensitive to outliers: extreme values can stretch the scale, compressing the majority of observations into a narrow band and reducing the resolution for typical values.

To illustrate min–max scaling, consider the variable age in the churn dataset, which ranges from approximately 26 to 73. We use the minmax() function from the liver package to rescale its values to the \([0, 1]\) interval:

ggplot(data = churn) +
  geom_histogram(aes(x = age), bins = 15) +
  ggtitle("Before Min-Max Scaling")

ggplot(data = churn) +
  geom_histogram(aes(x = minmax(age)), bins = 15) +
  ggtitle("After Min-Max Scaling")

The left panel shows the raw distribution of age, while the right panel displays the scaled version. After transformation, all values fall within the \([0, 1]\) range, making this feature numerically comparable to others—a crucial property when modeling techniques depend on distance or gradient magnitude.

While min–max scaling ensures all features fall within a fixed range, some algorithms perform better when variables are standardized around zero. The next section introduces z-score scaling, an alternative approach based on statistical standardization.

6.9 Z-Score Scaling

While min–max scaling rescales values into a fixed range, z-score scaling—also known as standardization—centers each numerical feature at zero and rescales it to have unit variance. This transformation ensures that features measured on different scales contribute comparably during model training.

Z-score scaling is particularly useful for algorithms that rely on gradient-based optimization or are sensitive to the relative magnitude of predictors, such as linear regression and logistic regression. Unlike min–max scaling, which constrains values to a fixed interval, z-score scaling expresses each observation in terms of its deviation from the mean.

The formula for z-score scaling is \[ x_{\text{scaled}} = \frac{x - \text{mean}(x)}{\text{sd}(x)}, \] where \(x\) is the original feature value, \(\text{mean}(x)\) is the mean of the feature, and \(\text{sd}(x)\) is its standard deviation. The result, \(x_{\text{scaled}}\), represents the number of standard deviations that an observation lies above or below the mean.

Z-score scaling places features with different units or magnitudes on a comparable scale. However, it remains sensitive to outliers, since both the mean and standard deviation can be influenced by extreme values.

To illustrate, let us apply z-score scaling to the age variable in the churn dataset. The mean and standard deviation of age are approximately 46.33 and 8.02, respectively. We use the zscore() function from the liver package:

ggplot(data = churn) +
  geom_histogram(aes(x = age), bins = 15) +
  ggtitle("Before Z-Score Scaling")

ggplot(data = churn) +
  geom_histogram(aes(x = zscore(age)), bins = 15) +
  ggtitle("After Z-Score Scaling")

The left panel shows the original distribution of age, while the right panel displays the standardized version. Notice that the center of the distribution shifts to approximately zero and the spread is expressed in units of standard deviation. The overall shape of the distribution—including skewness—remains unchanged.

It is important to emphasize that z-score scaling does not make a variable normally distributed. It standardizes the location and scale but preserves the underlying distributional shape. If a variable is skewed before scaling, it will remain skewed after transformation.

When applying feature scaling, scaling parameters must be estimated using the training set only. If the mean and standard deviation are computed from the full dataset before partitioning, information from the test set influences the training process. This constitutes a form of data leakage and leads to overly optimistic performance estimates. The correct workflow is to compute the scaling parameters on the training data and then apply the same transformation, without recalibration, to the test set. A broader discussion of data leakage and its prevention is provided in Section 6.3.

6.10 Dealing with Class Imbalance

Imagine training a fraud detection model that labels every transaction as legitimate. It might achieve 99% accuracy, yet fail completely at detecting fraud. This illustrates the challenge of class imbalance, a situation in which one class dominates the dataset while the rare class carries the greatest practical importance.

In many real-world classification tasks, the outcome of interest is relatively uncommon. Fraudulent transactions are rare, most customers do not churn, and most medical tests are negative. When a model is trained on such data, it may optimize overall accuracy by predicting the majority class most of the time. Although this strategy yields high accuracy, it fails precisely where predictive insight is most valuable: identifying the minority class. Addressing class imbalance is therefore an important step in data setup for modeling, particularly when the minority class has substantial business or scientific relevance.

Several strategies are commonly used to rebalance the training dataset and ensure that both classes are adequately represented during learning. Oversampling increases the number of minority class observations, either by duplicating existing cases or by generating synthetic examples. SMOTE (Synthetic Minority Over-sampling Technique) generates synthetic minority observations by interpolating between nearest neighbors rather than duplicating existing cases. Undersampling reduces the number of majority class observations and is especially useful when the dataset is large. Hybrid approaches combine both strategies. Another powerful alternative is class weighting, in which the learning algorithm penalizes misclassification of the minority class more heavily. Many models, including logistic regression and decision trees, support class weighting directly.

Let us illustrate with the churn dataset. After partitioning the data, we examine the distribution of the target variable in the training set:

table(train_set$churn)
   
    yes   no 
   1298 6804
prop.table(table(train_set$churn))
   
         yes        no 
   0.1602074 0.8397926

In this dataset, churners represent approximately 16% of the training observations. While this proportion does not imply automatic intervention, it raises concern about the model’s ability to detect churn effectively.

Balancing is not always necessary. There is no universal threshold that defines when a dataset is “too imbalanced.” As a practical heuristic, when the minority class represents roughly 10–15% or less of the observations, imbalance often begins to influence model training and evaluation, particularly in small to moderate-sized datasets. However, the decision to apply balancing techniques should not rely solely on class proportions. A more reliable indicator is model behavior. If a classifier achieves high overall accuracy but exhibits poor recall or precision for the minority class, corrective measures may be warranted. Furthermore, when the minority outcome carries substantial practical cost—such as fraud detection, disease diagnosis, or customer churn—even moderate imbalance may justify intervention. The choice to rebalance should therefore consider class proportions, model performance, and the consequences of misclassification.

To rebalance the training data in R, we can use the ovun.sample() function from the ROSE package to oversample the minority class so that it represents 30% of the training set:

library(ROSE)

balanced_train_set <- ovun.sample(
  churn ~ ., 
  data = train_set, 
  method = "over", 
  p = 0.3
)$data

table(balanced_train_set$churn)
   
     no  yes 
   6804 2864
prop.table(table(balanced_train_set$churn))
   
         no      yes 
   0.703765 0.296235

The argument churn ~ . specifies that balancing should be performed with respect to the target variable while retaining all predictors.

Balancing must always be performed after partitioning and applied only to the training set. The test set should retain the original class distribution, since it represents the real-world population on which the model will ultimately be evaluated. Altering the test distribution would distort performance estimates and undermine the validity of model evaluation.

In summary, class imbalance requires careful consideration during model development. By ensuring that the training process pays adequate attention to the minority class while preserving the natural distribution in the test set, we support fair evaluation and ensure that reported performance reflects meaningful predictive capability rather than majority-class dominance.

6.11 Chapter Summary and Takeaways

This chapter completed Step 4: Data Setup for Modeling in the Data Science Workflow. We began by examining model fit and generalization, emphasizing that a useful predictive model must perform well not only on observed data but also on new data. In this context, we introduced the concepts of underfitting and overfitting and showed why predictive modeling requires more than fitting the available sample as closely as possible.

We then turned to cross-validation and data partitioning for robust assessment. We introduced the train-test split as a basic strategy for separating model development from final assessment and discussed how to validate whether the resulting subsets were reasonably representative of the original dataset. We also showed how cross-validation provides a more stable basis for assessment when a single split may be too sample-dependent.

Next, we examined data leakage as a key risk in predictive modeling. We showed how leakage can arise when information from the test set influences model development, whether during partitioning, balancing, encoding, scaling, imputation, or model tuning. The guiding principle is straightforward: all data-dependent transformations must be learned from the training data only and then applied unchanged to validation or test data.

We also prepared predictors for modeling by encoding categorical variables and scaling numerical features. Ordinal and one-hot encoding techniques allow qualitative information to be used effectively by learning algorithms, while min-max and z-score transformations place numerical variables on comparable scales.

Finally, we addressed class imbalance, a common challenge in classification tasks where one outcome dominates the dataset. Techniques such as oversampling, undersampling, and class weighting help ensure that minority classes are adequately represented during training.

Together, these steps form the practical foundation for the modeling chapters that follow. Although this chapter does not include a standalone case study, its methods are applied repeatedly in later chapters. For example, the churn classification case study in Section 7.7 shows how data partitioning, cross-validation, leakage prevention, feature preparation, and class balancing support the development of a robust classifier.

In larger projects, preprocessing and model training are often combined within a unified workflow. In R, the mlr3pipelines package supports such structured pipelines, helping reduce data leakage and improve reproducibility. Readers seeking a deeper treatment may consult Applied Machine Learning Using mlr3 in R by Bischl et al. (2024).

With the data now structured for model development and robust assessment, we are ready to construct and compare predictive models. The next chapter begins with one of the most intuitive classification methods: k-Nearest Neighbors.

6.12 Exercises

This section combines conceptual questions and applied programming exercises designed to reinforce the key ideas introduced in this chapter. The goal is to consolidate essential preparatory steps for predictive modeling, focusing on partitioning, validating, balancing, and preparing features to support fair and generalizable learning.

Conceptual Questions

  1. Why is partitioning the dataset crucial before training a machine learning model? Explain its role in ensuring generalization.

  2. What is the main risk of training a model without separating the dataset into training and testing subsets? Provide an example where this could lead to misleading results.

  3. Explain the difference between overfitting and underfitting. How does proper partitioning help address these issues?

  4. Describe the role of the training set and the testing set in machine learning. Why should the test set remain unseen during model training?

  5. What is data leakage, and how can it occur during data partitioning? Provide an example of a scenario where data leakage could lead to overly optimistic model performance.

  6. Why is it necessary to validate the partition after splitting the dataset? What could go wrong if the training and test sets are significantly different?

  7. How would you test whether numerical features, such as age in the churn dataset, have similar distributions in both the training and testing sets?

  8. Why must categorical variables often be converted to numeric form before being used in machine learning models?

  9. What is the key difference between ordinal and nominal categorical variables, and how does this difference determine the appropriate encoding technique?

  10. Explain how one-hot encoding represents categorical variables and why this method avoids imposing artificial order on nominal features.

  11. What is the main drawback of one-hot encoding when applied to variables with many categories (high cardinality)?

  12. When is ordinal encoding preferred over one-hot encoding, and what risks arise if it is incorrectly applied to nominal variables?

  13. Compare min–max scaling and z-score scaling. How do these transformations differ in their handling of outliers?

  14. Why is it important to apply feature scaling after data partitioning rather than before?

  15. What type of data leakage can occur if scaling is performed using both training and test sets simultaneously?

  16. If a dataset is highly imbalanced, why might a model trained on it fail to generalize well? Provide an example from a real-world domain where class imbalance is a serious issue.

  17. Why should balancing techniques be applied only to the training dataset and not to the test dataset?

  18. Some machine learning algorithms are robust to class imbalance, while others require explicit handling of imbalance. Which types of models typically require class balancing, and which can handle imbalance naturally?

  19. When dealing with class imbalance, why is accuracy not always the best metric to evaluate model performance? Which alternative metrics should be considered?

  20. Suppose a dataset has a rare but critical class (e.g., fraud detection). What steps should be taken during the data partitioning and balancing phase to ensure effective model learning?

Hands-On Practice

The following exercises use the churn_mlc, bank, and loan datasets from the liver package. The churn_mlc and bank datasets were introduced earlier, while loan will be used again in Chapter 9.

library(liver)

data(churn_mlc)
data(bank)
data(loan)
Partitioning the Data
  1. Partition the churn_mlc dataset into 75% training and 25% testing. Set a reproducible seed for consistency.

  2. Perform a 90–10 split on the bank dataset. Report the number of observations in each subset.

  3. Use stratified sampling to ensure that the churn rate is consistent across both subsets of the churn_mlc dataset.

  4. Apply a 60–40 split to the loan dataset. Save the outputs as train_loan and test_loan.

  5. Generate density plots to compare the distribution of income between the training and test sets in the bank dataset.

Validating the Partition
  1. Use a two-sample Z-test to assess whether the churn proportion differs significantly between the training and test sets.

  2. Apply a two-sample t-test to evaluate whether average age differs across subsets in the bank dataset.

  3. Conduct a Chi-square test to assess whether the distribution of marital status differs between subsets in the bank dataset.

  4. Suppose the churn proportion is 30% in training and 15% in testing. Identify an appropriate statistical test and propose a corrective strategy.

  5. Select three numerical variables in the loan dataset and assess whether their distributions differ between the two subsets.

Balancing the Training Dataset
  1. Examine the class distribution of churn in the training set and report the proportion of churners.

  2. Apply random oversampling to increase the churner class to 40% of the training data using the ROSE package.

  3. Use undersampling to equalize the deposit = "yes" and deposit = "no" classes in the training set of the bank dataset.

  4. Create bar plots to compare the class distribution in the churn_mlc dataset before and after balancing.

Preparing Features for Modeling
  1. Identify two categorical variables in the bank dataset. Decide whether each should be encoded using ordinal or one-hot encoding, and justify your choice.

  2. Apply one-hot encoding to the marital variable in the bank dataset using the one.hot() function from the liver package. Display the resulting column names.

  3. Perform ordinal encoding on the education variable in the bank dataset, ordering the levels from primary to tertiary. Confirm that the resulting values reflect the intended order.

  4. Compare the number of variables in the dataset before and after applying one-hot encoding. How might this expansion affect model complexity and training time?

  5. Apply min–max scaling to the numerical variables age and balance in the bank dataset using the minmax() function. Verify that all scaled values fall within the \([0, 1]\) range.

  6. Use z-score scaling on the same variables with the zscore() function. Report the mean and standard deviation of each scaled variable and interpret the results.

  7. In your own words, explain how scaling before partitioning could cause data leakage. Suggest a correct workflow for avoiding this issue.

  8. Compare the histograms of one variable before and after applying z-score scaling. What stays the same, and what changes in the distribution?

Self-Reflection

  1. Which of the three preparation steps—partitioning, validation, or balancing—currently feels most intuitive, and which would benefit from further practice? Explain your reasoning.

  2. How does a deeper understanding of data setup influence your perception of model evaluation and fairness in predictive modeling?