7 Classification Using k-Nearest Neighbors

Tell me who your friends are, and I will tell you who you are.

— Spanish proverb

Classification is a foundational task in machine learning that enables algorithms to assign observations to specific categories based on patterns learned from labeled data. Whether filtering spam emails, detecting fraudulent transactions, or predicting customer churn, classification plays a vital role in many real-world decision systems. This chapter introduces classification as a form of supervised learning, emphasizing accessible and practical methods for those beginning their journey into predictive modeling.

This chapter also marks the start of Step 5: Modeling in the Data Science Workflow (Figure Figure 2.3). Building on earlier chapters—where we cleaned and explored data, developed statistical reasoning, and prepared datasets for modeling—we now turn to the stage of applying machine learning techniques. In particular, this chapter builds directly on Step 4: Data Setup to Model (Chapter 6), where datasets were partitioned, validated, and prepared (including encoding and scaling) to ensure fair, leakage-free evaluation.

What This Chapter Covers

We begin by defining classification and contrasting it with regression, then introduce common applications and categories of classification algorithms. The focus then shifts to one of the most intuitive and interpretable methods: k-Nearest Neighbors (kNN) as a distance-based algorithm that predicts the class of a new observation by examining its closest neighbors in the training set.

To demonstrate the method in action, we apply kNN to the churnCredit dataset, where the goal is to predict whether a customer will discontinue a service. The chapter walks through the full modeling workflow, data preparation, selecting an appropriate value of k, implementing the model in R, and evaluating its predictive performance, offering a step-by-step blueprint for real-world classification problems.

By the end of this chapter, readers will have a clear understanding of how classification models operate, how kNN translates similarity into prediction, and how to apply this method effectively to real-world data.

7.1 Classification

How do email applications filter spam, streaming services recommend the next show, or banks detect fraudulent transactions in real time? These intelligent systems rely on classification, a core task in supervised machine learning that assigns input data to one of several predefined categories.

In classification, models learn from labeled data to predict categorical outcomes. For example, given customer attributes, a model might predict whether a customer is likely to churn. This contrasts with regression, which predicts continuous quantities such as income or house price.

The target variable, often called the class or label, can take different forms. In binary classification, the outcome has two possible categories, such as spam versus not spam. In multiclass classification, the outcome includes more than two categories, such as distinguishing between a pedestrian, a car, or a bicycle in an object recognition task.

Classification underpins a wide array of applications. Email clients detect spam based on message features and sender behavior. Financial systems flag anomalous transactions to prevent fraud. Businesses use churn models to identify customers at risk of leaving. In healthcare, models assist in diagnosing diseases from clinical data. Autonomous vehicles rely on object recognition to navigate safely. Recommendation systems apply classification logic to tailor content to users.

These examples illustrate how classification enables intelligent systems to translate structured inputs into meaningful, actionable predictions. As digital data becomes more pervasive, classification remains a foundational technique for building effective and reliable predictive models.

How Classification Works

Classification typically involves two main phases:

Training phase: The model learns patterns from a labeled dataset, where each observation contains input features along with a known class label. For example, a fraud detection system might learn that high-value transactions originating from unfamiliar locations are often fraudulent.
Prediction phase: Once trained, the model is used to classify new, unseen observations. Given the features of a new transaction, the model predicts whether it is fraudulent.

A well-performing classification model captures meaningful patterns in the data rather than simply memorizing the training set. Its value lies in the ability to generalize, that is, to make accurate predictions on new data not encountered during training. This ability to generalize is a defining characteristic of all supervised learning methods.

Classification Algorithms and the Role of kNN

A wide range of algorithms can be used for classification, each with its own strengths depending on the nature of the data and the modeling goals. Some commonly used methods include:

k-Nearest Neighbors: A simple, distance-based algorithm that assigns labels based on the nearest neighbors. It is the focus of this chapter.
Naive Bayes: A probabilistic method well-suited to text classification tasks such as spam detection (see Chapter 9).
Logistic Regression: A widely used model for binary outcomes, known for its interpretability (see Chapter 10).
Decision Trees and Random Forests: Flexible models that can capture complex, nonlinear relationships (see Chapter 11).
Neural Networks: High-capacity algorithms effective for high-dimensional or unstructured data, including images and text (see Chapter 12).

Choosing an appropriate algorithm depends on several factors, including dataset size, the types of features, the need for interpretability, and computational constraints. For small to medium-sized datasets or when transparency is a priority, simpler models such as kNN or Decision Trees may be suitable. For more complex tasks involving large datasets or unstructured inputs, Neural Networks may offer better predictive performance.

To illustrate, consider the bank dataset, where the task is to predict whether a customer will subscribe to a term deposit (deposit = yes). Predictor variables such as age, education, and marital status can be used to build a classification model. Such a model can support targeted marketing by identifying customers more likely to respond positively.

Among these algorithms, kNN stands out for its ease of use and intuitive decision-making process. Because it makes minimal assumptions about the underlying data, kNN is often used as a baseline model, helping to gauge how challenging a classification problem is before considering more complex approaches. In the sections that follow, we explore how the kNN algorithm works, how to implement it in R, and how to apply it to a real-world classification task using the churnCredit dataset.

7.2 How k-Nearest Neighbors Works

Imagine making a decision by consulting a few trusted peers who have faced similar situations. The kNN algorithm works in much the same way: it predicts outcomes based on the most similar observations from previously seen data. This intuitive, experience-based approach makes kNN one of the most accessible methods in classification.

Unlike many algorithms that involve an explicit training phase, kNN follows a lazy learning strategy. It stores the entire training dataset and postpones computation until a prediction is needed. When a new observation arrives, the algorithm calculates its distance from all training points, identifies the k closest neighbors, and assigns the most common class among them. The choice of k, the number of neighbors used, is crucial: small values make the model sensitive to local patterns, while larger values promote broader generalization. Because kNN defers all computation until prediction, it avoids upfront model fitting but shifts the computational burden to the prediction phase.

How Does kNN Classify a New Observation?

When classifying a new observation, the kNN algorithm first computes its distance to all data points in the training set, typically using the Euclidean distance. It then identifies the k nearest neighbors and assigns the most frequent class label among them as the predicted outcome.

Figure 7.1 illustrates this idea using a toy dataset with two classes: Class A (light-orange circles) and Class B (soft-green squares). A new data point, shown as a dark star, must be assigned to one of the two classes. The classification result depends on the chosen value of k:

When \(k = 3\), the three closest neighbors include two green squares and one light-orange circle. Since the majority class is Class B, the new point is labeled accordingly.
When \(k = 6\), the nearest neighbors include four light-orange circles and two green squares, resulting in a prediction of Class A.

Figure 7.1: A two-dimensional toy dataset with two classes (Class A and Class B) and a new data point (dark star), illustrating the kNN algorithm with k = 3 and k = 6.

These examples demonstrate how the choice of k directly affects the classification result. A smaller k makes the model more sensitive to local variation and potentially noisy observations, leading to overfitting. In contrast, a larger k smooths the decision boundaries by incorporating more neighbors but may overlook meaningful local structure. Choosing the right value of k is therefore essential for balancing variance and bias, a topic we revisit later in this chapter.

Strengths and Limitations of kNN

The kNN algorithm is valued for its simplicity and transparent decision-making process, making it a common starting point in classification tasks. It requires no explicit model training; instead, it stores the training data and performs computations only at prediction time. This approach makes kNN easy to implement and interpret, particularly effective for small datasets with well-separated class boundaries.

However, this simplicity comes with important trade-offs. The algorithm is sensitive to irrelevant or noisy features, which can distort distance calculations and degrade predictive performance. Moreover, since kNN calculates distances to all training examples at prediction time, it can become computationally expensive as the dataset grows.

Another crucial consideration is the choice of k, which directly affects model behavior. A small k may lead to overfitting and heightened sensitivity to noise, whereas a large k may oversmooth the decision boundary, obscuring meaningful patterns. As we discuss later in the chapter, selecting an appropriate value of k is key to balancing variance and bias.

Finally, the effectiveness of kNN often hinges on proper data preprocessing. Feature selection, scaling, and outlier handling all play a significant role in ensuring that distance computations reflect meaningful structure in the data, topics we address in the next sections.

7.3 A Simple Example of kNN Classification

To illustrate how kNN operates in practice, consider a simplified classification example involving drug prescriptions. We use a synthetic dataset of 200 patients that records each patient’s age, sodium-to-potassium (Na/K) ratio in the blood, and the prescribed drug type. Although artificially generated, the dataset mimics patterns commonly found in real clinical data. Details of the data generation process are provided in Section 1.22. The dataset is available in the liver package under the name drug. Figure 7.2 visualizes the distribution of patient records, where each point represents a patient. The dataset includes three drug types—Drug A, Drug B, and Drug C—indicated by different colors and shapes.

Suppose three new patients arrive at the clinic, and we need to determine which drug is most suitable for them based on their age and sodium-to-potassium ratio. Patient 1 is 40 years old with a Na/K ratio of 30.5. Patient 2 is 28 years old with a ratio of 9.6, and Patient 3 is 61 years old with a ratio of 10.5. These patients are shown as dark stars in Figure 7.2, with their three nearest neighbors highlighted in gray.

Figure 7.2: Scatter plot of age versus sodium-to-potassium ratio for 200 patients, with drug type indicated by color and shape. The three new patients are shown as dark stars, and their three nearest neighbors are highlighted with gray circles.

For new Patient 1, located deep within a cluster of green-circle points (Drug A), the classification is straightforward. All nearest neighbors belong to Drug A, making the prediction clear and confident.

For new Patient 2, the outcome depends on the chosen value of k, as shown in the left panel of Figure 7.3. When \(k = 1\), the nearest neighbor is a soft-blue square, so the predicted class is Drug C. With \(k = 2\), there is a tie between Drug B and Drug C, leaving no clear majority. At \(k = 3\), two of the three nearest neighbors are soft-blue squares, so the prediction remains Drug C. What happens if we increase k even further? The model begins to smooth the decision boundary, reducing noise sensitivity but potentially missing finer local details.

For new Patient 3, the classification is more uncertain, as seen in the right panel of Figure 7.3. With \(k = 1\) or \(k = 2\), the patient lies nearly equidistant from both light-orange and soft-blue points, leading to an unstable classification. At \(k = 3\), the three nearest neighbors each represent a different class, making the prediction entirely ambiguous. What would happen if the patient’s sodium-to-potassium ratio were slightly higher or lower? Even a small shift could move this patient closer to one cluster or another, changing the predicted class entirely. This highlights a key limitation of kNN: when observations fall near class boundaries, prediction confidence decreases sharply.

Figure 7.3: Zoomed-in views of new Patient 2 (left) and new Patient 3 (right) with their three nearest neighbors.

This example highlights key considerations for using kNN effectively. The choice of k strongly influences the decision boundary: smaller values emphasize local variation, while larger values yield smoother classifications. The distance metric determines how similarity is assessed, and proper feature scaling ensures that all variables contribute meaningfully. Together, these design choices play a crucial role in the success of kNN in practice. In the next sections, we explain how kNN measures similarity and explore how to choose the optimal value of k.

7.4 How Does kNN Measure Similarity?

Suppose you are a physician comparing two patients based on age and sodium-to-potassium (Na/K) ratio. One patient is 40 years old with a Na/K ratio of 30.5, and the other is 28 years old with a ratio of 9.6. Which of these patients is more similar to a new case you are evaluating?

In the kNN algorithm, classifying a new observation depends on identifying the most similar records in the training set. While similarity may seem intuitive, machine learning requires a precise definition. Specifically, similarity is quantified using a distance metric, which determines how close two observations are in a multidimensional feature space. These distances govern which records are chosen as neighbors and, ultimately, how a new observation is classified.

In this medical scenario, similarity is measured by comparing numerical features such as age and lab values. The smaller the computed distance between two patients, the more similar they are assumed to be, and the more influence they have on classification. Since kNN relies on the assumption that nearby points tend to share the same class label, choosing an appropriate distance metric is essential for accurate predictions.

7.4.1 Euclidean Distance

A widely used measure of similarity in kNN is Euclidean distance, which corresponds to the straight-line, or “as-the-crow-flies,” distance between two points. It is intuitive, easy to compute, and well-suited to numerical data with comparable scales.

Mathematically, the Euclidean distance between two points \(x\) and \(y\) in \(n\)-dimensional space is given by: \[ \text{dist}(x, y) = \sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2 + \ldots + (x_n - y_n)^2}, \] where \(x = (x_1, x_2, \ldots, x_n)\) and \(y = (y_1, y_2, \ldots, y_n)\) are the feature vectors.

For example, suppose we want to compute the Euclidean distance between two new patients from the previous section, using their age and sodium-to-potassium (Na/K) ratio. Patient 1 is 40 years old with a Na/K ratio of 30.5, and Patient 2 is 28 years old with a Na/K ratio of 9.6. The Euclidean distance between these two patients is visualized in Figure 7.4 in a two-dimensional feature space, where each axis represents one of the features (age and Na/K ratio). The line connecting Patient 1 \((40, 30.5)\) and Patient 2 \((28, 9.6)\) represents their Euclidean distance: \[ \text{dist}(x, y) = \sqrt{(40 - 28)^2 + (30.5 - 9.6)^2} = \sqrt{144 + 436.81} = 24.11 \]

Figure 7.4: Visual representation of Euclidean distance between two patients in 2D space.

This value quantifies how dissimilar the patients are in the two-dimensional feature space, and it plays a key role in determining how the new patient would be classified by kNN.

Although other distance metrics exist, such as Manhattan distance, Hamming distance, or cosine similarity, Euclidean distance is the most commonly used in practice, especially when working with numerical features. Its geometric interpretation is intuitive and it works well when variables are measured on similar scales. In more specialized contexts, other distance metrics may be more appropriate depending on the structure of the data or the application domain. Readers interested in alternative metrics can explore resources such as the proxy package in R or consult advanced machine learning texts.

In the next section, we will examine how preprocessing steps like feature scaling ensure that Euclidean distance yields meaningful and balanced comparisons across features.

7.5 Data Setup for kNN

The performance of the kNN algorithm is highly sensitive to how the data is set up. Because kNN relies on distance calculations to assess similarity between observations, careful setup of the feature space is essential. Two key steps—encoding categorical variables and feature scaling—ensure that both categorical and numerical features are properly represented in these computations. These tasks belong to the Data Setup to Model phase introduced in Chapter 6 (see Figure 2.3).

To make this idea concrete, imagine working with patient data that includes age, sodium-to-potassium (Na/K) ratio, marital status, and education level. While age and Na/K ratio are numeric, marital status and education are categorical. To prepare these features for a distance-based model, we must convert them into numerical form in a way that preserves their original meaning.

In most tabular datasets (such as the churnCredit and bank datasets introduced earlier), features include a mix of categorical and numerical variables. A recommended approach is to first encode the categorical features into numeric format and then scale all numerical features. This sequence ensures that distance calculations occur on a unified numerical scale without introducing artificial distortions.

The appropriate encoding strategy depends on whether a variable is binary, nominal, or ordinal. These techniques were detailed in Chapter 6: general guidance in Section 6.6, ordinal handling in Section 6.7, and one-hot encoding in Section 6.8.

Once categorical variables have been encoded, all numerical features—both original and derived—should be scaled so that they contribute fairly to similarity calculations. Even after encoding, features can differ widely in range. For example, age might vary from 20 to 70, while income could range from 20,000 to 150,000. Without proper scaling, features with larger magnitudes may dominate the distance computation, leading to biased neighbor selection.

Two widely used scaling methods address this issue: min–max scaling (introduced in Section 6.10) and z-score scaling (introduced in Section 6.11). Min–max scaling rescales values to a fixed range, typically \([0, 1]\), ensuring that all features contribute on the same numerical scale. Z-score scaling centers features at zero and scales them by their standard deviation, making it preferable when features have different units or contain outliers.

Min–max scaling is generally suitable when feature values are bounded and preserving relative distances is important. Z-score scaling is better when features are measured in different units or affected by outliers, as it reduces the influence of extreme values.

Before moving on, it is essential to apply scaling correctly, only after the dataset has been partitioned, to avoid data leakage. The next subsection explains this principle in detail.

7.5.1 Preventing Data Leakage during Scaling

Scaling should be performed after splitting the dataset into training and test sets. This prevents data leakage, a common pitfall in predictive modeling where information from the test set inadvertently influences the model during training. Specifically, parameters such as the mean, standard deviation, minimum, and maximum must be computed only from the training data and then applied to scale both the training and test sets.

The comparison in Figure 7.5 visualizes the importance of applying scaling correctly. The middle panel shows proper scaling using training-derived parameters; the right panel shows the distortion caused by scaling the test data independently.

To illustrate, consider the drug classification task from earlier. Suppose age and Na/K ratio are the two predictors. The following code demonstrates both correct and incorrect approaches to scaling using the minmax() function from the liver package:

library(liver)

# Correct scaling: Apply train-derived parameters to test data
train_scaled = minmax(train_set, col = c("age", "ratio"))

test_scaled = minmax(test_set, col = c("age", "ratio"), 
  min = c(min(train_set$age), min(train_set$ratio)), 
  max = c(max(train_set$age), max(train_set$ratio))
)

# Incorrect scaling: Apply separate scaling to test set
train_scaled_wrongly = minmax(train_set, col = c("age", "ratio"))
test_scaled_wrongly  = minmax(test_set , col = c("age", "ratio"))

Note. Scaling parameters should always be derived from the training data and then applied consistently to both the training and test sets. Failing to do so can result in incompatible feature spaces, leading the kNN algorithm to identify misleading neighbors and produce unreliable predictions.

With similarity measurement and data preparation steps now complete, the next task is to determine an appropriate value of \(k\). The following section examines how this crucial hyperparameter influences the behavior and performance of the kNN algorithm.

7.6 Choosing the Right Value of k in kNN

Imagine you are new to a city and looking for a good coffee shop. If you ask just one person, you might get a recommendation based on their personal taste, which may differ from yours. If you ask too many people, you could be overwhelmed by conflicting opinions or suggestions that average out to a generic option. The sweet spot is asking a few individuals whose preferences align with your own. Similarly, in the kNN algorithm, selecting an appropriate number of neighbors (\(k\)) requires balancing specificity and generalization.

The parameter k, which determines how many nearest neighbors are considered during classification, plays a central role in shaping model performance. There is no universally optimal value for k; the best choice depends on the structure of the dataset and the nature of the classification task. Selecting k involves navigating the trade-off between overfitting and underfitting.

When k is too small, such as \(k = 1\), the model becomes overly sensitive to individual training points. Each new observation is classified based solely on its nearest neighbor, making the model highly reactive to noise and outliers. This often leads to overfitting, where the model performs well on the training data but generalizes poorly to new cases. A small cluster of mislabeled examples, for instance, could disproportionately influence the results.

As k increases, the algorithm includes more neighbors in its classification decisions, smoothing the decision boundary and reducing the influence of noisy observations. However, when k becomes too large, the model may begin to overlook meaningful patterns, leading to underfitting. If k approaches the size of the training set, predictions may default to the majority class label.

To determine a suitable value of k, it is common to evaluate a range of options using a validation set or cross-validation. Performance metrics such as accuracy, precision, recall, and the F1-score can guide this choice. These metrics are discussed in detail in Chapter 8. For simplicity, we focus here on accuracy (also called the success rate), which measures the proportion of correct predictions.

As an example, Figure 7.6 presents the accuracy of the kNN classifier for k values ranging from 1 to 30, generated with the kNN.plot() function from the liver package in R. Accuracy fluctuates as k increases, with the best performance achieved at \(k = 9\), where the algorithm reaches its highest accuracy.

Choosing k is ultimately an empirical process informed by validation and domain knowledge. There is no universal rule, but careful experimentation helps identify a value that generalizes well for the problem at hand. A detailed case study in the following section revisits this example and walks through the complete modeling process.

7.7 Case Study: Predicting Customer Churn with kNN

In this case study, we apply the kNN algorithm to a real-world classification problem. Using the churnCredit dataset from the liver package in R, we follow the complete modeling workflow: data setup, model training, and evaluation. This provides a practical context to reinforce concepts introduced earlier in the chapter.

The churnCredit dataset summarizes customer characteristics and service usage across multiple dimensions, including account tenure, product holdings, transaction activity, and customer service interactions. Our goal is to predict whether a customer has churned (yes) or not (no) based on these features. Readers unfamiliar with the dataset are encouraged to review the exploratory analysis in Section 4.3, which provides context and preliminary findings. We begin by inspecting the structure:

library(liver)

data(churnCredit)
str(churnCredit)
   'data.frame':    10127 obs. of  21 variables:
    $ customer.ID          : int  768805383 818770008 713982108 769911858 709106358 713061558 810347208 818906208 710930508 719661558 ...
    $ age                  : int  45 49 51 40 40 44 51 32 37 48 ...
    $ gender               : Factor w/ 2 levels "female","male": 2 1 2 1 2 2 2 2 2 2 ...
    $ education            : Factor w/ 7 levels "uneducated","highschool",..: 2 4 4 2 1 4 7 2 1 4 ...
    $ marital              : Factor w/ 4 levels "married","single",..: 1 2 1 4 1 1 1 4 2 2 ...
    $ income               : Factor w/ 6 levels "<40K","40K-60K",..: 3 1 4 1 3 2 5 3 3 4 ...
    $ card.category        : Factor w/ 4 levels "blue","silver",..: 1 1 1 1 1 1 3 2 1 1 ...
    $ dependent.count      : int  3 5 3 4 3 2 4 0 3 2 ...
    $ months.on.book       : int  39 44 36 34 21 36 46 27 36 36 ...
    $ relationship.count   : int  5 6 4 3 5 3 6 2 5 6 ...
    $ months.inactive      : int  1 1 1 4 1 1 1 2 2 3 ...
    $ contacts.count.12    : int  3 2 0 1 0 2 3 2 0 3 ...
    $ credit.limit         : num  12691 8256 3418 3313 4716 ...
    $ revolving.balance    : int  777 864 0 2517 0 1247 2264 1396 2517 1677 ...
    $ available.credit     : num  11914 7392 3418 796 4716 ...
    $ transaction.amount.12: int  1144 1291 1887 1171 816 1088 1330 1538 1350 1441 ...
    $ transaction.count.12 : int  42 33 20 20 28 24 31 36 24 32 ...
    $ ratio.amount.Q4.Q1   : num  1.33 1.54 2.59 1.41 2.17 ...
    $ ratio.count.Q4.Q1    : num  1.62 3.71 2.33 2.33 2.5 ...
    $ utilization.ratio    : num  0.061 0.105 0 0.76 0 0.311 0.066 0.048 0.113 0.144 ...
    $ churn                : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...

The dataset is an R data frame containing 10127 observations and 20 predictor variables, along with a binary outcome variable, churn. Consistent with the earlier analysis in Chapter 4, we exclude customer.ID (identifier) and available.credit (a deterministic transformation of other credit variables) from the predictor set. The candidate predictors for kNN are:

age, gender, education, marital, income, card.category, dependent.count, months.on.book, relationship.count, months.inactive, contacts.count.12, credit.limit, revolving.balance, transaction.amount.12, transaction.count.12, ratio.amount.Q4.Q1, and ratio.count.Q4.Q1.

Before proceeding to Data Setup to Model (Chapter 6), we harmonize missing and unknown values, following the approach in Section 4.3. Because random imputation is involved, we set a seed for reproducibility. We also ensure the outcome is a factor with levels no and yes.

library(Hmisc)

set.seed(42)  # for reproducibility of random imputations

# Treat "unknown" as missing and drop unused levels
churnCredit[churnCredit == "unknown"] <- NA
churnCredit <- droplevels(churnCredit)

# Random imputation for selected categorical/numeric fields as used in Chapter 4
churnCredit$education <- impute(churnCredit$education, "random")
churnCredit$income    <- impute(churnCredit$income, "random")
churnCredit$marital   <- impute(churnCredit$marital, "random")

In the remainder of this section, we proceed step by step: partitioning the data, applying preprocessing after the split to avoid leakage (scaling numeric features and encoding categorical variables for kNN), selecting an appropriate value of \(k\), fitting the model, generating predictions, and evaluating classification performance.

7.7.1 Data Setup for kNN

To evaluate how well the kNN model generalizes to new observations, we begin by splitting the dataset into training and test sets. This separation provides an unbiased estimate of predictive accuracy by testing the model on data not used during training.

Since the churnCredit dataset has already been cleaned and imputed (see Chapter 3), we proceed directly to data partitioning using the partition() function from the liver package. This function divides the data into an 80% training set and a 20% test set:

set.seed(42)

data_sets = partition(data = churnCredit, ratio = c(0.8, 0.2))

train_set = data_sets$part1
test_set  = data_sets$part2

test_labels = test_set$churn

The partition() function preserves the class distribution of the target variable (churn) across both sets, ensuring that the test set remains representative of the population. This stratified sampling approach is especially important for classification problems with imbalanced outcomes. For a discussion of partitioning and validation strategies, see Section 6.4.

Encoding Categorical Features for kNN

Because the kNN algorithm relies on distance calculations between observations, all input features must be numeric. Therefore, categorical variables need to be transformed into numerical representations. In the churnCredit dataset, the variables gender, education, marital, income, and card.category are categorical and require encoding. The one.hot() function from the liver package automates this step by generating binary indicator variables:

categorical_vars = c("gender", "education", "marital", "income", "card.category")

train_onehot = one.hot(train_set, cols = categorical_vars)
test_onehot  = one.hot(test_set,  cols = categorical_vars)

str(test_onehot)
   'data.frame':    2025 obs. of  41 variables:
    $ customer.ID            : int  713061558 816082233 709327383 806165208 804424383 709029408 788658483 715318008 827111283 720572508 ...
    $ age                    : int  44 35 45 47 63 41 53 55 45 38 ...
    $ gender                 : Factor w/ 2 levels "female","male": 2 2 1 2 2 2 1 1 2 1 ...
    $ gender_female          : int  0 0 1 0 0 0 1 1 0 1 ...
    $ gender_male            : int  1 1 0 1 1 1 0 0 1 0 ...
    $ education              : Factor w/ 6 levels "uneducated","highschool",..: 4 4 4 6 4 4 3 3 4 4 ...
     ..- attr(*, "imputed")= int [1:310] 5 11 18 35 44 57 59 83 85 87 ...
    $ education_uneducated   : int  0 0 0 0 0 0 0 0 0 0 ...
    $ education_highschool   : int  0 0 0 0 0 0 0 0 0 0 ...
    $ education_college      : int  0 0 0 0 0 0 1 1 0 0 ...
    $ education_graduate     : int  1 1 1 0 1 1 0 0 1 1 ...
    $ education_post-graduate: int  0 0 0 0 0 0 0 0 0 0 ...
    $ education_doctorate    : int  0 0 0 1 0 0 0 0 0 0 ...
    $ marital                : Factor w/ 3 levels "married","single",..: 1 2 1 3 1 1 1 2 2 2 ...
     ..- attr(*, "imputed")= int [1:156] 2 18 44 57 62 80 99 110 122 164 ...
    $ marital_married        : int  1 0 1 0 1 1 1 0 0 0 ...
    $ marital_single         : int  0 1 0 0 0 0 0 1 1 1 ...
    $ marital_divorced       : int  0 0 0 1 0 0 0 0 0 0 ...
    $ income                 : Factor w/ 5 levels "<40K","40K-60K",..: 2 3 4 3 3 3 1 1 4 1 ...
     ..- attr(*, "imputed")= int [1:217] 3 10 30 34 38 64 69 78 88 102 ...
    $ income_<40K            : int  0 0 0 0 0 0 1 1 0 1 ...
    $ income_40K-60K         : int  1 0 0 0 0 0 0 0 0 0 ...
    $ income_60K-80K         : int  0 1 0 1 1 1 0 0 0 0 ...
    $ income_80K-120K        : int  0 0 1 0 0 0 0 0 1 0 ...
    $ income_>120K           : int  0 0 0 0 0 0 0 0 0 0 ...
    $ card.category          : Factor w/ 4 levels "blue","silver",..: 1 1 1 1 1 1 1 1 1 1 ...
    $ card.category_blue     : int  1 1 1 1 1 1 1 1 1 1 ...
    $ card.category_silver   : int  0 0 0 0 0 0 0 0 0 0 ...
    $ card.category_gold     : int  0 0 0 0 0 0 0 0 0 0 ...
    $ card.category_platinum : int  0 0 0 0 0 0 0 0 0 0 ...
    $ dependent.count        : int  2 3 2 1 1 4 2 1 3 4 ...
    $ months.on.book         : int  36 30 37 42 56 36 38 36 41 28 ...
    $ relationship.count     : int  3 5 6 5 3 4 5 4 2 2 ...
    $ months.inactive        : int  1 1 1 2 3 1 2 2 2 3 ...
    $ contacts.count.12      : int  2 3 2 0 2 2 3 1 2 3 ...
    $ credit.limit           : num  4010 8547 14470 20979 10215 ...
    $ revolving.balance      : int  1247 1666 1157 1800 1010 2517 1490 1914 578 2055 ...
    $ available.credit       : num  2763 6881 13313 19179 9205 ...
    $ transaction.amount.12  : int  1088 1311 1207 1178 1904 1589 1411 1407 1109 1042 ...
    $ transaction.count.12   : int  24 33 21 27 40 24 28 43 28 23 ...
    $ ratio.amount.Q4.Q1     : num  1.376 1.163 0.966 0.906 0.843 ...
    $ ratio.count.Q4.Q1      : num  0.846 2 0.909 0.929 1 ...
    $ utilization.ratio      : num  0.311 0.195 0.08 0.086 0.099 0.282 0.562 0.544 0.018 0.209 ...
    $ churn                  : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...

For each categorical variable with \(k\) categories, the function creates \(k\) binary columns (dummy variables). In practice, it is often preferable to use \(k - 1\) dummy variables to avoid redundancy and multicollinearity, while maintaining interpretability and compatibility with distance-based algorithms.

Feature Scaling for kNN

To ensure that all numerical variables contribute equally to distance calculations, we apply min–max scaling. This technique rescales each variable to the \([0, 1]\) range based on the minimum and maximum values computed from the training set. The same scaling parameters are then applied to the test set to prevent data leakage:

numeric_vars = c("age", "dependent.count", "months.on.book", "relationship.count", 
               "months.inactive", "contacts.count.12", "credit.limit", 
               "revolving.balance", "transaction.amount.12", "transaction.count.12", 
               "ratio.amount.Q4.Q1", "ratio.count.Q4.Q1")

min_train = sapply(train_set[, numeric_vars], min)   # Column-wise minimums
max_train = sapply(train_set[, numeric_vars], max)   # Column-wise maximums

train_scaled = minmax(train_onehot, col = numeric_vars, min = min_train, max = max_train)
test_scaled  = minmax(test_onehot,  col = numeric_vars, min = min_train, max = max_train)

Here, sapply() computes the column-wise minimum and maximum values across the selected numeric variables in the training set. These values define the scaling range. The minmax() function from the liver package then applies min–max scaling to both the training and test sets, using the training-set values as reference.

This step places all variables on a comparable scale, ensuring that those with larger ranges do not dominate the distance calculations. For further discussion of scaling methods and their implications, see Section 6.9 and the preparation overview in Section 7.5. With the data now encoded and scaled, we can proceed to determine the optimal number of neighbors (\(k\)) for the kNN model.

7.7.2 Finding the Best Value for \(k\)

The number of neighbors (\(k\)) is a key hyperparameter in the kNN algorithm. Choosing a very small \(k\) can make the model overly sensitive to noise, whereas a very large \(k\) can oversmooth decision boundaries and obscure meaningful local patterns.

In R, there are several ways to identify the optimal value of \(k\). A common approach is to assess model accuracy across a range of values (for example, from 1 to 30) and select the \(k\) that yields the highest performance. This can be implemented manually with a for loop that records the accuracy for each value of \(k\).

The liver package simplifies this process with the kNN.plot() function, which automatically computes accuracy across a specified range of \(k\) values and visualizes the results. This enables quick identification of the best-performing model.

Before running the function, we define a formula object that specifies the relationship between the target variable (churn) and the predictor variables. The predictors include all scaled numeric variables and the binary indicators generated through one-hot encoding, such as gender_female, education_uneducated, and others:

formula = churn ~ gender_female + age + 
    education_uneducated  + education_highschool + education_college  + education_graduate + `education_post-graduate` +
    marital_married + marital_single +
    `income_<40K` + `income_40K-60K` + `income_60K-80K` + `income_80K-120K` + 
    card.category_blue + card.category_silver + card.category_gold +
    dependent.count + months.on.book + relationship.count + months.inactive + contacts.count.12 +
    credit.limit + revolving.balance + transaction.amount.12 + transaction.count.12 + 
    ratio.amount.Q4.Q1 + ratio.count.Q4.Q1

We now apply the kNN.plot() function:

kNN.plot(formula = formula, 
         train = train_scaled, 
         test = test_scaled, 
         k.max = 20, 
         reference = "yes", 
         set.seed = 42)

Figure 7.6: Accuracy of the kNN algorithm on the churnCredit dataset for values of k ranging from 1 to 20.

The arguments in kNN.plot() control various aspects of the evaluation. The train and test inputs specify the scaled datasets, ensuring comparable feature scales for distance computation. The argument k.max = 20 defines the largest number of neighbors to test, allowing us to visualize model performance over a meaningful range. Setting reference = "yes" designates the "yes" class as the positive outcome (customer churn), and set.seed = 42 ensures reproducibility.

The resulting plot shows how model accuracy changes with \(k\). In this case, accuracy peaks at \(k = 5\), suggesting that this value strikes a good balance between capturing local patterns and maintaining generalization. With the optimal \(k\) determined, we can now apply the kNN model to classify new customer records in the test set.

7.7.3 Applying the kNN Classifier

With the optimal value \(k = 5\) identified, we now apply the kNN algorithm to classify customer churn in the test set. This step brings together the work from the previous sections—data preparation, feature encoding, scaling, and hyperparameter tuning. Unlike many machine learning algorithms, kNN does not build an explicit predictive model during training. Instead, it retains the training data and performs classification on demand by computing distances to identify the closest training observations.

In R, we use the kNN() function from the liver package to implement the k-Nearest Neighbors algorithm. This function provides a formula-based interface consistent with other modeling functions in R, making the syntax more readable and the workflow more transparent. An alternative is the knn() function from the class package, which requires specifying input matrices and class labels manually. While effective, this approach is less intuitive for beginners and is not used in this book:

kNN_predict = kNN(formula = formula, train = train_scaled, test = test_scaled, k = 5)

In this command, formula defines the relationship between the response variable (churn) and the predictors. The train and test arguments specify the scaled datasets prepared in earlier steps. The parameter k = 5 sets the number of nearest neighbors, as determined in the tuning step. The kNN() function classifies each test observation by computing its distance to all training records and assigning the majority class among the five nearest neighbors.

7.7.4 Evaluating Model Performance of the kNN Model

With predictions in hand, the final step is to assess how well the kNN model performs. A fundamental and intuitive evaluation tool is the confusion matrix, which summarizes the correspondence between predicted and actual class labels in the test set. We use the conf.mat.plot() function from the liver package to compute and visualize this matrix. The argument reference = "yes" specifies that the positive class refers to customers who have churned:

conf.mat.plot(kNN_predict, test_labels, reference = "yes")

The resulting matrix displays the number of true positives, true negatives, false positives, and false negatives. In this example, the model correctly classified 1766 observations and misclassified 259.

While the confusion matrix provides a useful snapshot of model performance, it does not capture all aspects of classification quality. In Chapter 8, we introduce additional evaluation metrics, including accuracy, precision, recall, and F1-score, that offer a more nuanced assessment.

Summary of the kNN Case Study

This case study has demonstrated the complete modeling pipeline for applying kNN: starting with data partitioning, followed by preprocessing (including encoding and scaling), tuning the hyperparameter k, applying the classifier, and evaluating the results. Each stage plays a critical role in ensuring that the final predictions are both accurate and interpretable.

While the confusion matrix provides an initial evaluation of model performance, a more comprehensive assessment requires additional metrics such as accuracy, precision, recall, and F1-score. These will be explored in the next chapter (Chapter 8), which introduces tools and techniques for evaluating and comparing machine learning models more rigorously.

7.8 Chapter Summary and Takeaways

This chapter introduced the kNN algorithm, a simple yet effective method for classification. We began by revisiting the concept of classification and its practical applications, distinguishing between binary and multi-class problems. We then examined how kNN classifies observations by identifying their nearest neighbors using distance metrics.

To ensure meaningful distance comparisons, we discussed essential preprocessing steps such as one-hot encoding of categorical variables and feature scaling. We also explored how to select the optimal number of neighbors (\(k\)), emphasizing the trade-off between overfitting and underfitting. These concepts were demonstrated through a complete case study using the liver package in R and the churnCredit dataset, highlighting the importance of thoughtful data preparation and parameter tuning.

The simplicity and interpretability of kNN make it a valuable introductory model. However, its limitations, including sensitivity to noise, reliance on proper scaling, and inefficiency with large datasets, can reduce its practicality for large-scale applications. Despite these drawbacks, kNN remains a strong baseline for classification tasks and a useful reference point for model comparison.

While our focus has been on classification, the kNN algorithm also supports regression. In kNN regression, the target variable is numeric, and predictions are based on averaging the outcomes of the k nearest neighbors. This variant follows the same core principles and offers a non-parametric alternative to traditional regression models.

Another important use case is imputation of missing values, where kNN fills in missing entries by identifying similar observations and using their values (via majority vote or averaging). This method preserves local structure in the data and often outperforms basic imputation techniques such as mean substitution, especially when the extent of missingness is moderate.

In the chapters that follow, we turn to more advanced classification methods. We begin with Naive Bayes (Chapter 9), followed by Logistic Regression (Chapter 10), and Decision Trees (Chapter 11). These models address many of kNN’s limitations and provide more scalable and robust tools for real-world predictive tasks.

7.9 Exercises

The following exercises reinforce key ideas introduced in this chapter. Begin with conceptual questions to test your understanding, continue with hands-on modeling tasks using the bank dataset, and conclude with reflective prompts and real-world considerations for applying kNN.

Conceptual Questions

Explain the fundamental difference between classification and regression. Provide an example of each.
What are the key steps in applying the kNN algorithm?
Why is the choice of \(k\) important in kNN, and what happens when \(k\) is too small or too large?
Describe the role of distance metrics in kNN classification. Why is Euclidean distance commonly used?
What are the limitations of kNN compared to other classification algorithms?
How does feature scaling impact the performance of kNN? Why is it necessary?
How is one-hot encoding used in kNN, and why is it necessary for categorical variables?
How does kNN handle missing values? What strategies can be used to deal with missing data?
Explain the difference between lazy learning (such as kNN) and eager learning (such as decision trees or logistic regression). Give one advantage of each.
Why is kNN considered a non-parametric algorithm? What advantages and disadvantages does this bring?

Hands-On Practice: Applying kNN to the bank Dataset

The following tasks apply the kNN algorithm to the bank dataset from the liver package. This dataset includes customer demographics and banking history, with the goal of predicting whether a customer subscribed to a term deposit. These exercises follow the same modeling steps as the churn case study and offer opportunities to deepen your practical understanding.

To begin, load the necessary package and dataset:

library(liver)

# Load the dataset
data(bank)

# View the structure of the dataset
str(bank)
   'data.frame':    4521 obs. of  17 variables:
    $ age      : int  30 33 35 30 59 35 36 39 41 43 ...
    $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
    $ marital  : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
    $ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ...
    $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
    $ balance  : int  1787 4789 1350 1476 0 747 307 147 221 -88 ...
    $ housing  : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ...
    $ loan     : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ...
    $ contact  : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 3 1 ...
    $ day      : int  19 11 16 3 5 23 14 6 14 17 ...
    $ month    : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 9 9 9 1 ...
    $ duration : int  79 220 185 199 226 141 341 151 57 313 ...
    $ campaign : int  1 1 1 4 1 2 1 2 2 1 ...
    $ pdays    : int  -1 339 330 -1 -1 176 330 -1 -1 147 ...
    $ previous : int  0 4 1 0 0 3 2 0 0 2 ...
    $ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
    $ deposit  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Data Exploration and Preparation

Load the bank dataset and display its structure. Identify the target variable and the predictor variables.
Perform an initial EDA:
- What are the distributions of key numeric variables like age, balance, and duration?
- Are there any unusually high or low values that might influence distance calculations in kNN?
Explore potential associations:
- Are there noticeable differences in numeric features (e.g., balance, duration) between customers who subscribed to a deposit versus those who did not?
- Are there categorical features (e.g., job, marital) that seem associated with the outcome?
Count the number of instances where a customer subscribed to a term deposit (deposit = “yes”) versus those who did not (deposit = “no”). What does this tell you about class imbalance?
Identify nominal variables in the dataset. Apply one-hot encoding using the one.hot() function. Retain only one dummy variable per categorical feature to avoid redundancy and multicollinearity.
Partition the dataset into 80% training and 20% testing sets using the partition() function. Ensure the target variable remains proportionally distributed in both sets.
Validate the partitioning by comparing the class distribution of the target variable in the training and test sets.
Apply min-max scaling to numerical variables in both training and test sets. Ensure that the scaling parameters are derived from the training set only.

Diagnosing the Impact of Preprocessing

What happens if you skip feature scaling before applying kNN? Train a model without scaling and compare its accuracy to the scaled version.
What happens if you leave categorical variables as strings without applying one-hot encoding? Does the model return an error, or does performance decline? Explain why.

Choosing the Optimal k

Use the kNN.plot() function to determine the optimal \(k\) value for classifying deposit in the bank dataset.
What is the best \(k\) value based on accuracy? How does accuracy change as \(k\) increases?
Interpret the meaning of the accuracy curve generated by kNN.plot(). What patterns do you observe?

Building and Evaluating the kNN Model

Train a kNN model using the optimal \(k\) and make predictions on the test set.
Generate a confusion matrix for the kNN model predictions using the conf.mat() function. Interpret the results.
Calculate the accuracy of the kNN model. How well does it perform in predicting deposit?
Compare the performance of kNN with different values of \(k\) (e.g., \(k = 1, 5, 15, 25\)). How does changing \(k\) affect the classification results?
Train a kNN model using only a subset of features: age, balance, duration, and campaign. Compare its accuracy with the full-feature model. What does this tell you about feature selection?
Compare the accuracy of kNN when using min-max scaling versus z-score standardization. How does the choice of scaling method impact model performance?

Critical Thinking and Real-World Applications

Suppose you are building a fraud detection system for a bank. Would kNN be a suitable algorithm? What are its advantages and limitations in this context?
How would you handle imbalanced classes in the bank dataset? What strategies could improve classification performance?
In a high-dimensional dataset with hundreds of features, would kNN still be an effective approach? Why or why not?
Imagine you are working with a dataset where new observations are collected continuously. What challenges would kNN face, and how could they be addressed?
If a financial institution wants to classify customers into different risk categories for loan approval, what preprocessing steps would be essential before applying kNN?
In a dataset where some features are irrelevant or redundant, how could you improve kNN’s performance? What feature selection methods would you use?
If computation time is a concern, what strategies could you apply to make kNN more efficient for large datasets?
Suppose kNN is performing poorly on the bank dataset. What possible reasons could explain this, and how would you troubleshoot the issue?

Self-Reflection

What did you find most intuitive about the kNN algorithm? What aspects required more effort to understand?
How did the visualizations (e.g., scatter plots, accuracy curves, and confusion matrices) help you understand the behavior of the model?
If you were to explain how kNN works to a colleague or friend, how would you describe it in your own words?
How would you decide whether kNN is a good choice for a new dataset or project you are working on?
Which data preprocessing steps, such as encoding or scaling, felt most important in improving kNN’s performance?