9 Naive Bayes Classifier

The measure of belief is the measure of action.

— Thomas Bayes

How can we make fast, reasonably accurate predictions, using minimal data and computation? Imagine a bank deciding, in real time, whether to approve a loan based on a customer’s income, age, and mortgage status. Behind the scenes, such decisions must be made quickly, reliably, and at scale. The Naive Bayes classifier offers a remarkably simple yet surprisingly effective solution, relying on probability theory to make informed predictions in milliseconds.

In Chapter 7, we introduced k-Nearest Neighbors (kNN), a model that classifies based on similarity in feature space. In Chapter 8, we learned how to assess model performance using confusion matrices, sensitivity, specificity, ROC curves, and other evaluation metrics. Now, we turn to a fundamentally different approach: Naive Bayes, a probabilistic classifier grounded in Bayesian theory. Unlike kNN, which has no formal training phase, Naive Bayes builds a model from the data, estimating how likely each class is, given the features. It produces not just decisions but class probabilities, which integrate seamlessly with the evaluation tools we introduced earlier. This chapter gives us the chance to apply those tools while exploring a new perspective on classification.

At its core, Naive Bayes is built on Bayes’ theorem and makes a bold simplifying assumption: that all features are conditionally independent given the class label. This assumption is rarely true in practice, yet the model often works surprisingly well. Why? Because it enables fast training, efficient probability estimation, and interpretable outputs. The algorithm is especially well suited to high-dimensional data, such as text classification, where thousands of features are common. It is also ideal for real-time tasks like spam filtering or financial risk scoring, where speed and simplicity matter.

But every model has trade-offs. Naive Bayes assumes feature independence, an assumption often violated when features are strongly correlated. It also does not naturally handle continuous features unless a specific distribution (often Gaussian) is assumed, which may misrepresent the data. And while it performs well in many scenarios, more flexible models, like decision trees or ensemble methods, can outperform it on datasets with complex feature interactions.

Despite these limitations, Naive Bayes remains a favorite in many real-world applications. In domains such as sentiment analysis, email filtering, and document classification, its assumptions hold well enough, and its simplicity becomes an asset. It is fast, easy to implement, and interpretable, qualities that make it a strong first-choice model and a valuable baseline in the early stages of model development.

The model’s power comes from its foundation in Bayesian probability, specifically Bayes’ Theorem, introduced by the 18th-century statistician Thomas Bayes (Bayes 1958). This theorem offers a principled way to update beliefs in light of new data, combining prior knowledge with observed evidence. It remains one of the most influential ideas in both statistics and machine learning.

What This Chapter Covers

This chapter explores how the Naive Bayes classifier leverages probability to make fast, interpretable predictions, even in high-dimensional and sparse settings. You will deepen your conceptual understanding of Bayesian reasoning while gaining hands-on experience implementing Naive Bayes models in R.

In particular, you will:

Understand the mathematical foundation of Naive Bayes, including Bayes’ theorem and its role in classification.
Work through step-by-step examples to see how the model estimates class probabilities from training data.
Compare the main variants of Naive Bayes (Gaussian, Multinomial, and Bernoulli) and identify when each is appropriate.
Analyze key assumptions, strengths, and limitations of the model in practical scenarios.
Implement and evaluate a Naive Bayes model using the risk dataset from the liver package.

By the end of this chapter, you will be able to explain how Naive Bayes works, choose the right variant for a given task, and apply it effectively in R. To begin, let us revisit the core principle that drives this classifier: Bayes’ theorem.

9.1 Bayes’ Theorem and Probabilistic Foundations

How should we update our beliefs when new evidence becomes available? Whether assessing financial risk, diagnosing medical conditions, or detecting spam, many real-world decisions require reasoning under uncertainty. Bayes’ Theorem provides a formal, principled framework for refining probability estimates as new information emerges, making it a cornerstone of probabilistic learning and modern machine learning.

This framework underlies what is known as Bayesian inference: the process of starting with prior expectations based on historical data and updating them using new evidence to obtain a more accurate posterior belief. For example, when evaluating whether a loan applicant poses a financial risk, an institution might begin with general expectations derived from population statistics. As additional details, such as mortgage status or outstanding debts, are observed, Bayes’ Theorem enables a systematic update of the initial risk assessment.

This idea traces back to Thomas Bayes, an 18th-century minister and self-taught mathematician. His pioneering work introduced a dynamic interpretation of probability, not merely as the frequency of outcomes, but as a belief that can evolve with new data. Readers interested in the broader implications of Bayesian reasoning may enjoy the book “Everything Is Predictable: How Bayesian Statistics Explain Our World”, which explores how this perspective informs real-life decisions.

Even earlier, the conceptual roots of probability theory developed from attempts to reason about chance in gambling, trade, and risk. In the 17th century, mathematicians such as Gerolamo Cardano, Blaise Pascal, and Pierre de Fermat laid the groundwork for formal probability theory. Cardano, for example, observed that some dice outcomes, such as sums of 9 versus 10, have unequal likelihoods due to differing numbers of permutations. These early insights into randomness and structure laid the intellectual foundation for modern approaches to modeling uncertainty, including the Naive Bayes classifier.

The Essence of Bayes’ Theorem

Bayes’ Theorem offers a systematic method for refining probabilistic beliefs as new evidence is observed, forming the theoretical foundation of Bayesian inference. It addresses a central question in probabilistic reasoning: Given what is already known, how should our belief in a hypothesis change when new data are observed?

The theorem is mathematically expressed as:

\[\begin{equation} \label{eq-bayes-theorem} P(A|B) = \frac{P(A \cap B)}{P(B)}, \end{equation}\]

where:

\(P(A|B)\) is the posterior probability, the probability of event \(A\) (the hypothesis) given that event \(B\) (the evidence) has occurred;
\(P(A \cap B)\) is the joint probability that both events \(A\) and \(B\) occur;
\(P(B)\) is the marginal probability or evidence, quantifying the total probability of observing event \(B\) across all possible outcomes.

To clarify the components of Bayes’ Theorem, Figure 9.1 provides a visual interpretation using a Venn diagram. The overlapping region between the circle and the octagon represents the joint probability \(P(A \cap B)\), while the entire area of the octagon corresponds to the marginal probability \(P(A)\), and the entire area of the circle corresponds to \(P(B)\).

Figure 9.1: The Venn diagram (top) visualizes the joint and marginal probabilities in Bayes’ Theorem. The corresponding probabilities are also expressed in the formula shown below.

The expression for Bayes’ Theorem can also be derived by applying the definition of conditional probability. Specifically, \(P(A \cap B)\) can be written as \(P(A) \times P(B|A)\), leading to an alternative form:

\[\begin{equation} \label{eq-bayes-theorem-2} P(A|B) = \frac{P(A \cap B)}{P(B)} = P(A) \times \frac{P(B|A)}{P(B)}. \end{equation}\]

These equivalent expressions result from two ways of expressing the joint probability \(P(A \cap B)\). This formulation highlights how a prior belief \(P(A)\) is updated using the likelihood \(P(B|A)\) and normalized by the marginal probability \(P(B)\).

Bayes’ Theorem thus provides a principled way to refine beliefs by incorporating new evidence. This principle underpins many probabilistic learning techniques, including the Naive Bayes classifier introduced in this chapter.

Let us now apply Bayes’ Theorem to a practical example: estimating the probability that a customer has a good risk profile (\(A\)) given that they have a mortgage (\(B\)), using the risk dataset from the liver package.

We begin by loading the dataset and inspecting the relevant data:

library(liver)  

data(risk)

xtabs(~ risk + mortgage, data = risk)
              mortgage
   risk        yes no
     good risk  81 42
     bad risk   94 29

To improve readability, we add row and column totals to the contingency table:

addmargins(xtabs(~ risk + mortgage, data = risk))
              mortgage
   risk        yes  no Sum
     good risk  81  42 123
     bad risk   94  29 123
     Sum       175  71 246

Now, we define the relevant events: \(A\) is the event that a customer has a good risk profile, and \(B\) is the event that the customer has a mortgage (mortgage = yes). The prior probability of a customer having good risk is:

\[ P(A) = \frac{\text{Total Good Risk Cases}}{\text{Total Cases}} = \frac{123}{246} = 0.5 \]

Using Bayes’ Theorem, we compute the probability of a customer being classified as good risk given that they have a mortgage:

\[\begin{equation} \label{eq1} \begin{split} P(\text{Good Risk} | \text{Mortgage = Yes}) & = \frac{P(\text{Good Risk} \cap \text{Mortgage = Yes})}{P(\text{Mortgage = Yes})} \\ & = \frac{\text{Good Risk with Mortgage Cases}}{\text{Total Mortgage Cases}} \\ & = \frac{81}{175} \\ & = 0.463 \end{split} \end{equation}\]

This result indicates that among customers with mortgages, the observed proportion of those with a good risk profile is lower than in the general population. Such insights help financial institutions refine credit risk models by incorporating new evidence systematically.

How Does Bayes’ Theorem Work?

Imagine you are deciding whether to approve a loan application. You begin with a general expectation, perhaps most applicants with steady income and low debt are low risk. But what happens when you learn that the applicant has missed several past payments? Your belief shifts. This type of evidence-based reasoning is precisely what Bayes’ Theorem formalizes.

Bayes’ Theorem provides a structured method to refine our understanding of uncertainty as new information becomes available. In everyday decisions, whether assessing financial risk or evaluating the results of a medical test, we often begin with an initial belief and revise it in light of new evidence.

Bayesian reasoning plays a central role in many practical applications. In financial risk assessment, banks typically begin with prior expectations about borrower profiles, and then revise the risk estimate after considering additional information such as income, credit history, or mortgage status. In medical diagnostics, physicians assess the baseline probability of a condition and then update that estimate based on test results, incorporating both prevalence and diagnostic accuracy. In spam detection, email filters estimate the probability that a message is spam using features such as keywords, sender information, and formatting, and continually refine those estimates as new messages are processed.

Can you think of a situation where you made a decision based on initial expectations, but changed your mind after receiving new information? That shift in belief is the intuition behind Bayesian updating. Bayes’ Theorem turns this intuition into a formal rule. It offers a principled mechanism for learning from data, one that underpins many modern tools for prediction and classification.

From Bayes’ Theorem to Naive Bayes

Bayes’ Theorem provides a mathematical foundation for updating probabilities as new evidence emerges. However, directly applying Bayes’ Theorem to problems involving many features becomes impractical, as it requires estimating a large number of joint probabilities from data, many of which may be sparse or unavailable.

The Naive Bayes classifier addresses this challenge by introducing a simplifying assumption: it treats all features as conditionally independent given the class label. While this assumption rarely holds exactly in real-world datasets, it dramatically simplifies the required probability calculations.

Despite its simplicity, Naive Bayes often delivers competitive results. For example, in financial risk prediction, a bank may evaluate a customer’s creditworthiness using multiple variables such as income, loan history, and mortgage status. Although these variables are often correlated, the independence assumption enables the classifier to estimate probabilities efficiently by breaking the joint distribution into simpler, individual terms.

This efficiency is particularly advantageous in domains like text classification, spam detection, and sentiment analysis, where the number of features can be very large and independence is a reasonable approximation.

Why does such a seemingly unrealistic assumption often work so well in practice? As we will see, this simplicity allows Naive Bayes to serve as a fast, interpretable, and surprisingly effective classifier, even in complex real-world settings.

9.2 Why Is It Called “Naive”?

When assessing a borrower’s financial risk using features such as income, mortgage status, and number of loans, it is reasonable to expect dependencies among them. For example, individuals with higher income may be more likely to have multiple loans or stable mortgage histories. However, Naive Bayes assumes that all features are conditionally independent given the class label (e.g., “good risk” or “bad risk”).

This simplifying assumption is what gives the algorithm its name. While features in real-world data are often correlated, such as income and age, assuming independence significantly simplifies probability calculations, making the method both efficient and scalable.

To illustrate this, consider the risk dataset from the liver package:

str(risk)
   'data.frame':    246 obs. of  6 variables:
    $ age     : int  34 37 29 33 39 28 28 25 41 26 ...
    $ marital : Factor w/ 3 levels "single","married",..: 3 3 3 3 3 3 3 3 3 3 ...
    $ income  : num  28061 28009 27615 27287 26954 ...
    $ mortgage: Factor w/ 2 levels "yes","no": 1 2 2 1 1 2 2 2 2 2 ...
    $ nr.loans: int  3 2 2 2 2 2 3 2 2 2 ...
    $ risk    : Factor w/ 2 levels "good risk","bad risk": 2 2 2 2 2 2 2 2 2 2 ...

This dataset includes financial indicators such as age, income, marital status, mortgage, and number of loans. Naive Bayes assumes that, given a person’s risk classification, these features do not influence one another. Mathematically, the probability of a customer being in the good risk category given their attributes is expressed as:

\[ P(Y = y_1 | X_1, X_2, \dots, X_5) = \frac{P(Y = y_1) \times P(X_1, X_2, \dots, X_5 | Y = y_1)}{P(X_1, X_2, \dots, X_5)}. \]

Mathematically, computing the full joint likelihood of all features given a class label is challenging. Directly computing \(P(X_1, X_2, \dots, X_5 | Y = y_1)\) is computationally expensive, especially as the number of features grows. In datasets with hundreds or thousands of features, storing and calculating joint probabilities for all possible feature combinations becomes impractical.

The naive assumption of conditional independence simplifies this problem by expressing the joint probability as the product of individual probabilities:

\[ P(X_1, X_2, \dots, X_5 | Y = y_1) = P(X_1 | Y = y_1) \times P(X_2 | Y = y_1) \times P(X_5 | Y = y_1). \]

This transformation eliminates the need to compute complex joint probabilities, making the algorithm scalable even for high-dimensional data. Instead of handling an exponential number of feature combinations, Naive Bayes only requires computing simple conditional probabilities for each feature given the class label.

In practice, the independence assumption is rarely true, as features often exhibit some degree of correlation. Nevertheless, Naive Bayes remains widely used in domains where feature dependencies are sufficiently weak to preserve classification accuracy, where interpretability and computational efficiency are prioritized over capturing complex relationships, and where minor violations of the independence assumption do not substantially degrade predictive performance.

For example, in credit risk prediction, while income and mortgage status are likely correlated, treating them as independent still allows Naive Bayes to classify borrowers effectively. Similarly, in spam detection or text classification, where features (such as word occurrences) are often close to independent, the algorithm delivers fast and accurate predictions.

By reducing complex joint probability estimation to simpler conditional calculations, Naive Bayes offers a scalable solution. In the next section, we address a key practical issue: how to handle zero-probability problems when certain feature values are absent in the training data.

9.3 The Laplace Smoothing Technique

One challenge in Naive Bayes classification is handling feature values that appear in the test data but are missing from the training data for a given class. For example, suppose no borrowers labeled as “bad risk” are married in the training data. If a married borrower later appears in the test set, Naive Bayes would assign a probability of zero to \(P(\text{bad risk} | \text{married})\). Because the algorithm multiplies probabilities when making predictions, this single zero would eliminate the bad risk class from consideration, leading to a biased or incorrect prediction.

This issue arises because Naive Bayes estimates conditional probabilities directly from frequency counts in the training set. If a category is absent for a class, its conditional probability becomes zero. To address this, Laplace smoothing (or add-one smoothing) is used. Named after Pierre-Simon Laplace, the technique assigns a small non-zero probability to every possible feature-class combination, even if some combinations do not appear in the data.

To illustrate, consider the marital variable in the risk dataset. Suppose no customers labeled as bad risk are married. We can simulate this scenario:

            risk
   marital   good risk bad risk
     single         21       11
     married        51        0
     other           8       10

Without smoothing, the conditional probability becomes:

\[ P(\text{bad risk} | \text{married}) = \frac{\text{count}(\text{bad risk} \cap \text{married})}{\text{count}(\text{married})} = \frac{0}{\text{count}(\text{married})} = 0. \]

This would cause every married borrower to be classified as good risk, regardless of other features.

Laplace smoothing resolves this by adjusting the count of each category. A small constant \(k\) (typically \(k = 1\)) is added to each count, yielding:

\[ P(\text{bad risk} | \text{married}) = \frac{\text{count}(\text{bad risk} \cap \text{married}) + k}{\text{count}(\text{married}) + k \times \text{number of marital categories}}. \]

This adjustment ensures that every possible feature-category pair has a non-zero probability, even if unobserved in the training set.

In R, you can apply Laplace smoothing using the laplace argument in the naivebayes package. By default, no smoothing is applied (laplace = 0). To apply smoothing, simply set laplace = 1:

library(naivebayes)

formula_nb = risk ~ age + income + marital + mortgage + nr.loans

model <- naive_bayes(formula = formula_nb, data = risk, laplace = 1)

This adjustment improves model robustness, especially when working with limited or imbalanced data. Curious to see how the naivebayes package performs in practice? In the case study later in this chapter, we will walk through how to train and evaluate a Naive Bayes model using the risk dataset, complete with R code, predicted probabilities, and performance metrics.

Laplace smoothing is a simple yet effective fix for the zero-probability problem in Naive Bayes. While \(k = 1\) is a common default, the value can be tuned based on domain knowledge. By ensuring that all probabilities remain well-defined, Laplace smoothing makes Naive Bayes more reliable for real-world prediction tasks.

9.4 Types of Naive Bayes Classifiers

What if your dataset includes text, binary flags, and numeric values? Can a single Naive Bayes model accommodate them all? Not exactly. Different types of features require different probabilistic assumptions, this is where distinct variants of the Naive Bayes classifier come into play. The choice of variant depends on the structure and distribution of the predictors in your data.

Each of the three most common types of Naive Bayes classifiers is suited to a specific kind of feature:

Multinomial Naive Bayes is designed for categorical or count-based features, such as word frequencies in text data. It models the probability of counts using a multinomial distribution. In the risk dataset, the marital variable, with levels such as single, married, and other, is an example where this variant is appropriate.
Bernoulli Naive Bayes is intended for binary features that capture the presence or absence of a characteristic. This approach is common in spam filtering, where features often indicate whether a particular word is present. In the risk dataset, the binary mortgage variable (yes or no) fits this model.
Gaussian Naive Bayes is used for continuous features that are assumed to follow a normal distribution. It models feature likelihoods using Gaussian densities and is well suited for variables like age and income in the risk dataset.

Selecting the appropriate variant based on your feature types ensures that the underlying probability assumptions remain valid and that the model produces reliable predictions.

The names Bernoulli and Gaussian refer to foundational distributions introduced by two prominent mathematicians: Jacob Bernoulli, known for early work in probability theory, and Carl Friedrich Gauss, associated with the normal distribution. Their contributions form the statistical backbone of different Naive Bayes variants.

In the next section, we apply Naive Bayes to the risk dataset and explore how these variants operate in practice.

9.5 Case Study: Predicting Financial Risk with Naive Bayes

How can a bank predict in advance whether an applicant is likely to repay a loan, or default, before making a lending decision? This is a daily challenge for financial institutions, where each loan approval carries both potential profit and risk. Making accurate predictions about creditworthiness helps banks protect their assets, comply with regulatory standards, and promote responsible lending practices.

In this case study, we apply the complete Data Science Workflow introduced in Chapter 2 (Figure 2.3), following each step, from understanding the problem and preparing the data to training, evaluating, and interpreting the model. Using the risk dataset from the liver package in R, we build a Naive Bayes classifier to categorize customers as either good risk or bad risk. By walking through the workflow step-by-step, this example demonstrates how probabilistic classification can guide credit decisions and help institutions manage financial risk in a structured, data-driven manner.

Problem Understanding

How can financial institutions anticipate which applicants are likely to repay their loans and which may default before extending credit? This challenge lies at the heart of modern lending practices. Effective financial risk assessment requires balancing profitability with caution by using demographic and financial indicators to estimate the likelihood of default.

This case study builds on earlier chapters: Chapter Chapter 7 introduced classification with instance-based methods, and Chapter Chapter 8 covered how to assess model performance. We now extend these foundations by applying a probabilistic classification technique, Naive Bayes, to a real-world dataset.

Key business questions guiding this analysis include:

Which financial and demographic features influence a customer’s risk profile?
How can we predict a customer’s risk category before making a loan decision?
In what ways can such predictions support more effective lending strategies?

By analyzing the risk dataset, we aim to develop a model that classifies customers as good risk or bad risk based on their likelihood of default. The results can inform data-driven credit scoring, guide responsible lending practices, and reduce non-performing loans.

Data Understanding

Before training a classification model, we begin by exploring the dataset to assess the structure of the variables, identify key distributions, and check for any anomalies that might affect modeling. As introduced earlier in Section 9.2, the risk dataset from the liver package contains financial and demographic attributes used to assess whether a customer is a good risk or bad risk. It includes 246 observations across 6 variables.

The dataset consists of 5 predictors and a binary target variable, risk, which distinguishes between customers who are more or less likely to default. The key variables are:

age: Customer’s age in years.
marital: Marital status (single, married, other).
income: Annual income.
mortgage: Indicates whether the customer has a mortgage (yes, no).
nr_loans: Number of loans held by the customer.
risk: The target variable (good risk, bad risk).

For additional details about the dataset, refer to its documentation.

To obtain an overview of the variable distributions and check for missing values or outliers, we examine the dataset’s summary statistics:

summary(risk)
         age           marital        income      mortgage     nr.loans            risk    
    Min.   :17.00   single :111   Min.   :15301   yes:175   Min.   :0.000   good risk:123  
    1st Qu.:32.00   married: 78   1st Qu.:26882   no : 71   1st Qu.:1.000   bad risk :123  
    Median :41.00   other  : 57   Median :37662             Median :1.000                  
    Mean   :40.64                 Mean   :38790             Mean   :1.309                  
    3rd Qu.:50.00                 3rd Qu.:49398             3rd Qu.:2.000                  
    Max.   :66.00                 Max.   :78399             Max.   :3.000

As the summary indicates a clean and well-structured dataset with no apparent anomalies, we can proceed to data preparation before training the Naive Bayes classifier.

Data Setup to Model

Before training the Naive Bayes classifier, we begin by splitting the dataset into training and testing sets. This step allows us to evaluate how well the model generalizes to unseen data. We use an 80/20 split, allocating 80% of the data for training and 20% for testing. To maintain consistency with previous chapters, we apply the partition() function from the liver package:

set.seed(5)

data_sets = partition(data = risk, ratio = c(0.8, 0.2))

train_set = data_sets$part1
test_set  = data_sets$part2

test_labels = test_set$risk

Setting set.seed(5) ensures reproducibility so that the same partitioning is achieved each time the code is run. The train_set will be used to train the Naive Bayes classifier, while the test_set will serve as unseen data to evaluate the model’s predictions. The test_labels vector contains the true class labels for the test set, which we will compare against the model’s outputs.

As discussed in Section 6.4, it is important to check whether the training and test sets are representative of the original dataset. This can be done by comparing the distribution of the target variable or key predictors. Here, we illustrate the process by validating the marital variable across the two sets. As an exercise, you are encouraged to validate the partition based on the target variable risk to confirm that both classes, good risk and bad risk, are similarly distributed.

To check for representativeness, we use a chi-squared test to compare the distribution of marital statuses (single, married, other) in the training and test sets:

chisq.test(x = table(train_set$marital), y = table(test_set$marital))
   
    Pearson's Chi-squared test
   
   data:  table(train_set$marital) and table(test_set$marital)
   X-squared = 6, df = 4, p-value = 0.1991

This test evaluates whether the proportions of marital categories differ significantly between the two sets. The hypotheses are:

\[ \begin{cases} H_0: \text{The proportions of marital categories are the same in both sets.} \\ H_a: \text{At least one of the proportions is different.} \end{cases} \]

Since the p-value exceeds \(\alpha = 0.05\), we do not reject \(H_0\). This suggests that the marital status distribution is statistically similar between the training and test sets, indicating that the partition preserves the key structure of the dataset.

Unlike distance-based algorithms such as k-nearest neighbors, the Naive Bayes classifier does not rely on geometric distance calculations. Therefore, there is no need to scale numeric variables such as age or income, and no need to convert categorical variables like marital into dummy variables. The algorithm models probability distributions directly, making it robust to different variable types without requiring transformation. This illustrates how preprocessing steps must be tailored to the modeling technique in use.

In contrast, when applying kNN to this dataset (see Chapter 7), it would be necessary to scale numerical variables and encode categorical variables. These considerations are explored further in this chapter’s exercises.

Applying the Naive Bayes Classifier

With the dataset partitioned and validated, we now proceed to train and evaluate the Naive Bayes classifier. This model is particularly well suited to problems like credit risk assessment because it is fast, interpretable, and effective even when variables are a mix of numerical and categorical types.

Several R packages provide implementations of Naive Bayes, with two commonly used options being naivebayes and e1071. In this case study, we use the naivebayes package, which offers a fast and flexible implementation that supports both categorical and continuous features.

The core function, naive_bayes(), estimates the required probability distributions during training and stores them in a model object. Based on the types of the predictors, the algorithm makes the following assumptions:

Categorical distributions for nominal variables such as marital and mortgage;
Bernoulli distributions for binary variables, which are a special case of categorical features;
Poisson distributions for count variables (optionally enabled);
Gaussian distributions for continuous features such as age and income;
Kernel density estimation for continuous features when no parametric form is assumed.

Unlike the k-NN algorithm introduced in Chapter Chapter 7, which does not include an explicit training phase, Naive Bayes follows a two-step procedure:

Training phase: The model estimates class-conditional probability distributions from the training data.
Prediction phase: The trained model applies Bayes’ theorem to compute posterior probabilities for new observations.

To train the model, we specify a formula where risk is the target variable and all other columns are treated as predictors:

formula = risk ~ age + income + mortgage + nr.loans + marital

We then fit the model using the naive_bayes() function:

library(naivebayes)

naive_bayes = naive_bayes(formula, data = train_set)

naive_bayes
   
   ===================================================== Naive Bayes ======================================================
   
   Call:
   naive_bayes.formula(formula = formula, data = train_set)
   
   ------------------------------------------------------------------------------------------------------------------------ 
    
   Laplace smoothing: 0
   
   ------------------------------------------------------------------------------------------------------------------------ 
    
   A priori probabilities: 
   
   good risk  bad risk 
   0.4923858 0.5076142 
   
   ------------------------------------------------------------------------------------------------------------------------ 
    
   Tables: 
   
   ------------------------------------------------------------------------------------------------------------------------ 
   :: age (Gaussian) 
   ------------------------------------------------------------------------------------------------------------------------ 
         
   age    good risk  bad risk
     mean 46.453608 35.470000
     sd    8.563513  9.542520
   
   ------------------------------------------------------------------------------------------------------------------------ 
   :: income (Gaussian) 
   ------------------------------------------------------------------------------------------------------------------------ 
         
   income good risk  bad risk
     mean 48888.987 27309.560
     sd    9986.962  7534.639
   
   ------------------------------------------------------------------------------------------------------------------------ 
   :: mortgage (Bernoulli) 
   ------------------------------------------------------------------------------------------------------------------------ 
           
   mortgage good risk  bad risk
        yes 0.6804124 0.7400000
        no  0.3195876 0.2600000
   
   ------------------------------------------------------------------------------------------------------------------------ 
   :: nr.loans (Gaussian) 
   ------------------------------------------------------------------------------------------------------------------------ 
           
   nr.loans good risk  bad risk
       mean 1.0309278 1.6600000
       sd   0.7282057 0.7550503
   
   ------------------------------------------------------------------------------------------------------------------------ 
   :: marital (Categorical) 
   ------------------------------------------------------------------------------------------------------------------------ 
            
   marital    good risk   bad risk
     single  0.38144330 0.49000000
     married 0.52577320 0.11000000
     other   0.09278351 0.40000000
   
   ------------------------------------------------------------------------------------------------------------------------

This function automatically identifies the feature types and estimates appropriate probability distributions for each class. For instance:

Categorical features (e.g., marital, mortgage) are modeled using class-conditional probabilities.
Numerical features (e.g., age, income, nr.loans) are modeled using Gaussian distributions by default.

To inspect the learned parameters, we can use:

summary(naive_bayes)
   
   ===================================================== Naive Bayes ====================================================== 
    
   - Call: naive_bayes.formula(formula = formula, data = train_set) 
   - Laplace: 0 
   - Classes: 2 
   - Samples: 197 
   - Features: 5 
   - Conditional distributions: 
       - Bernoulli: 1
       - Categorical: 1
       - Gaussian: 3
   - Prior probabilities: 
       - good risk: 0.4924
       - bad risk: 0.5076
   
   ------------------------------------------------------------------------------------------------------------------------

This summary shows the estimated means and standard deviations for numerical predictors and the conditional probabilities for categorical ones. These form the foundation of the model’s predictions.

Now that the nr.loans variable is a count with values such as 0, 1, and 3. While the default setting uses a Gaussian distribution, it may be worth experimenting with the usepoisson = TRUE option to see whether a Poisson distribution offers a better fit. As an exercise, you are encouraged to compare model performance with and without this option.

Prediction and Model Evaluation

With the Naive Bayes classifier trained, we now evaluate its performance by applying it to the test set, data that was not used during training. The objective is to compare the model’s predicted class probabilities against the actual outcomes stored in test_labels.

To generate predicted probabilities for each class, we use the predict() function from the naivebayes package, setting type = "prob" to return posterior probabilities instead of hard class labels:

prob_naive_bayes = predict(naive_bayes, test_set, type = "prob")

To explore the output, we display the first 6 rows and round the values to three decimal places:

round(head(prob_naive_bayes, n = 6), 3)
        good risk bad risk
   [1,]     0.001    0.999
   [2,]     0.013    0.987
   [3,]     0.000    1.000
   [4,]     0.184    0.816
   [5,]     0.614    0.386
   [6,]     0.193    0.807

The resulting matrix contains two columns: the first shows the predicted probability that a customer belongs to the “good risk” class, while the second shows the probability of being in the “bad risk” class. For example, if a customer receives a high probability for “bad risk,” it suggests that the model considers them more likely to default.

Rather than relying on a fixed decision threshold (such as 0.5), the model’s probabilities can be mapped to class labels using a threshold selected according to specific business needs. In the next subsection, we convert these probabilities into class predictions and evaluate performance using a confusion matrix and additional metrics.

Confusion Matrix

To assess the classification performance of the Naive Bayes model, we compute a confusion matrix using the conf.mat() and conf.mat.plot() functions from the liver package:

# Extract probability of "good risk"
prob_naive_bayes = prob_naive_bayes[, "good risk"] 

conf.mat(prob_naive_bayes, test_labels, cutoff = 0.5, reference = "good risk")
              Predict
   Actual      good risk bad risk
     good risk        24        2
     bad risk          3       20

conf.mat.plot(prob_naive_bayes, test_labels, cutoff = 0.5, reference = "good risk")

We apply a threshold of 0.5, classifying an observation as “good risk” if its predicted probability for that class exceeds 50%. The reference class is “good risk”, meaning that metrics such as sensitivity and precision are computed relative to this category.

The resulting confusion matrix summarizes the model’s predictions compared to the actual outcomes, highlighting both correct classifications and misclassifications. For example, the matrix may indicate that 24 customers were correctly classified as “good risk” and 20 as “bad risk”, while 3 “bad risk” cases were misclassified as “good risk” and 2 “good risk” cases were misclassified as “bad risk”.

Want to explore the effect of changing the classification threshold? Try setting the cutoff to values such as 0.4 or 0.6 to examine how sensitivity, specificity, and overall accuracy shift under different decision criteria.

ROC Curve and AUC

To complement the confusion matrix, we use the Receiver Operating Characteristic (ROC) curve and the Area Under the Curve (AUC) to evaluate the classifier’s performance across all possible classification thresholds. While the confusion matrix reflects accuracy at a fixed cutoff (e.g., 0.5), ROC analysis provides a more flexible, threshold-agnostic view of model performance.

library(pROC)          

roc_naive_bayes = roc(test_labels, prob_naive_bayes)

ggroc(roc_naive_bayes, colour = "#377EB8", size = 1) +
    ggtitle("ROC Curve for Naive Bayes")

The ROC curve plots the true positive rate (sensitivity) against the false positive rate (1 - specificity) at various thresholds. A curve that bows toward the top-left corner indicates strong discriminative performance, reflecting a high sensitivity with a low false positive rate.

Next, we compute the AUC score:

round(auc(roc_naive_bayes), 3)
   [1] 0.957

The AUC value, 0.957, quantifies the model’s ability to distinguish between the two classes. Specifically, it represents the probability that a randomly selected “good risk” customer will receive a higher predicted probability than a randomly selected “bad risk” customer. An AUC of 1 indicates perfect separation, while an AUC of 0.5 reflects no discriminative power beyond random guessing.

Together, the ROC curve and AUC offer a comprehensive assessment of model performance, independent of any particular decision threshold. In the final section of this case study, we reflect on the model’s practical strengths and limitations.

Takeaways from the Case Study

This case study illustrated how the Naive Bayes classifier can support financial risk assessment by classifying customers as good risk or bad risk based on demographic and financial attributes. Using tools such as the confusion matrix, ROC curve, and AUC, we evaluated the model’s accuracy and ability to guide lending decisions.

Naive Bayes offers several practical advantages. Its simplicity and computational efficiency make it well suited for real-time decision-making. Despite its strong independence assumption, the algorithm often performs competitively, especially in high-dimensional settings or when feature correlations are weak. Moreover, the ability to output class probabilities allows institutions to adjust classification thresholds based on specific business goals, such as prioritizing sensitivity to minimize default risk or specificity to avoid rejecting reliable applicants.

Nonetheless, the conditional independence assumption can limit performance when predictors are strongly correlated. This limitation can be addressed by incorporating additional features (e.g., credit history), using more flexible probabilistic models, or transitioning to ensemble methods such as random forests or boosting.

By applying Naive Bayes to a real-world dataset, we demonstrated how probabilistic classification can support data-driven credit policy. Models like this help financial institutions strike a balance between risk management and fair lending practices.

Reflective prompt: How might this modeling approach transfer to other domains, such as healthcare or marketing? Could adjusting the classification threshold or selecting a different Naive Bayes variant improve outcomes in those settings? As you compare this method with others, such as k-nearest neighbors or logistic regression, consider when each model is most appropriate and why.

Chapter Summary and Takeaways

This chapter introduced the Naive Bayes classifier as a fast and interpretable approach to probabilistic classification. Grounded in Bayes’ Theorem, the method estimates the likelihood that an observation belongs to a particular class, assuming conditional independence among features. While this assumption rarely holds exactly, Naive Bayes often performs surprisingly well in practice, especially in high-dimensional settings and text-based applications.

We examined three common variants, multinomial, Bernoulli, and Gaussian, each suited to different data types. Using the risk dataset, we applied Naive Bayes in R, evaluated its performance with confusion matrices, ROC curves, and AUC, and interpreted predicted probabilities to support threshold-based decisions.

Key takeaways:

Naive Bayes is computationally efficient and scalable, making it well-suited for real-time applications.
It offers transparent probabilistic outputs, enabling flexible decision-making and threshold adjustment.
The model performs robustly even when the independence assumption is only approximately satisfied.

While this chapter focused on a generative probabilistic model, the next chapter introduces logistic regression, a discriminative linear model that estimates the log-odds of class membership. Logistic regression provides a useful complement to Naive Bayes, particularly when modeling predictor relationships and interpreting coefficients are central to the analysis.

9.6 Exercises

The following exercises are designed to strengthen your understanding of the Naive Bayes classifier and its practical applications. They progress from conceptual questions that test your grasp of probabilistic reasoning and model assumptions, to hands-on analyses using the churn and churnCredit datasets from the liver package. Together, these tasks guide you through data preparation, model training, evaluation, and interpretation—helping you connect theoretical principles to real-world predictive modeling.

Conceptual Questions

Why is Naive Bayes considered a probabilistic classification model?
What is the difference between prior probability, likelihood, and posterior probability in Bayes’ theorem?
What does it mean that Naive Bayes assumes feature independence?
In which situations does the feature independence assumption become problematic? Provide an example.
What are the main strengths of Naive Bayes? Why is it widely used in text classification and spam filtering?
What are its major limitations, and how do they affect performance?
How does Laplace smoothing prevent zero probabilities in Naive Bayes? Hint: See Section 9.3.
When should you use multinomial, Bernoulli, or Gaussian Naive Bayes? Hint: See Section 9.4.
Compare Naive Bayes to k-Nearest Neighbors (Chapter 7). How do their assumptions differ?
How does changing the probability threshold affect predictions and evaluation metrics?
Why can Naive Bayes remain effective even when the independence assumption is violated?
What dataset characteristics typically cause Naive Bayes to perform poorly?
How does Gaussian Naive Bayes handle continuous variables?
How can domain knowledge improve Naive Bayes results?
How does Naive Bayes handle imbalanced datasets? What preprocessing strategies help?
How can prior probabilities be adjusted to reflect business priorities?

Hands-On Practice: Naive Bayes with the churn Dataset

The churn dataset from the liver package contains information about customer subscriptions. The goal is to predict whether a customer will churn (churn = yes/no) using Naive Bayes. See Section 4.3 for prior exploration.

Data Setup to Model

Load the dataset and inspect its structure:

library(liver)

data(churn, package = "liver")

Summarize key variables and their distributions.
Partition the data into 80% training and 20% test sets using partition() from liver.
Confirm that the class distribution of churn is similar across both sets.

Training and Evaluation

Define the model formula:

formula = churn ~ account.length + voice.plan + voice.messages +
                 intl.plan + intl.mins + day.mins + eve.mins +
                 night.mins + customer.calls

Train a Naive Bayes model using the naivebayes package.
Summarize the model and interpret class-conditional probabilities.
Predict class probabilities for the test set.
Display the first ten predictions and interpret churn likelihoods.
Generate a confusion matrix with conf.mat() using a 0.5 threshold.
Visualize it with conf.mat.plot() from liver.
Compute accuracy, precision, recall, and F1-score.
Adjust the threshold to 0.3 and observe the change in performance metrics.
Plot the ROC curve and compute AUC.
Retrain the model with Laplace smoothing (laplace = 1) and compare results.
Compare the Naive Bayes model to k-Nearest Neighbors using identical partitions.
Remove one predictor at a time and re-evaluate model performance.
Diagnose poor performance on subsets of data: could it stem from class imbalance or correlated features?

Hands-On Practice: Naive Bayes with the churnCredit Dataset

The churnCredit dataset from the liver package contains 10,127 customer records and 21 variables combining churn, credit, and demographic features. It allows you to evaluate how financial and behavioral variables jointly affect churn.

Data Setup for Modeling

Load the dataset:

library(liver)

data(churnCredit, package = "liver")

Inspect the structure and summary statistics. Identify the target variable (churn) and main predictors.
Partition the data into 80% training and 20% test sets using partition() from liver. Use set.seed(9) for reproducibility.
Verify that the class distribution of churn is consistent between the training and test sets.

Training and Evaluation

Define a model formula with predictors such as credit.score, age, tenure, balance, products.number, credit.card, and active.member.
Train a Naive Bayes classifier using the naivebayes package.
Summarize the model and interpret key conditional probabilities.
Predict outcomes for the test set and generate a confusion matrix with a 0.5 threshold.
Compute evaluation metrics: accuracy, precision, recall, and F1-score.
Plot the ROC curve and compute AUC.
Retrain the model with Laplace smoothing (laplace = 1) and compare results.
Adjust the classification threshold to 0.3 and note changes in sensitivity and specificity.
Identify any predictors that might violate the independence assumption and discuss their potential effects on model performance.

Reflection

Compare results with the churn dataset. Does adding financial information improve predictive accuracy?
How could this model support retention or credit risk management strategies?
Identify the top three most influential predictors using feature importance or conditional probabilities. Do they differ from the most influential features in the churn dataset? What might this reveal about customer behavior?

Critical Thinking

How could a company use this model to inform business decisions related to churn?
If false negatives are costlier than false positives, how should the decision threshold be adjusted?
How might you use this model to target promotions for likely churners?
Suppose a new feature, customer satisfaction score, were added. How could it improve predictions?
How would you address poor model performance on new data?
How might feature correlation affect Naive Bayes reliability?
How could Naive Bayes be extended to handle multi-class classification problems?
Would Naive Bayes be suitable for time-series churn data? Why or why not?

Self-Reflection

Summarize the main strengths and limitations of Naive Bayes in your own words.
How did the independence assumption influence your understanding of model behavior?
Which stage—data preparation, training, or evaluation—most enhanced your understanding of Naive Bayes?
How confident are you in applying Naive Bayes to datasets with mixed data types?
Which extension would you explore next: smoothing, alternative distributions, or correlated features?
Compared to models like kNN or logistic regression, when is Naive Bayes preferable, and why?
Reflect on how Naive Bayes connects back to the broader Data Science Workflow. At which stage does its simplicity provide the greatest advantage?