11  Generalized Regression Models

The theory of probabilities is at bottom nothing but common sense reduced to calculation.

— Pierre-Simon Laplace

In Chapter 10, we developed regression models for continuous outcomes, focusing on situations where the response variable can take a wide range of numerical values, such as house prices, healthcare costs, or hourly bike rental demand. Those models provided a powerful framework for studying relationships between variables, making predictions, and assessing the contribution of individual predictors. Yet many important data science problems involve outcomes that do not fit this setting. In practice, we are often interested in questions such as whether a customer will churn, whether a patient has a disease, or how many service calls a customer is expected to make. These outcomes are not continuous: they are binary or count-based, and they require a different modeling approach.

When the response variable is binary or a count, the assumptions of ordinary linear regression are no longer appropriate. A linear model can produce predicted values below 0 or above 1 when applied to probabilities, and it does not account naturally for the discrete structure of count data. More fundamentally, the variability of binary and count outcomes is tied to their mean in ways that differ from the constant-variance framework of linear regression. If we ignore these features and apply linear regression mechanically, the resulting model may be difficult to interpret and may lead to misleading conclusions.

Generalized regression models extend the regression framework to address these limitations. They preserve the central idea that we relate an outcome to a set of predictors through a structured model, but they allow the form of that relationship to adapt to the type of response variable under study. This broader framework makes it possible to model probabilities for binary outcomes and expected event counts for count data while still retaining much of the interpretability that makes regression so useful. In this chapter, we focus on two of the most important examples: logistic regression for binary outcomes and Poisson regression for count data.

This chapter also connects naturally to earlier parts of the book. In Chapter 7, we introduced classification through k-Nearest Neighbors, and in Chapter 9, we examined the Naive Bayes classifier as a probabilistic approach to prediction. Logistic regression provides another important perspective on binary classification, one that is grounded in the regression framework and supports direct interpretation of predictor effects. At the same time, Poisson regression extends regression modeling in a different direction by addressing outcomes that represent event frequencies rather than categories. Together, these models broaden the scope of regression analysis and show how a common modeling philosophy can be adapted to different kinds of data.

As in earlier chapters, we work within the Data Science Workflow introduced in Chapter 2. We begin by motivating the generalized linear model framework, then develop logistic regression and Poisson regression in turn, and finally apply these ideas in a case study on customer churn prediction. Along the way, we examine how these models are fitted in R, how their coefficients should be interpreted, and how their predictive performance can be assessed in practice. In this way, the chapter shows not only how generalized regression models are formulated, but also how they can support meaningful decisions in real data science problems.

What This Chapter Covers

In Chapter 10, we focused on regression models for continuous outcomes. In this chapter, we extend that framework to settings where the response variable is no longer continuous, but instead represents a binary outcome or a count. Such outcomes arise frequently in data science. A customer may churn or remain, a patient may respond to treatment or not, and the number of service calls or purchases may vary from one individual to another. These problems require models that respect the structure of the response variable while still allowing us to relate it systematically to a set of predictors.

We begin by introducing the generalized linear model framework, which provides the foundation for the methods developed in this chapter. This framework extends ordinary linear regression by combining a probability distribution for the response, a linear predictor, and a link function that connects the two. This perspective helps explain why different types of outcomes require different regression models and how those models remain part of a common statistical family.

We then turn to logistic regression, which is used when the response variable is binary. We explain why ordinary linear regression is not appropriate for such outcomes, introduce the concepts of odds and log-odds, and show how logistic regression models probabilities in a way that remains interpretable and statistically coherent. We also demonstrate how to fit logistic regression models in R, interpret coefficients and odds ratios, and convert predicted probabilities into class predictions when needed.

Next, we introduce Poisson regression for modeling count data. Here, the focus shifts from probabilities to expected event counts. We show how the Poisson model uses a log link to ensure that predicted counts remain positive, and we discuss how its coefficients are interpreted in terms of multiplicative changes in the expected count. We also highlight an important practical issue, overdispersion, which reminds us that model assumptions must always be checked against the data.

The chapter concludes with a case study on customer churn prediction using logistic regression, together with comparisons to k-Nearest Neighbors and Naive Bayes. This case study illustrates how generalized regression models fit naturally into the Data Science Workflow and how they can be evaluated alongside other predictive methods using ROC curves and the area under the curve. A shorter worked example for Poisson regression complements this analysis by showing how count outcomes can be modeled and interpreted in practice.

By the end of this chapter, you will understand when generalized regression models are needed, how logistic and Poisson regression extend the ideas of Chapter 10, and how these models can be fitted, interpreted, and evaluated in R. More broadly, you will see how regression remains a flexible modeling framework even when the response variable is not continuous.

11.1 From Linear Regression to Generalized Linear Models

In Chapter 10, we used linear regression to model continuous outcomes such as house prices and bike rental demand. That framework was powerful because it gave us a clear way to relate a response variable to one or more predictors, estimate coefficients, interpret their effects, and generate predictions. However, not every response variable is continuous. In many practical problems, the outcome may be binary, such as whether a customer churns, or it may be a count, such as the number of customer service calls. In such settings, the standard linear regression model is no longer well suited to the structure of the data.

Generalized linear models provide a broader framework that extends the main ideas of regression while adapting them to different types of outcomes. Rather than treating binary and count data as awkward exceptions, generalized linear models incorporate them naturally by allowing the response distribution and the relationship between the mean response and the predictors to change according to the problem at hand. This makes the framework both flexible and unified: the details of the model may change, but the central regression idea remains the same.

Why Linear Regression Is Not Enough

To understand why a broader framework is needed, consider what happens if we apply ordinary linear regression to a binary outcome such as customer churn. A linear model could produce fitted values below 0 or above 1, even though probabilities must lie between 0 and 1. This is already a warning sign that the model does not respect the natural range of the response variable. In addition, the variability of a binary response is not constant across all predictor values. Instead, it depends on the underlying probability itself, which conflicts with the constant-variance assumption used in ordinary linear regression.

A similar problem arises with count data. Suppose we wish to model the number of service calls made by a customer. Counts are non-negative integers, so values such as \(-2.4\) or \(3.7\) do not make sense as predictions. Yet a linear regression model treats the response as if any real-valued outcome were possible. Count data also tend to have a variance structure that depends on the mean, rather than remaining roughly constant. Once again, the assumptions of ordinary linear regression do not align well with the data-generating process.

These examples show that the issue is not merely technical. It is not enough for a model to produce a line or a fitted equation. A good model should also respect the basic nature of the outcome being studied. When the response is binary, the model should produce probabilities in the interval \([0,1]\). When the response is a count, the model should produce non-negative expected counts. Generalized linear models are designed precisely for this purpose.

The Three Components of a Generalized Linear Model

A generalized linear model, often abbreviated as GLM, is built from three connected components. Together, these components determine how the response variable is modeled and how the predictors enter the model.

The first component is the random component. This specifies the probability distribution of the response variable. In ordinary linear regression, we work with a continuous response and usually assume a normal distribution for the errors. In a generalized linear model, we instead choose a distribution appropriate for the type of response under study. For binary data, we use a binomial distribution. For count data, we often use a Poisson distribution.

The second component is the systematic component. This is the familiar linear predictor, \[ \eta = b_0 + b_1 x_1 + b_2 x_2 + \dots + b_m x_m, \] where \(b_0, b_1, \dots, b_m\) are coefficients and \(x_1, x_2, \dots, x_m\) are predictors. This part of the model looks much like the regression equations we used in Chapter 10. It preserves the core regression idea that the predictors combine linearly on an appropriate scale.

The third component is the link function. This function connects the expected value of the response variable to the linear predictor. If we let \(\mu = \mathbb{E}(Y)\) denote the expected response, then a generalized linear model has the form \[ g(\mu) = \eta, \] where \(g(\cdot)\) is the link function. The role of the link function is crucial. It transforms the mean response to a scale on which a linear relationship with the predictors becomes appropriate.

In ordinary linear regression, the link function is simply the identity link, so that \[ \mu = \eta. \] This means that the expected response is modeled directly as a linear function of the predictors. In logistic regression, by contrast, the link function is the logit, which transforms a probability to the log-odds scale. In Poisson regression, the link function is the logarithm, which transforms a positive expected count to the real line.

Taken together, these three components allow generalized linear models to preserve the interpretability of regression while adapting to the structure of different outcome types. We still build a model around predictors and coefficients, but we do so in a way that respects the distribution and range of the response variable.

11.2 Logistic Regression for Binary Outcomes

Many important data science questions take the form of a yes-or-no outcome. Will a customer churn? Will a patient respond to treatment? Will a transaction be classified as fraudulent? In each of these cases, the response variable is binary, meaning that it takes one of two possible values. Problems of this kind were introduced earlier in the book through classification methods such as k-Nearest Neighbors in Chapter 7 and the Naive Bayes classifier in Chapter 9. These methods emphasized prediction from a classification perspective. We now turn to a complementary model-based approach that places binary outcomes within the regression framework: logistic regression.

Logistic regression is one of the most widely used generalized linear models. It extends the logic of regression to settings where the response variable represents whether an event occurs or not. Rather than modeling the response directly, logistic regression models the probability of the event through a transformation that ensures the fitted values remain between 0 and 1. This makes the model suitable for binary outcomes while preserving the interpretability and structure of regression analysis.

Why Not Use Linear Regression for a Binary Response?

At first glance, it may seem natural to apply ordinary linear regression to a binary outcome coded as 0 and 1. For example, we might code customer churn as 1 for “yes” and 0 for “no”, then regress this outcome on a set of predictors. Although this idea is simple, it runs into immediate problems.

First, a linear regression model can produce fitted values smaller than 0 or larger than 1, even though probabilities must lie in the interval \([0,1]\). Second, the variability of a binary response is not constant. When the probability of the event is near 0 or near 1, the response has less variability than when the probability is closer to 0.5. This violates the constant-variance assumption underlying ordinary linear regression. Third, the relationship between predictors and the probability of an event is often not naturally linear on the probability scale. A one-unit increase in a predictor may have a different effect when the probability is already near 0.9 than when it is near 0.2.

For these reasons, binary outcomes require a model that respects both the range and the structure of the data. Logistic regression does this by modeling a transformed version of the probability rather than the probability itself.

Odds, Log-Odds, and the Logistic Model

To build the logistic regression model, we begin with the probability of the event of interest. Let \(p\) denote the probability that the outcome equals 1. For example, in a churn application, \(p\) may represent the probability that a customer leaves the service.

Instead of modeling \(p\) directly, logistic regression models the odds of the event, \[ \text{odds} = \frac{p}{1-p}. \] The odds compare the probability that the event occurs to the probability that it does not occur. If \(p = 0.8\), then the odds are \(0.8/0.2 = 4\), meaning that the event is four times as likely to occur as not to occur.

Because odds are always positive, we take their natural logarithm to obtain the log-odds, also called the logit: \[ \text{logit}(p) = \log\left(\frac{p}{1-p}\right). \] This transformation maps probabilities in the interval \((0,1)\) to the entire real line, which allows us to model the transformed probability as a linear function of the predictors: \[ \text{logit}(p) = b_0 + b_1 x_1 + b_2 x_2 + \dots + b_m x_m. \]

This is the logistic regression model. It says that the predictors have a linear effect on the log-odds scale. Although the model is linear in the coefficients, the relationship between the predictors and the probability itself is nonlinear. Solving the equation above for \(p\) gives the inverse-logit form: \[ p = \frac{e^{b_0 + b_1 x_1 + \dots + b_m x_m}}{1 + e^{b_0 + b_1 x_1 + \dots + b_m x_m}}. \]

This expression guarantees that predicted probabilities always lie between 0 and 1. That is the key reason logistic regression is appropriate for binary outcomes. It preserves the regression idea of linking a response to predictors, but it does so on a scale that matches the structure of the data.

Fitting a Logistic Regression Model in R

We now fit a logistic regression model in R using the churn_mlc dataset from the liver package. This dataset contains customer-level information such as account characteristics, usage behavior, and service interactions. Our goal is to model whether a customer has churned (yes) or not (no) based on a selected set of predictors.

We begin by loading the dataset and inspecting its structure:

library(liver)

data(churn_mlc)

str(churn_mlc)
   'data.frame':    5000 obs. of  20 variables:
    $ state         : Factor w/ 51 levels "AK","AL","AR",..: 17 36 32 36 37 2 20 25 19 50 ...
    $ area_code     : Factor w/ 3 levels "area_code_408",..: 2 2 2 1 2 3 3 2 1 2 ...
    $ account_length: int  128 107 137 84 75 118 121 147 117 141 ...
    $ voice_plan    : Factor w/ 2 levels "yes","no": 1 1 2 2 2 2 1 2 2 1 ...
    $ voice_messages: int  25 26 0 0 0 0 24 0 0 37 ...
    $ intl_plan     : Factor w/ 2 levels "yes","no": 2 2 2 1 1 1 2 1 2 1 ...
    $ intl_mins     : num  10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...
    $ intl_calls    : int  3 3 5 7 3 6 7 6 4 5 ...
    $ intl_charge   : num  2.7 3.7 3.29 1.78 2.73 1.7 2.03 1.92 2.35 3.02 ...
    $ day_mins      : num  265 162 243 299 167 ...
    $ day_calls     : int  110 123 114 71 113 98 88 79 97 84 ...
    $ day_charge    : num  45.1 27.5 41.4 50.9 28.3 ...
    $ eve_mins      : num  197.4 195.5 121.2 61.9 148.3 ...
    $ eve_calls     : int  99 103 110 88 122 101 108 94 80 111 ...
    $ eve_charge    : num  16.78 16.62 10.3 5.26 12.61 ...
    $ night_mins    : num  245 254 163 197 187 ...
    $ night_calls   : int  91 103 104 89 121 118 118 96 90 97 ...
    $ night_charge  : num  11.01 11.45 7.32 8.86 8.41 ...
    $ customer_calls: int  1 1 0 2 3 0 3 0 1 0 ...
    $ churn         : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...

The dataset contains 5000 observations and 20 variables. The response variable is churn, and the remaining variables provide information that may help explain why some customers leave while others remain.

To illustrate logistic regression, we use the following predictors:

account_length, voice_plan, voice_messages, intl_plan, intl_mins, day_mins, eve_mins, night_mins, and customer_calls.

We define the model formula as follows:

formula_churn <- churn ~ account_length + voice_messages + day_mins +
  eve_mins + night_mins + intl_mins + customer_calls +
  intl_plan + voice_plan

To fit the logistic regression model, we use the glm() function. The argument family = binomial tells R to fit a generalized linear model for a binary response using the logit link:

glm_churn <- glm(formula = formula_churn, data = churn_mlc, family = binomial)

We then inspect the fitted model with:

summary(glm_churn)
   
   Call:
   glm(formula = formula_churn, family = binomial, data = churn_mlc)
   
   Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
   (Intercept)     8.8917584  0.6582188  13.509  < 2e-16 ***
   account_length -0.0013811  0.0011453  -1.206   0.2279    
   voice_messages -0.0355317  0.0150397  -2.363   0.0182 *  
   day_mins       -0.0136547  0.0009103 -15.000  < 2e-16 ***
   eve_mins       -0.0071210  0.0009419  -7.561 4.02e-14 ***
   night_mins     -0.0040518  0.0009048  -4.478 7.53e-06 ***
   intl_mins      -0.0882514  0.0170578  -5.174 2.30e-07 ***
   customer_calls -0.5183958  0.0328652 -15.773  < 2e-16 ***
   intl_planno     2.0958198  0.1214476  17.257  < 2e-16 ***
   voice_planno   -2.1637477  0.4836735  -4.474 7.69e-06 ***
   ---
   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
   
   (Dispersion parameter for binomial family taken to be 1)
   
       Null deviance: 4075.0  on 4999  degrees of freedom
   Residual deviance: 3174.3  on 4990  degrees of freedom
   AIC: 3194.3
   
   Number of Fisher Scoring iterations: 6

The output reports the estimated coefficients, their standard errors, corresponding z-statistics, and p-values. As in linear regression, these values help us assess the direction and strength of the association between each predictor and the response. Here, however, the coefficients are interpreted on the log-odds scale rather than on the original response scale.

At this stage, we fit the model using the full dataset because the main purpose of this section is to understand model specification and interpretation. Later in the chapter, we return to logistic regression in a predictive setting and evaluate its out-of-sample performance alongside other classification methods.

It is also important to note that we did not manually create dummy variables for the binary predictors intl_plan and voice_plan. When factor variables are included in a model formula, R automatically creates the required indicator coding and uses one level as the reference category.

Practice: Remove one or more predictors with relatively large p-values, such as account_length, and refit the model. Compare the resulting coefficient estimates and significance levels with those of the original model. Which parts of the model remain stable, and which appear more sensitive to the choice of predictors?

Interpreting Coefficients, Odds Ratios, and Predicted Probabilities

The coefficients in a logistic regression model describe changes in the log-odds of the event. This is mathematically convenient, but it is not always the most intuitive scale for interpretation. A more interpretable quantity is the odds ratio, obtained by exponentiating the coefficient.

For a numeric predictor \(x_j\), the coefficient \(b_j\) represents the change in log-odds associated with a one-unit increase in \(x_j\), holding the other predictors fixed. Exponentiating this coefficient gives \[ e^{b_j}, \] which is the multiplicative change in the odds associated with a one-unit increase in that predictor.

For example, if a coefficient equals 0.20, then \[ e^{0.20} \approx 1.22. \] This means that a one-unit increase in the predictor multiplies the odds of churn by about 1.22, or increases the odds by about 22%, holding the other predictors constant. If a coefficient is negative, the odds ratio will be less than 1, indicating a decrease in the odds.

We can compute the estimated odds ratios in R as follows:

round(exp(coef(glm_churn)), 3)
      (Intercept) account_length voice_messages       day_mins       eve_mins     night_mins      intl_mins customer_calls 
         7271.794          0.999          0.965          0.986          0.993          0.996          0.916          0.595 
      intl_planno   voice_planno 
            8.132          0.115

For binary predictors such as intl_plan or voice_plan, the interpretation is relative to the reference category. The corresponding odds ratio compares the odds of churn for one category against the odds for the reference category, holding the other predictors fixed.

Although odds ratios are useful, readers should be careful not to confuse them with probabilities. A change in odds is not the same as a change in probability, and the same odds ratio can correspond to different probability changes depending on the starting point. This is one reason logistic regression is often interpreted using both odds ratios and predicted probabilities.

Predicted probabilities can be obtained using predict() with type = "response":

round(predict(glm_churn, newdata = churn_mlc[1:5, ], type = "response"), 3)
       1     2     3     4     5 
   0.924 0.972 0.938 0.417 0.522

These values represent the model’s estimated probability of the event associated with the non-reference level of the response variable. In a binomial model fitted with a two-level factor response, R treats the second factor level as the event being modeled. It is therefore useful to inspect the factor levels directly:

levels(churn_mlc$churn)
   [1] "yes" "no"

If needed, the reference level can be changed before fitting the model. For example, if we want "yes" to represent the modeled event explicitly, we can write:

churn_mlc$churn <- relevel(churn_mlc$churn, ref = "no")

This should be done before fitting the model. Changing the reference level affects the interpretation of the probabilities and coefficients, although it does not change the underlying information in the data.

Practice: Choose one predictor from the fitted model and interpret both its coefficient and its odds ratio. Then compute predicted probabilities for the first five observations and compare them with the observed churn values.

From Probabilities to Class Predictions

A logistic regression model naturally produces predicted probabilities. In some applications, however, we also need a class prediction such as yes or no. To make this transition, we choose a decision threshold. The most common choice is 0.5: if the predicted probability is at least 0.5, we classify the observation as the event; otherwise, we classify it as the non-event.

For example, class predictions based on a 0.5 threshold can be obtained as follows:

pred_prob <- predict(glm_churn, newdata = churn_mlc, type = "response")

pred_class <- ifelse(pred_prob >= 0.5, "yes", "no")

head(pred_class)
       1     2     3     4     5     6 
   "yes" "yes" "yes"  "no" "yes" "yes"

Although a threshold of 0.5 is common, it is not always the most appropriate choice. In churn prediction, for instance, a company may prefer to identify more at-risk customers even if this leads to more false positives. In that case, a lower threshold such as 0.3 may be more useful. In medical screening, the same logic often applies: missing a true positive case may be more costly than issuing an additional false alarm.

This is why logistic regression is especially valuable as a probabilistic model. It separates the task of estimating risk from the task of making a classification decision. The model provides estimated probabilities, and the analyst chooses a threshold based on the goals of the application. In the case study later in this chapter, we compare logistic regression with Naive Bayes and k-Nearest Neighbors using ROC curves and the area under the curve, which allow us to evaluate models across all possible thresholds rather than relying on a single cutoff.

Logistic regression therefore plays two roles at once. It is a regression model for binary outcomes, and it is also a classification tool that produces interpretable probabilities. This combination of interpretability, flexibility, and practical usefulness explains why logistic regression remains one of the most important methods for binary data analysis.

11.3 Poisson Regression for Count Outcomes

Not all response variables describe whether an event occurs. In many applications, the outcome records how many times something happens within a fixed period or setting. Examples include the number of doctor visits, the number of insurance claims, the number of purchases made in a week, or the number of website visits in an hour. These outcomes are not continuous and are not simply binary. They are counts, and they require a model that respects their discrete and non-negative nature.

Poisson regression is one of the most important generalized linear models for this setting. It extends the regression framework to responses that represent event frequencies, allowing us to relate expected counts to a set of predictors in a structured and interpretable way. Just as logistic regression adapts regression to binary outcomes, Poisson regression adapts it to count outcomes.

Why Count Data Require a Different Model

At first glance, one might try to apply ordinary linear regression to a count outcome such as the number of doctor visits. But this creates several problems.

First, a linear regression model can produce fitted values that are negative, even though counts must be non-negative. Second, count data often have a variance that changes with the mean. In many settings, observations with larger expected counts also show greater variability. This conflicts with the constant-variance assumption of ordinary linear regression. Third, count outcomes are often right-skewed, especially when many observations take small values and a smaller number take larger values. This again makes the normal-error framework of linear regression less appropriate.

These features reflect the fact that count data arise from a different type of data-generating process. A suitable model should recognize that event counts are discrete, non-negative, and often more variable at larger mean levels. Poisson regression provides such a model by linking the expected count to predictors through a structure tailored to count outcomes.

Fitting a Poisson Regression Model in R

We now illustrate Poisson regression using the doctor_visits dataset from the liver package. This dataset contains cross-sectional data from the 1977–1978 Australian Health Survey and is a standard example for count-data modeling. It includes visits, the number of doctor visits in the past two weeks, together with predictors such as gender, age, income, illness, reduced, health, private, freepoor, freerepat, nchronic, and lchronic. :contentReferenceoaicite:1

We begin by loading the dataset and inspecting its structure:

library(liver)

data(doctor_visits)

str(doctor_visits)
   'data.frame':    5190 obs. of  12 variables:
    $ age      : num  19 19 19 19 19 19 19 19 19 19 ...
    $ income   : num  5500 4500 9000 1500 4500 3500 5500 1500 6500 1500 ...
    $ illness  : num  1 1 3 1 2 5 4 3 2 1 ...
    $ reduced  : num  4 2 0 0 5 1 0 0 0 0 ...
    $ health   : num  1 1 0 0 1 9 2 6 5 0 ...
    $ gender   : Factor w/ 2 levels "male","female": 2 2 1 1 1 2 2 2 2 1 ...
    $ private  : Factor w/ 2 levels "no","yes": 2 2 1 1 1 1 1 1 2 2 ...
    $ freepoor : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
    $ freerepat: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
    $ nchronic : Factor w/ 2 levels "no","yes": 1 1 1 1 2 2 1 1 1 1 ...
    $ lchronic : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
    $ visits   : num  1 1 1 1 1 1 1 1 1 1 ...

The response variable is visits, which records how many times an individual visited a doctor in the past two weeks. Because this response is a count, Poisson regression provides a natural starting point. Following the standard example in the AER documentation, we include all available predictors and add a quadratic term for age through I(age^2). The quadratic term allows the relationship between age and doctor visits to be nonlinear. ([CRAN][1])

dv_pois <- glm(visits ~ . + I(age^2), data = doctor_visits, family = poisson)

We inspect the fitted model using:

summary(dv_pois)
   
   Call:
   glm(formula = visits ~ . + I(age^2), family = poisson, data = doctor_visits)
   
   Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
   (Intercept)  -2.224e+00  1.898e-01 -11.716   <2e-16 ***
   age           1.056e-02  1.001e-02   1.055   0.2912    
   income       -2.053e-05  8.838e-06  -2.323   0.0202 *  
   illness       1.869e-01  1.828e-02  10.227   <2e-16 ***
   reduced       1.268e-01  5.034e-03  25.198   <2e-16 ***
   health        3.008e-02  1.010e-02   2.979   0.0029 ** 
   genderfemale  1.569e-01  5.614e-02   2.795   0.0052 ** 
   privateyes    1.232e-01  7.164e-02   1.720   0.0855 .  
   freepooryes  -4.401e-01  1.798e-01  -2.447   0.0144 *  
   freerepatyes  7.980e-02  9.206e-02   0.867   0.3860    
   nchronicyes   1.141e-01  6.664e-02   1.712   0.0869 .  
   lchronicyes   1.412e-01  8.315e-02   1.698   0.0896 .  
   I(age^2)     -8.487e-05  1.078e-04  -0.787   0.4310    
   ---
   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
   
   (Dispersion parameter for poisson family taken to be 1)
   
       Null deviance: 5634.8  on 5189  degrees of freedom
   Residual deviance: 4379.5  on 5177  degrees of freedom
   AIC: 6737.1
   
   Number of Fisher Scoring iterations: 6

The output reports coefficient estimates, standard errors, z-statistics, and p-values. Each coefficient describes how the corresponding predictor is associated with the log of the expected number of doctor visits, holding the other predictors fixed.

Interpreting Coefficients and Expected Counts

The coefficients in a Poisson regression model are interpreted on the log scale, which is not always the most intuitive scale for practical discussion. A more interpretable quantity is obtained by exponentiating the coefficient.

For a numeric predictor \(x_j\), the coefficient \(b_j\) represents the change in \(\log(\lambda)\) associated with a one-unit increase in \(x_j\), holding the other predictors fixed. Exponentiating this coefficient gives \[ e^{b_j}, \] which is the multiplicative change in the expected count.

For example, if a coefficient equals 0.3, then \[ e^{0.3} \approx 1.35. \] This means that a one-unit increase in the predictor multiplies the expected number of doctor visits by about 1.35, or increases it by approximately 35%, holding the other predictors constant.

If a coefficient is negative, the interpretation is similar but in the opposite direction. For example, if a coefficient equals \(-0.2\), then \[ e^{-0.2} \approx 0.82. \] This means that a one-unit increase in the predictor multiplies the expected count by about 0.82, corresponding to a decrease of roughly 18%.

We can compute the exponentiated coefficients in R as follows:

round(exp(coef(dv_pois)), 3)
    (Intercept)          age       income      illness      reduced       health genderfemale   privateyes  freepooryes 
          0.108        1.011        1.000        1.206        1.135        1.031        1.170        1.131        0.644 
   freerepatyes  nchronicyes  lchronicyes     I(age^2) 
          1.083        1.121        1.152        1.000

For factor variables, the interpretation is relative to the reference category. The exponentiated coefficient compares the expected number of doctor visits in one category with the expected number in the reference category, holding the remaining predictors fixed.

Predicted expected counts can be obtained with predict() using type = "response":

round(predict(dv_pois, newdata = doctor_visits[1:5, ], type = "response"), 3)
       1     2     3     4     5 
   0.313 0.248 0.187 0.150 0.370

These values are the model’s estimated expected numbers of doctor visits for the specified observations. Because the model predicts expected counts, the fitted values do not need to be integers. They represent average expected frequencies rather than literal individual outcomes.

Practice: Suppose a predictor has an estimated coefficient of \(-0.2\). Compute \(e^{-0.2} - 1\) and interpret the result as a percentage change in the expected number of doctor visits. Then choose one coefficient from dv_pois, exponentiate it, and interpret it in context.

Overdispersion and Practical Limitations

A central assumption of the Poisson model is that the variance of the response equals its mean. In practice, however, count data often show more variability than the Poisson distribution allows. This phenomenon is called overdispersion. When overdispersion is present, the standard Poisson model tends to underestimate uncertainty, which can make standard errors too small and significance tests too optimistic.

A simple diagnostic is to compare the residual deviance with the residual degrees of freedom:

c(
  residual_deviance = deviance(dv_pois),
  residual_df = df.residual(dv_pois),
  ratio = deviance(dv_pois) / df.residual(dv_pois)
)
   residual_deviance       residual_df             ratio 
        4379.5150954      5177.0000000         0.8459562

A ratio noticeably larger than 1 suggests that overdispersion may be present. This is only a rough guide, but it is useful as an initial diagnostic.

When overdispersion is substantial, alternatives such as quasi-Poisson regression or negative binomial regression are often more appropriate because they allow the variance to exceed the mean. We do not develop these models in detail here, but it is important to recognize that Poisson regression is a starting point rather than a universal solution for every count outcome.

Poisson regression therefore extends the regression framework in a way that is natural for event counts. It allows us to model expected frequencies, interpret predictor effects multiplicatively, and remain within the broader generalized linear model framework. Later in the chapter, we return to Poisson regression in a short worked example that builds on this foundation.

11.4 Case Study: Customer Churn Prediction with Logistic Regression

Customer churn, defined as the event in which a customer discontinues a service, represents a major challenge in subscription-based industries such as telecommunications, banking, and online platforms. Accurately identifying customers who are at risk of churning enables proactive retention strategies and can substantially reduce revenue loss. This case study focuses on predicting customer churn using multiple classification models and comparing their performance in a realistic modeling setting.

Throughout this chapter, we have introduced several classification approaches from different perspectives. In this case study, we bring these methods together and apply them to the same prediction task using a common dataset. Specifically, we compare three models introduced earlier in the book: logistic regression (Section 11.2), k-Nearest Neighbors (Chapter 7), and the Naive Bayes classifier (Chapter 9). Each model reflects a different modeling philosophy, ranging from parametric and interpretable to instance-based and probabilistic.

The analysis is based on the churn_mlc dataset from the liver package, which contains customer-level information on service usage, plan characteristics, and interactions with customer service. The target variable is churn, a binary indicator that records whether a customer has left the service (yes) or remained active (no). The dataset is provided in an analysis-ready format, allowing us to focus directly on modeling and evaluation within the Data Science Workflow introduced in Chapter 2. We begin by loading the dataset and inspecting its structure:

library(liver)

data(churn_mlc)
str(churn_mlc)
   'data.frame':    5000 obs. of  20 variables:
    $ state         : Factor w/ 51 levels "AK","AL","AR",..: 17 36 32 36 37 2 20 25 19 50 ...
    $ area_code     : Factor w/ 3 levels "area_code_408",..: 2 2 2 1 2 3 3 2 1 2 ...
    $ account_length: int  128 107 137 84 75 118 121 147 117 141 ...
    $ voice_plan    : Factor w/ 2 levels "yes","no": 1 1 2 2 2 2 1 2 2 1 ...
    $ voice_messages: int  25 26 0 0 0 0 24 0 0 37 ...
    $ intl_plan     : Factor w/ 2 levels "yes","no": 2 2 2 1 1 1 2 1 2 1 ...
    $ intl_mins     : num  10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...
    $ intl_calls    : int  3 3 5 7 3 6 7 6 4 5 ...
    $ intl_charge   : num  2.7 3.7 3.29 1.78 2.73 1.7 2.03 1.92 2.35 3.02 ...
    $ day_mins      : num  265 162 243 299 167 ...
    $ day_calls     : int  110 123 114 71 113 98 88 79 97 84 ...
    $ day_charge    : num  45.1 27.5 41.4 50.9 28.3 ...
    $ eve_mins      : num  197.4 195.5 121.2 61.9 148.3 ...
    $ eve_calls     : int  99 103 110 88 122 101 108 94 80 111 ...
    $ eve_charge    : num  16.78 16.62 10.3 5.26 12.61 ...
    $ night_mins    : num  245 254 163 197 187 ...
    $ night_calls   : int  91 103 104 89 121 118 118 96 90 97 ...
    $ night_charge  : num  11.01 11.45 7.32 8.86 8.41 ...
    $ customer_calls: int  1 1 0 2 3 0 3 0 1 0 ...
    $ churn         : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...

The dataset consists of 5000 observations and 20 variables. The features describe customer usage patterns, subscription plans, and interactions with customer service. Rather than modeling all available variables, we select a subset of predictors that capture core aspects of customer behavior and are commonly used in churn analysis. Since the primary goal of this case study is to compare modeling approaches rather than to perform exploratory analysis, we keep EDA brief and move directly to data partitioning and model fitting.

Practice: Apply exploratory data analysis techniques to the churn_mlc dataset following the approach used in Chapter 4. Compare the patterns you observe with those from the churn dataset.

To ensure a fair comparison across models, we use the same set of predictors and preprocessing steps for all three classification methods. Model performance is evaluated using ROC curves and the area under the ROC curve (AUC), as introduced in Chapter 8. These metrics provide a threshold-independent assessment of classification performance and allow us to compare models on equal footing. The modeling formula used throughout this case study is:

formula = churn ~ account_length + voice_plan + voice_messages + intl_plan + intl_mins + intl_calls + day_mins + day_calls + eve_mins + eve_calls + night_mins + night_calls + customer_calls

In the following sections, we fit each classification model using this common setup and compare their predictive performance, interpretability, and practical suitability for churn prediction.

Data Setup for Modeling

To evaluate how well our classification models generalize to unseen data, we partition the dataset into separate training and test sets. This separation ensures that model performance is assessed on observations that were not used during model fitting, providing an unbiased estimate of predictive accuracy.

To maintain consistency across chapters and enable meaningful comparison with earlier results, we adopt the same data partitioning strategy used in Chapter 7.7. Specifically, we use the partition() function from the liver package to randomly split the data into non-overlapping subsets. Setting a random seed guarantees that the results are reproducible.

set.seed(42)

splits = partition(data = churn_mlc, ratio = c(0.8, 0.2))

train_set = splits$part1
test_set  = splits$part2

test_labels = test_set$churn

This procedure assigns 80% of the observations to the training set and reserves the remaining 20% for model evaluation. The response variable from the test set is stored separately in test_labels and will be used to assess predictive performance using ROC curves and AUC.

Practice: Repartition the churn_mlc dataset into a 70% training set and a 30% test set using the same approach. Check whether the class distribution of the target variable churn is similar in both subsets, and reflect on why preserving this balance is important for fair model evaluation.

In the following subsections, we train each classification model using the same formula and training data. We then generate predictions on the test set and compare model performance using ROC curves and the area under the curve (AUC).

Training the Logistic Regression Model

We begin with logistic regression, a widely used baseline model for binary classification. Logistic regression models the probability of customer churn as a function of the selected predictors, making it both interpretable and well suited for probabilistic evaluation.

We fit the model using the glm() function, specifying the binomial family to indicate a binary response:

logistic_model = glm(formula = formula, data = train_set, family = binomial)

Once the model is fitted, we generate predicted probabilities for the observations in the test set:

logistic_probs = predict(logistic_model, newdata = test_set, type = "response")

In logistic regression, predict(..., type = "response") returns estimated probabilities rather than class labels. By default, these probabilities correspond to the non-reference class of the response variable. In the churn_mlc dataset, the response variable churn has two levels, "yes" and "no". Since "yes" is the first factor level and therefore treated as the reference category, the predicted probabilities returned here represent the probability of "no" (i.e., not churning).

If the goal is instead to obtain predicted probabilities for "yes" (customer churn), the reference level should be redefined before data partitioning and model fitting. For example:

churn_mlc$churn = relevel(churn_mlc$churn, ref = "no")

Refitting the model after this change would cause predict() to return probabilities of churn directly. Importantly, while the numerical probabilities change interpretation, the underlying fitted model remains equivalent.

At this stage, we retain the probabilistic predictions rather than converting them to class labels. This allows us to evaluate model performance across all possible classification thresholds using ROC curves and AUC, as discussed in Chapter 8.

Practice: How would you convert the predicted probabilities into binary class labels? Try using thresholds of 0.5 and 0.3. How do the resulting classifications differ, and what are the implications for false positives and false negatives?

Training the Naive Bayes Model

We briefly introduced the Naive Bayes classifier and its probabilistic foundations in Chapter 9. Here, we apply the model to the same customer churn prediction task, using the same set of predictors as in the logistic regression and kNN models to ensure a fair comparison.

Naive Bayes is a fast, probabilistic classifier that is particularly well suited to high-dimensional and mixed-type data. Its defining assumption is that predictors are conditionally independent given the class label. While this assumption is often violated in practice, Naive Bayes can still perform surprisingly well, especially as a baseline model.

We fit the Naive Bayes classifier using the naive_bayes() function from the naivebayes package:

library(naivebayes)

bayes_model = naive_bayes(formula, data = train_set)

Once the model is trained, we generate predicted class probabilities for the test set:

bayes_probs = predict(bayes_model, test_set, type = "prob")

The object bayes_probs is a matrix in which each row corresponds to a test observation and each column represents the estimated probability of belonging to one of the two classes (no or yes). As with logistic regression, we retain these probabilistic predictions rather than converting them to class labels, since they are required for threshold-independent evaluation using ROC curves and AUC.

Practice: How might the conditional independence assumption affect the performance of Naive Bayes on this dataset, where usage variables such as call minutes and call counts are likely correlated? Compare this to the assumptions underlying logistic regression.

Training the kNN Model

The k-Nearest Neighbors (kNN) algorithm is a non-parametric, instance-based classifier that assigns a class label to each test observation based on the majority class among its \(k\) closest neighbors in the training set. Because kNN relies entirely on distance calculations, it is particularly sensitive to the scale and encoding of the input features.

We train a kNN model using the kNN() function from the liver package, setting the number of neighbors to \(k = 7\). This choice is informed by experimentation with different values of \(k\) using the kNN.plot() function, as discussed in Chapter 7.6. To ensure that all predictors contribute appropriately to distance computations, we apply min–max scaling and binary encoding using the scaler = "minmax" option:

knn_probs = kNN(
  formula = formula,
  train   = train_set,
  test    = test_set,
  k       = 7,
  scaler  = "minmax",
  type    = "prob"
)

This preprocessing step scales all numeric predictors to the \([0, 1]\) range and encodes binary categorical variables in a format suitable for distance-based modeling. As with logistic regression and Naive Bayes, we retain predicted class probabilities rather than class labels, since these probabilities are required for threshold-independent evaluation using ROC curves and AUC.

With predicted probabilities now available from all three models (logistic regression, Naive Bayes, and kNN), we are ready to compare their classification performance using ROC curves and the area under the curve.

Model Evaluation and Comparison

To evaluate and compare the performance of the three classification models across all possible classification thresholds, we use ROC curves and the Area Under the Curve (AUC) metric. As introduced in Chapter 8, the ROC curve plots the true positive rate against the false positive rate, while the AUC summarizes the overall discriminatory ability of a classifier: values closer to 1 indicate stronger separation between classes.

ROC-based evaluation is particularly useful in churn prediction settings, where class imbalance is common and the choice of classification threshold may vary depending on business objectives. We compute ROC curves using the pROC package. Since ROC analysis requires class probabilities, we extract the predicted probabilities corresponding to the "yes" (churn) class for each model:

library(pROC)

roc_logistic = roc(test_labels, logistic_probs)
roc_bayes    = roc(test_labels, bayes_probs[, "yes"])
roc_knn      = roc(test_labels, knn_probs[, "yes"])

To facilitate comparison, we visualize all three ROC curves in a single plot:

ggroc(list(roc_logistic, roc_bayes, roc_knn), size = 0.8) + 
  scale_color_manual(values = c("#377EB8", "#E66101", "#4DAF4A"),
           labels = c(
             paste("Logistic (AUC =", round(auc(roc_logistic), 3), ")"),
             paste("Naive Bayes (AUC =", round(auc(roc_bayes), 3), ")"),
             paste("kNN (AUC =", round(auc(roc_knn), 3), ")")
           )) +
  ggtitle("ROC Curves with AUC for Three Models") + 
  theme(legend.title = element_blank(), legend.position = c(.7, .3))

The ROC curves summarize the trade-off between sensitivity and specificity for each classifier. The corresponding AUC values are 0.834 for logistic regression, 0.866 for Naive Bayes, and 0.879 for kNN. Although kNN achieves the highest AUC, the differences among the three models are modest. This suggests that all three approaches provide comparable predictive performance on this dataset.

From a practical perspective, these results highlight an important modeling trade-off. While kNN offers slightly stronger discrimination, logistic regression and Naive Bayes remain attractive alternatives due to their interpretability, simplicity, and lower computational cost. In many real-world applications, such considerations may outweigh small gains in predictive accuracy.

Practice: Repartition the churn_mlc dataset using a 70%–30% train–test split. Following the same workflow as in this section, fit a logistic regression model, a Naive Bayes classifier, and a kNN model, and report the corresponding ROC curves and AUC values. Compare these results with those obtained using the 80%–20% split. What do you observe about the stability of model evaluation across different data partitions?

11.5 Chapter Summary and Takeaways

In this chapter, we extended the regression framework from continuous outcomes to two important non-continuous settings: binary responses and count outcomes. Building on the linear regression ideas developed in Chapter 10, we introduced generalized linear models as a broader modeling framework that combines an appropriate response distribution, a linear predictor, and a link function. This perspective showed that logistic regression and Poisson regression are not entirely separate techniques, but natural extensions of the same regression logic to different kinds of data.

We first examined logistic regression for binary outcomes. This model is designed for situations in which the response records whether an event occurs, such as whether a customer churns or not. By modeling the log-odds of the event rather than the outcome directly, logistic regression ensures that predicted probabilities remain between 0 and 1. We also showed how its coefficients can be interpreted through odds ratios and how predicted probabilities can be translated into class predictions when a decision threshold is required.

We then introduced Poisson regression for count outcomes. In this setting, the response variable records how many times an event occurs within a fixed interval, such as the number of doctor visits or service calls. Poisson regression links the expected count to the predictors through a log link, ensuring that fitted values remain positive and allowing predictor effects to be interpreted multiplicatively. At the same time, we emphasized that model assumptions still matter, particularly the assumption that the variance is tied to the mean, and we highlighted overdispersion as an important practical issue in count-data analysis.

The case study on customer churn brought these ideas together in a predictive setting. There, logistic regression served as the main generalized regression model and was compared with Naive Bayes and k-Nearest Neighbors using ROC curves and the area under the curve. This comparison illustrated an important practical lesson: models with different assumptions and structures can produce similar predictive performance, so model choice should not be based on accuracy alone. Interpretability, robustness, computational cost, and the goals of the analysis are also central considerations.

Taken together, this chapter reinforces a key message of the Data Science Workflow: effective modeling is not about choosing the most complicated method available, but about selecting a model that matches the structure of the response variable and the purpose of the analysis. Generalized regression models provide a flexible and interpretable way to extend regression beyond continuous outcomes while preserving a common modeling language. They allow us to reason about probabilities, expected counts, and uncertainty in a principled way, making them essential tools in modern data science.

11.6 Exercises

These exercises reinforce the main ideas of this chapter by combining conceptual understanding, interpretation of model output, and practical implementation in R. They focus on generalized linear models for binary and count outcomes, with particular attention to logistic regression, Poisson regression, and model comparison in applied settings. The datasets used in these exercises are available in the liver and AER packages.

Generalized Linear Models: Conceptual Questions

  1. Explain why ordinary linear regression is not appropriate for a binary response variable.

  2. Explain why ordinary linear regression is not appropriate for count data.

  3. What are the three main components of a generalized linear model?

  4. What is the role of a link function in a generalized linear model?

  5. Why does logistic regression use the logit link rather than the identity link?

  6. Why does Poisson regression use the log link?

  7. What is an odds ratio, and how is it interpreted in logistic regression?

  8. In Poisson regression, what does \(e^{b_j}\) represent for a predictor \(x_j\)?

  9. What is overdispersion in the context of Poisson regression, and why can it be problematic?

  10. Logistic regression and Poisson regression are both generalized linear models. What do they have in common, and how do they differ?

Hands-On Practice: Logistic Regression with the bank Dataset

data(bank, package = "liver")
  1. Fit a logistic regression model predicting y using age, balance, and duration.

  2. Interpret the estimated coefficient for duration. What does it suggest about the relationship between call duration and the probability of subscription?

  3. Compute the odds ratios for the fitted model and interpret them.

  4. Add campaign and previous to the model. How do the coefficient estimates and their significance change?

  5. Estimate the predicted probability of subscription for a new customer with the following profile: age = 40, balance = 1500, duration = 300, campaign = 1, and previous = 0.

  6. Convert the predicted probabilities into class predictions using a threshold of 0.5. Then repeat the analysis using a threshold of 0.3. How do the results differ?

  7. Construct a confusion matrix for the model using one of the thresholds above. What does the confusion matrix reveal about the model’s strengths and weaknesses?

  8. Compute accuracy, precision, recall, and F1-score for the fitted model. Which of these measures seems most informative in this setting, and why?

  9. Plot the ROC curve and compute the AUC. What does the AUC value suggest about the model’s ability to distinguish between subscribers and non-subscribers?

  10. Apply stepwise regression to obtain a simpler logistic regression model. Compare the selected model with the original model in terms of interpretability and predictive performance.

Hands-On Practice: Poisson Regression with the doctor_visits Dataset

library(liver)

data(doctor_visits)
  1. Fit a Poisson regression model predicting visits using age, income, illness, reduced, and health.

  2. Interpret the estimated coefficient for illness. Then exponentiate the coefficient and interpret the result on the expected-count scale.

  3. Add a quadratic term for age by fitting a model that includes I(age^2). Does this suggest a nonlinear relationship between age and the expected number of doctor visits?

  4. Compute the exponentiated coefficients for the fitted model. Choose two predictors and interpret their effects in context.

  5. Use the fitted model to estimate the expected number of doctor visits for three selected observations in the dataset.

  6. Compare observed and predicted counts for the first ten observations. Does the model appear to capture the variation in doctor visits reasonably well?

  7. Compute the residual deviance, residual degrees of freedom, and their ratio. Does this suggest possible overdispersion?

  8. If overdispersion appears to be present, explain why a quasi-Poisson or negative binomial model might be more appropriate.

Model Comparison and Interpretation

  1. In the customer churn case study, logistic regression was compared with Naive Bayes and k-Nearest Neighbors. Why is it useful to compare models with different assumptions on the same prediction task?

  2. Suppose logistic regression and kNN achieve similar AUC values, but logistic regression is easier to interpret. In what kinds of applications might logistic regression be preferred?

  3. Why are ROC curves and AUC useful when comparing classification models across different thresholds?

  4. A model with the highest AUC is not always the best choice in practice. Explain why, using ideas such as interpretability, computational cost, and robustness.

  5. Logistic regression produces estimated probabilities, while kNN and Naive Bayes can also be used for classification. Why can predicted probabilities be especially useful in decision-making settings?

Self-Reflection

  1. Think of a real-world problem involving a binary outcome, such as disease diagnosis, customer churn, or loan approval. Which predictors would you include in a logistic regression model, and why?

  2. Think of a real-world problem involving a count outcome, such as hospital visits, insurance claims, or website clicks. Would Poisson regression be a reasonable starting point? Explain your reasoning, including what model assumptions you would want to check.