5  Statistical Inference and Hypothesis Testing

Statistics is the grammar of science.

— Karl Pearson

Imagine a bank notices that customers who contact customer service frequently appear more likely to close their credit card accounts. Is this pattern evidence of a genuine underlying relationship, or could it simply reflect random variation in the data? Questions like these lie at the heart of statistical inference.

Statistical inference uses information from a sample to draw conclusions about a broader population. It enables analysts to move beyond the descriptive summaries of exploratory data analysis and toward evidence-based decision-making. In practice, inference helps answer questions such as: What proportion of customers are likely to churn? and Do churners make more service contacts on average than non-churners?

In Chapter 4, we examined the churn dataset and identified several promising patterns. For example, customers with more frequent service contacts or lower spending levels appeared more likely to churn. However, EDA alone cannot tell us whether these differences reflect genuine population-level effects or are merely artifacts of sampling variability. Statistical inference provides the framework to make such distinctions in a principled way.

This chapter emphasizes that sound inference relies on more than formulas or computational steps. It requires critical thinking: recognizing how randomness influences observed data, understanding the limitations of sample-based conclusions, and interpreting results with appropriate caution. Misunderstandings can lead to misleading or overconfident claims, a theme highlighted in Darrell Huff’s classic book How to Lie with Statistics. Strengthening your skills in statistical reasoning will help you evaluate evidence rigorously and draw conclusions that are both accurate and defensible.

What This Chapter Covers

This chapter introduces statistical inference, a set of methods that allow us to draw conclusions about populations using information from samples. Building on the exploratory work of earlier chapters, the focus now shifts from identifying patterns to evaluating whether those patterns reflect meaningful population-level effects. This transition is a central step in the data science workflow, where initial insights are tested and uncertainty is quantified.

The chapter begins with point estimation, where sample statistics are used to estimate unknown population parameters. It then introduces confidence intervals, which provide a principled way to express the uncertainty associated with these estimates. Hypothesis testing follows, offering a framework for assessing whether observed differences or associations are likely to have arisen by chance. Along the way, you will work with several real-world datasets, including churn and diamonds in the main text, and the bank, churn_mlc, and marketing datasets from the liver package in the exercises.

Throughout the chapter, you will apply these inferential tools in R to evaluate patterns, interpret p-values and confidence intervals, and distinguish statistical significance from practical relevance. These skills form the basis for reliable, data-driven conclusions and support the Modeling work that follows.

The chapter concludes by revisiting how statistical inference supports later phases of the data science workflow, including validating data partitions and assessing feature relevance for Modeling, topics that will be developed further in Chapter 6.

5.1 Introduction to Statistical Inference

Data science is not only about describing data; it is about learning from data in order to understand the real world. When we analyze a dataset such as churn, we observe information about a particular group of customers. Yet our ultimate goal is rarely limited to those specific observations. Instead, we seek to draw conclusions about the broader customer population: Are churn rates increasing? Do customers with frequent service interactions tend to leave more often? Are observed differences meaningful, or are they simply the result of random fluctuations?

Statistical inference provides the framework for answering such questions. It connects the data observed in a sample with the larger population from which that sample is drawn. Because we rarely have access to the entire population, we must rely on partial information. This reliance introduces uncertainty, and probability serves as the language used to quantify it, as illustrated in Figure 5.1.

Figure 5.1: A conceptual overview of statistical inference. Data from a sample are used to infer properties of the population, with probability quantifying uncertainty.

A crucial assumption underlying statistical inference is that the data meaningfully represent the population of interest. In classical settings, this is achieved through random sampling, where each member of the population has a known and non-zero probability of being included in the sample. Random sampling does not eliminate variability; rather, it ensures that differences we observe are driven by natural randomness instead of systematic bias.

In many real-world data science applications, however, datasets are observational rather than generated through carefully designed sampling schemes. The principle remains the same: valid inference depends on whether the data reasonably reflect the population about which conclusions are drawn. If certain groups are overrepresented or underrepresented, conclusions may be distorted.

When we analyze a dataset, we observe only one realization of a broader process. If we were to collect another sample from the same population, the numerical summaries would almost certainly differ. Statistical inference acknowledges this variability and provides tools to assess how much confidence we should place in our conclusions.

In statistical notation, we distinguish between population parameters and sample statistics. Population parameters, which are fixed but unknown, are typically denoted by Greek letters such as \(\mu\) (population mean), \(\sigma\) (population standard deviation), and \(\pi\) (population proportion). Their observable counterparts computed from data are denoted by symbols such as \(\bar{x}\) (sample mean), \(s\) (sample standard deviation), and \(p\) (sample proportion). The aim of inference is to use sample statistics to learn about unknown population parameters, as illustrated in Figure 5.1.

To formalize how sample statistics vary across repeated samples, statistical inference relies on probability distributions. Certain distributions play a central role. The normal distribution often arises when sample sizes are large. The \(t\)-distribution accounts for additional uncertainty when population variability is estimated from the data. The chi-square distribution is commonly used when analyzing categorical data and assessing variability in frequency tables. These distributions form the mathematical foundation for the confidence intervals and hypothesis tests developed later in this chapter.

Within the Data Science Workflow (see Figure 2.3), inference bridges exploratory data analysis and predictive modeling. Exploratory analysis helps reveal potential patterns, such as higher churn rates among customers with many service contacts (contacts_count_12). Inference then evaluates whether these patterns are likely to reflect genuine population-level relationships or could plausibly have arisen by chance.

Statistical thinking therefore requires more than applying formulas or running functions in R. It involves asking how the data were generated, what assumptions are being made, and whether observed patterns are likely to persist beyond the specific sample at hand. A statistically significant result is meaningful only if the data are relevant to the question and reasonably representative of the population of interest. Without this foundation, even technically correct calculations may lead to misleading interpretations. In this sense, statistical inference is not merely a computational procedure but a disciplined way of thinking about evidence, variability, and generalization.

Statistical inference is typically organized around three interconnected components (see Figure 5.2). Point estimation provides numerical summaries that serve as estimates of population parameters. Confidence intervals quantify the uncertainty surrounding those estimates. Hypothesis testing offers a structured framework for evaluating whether observed patterns are statistically credible.

Figure 5.2: The three core goals of statistical inference: point estimation, confidence intervals, and hypothesis testing. Together they support reliable generalization from sample data.

These components build on one another. Estimation provides an initial summary, confidence intervals express the associated uncertainty, and hypothesis testing formalizes decisions about statistical evidence. Together, they allow analysts to move beyond description toward principled, evidence-based conclusions. The remainder of this chapter introduces each component in turn, beginning with point estimation and progressing through confidence intervals and hypothesis testing, supported by practical implementation in R.

5.2 Point Estimation

When analyzing sample data, an essential first step in statistical inference is to estimate characteristics of the population from which the sample is drawn. In Chapter 4.3, we summarized patterns in the churn dataset using descriptive statistics. We now formalize those summaries as estimates of population parameters. These parameters include quantities such as the average number of customer service contacts, the typical transaction amount, or the proportion of customers who churn.

Because we rarely have access to the entire population, we rely on point estimates computed from sample data. A population parameter is fixed but unknown, whereas a sample statistic is observable and varies from sample to sample. A point estimate is a single numerical value that serves as our best guess for a population parameter. For example, the sample mean estimates the population mean, and the sample proportion estimates the population proportion.

To illustrate, consider the proportion of customers who churn in the churn dataset:

library(liver) # load the liver package to access the churn dataset
data(churn)    # load the churn dataset

prop.table(table(churn$churn))["yes"]
         yes 
   0.1606596

The resulting value, 0.16, provides a sample-based estimate of the true proportion of churners in the broader customer population.

Similarly, we can estimate the average annual transaction amount across all customers:

mean(churn$transaction_amount_12)
   [1] 4404.086

The computed mean, 4404.09, serves as a point estimate of the corresponding population mean.

Practice: Estimate the average annual transaction amount among customers who churn (churn == "yes") in the churn dataset by first subsetting the data and then computing the sample mean of transaction_amount_12. How does this estimate compare to the overall average computed above? What might this difference suggest about the spending behavior of churned customers?

5.3 Confidence Intervals: Quantifying Uncertainty

While point estimates provide useful summaries, they do not indicate how precise those estimates are. A single number does not reveal how much it might vary across different samples. Without accounting for this variability, we risk interpreting random fluctuations as meaningful patterns. Confidence intervals address this limitation by providing a principled way to quantify uncertainty and assess the reliability of our estimates.

Confidence intervals express the uncertainty associated with estimating population parameters. Rather than reporting only a point estimate, such as “the average annual transaction amount in the churn dataset is $4,404,” a confidence interval might state that “we are 95 percent confident that the true average lies between $4,337 and $4,470.” This range reflects the sampling variability that arises whenever conclusions are drawn from a sample rather than from the entire population.

At its core, a confidence interval has a simple structure: \[ \text{Point Estimate} \pm \text{Margin of Error}. \]

The margin of error quantifies the uncertainty surrounding the estimate and can be written as \[ \text{Margin of Error} = \text{Critical Value} \times \text{Standard Error}. \]

The standard error measures how much a statistic is expected to vary from sample to sample. The critical value determines how wide the interval must be to achieve a chosen confidence level, such as 95 percent. Together, these components translate sampling variability into an interpretable range of plausible values. We now develop confidence intervals for two fundamental population parameters: a population mean and a population proportion.

Confidence Interval for a Population Mean

Suppose we wish to estimate the average annual transaction amount in the customer population represented by the churn dataset. Let \(\mu\) denote the unknown population mean and \(\bar{x}\) the sample mean.

For a sample of size \(n\) with sample standard deviation \(s\), the standard error of the mean is \[ \frac{s}{\sqrt{n}}. \]

This expression reveals two important ideas. Larger samples reduce uncertainty because the denominator \(\sqrt{n}\) increases as \(n\) grows. Conversely, greater variability in the data increases uncertainty, since a larger value of \(s\) produces a wider interval.

To construct a confidence interval, we multiply the standard error by a critical value, which determines how wide the interval must be to achieve a chosen confidence level. For example, a 95 percent confidence level requires a multiplier large enough so that, in repeated sampling, approximately 95 percent of constructed intervals contain the true population mean.

When the population standard deviation \(\sigma\) is unknown, which is almost always the case in practice, we use the \(t\)-distribution. The confidence interval for the population mean is \[ \bar{x} \pm t_{\frac{\alpha}{2}, n-1}\left(\frac{s}{\sqrt{n}}\right), \]

where \(t_{\alpha/2, n-1}\) denotes the critical value from the \(t\)-distribution with \(n-1\) degrees of freedom. Although the notation may appear technical, the structure remains intuitive: estimate \(\pm\) (multiplier \(\times\) uncertainty). The width of the interval reflects the combined influence of sample size, variability, and the chosen confidence level. See Figure 5.3 for a visual representation of confidence interval for a population mean.

Figure 5.3: Confidence interval for a population mean. The interval is centered around the point estimate, with its width determined by the margin of error. The confidence level specifies the long-run proportion of such intervals that contain the true parameter.

The \(t\)-distribution closely resembles the standard normal (\(z\)) distribution but accounts for additional uncertainty introduced when \(\sigma\) is estimated from the data. When \(\sigma\) is known, the interval becomes \[ \bar{x} \pm z_{\frac{\alpha}{2}}\left(\frac{\sigma}{\sqrt{n}}\right). \]

In practice, however, \(\sigma\) is rarely known, so the \(t\)-distribution is typically used. For large sample sizes, the \(t\)-distribution approaches the normal distribution, and the resulting intervals are nearly identical.

In practice, we rarely compute these quantities manually. Statistical software determines the appropriate critical value and standard error automatically. To construct a 95 percent confidence interval for the average annual transaction amount in the churn dataset, we use:

t_result = t.test(churn$transaction_amount_12, conf.level = 0.95)
t_result$conf.int
   [1] 4337.915 4470.258
   attr(,"conf.level")
   [1] 0.95

The t.test() function in R performs a \(t\)-test and, by default, also returns a confidence interval for the population mean. In this example, we use it to compute the 95 percent confidence interval for the average annual transaction amount across customers in the dataset. The argument conf.level = 0.95 specifies the desired confidence level, while the function automatically determines the appropriate \(t\) critical value based on the sample size. Although we focus here on the confidence interval, the same function will be used in Sections 5.5 and 5.7 to test hypotheses about population means.

Interpretation requires care. A 95 percent confidence interval does not mean that there is a 95 percent probability that the true mean lies within this specific interval. Instead, if we were to repeatedly draw samples of the same size and construct a confidence interval from each, approximately 95 percent of those intervals would contain the true population mean. The population mean \(\mu\) is fixed, whereas the interval endpoints vary from sample to sample.

Practice: Construct a 95 percent confidence interval for the average annual transaction amount among customers who churn (churn == "yes") in the churn dataset by first subsetting the data and then applying t.test() with conf.level = 0.95. Compare the resulting interval with the overall confidence interval computed above. How does the average transaction amount among churned customers differ from the overall average, and how might differences in sample size or variability influence the width of the interval?

Confidence Interval for a Population Proportion

Suppose we wish to estimate the overall churn rate in the customer population. From the churn dataset, we can compute the sample proportion of customers who churn. But how precise is this estimate? How much might it vary if we were to observe a different sample?

Let \(\pi\) denote the true population proportion of customers who churn, and let \(p\) be the sample proportion computed from the data. For sufficiently large samples, the sampling distribution of \(p\) is approximately normal. A common rule of thumb is that both \(np\) and \(n(1-p)\) should be at least 5 (or 10) to justify this approximation. Under this condition, the standard error of the sample proportion is \[ \sqrt{\frac{p(1-p)}{n}}. \]

A confidence interval for the population proportion is therefore given by \[ p \pm z_{\frac{\alpha}{2}} \sqrt{\frac{p(1-p)}{n}}, \] where \(z_{\alpha/2}\) is the critical value from the standard normal distribution corresponding to the chosen confidence level.

To construct a 95 percent confidence interval for the proportion of customers who churn in the churn dataset, we use:

prop_result = prop.test(table(churn$churn)["yes"],
                        n = nrow(churn),
                        conf.level = 0.95)

prop_result$conf.int
   [1] 0.1535880 0.1679904
   attr(,"conf.level")
   [1] 0.95

The prop.test() function computes the interval automatically using a normal approximation (with a continuity correction by default). The resulting interval provides a range of plausible values for the true churn rate in the broader customer population.

As with the mean, interpretation requires care. A 95 percent confidence interval for \(\pi\) does not imply that there is a 95 percent probability that the true proportion lies within this specific interval. Rather, it means that if we repeatedly drew samples of the same size and constructed intervals in the same way, approximately 95 percent of those intervals would contain the true population proportion. The width of the interval depends on both the sample size and the variability in the data. Larger samples produce narrower intervals, reflecting greater precision.

Practice: Construct a 90 percent confidence interval for the proportion of customers who churn by setting conf.level = 0.90. Compare its width with the 95 percent interval. How does changing the confidence level alter the critical value, and what does this imply about the trade-off between precision and certainty?

Confidence intervals also play an important role when comparing groups. Constructing separate intervals for churned and active customers allows us to assess whether their average transaction amounts plausibly differ at the population level. In this way, confidence intervals not only quantify uncertainty but also prepare the ground for the formal hypothesis testing framework introduced next.

5.4 Hypothesis Testing

Suppose a bank introduces a new customer service protocol and randomly applies it to a subset of customers. After several months, analysts observe that the treated group exhibits a slightly lower churn rate than the untreated group. But does this difference reflect a genuine improvement? Or could it simply be the result of natural sampling variability?

Questions of this kind arise frequently in data-driven decision making. Observed differences are common, but not all differences are meaningful. Some emerge because of random fluctuation rather than underlying population-level change.

Hypothesis testing provides a structured framework for distinguishing between these possibilities. While confidence intervals quantify the uncertainty surrounding an estimate, hypothesis testing addresses a different but complementary question: Do the data provide sufficient evidence to challenge a specific claim about the population?

Within the Data Science Workflow, hypothesis testing forms a bridge between exploratory observations and formal conclusions. In Chapter 4.3, exploratory analysis suggested that churn tends to be higher among customers with lower spending and fewer transactions. Such patterns are informative, but they remain descriptive. Hypothesis testing allows us to evaluate whether these observed relationships are statistically credible or could plausibly have arisen by chance.

In this sense, hypothesis testing is not merely a computational procedure. It is a disciplined way of assessing evidence and making decisions under uncertainty. We begin by formalizing the logic of hypothesis testing before applying it to concrete examples using the churn dataset.

The Logic of Hypothesis Testing

At its core, hypothesis testing is based on a simple but powerful idea: begin with a default assumption about the population and evaluate whether the observed data provide sufficient evidence to question it.

The framework is built around two competing statements:

  • The null hypothesis (\(H_0\)): the default assumption, typically representing no difference, no association, or equality of population parameters.

  • The alternative hypothesis (\(H_a\)): the competing claim that a difference or association exists.

The logic proceeds by temporarily assuming that the null hypothesis is true. Under this assumption, we ask: If there were truly no difference or no relationship, how likely would we be to observe data as extreme as those obtained? If the observed data would be very unlikely under \(H_0\), this casts doubt on the null hypothesis. Conversely, if the data are consistent with what we would expect under \(H_0\), we do not have strong evidence against it.

This reasoning may initially appear indirect. Why not attempt to prove the alternative hypothesis directly? In statistical inference, it is typically easier to assess how incompatible the data are with a clearly specified baseline assumption than to prove that a particular alternative explanation must be true. Hypothesis testing therefore evaluates evidence against \(H_0\) rather than attempting to confirm \(H_a\). In this way, hypothesis testing provides a disciplined approach to reasoning under uncertainty: we measure how surprising the data would be under a default assumption and base our conclusions on the strength of that evidence.

Practice: Consider the claim that the average number of customer service calls is 2 per month. Formulate the null and alternative hypotheses for testing whether the true average differs from this value. Clearly state which hypothesis represents the default assumption.

Intuition Behind the p-Value

The p-value formalizes the idea of how unusual the observed data are under the null hypothesis. More precisely, the p-value is the probability of obtaining results at least as extreme as those observed, assuming that \(H_0\) is true.

To build intuition, suppose the null hypothesis states that there is no difference in churn rates between two customer groups. If this assumption were correct, any observed difference in sample proportions would arise purely from random variation. Small differences would occur frequently, whereas very large differences would be rare.

Now imagine that the data reveal a difference that appears substantial. The central question becomes: If there were truly no difference in the population, how likely would it be to observe a difference at least this large? The p-value answers precisely this question. It measures the probability of observing data as extreme as those obtained, under the assumption that \(H_0\) is true (see Figure 5.5 for a visualization).

A small p-value indicates that the observed result would be unlikely if the null hypothesis were correct. This provides evidence against \(H_0\). A large p-value indicates that the observed data are consistent with what we would expect under \(H_0\), and therefore do not provide strong evidence against it.

It is essential to understand what the p-value does not represent. It is not the probability that \(H_0\) is true. Rather, it is the probability of the observed data, or more extreme data, given that \(H_0\) is true. The distinction is fundamental: hypothesis testing does not determine whether the null hypothesis is true or false. Instead, it evaluates how incompatible the observed data are with the null assumption. In this way, the p-value serves as a quantitative measure of the strength of evidence against \(H_0\).

Practice: Suppose a hypothesis test yields a p-value of 0.08. At a significance level of \(\alpha = 0.05\), what decision would you make? Does this result prove that the null hypothesis is true? Explain your reasoning in words.

Decision Rule and Significance Level

To make decisions systematic and transparent, we compare the p-value to a pre-specified threshold known as the significance level, denoted by \(\alpha\). A common choice is \(\alpha = 0.05\), although other values may be appropriate depending on the context.

The significance level is chosen before examining the data. It represents the maximum probability of incorrectly rejecting a true null hypothesis that we are willing to tolerate. In other words, \(\alpha\) controls the risk of a Type I error. The decision rule is straightforward:

Reject \(H_0\) if the p-value is less than \(\alpha\).

If the p-value is greater than or equal to \(\alpha\), we do not reject \(H_0\). It is important to emphasize that we do not “accept” the null hypothesis. A large p-value does not confirm that \(H_0\) is true; it simply indicates that the data do not provide strong evidence against it.

To illustrate this logic, consider a simple example. Suppose we test whether the average account tenure differs from 36 months: \[ \begin{cases} H_0: \mu = 36 \\ H_a: \mu \neq 36 \end{cases} \] Assume that a one-sample t-test produces a p-value of \(0.03\). If we set \(\alpha = 0.05\), we reject \(H_0\) because \(0.03 < 0.05\). The observed data would be relatively unlikely under the assumption that the true mean equals 36 months. If instead we had chosen \(\alpha = 0.01\), we would not reject \(H_0\), since \(0.03 > 0.01\). The same data therefore lead to different decisions depending on the pre-specified tolerance for Type I error.

This example highlights two important principles. First, hypothesis testing is based on controlled risk rather than certainty. Second, conclusions depend jointly on the observed evidence and the chosen significance level.

There is also a close relationship between hypothesis testing and confidence intervals. For a two-sided test at significance level \(\alpha\), rejecting \(H_0\) is equivalent to observing that the hypothesized parameter value lies outside the corresponding \((1 - \alpha)\) confidence interval. Both approaches rely on the same underlying measure of sampling variability and provide complementary perspectives on statistical evidence (see Figure 5.4).

Figure 5.4: Visual summary of hypothesis testing, showing how sample evidence informs the decision to reject or not reject the null hypothesis (\(H_0\)).

Errors and Statistical Power

Because hypothesis testing relies on sample data rather than the entire population, decisions are inherently subject to uncertainty. As a result, two types of error are possible.

A Type I error occurs when we reject \(H_0\) even though it is true. In practical terms, this means concluding that a difference or association exists when, in reality, it does not. The probability of making a Type I error is denoted by \(\alpha\), the significance level chosen before conducting the test. By selecting \(\alpha = 0.05\), for example, we accept a 5 percent risk of incorrectly rejecting a true null hypothesis.

A Type II error occurs when we do not reject \(H_0\) even though it is false. In this case, a genuine difference or association exists in the population, but the sample data fail to provide strong enough evidence to detect it. The probability of making a Type II error is denoted by \(\beta\).

These two errors reflect different types of risk. Reducing \(\alpha\) lowers the probability of a Type I error but may increase the probability of a Type II error. In practice, there is often a trade-off between being too quick to declare a difference and being too cautious to detect a real one.

A related concept is statistical power, defined as \[ \text{Power} = 1 - \beta, \] which represents the probability of correctly rejecting \(H_0\) when it is false. Power measures the ability of a test to detect a genuine difference when one exists. Statistical power increases as the sample size grows, because larger samples reduce sampling variability and make true differences easier to detect. Power also increases when the variability in the data is smaller or when the true difference between groups is larger. In contrast, small samples or highly variable data make it more difficult to distinguish signal from noise, thereby reducing power. The possible outcomes of hypothesis testing are summarized in Table 5.1.

Table 5.1: Possible outcomes of hypothesis testing.
Decision Reality: \(H_0\) is True Reality: \(H_0\) is False
Do not Reject \(H_0\) Correct decision Type II Error (\(\beta\))
Reject \(H_0\) Type I Error (\(\alpha\)) Correct decision

An analogy often used is a criminal trial. The null hypothesis represents the presumption of innocence. Rejecting \(H_0\) corresponds to convicting a defendant. A Type I error is convicting an innocent person, whereas a Type II error is acquitting a guilty one. As in legal decision-making, hypothesis testing involves balancing the risks of different types of mistakes rather than eliminating uncertainty altogether.

One-Tailed and Two-Tailed Tests

The form of a hypothesis test depends on the research question and, more specifically, on how the alternative hypothesis is formulated. In some situations, the goal is to determine whether a parameter differs from a specified value in either direction. In other cases, the interest lies in detecting a deviation in a particular direction.

A two-tailed test is appropriate when the alternative hypothesis states that the parameter is not equal to a specified value, written as \(H_a: \theta \neq \theta_0\). This form is used when deviations in both directions are of interest. For example, we may wish to assess whether the average annual transaction amount differs from $4,000, without specifying whether it is higher or lower.

A right-tailed test is used when the alternative hypothesis asserts that the parameter exceeds a specified value, written as \(H_a: \theta > \theta_0\). This form is appropriate when the objective is to detect an increase. For instance, we might test whether the churn rate is greater than 30 percent.

A left-tailed test is used when the alternative hypothesis proposes that the parameter is smaller than a specified value, written as \(H_a: \theta < \theta_0\). This form applies when the objective is to detect a decrease, such as determining whether the average number of months on book is less than 24 months.

The specification of the alternative hypothesis determines which outcomes are considered “at least as extreme” as the observed test statistic and therefore how the p-value is computed. Under \(H_0\), we examine the sampling distribution of the test statistic and calculate the p-value as an area under that distribution. In a two-tailed test, extreme values in both directions provide evidence against \(H_0\), so the p-value is the combined area in the two tails. In a right-tailed test, only unusually large values contribute to the p-value, corresponding to the area to the right of the observed statistic. In a left-tailed test, only unusually small values contribute, corresponding to the area to the left. Figure 5.5 visualizes these three cases and illustrates how the shaded tail region represents the p-value in each setting.

Figure 5.5: P-values represented as tail areas under the sampling distribution assumed by \(H_0\). The shaded region corresponds to the p-value for two-tailed, right-tailed, and left-tailed tests.

The choice between a one-tailed and two-tailed test must be determined by the research question before examining the data. Selecting the direction after observing the results compromises the validity of the p-value and increases the risk of drawing misleading conclusions.

Choosing the Appropriate Hypothesis Test

Once the null and alternative hypotheses have been clearly formulated, the next step is to select a statistical test that aligns with both the research question and the structure of the data. This decision is not arbitrary; it depends primarily on the type of outcome variable and the number of groups or variables being compared.

A useful starting point is to determine whether the primary variable of interest is numerical or categorical. If the goal is to draw inference about a population mean based on numerical data, tests such as the one-sample or two-sample t-test are appropriate. If the objective is to draw inference about proportions or relationships between categorical variables, tests for proportions or chi-square procedures are typically used. When more than two groups are compared with respect to a numerical outcome, analysis of variance (ANOVA) provides a natural extension of the two-sample t-test. When assessing linear association between two numerical variables, a correlation test is appropriate.

In this way, the choice of test reflects the structure of the variables involved and the specific parameter being evaluated. The null hypothesis always expresses a statement about equality or lack of association, but the mathematical form of the test statistic depends on the data type and comparison framework. Table 5.2 provides a concise summary of commonly used hypothesis tests and the types of research questions to which they apply.

Table 5.2: Common hypothesis tests, their corresponding null hypotheses, and the types of data to which they are typically applied.
Test Null Hypothesis (\(H_0\)) Applied To
One-sample t-test \(H_0: \mu = \mu_0\) Single numerical variable
Test for Proportion \(H_0: \pi = \pi_0\) Single categorical variable
Two-sample t-test \(H_0: \mu_1 = \mu_2\) Numerical outcome for two independent groups
Paired t-test \(H_0: \mu_d = 0\) Numerical outcome for paired observations
Two-sample Z-test \(H_0: \pi_1 = \pi_2\) Two binary categorical variables
Chi-square Test \(H_0: \pi_1 = \cdots = \pi_k\) Two categorical variables
Analysis of Variance (ANOVA) \(H_0: \mu_1 = \cdots = \mu_k\) Numerical outcome by multi-level group
Correlation Test \(H_0: \rho = 0\) Two numerical variables

The remainder of this chapter applies these tests to concrete examples drawn from the churn dataset. Each example emphasizes not only the mechanics of implementation in R but also the interpretation of results in context. By linking statistical procedures to practical questions, we reinforce the central objective of inference: making principled conclusions from data while acknowledging uncertainty.

5.5 One-sample t-test

Suppose a bank believes that customers typically remain active for 36 months before they churn. Has customer behavior changed in recent years? Is the average tenure of customers still close to this benchmark? The one-sample t-test provides a principled way to evaluate such questions.

The one-sample t-test assesses whether the mean of a numerical variable in a population equals a specified value. It is commonly used when organizations compare sample evidence with a theoretical expectation or operational benchmark. Because the population standard deviation is usually unknown, the test statistic follows a \(t\)-distribution, which accounts for the additional uncertainty introduced by estimating variability from the sample.

The form of the hypotheses depends on the research question. For a two-sided test, the hypotheses are \[ \begin{cases} H_0: \mu = \mu_0 \\ H_a: \mu \neq \mu_0 \end{cases} \]

If interest lies in detecting a decrease, the alternative becomes \(H_a: \mu < \mu_0\), whereas to detect an increase, it becomes \(H_a: \mu > \mu_0\).

To illustrate, we return to the churn dataset. In earlier exploratory analysis, the variable months_on_book emerged as an important feature related to customer retention. We now assess whether the average account tenure differs from the benchmark value of 36 months at the 5 percent significance level (\(\alpha = 0.05\)). The hypotheses are \[ \begin{cases} H_0: \mu = 36 \\ H_a: \mu \neq 36 \end{cases} \]

We conduct the test in R as follows:

t_test <- t.test(churn$months_on_book, mu = 36)
t_test
   
    One Sample t-test
   
   data:  churn$months_on_book
   t = -0.90208, df = 10126, p-value = 0.367
   alternative hypothesis: true mean is not equal to 36
   95 percent confidence interval:
    35.77284 36.08397
   sample estimates:
   mean of x 
    35.92841

The output reports the test statistic, degrees of freedom, the p-value, and a 95 percent confidence interval for the population mean. The p-value is 0.37, which exceeds \(\alpha = 0.05\). We therefore do not reject \(H_0\), indicating that the data do not provide sufficient evidence that the average tenure differs from 36 months.

The 95 percent confidence interval is (35.77, 36.08), and because 36 lies within this interval, the conclusion is consistent with the hypothesis test. The sample mean is 35.93 months, which serves as our point estimate of the population mean. The test statistic follows a \(t\)-distribution with \(n - 1\) degrees of freedom because the population standard deviation is unknown.

From a practical perspective, even if a difference had been statistically significant, its magnitude would need to be evaluated carefully. A deviation of a few tenths of a month may have limited operational relevance, whereas a difference of several months could meaningfully affect retention strategy.

Practice: Test whether the average account tenure of customers is less than 36 months. Formulate the hypotheses and conduct a left-tailed test using t.test() with the option alternative = "less". Interpret the result at \(\alpha = 0.05\).

Practice: Subset the data to customers who churn (subset(churn, churn == "yes")) and use a one-sample t-test to assess whether the average annual transaction amount (transaction_amount_12) differs from $4,000.

The one-sample t-test provides a structured method for comparing a sample mean with a fixed benchmark. Statistical significance indicates whether an observed difference is unlikely to be due to random variation alone, but practical relevance determines whether the difference matters in context. Effective decision-making requires attention to both.

5.6 Hypothesis Testing for Proportion

Suppose a bank believes that 15 percent of its credit card customers churn each year. Has that rate changed in the current quarter? Are recent retention strategies having a measurable impact? These are common analytical questions whenever the outcome of interest is binary, such as churn versus no churn. To formally assess whether the observed proportion in a sample differs from a historical or expected benchmark, we use a test for a population proportion.

A proportion test evaluates whether the population proportion (\(\pi\)) of a particular category is equal to a hypothesised value (\(\pi_0\)). It is most appropriate when analysing binary categorical variables, such as service subscription, default status, or churn. The prop.test() function in R implements this test and can be used either for a single proportion or for comparing two proportions.

Suppose we want to test whether the observed churn rate in the churn dataset differs from the bank’s expectation of 15 percent. The hypotheses for a two-tailed test are: \[ \begin{cases} H_0: \pi = 0.15 \\ H_a: \pi \neq 0.15 \end{cases} \]

We conduct a two-tailed proportion test in R:

prop_test <- prop.test(x = sum(churn$churn == "yes"),
                       n = nrow(churn),
                       p = 0.15)

prop_test
   
    1-sample proportions test with continuity correction
   
   data:  sum(churn$churn == "yes") out of nrow(churn), null probability 0.15
   X-squared = 8.9417, df = 1, p-value = 0.002787
   alternative hypothesis: true p is not equal to 0.15
   95 percent confidence interval:
    0.1535880 0.1679904
   sample estimates:
           p 
   0.1606596

Here, x is the number of churned customers, n is the total sample size, and p = 0.15 specifies the hypothesised population proportion. The test uses a chi-square approximation to evaluate whether the observed sample proportion differs significantly from this value.

The output provides three key results: the p-value, a confidence interval for the true proportion, and the estimated sample proportion. The p-value is 0.003. Because it is less than the significance level (\(\alpha = 0.05\)), we reject the null hypothesis. This indicates statistical evidence that the true churn rate differs from 15 percent.

The 95 percent confidence interval for the population proportion is (0.154, 0.168). Since this interval does not contain \(0.15\), the conclusion is consistent with the decision to reject (\(H_0\)). The observed sample proportion is 0.161, which serves as our point estimate of the population churn rate.

Practice: Test whether the proportion of churned customers exceeds 15 percent. Set up a right-tailed one-sample proportion test using the option alternative = "greater" in the prop.test() function.

This example shows how a test for a single proportion can be used to validate operational assumptions about customer behavior. The p-value indicates whether a difference is statistically significant, whereas the confidence interval and estimated proportion help assess practical relevance. When combined with domain knowledge, this method supports evidence-informed decisions about customer retention.

5.7 Two-sample t-test

Do customers who churn have lower credit limits than those who remain active? If so, can credit availability help explain churn behavior? The two-sample t-test provides a statistical method to address such questions by comparing the means of a numerical variable across two independent groups.

Also known as Student’s t-test, this method was developed by William Sealy Gosset in the early twentieth century while he was employed at the Guinness Brewery. Because his employer restricted publication under his own name, he published his work under the pseudonym “Student.” Gosset’s central challenge was practical: how to make reliable inferences from small samples when the population variance was unknown. The resulting \(t\)-distribution accounts for this additional uncertainty and remains one of the most widely used tools in statistical inference.

In Section 4.5, we examined the distribution of the total credit limit (credit_limit) for churners and non-churners using violin and histogram plots. These visualizations suggested that churners may have slightly lower credit limits. The next step is to assess whether this difference is statistically significant.

Both plots indicate that churners tend to have slightly lower credit limits than customers who stay. To test whether this difference is statistically significant, we apply the two-sample t-test. We start by formulating the hypotheses:

\[ \begin{cases} H_0: \mu_1 = \mu_2 \\ H_a: \mu_1 \neq \mu_2 \end{cases} \]

Here, \(\mu_1\) and \(\mu_2\) represent the mean credit limits for churners and non-churners, respectively. The null hypothesis states that the population means are equal. To perform the test, we use the t.test() function in R. The formula syntax credit_limit ~ churn instructs R to compare the credit limits across the two churn groups:

t_test_credit <- t.test(credit_limit ~ churn, data = churn)
t_test_credit
   
    Welch Two Sample t-test
   
   data:  credit_limit by churn
   t = -2.401, df = 2290.4, p-value = 0.01643
   alternative hypothesis: true difference in means between group yes and group no is not equal to 0
   95 percent confidence interval:
    -1073.4010  -108.2751
   sample estimates:
   mean in group yes  mean in group no 
            8136.039          8726.878

The output includes the test statistic, p-value, degrees of freedom, confidence interval, and estimated group means. The p-value is 0.0164, which is smaller than the standard significance level \(\alpha = 0.05\). We therefore reject \(H_0\) and conclude that the average credit limits differ between churners and non-churners.

The 95 percent confidence interval for the difference in means is (-1073.401, -108.275), and because zero is not contained in this interval, the result is consistent with rejecting the null hypothesis. The estimated group means are 8136.04 for churners and 8726.88 for non-churners, indicating that churners tend to have lower credit limits.

Practice: Test whether the average tenure (months_on_book) differs between churners and non-churners using t.test(months_on_book ~ churn, data = churn). visualizations for this variable appear in Section 4.5.

The two-sample t-test assumes independent groups and approximately normal distributions within each group. In practice, the test is robust when sample sizes are large, due to the Central Limit Theorem. By default, R performs Welch’s t-test, which adjusts the degrees of freedom to accommodate unequal variances between groups and therefore does not require the assumption of equal population variances. If the data are strongly skewed or contain substantial outliers, a nonparametric alternative such as the Mann–Whitney U test may be appropriate.

From a business perspective, lower credit limits among churners may indicate financial constraints, lower engagement, or risk management decisions by the bank. This finding can support targeted strategies, such as credit line adjustments or personalized outreach. As always, assessing practical relevance is essential: even if a difference is statistically significant, its magnitude must be evaluated in context.

The two-sample t-test is an effective way to evaluate patterns identified during exploratory analysis. It helps analysts move from visual impressions to statistical evidence, strengthening the foundation for downstream Modeling.

When observations are measured on the same units under two conditions, however, the independence assumption no longer holds. In such cases, the paired \(t\)-test provides a more appropriate approach.

5.8 Paired t-test

In many practical situations, observations occur in natural pairs. For example, we may measure the same individuals before and after a treatment, evaluate employee performance before and after training, or compare outcomes for the same customers under two conditions. Because each pair refers to the same unit of analysis, the observations are not independent. The paired \(t\)-test is designed for this type of data.

Unlike the two-sample \(t\)-test, which compares the means of two independent groups, the paired \(t\)-test analyzes the differences within each pair. This structure arises when the same units are measured twice or when observations are naturally matched. Let \((x_{i,1}, x_{i,2})\) denote the two measurements for the \(i\)th unit. For each pair, we compute the difference \[ d_i = x_{i,2} - x_{i,1}. \]

The paired \(t\)-test then examines whether the mean difference \(\mu_d\) equals zero. The hypotheses are \[ \begin{cases} H_0: \mu_d = 0 \\ H_a: \mu_d \neq 0 \end{cases} \] for a two-sided test, although one-sided alternatives can also be used when a specific direction is of interest.

By focusing on differences within pairs, the paired \(t\)-test removes variability between individuals and isolates the effect of the change or intervention. As a result, paired designs often provide more precise inference than analyses that treat the observations as independent.

To illustrate the idea, suppose a bank launches a marketing campaign designed to encourage customers to increase their credit card usage. To evaluate the campaign, the bank records the total transaction amount for the same group of customers before and after the campaign. Because each observation corresponds to the same customer measured at two points in time, the data form natural pairs. It is therefore important to distinguish this situation from the two-sample \(t\)-test introduced earlier. The two-sample test compares the means of independent groups, such as the spending of different customers in two segments. In contrast, the paired \(t\)-test compares two measurements taken on the same units or on closely matched units. Applying a two-sample test in this setting would incorrectly treat the observations as independent and ignore the relationship between the paired measurements.

Let \(d = \text{after} - \text{before}\) denote the change in spending for each customer. The paired \(t\)-test evaluates whether the average change in spending differs from zero, indicating that the campaign may have affected customer behavior.

The following example simulates transaction data for 100 customers and then performs a paired \(t\)-test.

set.seed(42)

n <- 100

before <- rnorm(n, mean = 4000, sd = 800)
after  <- before + rnorm(n, mean = 300, sd = 500)

Figure 5.6 helps visualize the paired structure of the data. In the left panel, each line connects the spending values for a single customer before and after the campaign. If many lines trend upward from left to right, this suggests that spending increased following the campaign.

The right panel shows the distribution of the differences (\(\text{after} - \text{before}\)). The dashed vertical line represents zero difference. If most values lie to the right of this line, the data suggest that many customers increased their spending. The paired \(t\)-test formally evaluates whether the mean of these differences is significantly different from zero.

Figure 5.6: Customer transaction amounts before and after a marketing campaign (left) and distribution of differences (right).

We can perform the paired \(t\)-test in R using the t.test() function with the argument paired = TRUE:

t.test(after, before, paired = TRUE)
   
    Paired t-test
   
   data:  after and before
   t = 5.6683, df = 99, p-value = 1.425e-07
   alternative hypothesis: true mean difference is not equal to 0
   95 percent confidence interval:
    166.5543 345.9620
   sample estimates:
   mean difference 
          256.2581

The argument paired = TRUE instructs R to treat the observations as matched pairs. Internally, R computes the differences between the paired values and performs a one-sample \(t\)-test on those differences. The output reports the estimated mean difference, the \(t\) statistic, the degrees of freedom, the \(p\)-value, and a confidence interval for the mean difference. In this simulated example, the \(p\)-value is cloth to zero, providing strong evidence against \(H_0\). We therefore conclude that the marketing campaign significantly increased customer spending.

Paired \(t\)-tests are widely used when the same units are measured under two conditions. Common applications include before–after experiments, repeated measurements on the same subjects, and evaluations of interventions or policy changes.

Nonparametric alternatives are available when the assumptions of the \(t\)-test are questionable. In particular, if the data are highly skewed, contain strong outliers, or arise from small samples where normality cannot be reasonably assumed, rank-based methods provide a useful alternative. A commonly used option is the Wilcoxon test, implemented in R through the wilcox.test() function. This function can perform a one-sample test, a paired test, or a two-sample test (also known as the Wilcoxon rank-sum or Mann–Whitney test). Although these methods are less sensitive to deviations from normality, the \(t\)-test remains preferable when its assumptions are reasonably satisfied because it typically provides more efficient inference.

Practice: Conduct a one-sided paired \(t\)-test to evaluate whether the marketing campaign increased customer spending (\(H_a: \mu_d > 0\)). Use alternative = "greater" in t.test(). Compare the resulting \(p\)-value with the two-sided test above and explain why the values differ.

5.9 Two-sample Z-test

Do male and female customers churn at different rates? If so, could gender-based differences in behavior or service interaction help explain customer attrition? When the outcome of interest is binary (such as churn versus no churn) and we want to compare proportions across two independent groups, the two-sample Z-test provides an appropriate statistical framework.

Whereas the two-sample t-test compares means of numerical variables, the Z-test evaluates whether the difference between two population proportions is statistically significant or could plausibly be attributed to sampling variability. This makes it especially useful when analysing binary categorical outcomes.

In Chapter 4, Section 4.4, we examined churn patterns across demographic groups, including gender. Bar plots suggested that churn rates may differ between male and female customers. The two-sample Z-test allows us to formally evaluate whether these observed differences are statistically meaningful.

The first plot displays the number of churned and non-churned customers across genders, while the second shows proportional differences. These patterns suggest that churn may not be evenly distributed across male and female customers. To assess whether the difference is statistically significant, we set up the following hypotheses:

\[ \begin{cases} H_0: \pi_1 = \pi_2 \\ H_a: \pi_1 \neq \pi_2 \end{cases} \]

Here, \(\pi_1\) and \(\pi_2\) are the proportions of churners among male and female customers, respectively. We construct a contingency table:

table_gender <- table(churn$churn, churn$gender,
                      dnn = c("Churn", "Gender"))
table_gender
        Gender
   Churn female male
     yes    930  697
     no    4428 4072

Next, we apply the prop.test() function to compare the two proportions:

z_test_gender <- prop.test(table_gender)
z_test_gender
   
    2-sample test for equality of proportions with continuity correction
   
   data:  table_gender
   X-squared = 13.866, df = 1, p-value = 0.0001964
   alternative hypothesis: two.sided
   95 percent confidence interval:
    0.02401099 0.07731502
   sample estimates:
      prop 1    prop 2 
   0.5716042 0.5209412

The output includes the p-value, a confidence interval for the difference in proportions, and the estimated churn proportions for each gender. The p-value is 0, which is less than the significance level \(\alpha = 0.05\). We therefore reject \(H_0\) and conclude that the churn rate differs between male and female customers.

The 95 percent confidence interval for the difference in proportions is (0.024, 0.077). Because this interval does not contain zero, it supports the conclusion that the proportions are statistically different. The estimated churn proportions are 0.572 for male customers and 0.521 for female customers, indicating the direction and magnitude of the difference.

From a business perspective, differences in churn rates across demographic groups may reflect differences in service expectations, product usage patterns, or engagement levels. However, as always, statistical significance does not guarantee practical relevance. Even if one gender group shows a higher churn rate, the size of the difference should be interpreted in context before informing retention strategies.

Practice: Test whether the proportion of churned customers is higher among female customers than among male customers. Follow the same steps as in this section and set up a right-tailed two-sample Z-test by specifying alternative = "greater" in the prop.test() function.

The two-sample Z-test complements visual exploration and provides a rigorous method for comparing proportions. By integrating statistical inference with domain knowledge, organizations can make informed decisions about customer segmentation and retention strategies.

5.10 Chi-square Test

Does customer churn vary across marital groups? And if so, does marital status reveal behavioral differences that could help inform retention strategies? These are typical questions when analysing relationships between two categorical variables. The Chi-square test provides a statistical method for evaluating whether such variables are associated or whether any observed differences are likely due to chance.

While earlier tests compared means or proportions between two groups, the Chi-square test examines whether the distribution of outcomes across several categories deviates from what would be expected if the variables were independent. It is particularly useful for demographic segmentation and behavioral analysis when one or both variables have more than two levels.

To illustrate the method, we revisit the churn dataset. In Chapter 4, Section 4.4, we explored churn rates across the marital categories “single”, “married”, and “divorced”. As in that chapter, we use the cleaned version of the dataset, where “unknown” marital values were removed during the data preparation step. Visualizations suggested possible differences across groups, but a formal statistical test is required to determine whether these differences are statistically meaningful.

We begin by visualizing churn across marital groups:

The left plot presents raw churn counts; the right plot shows churn proportions within each marital category. While these visuals indicate potential differences, we use the Chi-square test to formally assess whether marital status and churn are associated.

We first construct a contingency table:

table_marital <- table(churn$churn, churn$marital,
                       dnn = c("Churn", "Marital"))
table_marital
        Marital
   Churn married single divorced
     yes     767    727      133
     no     4277   3548      675

This table serves as the input to the chisq.test() function, which assesses whether two categorical variables are independent. The hypotheses are: \[ \begin{cases} H_0: \pi_{\text{divorced, yes}} = \pi_{\text{married, yes}} = \pi_{\text{single, yes}} \\ H_a: \text{Churn proportions differ across at least one marital group.} \end{cases} \]

We conduct the test as follows:

chisq_marital <- chisq.test(table_marital)
chisq_marital
   
    Pearson's Chi-squared test
   
   data:  table_marital
   X-squared = 5.6588, df = 2, p-value = 0.05905

The output includes the Chi-square statistic, degrees of freedom, expected frequencies under independence, and the p-value. The p-value is 0.059, which is slightly greater than the significance level \(\alpha = 0.05\). From a classical inferential perspective, we therefore do not reject \(H_0\) and conclude that the sample does not provide sufficient statistical evidence to claim that churn behavior differs across marital groups.

However, this conclusion should not be interpreted as evidence that marital status is irrelevant for predictive modeling. The Chi-square test evaluates the relationship between marital status and churn in isolation. In contrast, machine learning algorithms assess features jointly, accounting for interactions and correlations among multiple predictors. A variable that shows little association on its own may still improve predictive performance when considered alongside other features.

Moreover, many modeling approaches, such as logistic regression with regularization, tree-based methods, or ensemble models, incorporate feature selection as part of the training process. These algorithms determine the contribution of each predictor in the context of the full model. For this reason, it is generally premature to eliminate a variable solely on the basis of a single univariate hypothesis test.

In predictive modeling, the ultimate criterion is not statistical significance in isolation, but improvement in out-of-sample predictive performance. We return to this distinction in later chapters when constructing and evaluating classification models.

To check whether the test assumptions are satisfied, we inspect the expected frequencies:

chisq_marital$expected
        Marital
   Churn   married    single divorced
     yes  810.3671  686.8199  129.813
     no  4233.6329 3588.1801  678.187

A general rule is that all expected cell counts should be at least 5. When expected frequencies are very small, the Chi-square approximation becomes unreliable, and Fisher’s exact test may be a better option. In the churn dataset, the expected counts are sufficiently large for the Chi-square test to be appropriate.

Even when the test does not detect a statistically significant association, examining which categories deviate most from their expected counts can provide useful descriptive insight. Identifying whether certain marital groups churn slightly more or less than expected may point toward behavioral patterns worth exploring in further modeling or segmentation analysis.

Practice: Test whether education level is associated with churn in the churn dataset. Follow the same steps as above. For more information on the education variable, see Section 4.4 in Chapter 4.

The Chi-square test therefore complements exploratory visualization by providing a formal statistical framework for analysing associations between categorical variables. Combined with domain expertise, it supports data-informed decisions about customer segmentation and engagement strategies.

5.11 Analysis of Variance (ANOVA) Test

So far, we have examined hypothesis tests that compare two groups, such as the two-sample t-test and the Z-test. But what if we want to compare more than two groups? For example, does the average price of diamonds vary across different quality ratings? When dealing with a categorical variable that has multiple levels, the Analysis of Variance (ANOVA) provides a principled way to test whether at least one group mean differs significantly from the others.

ANOVA is especially useful for evaluating how a categorical factor with more than two levels affects a numerical outcome. It assesses whether the variability between group means is greater than what would be expected due to random sampling alone. The test statistic follows an F-distribution, which compares variance across and within groups.

To illustrate, consider the diamonds dataset from the ggplot2 package. We analyze whether the mean price (price) differs by cut quality (cut), which has five levels: “Fair,” “Good,” “Very Good,” “Premium,” and “Ideal.”

data(diamonds)   

ggplot(data = diamonds) + 
  geom_boxplot(aes(x = cut, y = price, fill = cut)) +
  scale_fill_manual(values = c("#F4A582", "#FDBF6F", "#FFFFBF", "#A6D5BA", "#1B9E77"))

The boxplot shows clear differences in the distribution and median prices across cut categories. Visual inspection, however, cannot determine whether these observed differences are statistically significant. ANOVA provides the formal test needed to make this determination.

We evaluate whether cut quality affects diamond price by comparing the mean price across all five categories. Our hypotheses are: \[ \begin{cases} H_0: \mu_1 = \mu_2 = \mu_3 = \mu_4 = \mu_5 \quad \text{(All group means are equal);} \\ H_a: \text{At least one group mean differs.} \end{cases} \]

We apply the aov() function in R, which fits a linear model and produces an ANOVA table summarising the variation between and within groups:

anova_test <- aov(price ~ cut, data = diamonds)
summary(anova_test)
                  Df    Sum Sq   Mean Sq F value Pr(>F)    
   cut             4 1.104e+10 2.760e+09   175.7 <2e-16 ***
   Residuals   53935 8.474e+11 1.571e+07                   
   ---
   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The output reports the degrees of freedom (Df), the F-statistic (F value), and the corresponding p-value (Pr(>F)). Because the p-value is below the significance level (\(\alpha = 0.05\)), we reject the null hypothesis and conclude that cut quality has a statistically significant effect on diamond price. Rejecting \(H_0\) indicates that at least one group mean differs, but it does not tell us which cuts differ from each other. For this, we use post-hoc tests such as Tukey’s Honest Significant Difference (HSD) test, which controls for multiple comparisons while identifying significantly different pairs of groups.

As with any statistical method, ANOVA has assumptions: independent observations, roughly normal distributions within groups, and approximately equal variances across groups. With large sample sizes—such as those in the diamonds dataset—the test is reasonably robust to moderate deviations from these conditions.

From a business perspective, understanding differences in price across cut levels supports pricing, inventory, and marketing decisions. For example, if higher-quality cuts consistently command higher prices, retailers may emphasize them in promotions. Conversely, if mid-tier cuts show similar prices, pricing strategies may be reconsidered to align with customer perceptions of value.

Practice: Use ANOVA to test whether the average carat (carat) differs across clarity levels (clarity) in the diamonds dataset. Fit the model using aov(carat ~ clarity, data = diamonds) and examine the ANOVA output. For a visual comparison, create a boxplot similar to the one used for cut quality.

5.12 Correlation Test

Suppose you are analysing sales data and notice that as advertising spend increases, product sales tend to rise as well. Is this trend real, or merely coincidental? In exploratory analysis (see Section 4.7), we used scatter plots and correlation matrices to visually assess such relationships. The next step is to evaluate whether the observed association is statistically meaningful. The correlation test provides a formal method for determining whether a linear relationship between two numerical variables is stronger than what we would expect by random chance.

The correlation test evaluates both the strength and direction of a linear relationship by testing the null hypothesis that the population correlation coefficient (\(\rho\)) is equal to zero. This test is particularly useful when examining how continuous variables co-vary—insights that can guide pricing strategies, forecasting models, and feature selection in predictive analytics.

To illustrate, we test the relationship between carat (diamond weight) and price in the diamonds dataset from the ggplot2 package. A positive relationship is expected: larger diamonds typically command higher prices. We begin with a scatter plot to visually explore the trend:

ggplot(diamonds, aes(x = carat, y = price)) +
  geom_point(alpha = 0.3, size = 0.6) +
  labs(x = "Diamond Weight (Carats)", y = "Price (dolors)")

The plot clearly shows an upward trend, suggesting a positive association. However, visual inspection does not provide formal evidence. To test the linear relationship, we set up the following hypotheses: \[ \begin{cases} H_0: \rho = 0 \quad \text{(No linear correlation)} \\ H_a: \rho \neq 0 \quad \text{(A significant linear correlation exists)} \end{cases} \]

We conduct the test using the cor.test() function, which performs a Pearson correlation test and reports the correlation coefficient, p-value, and a confidence interval for \(\rho\):

cor_test <- cor.test(diamonds$carat, diamonds$price)
cor_test
   
    Pearson's product-moment correlation
   
   data:  diamonds$carat and diamonds$price
   t = 551.41, df = 53938, p-value < 2.2e-16
   alternative hypothesis: true correlation is not equal to 0
   95 percent confidence interval:
    0.9203098 0.9228530
   sample estimates:
         cor 
   0.9215913

The output highlights three important results. First, the p-value is very close to zero, which is well below the significance level \(\alpha = 0.05\). We therefore reject \(H_0\) and conclude that a significant linear relationship exists between carat and price. Second, the correlation coefficient is 0.92, indicating a strong positive association. Finally, the 95 percent confidence interval for the true correlation is (0.92, 0.923), which does not include zero and thus reinforces the conclusion of a statistically meaningful relationship.

From a business perspective, this finding supports the intuitive notion that carat weight is one of the primary determinants of diamond pricing. However, correlation does not imply causation: even a strong correlation may overlook other important attributes, such as cut quality or clarity, that also influence price. These relationships can be examined more fully using multivariate regression models.

The correlation test provides a rigorous framework for evaluating linear relationships between numerical variables. When combined with visual summaries and domain knowledge, it helps identify meaningful patterns and informs decisions about pricing, product quality, and model design.

Practice: Using the churn dataset, test whether credit_limit and transaction_amount_12 are linearly correlated. Create a scatter plot, compute the correlation using cor.test(), and interpret the strength and significance of the relationship.

5.13 From Inference to Prediction in Data Science

You may have identified a statistically significant association between churn and service calls. But will this insight help predict which specific customers are likely to churn next month? This question captures an important transition in the data science workflow: moving from explaining relationships to predicting outcomes.

While the principles introduced in this chapter—estimation, confidence intervals, and hypothesis testing—provide the foundations for rigorous reasoning under uncertainty, their role changes as we shift from classical statistical inference to predictive Modeling. In traditional statistics, the emphasis is on population-level conclusions drawn from sample data. In data science, the central objective is predictive performance and the ability to generalise reliably to new, unseen observations.

This distinction has several practical implications. In large datasets, even very small differences can be statistically significant, but not necessarily useful. For example, finding that churners make 0.1 fewer calls on average may yield a significant p-value, yet contribute almost nothing to predictive accuracy. In Modeling, the goal is not to determine whether each variable is significant in isolation, but whether it improves the model’s ability to forecast or classify effectively.

Traditional inference often begins with a clearly defined hypothesis, such as testing whether a marketing intervention increases conversion rates. In contrast, predictive Modeling typically begins with exploration: analysts examine many features, apply transformations, compare algorithms, and refine models based on validation metrics. The focus shifts from confirming specific hypotheses to discovering patterns that support robust generalization.

Despite this shift, inference remains highly relevant throughout the Modeling pipeline. During data preparation, hypothesis tests can verify that training and test sets are comparable, reducing the risk of biased evaluation (see Chapter 6). When selecting features, inference-based reasoning helps identify variables that show meaningful relationships with the outcome. Later, in model diagnostics, statistical concepts such as residual analysis, variance decomposition, and measures of uncertainty are essential for detecting overfitting, assessing assumptions, and interpreting model behavior. These ideas return again in Chapter 10, where hypothesis testing is used to assess regression coefficients and evaluate competing models.

Recognizing how the role of inference evolves in predictive contexts allows us to use these tools more effectively. The goal is not to replace inference with prediction, but to integrate both perspectives. As we move to the next chapter, we begin constructing predictive models. The principles developed throughout this chapter—careful reasoning about variability, uncertainty, and structure—remain central to building models that are not only accurate but also interpretable and grounded in evidence.

5.14 Chapter Summary and Takeaways

This chapter equipped you with the essential tools of statistical inference. You learned how to use point estimates and confidence intervals to quantify uncertainty and how to apply hypothesis testing to evaluate evidence for or against specific claims about populations.

We applied a range of hypothesis tests using real-world examples: t-tests for comparing group means, proportion tests for binary outcomes, ANOVA for examining differences across multiple groups, the Chi-square test for assessing associations between categorical variables, and correlation tests for measuring linear relationships between numerical variables.

Together, these methods form a framework for drawing rigorous, data-driven conclusions. In the context of data science, they support not only analysis but also model diagnostics, the evaluation of data partitions, and the interpretability of predictive models. While p-values help assess statistical significance, they should always be interpreted alongside effect size, underlying assumptions, and domain relevance to ensure that findings are both meaningful and actionable.

Statistical inference continues to play an important role in later chapters. It helps validate training and test splits (Chapter 6) and reappears in regression Modeling (Chapter 10), where hypothesis tests are used to assess model coefficients and compare competing models. For readers who want to explore statistical inference more deeply, a helpful introduction is Intuitive Introductory Statistics by Wolfe and Schneider (Wolfe and Schneider 2017).

In the next chapter, we transition from inference to modeling, beginning with one of the most critical steps in any supervised learning task: dividing data into training and test sets. This step ensures that model evaluation is fair, transparent, and reliable, setting the stage for building predictive systems that generalise to new data.

5.15 Exercises

This set of exercises is designed to help you consolidate and apply what you have learned about statistical inference. They are organized into three parts: conceptual questions to deepen your theoretical grasp, hands-on tasks to practice applying inference methods in R, and reflection prompts to encourage thoughtful integration of statistical thinking into your broader data science workflow.

Conceptual Questions

  1. Why is hypothesis testing important in data science? Explain its role in making data-driven decisions and how it complements exploratory data analysis.

  2. What is the difference between a confidence interval and a hypothesis test? How do they provide different ways of drawing conclusions about population parameters?

  3. The p-value represents the probability of observing the sample data, or something more extreme, assuming the null hypothesis is true. How should p-values be interpreted, and why is a p-value of 0.001 in a two-sample t-test not necessarily evidence of practical significance?

  4. Explain the concepts of Type I and Type II errors in hypothesis testing. Why is it important to balance the risks of these errors when designing statistical tests?

  5. In a hypothesis test, failing to reject the null hypothesis does not imply that the null hypothesis is true. Explain why this is the case and discuss the implications of this result in practice.

  6. When working with small sample sizes, why is the t-distribution used instead of the normal distribution? How does the shape of the t-distribution change as the sample size increases?

  7. One-tailed and two-tailed hypothesis tests serve different purposes. When would a one-tailed test be more appropriate than a two-tailed test? Provide an example where each type of test would be applicable.

  8. Both the two-sample Z-test and the Chi-square test analyze categorical data but serve different purposes. How do they differ, and when would one be preferred over the other?

  9. The Analysis of Variance (ANOVA) test is designed to compare means across multiple groups. Why can’t multiple t-tests be used instead? What is the advantage of using ANOVA in this context?

Hands-On Practice: Hypothesis Testing in R

For the following exercises, use the churn_mlc, bank, marketing, and diamonds datasets available in the liver and ggplot2 packages. We have previously used the churn_mlc, bank, and diamonds datasets in this and earlier chapters. In Chapter 10, we will introduce the marketing dataset for regression analysis.

To load the datasets, use the following commands:

library(liver)
library(ggplot2)   

# To import the datasets
data(churn_mlc)  
data(bank)  
data(marketing, package = "liver")  
data(diamonds)  
  1. We are interested in knowing the 90% confidence interval for the population mean of the variable “night_calls” in the churn_mlc dataset. In R, we can obtain a confidence interval for the population mean using the t.test() function as follows:
t.test(x = churn_mlc$night_calls, conf.level = 0.90)$"conf.int"
   [1]  99.45484 100.38356
   attr(,"conf.level")
   [1] 0.9

Interpret the confidence interval in the context of customer service calls made at night. Report the 99% confidence interval for the population mean of “night_calls” and compare it with the 90% confidence interval. Which interval is wider, and what does this indicate about the precision of the estimates? Why does increasing the confidence level result in a wider interval, and how does this impact decision-making in a business context?

  1. Subgroup analyses help identify behavioral patterns in specific customer segments. In the churn_mlc dataset, we focus on customers with both an International Plan and a Voice Mail Plan who make more than 220 daytime minutes of calls. To create this subset, we use:
sub_churn = subset(churn_mlc, (intl_plan == "yes") & (voice_plan == "yes") & (day_mins > 220)) 

Next, we compute the 95% confidence interval for the proportion of churners in this subset using prop.test():

prop.test(table(sub_churn$churn), conf.level = 0.95)$"conf.int"
   [1] 0.2595701 0.5911490
   attr(,"conf.level")
   [1] 0.95

Compare this confidence interval with the overall churn rate in the dataset (see Section 5.3). What insights can be drawn about this customer segment, and how might they inform retention strategies?

  1. In the churn_mlc dataset, we test whether the mean number of customer service calls (customer_calls) is greater than 1.5 at a significance level of 0.01. The right-tailed test is formulated as:

\[ \begin{cases} H_0: \mu \leq 1.5 \\ H_a: \mu > 1.5 \end{cases} \]

Since the level of significance is \(\alpha = 0.01\), the confidence level is \(1-\alpha = 0.99\). We perform the test using:

t.test(x = churn_mlc$customer_calls, 
        mu = 1.5, 
        alternative = "greater", 
        conf.level = 0.99)
   
    One Sample t-test
   
   data:  churn_mlc$customer_calls
   t = 3.8106, df = 4999, p-value = 7.015e-05
   alternative hypothesis: true mean is greater than 1.5
   99 percent confidence interval:
    1.527407      Inf
   sample estimates:
   mean of x 
      1.5704

Report the p-value and determine whether to reject the null hypothesis at \(\alpha=0.01\). Explain your decision and discuss its implications in the context of customer service interactions.

  1. In the churn_mlc dataset, we test whether the proportion of churners (\(\pi\)) is less than 0.14 at a significance level of \(\alpha=0.01\). The confidence level is \(99\%\), corresponding to \(1-\alpha = 0.99\). The test is conducted in R using:
prop.test(table(churn_mlc$churn), 
           p = 0.14, 
           alternative = "less", 
           conf.level = 0.99)
   
    1-sample proportions test with continuity correction
   
   data:  table(churn_mlc$churn), null probability 0.14
   X-squared = 0.070183, df = 1, p-value = 0.6045
   alternative hypothesis: true p is less than 0.14
   99 percent confidence interval:
    0.0000000 0.1533547
   sample estimates:
        p 
   0.1414

State the null and alternative hypotheses. Report the p-value and determine whether to reject the null hypothesis at \(\alpha=0.01\). Explain your conclusion and its potential impact on customer retention strategies.

  1. In the churn_mlc dataset, we examine whether the number of customer service calls (customer_calls) differs between churners and non-churners. To test this, we perform a two-sample t-test:
t.test(customer_calls ~ churn, data = churn_mlc)
   
    Welch Two Sample t-test
   
   data:  customer_calls by churn
   t = 11.292, df = 804.21, p-value < 2.2e-16
   alternative hypothesis: true difference in means between group yes and group no is not equal to 0
   95 percent confidence interval:
    0.6583525 0.9353976
   sample estimates:
   mean in group yes  mean in group no 
            2.254597          1.457722

State the null and alternative hypotheses. Determine whether to reject the null hypothesis at a significance level of \(\alpha=0.05\). Report the p-value and interpret the results, explaining whether there is evidence of a relationship between churn status and customer service call frequency.

  1. In the marketing dataset, we test whether there is a positive relationship between revenue and spend at a significance level of \(\alpha = 0.025\). We perform a one-tailed correlation test using:
cor.test(x = marketing$spend, 
         y = marketing$revenue, 
         alternative = "greater", 
         conf.level = 0.975)
   
    Pearson's product-moment correlation
   
   data:  marketing$spend and marketing$revenue
   t = 7.9284, df = 38, p-value = 7.075e-10
   alternative hypothesis: true correlation is greater than 0
   97.5 percent confidence interval:
    0.6338152 1.0000000
   sample estimates:
        cor 
   0.789455

State the null and alternative hypotheses. Report the p-value and determine whether to reject the null hypothesis. Explain your decision and discuss its implications for understanding the relationship between marketing spend and revenue.

  1. In the churn_mlc dataset, for the variable “day_mins”, test whether the mean number of “Day Minutes” is greater than 180. Set the level of significance to be 0.05.

  2. In the churn_mlc dataset, for the variable “intl_plan” test at \(\alpha=0.05\) whether the proportion of customers who have international plan is less than 0.15.

  3. In the churn_mlc dataset, test whether there is a relationship between the target variable “churn” and the variable “intl_charge” with \(\alpha=0.05\).

  4. In the bank dataset, test whether there is a relationship between the target variable “deposit” and the variable “education” with \(\alpha=0.05\).

  5. Compute the proportion of customers in the churn_mlc dataset who have an International Plan (intl_plan). Construct a 95% confidence interval for this proportion using R, and interpret the confidence interval in the context of customer subscriptions.

  6. Using the churn_mlc dataset, test whether the average number of daytime minutes (day_mins) for churners differs significantly from 200 minutes. Conduct a one-sample t-test in R and interpret the results in relation to customer behavior.

  7. Compare the average number of international calls (intl_calls) between churners and non-churners. Perform a two-sample t-test and evaluate whether the observed differences in means are statistically significant.

  8. Test whether the proportion of customers with a Voice Mail Plan (voice_plan) differs between churners and non-churners. Use a two-sample Z-test in R and interpret the results, considering the implications for customer retention strategies.

  9. Investigate whether marital status (marital) is associated with deposit subscription (deposit) in the bank dataset. Construct a contingency table and perform a Chi-square test to assess whether marital status has a significant impact on deposit purchasing behavior.

  10. Using the diamonds dataset, test whether the mean price of diamonds differs across different diamond cuts (cut). Conduct an ANOVA test and interpret the results. If the test finds significant differences, discuss how post-hoc tests could be used to further explore the findings.

  11. Assess the correlation between carat and price in the diamonds dataset. Perform a correlation test in R and visualize the relationship using a scatter plot. Interpret the results in the context of diamond pricing.

  12. Construct a 95% confidence interval for the mean number of customer service calls (customer_calls) among churners. Explain how the confidence interval helps quantify uncertainty and how it might inform business decisions regarding customer support.

  13. Take a random sample of 100 observations from the churn_mlc dataset and test whether the average eve_mins differs from 200. Repeat the test using a sample of 1000 observations. Compare the results and discuss how sample size affects hypothesis testing and statistical power.

  14. Suppose a hypothesis test indicates that customers with a Voice Mail Plan are significantly less likely to churn (p \(<\) 0.01). What are some potential business strategies a company could implement based on this finding? Beyond statistical significance, what additional factors should be considered before making marketing decisions?

Reflection

  1. How do confidence intervals and hypothesis tests complement each other when assessing the reliability of results in data science?

  2. In your work or studies, can you think of a situation where failing to reject the null hypothesis was an important finding? What did it help clarify?

  3. Describe a time when statistical significance and practical significance diverged in a real-world example. What lesson did you learn?

  4. How might understanding Type I and Type II errors influence how you interpret results from automated reports, dashboards, or A/B tests?

  5. When designing a data analysis for your own project, how would you decide which statistical test to use? What questions would guide your choice?

  6. How can confidence intervals help communicate uncertainty to non-technical stakeholders? Can you think of a better way to present this information visually?

  7. Which statistical test from this chapter do you feel most comfortable with, and which would you like to practice more? Why?