4  Exploratory Data Analysis

The greatest value of a picture is when it forces us to notice what we never expected.

— John Tukey

Exploratory Data Analysis (EDA) is a foundational stage of data analysis in which analysts actively interrogate data to understand its structure, quality, and underlying patterns. Rather than serving merely as a preliminary step, EDA directly informs analytical decisions by revealing unexpected behavior, identifying anomalies, and suggesting promising directions for further investigation. In the Data Science Workflow (see Figure 2.3), EDA forms the conceptual bridge between Data Preparation (Chapter 3) and Data Setup for Modeling (Chapter 6), ensuring that modeling choices are grounded in empirical evidence rather than assumptions.

Unlike formal hypothesis testing, EDA is inherently flexible and iterative. It encourages curiosity, experimentation, and repeated refinement of questions as insights emerge. Some exploratory paths will highlight meaningful structure in the data, while others may expose data quality issues or confirm that certain variables carry little information. Through this process, analysts develop intuition about the data, assess which features are likely to be informative, and refine the scope of subsequent analysis. The goal of EDA is not to validate theories but to generate insight. Summary statistics, exploratory visualizations, and correlation measures offer an initial map of the data landscape, although apparent patterns should be interpreted cautiously and not mistaken for causal relationships. Formal tools for statistical inference, introduced in Chapter 5, build on this exploratory foundation.

EDA also plays a central role in diagnosing and improving data quality. Missing values, extreme observations, inconsistent formats, and redundant features often become apparent only through systematic exploration. Identifying such issues early helps prevent misleading results and supports the development of reliable and interpretable models. The choice of exploratory techniques depends on both the nature of the data and the analytical questions of interest. Histograms and box plots provide insight into distributions, while scatter plots and correlation matrices help uncover relationships and potential dependencies among variables. Together, these tools allow analysts to move forward with a clearer understanding of what the data can, and cannot, support.

What This Chapter Covers

This chapter provides a structured introduction to exploratory data analysis, focusing on how summary statistics and visual techniques can be used to understand feature distributions, identify anomalies, and explore relationships within data. You will learn how correlation analysis helps detect redundancy among predictors and how multivariate exploration reveals patterns that support informed modeling decisions.

The chapter begins with EDA as Data Storytelling, which highlights the importance of communicating exploratory findings clearly and in context. This is followed by Key Objectives and Guiding Questions for EDA, where the main goals of exploration are translated into practical questions that guide a systematic analytical process.

These concepts are then applied in a detailed case study using the churn dataset from the liver package. This example demonstrates how exploratory techniques uncover meaningful customer patterns, how visualizations support interpretation, and how EDA prepares data for subsequent classification tasks, including k-nearest neighbours modeling in Chapter 7.

The chapter concludes with a comprehensive set of exercises and hands-on projects based on two additional real-world datasets (bank and churn_mlc, also from the liver package). These activities reinforce exploratory skills and establish continuity with later chapters, including the neural network case study presented in Chapter 12.

4.1 EDA as Data Storytelling

Exploratory data analysis is not only a technical process for uncovering patterns, but also a way of making sense of data through structured interpretation. While EDA reveals structure, anomalies, and relationships, these findings gain analytical value only when they are considered in context and connected to meaningful questions. In this sense, data storytelling is an integral part of exploration: it transforms raw observations into insight by linking evidence, interpretation, and purpose.

Effective storytelling in data science weaves together analytical results, domain knowledge, and visual clarity. Rather than presenting statistics or plots in isolation, strong exploratory analysis connects each finding to a broader narrative about the data-generating process. Whether the audience consists of analysts, business stakeholders, or policymakers, the aim is to communicate what matters, why it matters, and how it informs subsequent decisions.

visualization plays a central role in this process. Summary statistics provide a compact overview of central tendency and variability, but visual displays make patterns and irregularities more apparent. Scatter plots and correlation matrices help reveal relationships among numerical features, while histograms, box plots, and categorical visualizations clarify distributions, skewness, and group differences. Selecting appropriate visual tools strengthens both analytical reasoning and interpretability.

Storytelling through data is widely used across domains, including business analytics, journalism, public policy, and scientific research. A well-known example is Hans Rosling’s TED Talk New insights on poverty, in which decades of demographic and economic data are presented in a clear and engaging manner. Figure 4.1 visualises changes in GDP per capita and life expectancy across world regions from 1950 to 2019. The figure is generated using the gapminder dataset from the liver package and visualised with the ggplot2 package. Although drawn from global development data, the same principles of exploratory analysis apply when examining customer behavior, financial trends, or service outcomes.

Figure 4.1 reveals several broad patterns that emerge through exploratory visualization. Across all regions, both GDP per capita and life expectancy increase substantially between 1950 and 2019, indicating a strong association between economic development and population health. This trend is particularly pronounced for Western countries, which display consistently higher levels of both variables and a more pronounced upward shift over time. Other regions show more gradual improvement and greater dispersion, reflecting heterogeneous development trajectories. While these patterns are descriptive rather than causal, they illustrate how exploratory visualization helps surface broad global trends.

Figure 4.1: Changes in GDP per capita and life expectancy by region from 1950 (left) to 2019 (right). Dot size is proportional to population.

As you conduct EDA, it is therefore useful to ask not only what the data shows, but also why those patterns are relevant. Which findings warrant further investigation? How might they inform modeling choices, challenge assumptions, or guide decision-making? Framing exploration in narrative terms helps ensure that EDA remains purposeful rather than purely descriptive, grounded in the real-world questions that motivate the analysis.

The next section builds on these ideas by introducing the key objectives and guiding questions that structure effective exploratory analysis. Together, they provide a flexible yet systematic foundation for the detailed EDA of customer churn that follows.

4.2 Objectives and Guiding Questions for EDA

A useful starting point is to clarify what exploratory analysis is designed to accomplish. At its core, EDA seeks to understand the structure of the data, including feature types, value ranges, missing entries, and possible anomalies. It examines how individual features are distributed, identifying central tendencies, variation, and skewness. It investigates how features relate to one another, revealing associations, dependencies, or interactions that may later contribute to predictive models. It also detects patterns and outliers that might indicate errors, unusual subgroups, or emerging signals worth investigating further.

These objectives form the foundation for effective Modeling. They help analysts refine which features deserve emphasis, anticipate potential challenges, and identify early insights that can guide the direction of later stages in the workflow.

Exploration becomes more productive when guided by focused questions. These questions can be grouped broadly into those concerning individual features and those concerning relationships among features. When examining features one at a time, the guiding questions ask what each feature reveals on its own, how it is distributed, whether missing values follow a particular pattern, and whether any irregularities stand out. Histograms, box plots, and summary statistics are familiar tools for answering such questions.

When shifting to relationships among features, the focus moves to how predictors relate to the target, whether any features are strongly correlated, whether redundancies or interactions might influence Modeling, and how categorical and numerical features combine to reveal structure. Scatter plots, grouped visualizations, and correlation matrices help reveal these patterns and support thoughtful feature selection.

A recurring challenge, especially for students, is choosing which plots or techniques best suit different types of data. Table 4.1 summarizes commonly used exploratory objectives alongside appropriate analytical tools. It serves as a practical reference when deciding how to approach unfamiliar datasets or new analytical questions.

Table 4.1: Overview of recommended tools for common EDA objectives.
Objective Data.Type Techniques
Examine a feature’s distribution Numerical Histogram, box plot, density plot, summary statistics
Summarize a categorical feature Categorical Bar chart, frequency table
Identify outliers Numerical Box plot, histogram
Detect missing data patterns Any Summary statistics, missingness maps
Explore the relationship between two numerical features Numerical & Numerical Scatter plot, correlation coefficient
Compare a numerical feature across groups Numerical & Categorical Box plot, grouped bar chart, violin plot
Analyze interactions between two categorical features Categorical & Categorical Stacked bar chart, mosaic plot, contingency table
Assess correlation among multiple numerical features Multiple Numerical Correlation matrix, scatterplot matrix

By aligning objectives with guiding questions and appropriate methods, EDA becomes more than a routine diagnostic stage. It becomes a strategic component of the workflow that enhances data quality, informs feature construction, and lays the groundwork for effective Modeling.

The next section applies these principles through a detailed EDA of customer churn, showing how statistical summaries, visual tools, and domain understanding can uncover patterns that support predictive analysis.

4.3 EDA in Practice: The churn Dataset

Exploratory data analysis (EDA) is most informative when it is grounded in real data and motivated by practical questions. In this section, we illustrate the exploratory process using the churn dataset, which contains demographic, behavioral, and financial information about customers, along with a binary indicator of whether a customer has churned by closing their credit card account. The goal is to understand which patterns and characteristics are associated with customer attrition and how these insights can guide subsequent analysis.

This walkthrough follows the logic of the Data Science Workflow introduced in Chapter 2. We begin by briefly revisiting problem understanding and data preparation to establish the business context and examine the structure of the dataset. The core of the section focuses on exploratory data analysis, where summary statistics, visualizations, and guiding questions are used to investigate relationships between customer characteristics and churn outcomes.

The insights developed through this exploratory analysis form the foundation for later stages of the workflow. They inform how the data are prepared for Modeling in Chapter 6, support the construction of predictive models using k-nearest neighbours in Chapter 7, and motivate the evaluation strategies discussed in Chapter 8. Taken together, these stages demonstrate how careful exploratory analysis strengthens understanding and supports well-grounded analytical decisions.

4.3.1 Problem Understanding for the churn Dataset

A bank manager has become increasingly concerned about a growing number of customers closing their credit card accounts. Understanding why customers leave, and anticipating which customers are at greater risk of doing so, has become a strategic priority. Reliable churn prediction would allow the bank to intervene proactively, for example by adjusting services or offering targeted incentives to retain valuable clients.

Customer churn is a persistent challenge in subscription-based industries such as banking, telecommunications, and streaming services. Retaining existing customers is typically more cost-effective than acquiring new ones, which makes identifying the drivers of churn an important analytical objective. From a business perspective, this problem naturally leads to three central questions:

  • Why are customers choosing to leave?
  • Which behavioral or demographic characteristics are associated with higher churn risk?
  • How can these insights inform strategies designed to improve customer retention?

Exploratory data analysis provides an initial foundation for addressing these questions. By examining distributions, group differences, and relationships among features, EDA helps uncover early signals associated with churn. These exploratory insights support a deeper understanding of how customer attributes and behaviors interact and help narrow the focus for subsequent modeling efforts.

In Chapter 7, a k-nearest neighbours (kNN) model will be developed to predict customer churn. Before such a model can be constructed, it is essential to understand the structure of the churn dataset, the types of features it contains, and the patterns they exhibit. The next subsection therefore examines the dataset in detail to establish this foundational understanding.

4.3.2 Overview of the churn Dataset

Before conducting visual or statistical exploration, it is important to understand the dataset used throughout this chapter. The churn dataset, available in the liver package, serves as a realistic case study for applying exploratory data analysis. It contains combined demographic information, account characteristics, credit usage, and customer interaction metrics. The key feature of interest is churn, which indicates whether a customer has closed a credit card account (“yes”) or remained active (“no”). This binary outcome will later serve as the target feature for the classification model in Chapter 7. At this stage, the goal is to understand the structure, content, and quality of the data surrounding this outcome. To load and inspect the dataset, run:

library(liver)

data(churn)

str(churn)
   'data.frame':    10127 obs. of  21 variables:
    $ customer_ID          : int  768805383 818770008 713982108 769911858 709106358 713061558 810347208 818906208 710930508 719661558 ...
    $ age                  : int  45 49 51 40 40 44 51 32 37 48 ...
    $ gender               : Factor w/ 2 levels "female","male": 2 1 2 1 2 2 2 2 2 2 ...
    $ education            : Factor w/ 7 levels "uneducated","highschool",..: 2 4 4 2 1 4 7 2 1 4 ...
    $ marital              : Factor w/ 4 levels "married","single",..: 1 2 1 4 1 1 1 4 2 2 ...
    $ income               : Factor w/ 6 levels "<40K","40K-60K",..: 3 1 4 1 3 2 5 3 3 4 ...
    $ card_category        : Factor w/ 4 levels "blue","silver",..: 1 1 1 1 1 1 3 2 1 1 ...
    $ dependent_count      : int  3 5 3 4 3 2 4 0 3 2 ...
    $ months_on_book       : int  39 44 36 34 21 36 46 27 36 36 ...
    $ relationship_count   : int  5 6 4 3 5 3 6 2 5 6 ...
    $ months_inactive      : int  1 1 1 4 1 1 1 2 2 3 ...
    $ contacts_count_12    : int  3 2 0 1 0 2 3 2 0 3 ...
    $ credit_limit         : num  12691 8256 3418 3313 4716 ...
    $ revolving_balance    : int  777 864 0 2517 0 1247 2264 1396 2517 1677 ...
    $ available_credit     : num  11914 7392 3418 796 4716 ...
    $ transaction_amount_12: int  1144 1291 1887 1171 816 1088 1330 1538 1350 1441 ...
    $ transaction_count_12 : int  42 33 20 20 28 24 31 36 24 32 ...
    $ ratio_amount_Q4_Q1   : num  1.33 1.54 2.59 1.41 2.17 ...
    $ ratio_count_Q4_Q1    : num  1.62 3.71 2.33 2.33 2.5 ...
    $ utilization_ratio    : num  0.061 0.105 0 0.76 0 0.311 0.066 0.048 0.113 0.144 ...
    $ churn                : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...

The dataset is stored as a data.frame with 10127 observations and 21 features. The predictors consist of both numerical and categorical features that describe customer demographics, spending behavior, credit management, and engagement with the bank. Eight features are categorical (gender, education, marital, income, card_category, churn, and two grouping identifiers), while the remaining features are numerical. The categorical features represent demographic or qualitative groupings, and the numerical features capture continuous measures such as credit limits, transaction amounts, and utilization ratios. This distinction guides the choice of summary and visualization techniques used later in the chapter. A structured overview of the features is provided below:

  • customer_ID: Unique identifier for each account holder.
  • age: Age of the customer, in years.
  • gender: Gender of the account holder.
  • education: Highest educational qualification.
  • marital: Marital status.
  • income: Annual income bracket.
  • card_category: Credit card type (blue, silver, gold, platinum).
  • dependent_count: Number of dependents.
  • months_on_book: Tenure with the bank, in months.
  • relationship_count: Number of products held by the customer.
  • months_inactive: Number of inactive months in the past 12 months.
  • contacts_count_12: Number of customer service contacts in the past 12 months.
  • credit_limit: Total credit card limit.
  • revolving_balance: Current revolving balance.
  • available_credit: Unused portion of the credit limit, calculated as credit_limit - revolving_balance.
  • transaction_amount_12: Total transaction amount in the past 12 months.
  • transaction_count_12: Total number of transactions in the past 12 months.
  • ratio_amount_Q4_Q1: Ratio of total transaction amount in the fourth quarter to that in the first quarter.
  • ratio_count_Q4_Q1: Ratio of total transaction count in the fourth quarter to that in the first quarter.
  • utilization_ratio: Credit utilization ratio, defined as revolving_balance / credit_limit.
  • churn: Whether the account was closed (“yes”) or remained active (“no”).

A first quantitative impression of the dataset can be obtained with:

summary(churn)
     customer_ID             age           gender             education        marital          income      card_category 
    Min.   :708082083   Min.   :26.00   female:5358   uneducated   :1487   married :4687   <40K    :3561   blue    :9436  
    1st Qu.:713036770   1st Qu.:41.00   male  :4769   highschool   :2013   single  :3943   40K-60K :1790   silver  : 555  
    Median :717926358   Median :46.00                 college      :1013   divorced: 748   60K-80K :1402   gold    : 116  
    Mean   :739177606   Mean   :46.33                 graduate     :3128   unknown : 749   80K-120K:1535   platinum:  20  
    3rd Qu.:773143533   3rd Qu.:52.00                 post-graduate: 516                   >120K   : 727                  
    Max.   :828343083   Max.   :73.00                 doctorate    : 451                   unknown :1112                  
                                                      unknown      :1519                                                  
    dependent_count months_on_book  relationship_count months_inactive contacts_count_12  credit_limit   revolving_balance
    Min.   :0.000   Min.   :13.00   Min.   :1.000      Min.   :0.000   Min.   :0.000     Min.   : 1438   Min.   :   0     
    1st Qu.:1.000   1st Qu.:31.00   1st Qu.:3.000      1st Qu.:2.000   1st Qu.:2.000     1st Qu.: 2555   1st Qu.: 359     
    Median :2.000   Median :36.00   Median :4.000      Median :2.000   Median :2.000     Median : 4549   Median :1276     
    Mean   :2.346   Mean   :35.93   Mean   :3.813      Mean   :2.341   Mean   :2.455     Mean   : 8632   Mean   :1163     
    3rd Qu.:3.000   3rd Qu.:40.00   3rd Qu.:5.000      3rd Qu.:3.000   3rd Qu.:3.000     3rd Qu.:11068   3rd Qu.:1784     
    Max.   :5.000   Max.   :56.00   Max.   :6.000      Max.   :6.000   Max.   :6.000     Max.   :34516   Max.   :2517     
                                                                                                                          
    available_credit transaction_amount_12 transaction_count_12 ratio_amount_Q4_Q1 ratio_count_Q4_Q1 utilization_ratio
    Min.   :    3    Min.   :  510         Min.   : 10.00       Min.   :0.0000     Min.   :0.0000    Min.   :0.0000   
    1st Qu.: 1324    1st Qu.: 2156         1st Qu.: 45.00       1st Qu.:0.6310     1st Qu.:0.5820    1st Qu.:0.0230   
    Median : 3474    Median : 3899         Median : 67.00       Median :0.7360     Median :0.7020    Median :0.1760   
    Mean   : 7469    Mean   : 4404         Mean   : 64.86       Mean   :0.7599     Mean   :0.7122    Mean   :0.2749   
    3rd Qu.: 9859    3rd Qu.: 4741         3rd Qu.: 81.00       3rd Qu.:0.8590     3rd Qu.:0.8180    3rd Qu.:0.5030   
    Max.   :34516    Max.   :18484         Max.   :139.00       Max.   :3.3970     Max.   :3.7140    Max.   :0.9990   
                                                                                                                      
    churn     
    yes:1627  
    no :8500  
              
              
              
              
   

The summary statistics reveal several broad patterns:

  • Demographics and tenure: Customers are primarily middle-aged, with an average age of about 46 years, and have held their accounts for approximately three years.

  • Credit behavior: Credit limits vary widely around an average of roughly 8,600 dollars. Available credit closely mirrors the credit limit, and utilization ratios range from very low to very high, indicating a mix of conservative and heavy users.

  • Transaction activity: Customers complete about 65 transactions per year on average, with total annual spending near 4,400 dollars. The upper quartile contains high spenders whose behavior may influence churn.

  • behavioral changes: Quarterly spending ratios show a slight decline from the first to the fourth quarter for many customers, although some increase their spending.

  • Categorical features: Females form a slight majority. Education levels are concentrated in the college and graduate categories, and income tends to fall in lower brackets. Most customers hold blue cards, which reflects typical portfolio distributions.

These descriptive patterns illustrate the heterogeneity of the customer base and suggest that several numerical features may require scaling or transformation. Some categorical features, particularly education, marital, and income, contain an “unknown” category that represents missing information. Handling these cases is an important preparatory step.

The next subsection focuses on preparing the dataset for exploration by addressing missing values, verifying feature types, and ensuring consistent formats. Proper preparation ensures that the insights drawn from exploratory data analysis are both valid and interpretable.

4.3.3 Data Preparation for the churn Dataset

Before conducting exploratory data analysis, we carry out a limited amount of data preparation to ensure that summaries and visualisations accurately reflect the underlying data. An initial inspection of the churn dataset reveals that several categorical features (education, income, and marital) contain missing entries encoded as the string “unknown”. These placeholders must be converted to standard missing values so that they are handled correctly during exploration.

To standardise the representation of missing values, we replace all occurrences of “unknown” with NA and remove unused factor levels:

churn[churn == "unknown"] <- NA
churn <- droplevels(churn)

Before deciding how to handle the missing values, we assess their extent. The naniar package provides convenient tools for visualising missingness across features. The function gg_miss_var() displays the proportion of missing observations for each variable:

library(naniar)
gg_miss_var(churn, show_pct = TRUE)

The plot shows that missing values are confined to three categorical features, with the highest proportion occurring in education (approximately 15%). Strategies for handling missing categorical data are discussed in detail in Chapter 3.9. For the purposes of exploratory analysis, we apply random imputation to preserve the observed distribution of each feature and avoid distorting group comparisons. We therefore apply random imputation using the impute() function from the Hmisc package:

library(Hmisc)

churn$education <- impute(churn$education, "random")
churn$income    <- impute(churn$income, "random")
churn$marital   <- impute(churn$marital, "random")

With missing values addressed and feature types confirmed, the dataset is now suitable for exploratory analysis. In the next section, we apply visual and numerical tools to uncover patterns associated with customer churn.

4.4 Exploring Categorical Features

Categorical features group observations into distinct classes and often capture key demographic or behavioral characteristics. In the churn dataset, such features include gender, education, marital, card_category, and the outcome variable churn. Examining how these features are distributed, and how they relate to customer churn, provides an initial understanding of customer retention and disengagement.

We begin by examining the distribution of the target feature churn, which indicates whether a customer has closed a credit card account. Understanding this distribution is important for assessing class balance, a factor that influences both model training and the interpretation of predictive performance. The bar plot and pie chart below summarize the proportion of customers who churned:

library(ggplot2)

# Bar plot
ggplot(data = churn, aes(x = churn, label = scales::percent(prop.table(after_stat(count))))) +
  geom_bar(fill = c("#F4A582", "#A8D5BA")) +
  geom_text(stat = "count", vjust = 0.4, size = 6)

# Pie chart
ggplot(churn, aes(x = 1, fill = churn)) +
  geom_bar(width = 1) +
  coord_polar(theta = "y") +
  theme_void()

Both plots show that most customers remain active (churn = "no"), while only a small proportion (about 16.1 percent) have closed their accounts. The bar plot makes the class imbalance immediately visible and supports direct comparison of counts or proportions. The pie chart conveys the same information but is less effective for analytical comparison; it is included here primarily to illustrate alternative presentation styles for a binary outcome.

A simpler bar plot, without colours or percentage labels, can be created as follows:

ggplot(data = churn) +
  geom_bar(aes(x = churn))

This basic version provides a quick overview of class counts, while the enhanced plot communicates relative proportions more clearly. Such refinements are particularly useful when presenting results to non-technical audiences.

Practice: Create a bar plot of the gender feature using ggplot2. Experiment with adding colour fills or percentage labels. This short exercise reinforces the structure of bar plots before examining relationships between categorical features.

Having established the overall distribution of the target variable, the next step is to explore how other categorical features vary across churn outcomes. These comparisons help identify customer segments and behavioral patterns that may be associated with elevated attrition risk.

Relationship Between Gender and Churn

Among the demographic features, gender provides a natural starting point for exploring whether customer retention behavior differs across broad population groups. Although gender is not typically a strong predictor of churn in financial services, examining it first establishes a useful baseline for comparison with more behaviorally driven features.

We note that, in this dataset, the gender feature is recorded as a binary category. This representation does not capture the full diversity of gender identities and excludes non-binary and LGBTQ+ identities. Any conclusions drawn from this feature should therefore be interpreted with caution, both analytically and ethically, as they reflect limitations of the available data rather than characteristics of the underlying population.

ggplot(data = churn) + 
  geom_bar(aes(x = gender, fill = churn)) +
  labs(x = "Gender", y = "Count", title = "Counts of Churn by Gender") 

ggplot(data = churn) + 
  geom_bar(aes(x = gender, fill = churn), position = "fill") +  
  labs(x = "Gender", y = "Proportion", title = "Proportion of Churn by Gender")

The left panel shows the number of churners and non-churners within each gender group, while the right panel displays the corresponding proportions. The proportional view facilitates comparison of churn rates across groups and reveals a slightly higher churn rate among female customers. The difference, however, is small and unlikely to be practically meaningful in isolation.

To examine this pattern more closely, we can inspect the corresponding contingency table:

addmargins(table(churn$churn, churn$gender,
                 dnn = c("Churn", "Gender")))
        Gender
   Churn female  male   Sum
     yes    930   697  1627
     no    4428  4072  8500
     Sum   5358  4769 10127

The table confirms the visual impression that the proportion of female customers who churn is marginally higher than that of male customers. At this stage, the analysis remains descriptive. Determining whether such a difference is statistically significant requires formal inference, which is introduced in Section 5.8.

From an exploratory perspective, this finding suggests that gender alone is not a strong differentiating feature for churn behavior. In practice, larger and more informative variation is typically associated with behavioral and financial indicators such as transaction activity, credit utilization, and customer service interactions. These features therefore tend to carry greater predictive value in churn modeling contexts.

Practice: Compute the churn rate separately for male and female customers using the churn dataset. Then create your own bar plot and compare it with the figures above. Based on the observed proportions, would you expect the difference in churn rates to be statistically significant? This question is revisited formally in Chapter 5.8, where the test for two proportions is introduced.

Relationship Between Card Category and Churn

Card type is one of the most informative service features in the churn dataset. The variable card_category places customers into four tiers: blue, silver, gold, and platinum. These categories reflect different benefit levels and often correspond to distinct customer segments.

ggplot(data = churn) + 
  geom_bar(aes(x = card_category, fill = churn)) + 
  labs(x = "Card Category", y = "Count")

ggplot(data = churn) + 
  geom_bar(aes(x = card_category, fill = churn), position = "fill") + 
  labs(x = "Card Category", y = "Proportion")

The left panel displays the number of churners and non-churners within each card tier. The right panel shows proportions within each tier. The distribution is highly imbalanced: more than 93 percent of customers hold a blue card, the entry-level option. This reflects typical product portfolios in retail banking, where most customers hold standard cards. Because the other categories are much smaller, differences across tiers must be interpreted with care.

addmargins(table(churn$churn, churn$card_category, 
                 dnn = c("Churn", "Card Category")))
        Card Category
   Churn  blue silver  gold platinum   Sum
     yes  1519     82    21        5  1627
     no   7917    473    95       15  8500
     Sum  9436    555   116       20 10127

The contingency table confirms the visual pattern. Churn rates are slightly higher among blue and silver cardholders and lower among customers with gold or platinum cards. Although modest, this difference suggests that customers with premium cards are more engaged and therefore less likely to close their accounts.

Because the silver, gold, and platinum groups are relatively small, analysts often combine similar categories to ensure adequate group sizes for Modeling. A common approach is to separate “blue” from “silver+” (a combined group of silver, gold, and platinum cardholders). This simplification reduces sparsity, stabilises estimates, and often produces clearer and more interpretable models.

Practice: Reclassify the card categories into two groups, “blue” and “silver+”, using the fct_collapse() function from the forcats package (as in Section 3.10). Then recreate both bar plots and compare the patterns. Does the simplified version make the churn differences easier to see? Would this reclassification improve interpretability in a predictive model?

Relationship Between Income and Churn

Income level reflects purchasing power and financial stability, both of which may influence a customer’s likelihood of closing a credit account. The feature income in the churn dataset includes five ordered categories, ranging from less than $40K to over $120K. Because missing values were imputed earlier, the feature now provides a complete and consistent basis for comparison.

ggplot(data = churn) + 
  geom_bar(aes(x = income, fill = churn)) + 
  labs(x = "Annual Income Bracket", y = "Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplot(data = churn) + 
  geom_bar(aes(x = income, fill = churn), position = "fill") + 
  labs(x = "Annual Income Bracket", y = "Proportion") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The bar plots indicate a gradual decline in churn as income increases. Customers in the lowest bracket (less than $40K) churn slightly more often than those in higher brackets, while customers earning over $120K show the lowest churn rates. Although the trend is modest, it suggests that higher-income customers maintain more stable account relationships. To examine this pattern more closely, we can inspect the corresponding contingency table:

addmargins(table(churn$churn, churn$income, 
                 dnn = c("Churn", "Income")))
        Income
   Churn  <40K 40K-60K 60K-80K 80K-120K >120K   Sum
     yes   677     310     227      271   142  1627
     no   3327    1705    1345     1453   670  8500
     Sum  4004    2015    1572     1724   812 10127

The contingency table supports this observation. Lower-income customers may be more sensitive to service fees or constrained credit limits, while higher-income customers typically exhibit more consistent spending patterns and longer account tenure.

From an analytical perspective, income provides a weak yet interpretable signal of churn behavior. Because the categories follow a natural progression, treating income as an ordered factor may be useful during Modeling.

Practice: Convert income into an ordered factor using factor(..., ordered = TRUE) and recreate the proportional bar plot. Does the plot change? Next, reorder the categories using fct_relevel() and observe how the ordering affects readability. Small adjustments to factor ordering often make EDA plots easier to interpret.

Relationship Between Marital Status and Churn

Marital status may influence financial behavior and account management, making it a useful demographic feature to explore in the context of churn. The marital feature in the churn dataset includes three categories (married, single, and divorced) which may reflect differences in household structure, shared responsibilities, or spending patterns.

ggplot(data = churn) + 
  geom_bar(aes(x = marital, fill = churn)) + 
  labs(x = "Marital Status", y = "Count")

ggplot(data = churn) + 
  geom_bar(aes(x = marital, fill = churn), position = "fill") + 
  labs(x = "Marital Status", y = "Proportion")

The count plot on the left shows that most customers are married, followed by single and divorced individuals. The proportional bar plot on the right highlights that single customers churn at a slightly higher rate than married or divorced customers. This difference is consistent but small, suggesting only a weak relationship between marital status and account closure.

addmargins(table(churn$churn, churn$marital, 
                 dnn = c("Churn", "Marital Status")))
        Marital Status
   Churn married single divorced   Sum
     yes     767    727      133  1627
     no     4277   3548      675  8500
     Sum    5044   4275      808 10127

The contingency table supports the visual impression. Although single customers exhibit marginally higher churn rates, the overall association between marital status and churn appears limited. Small behavioral differences may exist across household types, but marital status is unlikely to be a strong predictor of churn on its own.

From an analytical standpoint, this feature offers only minor explanatory value. Later sections will show that behavioral and financial indicators—including spending activity, utilization ratio, and customer-service interactions—provide more substantial insight into churn risk. Because both marital and churn are categorical variables, the Chi-square test introduced in Section 5.9 will formally assess whether the observed differences are statistically meaningful.

Practice: Examine whether education is associated with churn. Create bar plots for counts and proportions, inspect the contingency table, and consider whether any observed differences appear meaningful in practice. This exercise reinforces the workflow used for exploring categorical features.

4.5 Exploring Numerical Features

The churn dataset contains fourteen numerical features that describe customer behavior, credit management, and engagement with the bank. Examining these features helps us understand how customers differ in spending patterns, activity levels, financial capacity, and behavioral change, all of which are commonly associated with churn risk.

To keep the analysis focused and interpretable, we concentrate on five representative numerical features that capture key behavioral and financial dimensions of customer retention: contacts_count_12, transaction_amount_12, credit_limit, months_on_book, and ratio_amount_Q4_Q1. Together, these variables reflect customer interaction with the bank, overall engagement, financial strength, tenure, and recent behavioral trends. They provide a compact yet informative basis for exploring numerical patterns related to churn.

In the following subsections, we use summary statistics and visualizations to examine the distributions of these features and their relationships with customer churn, with the aim of identifying meaningful variation and potential signals for subsequent analysis.

Customer Contacts and Churn

The number of customer service contacts in the past year (contacts_count_12) offers insight into customer engagement and potential dissatisfaction. This feature is a count variable with small integer values, making bar plots more appropriate than boxplots or density plots. Bar plots clearly display how frequently customers interacted with support and allow easy comparison between churned and active accounts.

ggplot(data = churn) +
  geom_bar(aes(x = contacts_count_12, fill = churn)) +
  labs(x = "Number of Contacts in 12 Months", y = "Count")

ggplot(data = churn) +
  geom_bar(aes(x = contacts_count_12, fill = churn), position = "fill") +
  labs(x = "Number of Contacts in 12 Months", y = "Proportion")

Both plots show that customers who contact customer service more frequently are more likely to churn. The increase is particularly noticeable for those with four or more interactions during the year. This pattern suggests that repeated service contacts may reflect concerns, dissatisfaction, or unresolved issues. From an analytical perspective, contacts_count_12 provides a clear behavioral signal: frequent contact is associated with elevated churn risk. Because it is easy to interpret and directly linked to customer experience, this feature often plays a meaningful role in churn Modeling and early-warning retention strategies.

Transaction Amount and Churn

The total transaction amount over the past twelve months (transaction_amount_12) reflects how actively customers use their credit card. Higher spending is typically associated with regular engagement, whereas lower spending may indicate reduced usage or a shift toward alternative payment methods. Because this feature is continuous, we use boxplots and density plots to examine how its distribution differs between customers who churn and those who remain active.

ggplot(data = churn) +
  geom_boxplot(aes(x = churn, y = transaction_amount_12), 
               fill = c("#F4A582", "#A8D5BA")) +
  labs(x = "Churn", y = "Total Transaction Amount")

ggplot(data = churn) +
  geom_density(aes(x = transaction_amount_12, fill = churn), alpha = 0.6) +
  labs(x = "Total Transaction Amount", y = "Density")

The boxplot highlights differences in central tendency and spread, while the density plot provides a more detailed view of the distributional shape. Together, the plots show that customers who churn tend to have lower total transaction amounts and a narrower range of spending, suggesting more limited engagement over the year. In contrast, customers who remain active exhibit higher and more variable transaction volumes.

From an exploratory perspective, this pattern indicates that sustained reductions in spending are associated with an increased likelihood of churn. Such insights motivate closer monitoring of spending behavior and help identify customers whose engagement appears to be declining, although further analysis is required to assess predictive strength and causal relevance.

Practice: Recreate the density plot for transaction_amount_12 using a histogram instead. Experiment with different bin widths and compare the resulting plots. How sensitive are your conclusions to these choices? Which visualization would you use for exploratory analysis, and which for reporting results?

Credit Limit and Churn

The total credit line assigned to a customer (credit_limit) reflects both financial capacity and the bank’s assessment of creditworthiness. Customers with higher credit limits are often more established or have demonstrated reliable repayment behavior, which may be associated with a lower likelihood of churn. Because credit limits vary substantially across customers, we use violin plots and histograms to examine both distributional shape and differences between churn groups.

ggplot(data = churn, aes(x = churn, y = credit_limit, fill = churn)) +
  geom_violin(trim = FALSE) +
  labs(x = "Churn", y = "Credit Limit")

ggplot(data = churn) +
  geom_histogram(aes(x = credit_limit, fill = churn)) +
  labs(x = "Credit Limit", y = "Count")

The violin plot suggests that customers who churn tend to have lower credit limits on average, although the overlap between the two groups is substantial. The histogram provides additional insight into the overall distribution, revealing that most customers fall into a lower credit limit range, with a smaller group holding substantially higher limits. This pattern gives the appearance of two broad clusters, one concentrated below approximately $7,000 and another above $30,000, though this observation remains exploratory.

Taken together, the plots indicate a modest shift toward higher credit limits among customers who remain active, but the separation between groups is not pronounced. To assess whether the observed difference in average credit limits is statistically meaningful, we return to this comparison in Section 5.7, where we introduce formal hypothesis testing for numerical features.

From an exploratory standpoint, credit limit appears to be a weaker differentiating feature than transaction activity, but it still provides useful contextual information about customer profiles. Its primary value at this stage lies in complementing behavioral indicators rather than serving as a standalone signal of churn.

Practice: Create boxplots and density plots for credit_limit stratified by churn status. Compare these with the violin plot and histogram shown in this section. How do the different visualizations influence your perception of group overlap and central tendency? Discuss which plots are most informative at this exploratory stage.

Months on Book and Churn

The feature months_on_book measures how long a customer has held their credit card account. Tenure often reflects relationship stability, accumulated benefits, and familiarity with the service. Customers with longer histories typically show stronger loyalty, whereas newer customers may be more vulnerable to unmet expectations or early dissatisfaction.

# Violin and boxplot
ggplot(data = churn, 
       aes(x = churn, y = months_on_book, fill = churn)) +
  geom_violin(alpha = 0.5, trim = TRUE) +
  geom_boxplot(width = 0.15, fill = "white", outlier.shape = NA) +
  labs(x = "Churn", y = "Months on Book") +
  theme(legend.position = "none")

# Histogram
ggplot(data = churn) +
  geom_histogram(aes(x = months_on_book, fill = churn), bins = 20) +
  labs(x = "Months on Book", y = "Count")

Both plots suggest that customers who churn tend to have slightly shorter tenures than those who remain active. The difference is not large, but it is consistent: the median tenure for churners is lower by a few months. The pronounced peak around 36 months likely reflects a cohort effect, possibly linked to a major acquisition campaign that occurred three years prior to the observation period.

From a business perspective, these patterns highlight the importance of early relationship management. Targeted onboarding, proactive engagement in the first year, and timely communication may help build loyalty among newer customers and reduce attrition during the initial stages of the customer lifecycle.

Practice: Create a density plot for months_on_book stratified by churn status. Compare these with the histogram shown in this section. How do the different visualizations influence your perception of group overlap and central tendency? Discuss which plots are most informative at this exploratory stage.

Ratio of Transaction Amount (Q4/Q1) and Churn

The feature ratio_amount_Q4_Q1 compares total spending in the fourth quarter with that in the first quarter. It captures how customer behavior changes over time and provides a temporal view of engagement. A ratio below 1 indicates that spending in Q4 was lower than in Q1, whereas a ratio above 1 reflects increased spending toward the end of the year.

ggplot(data = churn) +
  geom_boxplot(aes(x = churn, y = ratio_amount_Q4_Q1), 
               fill = c("#F4A582", "#A8D5BA")) +
  labs(x = "Churn", y = "Transaction Amount Ratio")

ggplot(data = churn) +
  geom_density(aes(x = ratio_amount_Q4_Q1, fill = churn), alpha = 0.6) +
  labs(x = "Transaction Amount Ratio", y = "Density")

The plots show that customers who churn tend to have lower Q4-to-Q1 ratios, indicating a reduction in spending toward the end of the year. Customers who remain active typically maintain or modestly increase their spending. This downward shift in activity may serve as an early sign of disengagement: gradual reductions in spending often precede account closure.

From a business perspective, monitoring quarterly spending patterns can help identify customers who may be at risk of churn. Seasonal incentives or targeted engagement campaigns aimed at customers with declining activity may help maintain their involvement and improve retention outcomes.

Practice: Repeat the analysis using features such as age and months_inactive. Compare the patterns you observe for churners and non-churners. How might these features contribute to predicting which customers are likely to remain active?

4.6 Exploring Multivariate Relationships

Univariate and pairwise analyses provide helpful context, but real-world customer behavior often arises from the interaction of multiple features. Examining these joint patterns is essential for identifying customer segments with distinct churn risks and for selecting features that add genuine value to predictive models.

We begin with a correlation analysis of the numerical features, which highlights pairs of variables that move together and helps detect redundancy. After establishing these relationships, we broaden the analysis to explore how behavioral, transactional, and demographic features interact. These multivariate views reveal usage patterns and customer profiles that are not visible through individual variables alone.

4.7 Assessing Correlation and Redundancy

Before examining more complex interactions among features, we assess how numerical variables relate to one another. Correlation analysis helps us identify features that may carry overlapping information or exhibit redundancy. Recognizing such relationships early simplifies subsequent modeling and reduces the risk of multicollinearity.

Correlation quantifies the degree to which two features move together. A positive correlation indicates that higher values of one feature tend to be associated with higher values of the other, whereas a negative correlation indicates an inverse relationship. The Pearson correlation coefficient, denoted by \(r\), summarizes this association on a scale from \(-1\) to \(1\). Values of \(r = 1\) and \(r = -1\) indicate perfect positive and negative linear relationships, respectively, while \(r = 0\) indicates no linear association.

Figure 4.2: Example scatterplots showing different correlation coefficients.

We emphasize that correlation does not imply causation. For example, a strong positive correlation between customer service contacts and churn does not mean that contacting customer service causes customers to leave. Both behaviors may instead reflect an underlying factor, such as dissatisfaction with service.

A well-known illustration of this principle is shown in Figure 4.3 which depicts a strong correlation between per-capita chocolate consumption and Nobel Prize wins across countries (Messerli 2012). While clearly not causal, the example highlights how correlations can arise through coincidence or shared underlying factors. Readers interested in causal reasoning may consult The Book of Why by Judea Pearl and Dana Mackenzie (2018) for an accessible introduction.

Figure 4.3: Scatterplot illustrating the correlation between Nobel Prize wins and chocolate consumption (per 10 million population) across countries.

Returning to the churn dataset, we compute and visualise the correlation matrix for all numerical features using a heatmap. This overview helps us detect redundant or closely related variables before proceeding to modeling.

library(ggcorrplot)

numeric_features = c("age", "dependent_count", "months_on_book", 
             "relationship_count", "months_inactive", "contacts_count_12", 
             "credit_limit", "revolving_balance", "available_credit", 
             "transaction_amount_12", "transaction_count_12", 
             "ratio_amount_Q4_Q1", "ratio_count_Q4_Q1", "utilization_ratio")

cor_matrix = cor(churn[, numeric_features])

ggcorrplot(cor_matrix, type = "lower", lab = TRUE, lab_size = 1.7, tl.cex = 6, 
           colors = c("#699fb3", "white", "#b3697a"),
           title = "Visualization of the Correlation Matrix") +
  theme(plot.title = element_text(size = 10, face = "plain"),
        legend.title = element_text(size = 7), 
        legend.text = element_text(size = 6))

The heatmap shows that most numerical features are only weakly or moderately correlated, suggesting that they capture distinct behavioral dimensions. One notable exception is the perfect correlation (\(r = 1\)) between credit_limit and available_credit, indicating that one feature is mathematically derived from the other. Including both in a model would therefore introduce redundancy without adding new information. This relationship is illustrated in the following scatter plots:

ggplot(data = churn) +
    geom_point(aes(x = credit_limit, y = available_credit), size = 0.1) +
    labs(x = "Credit Limit", y = "Available Credit")

ggplot(data = churn) +
    geom_point(aes(x = credit_limit - revolving_balance, 
                   y = available_credit), size = 0.1) +
    labs(x = "Credit Limit - Revolving Balance", y = "Available Credit")

The first plot shows the exact linear relationship between credit_limit and available_credit. The second confirms that available_credit is effectively equal to credit_limit - revolving_balance, explaining the observed redundancy.

As an optional exploration, we can examine the joint structure of these three features using a three-dimensional scatter plot. The plotly package enables interactive rotation and zooming, which can make this linear dependency especially apparent. This visualization is available in HTML output or interactive environments such as RStudio, but it does not render in the PDF version of this book.

library(plotly)

plot_ly(
  data = churn,
  x = ~credit_limit,
  y = ~available_credit,
  z = ~revolving_balance,
  color = ~churn,
  colors = c("#F4A582", "#A8D5BA"),
  type = "scatter3d",
  mode = "markers",
  marker = list(size = 1)
)

A similar relationship appears between utilization_ratio, revolving_balance, and credit_limit. Because the utilization ratio is defined as revolving_balance / credit_limit, it does not introduce new information but provides a normalized view of credit usage. Depending on the modeling objective, we may retain the ratio for interpretability or keep its component features for greater flexibility.

ggplot(data = churn) +
    geom_point(aes(x = credit_limit, y = utilization_ratio), size = 0.1) +
    labs(x = "Credit Limit", y = "utilization Ratio")

ggplot(data = churn) +
    geom_point(aes(x = revolving_balance/credit_limit, 
                   y = utilization_ratio), size = 0.1) +
    labs(x = "Revolving Balance / Credit Limit", y = "utilization Ratio")

Practice: Create a three-dimensional scatter plot using credit_limit, revolving_balance, and utilization_ratio. Because these features are mathematically linked, the points should lie close to a plane. Use plotly to explore the structure interactively. Rotate the plot and examine how the features relate. Does the three-dimensional view make the redundancy among these features more visually apparent?

Identifying redundant or highly correlated features provides a clearer foundation for multivariate exploration. After consolidating or removing derived variables, the remaining numerical features offer complementary perspectives on customer behavior. In the next subsection, we examine how key features interact, beginning with joint patterns in transaction amount and transaction frequency, to uncover usage dynamics that are not visible from individual features alone.

4.7.1 Joint Patterns in Transaction Amount and Count

Transaction activity has two complementary dimensions: how much customers spend and how frequently they use their card. The features transaction_amount_12 and transaction_count_12 capture these behaviors over a twelve-month period. Examining them jointly provides insight into usage patterns that remain hidden in univariate analyses. A scatter plot augmented with marginal histograms is particularly useful here, as it simultaneously reveals the joint structure of the data and the marginal distributions of each feature.

The code below first constructs a base scatter plot using ggplot2 and then applies ggMarginal() from the ggExtra package to add histograms along the horizontal and vertical axes:

library(ggExtra)

# Base scatter plot
scatter_plot <- ggplot(data = churn) +
  geom_point(aes(x = transaction_amount_12, y = transaction_count_12, 
                 color = churn), size = 0.1, alpha = 0.7) +
  labs(x = "Transaction Amount", y = "Total Transaction Count") +
  theme(legend.position = "bottom")

# Add marginal histograms
ggMarginal(scatter_plot, type = "histogram", groupColour = TRUE, 
           groupFill = TRUE, alpha = 0.3, size = 4)

The central scatter plot reveals a clear positive association: customers who spend more also tend to make more transactions. Most observations lie along a broad diagonal band representing moderate spending and activity, where churners and non-churners largely overlap. The marginal histograms complement this view by enabling a quick comparison of the individual distributions for both features across churn groups.

Beyond this general trend, the scatter plot suggests the presence of three broad usage segments: customers with low spending and few transactions, customers with moderate spending and moderate transaction counts, and customers with high spending and frequent transactions. Churners are predominantly concentrated in the low-activity segment, while the high-spending, high-usage segment contains very few churners.

Practice: Replace type = "histogram" with type = "density" in ggMarginal() to add marginal density curves. Then recreate the scatter plot using ratio_amount_Q4_Q1 on the horizontal axis instead of transaction_amount_12. Which version makes differences between churn groups easier to detect?

To examine these patterns more closely, we focus on two illustrative subsets: customers with very low spending and customers with moderate spending but relatively few transactions. These subsets are extracted using the subset() function as follows:

sub_churn = subset(churn,
  (transaction_amount_12 < 1000) |
  ((2000 < transaction_amount_12) & 
    (transaction_amount_12 < 3000) & 
    (transaction_count_12 < 52))
  )

ggplot(data = sub_churn, 
       aes(x = churn, 
           label = scales::percent(prop.table(after_stat(count))))) +
  geom_bar(fill = c("#F4A582", "#A8D5BA")) + 
  geom_text(stat = "count", vjust = 0.4, size = 6) 

Within this subset, the proportion of churners is noticeably higher than in the full dataset. This reinforces the earlier observation that customers with low or inconsistent usage—particularly those who spend little and use their card infrequently—are at elevated risk of churn.

From a Modeling perspective, this example highlights the importance of feature interactions: neither transaction amount nor transaction count alone is sufficient to identify these customers, but their combination is informative. From a business perspective, low-activity customers represent a natural target for re-engagement strategies, such as personalized messaging or incentives designed to encourage more frequent card usage.

Card Category and Spending Patterns

The feature card_category divides customers into four product tiers (blue, silver, gold, and platinum). The feature transaction_amount_12 measures the total amount spent over the past twelve months. Examining these features together provides insight into how card tier relates to spending behavior. Because transaction_amount_12 is continuous and card_category is categorical, density plots are a natural choice for comparing entire distributions. They highlight differences in the shape, centre, and spread of spending among card tiers.

ggplot(data = churn, 
       aes(x = transaction_amount_12, fill = card_category)) +
  geom_density(alpha = 0.5) +
  labs(x = "Total Transaction Amount (12 months)",
       y = "Density",
       fill = "Card Category") +
  scale_fill_manual(values = c("#1E90FF", "gray30", "#FFD700", "#BFC7CE"))

The density curves show a clear gradient across tiers: customers with gold and platinum cards tend to have noticeably higher transaction amounts. Their curves are shifted to the right relative to those of blue and silver cardholders. Blue card customers, who constitute more than 90 percent of the entire customer base, display a broader distribution concentrated in the lower and middle spending ranges. Although this imbalance affects how prominent each curve appears, the underlying pattern remains consistent: higher-tier cards are associated with greater spending activity.

From a business perspective, this relationship is intuitive. Premium cardholders typically receive enhanced benefits, rewards, or services, and they often belong to customer segments with higher financial engagement. Blue cardholders, by contrast, form a mixed group ranging from highly active customers to those who use their card only occasionally. These observations can guide differentiated retention and marketing strategies—for example, offering targeted upgrades to high-spending blue cardholders or designing tailored benefits to encourage greater engagement among lower-activity segments.

4.7.2 Transaction Analysis by Age

Age is an important demographic factor that can shape financial behavior, spending patterns, and overall engagement with credit products. In the churn dataset, examining how transaction activity varies across age helps determine whether younger and older customers display different usage profiles that might influence their likelihood of churn. Because individual observations form a dense cloud, we use smoothed trend lines to highlight the overall relationship between age and transaction activity.

# Total Transaction Amount for last 12 months by Age
ggplot(data = churn, 
       aes(x = age, y = transaction_amount_12, color = churn)) +
  geom_smooth(se = FALSE, linewidth = 1.1, alpha = 0.9) +
  labs(x = "Customer Age", y = "Total Transaction Amount") 

# Total Transaction Count for last 12 months by Age
ggplot(data = churn, 
       aes(x = age, y = transaction_count_12, color = churn)) +
  geom_smooth(se = FALSE, linewidth = 1.1, alpha = 0.9) +
  labs(x = "Customer Age", y = "Total Transaction Count") 

The curves indicate that both spending and transaction frequency tend to decline with age. Younger customers generally make more purchases and spend larger amounts, whereas older customers show lower and more stable levels of activity. There is a slight separation between churners and non-churners at younger ages: highly active younger customers appear somewhat more likely to churn, though the difference is modest.

These patterns emphasize that age alone does not determine churn. Instead, demographic characteristics interact with behavioral indicators to shape retention dynamics. Considering age jointly with measures of spending, engagement, and credit usage provides a more complete picture of customer behavior than any single feature on its own.

4.8 Summary of Exploratory Findings

The exploratory analysis of the churn dataset provides a multifaceted view of customer behavior and the factors associated with churn. By examining categorical features, numerical features, and their interactions, several consistent patterns emerge that are relevant for understanding and Modeling customer attrition.

Demographic characteristics show only weak associations with churn. Gender and marital status exhibit small differences in churn rates, and education and income levels display modest variation once other factors are considered. These variables may provide supporting context in Modeling but do not appear to be primary drivers of account closure. In contrast, service-related characteristics such as card category and income bracket offer clearer signals. Customers with higher-tier cards and those in higher income groups churn less often, suggesting that perceived value and financial capacity contribute to account stability.

The numerical features reveal stronger and more actionable patterns. Customers who contact customer service frequently, particularly four or more times within a year, churn at higher rates. This suggests that repeated service interactions may reflect dissatisfaction or unresolved problems. Spending activity, measured by total transaction amount over twelve months, shows a similarly strong relationship with retention. Active customers display higher and more varied spending, whereas churners typically have substantially lower transaction volumes. Declines in spending may therefore serve as early indicators of disengagement.

Credit-related features add further insight. Customers with lower credit limits are somewhat more likely to leave, while those with higher limits tend to remain active. This pattern may relate to differences in financial standing or to perceived benefits associated with higher credit availability. Tenure shows a modest but consistent relationship: customers with longer account histories are slightly less likely to churn, indicating that new customers may require additional support during the early stages of their relationship with the bank. The ratio of fourth-quarter to first-quarter spending highlights behavioral change over time. Churners often show declining spending in the later part of the year, whereas active customers tend to maintain or increase their usage. This dynamic measure is particularly useful for detecting emerging signs of disengagement.

Multivariate exploration deepens these insights. Joint analysis of transaction amount and transaction count shows that customers who both spend little and use their card infrequently have elevated churn rates. This relationship does not emerge as clearly from the individual features and demonstrates the importance of considering interactions. Combining card category with transaction amount reveals that higher-tier cardholders tend to spend more and churn less, while blue cardholders represent a more heterogeneous group that includes many low-activity accounts. Analysis across age groups shows that younger customers generally spend more and complete more transactions but experience slightly higher churn rates than older customers with comparable activity levels. This aligns with broader evidence that younger customers are more willing to switch providers.

The correlation analysis identifies a few redundant features. Available credit is determined by subtracting revolving balance from the credit limit, and the utilization ratio is calculated from revolving balance and credit limit. These relationships indicate that the derived features do not contain additional information beyond their components. For Modeling, it is often preferable to retain either the raw components or the ratio, depending on the analytical objective, rather than all three. Removing such redundant variables simplifies the feature set and reduces the risk of multicollinearity.

Overall, the exploratory analysis shows that churn is more closely associated with behavioral and financial indicators, such as spending activity, credit usage, and service interactions, than with demographic variables alone. Together, these findings provide a clear empirical foundation for the statistical inference and predictive Modeling in the chapters that follow. Several of the patterns identified here will be examined formally in Chapter 5 using hypothesis tests to assess whether these observed differences reflect wider population-level effects.

4.9 Chapter Summary and Takeaways

This chapter introduced exploratory data analysis as a practical and systematic step in the data science workflow. Using the churn dataset, we demonstrated how graphical and numerical techniques can be used to understand data structure, detect data quality issues, and develop initial hypotheses about customer behavior that guide subsequent analysis.

The analysis began with an overview of the dataset and an initial preparation step, during which missing values encoded as "unknown" were identified and resolved. Ensuring that features were clean and correctly typed provided a reliable foundation for exploration. We then examined categorical variables such as gender, education, marital status, income, and card type to characterise customer profiles, followed by numerical features related to credit limits, transaction activity, and utilization.

Several consistent relationships emerged from this exploratory analysis. Customers with smaller credit limits, higher utilization ratios, or frequent customer service interactions were more likely to churn. In contrast, customers with higher transaction amounts and lower utilization tended to remain active. These patterns illustrate how EDA can surface potentially important explanatory features before any formal Modeling is undertaken.

Multivariate exploration further revealed that churn behavior is shaped by combinations of features rather than isolated characteristics. Joint patterns in transaction amount and transaction count, associations between card category and spending, and links between age and financial activity showed how behavioral, financial, and demographic factors interact to influence customer retention.

The chapter also highlighted the importance of identifying redundant features. For example, available credit and utilization ratio were found to be deterministically related to other variables in the dataset. recognizing such redundancy simplifies later Modeling steps and improves interpretability.

Taken together, the examples in this chapter illustrate three guiding principles for effective exploratory analysis. First, graphical and numerical summaries are most informative when used together. Second, careful attention to data quality, including missing values and redundant features, is essential for reliable conclusions. Third, exploratory analysis is not merely descriptive. It provides direction for statistical inference and predictive Modeling by revealing patterns that merit further investigation.

The insights developed here form the empirical foundation for the next stage of the analysis. Chapter 5 introduces the tools of statistical inference, which allow us to formalise uncertainty, quantify relationships, and test hypotheses suggested by the exploratory findings.

4.10 Exercises

These exercises reinforce the main ideas of the chapter, progressing from conceptual questions to applied analysis with the churn_mlc and bank datasets, and concluding with integrative challenges.

Conceptual Questions

  1. Why is exploratory data analysis essential before building predictive models? What risks might arise if this step is skipped?

  2. If a feature does not show a clear relationship with the target during EDA, should it be excluded from modeling? Consider potential interactions, hidden effects, and the role of feature selection.

  3. What does it mean for two features to be correlated? Explain the direction and strength of correlation, and contrast correlation with causation using an example.

  4. How can correlated predictors be detected and addressed during EDA? Describe how this improves model performance and interpretability.

  5. What are the potential consequences of including highly correlated features in a predictive model? Discuss the effects on accuracy, interpretability, and model stability.

  6. Is it always advisable to remove one of two correlated predictors? Under what circumstances might keeping both be justified?

  7. For each of the following methods—histograms, box plots, density plots, scatter plots, summary statistics, correlation matrices, contingency tables, and bar plots—indicate whether it applies to categorical data, numerical data, or both. Briefly describe its role in EDA.

  8. A bank observes that customers with high credit utilization and frequent customer service interactions are more likely to close their accounts. What actions could the bank take in response, and how might this guide retention strategy?

  9. Suppose several pairs of features in a dataset have high correlation (for example, \(r > 0.9\)). How would you handle this to ensure robust and interpretable modeling?

  10. Why is it important to consider both statistical and practical relevance when evaluating correlations? Provide an example of a statistically strong but practically weak correlation.

  11. Why is it important to investigate multivariate relationships in EDA? Describe a case where an interaction between two features reveals a pattern that univariate analysis would miss.

  12. How does data visualization support EDA? Provide two specific examples where visual tools reveal insights that summary statistics might obscure.

  13. Suppose you discover that customers with both high credit utilization and frequent service calls are more likely to churn. What business strategies might be informed by this finding?

  14. What are some common causes of outliers in data? How would you decide whether to retain, modify, or exclude an outlier?

  15. Why is it important to address missing values during EDA? Discuss strategies for handling missing data and when each might be appropriate.

Hands-On Practice: Exploring the churn_mlc Dataset

The churn_mlc dataset from the R package liver contains information on customer behavior and service usage in a telecommunications company. The goal is to study patterns associated with customer churn, defined as whether a customer leaves the service. The dataset was introduced earlier in this chapter and will be used again in later chapters, including the classification case study in Chapter 10. Additional details are available at https://cran.r-project.org/web/packages/liver/refman/liver.html. To load and inspect the dataset:

library(liver)

data(churn_mlc)
str(churn_mlc)
  1. Summarize the structure of the dataset and identify feature types. What information does this provide about the nature of the data?

  2. Examine the target feature churn. What proportion of customers have left the service?

  3. Explore the relationship between intl_plan and churn. Use bar plots and contingency tables to describe what you find.

  4. Analyze the distribution of customer_calls. Which values occur most frequently? What might this indicate about customer engagement or dissatisfaction?

  5. Investigate whether customers with higher day_mins are more likely to churn. Use box plots or density plots to support your reasoning.

  6. Compute the correlation matrix for all numerical features. Which features show strong relationships, and which appear independent?

  7. Summarize your main EDA findings. What patterns emerge that could be relevant for predicting churn?

  8. Reflect on business implications. Which customer behaviors appear most strongly associated with churn, and how could these insights inform a retention strategy?

Hands-On Practice: Exploring the bank Dataset

The bank dataset from the R package liver contains data on direct marketing campaigns of a Portuguese bank. The objective is to predict whether a client subscribes to a term deposit. This dataset will be used for classification in the case study of Chapter 12. More details are available at https://rdrr.io/cran/liver/man/bank.html. To load and inspect the dataset:

library(liver)

data(bank)
str(bank)
  1. Summarize the structure and feature types. What does this reveal about the dataset?

  2. Plot the target feature deposit. What proportion of clients subscribed to a term deposit?

  3. Explore the features default, housing, and loan using bar plots and contingency tables. What patterns emerge?

  4. Visualize the distributions of numerical features using histograms and box plots. Note any skewness or unusual observations.

  5. Identify outliers among numerical features. What strategies would you consider for handling them?

  6. Compute and visualize correlations among numerical features. Which features are highly correlated, and how might this influence modeling decisions?

  7. Summarize your main EDA observations. How would you present these results in a report?

  8. Interpret your findings in business terms. What actionable conclusions could the bank draw from these patterns?

  9. Examine whether higher values of campaign (number of contacts) relate to greater subscription rates. Visualize and interpret.

  10. Propose one new feature that could improve model performance based on your EDA findings.

  11. Investigate subscription rates by month. Are some months more successful than others?

  12. Explore how job relates to deposit. Which occupational groups have higher success rates?

  13. Analyze the joint impact of education and job on subscription outcomes. What patterns do you observe?

  14. Examine whether the duration of the last contact influences the likelihood of a positive outcome.

  15. Compare success rates across campaigns. What strategies might these differences suggest?

Challenge Problems

  1. Create a concise one- or two-plot summary of an EDA finding from the bank dataset. Focus on clarity and accessibility for a non-technical audience, using brief annotations to explain the insight.

  2. Using the adult dataset, identify a subgroup likely to earn over $50K. Describe their characteristics and how you uncovered them through EDA.

  3. A feature appears weakly related to the target in univariate plots. Under what conditions could it still improve model accuracy?

  4. Examine whether the proportion of deposit outcomes differs by marital status or job category. What hypotheses could you draw from these differences?

  5. Using the adult dataset, identify predictors that may not contribute meaningfully to modeling. Justify your selections with evidence from EDA.

Self-Reflection

Reflect on what you have learned in this chapter. Consider the following questions as a guide.

  1. How has exploratory data analysis changed your understanding of the dataset before modeling?

  2. Which visualizations or summary techniques did you find most effective for revealing structure or patterns?

  3. When exploring data, how do you balance curiosity-driven discovery with methodological discipline?

  4. How can EDA findings influence later stages of the data science workflow, such as feature engineering, model selection, or evaluation?

  5. In what ways did EDA help you detect issues of data quality, such as missing values or redundancy?