Preface

Data science integrates statistical reasoning, machine learning techniques, and computational tools to transform raw data into insight and informed decisions. From predictive models used in finance and healthcare to modern machine learning systems underlying generative AI applications, data-driven methods increasingly shape how complex problems are understood and addressed. As these techniques become central across disciplines and industries, the need for accessible yet rigorous educational resources has never been greater.

Data Science Foundations and Machine Learning with R: From Data to Decisions provides a hands-on introduction to this field. Designed for readers with no prior experience in analytics, programming, or formal statistics, the book offers a clear and structured pathway into data science by combining foundational statistical concepts with modern machine learning methods. Emphasis is placed on conceptual understanding, practical implementation, and reproducible workflows using R.

The motivation for this book emerged from a recurring challenge encountered in the classroom. Many students were eager to learn data science and machine learning, yet struggled to find resources that were simultaneously accessible, conceptually rigorous, and practically oriented. Existing materials often emphasized either theoretical abstraction or software mechanics in isolation, leaving beginners uncertain about how methods connect to real analytical problems. This book was written to address that gap. It is intended for newcomers to data science and machine learning, including undergraduate and graduate students, professionals transitioning into data-driven roles, and researchers seeking a practical introduction. Drawing on my experience teaching data science at the university level, the exposition adopts an applied, example-driven approach that integrates statistical foundations with hands-on modeling. The goal is to lower the barrier to entry without sacrificing depth, academic rigor, or relevance to real-world decision-making.

To support a smooth learning trajectory for readers with diverse backgrounds, the book adopts active learning strategies throughout. Concepts are introduced progressively and reinforced through illustrative examples, guided coding tasks, and applied problem-solving activities embedded directly within the main text. Newly introduced ideas are followed by in-text boxes labeled Practice, which invite readers to pause and apply concepts immediately in R as they are encountered. Each chapter also concludes with a case study that applies the chapter’s core ideas to a realistic data-driven scenario, bridging the gap between methodological concepts and real-world application. In addition, every chapter includes a substantial set of end-of-chapter exercises that consolidate learning through more extended implementation. Together, the in-text Practice boxes, case studies, and exercises form a coherent learning framework that steadily develops both conceptual understanding and practical proficiency.

Why This Book?

This book was written to provide a clear, structured, and application-focused introduction to data science and machine learning using R. While data science continues to evolve rapidly, many existing textbooks either emphasize theoretical development without sufficient practical guidance or focus narrowly on software usage without establishing conceptual foundations. This book aims to bridge that gap by integrating statistical modeling, machine learning techniques, and computational tools within a coherent learning framework.

Unlike many textbooks that assume prior experience with programming or analytics, this book is designed to be accessible to beginners while remaining academically rigorous. Core concepts are introduced gradually and reinforced through real-world examples, guided exercises, and annotated R code. This approach enables readers to develop theoretical understanding alongside practical fluency from the outset, fostering confidence in applying methods to realistic data-driven problems.

R is a widely adopted, open-source language with a rich ecosystem of packages for statistical computing, visualization, and reproducible analysis. This book emphasizes its practical use across academic, industrial, and research settings. For readers who prefer Python, a companion volume titled Data Science Foundations and Machine Learning with Python: From Data to Decisions is available from the same publisher. Further information about both books can be found at https://datasciencebook.ai.

Who Should Read This Book?

This book is intended for readers seeking a clear and practical introduction to data science and machine learning, particularly those who are new to the field. It is designed to support a broad audience, ranging from students encountering data analysis for the first time to professionals aiming to incorporate data-driven reasoning into their work.

The book is especially well suited for undergraduate students in programs that emphasize quantitative reasoning, including economics, business administration, business economics (with specializations such as finance or organizational economics), communication science, psychology, and STEM disciplines. It is also appropriate for students in Master’s programs in business analytics, econometrics, and the social sciences, where applied data analysis and modeling play a central role.

Beyond academic audiences, the book is suitable for professionals and researchers who wish to develop practical data science skills without assuming prior training in programming or machine learning. Its structured, example-driven approach makes it appropriate for self-study as well as for use in taught courses at both undergraduate and graduate levels. The material has been developed and refined through use as a reference text in a range of courses on data analytics, machine learning, data wrangling, and business analytics across several BSc and MSc programs, including at the University of Amsterdam.

The book is equally valuable for continuing education and professional development, offering an accessible yet rigorous foundation for readers seeking to strengthen their analytical skills in a rapidly evolving data landscape.

Skills You Will Gain

This book guides you through a practical and progressive journey into data science and machine learning using R, structured around the Data Science Workflow (Figure 1). The workflow emphasizes how to move from a clearly defined problem to a data-driven solution through statistical analysis and machine learning. Each chapter supports both conceptual understanding and applied skill development, guiding readers from formulating analytical questions and preparing data to building, evaluating, and interpreting models.

By the end of this book, you will be able to:

  • Identify and explain the key stages of a data science project, from problem formulation and data preparation to modeling and evaluation;

  • Apply core R programming concepts, including data structures, control flow, and functions, to explore, prepare, and analyze data;

  • Prepare and transform raw datasets by addressing missing values, outliers, and categorical variables using established best practices;

  • Explore and interpret data using descriptive statistics and effective visualizations;

  • Build, tune, and interpret machine learning models for classification, regression, and clustering, including methods such as k-nearest neighbors, Naive Bayes, decision trees, neural networks, and K-means clustering;

  • Evaluate and compare model performance using appropriate metrics tailored to different analytical tasks;

  • Apply and adapt data science techniques to real-world problems in domains such as marketing, finance, operations, and the social sciences.

Throughout the book, these skills are reinforced through illustrative examples, annotated R code, and practice-oriented exercises. Each chapter concludes with a case study that synthesizes the main concepts and demonstrates how methods can be applied in realistic settings. By the end of the book, readers are equipped not only with familiarity with data science tools, but also with the ability to apply them critically, responsibly, and effectively in practice.

Requirements and Expectations

This book assumes no prior experience with programming, statistics, or data science. It is designed to be accessible to beginners while maintaining academic rigor, with core concepts introduced gradually and reinforced through real-world examples, guided exercises, and annotated R code.

The material has been developed and refined through teaching at the undergraduate level, particularly for students in econometrics, social sciences, and the natural sciences. Many of these students begin with little or no background in programming, machine learning, or formal statistics. This teaching experience has directly informed the structure, pacing, and level of exposition adopted throughout the book.

Readers are expected to have only basic familiarity with using a computer and installing software. No prior programming experience is assumed, as all necessary R concepts are introduced from first principles. While selected statistical ideas are discussed later in the book, particularly in Chapter 5  Statistical Inference and Hypothesis Testing, no formal background in statistics is required.

Successful engagement with the material does, however, require a willingness to learn actively. Readers are encouraged to work through the in-text Practice boxes, experiment with code, and complete the end-of-chapter exercises, as hands-on problem-solving is central to the learning approach adopted throughout the book.

All tools and software used in this book are freely available, and detailed installation instructions are provided in Chapter 1  R Foundations for Data Science. There are no requirements regarding a specific operating system or computer architecture. It is assumed only that readers have access to a computer capable of running R and RStudio, along with an internet connection for downloading packages and datasets.

Structure of This Book

This book is structured around the Data Science Workflow (Figure 1), an iterative framework that emphasizes how data science projects progress from problem formulation to data-driven solutions through statistical analysis and machine learning. The journey begins in Chapter 1  R Foundations for Data Science, where readers install R, become familiar with its syntax, and work with essential data structures. From there, each chapter builds on the previous one, combining conceptual development with hands-on coding and real-world case studies.

Figure 1: The Data Science Workflow is an iterative framework for structuring data science and machine learning projects. Inspired by the CRISP-DM model (Cross-Industry Standard Process for Data Mining), it supports systematic problem-solving and continuous refinement.

The Data Science Workflow, introduced in Chapter 2  The Data Science Workflow and the Role of Machine Learning and illustrated in Figure 1, consists of seven key stages:

  1. Problem Understanding: Defining the analytical objective and broader context (Chapter 2  The Data Science Workflow and the Role of Machine Learning).

  2. Data Preparation: Cleaning, transforming, and organizing raw data (Chapter 3  Data Preparation in Practice: From Raw Data to Insight).

  3. Exploratory Data Analysis (EDA): Visualizing and summarizing data to uncover patterns and relationships (Chapter 4  Exploratory Data Analysis).

  4. Data Setup for Modeling: Selecting features, partitioning datasets, and scaling variables (Chapter 6  Data Setup for Modeling).

  5. Modeling: Building and training predictive models using a range of machine learning algorithms (Chapters 7  Classification Using k-Nearest Neighbors through 13  Clustering for Insight: Segmenting Data Without Labels).

  6. Evaluation: Assessing model performance using appropriate metrics and validation strategies (Chapter 8  Model Evaluation and Performance Assessment).

  7. Deployment: Translating analytical insights into real-world decisions and applications.

The sequence of chapters mirrors these stages, supporting a gradual progression from foundational concepts to applied modeling. Chapter 5  Statistical Inference and Hypothesis Testing complements this progression by providing a focused introduction to key statistical ideas, such as confidence intervals and hypothesis testing, which underpin critical reasoning, uncertainty assessment, and model interpretation.

To bridge theory and practice, newly introduced ideas throughout each chapter are accompanied by illustrative examples and in-text boxes labeled Practice, which invite readers to pause and apply concepts immediately in R as they are encountered. Each chapter then concludes with a case study that applies its core ideas to a realistic data-driven problem, demonstrating the Data Science Workflow in action through data preparation, model development, evaluation, and interpretation using real datasets. The datasets used throughout the book, summarized in Table 1, are made available through the liver package, enabling readers to reproduce analyses, complete exercises, and experiment with methods in a consistent environment. Each chapter also includes a set of exercises designed to consolidate learning, ranging from conceptual questions to hands-on coding tasks and applied problem-solving challenges, together reinforcing key ideas and building confidence in applying R for data science.

How to Use This Book

This book is designed for self-study, classroom instruction, and professional learning. Readers may work through the chapters sequentially to follow a structured learning path or consult individual chapters and sections to focus on specific skills or concepts as needed. Regardless of the mode of use, active engagement with the material is essential to achieving the learning objectives of the book.

Readers are encouraged to run the R code examples interactively, experiment with modifications, and explore alternative parameter settings or datasets to reinforce key ideas through hands-on experience. In particular, readers should actively engage with the in-text boxes labeled Practice, which appear immediately after new concepts are introduced and are intended to prompt immediate application and reflection. Each chapter also includes exercises that range from conceptual questions to applied coding tasks, providing further opportunities to deepen understanding and develop analytical fluency. End-of-chapter case studies offer a comprehensive view of the Data Science Workflow in practice, guiding readers through data preparation, modeling, evaluation, and interpretation in realistic analytical contexts.

The book also supports collaborative learning. Working through exercises, Practice boxes, and case studies in pairs or small groups can stimulate discussion, deepen conceptual understanding, and expose readers to diverse analytical perspectives, particularly in classroom and workshop settings.

Using This Book for Teaching

This book is well suited for introductory courses in data science and machine learning, as well as for professional training programs. Its structured progression, emphasis on applied learning, and extensive collection of exercises make it a flexible resource for instructors across a wide range of educational settings.

To support systematic skill development, the book includes more than 500 exercises organized across three levels: conceptual questions that reinforce key ideas, applied tasks based on real-world data, and advanced problems that deepen understanding of machine learning methods. This structure allows instructors to adapt the material to different course levels and learning objectives. Each chapter also features a case study that walks students through the complete Data Science Workflow, from data preparation and modeling to evaluation and interpretation, demonstrating how theoretical concepts translate into practical analysis.

The book has been used as a primary reference in undergraduate and graduate courses on data analytics, machine learning, and data wrangling, including within several BSc and MSc programs at the University of Amsterdam. It is equally suitable for courses in applied statistics, econometrics, business analytics, and quantitative methods across programs in the social sciences, business, and STEM disciplines.

Instructors adopting this book have access to a set of supporting teaching materials, including lecture slides, data science projects for practical sessions, and assessment resources. These materials are designed to facilitate course preparation and to support consistent, engaging instruction. Further information about instructor resources is available at this book’s homepage https://datasciencebook.ai.

Datasets Used in This Book

This book integrates real-world datasets to support its applied, hands-on approach to learning data science and machine learning. These datasets are used throughout the chapters to illustrate key concepts, demonstrate analytical techniques, and underpin comprehensive case studies. Table 1 summarizes the core datasets featured in the book, most of which are included in the liver package. All datasets provided by liver can be accessed directly in R, enabling seamless replication of examples, case studies, and exercises. This design allows readers to focus on methodological understanding and practical implementation without additional data preparation overhead.

Table 1: Overview of datasets used for case studies in different chapters. All datasets are included in the R package liver, except the diamonds dataset, which is available in the ggplot2 package.
Name Description Chapter
churn Customer churn in the credit card industry. Chapters 4, 5, 6, 7, 8
bank Direct marketing data from a Portuguese bank. Chapters 6, 7, 12
adult US Census data for income prediction. Chapters 3, 11
risk Credit risk dataset. Chapter 9
churn_mlc Customer churn dataset from MLC++ machine learning. Chapters 4, 10
churn_tel Customer churn dataset from a telecommunications company. Chapters 11, 13
marketing Marketing campaign performance data. Chapter 10
house House price prediction dataset. Chapter 10
diamonds Diamond pricing dataset. Chapter 3
cereal Nutritional information for 77 breakfast cereals. Chapter 13
caravan Customer data for insurance purchase prediction. Chapter 4
insurance Insurance policyholder data. Chapter 11
house_price House price data from Ames, Iowa. Chapter 10
drug Drug consumption dataset. Chapter 3
red_wines Red wine quality dataset. Chapter 7
white_wines White wine quality dataset. Chapter 13
gapminder Global development indicators from 1950 to 2019. Chapter 4

These datasets were selected to expose readers to a broad range of real-world challenges spanning marketing, finance, customer analytics, and predictive modeling. They appear throughout the book in illustrative examples, annotated code, and comprehensive case studies that follow the full Data Science Workflow. All datasets from liver can be loaded directly in R using the data() function (for example, data(churn)). Documentation and references to the original data sources are available through the package reference page at https://cran.r-project.org/web/packages/liver/refman/liver.html. Beyond the datasets listed in Table 1, the liver package includes additional datasets that appear in end-of-chapter exercises, providing further opportunities to practice data exploration, modeling, and evaluation across a variety of applied contexts.

Online Resources

Additional resources supporting this book are available online. The book’s companion website, https://datasciencebook.ai, provides information about the book, updates, and access to supplementary materials for instructors and readers. The website also includes documentation and additional resources related to the Python edition of the book, Data Science Foundations and Machine Learning with Python: From Data to Decisions, offering guidance for readers interested in working with the Python-based version of the material.

The book is also supported by the R package liver, which contains the datasets used throughout the chapters, exercises, and case studies. The package is freely available from CRAN at https://cran.r-project.org/web/packages/liver/index.html, along with documentation describing each dataset and its original source. These online resources are intended to facilitate reproducibility, support hands-on learning, and streamline the use of the book in both self-study and teaching contexts.

Acknowledgments

Writing this book has been both a challenging and rewarding journey, and I am deeply grateful to all those who supported and inspired me along the way. First and foremost, I thank my wife, Pariya, for her constant support, patience, and encouragement throughout this process. I am also sincerely grateful to my family, especially my mother and older brother, for their unwavering belief in me.

This book would not have taken shape without the contributions of my collaborators. I am particularly thankful to Dr. Kevin Burke for his valuable input in shaping the structure of the book. I also wish to acknowledge Dr. Jeroen van Raak and Dr. Julien Rossi, who enthusiastically collaborated with me on the development of the Python edition of this book. I am especially indebted to Eva Hiripi at Springer for her steadfast support and for encouraging me to pursue this project from the outset.

My colleagues in the Business Analytics Section at the University of Amsterdam provided thoughtful feedback and generous support during the writing process. I am particularly grateful to Prof. Ilker Birbil, Prof. Dick den Hertog, Prof. Marc Salomon, Dr. Marit Schoonhoven, Dr. Stevan Rudinac, Dr. Rob Goedhart, Prof. Jeroen de Mast, Prof. Joaquim Gromicho, Prof. Peter Kroos, Dr. Chintan Amrit, Dr. Inez Zwetsloot, Dr. Alex Kuiper, Dr. Bart Lameijer, Dr. Jannis Kurtz, Dr. Guido van Capelleveen, and Dr. Yeqiu Zheng. I also thank my PhD students, Lucas Vogels and Elias Dubbeldam, for their research insights and continued collaboration.

I would further like to acknowledge my former colleagues and co-authors, Dr. Khodakaram Salimifard, Sara Saadatmand, and Dr. Florian Böing-Messing, for their continued academic partnership. Finally, I am grateful to the students of the courses Data Wrangling and Data Analytics: Machine Learning at the University of Amsterdam. Their feedback has helped refine the material in meaningful ways, and I am particularly thankful to John Gatev for his thoughtful and constructive comments.

To everyone who contributed to this book, your encouragement, feedback, and collaboration have been invaluable.

All models are wrong, but some are useful.

— George Box

Reza Mohammadi
Amsterdam, Netherlands
January 2026