All models are wrong, but some are useful.
— George Box
Many students encounter data science through fragments: a statistical formula in one course, an R command in another, and a machine learning algorithm presented as a ready-made tool. They may learn how to run code without fully understanding the method, or they may study concepts without seeing how those concepts guide real data analysis. This separation between statistical reasoning, programming, and machine learning can make the field appear more technical and less coherent than it needs to be. An introductory textbook should therefore do more than present methods one by one. It should help readers understand how analytical questions, data preparation, statistical thinking, model building, evaluation, and interpretation fit together within a reproducible workflow.
Data Science Foundations and Machine Learning with R: From Data to Decisions was written to support this integrated view of data science education. The book is designed for readers with no prior experience in analytics, programming, or formal statistics, including students, professionals, and researchers seeking a practical introduction to data science and machine learning. Drawing on classroom experience, it combines foundational statistical concepts, modern machine learning methods, reproducible R workflows, and realistic applications. Concepts are introduced progressively and reinforced through examples and applied tasks, with the aim of developing both conceptual understanding and practical fluency. The central goal is to make data science accessible to beginners while preserving the rigor needed to apply methods critically, reproducibly, and responsibly.
The book addresses this educational challenge by treating data science as an integrated process rather than as a sequence of isolated topics. Statistical reasoning, programming, visualization, data preparation, and machine learning are introduced as complementary parts of applied analysis. This organization allows readers to see not only how individual methods work, but also how they contribute to complete data-driven investigations.
The book differs from purely theoretical introductions by placing methods in practical analytical contexts, and it differs from software manuals by treating R as a means for statistical reasoning rather than as an end in itself. Each chapter introduces concepts gradually and illustrates them through reproducible R examples and applied analyses. This structure is intended to help readers move from following examples to carrying out their own analyses with greater independence.
R was chosen because of its central role in statistical computing, data visualization, reproducible research, and applied data science education. Its open-source ecosystem allows readers to work with professional tools while developing transparent and reproducible workflows. Throughout the book, R is used not only to implement methods, but also to support interpretation, model evaluation, and communication of results.
The scope of the book is introductory but broad. It covers the foundations of data science, data preparation, exploratory data analysis, statistical inference, supervised learning, model evaluation, regression models, classification methods, tree-based models, neural networks, and clustering. The aim is not to provide exhaustive treatment of every method, but to build a principled foundation that prepares readers to apply data science and machine learning methods critically in academic, professional, and research settings.
The primary audience for this book is undergraduate and early graduate students taking an introductory course in data science, machine learning, business analytics, applied statistics, econometrics, or quantitative methods. The book is particularly suitable for programs in which students need to develop practical data analysis skills while also understanding the statistical and computational ideas behind the methods they use.
The book is designed for readers who are new to data science, programming, and machine learning. No prior experience with R is assumed, and the necessary programming concepts are introduced from the beginning. A formal background in statistics is also not required, although readers are expected to engage with statistical ideas as they are developed throughout the book, especially in the chapters on exploratory data analysis, statistical inference, regression, and model evaluation.
The material is also suitable for professionals, researchers, and self-study readers who wish to build a practical foundation in data science and machine learning using R. Readers from fields such as business, economics, the social sciences, communication science, psychology, health sciences, and STEM disciplines can use the book to develop reproducible analytical workflows and to apply data-driven methods to realistic problems.
Successful use of the book requires active engagement rather than passive reading. Readers are encouraged to run the R code, inspect intermediate results, modify examples, complete the in-text Practice tasks, and work through the end-of-chapter exercises. Basic computer literacy and access to a computer capable of running R and RStudio are sufficient. Installation instructions and the required software setup are introduced in Chapter 1 R Foundations for Data Science.
This book is designed to help readers develop both conceptual understanding and practical skill in data science and machine learning with R. The goals below reflect the progression of the book, from formulating analytical questions to preparing data, building models, evaluating results, and communicating findings.
By the end of the book, readers will be able to:
formulate data science questions and relate them to the main stages of the Data Science Workflow, from problem understanding to evaluation and interpretation;
use R to import, inspect, transform, visualize, and analyze data within a reproducible workflow;
prepare and explore datasets by identifying variable types, addressing missing values and outliers, transforming variables, and using graphical and numerical summaries;
apply core ideas from statistical inference and model evaluation to assess uncertainty, compare results, and judge model performance in context;
build, tune, and evaluate supervised and unsupervised learning models, including regression models, classification methods, tree-based models, neural networks, and clustering methods;
interpret analytical results critically and communicate findings in a way that supports evidence-based decisions.
Together, these goals emphasize not only technical implementation, but also interpretation, evaluation, and responsible use of data science methods. The aim is to help readers understand when, why, and how data science tools should be applied.
The book is designed to support active learning in self-study, classroom instruction, and professional training. Readers may work through the chapters sequentially or consult selected chapters for specific topics, but the material is most effective when studied interactively. Concepts are introduced through explanation and examples, then reinforced through annotated R code, in-text Practice tasks, case studies, and end-of-chapter exercises.
Readers are encouraged to run the code, inspect intermediate results, modify examples, and compare alternative choices such as different variables, parameter settings, or models. The case studies show how individual methods contribute to complete analyses, while the exercises provide further opportunities to consolidate understanding and develop fluency in R. This approach emphasizes reproducibility throughout: readers learn not only how to obtain results, but also how to document, check, and interpret the analytical steps that produced them.
The book is organized around a Data Science Workflow, developed in Chapter 2 The Data Science Workflow and the Role of Machine Learning and revisited throughout the text. This workflow provides a practical structure for moving from problem formulation and data preparation to modeling, evaluation, and interpretation. Rather than presenting data science as a collection of separate techniques, the chapters show how statistical reasoning, computation, and machine learning contribute to a coherent analytical process.
Chapters 1 R Foundations for Data Science and 2 The Data Science Workflow and the Role of Machine Learning establish the foundations. Chapter 1 R Foundations for Data Science introduces R, RStudio, basic programming concepts, data structures, data import, and reproducible reporting. Chapter 2 The Data Science Workflow and the Role of Machine Learning introduces the main ideas of data science and machine learning, including the Data Science Workflow that guides the rest of the book.
Chapters 3 Data Preparation in Practice: From Raw Data to Insight to 6 Data Setup for Modeling focus on preparing readers for modeling. Chapter 3 Data Preparation in Practice: From Raw Data to Insight examines data preparation, including variable types, missing values, outliers, and transformations. Chapter 4 Exploratory Data Analysis introduces exploratory data analysis through numerical summaries and visualization. Chapter 5 Statistical Inference and Hypothesis Testing develops core ideas from statistical inference and hypothesis testing, which support uncertainty assessment and critical interpretation. Chapter 6 Data Setup for Modeling then turns to data setup for modeling, including partitioning, validation, preprocessing, leakage prevention, and class imbalance.
Chapters 7 Classification Using k-Nearest Neighbors to 11 Generalized Linear Models for Binary and Count Outcomes introduce supervised learning and regression-based modeling. These chapters cover classification with k-nearest neighbors, model evaluation, Naive Bayes classification, regression analysis for continuous outcomes, and generalized regression models for binary and count outcomes. Together, they show how predictive models are built, assessed, compared, and interpreted in applied settings.
Chapters 12 Decision Trees and Random Forests to 14 Clustering for Insight: Segmenting Data Without Labels extend the modeling toolkit. Chapter 12 Decision Trees and Random Forests introduces decision trees and random forests, Chapter 13 Neural Networks for Supervised Learning presents neural networks for supervised learning, and Chapter 14 Clustering for Insight: Segmenting Data Without Labels introduces clustering as an unsupervised learning approach. These chapters broaden the methodological scope while continuing to emphasize practical implementation, evaluation, and interpretation.
Each chapter combines conceptual explanation with reproducible R examples, applied tasks, a realistic case study, and end-of-chapter exercises. This structure allows readers to encounter each method first as an idea, then as an implementation, and finally as part of a complete analysis. The emphasis throughout is on developing reproducible workflows and using data science methods critically in context.
This book can be used as a primary or supplementary text in introductory courses on data science, machine learning, business analytics, applied statistics, econometrics, and quantitative methods. Its structure supports use in undergraduate courses, early graduate courses, professional training programs, and self-contained practical modules.
The chapters are designed to support flexible teaching. Instructors may follow the full sequence of the book or select chapters according to the aims of a specific course. The book includes an extensive collection of exercises, ranging from conceptual questions to applied coding tasks and more advanced problems. Each chapter also includes a case study that can be used for classroom discussion, practical sessions, assignments, or project-based learning.
Additional teaching materials, including lecture slides, practical data science projects, and assessment resources, are made available to support course preparation and delivery. Further information about instructor resources and updates is available at the book’s companion website: https://datasciencebook.ai.
The book uses real-world datasets to support hands-on learning and reproducible analysis. These datasets are used throughout the chapters to illustrate concepts, demonstrate R workflows, and develop case studies in areas such as customer analytics, finance, marketing, healthcare, housing, and classification under class imbalance. Most datasets used in the book are provided through the liver package, which allows readers to reproduce examples, complete exercises, and extend the analyses without needing to locate external data sources before beginning the work.
The liver package contains datasets used in examples, exercises, and case studies, including datasets such as churn, bank, adult, loan, purchase_intention, doctor_visits, wholesale_customers, creditcard_fraud, red_wines, and white_wines. The package also includes supporting functions used throughout the book for common data science tasks. Datasets can be loaded directly in R using the data() function, and their documentation provides descriptions of the variables and references to the original sources where applicable.
Additional resources are available through the book’s companion website, https://datasciencebook.ai. The website provides updates, supplementary materials, and information about resources for instructors and readers. Documentation for the liver package is available through CRAN at https://cran.r-project.org/web/packages/liver/index.html, including dataset descriptions and reference material.
For readers interested in a Python-based treatment of the same learning objectives, a companion volume, Data Science Foundations and Machine Learning with Python: From Data to Decisions, is available from the same publisher. Further information about both books and their supporting resources is provided on the companion website.
Writing this book has been a demanding and rewarding process, and I am grateful to the many people who supported its development. First, I thank my wife, Pariya, for her patience, encouragement, and constant support throughout this project. I am also deeply grateful to my family, especially my mother and older brother, for their belief in me.
I am especially thankful to Eva Hiripi at Springer for her support and encouragement from the early stages of this book. I also thank Dr. Kevin Burke for his valuable input on the structure of the book, and Dr. Jeroen van Raak and Dr. Julien Rossi for their collaboration on the Python edition.
I am grateful to my colleagues in the Business Analytics Section at the University of Amsterdam for their feedback, encouragement, and academic support during the writing process. In particular, I thank Prof. Ilker Birbil, Prof. Dick den Hertog, Prof. Marc Salomon, Dr. Marit Schoonhoven, Dr. Stevan Rudinac, Dr. Rob Goedhart, Prof. Jeroen de Mast, Prof. Joaquim Gromicho, Prof. Peter Kroos, Dr. Chintan Amrit, Dr. Inez Zwetsloot, Dr. Alex Kuiper, Dr. Bart Lameijer, Dr. Jannis Kurtz, Dr. Guido van Capelleveen, and Dr. Yeqiu Zheng. I also thank my PhD students, Lucas Vogels and Elias Dubbeldam, for their research insights and continued collaboration.
I further acknowledge my former colleagues and co-authors, Dr. Khodakaram Salimifard, Sara Saadatmand, and Dr. Florian Böing-Messing, for their continued academic partnership. Finally, I am grateful to the students of the courses Data Wrangling and Data Analytics: Machine Learning at the University of Amsterdam. Their questions, feedback, and engagement helped refine the material in meaningful ways. I am particularly thankful to John Gatev for his thoughtful and constructive comments.
All models are wrong, but some are useful.
— George Box
Reza Mohammadi
Amsterdam, Netherlands
May 2026