Data science--and big data analytics as its most recent subdiscipline--is a fashionable research area and teaching subject. Many open-access libraries and source codes provide the opportunity to play with the available algorithms and achieve some numerical results that can be evaluated by adequate metrics related to the applied algorithms. A favorite topic of undergraduate/master’s dissertations is the application of algorithms in several practice areas. However, sometimes the students and researchers lack insight into the operation of algorithms and the underlying mathematical theories. Therefore, this book is apt for courses that introduce the fundamentals of data science/big data analytics at the graduate level.
The prerequisite to reading and understanding the book is having sound knowledge of multi-variable real analysis, discrete probability theory, and linear and matrix algebra.
The book presents the fundamental mathematical approaches utilized in big data analytics, chapter by chapter. In each chapter, a formal description of the relevant mathematical tools is followed by an application-oriented showcase. Application areas include not only finance and economy, but also natural language processing and genomics.
The authors wished to write a useful textbook; therefore, the chapters are organized as follows: the problem to be solved, including motivation; the mathematical formalism and techniques; case studies from practice; and a set of exercises. The closing chapter (10) provides solutions for the exercises.
Chapter 1 presents the mathematical foundations of the web page ranking elaborated by Google. Chapter 2 deals with online learning and demonstrates the application of “expert advice” to predict the price of goods in a market. Chapter 3 describes the underlying mathematics of recommendation systems motivated by the Netflix Prize, and a case study on latent semantic analysis is expounded. Chapter 4 treats the classification issue through the use of support vector machines and applications in quality control. Chapter 5 discusses clustering, both the classical k-means and spectral clustering. The capabilities of the clustering methods are demonstrated through topic extraction.
Chapter 6 looks at two significant methods of linear regression, namely ordinary least squares and ridge regression. The usage of linear regression is showcased through capital asset pricing. The chapter concentrates on variable selection in relationship with the problem of sparse matrices and vectors as input parameters. Two algorithms for handling the concerns are depicted: lasso regression and iterative shrinkage-thresholding algorithms. The strength of the algorithms is shown in the signal processing field, specifically in the task of compressed sensing.
Chapter 8 provides an overview of neuron networks and then logistic regression and its application for spam filtering. Chapter 9 surveys the issue of decision trees, including NP-completeness. The section briefly analyzes the NP and P complexity classes and their relevance in problem solving, as well as the heuristics and greedy algorithm approach to keep complexity at bay. The case study presents the underlying mathematical structure of chess game-playing systems. Again, chapter 10 provides complete solutions to the exercises presented at the end of every chapter.
The book can be used in courses devoted to the foundational mathematics of data science and analytics. It should be noted that sound mathematical knowledge, for example, an understanding of partial differentials and matrix algebra, is required for reading. The case studies and exercises make it a quality teaching material.