Computing Reviews, the leading online review service for computing literature.

Search

The minimum description length principle
Grünwald P., The MIT Press, 2007. 504 pp. Type: Book (9780262072816)

Date Reviewed: Mar 24 2009

Seven-hundred years ago, a Franciscan logician from the English village of Ockham formulated the philosophical principle that has come to be known as “Ockham’s [or Occam’s] Razor: Entities must not be multiplied beyond necessity.” His insight, which has become widely accepted as a guideline of modern scientific methodology, is that given two explanations of equal accuracy, we should prefer the simpler one. Modern information theory makes it possible to set this principle on a formal mathematical foundation. Over the past three decades, researchers in statistical learning theory have developed an extensive body of insights into this minimum description length (MDL) principle, and Grünwald now offers a unified expository text. Motivated by the challenges he faced as a student to pull together the necessary mathematical prerequisites, he offers an integrated reference that includes introductions to the necessary foundations in information theory and statistics, along with an accessible, though thoroughly formal development of the theory. Grünwald addresses higher-level questions about what is and is not intuitive about MDL. In spite of the antiquity of the fundamental idea of MDL, it involves an intrinsic paradox, in preferring parsimony to perspicuity. That is, it prefers a shorter model to one that gives more causal insight into the process that generated the data, if both have the same predictive power. This preference is not obvious, and the explicit acknowledgment of the alternative perspective is refreshing. But MDL has its own intuitive justification, in the balance between underfitting and overfitting. The book has four parts. The first part, a bit less than a quarter of the book, is devoted to an overview of the MDL idea and a summary of the mathematical prerequisites in probability and information theory. The second part, about a third of the book, is devoted to universal codes, codes that are almost as good as the best possible code for a given data string, but can be defined without seeing all the data in advance. The section starts with the classical two-part code that first communicates the particular parameterization of the code and then the encoded data. It then goes on to discuss Bayesian, normalized maximum likelihood, and prequential plug-in codes. Universal codes have a somewhat ad hoc character. Once one has chosen a model to encode the data, the length of the data is well defined, but how does one then encode the model? Avoiding an infinite regress is tricky. The basic idea of refined MDL, the subject of the third part of the book, is to characterize the model, not by its length under some code, but by its parametric complexity, a formal measure of richness of the set of patterns it can capture. By permitting overfitting, a richer model is the same as a longer one and therefore less desirable. Working out this scheme in detail leads to a theoretically more refined approach to MDL. The fourth part offers additional details on exponential families of distributions that have special benefits for MDL. The book is very well organized pedagogically. Each chapter begins with an overview paragraph, then a section-by-section summary, and finally suggestions for a fast track through the chapter, identifying the most critical sections. Boxed summaries distributed throughout the book draw the reader’s attention to highlights of the exposition, and illustrative examples are provided for the key technical points. In light of these features, it is surprising that the book does not include any exercises. The book’s orientation is decidedly theoretical rather than practical. It does not deal with the practical issues involved in handling large bodies of data, or point the reader to implementations that support the techniques described. Grünwald has rendered a great benefit to the machine learning field by providing this self-contained theoretical exposition. It will be helpful if someone would now build on this volume with a practitioner’s guide.

Reviewer: H. Van Dyke Parunak	Review #: CR136616 (1002-0149)

Statistical (I.5.1 ... )

Learning (I.2.6 )

Coding And Information Theory (E.4 )

Probability And Statistics (G.3 )

Would you recommend this review?

yes

Other reviews under "Statistical":	Date

A formulation and comparison of two linear feature selection techniques applicable to statistical classification Young D., Odell P. Pattern Recognition 17(3): 331-337, 1984. Type: Article	Mar 1 1985

Remarks on some statistical properties of the minimum spanning forest Dubes R., Hoffman R. Pattern Recognition 19(1): 49-53, 1986. Type: Article	Dec 1 1987

Statistical pattern recognition--early development and recent progress Chen C. International Journal of Pattern Recognition and Artificial Intelligence 1(1): 43-51, 1987. Type: Article	May 1 1988

more...

Reproduction in whole or in part without permission is prohibited. Copyright 1999-2024 ThinkLoud^®
Terms of Use | Privacy Policy