Computers with multicore central processing units (CPUs) and multiple graphics processing units (GPUs) exist today to speed up the parallel processing of computationally intensive statistical prediction algorithms. Unfortunately, several of the existing statistical algorithms [1], designed to cope with the instantaneous scrutiny of record-keeping systems in areas such as healthcare, are still sequential and thus computationally deficient. How should efficient statistical algorithms be designed to uncover and use the current and historical trends in medical claim databases to reliably predict the medical products associated with adverse events such as myocardial infarction or severe renal and liver collapse?
The authors of this paper critique the limitations of the existing statistical algorithms for coping with regulation and compliance issues in healthcare industries. They recognize the need to explore the parallelization capability of GPUs for solving generalized linear models (GLMs) that involve the solution of computationally intensive log-likelihood functions. Readers who are unfamiliar with computational statistics should browse Kennedy and Gentle’s introduction [1] to sequential algorithms for solving unconstrained optimization and nonlinear regression, prior to exploring the insightful parallel algorithms used in this paper to solve GLMs with Bayesian priors or indefinite parameter regularization.
The authors present a sequential cyclic coordinate descent algorithm used to fit the familiar Bayesian self-controlled case series. The algorithm targets the time-wasting computation of 1D gradients and Hessian matrices for extensive parallelization. They cleverly show how to represent and manipulate sparse matrices and dense vectors in parallel to derive the gradients and Hessians, and apply the parallel algorithms to compute the maximum a posteriori (MAP) probability estimates for numerous observational healthcare databases. Using GPUs to perform the sparse operations significantly increases the speed of the MAP estimation, compared to using CPUs to execute the sparse or dense computation.
Exploiting parallel algorithms to fit complex GLMs to huge datasets offers new opportunities for associating adverse events with specific drugs, while controlling for covariates such as patient demographics, coexisting diseases, and coinciding drugs. However, a complete Bayesian analysis of the entire set of unidentified parameters is missing from the proposed model. Clearly, the authors recognize the roles of cross-validation and bootstrapping in estimating the hyperparameters of the model. However, are accurate estimates of the model hyperparameters really computationally infeasible, as the authors claim? I strongly encourage all computational statisticians to read this perceptive paper and weigh in on this question.