Classic arbitration between exploration and exploitation could overestimate the optimal recommendation based on “learning by experience” [1] and relying on the measurement of user exploration through four essential parameters: accuracy, diversity, novelty, and serendipity. Thus, the author of this paper aims to increase reflection on the choices of online users through the uncertainty of multi-angle predictive factors [2]. Thompson sampling (TS), upper confidence bound (UCB), and other personalized ranking methods are certainly counting-based techniques that lead to results in the classical multi-armed bandit framework [3].
This paper extends previous work on entropy regularization with the goal of increasing the performance of recommender systems (RS) metrics. It explores bandit strategy techniques and focuses on stochastic planning for high-quality recommendations that will grab users’ attention even offline. Offline analytics are described as essential for tangling a web of inferences from the four above-mentioned parameters, taking into account the feedback loop and the eventual delay with which users evaluate their online experience [4].
“Exploration involves activities such as search, variation, risk taking, experimentation, discovery, and innovation. Exploitation involves activities such as refinement, efficiency, selection, implementation, and execution” [5]. Instead, Chen’s RS promise of a multi-angle utility function implemented through an “industrial recommendation platform serving billions of users,” including the use of feedback data on experience, can be vividly estimated. Thus a balance between continuity and change is the new metric that the author intends to introduce: either a loophole in preference confirmation to remove alternative routes that are not on track to receive recommendations, or better to convert the similarities related to short-, medium-, and long-term personalization of the recommendations themselves into something desirable.