Recommender systems are widely used, especially by online applications with a view to enhancing user experience. In most conventional systems, past history of a user’s implicit online behavior is used to derive a new recommendation. By enabling an explicit feedback mechanism with the user, would it be possible to design a reinforcement learning model that could lead to better recommendations? This paper tests this hypothesis, and the authors suggest a new solution and validate their findings on real-world datasets.
Inputs from a user’s interactive system are used to model a Markov decision process (MDP), which the paper labels as a T-step interactive recommendation--each step denoting response to a recommendation from the user. The responses are used in a reinforcement learning model, which uses it to learn a global policy by maximizing the cumulative reward it receives. A user-specific deep Q-learning method (christened UDQN) and a bias-incorporated UDQN (christened BUDQN) are formulated, where the existing latent state is used as input and user responses to recommendations are used as output.
Two different MovieLens datasets and a Yahoo! music dataset are used as benchmarking datasets to validate the experimental results. Cross-validation aspects are taken care of by using tenfold cross-validation in randomly selecting different samples for training and testing datasets to minimize the effects of overlapping data in test sets. Both of the proposed UDQN and BUDQN methods are seen to achieve better results as a recommender system.