Reinforcement learning for multiagent systems aims to find optimal policies that can be learned by agents during their interaction in cooperative or competitive games. In game theory, the target is reaching the Nash equilibrium, where each agent is making the best possible decision, taking into account the decisions of the other agents.
In this paper, the multiagent team is making a competitive game; each agent has different goals, assumptions, and algorithms. How can multiagent learning algorithms allow agents to learn Nash equilibrium strategies? The authors answer with two algorithms for policy learning. Both of them “use the exponential moving average approach and the Q-learning algorithm ... to update the policy for the learning agent.”
The difference is in how to update the policy. The first algorithm is constant-learning-rate exponential-moving-average Q-learning (CLR-EMAQL), so it uses a constant rate. The second algorithm is exponential-moving-average-Q-learning (EMAQL); it employs two decay rates, guided by the competing mechanisms of learn-fast or learn-slow, which allow the agent to learn differently when its policy is winning or losing.
The paper extends algorithms already proposed by the authors. Here they demonstrate that CLR-EMAQL “converges to Nash equilibrium ... in games that have pure Nash equilibrium,” while EMAQL can work also in games that have mixed Nash equilibrium. The mathematical analysis is done on a simplified two-player, two-action game; it includes demonstrations.
A large part of the paper presents matrix and stochastic games to illustrate the two algorithms. An entire section reports detailed simulations of the proposed algorithms against other literature methods.
The paper is both a good introduction to multiagent policy learning and an effective presentation of new algorithms. It is useful for graduate students and researchers in this specific learning method, which is interesting for applications ranging from financial strategies to robotics teams. The only negative is that it does not discuss the problem of parameter setting in the method.