An algorithm for solving average reward Markov and semi-Markov decision problems is presented in this paper. The algorithm is asynchronous and model free, and is capable of cooperating with a nearest neighbor approach to manage large state spaces. In addition, a convergence analysis is carried out by means of an ordinary differential equation method.
Classical policy iteration is presented for solving the problem, as long as it doesn’t require any uniformization. This is quite relevant as well, since the transition probabilities are not always available. In reinforcement learning, optimal policy is determined via a reinforcement mechanism. In such a setting, the decision maker can be viewed as an agent that selects actions in each state, visited either on a real-time basis or via a simulator. The feedback obtained from every action selected is used to update the knowledge base of the agent.
The algorithm provided is quite close to the classical policy iteration: begin by selecting a policy arbitrarily, and estimate (using simulation) the average reward associated with it. Then, perform policy evaluation. It goes on to perform policy improvement, where it selects the new policy based on Q-factors. Before beginning a phase, it must first obtain the average reward associated with the new policy (which can be done using simulation). The old vector of Q-factors will now be called P-factors, (P for policy), since these P-factors will contain information about the policy that will be evaluated in the next phase (policy improvement). Fresh Q-factors will be approximated in the next phase.
A case study related to an airline’s seat selling policy (and overbooking) is presented as an application. The seat allocation problem has its roots in the airlines’ practice of selling seats within the same cabin of a flight at different prices. There are differences in customers, and airlines use these differences to sell tickets at different prices. A passenger who wants a lesser number of stopovers, or a passenger who arrives late in the booking process, is made to pay a higher fare. This strategy for classifying passengers leads to different fare classes based on their needs and circumstances. Good results are shown for this algorithm as compared to the heuristic algorithm used in the industry.