Computing Reviews, the leading online review service for computing literature.

Search

A reinforcement learning algorithm based on policy iteration for average reward: empirical results with yield management and convergence analysis
Gosavi A. Machine Learning55 (1):5-29,2004.Type:Article

Date Reviewed: Jul 28 2005

An algorithm for solving average reward Markov and semi-Markov decision problems is presented in this paper. The algorithm is asynchronous and model free, and is capable of cooperating with a nearest neighbor approach to manage large state spaces. In addition, a convergence analysis is carried out by means of an ordinary differential equation method. Classical policy iteration is presented for solving the problem, as long as it doesn’t require any uniformization. This is quite relevant as well, since the transition probabilities are not always available. In reinforcement learning, optimal policy is determined via a reinforcement mechanism. In such a setting, the decision maker can be viewed as an agent that selects actions in each state, visited either on a real-time basis or via a simulator. The feedback obtained from every action selected is used to update the knowledge base of the agent. The algorithm provided is quite close to the classical policy iteration: begin by selecting a policy arbitrarily, and estimate (using simulation) the average reward associated with it. Then, perform policy evaluation. It goes on to perform policy improvement, where it selects the new policy based on Q-factors. Before beginning a phase, it must first obtain the average reward associated with the new policy (which can be done using simulation). The old vector of Q-factors will now be called P-factors, (P for policy), since these P-factors will contain information about the policy that will be evaluated in the next phase (policy improvement). Fresh Q-factors will be approximated in the next phase. A case study related to an airline’s seat selling policy (and overbooking) is presented as an application. The seat allocation problem has its roots in the airlines’ practice of selling seats within the same cabin of a flight at different prices. There are differences in customers, and airlines use these differences to sell tickets at different prices. A passenger who wants a lesser number of stopovers, or a passenger who arrives late in the booking process, is made to pay a higher fare. This strategy for classifying passengers leads to different fare classes based on their needs and circumstances. Good results are shown for this algorithm as compared to the heuristic algorithm used in the industry.

Reviewer: Joaquin Ordieres	Review #: CR131592

Concept Learning (I.2.6 ... )

Knowledge Acquisition (I.2.6 ... )

Would you recommend this review?

yes

Other reviews under "Concept Learning":	Date

Parallelism and programming in classifier systems Forrest S. (ed), Morgan Kaufmann Publishers Inc., San Francisco, CA, 1991. Type: Book (9780273088257)	Sep 1 1991

Learning structures of visual patterns from single instances Suganuma Y. Artificial Intelligence 50(1): 1-36, 1991. Type: Article	Apr 1 1992

Learning simple concepts under simple distributions Li M. (ed), Vitányi P. SIAM Journal on Computing 20(5): 911-935, 1991. Type: Article	Jul 1 1993

more...

Reproduction in whole or in part without permission is prohibited. Copyright 1999-2024 ThinkLoud^®
Terms of Use | Privacy Policy