Quality-based Rewards for Monte-Carlo Tree Search Simulations

Tom Pepels*, Mandy J. W. Tak, Marc Lanctot, Mark H. M. Winands

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingAcademicpeer-review

Abstract

Monte-Carlo Tree Search is a best-first search technique based on simulations to sample the state space of a decision-making problem. In games, positions are evaluated based on estimates obtained from rewards of numerous randomized play-outs. Generally, rewards from play-outs are discrete values representing the outcome of the game ( loss, draw, or win), e. g., r is an element of{-1, 0, 1}, which are backpropagated from expanded leaf nodes to the root node. However, a play-out may provide additional information. In this paper, we introduce new measures for assessing the a posteriori quality of a simulation. We show that altering the rewards of play-outs based on their assessed quality improves results in six distinct two-player games and in the General Game Playing agent CADIAPLAYER. We propose two specific enhancements, the Relative Bonus and Qualitative Bonus. Both are used as control variates, a variance reduction method for statistical simulation. Relative Bonus is based on the number of moves made during a simulation and Qualitative Bonus relies on a domain-dependent assessment of the game''s terminal state. We show that the proposed enhancements, both separate and combined, lead to significant performance increases in the domains discussed.
Original languageEnglish
Title of host publicationProceedings of the Twenty-first European Conference on Artificial Intelligence
Place of PublicationAmsterdam, The Netherlands, The Netherlands
PublisherIOS Press
Pages705-710
Number of pages6
ISBN (Print)978-1-61499-418-3
DOIs
Publication statusPublished - 2014

Publication series

SeriesFrontiers in Artificial Intelligence and Applications
Volume263

Fingerprint

Dive into the research topics of 'Quality-based Rewards for Monte-Carlo Tree Search Simulations'. Together they form a unique fingerprint.

Cite this