Learning Policies from Self-Play with Policy Gradients and MCTS Value Estimates

Dennis Soemers*, Eric Piette, Matthew Stephenson, Cameron Browne

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingAcademicpeer-review

88 Downloads (Pure)

Abstract

In recent years, state-of-the-art game-playing agents often involve policies that are trained in self-playing processes where Monte Carlo tree search (MCTS) algorithms and trained policies iteratively improve each other. The strongest results have been obtained when policies are trained to mimic the search behaviour of MCTS by minimising a cross-entropy loss. Because MCTS, by design, includes an element of exploration, policies trained in this manner are also likely to exhibit a similar extent of exploration. In this paper, we are interested in learning policies for a project with future goals including the extraction of interpretable strategies, rather than state-of-the-art game-playing performance. For these goals, we argue that such an extent of exploration is undesirable, and we propose a novel objective function for training policies that are not exploratory. We derive a policy gradient expression for maximising this objective function, which can be estimated using MCTS value estimates, rather than MCTS visit counts. We empirically evaluate various properties of resulting policies, in a variety of board games
Original languageEnglish
Title of host publicationIEEE Conference on Games
Subtitle of host publication(CoG'19)
PublisherIEEE
Pages329-336
Number of pages8
ISBN (Print)9781728118840
DOIs
Publication statusPublished - 23 Aug 2019
EventIEEE Conference on Games (IEEE COG) - London, United Kingdom
Duration: 20 Aug 201923 Aug 2019
https://ieee-cog.org/2019/

Publication series

SeriesIEEE Conference on Computational Intelligence and Games
ISSN2325-4270

Conference

ConferenceIEEE Conference on Games (IEEE COG)
Country/TerritoryUnited Kingdom
CityLondon
Period20/08/1923/08/19
Internet address

Keywords

  • reinforcement learning
  • search
  • self-play

Cite this