Learning Policies from Self-Play with Policy Gradients and MCTS Value Estimates

Dennis Soemers; Eric Piette; Matthew Stephenson; Cameron Browne

doi:10.1109/CIG.2019.8848037

Learning Policies from Self-Play with Policy Gradients and MCTS Value Estimates

Dennis Soemers^*, Eric Piette, Matthew Stephenson, Cameron Browne

^*Corresponding author for this work

Advanced Computing Sciences

Research output: Chapter in Book/Report/Conference proceeding › Conference article in proceeding › Academic › peer-review

131 Downloads (Pure)

Abstract

In recent years, state-of-the-art game-playing agents often involve policies that are trained in self-playing processes where Monte Carlo tree search (MCTS) algorithms and trained policies iteratively improve each other. The strongest results have been obtained when policies are trained to mimic the search behaviour of MCTS by minimising a cross-entropy loss. Because MCTS, by design, includes an element of exploration, policies trained in this manner are also likely to exhibit a similar extent of exploration. In this paper, we are interested in learning policies for a project with future goals including the extraction of interpretable strategies, rather than state-of-the-art game-playing performance. For these goals, we argue that such an extent of exploration is undesirable, and we propose a novel objective function for training policies that are not exploratory. We derive a policy gradient expression for maximising this objective function, which can be estimated using MCTS value estimates, rather than MCTS visit counts. We empirically evaluate various properties of resulting policies, in a variety of board games

Original language	English
Title of host publication	IEEE Conference on Games
Subtitle of host publication	(CoG'19)
Publisher	IEEE
Pages	329-336
Number of pages	8
ISBN (Print)	9781728118840
DOIs	https://doi.org/10.1109/CIG.2019.8848037
Publication status	Published - 23 Aug 2019
Event	IEEE Conference on Games (IEEE COG) - London, United Kingdom Duration: 20 Aug 2019 → 23 Aug 2019 https://ieee-cog.org/2019/

Publication series

Series	IEEE Conference on Computational Intelligence and Games
ISSN	2325-4270

Conference

Conference	IEEE Conference on Games (IEEE COG)
Country/Territory	United Kingdom
City	London
Period	20/08/19 → 23/08/19
Internet address	https://ieee-cog.org/2019/

Keywords

reinforcement learning
search
self-play

Access to Document

10.1109/CIG.2019.8848037

Full TextFinal published version, 585 KBLicence: Taverne
cog19-1

Cite this

@inproceedings{4b0906b7049c43eaadb493ac7dd61470,

title = "Learning Policies from Self-Play with Policy Gradients and MCTS Value Estimates",

abstract = "In recent years, state-of-the-art game-playing agents often involve policies that are trained in self-playing processes where Monte Carlo tree search (MCTS) algorithms and trained policies iteratively improve each other. The strongest results have been obtained when policies are trained to mimic the search behaviour of MCTS by minimising a cross-entropy loss. Because MCTS, by design, includes an element of exploration, policies trained in this manner are also likely to exhibit a similar extent of exploration. In this paper, we are interested in learning policies for a project with future goals including the extraction of interpretable strategies, rather than state-of-the-art game-playing performance. For these goals, we argue that such an extent of exploration is undesirable, and we propose a novel objective function for training policies that are not exploratory. We derive a policy gradient expression for maximising this objective function, which can be estimated using MCTS value estimates, rather than MCTS visit counts. We empirically evaluate various properties of resulting policies, in a variety of board games",

keywords = "reinforcement learning, search, self-play",

author = "Dennis Soemers and Eric Piette and Matthew Stephenson and Cameron Browne",

note = "Funding Information: This research is part of the European Research Council-funded Digital Ludeme Project (ERC Consolidator Grant #771292) run by Cameron Browne at Maastricht University{\textquoteright}s Department of Data Science and Knowledge Engineering. Funding Information: This research is part of the European Research Councilfunded Digital Ludeme Project (ERC Consolidator Grant #771292) run by Cameron Browne at Maastricht University's Department of Data Science and Knowledge Engineering. Publisher Copyright: {\textcopyright} 2019 IEEE.; IEEE Conference on Games (IEEE COG) ; Conference date: 20-08-2019 Through 23-08-2019",

year = "2019",

month = aug,

day = "23",

doi = "10.1109/CIG.2019.8848037",

language = "English",

isbn = "9781728118840",

series = "IEEE Conference on Computational Intelligence and Games",

publisher = "IEEE",

pages = "329--336",

booktitle = "IEEE Conference on Games",

address = "United States",

url = "https://ieee-cog.org/2019/",

}

Learning Policies from Self-Play with Policy Gradients and MCTS Value Estimates. / Soemers, Dennis ; Piette, Eric ; Stephenson, Matthew et al.
IEEE Conference on Games: (CoG'19). IEEE, 2019. p. 329-336 (IEEE Conference on Computational Intelligence and Games).

Research output: Chapter in Book/Report/Conference proceeding › Conference article in proceeding › Academic › peer-review

TY - GEN

T1 - Learning Policies from Self-Play with Policy Gradients and MCTS Value Estimates

AU - Soemers, Dennis

AU - Piette, Eric

AU - Stephenson, Matthew

AU - Browne, Cameron

N1 - Funding Information: This research is part of the European Research Council-funded Digital Ludeme Project (ERC Consolidator Grant #771292) run by Cameron Browne at Maastricht University’s Department of Data Science and Knowledge Engineering. Funding Information: This research is part of the European Research Councilfunded Digital Ludeme Project (ERC Consolidator Grant #771292) run by Cameron Browne at Maastricht University's Department of Data Science and Knowledge Engineering. Publisher Copyright: © 2019 IEEE.

PY - 2019/8/23

Y1 - 2019/8/23

N2 - In recent years, state-of-the-art game-playing agents often involve policies that are trained in self-playing processes where Monte Carlo tree search (MCTS) algorithms and trained policies iteratively improve each other. The strongest results have been obtained when policies are trained to mimic the search behaviour of MCTS by minimising a cross-entropy loss. Because MCTS, by design, includes an element of exploration, policies trained in this manner are also likely to exhibit a similar extent of exploration. In this paper, we are interested in learning policies for a project with future goals including the extraction of interpretable strategies, rather than state-of-the-art game-playing performance. For these goals, we argue that such an extent of exploration is undesirable, and we propose a novel objective function for training policies that are not exploratory. We derive a policy gradient expression for maximising this objective function, which can be estimated using MCTS value estimates, rather than MCTS visit counts. We empirically evaluate various properties of resulting policies, in a variety of board games

AB - In recent years, state-of-the-art game-playing agents often involve policies that are trained in self-playing processes where Monte Carlo tree search (MCTS) algorithms and trained policies iteratively improve each other. The strongest results have been obtained when policies are trained to mimic the search behaviour of MCTS by minimising a cross-entropy loss. Because MCTS, by design, includes an element of exploration, policies trained in this manner are also likely to exhibit a similar extent of exploration. In this paper, we are interested in learning policies for a project with future goals including the extraction of interpretable strategies, rather than state-of-the-art game-playing performance. For these goals, we argue that such an extent of exploration is undesirable, and we propose a novel objective function for training policies that are not exploratory. We derive a policy gradient expression for maximising this objective function, which can be estimated using MCTS value estimates, rather than MCTS visit counts. We empirically evaluate various properties of resulting policies, in a variety of board games

KW - reinforcement learning

KW - search

KW - self-play

U2 - 10.1109/CIG.2019.8848037

DO - 10.1109/CIG.2019.8848037

M3 - Conference article in proceeding

SN - 9781728118840

T3 - IEEE Conference on Computational Intelligence and Games

SP - 329

EP - 336

BT - IEEE Conference on Games

PB - IEEE

T2 - IEEE Conference on Games (IEEE COG)

Y2 - 20 August 2019 through 23 August 2019

ER -