Manipulating the Distributions of Experience used for Self-Play Learning in Expert Iteration

Dennis Soemers; Eric Piette; Matthew Stephenson; Cameron Browne

doi:10.1109/CoG47356.2020.9231589

Manipulating the Distributions of Experience used for Self-Play Learning in Expert Iteration

Dennis Soemers^*, Eric Piette, Matthew Stephenson, Cameron Browne

^*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceeding › Conference article in proceeding › Academic › peer-review

Abstract

Expert Iteration (ExIt) is an effective framework for learning game-playing policies from self-play. ExIt involves training a policy to mimic the search behaviour of a tree search algorithm -- such as Monte-Carlo tree search -- and using the trained policy to guide it. The policy and the tree search can then iteratively improve each other, through experience gathered in self-play between instances of the guided tree search algorithm. This paper outlines three different approaches for manipulating the distribution of data collected from self-play, and the procedure that samples batches for learning updates from the collected data. Firstly, samples in batches are weighted based on the durations of the episodes in which they were originally experienced. Secondly, Prioritized Experience Replay is applied within the ExIt framework, to prioritise sampling experience from which we expect to obtain valuable training signals. Thirdly, a trained exploratory policy is used to diversify the trajectories experienced in self-play. This paper summarises the effects of these manipulations on training performance evaluated in fourteen different board games. We find major improvements in early training performance in some games, and minor improvements averaged over fourteen games.

Original language	English
Title of host publication	IEEE Conference on Games
Subtitle of host publication	(CoG'20)
Place of Publication	Osaka, Japan
Publisher	IEEE
Pages	245-252
Number of pages	8
ISBN (Print)	9781728145334
DOIs	https://doi.org/10.1109/CoG47356.2020.9231589
Publication status	Published - 24 Aug 2020
Event	2020 IEEE Conference on Games (CoG) - Osaka, Japan (online) Duration: 24 Aug 2020 → 27 Aug 2020

Publication series

Series	IEEE Conference on Computational Intelligence and Games
ISSN	2325-4270

Conference

Conference	2020 IEEE Conference on Games (CoG)
Period	24/08/20 → 27/08/20

Keywords

reinforcement learning
self-play
games
REINFORCEMENT
LEVEL
GO
GAME

Access to Document

10.1109/CoG47356.2020.9231589

https://ieeexplore.ieee.org/document/9231589

Cite this

@inproceedings{4f0db36dd9064270ab6098422736e0ea,

title = "Manipulating the Distributions of Experience used for Self-Play Learning in Expert Iteration",

abstract = "Expert Iteration (ExIt) is an effective framework for learning game-playing policies from self-play. ExIt involves training a policy to mimic the search behaviour of a tree search algorithm -- such as Monte-Carlo tree search -- and using the trained policy to guide it. The policy and the tree search can then iteratively improve each other, through experience gathered in self-play between instances of the guided tree search algorithm. This paper outlines three different approaches for manipulating the distribution of data collected from self-play, and the procedure that samples batches for learning updates from the collected data. Firstly, samples in batches are weighted based on the durations of the episodes in which they were originally experienced. Secondly, Prioritized Experience Replay is applied within the ExIt framework, to prioritise sampling experience from which we expect to obtain valuable training signals. Thirdly, a trained exploratory policy is used to diversify the trajectories experienced in self-play. This paper summarises the effects of these manipulations on training performance evaluated in fourteen different board games. We find major improvements in early training performance in some games, and minor improvements averaged over fourteen games.",

keywords = "reinforcement learning, self-play, games, REINFORCEMENT, LEVEL, GO, GAME",

author = "Dennis Soemers and Eric Piette and Matthew Stephenson and Cameron Browne",

year = "2020",

month = aug,

day = "24",

doi = "10.1109/CoG47356.2020.9231589",

language = "English",

isbn = "9781728145334",

series = "IEEE Conference on Computational Intelligence and Games",

publisher = "IEEE",

pages = "245--252",

booktitle = "IEEE Conference on Games",

address = "United States",

note = "2020 IEEE Conference on Games (CoG) ; Conference date: 24-08-2020 Through 27-08-2020",

}

Soemers, D , Piette, E , Stephenson, M & Browne, C 2020, Manipulating the Distributions of Experience used for Self-Play Learning in Expert Iteration. in IEEE Conference on Games: (CoG'20). IEEE, Osaka, Japan, IEEE Conference on Computational Intelligence and Games, pp. 245-252, 2020 IEEE Conference on Games (CoG), 24/08/20. https://doi.org/10.1109/CoG47356.2020.9231589

Manipulating the Distributions of Experience used for Self-Play Learning in Expert Iteration. / Soemers, Dennis ; Piette, Eric ; Stephenson, Matthew et al.
IEEE Conference on Games: (CoG'20). Osaka, Japan: IEEE, 2020. p. 245-252 (IEEE Conference on Computational Intelligence and Games).

Research output: Chapter in Book/Report/Conference proceeding › Conference article in proceeding › Academic › peer-review

TY - GEN

T1 - Manipulating the Distributions of Experience used for Self-Play Learning in Expert Iteration

AU - Soemers, Dennis

AU - Piette, Eric

AU - Stephenson, Matthew

AU - Browne, Cameron

PY - 2020/8/24

Y1 - 2020/8/24

N2 - Expert Iteration (ExIt) is an effective framework for learning game-playing policies from self-play. ExIt involves training a policy to mimic the search behaviour of a tree search algorithm -- such as Monte-Carlo tree search -- and using the trained policy to guide it. The policy and the tree search can then iteratively improve each other, through experience gathered in self-play between instances of the guided tree search algorithm. This paper outlines three different approaches for manipulating the distribution of data collected from self-play, and the procedure that samples batches for learning updates from the collected data. Firstly, samples in batches are weighted based on the durations of the episodes in which they were originally experienced. Secondly, Prioritized Experience Replay is applied within the ExIt framework, to prioritise sampling experience from which we expect to obtain valuable training signals. Thirdly, a trained exploratory policy is used to diversify the trajectories experienced in self-play. This paper summarises the effects of these manipulations on training performance evaluated in fourteen different board games. We find major improvements in early training performance in some games, and minor improvements averaged over fourteen games.

AB - Expert Iteration (ExIt) is an effective framework for learning game-playing policies from self-play. ExIt involves training a policy to mimic the search behaviour of a tree search algorithm -- such as Monte-Carlo tree search -- and using the trained policy to guide it. The policy and the tree search can then iteratively improve each other, through experience gathered in self-play between instances of the guided tree search algorithm. This paper outlines three different approaches for manipulating the distribution of data collected from self-play, and the procedure that samples batches for learning updates from the collected data. Firstly, samples in batches are weighted based on the durations of the episodes in which they were originally experienced. Secondly, Prioritized Experience Replay is applied within the ExIt framework, to prioritise sampling experience from which we expect to obtain valuable training signals. Thirdly, a trained exploratory policy is used to diversify the trajectories experienced in self-play. This paper summarises the effects of these manipulations on training performance evaluated in fourteen different board games. We find major improvements in early training performance in some games, and minor improvements averaged over fourteen games.

KW - reinforcement learning

KW - self-play

KW - games

KW - REINFORCEMENT

KW - LEVEL

KW - GO

KW - GAME

U2 - 10.1109/CoG47356.2020.9231589

DO - 10.1109/CoG47356.2020.9231589

M3 - Conference article in proceeding

SN - 9781728145334

T3 - IEEE Conference on Computational Intelligence and Games

SP - 245

EP - 252

BT - IEEE Conference on Games

PB - IEEE

CY - Osaka, Japan

T2 - 2020 IEEE Conference on Games (CoG)

Y2 - 24 August 2020 through 27 August 2020

ER -