PMC5381785 | SoMeSci

Property	Value
is nif:broaderContext of	sms:PMC5381785/sentence0 sms:PMC5381785/sentence1 sms:PMC5381785/sentence10 sms:PMC5381785/sentence11 sms:PMC5381785/sentence12 sms:PMC5381785/sentence13 sms:PMC5381785/sentence14 sms:PMC5381785/sentence15 sms:PMC5381785/sentence16 sms:PMC5381785/sentence17 sms:PMC5381785/sentence18 sms:PMC5381785/sentence19 sms:PMC5381785/sentence2 sms:PMC5381785/sentence20 sms:PMC5381785/sentence21 sms:PMC5381785/sentence22 sms:PMC5381785/sentence23 sms:PMC5381785/sentence24 sms:PMC5381785/sentence25 sms:PMC5381785/sentence26 sms:PMC5381785/sentence27 sms:PMC5381785/sentence28 sms:PMC5381785/sentence29 sms:PMC5381785/sentence3 sms:PMC5381785/sentence30 sms:PMC5381785/sentence31 sms:PMC5381785/sentence32 sms:PMC5381785/sentence33 sms:PMC5381785/sentence34 sms:PMC5381785/sentence35 sms:PMC5381785/sentence36 sms:PMC5381785/sentence37 sms:PMC5381785/sentence38 sms:PMC5381785/sentence39 sms:PMC5381785/sentence4 sms:PMC5381785/sentence40 sms:PMC5381785/sentence41 sms:PMC5381785/sentence42 sms:PMC5381785/sentence43 sms:PMC5381785/sentence44 sms:PMC5381785/sentence45 sms:PMC5381785/sentence46 sms:PMC5381785/sentence47 sms:PMC5381785/sentence48 sms:PMC5381785/sentence49 sms:PMC5381785/sentence5 sms:PMC5381785/sentence50 sms:PMC5381785/sentence51 sms:PMC5381785/sentence52 sms:PMC5381785/sentence53 sms:PMC5381785/sentence54 sms:PMC5381785/sentence55 sms:PMC5381785/sentence56 sms:PMC5381785/sentence57 sms:PMC5381785/sentence58 sms:PMC5381785/sentence59 sms:PMC5381785/sentence6 sms:PMC5381785/sentence60 sms:PMC5381785/sentence61 sms:PMC5381785/sentence62 sms:PMC5381785/sentence63 sms:PMC5381785/sentence64 sms:PMC5381785/sentence65 sms:PMC5381785/sentence66 sms:PMC5381785/sentence67 sms:PMC5381785/sentence68 sms:PMC5381785/sentence69 sms:PMC5381785/sentence7 sms:PMC5381785/sentence70 sms:PMC5381785/sentence71 sms:PMC5381785/sentence72 sms:PMC5381785/sentence73 sms:PMC5381785/sentence74 sms:PMC5381785/sentence75 sms:PMC5381785/sentence76 sms:PMC5381785/sentence77 sms:PMC5381785/sentence78 sms:PMC5381785/sentence79 sms:PMC5381785/sentence8 sms:PMC5381785/sentence80 sms:PMC5381785/sentence81 sms:PMC5381785/sentence82 sms:PMC5381785/sentence83 sms:PMC5381785/sentence84 sms:PMC5381785/sentence85 sms:PMC5381785/sentence9
nif:broaderContext	<https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5381785>
is schema:hasPart of	sms:PLoS_Methods
schema:isPartOf	sms:PLoS_Methods
nif:isString	The goal of reinforcement learning is to find the policy π—a set of rules to select an action in each possible state—that would maximize the agent’s accumulated long term reward in a dynamical environment. The problem is especially challenging when the agent must learn without explicit information about the dynamics of the environment or the rewards. The reinforcement learning problem is usually modeled as a Markov decision process (MDP) giving rise to a sequence of observed, states, actions and rewards—s0, a0, r1, s1, a1, r2, s2, …, sT. The st is the state of the environment, at the action taken by the agent and rt the reward received by the agent at time-step t. One episode (game) lasts T time steps, and different episodes can be of different length. The goal of the agent is to maximize the sum of rewards r1 + r2 + … + rT (the total game score). One popular method to solve MDPs is Q-learning [14]. Q-learning is based on estimating the expected total discounted future rewards (the quality) of each state-action pair under a policy π: Qπ(st, at) = E[rt+1 + γrt+2 + γ2rt+2 + … + γT-trT\|π]. (1) Here γ is a discount rate between 0 and 1 that makes future rewards less valuable than immediate ones and helps to cope with infinite MDPs. The optimal quality value is then Q(st, at) = maxπ Qπ(st, at). Hence an optimal policy is easily derived from the optimal values by selecting the highest valued action in each state, and the problem only amounts to obtaining accurate Q-values. Given state s, action a, reward r and next state s′, it is possible to approximate Q(s, a) by iteratively solving the Bellman recurrence equation [1]: Qi+1(s, a) = E[r + γmaxa′Qi(s′, a′)]. (2) When the state-action space is small enough for the Q-values to be represented as a lookup table, this iterative approximation is proved to converge to the true Q-values [15], provided that all state-action pairs are regularly sampled. However, the combinatorial explosion of the number of possible states in even a modest-size environment makes this table based implementation of Q-learning unfeasible. This problem can be partially overcome by function approximation methods. Instead of storing each Q-value, their aim is to learn a function that maps state-action pairs to their respective Q-values. Deep Q-network (DQN) builds on standard Q-learning by approximating the Q-function using a non-linear neural network. The neural network, parametrized by θ, is trained to minimize the loss function: L(θ) = E[(r + γ maxa′Q(s′, a′; θ′)︸target-Q(s, a; θ)︸prediction)2](3) Notice that the formula closely reassembles the iterative update rule of the Bellmann equation mentioned above (Eq 2). Essentially, the goal is to minimize the difference between the current estimation of the Q-value (prediction), and an updated estimate (target) that combines the obtained reward and an estimation of the quality of the next state. There is no proof of convergence for Q-learning with non-linear function approximators. To overcome learning instability, all experiences (s, a, r, s′) are stored in a “replay memory” and are sampled uniformly as training examples. This ensures that the examples are uncorrelated and do not drive the policy to a local minima. Furthermore, a separate target network (with parameters θ′ in the formula above) is used for estimating the maximal Q-value. The target network’s weights are updated at certain intervals to be equal with those of the main network. Between updates these target Q-values remain unchanged and provide some much needed stability. To balance the exploitation of the current best known Q-values with the exploration of even better options, DQN uses a simple ϵ-greedy policy that samples a random action with probability ϵ (instead of always picking the action with maximal Q-value). When ϵ is annealed over time, this simple strategy is known to work just as well as more complex exploration strategies [1]. There exist several possible adaptations of the Q-learning algorithm for the multiagent case [5]. However, this is an open research area and theoretical guarantees for multiagent model-free reinforcement learning algorithms are scarce and restricted to specific types of tasks [3, 5]. In practice the simplest method consists of using an autonomous Q-learning algorithm for each agent in the environment, thereby using the environment as the sole source of interaction between agents. In this work we use this method due to its simplicity, decentralized nature, computational speed, and ability to produce consistent results for the range of tasks we report. Therefore, in our tests each agent is controlled by an independent DQN with architecture and parameters as reported in [7]. Adaptation of the code for the multiplayer paradigm: We needed to introduce several adaptations to the original code published with [7] to allow training multiple agents simultaneously. These changes are summarized in supplementary information (SI) section. We chose to use the Pong game environment in our study for reasons summarized in the SI. In short, there are three advantages for illustrating our results with Pong: i) the Pong game has a real-time two player mode, ii) DQNs are good in Pong and iii) the game is well-known and can be easily understood by the reader. In Pong each agent corresponds to one of the paddles situated on the left and right side of the screen (see screenshots in Results section). There are 4 actions that each of the two agents can take: move up, move down, stand still, and fire (to relaunch the ball or to start the game). Which action is taken is decided by the corresponding DQN for both agents separately. A central aim of this work is to study the emergence of different types of collective behavior depending on how the agents are rewarded. We adjust the rewarding scheme by simply changing the reward ρ a player receives when putting the ball past the opponent (when scoring). This essentially means, that we change the values on the main diagonal of the payoff matrix, given in Table 1. The reward for conceding is kept fixed at -1. This way we are able create several different games within the same Pong environment. Examples of the used rewarding schemes are given in the following subsection*s. Table data removed from full text. Table identifier and caption: 10.1371/journal.pone.0172395.t001 Rewarding schemes to explore the transition from competitive to the cooperative strategy. For the cases we study ρ ∈ [−1, 1]. Example: with ρ = −0.5, when the left player scores, it receives −0.5 points and the right player receives -1 points. Score more than the opponent (fully competitive): In the traditional rewarding scheme of Pong, also used in [7], the player who scores a point gets a positive reward of size 1 (ρ = 1). The player conceding a point is penalized with a reward of -1. This makes it essentially a zero-sum game, where a positive reward for the left player implies a negative reward of the same size for the right player and vice versa. Notice that ρ = 1 is the only case where the rewards of the two players sum up to zero, for all ρ < 1 the sum of rewards is negative. Loosing the ball penalizes both players (fully cooperative): In this setting we want the agents to learn to keep the ball in the game for as long as possible. To achieve this, we penalize both of the players whenever the ball goes out of play—both scoring and conceding lead to a punitive reward of −1 (ρ = −1). Which of the players lets the ball pass does not matter and no positive rewards are given. Transition between cooperation and competition: The above two rewarding schemes define fully competitive and fully collaborative tasks. To study the behavioral patterns lying between these two extremes we gradually reduce the reward difference between scoring and conceding. The reward for conceding is kept fixed at -1, while a set of intermediate values are given to the ρ parameter. In all of the experiments we let the agents learn for 50 epochs, 250000 time steps each. We limit the learning to 50 epochs because the Q-values predicted by the network have stabilized (see S1 Fig). Due to using a frame skipping technique the agents see and select actions only on every 4th frame [7]. In this article we always talk about the frames the agents actually see and so we use “visible frame”, “frame” and “time step” interchangeably. During the training time, as in [7], the exploration rate (proportion of actions chosen randomly) decreases from an initial 1.0 to 0.05 in the first million time steps and stays fixed at that value. A more detailed description of the training procedure and parameters can be found in [7]. After each epoch snapshots of the DQNs are stored to facilitate the future study of the training process. We also track the average maximal Q-value, which has been previously used as an indicator of training convergence (see SI). To guarantee a fair comparison between different rewarding schemes, we need the training signal to be equally strong in all cases. Rewards are the signal that agents use to evaluate their performance and that they learn from. For each scheme, we multiply the rewards with a normalization coefficient so that the sum of their absolute values would be equal. The rewarding schemes and the normalization coefficients are listed in S1 Table. To obtain quantitative measures of the agents’ behavior in the Pong environment, we identified and counted specific events in the game, e.g. bouncing of the ball against the paddle or the wall. We used Stella [16] integrated debugger to detect the memory locations containing information about these events. Based on these counts we defined a set of behavioral metrics listed below. The statistics are collected after each training epoch by letting the DQNs (in their current state) play against each other for 10 games, each game initialized with a different random seed (In Pong one game consists of multiple exchanges and lasts until one of the agents reaches 21 points.). During this testing phase the exploration rate is set to 0.01. The behavioral measures we used are the following: Average paddle-bounces per point counts how many times the ball bounces between two players before one of them scores a point. Randomly playing agents almost never hit the ball. Well trained agents hit the ball multiple times in an exchange. Hereafter we refer to this statistic as paddle-bounces.Average wall-bounces per paddle-bounce quantifies how many times the ball bounces from top and bottom walls before it reaches the other player. It is possible to hit the ball in an acute angle so that it bounces the walls several times before reaching the other player. Depending on the strategy, players might prefer to send the ball directly to the other player or use the wall bounces. Hereafter we refer to this statistic as wall-bounces.Average serving time per point shows how long it takes for the players to restart the game after the ball was lost (measured in frames). To restart the game, the agent who just scored has to send a specific command (fire). Depending on the rewarding scheme the agents might want to avoid restarting the game. Hereafter we refer to this statistic as serving time.
rdf:type	nif:Context