PropertyValue
is nif:broaderContext of
nif:broaderContext
is schema:hasPart of
schema:isPartOf
nif:isString
  • The model neurons in our population are escape noise neurons [14], i.e. leaky integrate and fire neurons where action potentials are generated with an instantaneous firing rate which depends on the membrane potential. Focusing on one of the population neurons, we denote by its input which is a spike pattern made up of spike trains . Each is a list of the input spike times in afferent . We use the symbol to refer to the postsynaptic spike train produced by the neuron, is also a list of spike times. If the neuron, with synaptic vector , produces the output in response to , its membrane potential is determined by(5)Here is the unit step function and, further, is Dirac's delta function, leading to immediate hyperpolarization after a postsynaptic spike. For the resting potential, denoted above by , we use (arbitrary units). Further, is used for the membrane time constant and for the synaptic time constant. By integrating the differential equation, the membrane potential can be written in spike response form as(6)The postsynaptic kernel and the reset kernel vanish for . For they are given byNote that the first eligibility trace of synapse can be expressed in terms of the postsynaptic kernel as . Action potential generation is controlled by an instantaneous firing rate which increases with the membrane potential. So, at each point in time, the neuron fires with probability where represents an infinitesimal time window (we use in the simulations). Our firing rate function iswith and . (In the limit of one would recover a deterministic neuron with a spiking threshold .) As shown in [14], the probability density, , that the neuron actually produces the output spike train in response to the stimulus during a decision period lasting from to satisfies:(7)The derivative of with respect to the strength of synapse is known as characteristic eligibility in reinforcement learning [35]. For our choice of the firing rate function one obtains(8)where is the first eligibility trace of the synapse (Eq. 1) and the postsynaptic signal of the neuron given right below Eq. (2). Note that (8) is similar to our second eligibility trace , see Eq. (2), except that we have replaced the integration over the decision period by low pass filtering with a time constant matched to the stimulus duration. The reason for this is that it seems un-biological to assume that the synapses of the population neurons know when decision periods start and end. We use the superscript , running from to , to index the population neurons. For instance, is the postsynaptic spike train produced by neuron in response to its input spike pattern . As suggested by the notation, the population neurons have different inputs, but their inputs are highly correlated because the neurons are randomly connected to a common input layer which present the stimulus to the network. In particular, we assume that each population neuron synapses onto a site in the input layer with probability , leading to many shared input spike trains between the neurons. The population response is read out by the decision making circuitry based on a spike/no-spike code. For notational convenience we introduce the coding function , with , if the there is no spike in the postsynaptic response , otherwise, if neuron produce at least one spike in response to the stimulus, . In term of this coding function the population activity being read out by the decision making circuitry can be written as:Using this activity reading, the behavioral decision is made probabilistically, the likelihood of producing the decision is given by the logistic function(9)Note that due to the normalization in the definition of , the magnitude of can be as large as . This is why, decisions based on the activity of a large population can be close to deterministic, despite of the noisy decision making circuitry. Feedback signals and the postsynaptic trace: We start with the reward feedback , modulating synaptic plasticity in Eq. (4). This feedback is encoded by means of a concentration variable , representing ambient levels of a neurotransmitter, e.g. dopamine. In the absence of reward information, the value of approaches a homeostatic level with a time constant . For any point in time when external reward information is available, this reinforcement leads to a change in the production rate of the neurotransmitter. The change is proportional to and lasts for . So up to the point in time when further reinforcement becomes available, the concentration variable evolves as:Here the step function equals if , otherwise the function value is zero. The reward feedback read-out at a synapse is determined by the deviation of the current neurotransmitter level from its homeostatic value and equalsHere the parameter is the positive learning rate which, for notational convenience, we absorb into the reward signal. The decision feedback used in Eq. (3) is encoded in the concentration of a second neurotransmitter. As for reward feedback, this is achieved by a temporary change in the production rate of the encoding neurotransmitter. For describing , we assume a stimulus that ended at time , evoking the population activity and behavioral decision . As shown in Text S1, the value of should then be determined by the derivative of with respect to and, in view of Eq. (9), this derivative is simply . Hence we usefor the temporal evolution of . Parameter values in the simulations are and . The above equation holds up to time when the subsequent stimulus presentation ends, at which point the decision variables and are replaced by their values for the latter stimulus. The decision feedback is simply For the postsynaptic trace in Eq. (3), we assume a concentration variable which reflects the spiking of the neuron. Each time there is a postsynaptic spike, is set to 1; at other times, decays as . The value of should reflect whether or not the neuron spiked in response to the decision stimulus. So, as for the eligibility trace (see Eq. 2), the relevant time scale is the decision period and this is why the same time constant is used in both cases. The trace is obtained ascomparing to an appropriate threshold . In the simulation we use . For the reasoning behind this choice, consider a stimulus ending at time of duration . The value of at time will accurately reflect whether or not the decision stimulus elicited a postsynaptic spike, if we choose . But since decision feedback is not instantaneous, the value of is mainly read-out at times later than . This is why the smaller value seemed a somewhat better choice. For TD-learning we used the SARSA control algorithm [1] which estimates the values of state-action pairs . At each point in time, the value estimates are updated according toHere and have values between and . The parameter is similar to a learning rate and controls the temporal discounting. The above update is done after every transition from a nonterminal state . If is terminal, then is defined as zero. When in state , the next action is chosen using either -greedy or softmax. In both cases only the values pertinent to the current state enter into the decision making. For memoryless TD-learning in the two armed bandit we used and . A positive discount factor would not qualitatively change the result. For each of runs per chosen value of , we simulated trials. After trials learning had converged and the reported asymptotic quantities are the average over the next trials. For learning with memory we used , and . For the sequential decision making task decision selection used -greedy with . The discount factor was set to and the step-size parameter to . With regard to the failure of TD-learning in the sequential decision making task, we note that there are also eligibility trace based versions, SARSA, of the algorithm with the above version corresponding to . For , the value update takes into account not just the next state-action pair but the value of all subsequent state-action pairs. Importantly, for the special case the subsequent values occurring in the update cancel, and the value update is in effect driven directly by the reward signal [1]. So SARSA is just a complicated way of doing basic Monte Carlo estimation of the values. It hence does not assume that the process is Markovian and SARSA does reliably converge towards optimal performance in our task. For the procedure interpolates between the two extremes and . Consequently the valuation of some state-action pairs (e.g. the shortcut 12, left) will then be wrong but the error will be smaller than for . If action selection is based on softmax the incorrect valuation will nevertheless be detrimental to decision making. However, this need not always be the case for -greedy, due to the thresholding inherent in this decision procedure. In particular, there is a positive critical value for (which depends mainly on the discount factor ) above which the valuation error will no longer affect the decision making. In this parameter regime, SARSA will reliably learn the optimal policy (upto the exploration determined by ). In all the simulations initial values for the synaptic strength were picked from a Gaussian distribution with mean zero and standard deviation equal to 4, independently for each afferent and each neuron. A learning rate of was used in all simulations, except for the 2-armed bandit task where was used. In the sequential decision making task with working memory, the population is presented stimuli encoding not just the current but also the immediately preceeding position. For this, each location on the track is assigned to a fixed spike pattern made up of 50 spike trains representing the location in the case that it is the current position and, further, to a second spike pattern with 30 spike trains for the case that it is the immediately preceeding position. The stimulus for the network is then obtained by concatenating the 50 spike trains corresponding to the current position with the 30 spike trains for the preceeding position. The curves showing the evolution of performance were obtained by calculating an exponentially weighted moving average in each run and then averaging over multiple runs. For the sequential decision making task reward per episode was considered and the smoothing factor in the exponentially weighted moving average was . In the other task, where performance per trial was considered, the smoothing factor was . For each run a new set of initial synaptic strength and a new set of stimuli was generated. The number of runs was , except in the two armed bandit where we averaged over 40 runs.
rdf:type