nif:isString
|
-
Our model contains several components, illustrated schematically in Fig 1. The first component is a network of spiking neurons, itself divided into two subgroups. The neural network dynamically controls the muscle activities within a simulated vocal tract. The vocal tract simulation computes air pressures within the vocal tract, allowing sounds to be synthesized. The auditory salience of these sounds is then estimated, and auditory salience is used as the basis for whether or not the model receives a reward for producing a given sound. Reward engages Hebbian learning (via STDP) within the neural network. Each simulation was run for a total of 2 hours of simulation time, or 7200 trials each taking 1 s of simulated time. A number of simulations were run in order to choose appropriate model parameters and to assess the range of natural variation in performance across simulations. Each of these components is discussed in more detail below.
Figure data removed from full text. Figure identifier and caption: 10.1371/journal.pone.0145096.g001 Overview of the model.A: Schematic depiction of the groups of neurons in the spiking neural network and how they are connected. There is a reservoir of 1000 recurrently connected neurons, with 200 of those being inhibitory (red) and the rest excitatory (blue and black). 200 of the reservoir’s excitatory neurons are designated as output neurons (black). These output neurons connect to two groups of motor neurons, agonist motor neurons (blue) and antagonist motor neurons (red). The connection weights within the reservoir are set at the start of the simulation to random values and do not change over the course of the simulation. The connection weights from the reservoir output neurons to the motor neurons are initially set to random values and are modified throughout the simulation by dopamine (DA)-modulated STDP. All reservoir and motor neurons receive random input current at each time step (not shown). B: Raster plot of spikes in the reservoir over a 1 s time period. C: Raster plot of spikes in the motor neuron groups over the same 1 s time period. The agonist and antagonist motor neuron spikes are summed at each time step then are smoothed using a 100 ms moving average. The smoothed antagonist activity is subtracted from the smoothed agonist activity, creating a net smoothed muscle activity that is sent to the orbicularis and masseter muscles. D: The smoothed agonist, antagonist, and net activity for the same 1 s as in the raster plots. E: Effects of the orbicularis oris and masseter on the vocal tract’s shape (reprinted with permission from [61]). Orbicularis oris activity tends to round and close the lips and masseter activity tends to raise the jaw. F: Schematic illustration that the vocal tract is modeled as an air-filled tube bounded by walls made up of coupled mass-spring systems (reprinted with permission from [61]). The orbicularis oris and masseter affect the equilibrium positions at the front parts of the tube. The air pressure over time and space in the tube is calculated, and the air pressure at the lip end of the tube forms the sound waveform. The vocal tract shape is modeled more realistically than depicted here and also contains a nasal cavity that is not depicted. G: The sound synthesized by the vocal tract model is input to an algorithm that estimates auditory salience. The plot shows, for the same 1 s as in B–D, the synthesized vocalization waveform (in cyan) and the salience of that waveform over time (in black). Apart from a peak in salience at the sound’s onset, the most salient portion of the sound is around the place where the sound’s one consonant can be heard. The overall salience of this particular sound is 10.77. If the salience of the sound is above the model’s current threshold, a reward is given, which causes an increase in dopamine concentration in the neural network.
The simulation code is provided at https://github.com/AnneSWarlaumont/BabbleNN. The neural network contained two main subgroups of neurons. The first subgroup was a reservoir of 1000 Izhikevich spiking neurons [62]. 80% of the neurons were excitatory and 20% were inhibitory. Each neuron was randomly assigned outgoing connections to 100 other neurons, with the constraint that inhibitory neurons could connect only to excitatory neurons. The reservoir neuron properties and synaptic connectivities were set almost identically to the network described in [63], and our simulation code incorporated MATLAB code from that work. See [64] for another example of an adaptation of such models to a reservoir architecture. A subset of the excitatory neurons in the reservoir were selected to also connect to an equally sized subset of excitatory motor neurons, all having the same parameter values as the excitatory reservoir neurons. The motor neuron population had the same total number of neurons as the subset of the reservoir that projected to them. Half of the motor neurons were agonists, positively activating the masseter and orbicularis oris muscles and serving to promote closure of the jaw and mouth. The other half of the motor neurons were antagonists, inhibiting activity in the masseter and orbicularis oris muscles, thereby promoting jaw and mouth opening. Our assumption is that the reservoir and motor neurons can be considered as models of subgroups of neurons within motor regions of the neocortex. The motor neurons’ effects on the vocal tract muscles are intended to roughly model the influence of upper motor neurons on the muscles (via lower motor neurons). The neural network simulation ran in millisecond simulated time increments. At each millisecond time increment, a random quantity of input current was given to each reservoir and motor neuron. Each neuron’s random input was drawn from a uniform distribution between -6.5 and 6.5 pA. This random input was the same as that given to the model in [63]; future work could test the implications of using other random input functions, such as exponential or power law input, and could aim to match this function to observations from real cortical neurons. The random input current was added to the current that was given to each neuron due to the firings of the neuron’s presynaptic neurons that fired during the previous time step. The input current due to presynaptic neuron firing was proportional to a variable representing the synaptic strength from the presynaptic to the postsynaptic neuron. Some of these synaptic strengths (a.k.a. connection weights), the ones connecting the reservoir to the motor neurons, changed over the course of the simulation as a result of learning. Note that there are no external inputs to the model other than the random inputs at each time step, which ensure spontaneous activity of the neurons in each group. This is by design, as the goal of the present work was to focus on how infants’ spontaneous vocalizations become more speech-like over the course of the first year of life (see [49] and [65] for further discussion).
After every second of simulated time, a smoothed muscle activity time series was calculated. A 100 ms moving average of the previous 1000 ms time series of agonist motor neuron spikes was computed. The result was a 900 ms smoothed time series of agonist motor neuron activity. The same computation was done for the antagonist motor neuron spikes. The smoothed antagonist motor neuron activity time series was then subtracted from the smoothed agonist motor activity time series. The result was multiplied by a constant parameter, m, to create the net muscle activity time series. The scaling brought the muscle activity into a range that was appropriate for the synthesizer. The 900 ms net muscle activity time series was given directly to the articulatory vocalization synthesizer and specified both the Masseter and Orbicularis Oris muscle activities. The vocalization synthesis relied on the articulatory synthesizer developed by Boersma and available in Praat [61, 66]. Praat version 5.3.32 for PC was used for all the simulations. The synthesizer models the walls of the vocal tract as a set of coupled, damped mass-spring systems whose equilibrium positions and spring constants are affected by the activation of the various vocal tract muscles. The air within the vocal tract is treated as a fluid whose aerodynamics are modeled by obtaining approximate numerical solutions to a set of equations representing constraints such as conservation of mass, response to pressure gradients, and friction. The air within the vocal tract affects the movements of the walls and vice versa. Besides the Masseter and Orbicularis Oris activity, a number of other parameters needed to be set in order to generate the synthesized vocalizations. The speaker type needed to be specified; we chose the adult female vocal tract model for all simulations. Although Praat does have a child vocal tract model, it does not have a built-in infant model. Additionally, for the child model to generate sound, the acoustic simulation sampling rate must be increased. This would increase the computational demands of the vocalization synthesis, which is already the main processing bottleneck within our model. Since the focus of this study was on neuromotor learning rather than on infant vs. adult anatomy, we reasoned that the adult female vocal tract provided a reasonable enough approximation of the main bioacoustic constraints on the infant vocal tract, particularly the nonlinear relationships of jaw and mouth movement to vocalization acoustics, for our purposes. The default sampling rate, 22050 Hz, was used. For each sound, the Lungs parameter, which specifies the target lung volume, was set to 0.1 at 0 ms, to 0.1 at 20 ms, to 0 at 50 ms, and to 0 at 900 ms. This created a scenario where the target lung volume went quickly from a high value at the beginning of the vocalization to a low value a few tens of ms later. In a human, such a change would be due to coordinated activity of the muscles of the diaphragm and rib cage. One laryngeal muscle, the Interarytenoid, was set to a value of 0.5 for the duration of the 900 ms vocalization. This muscle has the effect of adducting the vocal folds, causing a pressure differential between the lungs and the upper vocal tract that sets the vocal folds into vibratory motion. Finally, the Hyoglossus muscle, which lowers the tongue, was set to a value of 0.4 throughout the 900 ms vocalization. This made the vocal tract such that when the jaw and lips were open, the vocalization would sound like the vowel [A]. (See [49] for an example of a model that learns the settings of the laryngeal muscles for static, vowel-only vocalizations.) This combination of 900 ms of muscle activations and other settings was sent to the vocal tract model, which simulates the air pressure throughout the vocal tract at a series of time points and uses the time series of pressures at the mouth of the vocal tract to synthesize the vocalization. The vocalization was saved as a WAV file and subsequently analyzed to estimate its auditory salience.
The estimated auditory salience of each sound was used as the basis for determining when to reward the model. This was based on the idea that human infants will tend to prefer more salient stimuli as well as on the idea that human caregivers are more likely to notice and respond to more salient infant sounds. Salience was estimated using a program developed by Coath, Denham, and colleagues [36, 37]. The program takes a sound as input and analyzes that sound in a variety of ways. It first converts the sound to a spectrogram format, with the frequency and time bins based on a model of cochlear processing. Within that cortical response spectrogram, it then identifies points in time and frequency where there are transitions in the cochlear activity level. This is essentially a form of temporal edge detection. After that, it convolves the spectrotemporal transients with models of cortical filters. The cortical filter models were developed by unsupervised training on a corpus of speech data. The cortical filters are designed to well represent the input data with minimal redundancy. The final step in the salience estimation was to detect transients in the activation of these cortical filter models. Both onset transients and offset transients are detected. The transients can be thought of as auditory edge detectors [37]. The overall amount of change in the cortical filter activations at a series of evenly spaced time points determined the salience function for the particular input sound. The salience, s(v, t), over time, t, for a given second’s vocalization, v, was then converted to a single overall salience score for the sound, S(v), by taking the sum of the absolute value of the salience function the sound (so as to include both onset and offset transients), excluding the first 150 ms: S(v)=∑t=151ms900ms|s(v,t)|(1) The first 150 ms were excluded because they typically included a spike in salience related to the abrupt onset of the sound, and this spike was not related to the questions of interest in the present study. The model received a reward if the salience for the sound it had just produced, S(v), was greater than a threshold value, θ(v). The threshold was initialized to a value of 4.5 and increased as the model increased the salience of its productions. If on the last 10 trials at least 30% of the model’s vocalizations were rewarded, the threshold value was increased by 0.1. (See Algorithm 1.) The starting threshold, threshold increment, and 30% criteria were decided based on informal explorations during development of pilot versions of the model.
Table data removed from full text. Table identifier and caption: 10.1371/journal.pone.0145096.t001 Adapting the reward threshold. At the beginning of the simulation all neural connection weights within the reservoir were assigned random values. The outgoing connection weights from the excitatory neurons were drawn from a uniform random distribution between 0 and 1. The outgoing connection weights from the inhibitory neurons were drawn from a uniform random distribution between -1 and 0. These connection weights remained the same throughout the simulation. All initial connection weights between the reservoir and the motor neurons were drawn from a uniform random distribution between 0 and 1. The connections from the reservoir neurons to the motor neurons were updated via reward-modulated spike-timing-dependent plasticity. Spike-timing-dependent plasticity (STDP) is a form of Hebbian learning derived from a large number of both in vitro and in vivo studies on long term potentiation and depression, in both hippocampal and neocortical neurons [67, 68]. In STDP, the change in strength of a synapse connecting a presynaptic neuron to a postsynaptic neuron is related to the relative timing of spikes of those two neurons. Long term potentiation occurs when the presynaptic neuron fires before the postsynaptic neuron and long term depression occurs when the presynaptic neuron fires after the postsynaptic neuron. The degree of potentiation or depression is greater the closer together the two spikes are. There is evidence that the presence of dopamine increases learning rates in the neocortex and that such dopamine-modulated long term potentiation in the motor cortex facilitates skill acquisition [69–71]. It is believed that this provides a means by which animals learn to recreate movement patterns that lead to rewarding outcomes. Izhikevich’s DA-modulated STDP algorithm [63] was used, with the modification that in our model only the long term potentiation aspect of STDP is implemented. Rather than implement spike-timing dependent long term depression, the reservoir to motor neuron connection weights are periodically normalized. The algorithm is presented in Algorithm 2 and its essential features are described in the following paragraph.
Table data removed from full text. Table identifier and caption: 10.1371/journal.pone.0145096.t002 Reward-modulated spike-timing-dependent plasticity. Each time an output neuron within the reservoir spikes, a small amount, 0.1, is assigned to a trace memory of the firing of that neuron. These reservoir output neuron traces decrease exponentially with time. Whenever a motor neuron fires, the eligibility trace for each of its incoming synapses to be strengthened is increased by adding the memory traces of the firings of the reservoir output neurons. This eligibility trace is then multiplied by the dopamine level in order to determine how much the synapse strength is increased. The dopamine level is increased by adding 1 whenever a reward is received. The dopamine level, eligibility traces, and presynaptic firing memories decay exponentially over time. At each synaptic weight update, if the update would make the strength of the synapse greater than 4, the synaptic strength is capped at 4. This prevents any individual synapse from becoming overly, and unrealistically, strong. Due to the nature of the learning algorithm, no synapse strength could ever have a negative value. Finally, after each synaptic weight update, the synaptic weights are normalized by dividing all weights by the mean synapse strength. This prevents the overall network connectivity from increasing over time, which would severely disrupt the network’s dynamics [72]. Based on pilot explorations, this method of normalization seemed to be less sensitive to small parameter variations than relying solely on long term depression to keep synapse strengths within a desirable range; further exploration of this issue is warranted but outside the scope of the present study. Note that the reward function and the DA-modulated STDP were both deterministic. All random variation in the model stemmed from the random synaptic weight initialization and the random input currents given to the neurons.
Pilot explorations indicated that the types of sounds that are generated by the model are particularly sensitive to two parameters, the number of motor neurons and the muscle activity scaling parameter, m. With larger numbers of motor neurons in both the agonist and antagonist groups, the net motor neuron activity level tends to exhibit higher amplitude variation within a second, i.e. within a vocalization. This leads to a greater likelihood of syllabic vocalizations, since the jaw and lip muscle activities tend to vary within a greater range. For the same reasons, when the muscle scaling parameter, m, which is multiplied by the net motor neuron activity to generate muscle activity, is higher, the range of jaw and lip muscle activities tends to vary more greatly within a vocalization, leading to more syllabic vocalizations. To demonstrate this, and to determine appropriate values of these two parameters for focusing more detailed analyses, we ran 13 sets of simulations, varying the number of motor neurons and the value of the muscle scaling parameter, m. Each set of simulations consisted of 5 simulations with different random synaptic weight initializations and different random inputs given at each time step to the reservoir and motor neurons. We explored three values of the number of motor neurons: 50, 100, and 200. The number of reservoir output neurons was matched to the number of motor neurons, so that as the number of motor neurons increased, the output neurons in the reservoir also increased. The number of agonist motor neurons and the number of antagonist motor neurons were always equal, so if the total number of motor neurons was 50, this meant there were 25 agonist motor neurons promoting jaw and lip closure and 25 antagonist motor neurons promoting jaw and lip opening. We initially explored three values of m: 4, 5, and 6. Recall that m is the value that the difference between the smoothed agonist motor neuron spike counts and the smoothed antagonist motor neuron spike counts is multiplied by in order to obtain the time series of masseter and orbicularis oris muscle activities. We tested every pairwise combination of these number of motor neurons and values of m, making for 9 different parameter combinations in total. Based on the results of these 9 simulation sets, we then decided to test four additional parameter combinations, to cover 50 neurons with m = 7 and 8 and 200 motor neurons with m = 2 and 3. This made for a total of 13 parameter combinations tested. We then took the combination that appeared to provide the best combination of learning capability and realism in the initial behavioral starting point (200 neurons and m = 2), and focused further analyses on the simulations with that parameter combination.
Even with the synaptic weight normalization in place, changes in the neural network’s connection weights due to STDP can potentially lead to changes in network dynamics that could affect the oscillatory dynamics of the motor neuron population and in turn the types of vocalizations the model produces. In addition, even with no synaptic weight changes, over time there can be subtle changes in the neural dynamics. To ensure that salience-based rewards are driving any increases in vocalization salience and canonical syllable production over time, we ran yoked control simulations. These were simulations with their own unique random synaptic weight initializations and random inputs at each timestep, but with reward times taken from a previous simulation in which rewards were salience-driven. This matched the timings of synaptic modification to those of the real simulations, while making yoked control model rewards uncorrelated with the salience of vocalization. This control method is standard procedure in work on animal behavior, including experimental work on human vocal learning during the first year of life (e.g., see [73]).
In previous work [59, 60], the syllabicity of the sounds produced by a similar model had been evaluated using two metrics. The first was the salience of the sounds. Based on previous work showing human ratings of the syllabic quality of a sound to be correlated with our auditory salience metric, as well as theoretical considerations of the concept of auditory salience and the specific auditory salience estimation algorithm used here, we expected this to be a fairly useful metric. We also listened ourselves to the sounds produced by the model, to verify with our own ears what the vocalizations sounded like and how they compared to infants’ syllabic and non-syllabic vocalizations (links to sound examples that the reader can download are given in the Results section, and examples of human infant vocalizations classified as canonical, i.e. syllabic, vs. non-canonical are available at www.babyvoc.org through the IVICT tool). To provide an additional metric of the syllabicity of the sounds, as well as a metric that was independent of the development of the computational model, we utilized a Praat script for automatically identifying syllable nuclei in adult speech, developed by de Jong and Wempe [74, 75]. This syllable detection algorithm uses a combination of amplitude difference and voicing information to estimate where syllable nuclei, i.e the loudest parts of a syllable, usually the part containing the vowel, occur. It first searches the sound for segments where there is a high amplitude portion surrounded by lower amplitude sound. It then checks that there is an identifiable pitch, i.e. the perceptual correlate of fundamental frequency, during the high amplitude portion. If so, it labels this a likely syllable nucleus. We ran this program using the model’s individual 900 ms vocalizations as input, and, for each input vocalization, obtained the total number of syllable nuclei that the sound was estimated to contain. We used all the default parameters, i.e. a silence threshold of -25 dB, a minimum dip between peaks of 2 dB, and a minimum pause duration of 0.3 s.
|