Gómez proposed that during her task learners were initially focusedon adjacent probabilities; and that nonadjacent information for which PB i |A i _=1 and frequency of A i _B i was keptc
Trang 1Variability is an important ingredient in learning
Luca Onnis and Morten H Christiansen Department of Psychology, Cornell University, USA.
Nick ChaterDepartment of Psychology, University College London, UK
Rebecca GómezDepartment of Psychology, The University of Arizona at Tucson, USA
Trang 2An important aspect of language acquisition involves learning nonadjacent dependencies between words, such as subject/verbagreement for tense or number in English Despite the fact that infants and adults can track adjacent relations in order to infer thestructure of sequential stimuli, the evidence is not conclusive for learning nonadjacent items In four experiments, we provideevidence that discovering nonadjacent dependencies is possible, provided it is modulated by the variability of the interveningmaterial between items Detection of nonadjacencies improves significantly with either zero or large variability, and in suchconditions is independent of the embedded material In addition, variability-mediated learning also applies to visual abstractstimuli, suggesting that similar statistical computations are computed within sensory modalities Learning nonadjacencies isclearly modulated by the statistical properties of the input, although crucially the obtained U-shape learning curve cannot beexplained by current associative mechanisms of artificial grammar learning
Trang 3That human learners may discover structural dependencies in sequences of stimuli through probabilistic relationshipsinherent in the input has long been proposed as a potentially important mechanism in language development In the1950s, Harris proposed a number of procedures that capitalized on the distributional properties of relevant linguisticentities (such as phonemes and morphemes) to uncover structural information about languages (Harris, 1955) A simple
distributional procedure in English, for instance, would detect that articles precede nouns and not vice versa (the window, but not *window the) Harris also proposed that words that occur within similar contexts are semantically similar (Harris,
1968) Interest in the statistics of language was great outside linguistics also, and spawned fundamental discoveries ininformation theory Shannon (1950) derived several important insights from the frequency analysis of single letters (theletter ‘e’ is by far the most frequent in the English alphabet) and pairs and triples of letters He developed encryptionsystems during World War II based on the statistical structure of sequential information, setting the stage for hispioneering work on the mathematical theory of cryptography and information theory
At the time that Harris was developing his ideas on the statistics of language, Miller (1958) and Reber (1967)began investigating the processes by which learners responded to the statistical properties of miniature artificialgrammars These early experimental studies – now collectively called artificial grammar learning (AGL) – were taken bysome to show that adults become sensitive, after limited and often purely incidental exposure, to sequential structure.Although conducted with adults, the studies were motivated by the desire to gain insight into how children might learntheir first language Thus, the idea that statistical and distributional properties of the input could assist in the discovery ofthe structures of language, has historically been taken as a serious scientific endeavor
In the decades following the first artificial grammar experiments, efforts to employ statistical and distributionalanalysis withered in the face of a series of theoretical linguistic arguments (e.g., Chomsky, 1957) and interpretations ofmathematical results on the unlearnability of certain classes of languages (Gold, 1967) Consequently, the AGL approachlargely abandoned the study of language learning in favor of topics such as implicit learning (e.g., Reber, 1993, Reed &Johnson, 1994) In particular, Chomsky – who had been a student of Harris – postulated that constituent structure at thecore of linguistic knowledge cannot be learned from surface-level associations by any means (e.g., Chomsky, 1957; cf.Lashley, 1951, for a similar criticism being leveled independently in the area of sequential action) Chomsky pointed outthat that key relationships between words and constituents are conveyed in nonadjacent (or remotely connected) structure
as opposed to adjacent associations In English, for instance, arbitrary amounts of linguistic material may intervene
between auxiliaries and inflectional morphemes (e.g., is cooking, has travel ed) or between subject nouns and verbs in
Trang 4number agreement, and may be separated by embedded sentences (e.g., the books that sit on the shelf are dusty ) The
presence of embedding and nonadjacent dependencies in language represented a point of difficulty for early
associationist approaches, which mainly focused on discovering adjacent relations For instance, a distributional mechanism computing solely neighboring information would parse the following sentence *the books that sit on the
shelf is dusty as grammatical because the nearest-neighbor noun (“shelf”) to the verb (“is”) is singular A similar case,
*the book that sits on the shelves are dusty suffers from the same problem Such limitations cast doubts on the usefulness
of statistical properties of the input to discover structure in language, and contributed to a paradigm shift in studies oflanguage acquisition in the late 1950s and 1960s
In this paper we re-evaluate the possibility that non-adjacent dependency learning may be mediated by thestatistical properties of sequential stimuli Recently, there has been a revived interest in so-called statistical learning1after some thirty years, as researchers have begun to investigate empirically how infants might identify aspects oflinguistic structure Because much of this research has focused on the learning of adjacent linguistic elements, such assyllables in words, we know little about the conditions under which nonadjancencies may be acquired The aim of thepaper is to explore learning of nonadjacent dependencies and to discuss the theoretical implications of these results foracquiring language as well as sequential structure more generally
Learning nonadjacencies with artificial languages
The use of artificial grammars to tap into mechanisms of language acquisition has recently resurged in languageacquisition research under the guise of artificial language learning (e.g., Curtin, Mintz, & Christiansen, 2005; Gómez &Gerken, 1999; Saffran, Aslin, & Newport, 1996; see Gómez & Gerken, 2000, and Saffran, 2003, for reviews) Typically,artificial languages differ from artificial grammars in that they aim to mimick the learnability of certain properties of realnatural languages that the experimenter desires to test Conversely, artificial grammars tend to focus more on abstractsequences of shapes, or letters, with less relevance to linguistic structures As such, an artificial language typically uses
sequences of pseudo-words such as pel wadim jic, presented auditorily, while a typical example of artificial grammar is
MXVXXMX, where an arbitrary set of rules decides the order of randomly selected letters (Reber, 1967) However, both
types of experiments tap into the type of knowledge that learners may acquire after exposure to a limited number ofexamples from the language (Perruchet & Pacton, 2006)
Trang 5Much of the research so far points to a quick and robust ability of infants and adults to track statistical
regularities among adjacent elements, e.g., syllables (Saffran et al., 1996; Saffran, Newport & Aslin, 1996) These
studies suggest that there is a strong natural bias to make immediate statistical computations among adjacent properties
of stimuli Given this bias, tracking nonadjacent probabilities, at least in uncued streams of syllables, has proven elusive(Newport & Aslin, 2004; Onnis, Monaghan, Chater, & Richmond, 2005; Peña, Bonatti, Nespor, & Mehler, 2002)
Gómez (2002; see also Gómez & Maye, 2005) proposed that learners exposed to several sources of information(including adjacent and nonadjacent probabilities), may default to the easier-to-process adjacent probabilities If inparticular conditions adjacent information is not useful then nonadjacent information may become informative, and thusmore salient To test this hypothesis, Gómez exposed infants and adult participants to sentences of an artificial language
of the form AXB The language contained three families of nonadjacent pairs, notably A 1_B1 , A 2_B2 , and A 3_B3 She
manipulated the variability of the middle element X in four conditions by systematically increasing the pool from which the middle element could be drawn from 2, 6, 12, to 24 word-like elements, while the number and frequency of A i _B i
pairs was kept constant across conditions In the test phase, participants were required to discriminate correct
nonadjacent dependencies, (e.g., A 1 XB 1 ) from incorrect ones (*A 1 XB 2) Correct and incorrect sentences differed only in
the relation between initial and final elements; the same set of X elements occurred in both grammars, and X elements varied freely without restriction, resulting in identical relations between adjacent A iX and XBi elements in the two types
of sentences Because the sentences are identical with respect to absolute position of elements and adjacentdependencies, they can only be distinguished by noting the relation between the nonadjacent first and third elements
Therefore, tracking first-order adjacent transitional probabilities P(X|A i ) and P(B i |X) (Saffran et al., 1996) would not lead
to distinguishing correct versus incorrect sequences However learners might detect the nonadjacencies by trackingtrigram information (Perruchet & Pacteau, 1990) In this scenario, the number of unique sentences (trigrams2) increased
from 6 to 72 as variability of X increased, and the frequency of each trigram decreased correspondingly from 72 to 6
repetitions Given plausible memory and processing constraints, if learners were trying to learn trigrams they should be
better at learning the A i XB i grammar in conditions of small variability Surprisingly, Gómez found that learners were
significantly better at learning when the variability of X elements was highest at 24.
These results are counterintuitive They rule out n-gram information, suggesting that learners computed
transitional probabilities of nonadjacent dependencies P(B i |A i _) What is difficult to explain is the mechanism by which
the variability of intervening Xs was beneficial to such computations Indeed, by most accounts high variability should
Trang 6result in increased noise and thus decreased learning Gómez proposed that during her task learners were initially focused
on adjacent probabilities; and that nonadjacent information (for which P(B i |A i _)=1 and frequency of A i _B i was keptconstant across conditions) would become relatively more salient only as the probabilities of adjacent information
decreased, i.e., as the set-size of X elements increased Thus, high variability in the large set-size condition acted
beneficially to increase the salience of the nonadjacent elements compared to the middle elements, and facilitated
learning Gómez proposed that learning involves a tendency to seek out invariant structure (structure remaining constant
across varying contexts; E.J Gibson, 1969; J.J Gibson, 1966) Therefore, if the statistical probability of preferred(adjacent) structure decreases sufficiently, learners should begin to seek out other forms of information (see also Gómez,
in press) Consistent with this argument, infants and adults in Gómez (2002) appeared to be focusing on different types
of dependencies as a function of their statistical properties
The variability insight is the starting point of our investigations We argue that another type of variability could
potentially trigger nonadjacent computations, namely when there is no variability at all in intervening X elements If the
principle of seeking invariant structure in the input is correct, then intuitively in the case of zero variability learners
should perceive the invariance of the X element as potentially uninformative, allowing them to focus on the nonadjacent computations P(B i|Ai_) This hypothesis is tested in Experiment 1, which complements the results of Gómez, and shows
a counterintuitive U-shape of nonadjacent learning mediated by the variability of embedded middle elements.Experiments 2 and 3 further confirm that nonadjacent learning is indeed modulated by variability and, crucially, isindependent of the embedded items Experiment 4 tests whether learners carry out similar variability-modulatedcomputations with visual stimuli, suggesting that variability-modulated computations are not limited to language-likestimuli This last experiment speaks directly to a debate on the existence of similar sequencing principles involved inlanguage and other cognitive sequential tasks The overall pattern of results in Experiments 1-4 points to a U-shape, and
in the discussion section we dwell extensively on how current associative mechanisms based on n-gram informationcannot account for this pattern The ensuing discussion should be of relevance to researchers both in implicit learningand language acquisition, and suggest a reunification of the two literatures, which were one early on (Miller, 1958;Reber, 1967) and subsequently drifted apart into different areas of cognitive psychology as briefly discussed in theintroduction (see also Perruchet & Pacton, 2006)
Trang 7Experiment 1: Testing the zero variability hypothesis
Initial evidence that the nonadjacent dependencies in an A i XB i language would be learned with high variability of
intervening X elements comes from examples of natural language There is a peculiar asymmetry in language such that sequences of morphemes often involve some high-frequency items belonging to a relatively small set (such as am, the, -
ing, -s, are) interspersed with less frequent items belonging to a very large set (e.g., nouns, verbs, adjectives, adverbs).
Gómez (2002) noted that this asymmetry translates into patterns of highly invariant nonadjacent items separated by
highly variable material (am cooking, am work ing, am go ing, etc.) potentially making function morphemes more
detectable (see also Valian & Coulson, 1988) Several important grammatical relations emerge in this way, notably
auxiliaries and inflectional morphemes (e.g., am cooking, has travel ed) as well as dependencies in number agreement (the books on the shelf are dusty) Thus, to an extent the A i XB i language reflects some structures in natural languages
Examples of a second type of variability learning also exist in natural languages, namely that nonadjacenciesmay be variable with respect to a single fixed embedding: for instance, several different nonadjacent relations can be
interspersed with the same material (e.g., am cooking, has cooked). In Experiment 1 we mimicked this condition with an
A i XB i grammar, by exploring what happens when variability between the end-item pairs and the middle items is reversed
in the input Gómez attributed poor results in the small set-sizes to low variability: in these conditions both nonadjacentdependencies and middle items vary, but none of them considerably more than the other This may confuse learners, inthat it is not clear which structure is invariant Conversely, with larger set-sizes middle items are considerably morevariable than first-last item pairings, making the nonadjacent pairs stand out as invariant We asked what happens when
variability in the middle position is eliminated, thus making the nonadjacent items variable and the X item invariant We
replicated Gómez’ experiment with adults, adding a new condition – the zero variability condition – in which there is
only one middle element (i.e., A 1X1B1 , A 2X1B2 , and A 3X1B3) Our prediction is that invariance of the middle item willmake the end-items stand out, and make detection of the appropriate nonadjacent relationships easier The final predictedpicture is a U-shape learning curve in detecting nonadjacent dependencies, consistent with the idea that learning is aflexible and adaptive process
Method
Participants Sixty undergraduate and postgraduate students at the University of Warwick participated and were paid £3each
Trang 8Materials In the training phase participants listened to auditory strings generated by one of two artificial languages (L1
or L2) Strings in L1 had the form A 1 XB 1 , A 2 XB 2 , and A 3 XB 3 L2 strings had the form A 1 XB 2 , A 2 XB 3 , and A 3 XB 1
Variability was manipulated in 5 conditions, by drawing X from a pool of 1, 2, 6, 12, or 24 elements The strings,
recorded by a female voice, were the same that Gómez used in her study and were originally chosen as tokens among
several recorded sample strings in order to eliminate talker-induced differences in individual strings The elements A 1 ,
A 2 , and A 3 were instantiated as pel, vot, and dak; B 1 , B 2 , and B 3 , were instantiated as rud, jic, tood The 24 X middle items were: wadim, kicey, puser, fengle, coomo, loga, gople, taspu, hiftam, deecha, vamey, skiger, benez, gensim, feenam,
laeljeen, chla, roosa, plizet, balip, malsig, suleb, nilbo, and wiffle Following the design by Gómez (2002), the group of
12 middle elements was drawn from the first 12 words in the list, the set of 6 was drawn from the first 6, the set of 2from the first 2 and the set of 1 from the first word Three strings in each language were common to all five groups andthey were used as test stimuli The three L2 items served as foils for the L1 condition and vice versa In Gómez (2002)there were six test sentences generated by each language, because the smallest set-size had 2 middle items, resulting in
12 test items To keep the number of test items equal to Gómez we presented the 6 test stimuli twice in two blocks,randomizing within blocks for each participant Words were separated by 250-ms pauses and strings by 750-ms pauses.Procedure Six participants were recruited in each of the five set-size conditions (1, 2, 6, 12, 24) and for each of the twolanguage conditions (L1, L2) resulting in 12 participants per set-size Learners were asked to listen and pay closeattention to sentences of an invented language and they were told that there would be a series of simple questions relating
to the sentences after the listening phase They were not informed of the existence of rules to be discovered Duringtraining, participants in all 5 conditions listened to the same overall number of strings, (a total of 432) This way,frequency of exposure to the nonadjacent dependencies was held constant across conditions For instance, participants inset-size 24 heard six iterations of each of 72 type strings (3 dependencies x 24 middle items), participants in set-size 12encountered each string twice as often as those exposed to set-size 24 and so forth Whereas nonadjacent dependencieswere held constant (each repeated 144 times across conditions), transitional probabilities decreased as set-size increased.Training lasted 18 minutes, and training trials were presented in three blocks separated by a 2 min break Before the test,participants were told that the sentences they had heard were generated according to a set of rules, and they would nowhear 12 strings, 6 of which would violate the rules They were asked to press “Y” on a keyboard if they thought asentence followed the rules and to press “N” otherwise
Trang 9Results and Discussion
We measured total accuracy scores in endorsing grammatical strings and rejecting ungrammatical strings An analysis ofvariance with Variability (set-size 1, 2, 6, 12, 24) and Language (L1 vs L2) as between-subjects variable resulted in a
main effect of Variability, F (4,50)=14.17, p<.001, and a main effect of Language, F(1,50)=103.48, p<.001 More strings
were classified correctly for L2 than for L1 3
We were particularly interested in determining whether performance across the different set-size conditionswould result in a U-shaped function A polynomial trend analysis (collapsed across languages) yielded a significant
quadratic effect, F(1,55)=14.26, p<.001 (Figure 1) In contrast to Gómez (2002), there was not a significant increase between set-size 12 and set-size 24, t(22)=.57, p=.568, but otherwise the results for set-sizes 2-24 replicate her original
findings Figure 1 summarizes percentage of overall correct classification of grammatical and ungrammatical items foreach Variability condition
[Insert Figure 1 about here]
The results can be broken into two parts: first, the increased performance from small to large set-sizes (12 and24) replicates the original findings of Gómez (2002) These findings cannot be due to computation of transitional
probabilities P(B i |A i X), memorization of adjacent fragments, or memorization of trigrams, as these become less
statistically reliable with increased set-sizes Rather, as variability of intervening elements increases, nonadjacent
dependencies become more salient and thus more learnable, in favor of nonadjacent computations P(B i |A i _) The second
finding to note is the high performance in set-size 1 compared to lower performance in set-size 2, a condition with only
one additional X element One objection regarding the high accuracy rates in set-size 1 is that learning may be due to a
simple strategy to memorize whole sentences After all, there are only three different sentence types to be learned, andeach is repeated 144 times during training There are thus two potential explanations regarding learning in set-size 1: one
is that it is driven by rote learning, and that small differences (presence of one X element in set-size 1 versus two X elements in set-size 2) are sufficient to eliminate rote learning in set-size 2 The other possibility is that the invariance of
X elements helps the nonadjacencies to become salient, and thus more learnable Under both views, statistical
information contributes to learning, but only in the latter case do learners actually track ‘nonadjacent’ transitional
Trang 10probabilities P(B i |A i _), in much the same way as with large variability We therefore conducted a second experiment in
which we pitted these two possible explanations against each another by increasing the number of unique items in theset-size 1 condition to six; the same number of unique items found in set-size 2
Experiment 2a: Memorized exemplars or variability?
The literature on artificial grammar learning has suggested that complex sequential stimuli can be successfully encoded
by memorizing chunks of bigrams and trigrams (Dulany et al., 1984; Perruchet & Pacteau, 1990; Servan-Schreiber &Anderson, 1990) or by comparison with stored whole exemplars (Vokey & Brooks, 1992) The simplicity of thelanguage in set-size 1 of Experiment 1 could suggest that learning in set-sizes 2, 6, 12, and 24 may be driven byvariability, whereas learning in set-size 1 is driven by rote learning This interpretation implies that two differentmechanisms together are responsible for the U-shape4
In the set-size 2 condition, there are 6 different sentences, each repeated 72 times (as opposed to 3 sentenceseach repeated 144 times in set-size 1) Learning in set-size 2 could be poor because participants are unable to remember
6 different items and because there is little variability However, if subjects are merely memorizing items in set-size 1, ifthe number of individual sentences in set-sizes 1 and 2 were equated to six in both cases we should expect no difference
in performance between the two conditions If, on the other hand, variability plays a role in set-size 1, then we would stillexpect better learning compared with set-size 2
In Experiment 2a, we constructed a control language with 6 A i _B i frames and 1 X, which we name ‘set-size
1-control’ This language generated 6 different sentence types, each repeated 72 times during training, and equated thetype/token frequency of set-size 2 This allowed us to pit the predictions of a memory mechanism against the variabilityeffect
Method
Participants Twenty-four undergraduate students at Cornell University5, participating for course credit, were assigned
to one of two conditions (set-size 1-control, set-size 2-control)
Materials In the training phase participants listened to auditory strings generated by the same artificial language as
Experiment 1 (L1) In ‘set-size 1-control’, 6 A_B frames were generated of the form A XB , A XB , A XB , A XB , A XB,
Trang 11A 6 XB 6 , and one word was instantiated as X Given that six more pseudo-words had to be created with respect to
Experiment 1, the whole set of pseudo-words used in Experiment 1 was recorded afresh by a female voice along with six
new words A and B pseudo-words were drawn from the following pool: pel, vot, dak, rud, jic, tood, leeg, meep, noob,
rauk, sep, and zoet For each different participant, each word was assigned randomly to an Ai and B i element, so that each
participant experienced the same underlying grammar instantiated in unique sequences of pseudo-words The single X
element was randomly chosen from the set of 24 disyllabic pseudo-words in Experiment 1 This procedure eliminated theneed for the counter-balanced procedure with two languages (L1 and L2), and allowed for better control of phonologicalfeatures that have been found to have an impact other artificial grammar experiments (Onnis et al., 2005) It also helpedeliminate talker-induced differences in individual strings Given that this experiment was conducted at a differentinstitution with new stimuli, and our interest in comparing performance in the set-size 1 control directly with the set-size
2 condition, we ran a ‘set-size 2-control’ In this condition 3 A_B frames and 2 Xs were used, as in Experiment 1, while the random assignment of A, B, and X pseudo-words was used as in set-size 1-control The test items for the set-size 1-
control were constructed such that 12 strings were presented overall (as in Experiment 1): the six grammatical strings
were the six string types presented during training, while 6 ungrammatical strings were constructed by associating an A i
word with a B j word: A 1XB2 , A 2XB3 , A 3XB4 , A 4XB5 , A 5XB5 , A 6XB1 The test items for the set-size 2-control were
constructed in a similar way: 6 grammatical strings, A 1X1B1 , A 2X1B2 , A 3X1B3 , A 1X2B1 , A 2X2B2 , A 3X2B3; and 6
ungrammatical strings: A 1 X 1 B 2 , A 2 X 1 B 3 , A 3 X 1 B 1 , A 1 X 2 B 2 , A 2 X 2 B 3 , A 3 X 2 B 1 Words were separated by 250-ms pauses and
strings by 750-ms pauses as in Experiment 1
Procedure Twelve participants were recruited in each of the two conditions The total number of strings was heldconstant to 432 as in Experiment 1, and each string type was heard 72 times in both conditions Training and testing wasotherwise similar to Experiment 1, and participants were asked to discriminate grammatical versus ungrammaticalstrings at test
Results and Discussion
Participants in set-size 1-control correctly classified items with a mean=8.83 (74%) and sd=2.89, while participants inset-size 2-control classified items correctly with a mean=6.83 (57%) and sd=1.64 (see Figure 2) The difference between
the two groups was significant, t(22)=2.09, p<.05.
[Figure 2 goes about here]
Trang 12Learning in the set-size 1-control was significantly better than in set-size 2-control, despite the fact that 6
different A i _B i nonadjacencies had to be learned in set-size 1-control within the same overall number of trials, and thuswere less frequent than in set-size 2 The language in set-size 1-control also had a richer vocabulary (13 distinct words)than the set-size 2-control (8 distinct words) imposing a heavier burden on memory Thus in principle set-size 1-controlought to have been a less learnable condition In both conditions, there were 6 distinct sentence types to be learned, eachbeing presented an equal number of times (72) We conclude from this control experiment that a memory advantage is
unlikely to explain differences in learning A i _B i nonadjacencies between set-size 1 and set-size 2 If anything, a memoryadvantage should have helped participants in the set-size 2-control given that a smaller vocabulary was involved
The nonadjacent learning in set-size 1 appears to be better because of variability, not because of an effect of rote
memorization of whole items The invariance of a single X middle item likely makes the nonadjacent A i_Bi items standout as variable, and this facilitates learning
Experiment 2b: Controlling for the frequency of X.
When comparing set-size 1 to set-size 2 in Experiments 1 and 2a, a caveat is that the frequency of the middle item X in set-size 1 (432 occurrences) is twice the frequency of the middle items in set-size 2 (216 occurrences for X 1 and X 2).Thus it might be the case that if the same middle element is repeated so perfectly regularly and frequently, it will forceparticipants to (consciously) notice this due to brute repetition, and then they may naturally begin to discount thisinformation, or to strategically attend to the spanning information Under this view6, the results with zero variabilitywould therefore not reflect some automatic mechanisms, but rather something about the salience of repetition (in thiscase of individual elements, not trigrams), and distributional cues would have their effects only ‘indirectly’ as a result ofbeing mediated by salience-induced noticing of the patterns
In Experiment 2b, we reran the control experiment from Experiment 2a but with the frequency of the X item in set-size 1 being equal to the frequency of each of the two individual Xs in set-size 2; that it, the frequency of X in set-size
1 was reduced to 216 occurrences This reduced the overall amount of training, thus reducing the frequency of
occurrence of each A i XB i trigram, as well as the frequency of each A i _B i nonadjacent pair Therefore, in terms of strength
of individual items, bigrams, and trigrams, this condition is equally or more difficult than the set-size 2 condition We
Trang 13anticipated that zero variability would still give rise to learning as compared to set-size 2.
In Experiment 2b, we constructed a control language with 6 A i _B i frames and 1 X, which we name ‘set-size
1-equated’ This language generated 6 different sentence types, each repeated 72 times during training, and equated thetype/token frequency of set-size 2 This allowed us to pit the predictions of a memory mechanism against the variabilityeffect
Method
Participants Ten undergraduate students at Cornell University participated for course credit or were paid $5
Materials In the training phase participants listened to auditory strings generated by the same artificial language as
Experiment 2a set-size 1-control, i.e 6 A_B frames were generated of the form A 1 XB 1 , A 2 XB 2 , A 3 XB 3 , A 4 XB 4 , A 5 XB 5,
A 6 XB 6 , and one word was instantiated as X.
Procedure The total number of training strings was reduced to 216 occurrences, so that the frequency of the X item was equated to the frequency of X in set-size 2 (Exp.2a), namely 216 occurrences Each string type and each A _ B nonadjacent pair were heard 36 times Thus, whereas participants in set-size 2 see each of the A iXBi trigrams (involvingthree dependencies) 72 times, participans in set-size 1 only got half as much exposure to each of the six trigrams (i.e.,36) but had to learn twice as many dependencies
Results and Discussion
Participants in set-size 1-equated correctly classified strings with a mean=9.6 (80%) and sd=2.45 The difference with
set-size 2 was significant, t(20)=2.71, p<.05.
The results suggest that learning in set-size 1-equated was significantly better than in set-size 2-control, despite
the fact that 6 different A i _B i nonadjacencies had to be learned with half the number of trials and frequency of X was
equated to the set-size 2-control In addition, there was no statistical difference between set-size equated and set-size
1-control of Experiment 2a, t(20)=.66, p=.52 Lastly, there was no significant difference between set-size 1-equated and the original set-size 1 in Experiment 1, t(21)=1.95, p>.05.
Together, these results suggest that that learning with zero variability is better than learning with low variability
Learning also seems relatively independent of the brute frequency of exposure to the X element and is robust even when
Trang 14the frequency of A i _B i nonadjacencies is reduced significantly Finally, although Experiment 2b cannot rule out thepossibility that attention-mediated mechanisms are responsible for individuating the nonadjacencies, it rules out a brute
force account, whereby mere overexposure to the same X would produce an attention bias that would lead participants to discard the X and focus on the nonadjacent relations.
Experiment 3: Generalizing under variability
Experiments 1, 2a, and 2b provide initial evidence that the U-shape is due to learners computing statistics over specific
A i _B i nonadjancent elements Additional support would come from an experiment showing that learners can recognize
familiar nonadjacent dependencies independently of X elements, namely when presented in novel sentences with novel embedded X elements not experienced during training The specific hypothesis that we tested in Experiment 3 is thus that generalization to novel X items embedded in A i _B i frames is more likely to occur in conditions of zero or high variability.
This would provide indirect evidence as to how children may start building syntactic categories from the inputthey receive Mintz (2002) recently proposed that the statistical properties of nonadjacent dependencies might leadlearners to group the intervening elements as members of a category In a corpus analysis of child-directed speech Mintz(2003) analyzed all ordered pairs of words that frequently co-occur with exactly one word intervening, and termed such
nonadjacencies frequent frames These are exactly the types of structure that are simulated in the artificial grammar
presented in this study Mintz discovered that the words embedded in particulrar frequent frames tended to be from thesame category (e.g., noun, verb, preposition, adjective, adverb) Mintz (2002) also showed that adults categorize words
in artificial languages as a function of their co-occurrence patterns within frequent frames Importantly, in both thecomputational analyses and the human experiment of Mintz there is no mechanism for explaining how frequent framesmight come to be learned Together, Gómez (2002), Gómez & Maye (2005), and Experiments 1 and 2 can be construed
as a first step toward showing how learners might discover frequent frames In Experiment 3, in contrast, we aim to
establish whether learners can generalize the frames to new X elements, a necessary step if learners are to ultimately
learn categories based on frequent frames as proposed by Mintz (2003) Categorization is an important linguisticmilestone because once a word is associated with a particular syntactic category, a learner may extend it to novelsyntactic contexts Moreover, frequent nonadjacent dependencies have been shown to be fundamental to the process of
Trang 15progressively building syntactic knowledge of tense marking, singular and plural markings, etc For instance, Childers &Tomasello (2001) tested the ability of 2-year-old children to produce a verb-general transitive utterance with a nonceverb They found that children were best at generalizing if they had been mainly trained on the consistent pronoun frame
He’s VERB-ing it (e.g., He’s kicking it, He’s eating it ) rather than on several utterances containing unsystematic
correlations between the agent and the patient slots (Mary’s kicking the ball, John’s pushing the chair, etc.).
The link between our current study and the linguistic studies above on frequent frames is even stronger whenconsidering that most of the frequent frames found by Mintz, as well as the frames used by Childers and Tomasello intheir experiment, are composed of closed-class words (with a low variability), and their embeddings are open-classwords (which display high variability) Hence a statistical mechanism sensitive to invariant versus variable structurewould be able to detect frequent frames in the input, setting the stage for more powerful processes involving syntactic
categorization (Gómez & Maye, 2005) Therefore, we adapted Experiment 1 to test generalization to novel X elements.
Method
Subjects
Thirty-six undergraduate and postgraduate students at the University of Warwick participated and were paid £3 each.Materials
Training items were identical to those in Experiment 1 The test stimuli consisted of 12 strings: six strings were
grammatical and six were ungrammatical These strings were the same as Experiment 1, except that a novel X item always appeared in middle position, the word thusev, which had not been encountered at training.
Procedure
Six participants were recruitedin each of 3 Variability conditions (set-sizes 1, 2 and 24) and for each of two Languageconditions (L1, L2) resulting in 12 participants per Variability condition Training was identical to Experiment 1 in everyother way
Before the test, participants were told that the sentences they had heard were generated according to a set ofrules involving word order, and they would now hear 12 strings, 6 of which would violate the rules They were asked togive a “Yes/No” answer They were also told that the strings they were going to hear may contain new words and theyshould use the knowledge they obtained about the language they heard in the training phase to guide their decision
Trang 16Results and discussion
An analysis of variance with Variability (set-sizes 1, 2, 24) and Language (L1 vs L2) as between-subject variables
resulted in a main effect of Variability, F(2,30)= 7.72, p<.01, a main effect of Language, F(1,30)=5.94, p<.05, and a Variability by Language interaction, F(2,30)=5.48, p<.01 This interaction was due to participants’ higher correct
classifications for L2 Performance across the different variability conditions resulted in a U-shaped function: a
polynomial trend analysis showed a significant quadratic effect, F(1,33) =10.44, p<.01 Figure 3 presents the percentage
of correct classification for each of the three variability conditions
[Figure 3 goes about here]
These results further support the idea that learners in zero and high variability set-sizes are better at detecting
invariant structure in the input They learn nonadjacencies independently of the embedded X elements they have
experienced and are able to detect these despite a new element in the X position As in Experiments 1 and 2, modulation
of variability is critical for detecting nonadjacencies Notice that several types of generalization can be expected a priorieven from a simple language such as the one used here For example, if learners had detected the relative position of the
elements in the strings (A elements in position 1, B elements in position 3, X elements in position 2) they may have constructed a positional generalization for X words in second position, regardless of the specific dependencies between A and B items However, in that case they would have been expected to generalize across the board, regardless of the
condition of variability, since both grammatical and ungrammatical test items respected positional information, i.e., they
were all of the form AXB.
In the literature on artificial grammars, the debate on generalization has recently revolved around a distinctionbetween statistics versus rules On one account, a distributional analysis of the input would be responsible for low-levelprocesses such as early segmentation of the speech stream into words, but would not be sufficient for generalization, forwhich algebra-like variables are needed (e.g., Marcus, 1999; Peña et al., 2002) A tenet of rule-based accounts is thatgeneralizations ‘apply across the board’, that is, rules either appear or they do not, and learners either generalize or theyfail Without delving too deeply into the rules versus statistics discussion, the results of this experiment show that
generalization to new X elements – whether statistical or in the form of an algebraic rule – is modulated by distributional
information
Trang 17It must be acknowledged that learning and generalizing in the set-size 2 condition achieves performance higherthan chance In all the experiments presented here, a typical score of 6/12 is at chance, because participants are told thathalf the strings are ungrammatical Therefore, scores higher than 6 are interpreted as learning somewhat better thanchance A close look at the individual scores of Experiment 3 reveals that only three participants in the set-size 2condition performed with a score higher than or equal to 10 (individual scores: 6, 6, 7, 6, 11, 12, 8, 7, 3, 9, 10, 8),compared to nine participants in the set-size 1 condition (individual scores: 12, 12, 8, 7, 10, 12, 12, 12, 8, 11, 11, 12), andseven in the set-size 24 condition (individual scores: 12, 7, 11, 6, 12, 6, 12, 11, 11, 5, 11, 4) Thus while it is possible todetect nonadjacent dependencies in set-size 2 (under conditions of low variability), learning is facilitated considerably inconditions of zero or high variability Additionally, while the individual scores show that some participants in eachcondition did not learn the grammar, fewer participants failed to learn in the zero and high variability conditions than inset-size 2 In the next section, we investigate whether nonadjacent learning is supported in the visual modality.
Experiment 4: Learning nonadjacencies in the visual modality
An ongoing question is whether statistical learning operates in different sensory modalities, and to what degree thislearning is similar across modalities Artificial grammars are well suited to tackling these questions because the same set
of constraints generating the grammars can be instantiated using spoken stimuli, written words, sounds and tones, orvisual shapes Previous studies have shown that adjacent dependencies can be learned in the visual modality (Fiser &Aslin, 2002; Kirkham, Slemmer, & Johnson, 2002; Saffran, 2002) However, the constraints on nonadjacent stimuli inthe visual modality are unknown at present In Experiment 4 we tested whether learning of nonadjacent abstract shapeswas facilitated under the same conditions of zero or high variability that mediated learning in our previous experiments,thus generating a similar U-shape curve to the one obtained with spoken pseudo-words in Experiments 1 and 3
Method
Subjects
Fifty-four undergraduate students at Cornell University participated for course credit and were randomly assigned to one
of three conditions (pilot work showed that more participants would be needed to achieve sufficient power as compared
to the previous experiments and so we ran 18 participants in each condition)
Trang 18language Specific token shapes were randomly assigned to A, X, and B variables, so that each participant experienced a different instantiation of the AXB language Twelve visual test strings were created: 6 grammatical ones maintained A i _B i
nonadjacent relations, whereas 6 ungrammatical strings contained A i_Bj relations (similar to Experiment 1)
[Insert Figure 4 about here]
of white screen so that the strings as a whole would be perceived as independent from each other This method ofpresentation provided a visuo/spatial configuration that maintained sequentiality in the task, while increasing the chances
of visual learning (Conway & Christiansen, 2005b; Saffran, 2002) Pilot studies conducted with abstract visual stimulisuggest that these types of stimuli are harder to attend to for a prolonged time, and participants tire earlier than when theyattend to spoken stimuli Therefore, we decided to reduce the training phase from 3 blocks to 2 blocks of 144 stimulieach, in order to reduce variance resulting from distraction from the task or tiredness The training and test phases wereidentical to Experiment 1 in every other way
Trang 19Results and Discussion
Initial descriptive statistics revealed three outlier scores in the data, one in set-size 1 and two in set-size 24 As aconsequence, we trimmed means in each condition by removing approximately 2.5% of lower and higher scores, i.e., thelowest and highest score in each condition, resulting in 16 scores per condition An analysis of variance with Variability
(set-size 1, 6, 24) as between-subject variable resulted in a marginal effect of Variability, F(2,45)=3.047, p=.057.
Performance across the three variability conditions resulted in a U-shaped function: a polynomial trend analysis showed
a significant quadratic effect for percent correct across the three conditions, F(1,45)=5.97, p<.02 Figure 5 presents the
percentage of total correct classification for each of the three variability conditions
[Insert Figure 5 about here]
Learning nonadjacent dependencies visually results in a U-shape learning curve similar to that obtained withspoken pseudo-word stimuli (Experiments 1 and 3) Because the visual stimuli were constructed to prevent a verbalencoding strategy, the results suggest that the effect of variability on nonadjacency learning is not specific to language-like stimuli but may be a more general property of sequential learning, perhaps due to similarities in the computationalproperties of the mechanisms involved in learning nonadjacencies within the auditory and visual modalities (see alsoConway & Christiansen, 2005a)
General Discussion
Statistical learning of dependencies between adjacent elements in a sequence is fast, robust, automatic, and general innature (for reviews see Gómez & Gerken, 2000; Perruchet & Pacton, 2006; Saffran, 2003) Such learning has beendemonstrated across a variety of situations In contrast, although the ability to track dependencies separated by one ormore intervening elements is a fundamental linguistic ability, relatively little research has been directed toward thisproblem Nonadjacent structure in sequential information seems harder to learn, possibly because learners have toovercome the bias to attend to adjacent transitional probabilities In fact, a statistical learning mechanism that kept track
of all possible adjacent and nonadjacent regularities in the input would quickly encounter a computationally intractableproblem of exponential growth In this paper we were interested in re-assessing the possibility that nonadjacent statisticscan be detected without tracking all possible bigrams and trigrams We presented two specific conditions in which such
Trang 20computations are felicitous, namely zero or large variability of intervening items This resulted in a U-shape curve forlearning nonadjacencies, mediated by variability, suggesting that learners may be capitalizing on the most informativeinformation When the probability of adjacent elements is high, learners will track this structure and will only detectnonadjacent structure when there is no variability in the middle element or when variability is high (in the set-size 24condition) Although appealing, this principle does not explain how learners actually shift from computing adjacentversus nonadjacent statistics Could current associative mechanisms proposed in the AGL literature provide a more fine-grained account of the mechanism involved?
Standard AGL mechanisms and variability
In this section we consider what predictions current models of artificial grammar learning would make, based on a series
of associative measures developed in the literature Several studies have proposed exemplar- or fragment-based models,based on knowledge of memorized chunks of bigrams and trigrams (Dulany et al., 1984; Perruchet & Pacteau, 1990;Servan-Schreiber & Anderson, 1990; see also Perruchet & Pacton, 2006 for a recent review) and similarity to of wholeitems (Pothos & Bailey, 2000;Vokey & Brooks, 1992) These models essentially propose that learners acquireknowledge of fragments, chunks or whole items from the training strings upon which they subsequently base theirdecisions about grammaticality A variety of measures of chunk strength and of the similarity between training and testexemplars has been proposed We considered the following measures: Global Associative Chunk Strength (GCS),Anchor Strength (AS), Novelty Strength (NS), Novel Chunk Position (NCP), and Global Similarity (GS), in relation tothe data in Experiment 1 and 2a Descriptive fragment statistics for the languages used in Experiments 1 and 2a aresummarized in Table 1, while associative measures are reported in Table 2
Global associative chunk strength (GCS, Knowlton & Squire, 1994) is calculated by averaging the frequencieswith which different bigrams and trigrams appear in strings For instance, we can calculate the GCS for grammatical test
items in set-size 2 The form of each test item is A i X j B i , with 3 A i _B i dependencies and 2 X j-elements A specific item, for
instance A 1 X 2 B 1 , is composed of 2 bigrams, A 1 X 2 and X 2 B 1, each repeated 72 times during training, and one trigram
A 1 X 2 B 1, repeated 72 times The GCS measure for this item is obtained by averaging the summed frequencies of each gram by the number of n-grams: