Pronunciation Teaching Within A Theoretical Framework English Language Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Although a number of studies have been reported regarding segmental over the past fifteen years for an overview, see Ekman, 2003; Strange, 1995, there are only a small number of studies focusing on L2 stress in an EFL context.

On the other hand, suprasegmental properties, including stress, play an important role in second language acquisition. They are shown to be closely related to foreign accent perceived in L2 production and to difficulties in L2 perception. Researchers have attributed the problems with stress to the influence of the L1 prosodic system. However, these studies are inadequate, as their focus on stress acquisition mainly relies on the comparison of the phonological systems of L1 and L2. As Flege (1987) pointed out in research on L2 speech development at the segmental level, it is important to take phonetic details into account in order to gain a better understanding of the possible transfer of L1. The same is true for studies of prosody. It is possible that the influence of L1 lies in the difference between L1 and L2 in the employment of relevant phonetic correlates.

2.2. The history of pronunciation teaching within a theoretical framework

Popular opinion regarding the place of pronunciation training in the ESL or EFL curriculum has ebbed and flowed along with the historical framework of language learning theories and methodologies. Prior to the popularity of the direct method in the late nineteenth century, pronunciation received little overt focus within the language classroom.

Advocates of the direct method claim that an initial emphasis on listening without pressure to speak allows learners to acquire grammar inductively and to internalize the target sound system before speaking, much the way children acquire their first language (Celce-Murcia, Brinton, & Goodwin, 1996, as cited in Aufderhaar, 2004).

Although popular in elite private European schools, the direct method was rejected by the public schools and by most language schools in the United States as impractical due the classroom time, effort and background required of both the teacher and students for the success of this approach. Criticism centered around the time-consuming nature of this instruction at a time in which most students only studied foreign language for two years, along with a lack of qualified teachers who had a comfortable, native-like fluency at their command. As a result, this essentially intuitive-imitative approach gave way to the return of the grammar translation approach of the reading era, with very little attention to pronunciation (Celce-Murcia et al., 1996, as cited in Aufderhaar, 2004).

According to Aufderhaar (2004), both the direct and grammar translation methods were more emphasized when there was a sudden and urgent need for qualified interpreters and intelligence to learn English with the advent of World War II. Rooted in Skinner's (1957) theory of behaviorism which treated the acquisition of verbal skills as environmentally-determined stimulus-response behavior, the audiolingual method required intensive oral drilling for entire working days, six days a week (E. R. Brown, 1997). In contrast to the grammar-translation method, pronunciation was now considered to be of the highest priority, with phonetic transcription and articulation explicitly taught through charts and demonstrations, along with imitation (Celce-Murcia et al., 1996, as cited in Aufderhaar, 2004).

While generally proving successful within the military environment of small classes of highly motivated instructors and students whose well being depended in part on their command of the target language, the theoretical foundation of audiolingualism was shaken by the reality of the post World War II language classroom that was not conducive to this military regimen. Its strongest critic was Chomsky (1957), whose introduction of the generative-transformational theory viewed the underlying meaning of the whole as being more important than any one part. His focus on the creative, rule-governed nature of competence and performance led many educators to the conclusion that pronunciation should remain inductively within the context of morphology and syntax (Kreidler, 1989). At the heart of this hypothesis was the suggestion that all language skills, including listening comprehension, verbal production and pronunciation, are so integrated that there is no need to address them as separate and distinct features (Brown, 1997).

The influence of Chomsky's generative-transformational theory, along with the cognitive-code theory of the 1960s, which focused on listening at the discourse level and discarded skill ordering, paved the way for the trend to avoid or ignore direct pronunciation teaching altogether. The advent of the communicative approach in the late 1970s and early 1980s likewise deemed the teaching of pronunciation as ineffective and hopeless, instead it emphasized language functions over forms with the goal being overall communicative competence and listening comprehension for general meaning: MacCarthy (1976) stated that "at present any teaching of pronunciation is so ineffective as to be largely a waste of time." (p. 212). At that time, many instructors of the communicative approach assumed that pronunciation skills would be acquired naturally within the context of second language input and communicative practice.

However, pronunciation was not entirely ignored in the time period of the 1960s through the mid 1980s. Remnants of the audiolingual approach lingered within structural linguistics, which viewed language learning as a process of mastering hierarchies of structurally related items for encoding meaning (Morley, 1991). When pronunciation was addressed, instruction was generally oriented toward the drilling of individual sounds via articulatory descriptions and minimal pair contrasts (Chun, 2002).

It is the reliance on this traditional phonemic-based approach which Leather (1987) mentions one of the reasons for the demise of pronunciation teaching during this era: "The process, viewed as meaningless non-communicative drill-and-exercise gambits, lost its appeal; likewise, the product, that is, the success ratio for the time and energy expended, was found wanting." (Morley, 1991, p. 486). Attitudes ranged from serious questioning as to whether pronunciation could be overtly taught and learned at all (Chun, 2002), to unwavering claims that adults were simply unable to acquire second language pronunciation (Scovel, 1988).

According to Madsen and Bowen (1978), the lack of attention to pronunciation, which was prevalent in the communicative approach of the late 1970s and early 1980s and the direct assertion by many that pronunciation could not be taught, resulted in a great number of international students who were failed communicate effectively or even intelligibly although they had been instructed for a long time. This situation sparked research in second language acquisition that suggested a departure from the traditional, bottom-up phonemic-based approach to pronunciation teaching toward a top-down orientation focusing on suprasegmental or prosodic aspects such as rhythm, intonation, and duration.

Defined by Wennerstrom (2001, as cited in Aufderhaar, 2004) as "a general term encompassing intonation, rhythm, tempo, loudness, and pauses, as these interact with syntax, lexical meaning, and segmental phonology in spoken texts" (p.4), prosody has historically been ignored or relegated to the fringes of research and pedagogy, due in large part, according to Chun (2002), to its inherent complexity and difficulty mastering it. Considered notoriously difficult to acquire and define, Bolinger (1972) labeled the most controversial aspect of prosody, intonation, the "greasy part of language."

Despite its historical back-seat status, an undercurrent of research regarding prosody has spanned several disciplines. The first documented study of speech melody has been traced back to Steele (Couper-Kuhlen, 1993, as cited in Aufderhaar, 2004), who, in 1775, used musical notation to identify pitch variations that occur in regular forms upon syllables. Unfortunately, his materials, based on five features he identified as accent, quantity, pause, emphasis and force were dependent upon fixed and absolute musical pitches rather than flexible and relative tones, apparently lacking in practical applicability (Pike, 1945).

2.3. Pronunciation research in applied linguistics

Although attaining native-like pronunciation that facilitates mutual intelligibility is considered important for many language learners and teachers alike, there have been few empirical studies of pronunciation in applied linguistics (Derwing & Munro,2005; Levis, 2005). For example, Derwing and Munro (2005, p. 386) state that "it is widely accepted that suprasegmentals are very important to intelligibility, but as yet few studies support this belief." This claim is supported by other researchers such as Hahn (1994) and Levis (2005) who states that over the past 25 years there has been encouragement to teach suprasegmentals though very little pedagogy has been based on empirical research.

The usefulness of empirical research for developing more effective pronunciation teaching is obvious. As Levis (2005) states, "instruction should focus on those features that are most helpful for understanding and should deemphasize those that are relatively unhelpful" ( pp. 370-371). Munro (2008) echoes this point when stating that "it is important to establish a set of priorities for teaching. If one aspect of pronunciation instruction is more likely to promote intelligibility than some other aspect, it deserves more immediate attention." (p. 197). Of course, we must first know what the most important elements are to ensure optimal instruction and learning outcomes. As Munro (2008) argues, "because prosody encompasses a wide range of speech phenomena, further research is needed to pinpoint those aspects of prosody that are most critical" (p.210).

Hahn (2004, p. 201) agrees that there is little empirical support for claims that teaching suprasegmentals is helpful and that "knowing how the various prosodic features actually affect the way native speakers…process nonnative speech would substantially strengthen the rationale for current pronunciation pedagogy." For that reason, Hahn (2004) reiterates that it is important to identify the phonological features that are most salient for native listeners. Due to the complex relationship between suprasegmentals and intelligibility, Hahn (2004) argues that "it is helpful to isolate particular suprasegmental features for analyses" (p. 201). Hahn's argument supports the importance of the research in this dissertation in which the acoustic correlates of English lexical stress are isolated and manipulated individually to identify which are the most pertinent to the perception of speech intelligibility and nativeness.

Levis (2005) states that pronunciation teaching has been a study in extremes in that it was once considered the most important aspect of language learning (when audiolingual methods were favored) and then became very much marginalized in communicative language teaching. Of the research that has been carried out, such as that on intonation patterns, little of it finds its place in pronunciation textbooks (Derwing, 2008; Derwing & Munro, 2005; Levis, 2005; Tarone, 2005). Therefore, there is a need to first fill a gap in empirical research treating aspects of second language pronunciation and then to ensure that these findings are relayed to professionals in the fields of education and applied linguistics so that L2 students can benefit from these findings.

Once a general framework for the delivery of instruction is chosen, the next step in designing a course of any type is to consider the needs and desires of the students and create course objectives and learning outcomes. As stated earlier, ESL students are typically concerned with issues such as intelligibility, accent and nativeness. Students often voice their goals regarding attaining proficiency in these areas and teachers should consider which goals are realistic (Avery & Ehrlich, 1992). To do so, the students' current abilities must be assessed in order to target strategies that will help achieve these goals.

Assessing students' abilities is crucial in planning pronunciation teaching. Derwing (2003; 2008) stresses that each student should be assessed individually to identify the student's strengths and weaknesses and determine individual needs in pronunciation. These assessments can be done in a formal or informal way by the teacher and can include self-reports or self-assessments by the students. Self-assessments by students can provide insight into the students' perceived needs, although these needs may be biased by the students' previous experience with pronunciation instruction. Derwing (2003) found that "of the pronunciation problems identified [by the students], roughly 79% were segmental [in nature], while only 11% were related to prosody."(p.554). In other words, students are simply more aware of segmental elements than they are of prosodic ones due to more previous instruction on segmental elements.

Once evaluations have been completed, the question becomes how to address the language learners' pronunciation issues. A complication arises at this point because students in ESL classes typically come from very mixed language backgrounds. Even the varying needs of students in EFL classrooms, where all learners are from the same native language background, can be challenging as individual students have individual needs.

Therefore, integrating pronunciation lessons into class activities can be challenging in ESL classrooms as a particular speaker (or group of speakers) may have little difficulty with a particular element of pronunciation while others have great difficulty. A well-known example is Japanese speakers' difficulty acquiring /r/ and /l/ (Bradlow, 2008) which does not cause any trouble for Spanish speakers. As Derwing (2003) advises, focusing heavily on segmental instruction in mixed classrooms is inappropriate due to the variety of language backgrounds and, therefore, prosody should be emphasized as it can have greater importance for a larger diversity of students. Derwing (2008) also argues that instruction in prosody transfers better to spontaneous speech than instruction on segmentals.

Many instructors are reluctant to teach pronunciation and often unsure how to go about doing it (Derwing & Munro, 2005; Hewings, 2006) as they feel underprepared or have little support in terms of course materials. Derwing (2003) estimates that only about 30% of pronunciation teachers have formal linguistic training in pronunciation pedagogy. To address this issue, it is important that empirical research on pronunciation be conveyed in a clear manner to language teachers so that they can pass this information along to students.

To be certain, pronunciation should be considered an important element of ESL classroom instruction. It has been noted above that pronunciation is implicated in critical elements of communication such as speech intelligibility, and can also affect perception of nativeness. In addition, accurate pronunciation is critical for students needing to pass standardized English tests such as the Test of English as a Foreign Language (TOEFL) and the International English Language Testing System (IELTS) for entrance into colleges and universities in English-speaking countries, or when interviewed by entities such as the Foreign Service Institute which assesses not only a person's grammar and vocabulary but also comprehension, fluency and accent in oral interviews (Varonis & Gass, 1982).

Pronunciation is also a key element in programs that prepares international teaching assistants to become teachers in American classrooms (Hahn, 2004; Wennerstrom, 1998).

2.4. The reasons for teaching pronunciation

One of the most urgent reasons for effective pronunciation instruction centers on the large number of non-native English speakers attending American colleges and universities. According to The Institute for International Education, these students numbered 547,867 in the 2000/2001 school year, with a substantial number serving as graduate teaching assistants. The increase in the hours of classroom instruction given by non-native speakers has led to a corresponding decrease in student satisfaction with the quality of instruction, due mainly to the reported difficulty following non-native classroom presentation (Ostrom, 1997, as cited in Aufderhaar, 2004).

A survey by Shaw (1985, as cited in Aufderhaar, 2004 ) revealed that having an instructor with foreign-accented speech is the highest of six areas of potential frustration for college students. Accordingly, previous studies conducted by Hinofotis and Bailey (1980) on non-native university teaching assistants revealed a threshold level of understandable pronunciation in English, below which the non-native speaker will not be able to communicate orally regardless of his or her level of control of English grammar and vocabulary. While some instructors and administrators within the field have historically dismissed these problems simply as a matter of not having enough exposure to the spoken target language (Moy, 1986), other well-meaning instructors attempting to deal with this need have often relied on minimal pair drills, repetition and articulatory instruction with poor results (MacDonald, Yule, & Powers, 1994).

According to Aufderhaar (2004), research in second language acquisition that suggested a departure from the traditional, bottom-up phonemic-based approach to teaching from a top-down orientation emphasizing suprasegmental or prosodic aspects such as rhythm, intonation, and duration revealed a need to increase the adult learners' awareness of suprasegmental patterns of the target language at the discourse level.

Chun (2002) advocates five principles for teaching intonation, including sensitization, explanation, imitation, practice activities, and communicative activities, and stresses the need for focused listening practice requiring the identification of suprasegmental features within a context of various authentic speech samples representing different speaker roles and relationships.

2.5. The sound system of English

According to the Contrastive Analysis Hypothesis (CAH) the unequal features between languages are the main source of errors. Lado (1957, as cited in Gass & Selinker, 2008, p.96) claims that "those structures that are different will be difficult because when transferred they will not function satisfactorily in the foreign language and will therefore have to be changed". In order to understand the role of the first language in the phonological acquisition of the second language, emphasis has been given to the studies that have focused on the differences between English and Persian phonological systems. As Celce-Murcia, Brinton, and Goodwin (1996) state:

"all languages are unique in terms of their consonant and vowel systems. In linguistics, these distinctive characteristics have been divided into segmental and suprasegmental features. The segmental features of a language relate to consonants and vowels, whereas suprasegmental aspects of a language are involved with word stress, intonation, and rhythm" (p. 35).

2.5.1 English Consonants and Vowels

Standard American English includes 24 consonants and 22 vowels and diphthongs; however, a study performed on American English asserted that "there are similarities among consonants that permits us to classify them into groups; the classification can be done according to various criteria" (Olive, Greenwood, & Coleman, 1993, p. 22). They suggested that consonants could be classified based on voice, place, and manner of articulation; therefore, according to their common characteristics, which include their location inside the mouth, they can be grouped together (Olive, Greenwood, & Coleman, ibid, p. 22). Table 2.1 presents the English consonants.

Table 2.1. English Consonants























































The most common vowels in English have been classified in accordance with how the tongue shapes them, and "while the consonant sounds are mostly articulated via closure or obstruction in the vocal tract, vowel sounds are produced with a relatively free flow of air" (Yule, 2006, p. 38). Therefore, vowels can be classified based on the movement of the tongue, lips, and jaw. The vowels of English have been characterized as low, mid, or high, which describe the height of the tongue, whereas features such as front, central, or back refer to the position of the tongue inside the mouth (Barry, 2008, p. 21).

Table 2.2. English Simple Vowels





















2.6. The Pronunciation Errors of Persian Speakers and the Negative Transfer of Learned L1 Habits into English

Major (2001) addressed the issues in L2 phonology and how L1 phonological features can be transferred to the L2 when the sound pattern and word stress of the L2 differs from the L1. A foreign or nonnative accent can be detected more easily in a formal and longer conversation because in short conversation the speaker can produce words or sounds that are similar to the L2 in terms of segmental and suprasegmental features of language. Therefore, "then overall impression concerning native speakers from whether or not and to what degree a person sounds native or nonnative is called global foreign accent" (Major, 2000, p. 19). The measurement of global foreign accent is essential as it indicates at what stage of language development pronunciation is acquired.

Moreover, Nation and Newton (2009) stated that the goal of pronunciation instruction is to increase the intelligibility of second language speakers although factors such as age, L1, perspectives, and attitudes of the learner can affect the learning of second language phonological system. "There is clear evidence that there is a relationship between the age at which a language is learned and the degree of foreign accent" (Patkowski, 1990, as cited in Nation & Newton, 2009, p. 78). However, pronunciation has been identified as one of the important aspects of second language acquisition as it plays a crucial role in spoken conversational interactions and intelligibility.

Although some studies indicated that it is impossible for adult learners to acquire native-like pronunciation, Boudaoud and Cardoso's (2009) study suggested that learners' proficiency level in English could affect their pronunciation. They compared the phonological features of Persian with four languages: Spanish, Japanese, Portuguese, and Arabic and asserted that these languages prevented their speakers from producing the /s/ consonant when learning English. The study focused on four research questions related to the production of /s/ consonant by Persian speakers and the factors that affect the acquisition of English as a second language. The findings indicated that /st/ and /sn/ were more difficult to produce than /sl/ and suggested that error production decreased as the proficiency level increased.

Furthermore, Paribakht (2005) "examined the relationship between first language (L1; Persian) lexicalization of the concepts represented by the second language (L2; English) target words and learners' inferencing behavior while reading English texts" (p. 701). This study emphasized the pronunciation errors that English majors produce in Iran when they read English texts. The study asserted that students' errors in reading stemmed from their lack of knowledge in English vocabulary rather than the inability to produce the English sound system. The research questions examined whether lexicalization helped students identify the meaning of unfamiliar words. The findings also showed that students relied on their L1 when they were not provided with the synonym of an unfamiliar word.

Sadeghi (2009) focused on "collocational differences between the L1 and L2 and [suggested] implications for EFL learners and teachers" (p.100). This study addressed the errors that Iranian EFL students make when they learn English, and it stated that these errors stemmed from the differences between Persian and English. The study compared Persian and English collocations and focused on the transfer of L1 habits into L2. The aim of the study was to find out whether students made the same errors based on their proficiency level in the English language. Lower level students tend to transfer L1 habits into L2 more frequently as a result of their lack of knowledge in the target language. However, transferring Persian vowels and diphthongs into English pronunciation can also be observed by advanced learners of English.

Research related to the difference between phonological systems in English and Persian provide a general overview of the difficulties ESL students may encounter when teachers focus on pronunciation, intonation, and word stress.

2.6.1. Common consonant errors of Iranian EFL learners

Persian speakers tend to place a vowel after each consonant; therefore, the following errors can be predicted when Persians pronounce English words: Bread, script, and scramble are pronounced as [bɛɹɛd], [É›skiɹipt], and [É›skɛɹæmbÉ›l]. Furthermore, according to the contrastive analysis of English and Persian conducted by Yarmohammadi (1969, 1996) and Wilson and Wilson (2001), the following negative transfer of learned L1 habits into English can be expected from Persian speakers of English.

1. Stop consonants such as /p/, /b/, /t/, /d/, k/, /g/ are articulated with a stronger puff of air. /k/, /p/, /g/ and /t/ become aspirated when they are placed in the post coda position. Words such as bank, tap, king, and rest are pronounced as [bænkÊ°], [tæpÊ°], [kɪngÊ°], and [È·É›stÊ°].

2. Fricatives such as /v/, /θ/, /ð/, and /s/ are substituted and articulated for other consonants such as /w/, /t/ and /s/, /z/ and /d/, /ɛs/ (no initial consonant cluster). West, three, father, and school are pronounced as [vɛstʰ], [sɛȷi] or [tɛȷi], [fαdɛȷ], and [ɛskul].

3. Nasal consonant /ŋ/ is articulated as /n/ and /g/. Therefore, sing is pronounced as [sɪngʰ]. /m/ and /n/ are also articulated with a stronger puff of air and they may sound like /ɛm/ and /ɛn/.

4. Lateral liquid consonant /l/ can be pronounced with a stronger puff of air /ɛl/ when it is placed at the end of a word such as tell.

5. The retroflex liquid /È·/ is trilled and it is produced with the vibration of the tongue.

6. The glide consonant /w/ is replaced by /v/ since /w/ does not exist in Persian consonants. Therefore, flower is articulated as [fɛlavɛɹ].

2.6.2. Common vowel errors of Iranian EFL learners

According to the contrastive analysis of English and Persian conducted by Yarmohammadi (1969, 1996) and Wilson and Wilson (2001), the following negative transfer of learned L1 habits into English can be expected from Persian speakers of English:

1. /É›/ and /æ/ can substitute for one another; therefore, [bæt] is articulated as [bÉ›t].

2. /ʌ/ replaces /α/. [lʌk] is articulated as [lαk].

3. /ÊŠ/ replaces /u/. [ful] is pronounced as [fÊŠl].

4. /ɪ/ replaces /i/. [bit] is articulated as [bɪt].

5. /j/ replaces /i/ if placed in initial position. [twin] is articulated as [tujin].

2.7. The Importance of suprasegmentals and stress in L2 acquisition

2.7.1 The importance of suprasegmentals

Pronunciation is always a difficult step in learning a second or foreign language, especially for adults. Learners may have acquired perfect reading and writing skills while still being unable to communicate functionally in L2.

Problems in pronunciation can be traced to segmental as well as suprasegmental difficulties. Although most previous research has been conducted on the segmental level, recent studies show that suprasegmentals may play a more important role than segmentals in the acquisition of a second language phonological system (Anderson, Johnson & Koehler 1992, Derwing, Munro & Wiebe, 1998). Anderson, et al (1992) investigated the nonnative pronunciation deviance at three different levels: syllable structure, segmental structure and prosody. The correlation between the actual deviance at the three levels and nonnative speakers' performance on the Speaking Proficiency English Assessment Kit (SPEAK) Test was calculated. It was shown that while all three areas had a significant influence on pronunciation ratings, the effects of the prosodic variable were the strongest.

In Derwing, Munro, and Wiebe's (1998) study, native speakers were invited to evaluate the final results of three types of instruction, i.e. segmental accuracy, general speaking habits and prosodic factors, and no specific pronunciation instruction, after a 12-week pronunciation course. Treated in three different ways, three groups of ESL learner reading sentences and narratives at the beginning and end of the course were recorded. Both the first and second groups, who received pronunciation instruction, showed significant improvement in sentence reading. However, only the second group, where prosodic factors were included in the instruction, showed improvement in accentedness and fluency in the narratives.

In Johansson's (1978, as cited in Wang, 2009) study of Swedish-accented English speech, segmental and non-segmental errors were compared in terms of accentedness scores. Native English judges were presented with two kinds of production, those with native English intonation but segmental errors on the one hand, and those with nonnative intonation (Swedish-accented) but no segmental errors on the other. Higher scores were assigned to productions with native-like suprasegmental characteristics but poor segmentals.

In a more recent study, Munro (1995, as cited in Wang, 2009) used low-pass filtered English speech produced by Mandarin speakers for accent judgment. Untrained native English listeners were invited to rate the speech samples. It was found that non-segmental factors such as speaking rate, pitch patterns and reduction contribute to the detected foreign accent in Mandarin speakers' production and that their foreign accent can be detected based solely on suprasegmental information.

In addition, some recent studies have, therefore, focused on stress production with nonce words of English. For example, Pater (1997, as cited in Altmann, 2006) investigated the stress placement patterns for English nonce words by both English native speakers and French learners of English. While this study varied syllable weight within words, it used a rather small set of items. The native English speakers exhibited a stress placement pattern that was basically identical to the Latin stress rule (i.e., stress the penult if heavy; if the penult is light, stress the antepenult). The French L2 learners, however, used one of two strategies: 1) stress the leftmost syllable (quantity-insensitive approach), or 2) stress the leftmost heavy syllable (quantity-sensitive approach). This pattern is striking in that the French learners applied neither an L1 nor a target language strategy. That is, they preferred to stress words closer to the beginning than English native speakers did and ignored the French pattern which makes the final syllable prominent, thereby 'missetting' the stress parameter for English which, according to the English control group, requires stress to be placed on the rightmost possible non-final syllable.

Archibald (1998, as cited in Altmann, 2006) further explored the nature of the English stress rule by systematically testing English native speakers. He found a tendency to stress the initial syllable for most items, which did not necessarily mean the rightmost possible non-final syllable (e.g., aconvent, indumbine)1. In some cases, however, the majority of native speakers favored final stress (burgee, nidus). These mixed results might have been due to the small number of subjects (only five), or to the fact that some items used in this study might have been too similar to existing words and thus triggered analogous stress patterns.

Lehiste and Fox (1996) studied the duration and amplitude independently in a sequence of reiterant speech. English and Estonian listeners answered differently when they were asked which syllable in a sequence was the most significant. When the experiment was repeated with reiterant nonspeech signals, this difference was even more significant. In fact, English listeners were more sensitive to amplitude cues, whereas Estonian listeners were more sensitive to duration cues. Lehiste and Fox suggested this difference might reflect different language backgrounds. Estonian is a quantity-sensitive language; therefore, the Estonian listeners would presumably rely more on duration cues. However, the authors did not provide much explanation of why the English listeners were more sensitive to amplitude cues. Finally, the authors suggested that the results from the nonspeech condition might indicate that speech experience influences general auditory perception.

A cross-language study conducted by Dupoux, Pallier, Sebastian, and Mehler (1997) found that native speakers of French, a language with fixed word-final stress, have difficulties with the discrimination of nonwords that differ only in the position of stress (e.g., [va´suma] vs. [vasu´ma] vs. [vasuma´]). By contrast, native speakers of Spanish do not have any difficulties as stress is contrastive in their language. On the basis of their finding, the authors argue that French listeners are "deaf" to stress contrasts because French, unlike Spanish, does not have lexical stress. Subsequently, Peperkamp and Dupoux (2002) proposed a typology of stress-deafness by testing stress perception in adult speakers of several languages: French, Finnish, Hungarian, and Polish. Speakers of some languages showed more robust "stress deafness" effects than did speakers of other languages. They found that French speakers exhibited the strongest effect of stress deafness among all languages as French is a non-stress language, whereas Spanish speakers had significantly lower scores than other languages on the stress deafness index, since Spanish is a stress language like English.

Finally, in a series of L2 studies, Archibald (1992, 1993, 1997, as cited in Altmann, 2006) found that linguistic experience influences not only speech perception but also speech production. He stated that speakers of stress languages are more likely to show patterned stress behavior than speakers of non-stress languages. In his studies, he proposed that the errors that native Polish speakers (Archibald, 1992), and native Spanish speakers (Archibald, 1993) made were due to the transfer of their own native language (i.e., first language, L1) systems. Archibald (1992, as cited in Altmann, 2006) investigated the acquisition of English stress patterns by adult L2 learners by examining adult native Polish speakers' productions and perceptions of English stress patterns. The production task was to read English words in isolation and in sentences, and the perception task involved an identification paradigm where participants listened to English words and identified the stress placement. He observed that Polish speakers transferred their L1 metrical stress patterns (i.e., primary stress always falls on the penult) to the production and perception of two-syllable English words. For example, for English words, such as 'mainTAIN,' and 'apPEAR,' Polish learners of English tended to produce as 'MAINtain,' and 'APpear,' respectively. In a later study, Archibald (1993) found that as was the case with the Polish learners of English, Spanish learners of English transferred the stress patterns of their L1 when producing and perceiving English words. Like English, Polish and Spanish are both stress languages, learners of English from both languages demonstrated stress pattern behavior and their performance shows consistent patterns influenced by their native language.

However, when learners of English whose native language is not a stress language, it results in different scenario. Archibald (1997, as cited in Altmann, 2006) had Chinese learners of English and Japanese learners of English participate in his study with the same task paradigm and stimuli as his previous studies (1992, 1993). Both languages are non-stress languages, as Chinese is a tone language and Japanese is a pitch-accent language. He found that both language groups have difficulties in placing stress correctly in English words and the errors they made did not have any readily discernable pattern.

This observation was explained later by Wayland, Guion, and Landfair (2006). Instead of only looking at L2 learners' production and perception on two-syllable English words, in their study, the influence of syllabic structure, lexical class and stress pattern of known words on the acquisition of the English stress system were also examined. Ten native Thai learners of English participated in this study. Participants were asked to produce and give perceptual judgments on 40 English nonwords of varying syllabic structures in noun and verb sentence frames (i.e., 'I'd like a ___,' 'I'd like to ___'). In the production task, participants were asked to say each nonword in the both frame sentences. Their production data were coded for the first syllable or second syllable stress by a trained phonetician and a native English speaker. In the perceptual task, the same 40 nonwords were produced with stress on the initial and final syllable in each carrier frame 'I'd like a ____' and 'I'd like to ____.' Participants were asked to listen to the prerecorded phrases in pairs that varied only in the stress placement on the nonwords (e.g., 'I'd like a TOOkips' vs. 'I'd like a tooKIPS'). Then, they were instructed to listen to the two sentences and indicate which sentence sounded the most like a real English sentence to them. The results for both production and perceptual tasks showed the subjects' performance was influenced by their native language. Among three factors they examined, Thai learners' pattern of stress assignment on nonwords was significantly influenced by the stress patterns of phonologically similar words. The author explained that speakers of non-stress languages may rely more heavily on word-by-word learning of stress patterns and are less likely to abstract generalities about stress placement by syllabic structure and lexical class as tone is a lexical property and thus has to be acquired item by item.

It might be reasonable to assume that native speakers of other tonal languages such as Mandarin Chinese, Cantonese, Vietnamese, would use a similar approach when acquiring the English stress system. In the appendix of their study, the authors pointed out that unlike the production task, in perceptual judgments, Thai participants appeared to prefer final stress over initial stress regardless of syllable structure or lexical class. The authors suggested an indication of individual variation as an explanation since the data could not be easily explained by any of the participant's language background information.

2.8. Speech learning models

Linguists have long believed that perception and production of foreign speech are influenced by a listener's native language (Sapir 1921; Polivanov 1974; Abramson & Lisker 1970, as cited in Wang, 2008 ). The influence of the native language system on the segmental level has been widely studied (Goto 1971; Best & Strange 1992; Best 2001; Werker et al. 1981; Werker & Tees 1984, as cited in Wang, 2008).

2.8.1 Native language magnet model

Kuhl (1993, 2000) proposed the Native Language Magnet model. NLM holds that infants are equipped with a discriminative ability at birth to categorize phonetic units. They make use of the pattern information and of the statistical properties of the language input in speech learning. Through language development, an individual's perception is gradually distorted by his/her language experience (Iverson & Kuhl, 1996) and the acoustic dimension underlying speech is warped (Kuhl, 2000, as cited in Wang, 2008). With more input from their native language, they gradually develop acoustic prototypes for native phonemic categories. However, according to Wang (2008), in L2 speech learning, such acoustic prototypes for non-native categories are not created, due to insufficient relevant acoustic experience. Our native language acts as a filter and changes what we attend to in speech perception. The acoustic space is expanded or shrunk to highlight the contrasts in the native language. This language-specific filter makes L2 speech learning much more difficult because we may not be aware of dimensions of speech that are not important in L1 learning.

Iverson and Kuhl (2003) used synthesized speech stimuli to study the perception of L2 sound contrasts. Japanese speakers were compared to native English speakers in the perception of syllables begining with /r/ and /l/ in English. The stimuli were systematically manipulated for two acoustic cues, F2 and F3. There were three steps for F2 change and six steps for F3 change, producing a total of eighteen stimuli (3 steps of F2 Ã- 6 steps of F3). Native American listeners perceived the 18 stimuli as either instances of /r/ or instances of /l/. The 18 stimuli, spaced equally in terms of the two acoustic cues, were not spaced equally in the perception map of American listeners. The so-called magnet effects and boundary effects were observed for them. The magnet effect refers to the shrinking of perceptual space for the good instances of either /r/ or /l/ categories. In other words, allophonic or free variations of either /r/ or /l/ are perceived to be close to the prototypical /r/ or /l/ in the language although their actually acoustic distances to the prototypes are further. The boundary effect, on the other hand, refers to the stretching of perceptual space at the division area of the two categories. This means that an instance of /r/ and an instance of /l/ could be perceived to be two totally different segments (perceptually further apart) but acoustically closer to each other than their respective distances to the prototypes. Thus, around the boundary of the two segments, there is a perceptual division that is exaggerated despite the acoustic similarity.

Japanese listeners showed a perceptual map that was totally different from native American speakers. First, they seemed to differ from American listeners in the acoustic dimensions that they attend to in perception. While native listeners were most sensitive to F3 cues, Japanese listeners were sensitive to F2 cue variation. Second, no magnet or boundary effect was observed for Japanese listeners on the dimension of F3. Only one category of sound emerged from the Japanese perceptual map.

This study shows that the symmetric acoustic dimensions are distorted by native speakers in speech learning to maximize the difference between contrasts in their native language. The distorted perceptual map, once formed, makes L2 speech learning more difficult. In other words, if there is an L2 contrast that does not exist in L1, then it is very hard to create a perceptual map for this contrast, as in what Japanese listeners have experienced. It can be inferred from this model that even if L1 and L2 share a similar contrast, the perceptual map may not be the same in the two languages because speakers may rely on different acoustic dimensions in perception which still make it difficult for an L2 learner to form accurate categorizations.

2.8.2 Perception assimilation model

Best (1995, 2001) proposed an L2 speech learning model, the Perception Assimilation Model (PAM). This model explicitly draws on Articulatory Phonology and argues that listeners discriminate the speech signal based on information about articulatory gestures (e.g. Fowler, Best & Ie McRoberts,1990; Browman & Goldstein, 1992). These gestures, in turn, "are defined by the articulatory organs (active articulator, including laryngeal gestures, constriction locations place of articulation), and constriction degree (manner of articulation) employed" (Best, et al, 2001, p.777).

PAM proposes that the listeners' native knowledge, whether implicit or explicit, has a strong effect on the perception of non-native speech, and listeners have a strong tendency to assimilate non-native sounds to a native phoneme or category which is similar in terms of its articulatory gestures (Best 1995, 2001). PAM (Best, ibid, 2001) predicts that a non-native phone can be assimilated to the native phonological system in one of the three ways: as a categorized phone, an uncategorized sound, or a nonassimilable nonspeech sound. More importantly, PAM not only predicts the assimilation of a single non-native phone but also the assimilation of a non-native contrast.

Depending on the assimilation pattern of the two non-native phones involved in the contrast, six types of assimilation are predicted for non-native contrasts. They are Two Category assimilation (TC), Single Category assimilation (SC), Category Goodness difference (CG), Uncategorized-Categorized pair (UC), Uncategorized assimilation (UU), Non-Assimilable pair (NA). When the two phones of a non-native contrast are assimilated to two native categories, the contrast is perceived as TC and when both are assimilated to a single category; then, the contrast is SC. CG refers to the case when both phones are assimilated to one category but one is assimilated better than the other. When one is categorized and one is not, the contrast is UC and when both are not categorized, the contrast is UU.

When both phones are unassimilable, then the contrast is predicted to be NA. The discrimination of the six non-native contrasts can be affected by the native phonological system in different ways. To be more specific, L1 phonology should have a positive effect on discrimination of TC and UC contrasts. When a non native segmental contrast can be categorized as either TC or UC, learners would find it easier to differentiate the non-native sound segments. The effect of L1 phonology may be neutral for NA contrasts, neither positive nor negative. For SC or CG, L1 phonology is predicted to have a negative effect.

Different from both NLM and SLM (discussed below) which focus on the attributes of phonetic categories, PAM is the modal that is phonological in nature. As Best commented (2001, p.791), "PAM instead focuses on the functional organization of the native phonological system, specifically on the phonological distinctions between, and phonetic variations within, native phonological equivalence classes."

2.8.3 Speech learning model

Flege (1987, 1991, 1995, 2003) and colleagues proposed the Speech Learning Model (SLM). The hypotheses of SLM were proposed by Flege in 1995 with supporting evidence from empirical studies. Similar to the previous two models, SLM posits that listeners' speech perception is attuned to the contrasts in the L1 phonological system. In the acquisition of an L2, contrastive phones may not be perceived as contrastive because the L1 phonology may have prevented the listeners from attending to "the features or properties of L2 sounds that are important phonetically but not phonologically, or both" (Flege, 1995). SLM's difference from PAM is that SLM is not specifically built on articulatory phonology. Flege (ibid) has not been very explicit about what "the features and properties" of L2 sounds are. They may be articulatory gestures or acoustic properties. In his experimental studies, he has focused on the acoustic properties in speech learning. Unlike NLM, SLM focuses on adult speakers, especially bilingual speakers' acquisition of speech sounds in L2. Furthermore, SLM assumes that the construction of new L2 categories is possible. It is proposed that non-native sounds are classified according to their "equivalence" to existing sounds. It is less possible for a new L2 category to be created when the shared similarity is large. The correct perception of a more similar sound is more difficult. In other words, L2 learners can master a 'new' sound in the target language but not a 'similar' sound.

Flege and his colleagues have conducted experimental studies to test SLM in the perception of L2 speech sounds and the contributions made by acoustic correlates were discussed in these studies (Flege 1987, 1993, 1995). For example, in the studies on the acquisition of English voiceless stops, it was found that learners with the same phonological voiceless stops in their L1 but different VOT settings have great difficulties with the perception and production of the English voiceless stops. Flege suggested that the correct categorization may "be blocked by the continued perceptual linkage of L1 and L2 sounds" (p. 258). In a different study with German learners of English, Bohn and Flege (1992) found that German learners can be trained to perceive and produce the 'new' vowel /æ/ in English. Thus, the researchers concluded that, unlike a similar sound, with enough time and exposure to the new phone in L2, a new category in L2 can be created.

In general, SLM is not specifically designed to account for speech perception in a non-native language, and it uses the accuracy and failure in L2 speech perception to explain acquisition of L2 production. On the one hand, it offers a broader view of L2 speech acquisition, which incorporates not only perception and production, but also discusses factors such as Age Of Arrival (AOA) or Age Of Learning (AOL). On the other hand, it lacks the detailed account of why and how L2 perception is different from (or similar to) L1 perception, or why AOA or AOL would have an effect on L2 perception and production.

Although the three models differ in their beliefs about the native perceptual framework and how L2 sounds or sound contrasts are mapped to the L1 system, they hold the same view that "adults' discrimination of non-native speech contrasts is systematically related to their having acquired a native speech system" (Best, 2001, p.776). All three models have made important contributions to the study of speech learning. They have offered sophisticated proposals for the possible influence of L1 on L2 speech learning from different levels, phonological, phonemic, phonetic and acoustic. Many experimental studies were conducted to verify and evaluate these different claims.

In the discussion of speech learning, Flege (1995) also pointed out the importance of suprasegmentals in L2 acquisition and indicated that not only segmental but also prosodic divergences may lead to foreign accent. While the proposals about speech learning made by these models should apply to the perception of suprasegmentals, none of the models has made explicit predictions about the acquisition of suprasegmentals. Furthermore, the experimental methods have been used mainly with the study of segment perception. The scarcity of studies on suprasegmentals may be attributed to the complicated nature of suprasegmentals. While it is comparatively easier to identify a distinctive phoneme, defining stress is never an easy task.

2.9. The Importance of Stress

Out of the different components of suprasegmentals, lexical stress is one of the most important factors, yet the most complicated and least investigated one. "Lexical stress plays a central role in determining the profiles of words and phrases in current theories of metrical phonology" (Hogg & McCully, 1987, as cited in Field 2005, p.403). Furthermore, word stress may also influence the intonation and rhythm of sentence production. Bond (1999) found that in processing speech, native speakers put more emphasis on the stressed syllables than the unstressed ones. In other words, they tend to ignore mistakes in unstressed syllables.

In addition, misplacement of stress in a word is more likely to affect the processing of speech by native speakers than mispronunciation of a phoneme. In a study on the processing of lexical stress, Cutler and Clifton (1984) found that misplacement of stress in disyllabic words has detrimental effects in speech processing. A shift of the stress from the left syllable to the right syllable seriously hampered intelligibility. This can be illustrated with an example of WAllet, where capital letters represent the stressed syllable1. If the word is mispronounced as waLLET, native listeners have much lower efficiency in word recognition.

One other interesting finding of the study is that if vowel quality change is also involved in the stress misplacement, greater effects on word recognition are observed. In other words, changing a full vowel into a reduced vowel or vice versa can compromise intelligibility severely, i.e. in the case of waLLET, [`wÅlIt] → [wÅ`let]. In other words, incorrect placement of primary stress in L2 words may lead to miscommunication since the misplacement of lexical stress can "precipitate false recognition, often in defiance of segmental evidence" (Cutler, 1984, p.80). L2 learners, on the other hand, may not pay attention to stress placement in listening to a stress language and may not use stress as a cue in lexical processing. In production, their stress mistakes can cause severe problems for a native speaker who may rely primarily on stress. Thus, to study second language learners' problems in lexical stress may lead to overall improvements in second language perception and production. Pedagogically speaking, it is pointed out by Dalton and Seidlhofer (1994) that, in pronunciation instruction, lexical stress is easier to teach than intonation but has greater communicative value than the phoneme. It is thus worthwhile to study in greater detail what learners' problems are with English lexical stress perception.

2.9.1. The role of lexical stress in English

English is a stress system language, which uses pitch/stress to make lexical distinctions. Lexical stress refers to one of the suprasegmental features of language particularly relying on changes of the fundamental frequency (F0) of a word that serve as a linguistic function. For instance, in English, it can be used to distinguish the phonemic stress pattern used lexically (CONduct vs. conDUCT, where the capitalized letters indicate a stressed syllable), noun phrases and compounds (e.g., WHITEhouse vs. white HOUSE).

2.9.2. Acoustic attributes in the perception of lexical stress in English

In English, one of the syllables of a word is perceived as the most prominent one, the so-called lexical stress position of the word. The acoustic correlates of lexical stress in English are pitch (F0: fundamental frequency), duration, loudness (i.e., intensity), and vowel quality (e.g., Lehiste, 1970; Beckman, 1986; Pierrehumbert, 1980; Fry, 1955, 1958, as cited in Wang, 2008). Out of these, researchers have proposed pitch and duration as the most important perceptual cues; intensity is generally claimed to be of lesser importance (Fry, 1955, 1958), while vowel quality is the least important cue (Fry, 1965; Rietveld & Koopmans-van Beinum, 1987, as cited in Wang, 2008).

In the perceptual domain of stress, Fry (ibid) developed an experimental paradigm to test listeners' judgments on stress placement in his pioneering perception work on English stress (1955, 1958). He used real word speech stimuli to vary certain physical parameters systematically (i.e., pitch, duration and intensity), keeping segmental content the same. Listeners judged whether stress placement was affected by the manipulation of the acoustic correlate. Fry found that intensity had the least effect on stress perception. Duration changes had a greater effect than intensity, with longer syllables more likely to be perceived as stressed. The strongest effects on stress perception were achieved by altering the pitch contour. Thus, he concluded that pitch and duration, rather than intensity, seem to be the principal perceptual cues for stress in English. However, in his studies, he did not specify which acoustic cue is most associated with words that have the stressed syllables at the first or at the second syllable. His work thus stimulates the interest of the current study.

2.9.3. Statistical distribution of stress patterns in English

Lexical stress assignment in English is free or moveable, in that stress can be assigned to a syllable in any position in a word. Every content word in English contains a single syllable bearing primary stress and, optionally, other syllables bearing secondary stress. In general, stress is assigned from right to left to form trochaic (strong-weak) patterns, resulting in an alternating stress pattern in multisyllabic words (Hammond, 1999; Hayes, 1982). Despite the fact that stress can be assigned to any syllable in multisyllabic words in English, there are strong tendencies for stress to occur in certain positions more than others. For instance, Cutler and Carter (1978) reported that in a corpus of over 20,000 English words, 90% of the content words began with a stressed syllable. Thus, English words are not evenly distributed across syllable-stress patterns. To date, the relationship between number of syllables and distribution and frequency of different stress patterns in English words has not been investigated. Based on the CELEX database, the results show that most words in English words contain two syllables (44.49%), followed by three syllables (36.43%). Among all English words, the majority of two-syllable words exhibit primary stress on the first syllable (74.94%; henceforth "trochaic pattern"), while the remainder exhibit primary stress on the second syllable (25.06%; henceforth "iambic pattern"). Likewise, words that contain two syllables are the most frequently used words. They account for 70% of words by frequency out of a total of about 500,0000.

2.9.4. Perception of lexical stress in English

Studies on the possible role of prosody in spoken word recognition have focused on investigating general sensitivity to lexical stress. Studies in child language development (Jusczyk & Aslin, 1995; Jusczyk, Houston, & Newsome, 1999) found that 7.5-month-old-English-learning infants showed a preference toward the dominant stress pattern of English (trochaic, strong-weak, or stress on the first syllable). Further, by using an artificial language consisting of two-syllable words with trochaic (strong-weak, SW) and iambic (weak-strong, WS) stress patterns, Thiessen and Saffran (2003) found that 9-month-old infants relied on trochaic patterns as a cue to word segmentation, whereas both 6.5- and 9-month-old infants mis-segmented the words when the stress pattern was iambic.

With regard to word segmentation, it has been suggested that strong syllables trigger speech segmentation, and lexical access attempts follow strong syllables (Taft, 1984; Cutler & Norris, 1988; Cutler & Butterfield, 1992, as cited in Cutler, Oahan, & Donselaar, 1997). Taft (1984) examined the effect of stress pattern on word identification. Participants listened to two syllable words or phrases with SW or WS pattern (e.g., SW: 'lettuce' vs. 'let us;' WS: 'assign' vs. 'a sign'). They were asked to identify if the sound they heard contained one word or two words. Results showed that SW patterns attract more one-word responses, whereas WS patterns tend to signal two-word responses. van Heuven (1985) used the gating paradigm to examine the identification of words with correct or incorrect stress patterns. Presented with fragments of stimulus words, participants were asked to guess the word they heard at each fragment. Results showed that responses were mostly words with first syllable stress. Identification of correctly stressed words was also more accurate than that of incorrectly stressed words.

Cutler and Butterfield (1992) examined errors made by English listeners in spoken word recognition based on stress patterns. They found that listeners were more likely to misperceive a polysyllabic word as two words when the two word response results in a WS pattern. They concluded that these 'slips of the ear' were consistent with the idea that native English speakers treat strong syllables as word onsets. These findings are also compatible with the statistical distribution of stress patterns in English. As previously mentioned, based on the CELEX database, the majority of two-syllable words exhibit primary stress on the first syllable and these are also the most frequently occurring English words.

Likewise, Cutler and Carter (1987) found that more than 90% of the content words in English begin with strong syllables. In other words, the probability is high for a word with SW stress pattern, which provides a useful strategy for processing spoken words.

All of this evidence seems to support the conclusion that strong syllables facilitate lexical access and the SW pattern, which is the dominant stress pattern in English words, may result in a perceptual bias in which English listeners prefer to interpret stress patterns as SW.

Furthermore, the role of lexical stress in spoken word recognition has received considerable attention over the last several decades. Evidence has been accumulating in behavioral studies in normal as well as pathological populations. Behavioral studies in healthy adults have posited an important role for lexical prosodic information in normal lexical/semantic access. Cutler and Clifton (1984) presented two-syllable words with SW or WS stress patterns and their mis-stressed counterparts to native English listeners. Their task was to identify the words. Results showed that responses to mis-stressed words were slower, where SW words that were mis-stressed received the longest response time. They concluded that the stress shift disrupted word recognition.

Slowiaczek (1990) found consistent results in a shadowing task. Similar to Cutler and Clifton's (1984) study, she used words with correct and incorrect stress patterns (e.g., YELlow vs. yelLOW). Participants were asked to repeat the word they heard as fast and as accurately as possible. Results again showed that responses were faster for correctly stressed words than incorrectly stressed words. These two studies provide evidence that processing spoken words can be slowed down when lexical stress is misplaced, which in turn indicates that stress information affects lexical processing.

However, a discrepant result was found in a study conducted by Cutler and Clifton (1984), where the effect of stress pattern on the identification of word class was investigated. Participants listened to two-syllable nouns and verbs with SW ('apple' or 'borrow') or WS ('cigar' or 'await') stress patterns in two contexts 'the ___' and 'to ___.