ONE STORY OF PHONETICS

or

A research summary

Olle Engstrand

(July 1998)

***

TABLE OF CONTENTS

            1

            Points of departure

            1.1

              Identifying a pseudoproblem

            1.2

              Avoiding the physiological bias: data and interpretations

            1.3

              A means-end view

            2

            Prosodic bases of phrase and word structure

            2.1

              F0 and phrase structure

            2.2

              Tonal structure of the word accent contrast

            2.3

              Acquisition of the word accent contrast

            3

            Studies of the prosodic-spectral interface

            3.1

              Effects related to speaking rate and stress

            3.2

              Durational vs. spectral bases of quantity

            3.3

              Duration vs. spectrum (cont'd): cross-language observations

            3.3.1

                Using elicited speech materials

            3.3.2

                Using spontaneous speech

            4

            Systematicity of phonetic variation in spontaneous speech

            5

            Foreign accents - attitudes and signatures

            6

            Immigrant voices in Sweden (IRIS) - a database project

            6.1

              Background and motivations

            6.2

              Special study #1: Aspiration in Swahili

            6.3

              Special study #2: Salient features of Lule Sami phonetics

            7

            Phonetic typology: Constraints and biases

            7.1

              Areal biases in stop paradigms

            7.2

              Why are clicks so exclusive?

            7.3

              Two studies of voicing in stop consonants

            8

            Phonetic typology: Phonetic evaluation of UPSID

            9

            Phonetic typology and sound change - the Swedish dialects

 

1 Points of departure

1.1 Identifying a pseudoproblem

Structural linguistics is to be credited with the rediscovery of the segmentation criteria on which the first alphabets were once built. These criteria, which are distributional, semantic and phonetic, now form the cornerstones of a formal machinery designed to reduce language to a minimal set of elementary units. Segments and features are the abstract products of this machinery, i.e., they are the theoretical constructs defined in the phonetic grammar, or phonology, of each language. However, these units have also come to be regarded as primitives in the analysis of phonetic behavior, i.e., in theories of the production and perception of speech. Thus, the idea has arisen that it is a task of phonetics to explain how the units of phonology are implemented in the act of producing or perceiving spoken words and sentences. This, however, would seem to be a case of reification and circularity - reification because units are ascribed an existence independent of the theory in which they are defined, and circularity because these new objects are regarded as inputs to the behavior out of which they were abstracted in the first place. 

This premise is bound to cause a number of rather uneasy problems; specifically, it creates the need to explain a) how the speech production meachanism translates centrally represented discrete, static and invariant input units into a continuous and variable speech flow and, conversely, b) how this flow is processed by the ear and perception system to restore the linguistic units into their original form. This, then, is the root of the invariance problem.

The motor theory was developed in response to the latter problem, i.e., that an abuncance of context effects makes the speech signal inadequate as a vehicle for direct identification of the segmental units. But since segments, it was assumed, are recovered, a decoding module was postulated to interpret the signal via articulatory gestures, which were presumably less variable (e.g., Liberman et al., 1967). This highly specialized module was viewed as part of the human genetic endowment and indicative of our innate neurobiological makeup for language.

In retrospect, however, it appears as if the proponents of the motor theory overstated the disorderliness of the acoustic signal. There is thus evidence to indicate that the segmental structure of the signal becomes clearer the further 'downstreams' observations are made (e.g., at the acoustic as compared to the movement level; cf. In what sense is speech quantal?, Journal of Phonetics, 17, 107-121, 1989), and that observed acoustic variability in speech displays a considerable amount of systematicity (e.g., Systematicity of phonetic variation in natural discourse, Speech Communication 11, 337-346.). In addition, the motor theory disregarded the human listener's capacity for top-down processing in terms of circumstantial knowledge and expectations. Thus, it would appear that the perception module does little more than to create the additional problem of explaining its own existence.

Additionally, the former problem ('How are segments converted to movements?') has produced a number of ingenious constructs, most notably the concept of coarticulation. One example is the following: Dealing with so-called anticipatory coarticulation, Henke (1966, 1967) assumed a translation device which is fed with strings of segments, each of which comprises a bundle of features such as 'high', 'front', 'rounded', etc. In the translation process, segments are checked with respect to feature combinations in order to decide at which point they may start to be realized, the rule being that a segment's realization can begin as soon as it does not conflict with an intervening segment. Thus, in utterances such as at school, the feature 'round' (manifested as lip rounding for the vowel /u/) will be 'spread to the left' across the entire consonant cluster because the consonants /tsk/ are supposed to be 'unmarked for rounding'.

From this, the following two observations can be made: 1) The basis for marking /tsk/ in this way is entirely articulatory; specifically, since these are lingual consonants, the lips are assumed to be 'free to coarticulate' the rounding associated with the upcoming vowel; and again, 2) this emphasis on articulatory control mechanisms is an natural consequence of the physiological bias built into the definition of the problem.

 

1.2 Avoiding the physiological bias: data and interpretations

The inadequacy of the physiological bias was first demonstrated in a paper reporting simultaneous EMG and movement data from a Swedish speaker, Acoustic constraints or invariant input representation? (Reports from Uppsala University, Department of Linguistics, RUUL, 7, 67-95). The data presented in that paper showed, among other things, that:

(1) in vowel-symmetrical VCV utterances such as /utu/, /usu/, /ustu/, /ustru/ and /ukstru/, lip rounding activity did not continue throughout the consonantal interval, either at the movement or the motor command level. Rather, the rounding was interrupted during that interval such that there was a 'trough' in the lip rounding pattern, the duration of which was proportional to the number of consonants in the cluster.

(2) the trough gesture was active in some contexts (notably with C=/s/) with labial depressor activity coinciding with it in time;

(3) lower lip movements were partially independent of jaw movements, the net result being an almost constant upper-lower lip distances across vowel contexts; this effect was particularly strong in /s/ productions.

In the earlier, physiologically oriented literature (e.g., Gay, 1977, 1978, 1979; McAllister, 1978), the existence of a trough in this type of utterance was regarded as a perplexing problem. Since the speech organs engaged in articulating the sounds in utterances such as /usu/ are physiologically independent, the intervening consonant should be disregarded by the feature-spreading mechanism. What can it be, then, in the motor system that forces the trough? Clearly, this question overlooks the obvious, that is, that many consonants must be articulated with a certain amount of lip spreading for perceptual reasons. As a point in case, consider the acoustic and auditory properties of sibilants.

Compared to nonsibilants, sibilant fricatives such as /s/ are characterized by high-intensity noise in the high frequency region. The auditory salience of this characteristic has been demonstrated in several confusion studies. For example, McCasland (1979) showed that attenuated /s/ tends to be confused with diffuse sounds such as / / or /f/. Since this is also bound to happen to naturally attenuated, i.e., labialized sibilants, it could be expected that labialization in sibilants will not be a highly valued feature of the sound inventories of the world's languages. This expectation has been clearly confirmed in our analyses of the UPSID database (The UCLA Phonological Segment Inventory Database; Maddieson, 1984; Maddieson and Precoda, 1989) showing that labialized sibilants are extremely rare. Thus, out of the 451 UPSID languages, only 6 (or about 1.3%, half of which are Caucasian languages) have labialized sibilants. Since there is no reason to believe that labialized fricatives are more difficult to produce than plain fricatives, the explanation must be that labialization makes sibilants auditorily less salient. In fact, non-sibilant fricatives are frequently labialized, e.g. Swedish [ ] which is sometimes pronounced with an extreme narrowing at the mouth orifice (Lindblad, 1980).

Attenuation of sibilant noise also occurs in the case of coarticulated lip rounding, i.e., when lip-rouding pertaining to a rounded vowel is anticipated in or carried over to a sibilant. It is, thus, reasonable to interpret the partial unrounding during /s/ in /usu/, as observed in the above-mentioned paper (Acoustic constraints or invariant input representation?), as an articulatory means of preserving the sibilant character of /s/. In general, the observed constancy in mouth orifice geometry can be accounted for in terms of particularly sensitive aerodynamic-acoustic-auditory requirements for /s/. Similar conclusions were drawn by Perkell (1986) on the basis of a study of lip rounding in three languages.

Observations of the trough phenomenon were first made in connection with lip rounding. From a physiological point of view, then, it might be argued that troughs and other aspects of motor coordination observed on the lips would be system-specific. However, the fact that the lips are not special with respect to the trough was shown in a subsequent study of VCV coordination, the results of which were reported in Articulatory coordination in selected VCV utterances: A means-end view, Reports from Uppsala University, Department of Linguistics (RUUL) 10, and Articulatory correlates of stress and speaking rate in Swedish VCV utterances (see abstract).

Those experiments used cineradiographic films to observe tongue-lip coordination patterns primarily in utterances involving labial consonants such as [ipi], [ipa], [ipu], [api] etc. Data were obtained from two Swedish subjects; a third subject had to be rejected for technical reasons. The central findings are summarized in the next few paragraphs.

Part of the cinefilm data processing was done using tracings of midsagittal tongue contours. Those tracings allowed the following observations:

a) When migsagittal tongue surface tracings of maximum V1 constriction, initiation of /p/ closure, release of /p/ closure, and maximum V2 constriction were superimposed, the tongue movement trajectory turned out to be non-linear, approximating a neutral position during the closure interval;

b) in the vowel-symmetrical VCV case, a similar movement towards a neutral position was observed during closure. There was thus a tongue trough reminiscent of the lip trough observed in the /uCu/ utterances. The trough gesture was independent of the jaw such that tongue and jaw frequently moved in opposite directions.

In summary, the cinefilm data suggested that a) the trough phenomenon could be generalized to the dorsal motor system, and b) that the trough can be regarded as a special case of a non-linear V-to-V movement trajectory. Two explanations for why this pattern occurred were identified in the two papers just mentioned: aerodynamics and vowel dynamics. These are summarized in the next few paragraphs.

a) Aerodynamics

It was argued in the papers that a near-neutral vocal tract shape is essential in creating aspiration as specified for stressed voiceless stops by the phonetic norm for Swedish and that, given a subglottal pressure typical of stressed syllables (Stevens, 1971), narrowing the tongue constriction beyond some quantal limit would give rise to a fricative rather than an aspirative noise at stop release. To avoid this, much of the tongue movement toward V2 in utterances such as /api/ must be delayed until the stop release has been executed. By the same token, the tongue must be temporarily removed from the /i/ position in connection with the stop release in vowel-symmetrical VCVs, e.g., /ipi/.

These conclusions have recently been corroborated by Molis (1994) as well as in a recent study of our own, The locus line: does aspiration affect its steepness? (Abstract). Molis used the slopes of the so-called locus equation to quantify the amount of consonant-vowel coarticulation in VCV utterances produced by one Swedish, one French and one American English speaker. When measuring F2 onset near the stop release (i.e., in the aspiration noise for /p/), Molis found considerably steeper slopes for /b/ than for /p/ in her Swedish and American English subjects while, in her French data, /p/ and /b/ displayed equal slopes. Furthermore, the French slopes were comparable to the Swedish and American English /b/ slopes. Thus, the tongue may have approximated the position for the following vowel more closely before the release of /b/ than before that of /p/ in the Swedish and American English, but not in the French speakers. Since /p/ is aspirated in Swedish and English, but not in French, and since /b/ is not aspirated in any of these languages, Molis hypothesized that aspiration was the cause of this cross-language difference.

The locus equation data presented in the paper just mentioned further supported this hypothesis in showing that, for labial stops, the slope of the locus equation regression line varies with degree of aspiration as quantified in terms of voice onset time (VOT): short VOTs were associated with relatively steep slopes, and long VOTs were associated with relatively flat slopes. Thus, the short VOT conditions resulted in a relatively great extent of locus-to-target assimilation, thus suggesting that the tongue is nearly in the position for the upcoming vowel at the moment of stop release. The long VOT conditions, on the other hand, displayed a much smaller amount of locus to target assimilation. These data therefore provided further corroboration of the idea that aerodynamic constraints play a role for articulatory coordination patterns in stop-vowel syllables.

b) Vowel dynamics

The second part of the explanation of our cineradiographic VCV data related to a hypothesis advanced by Lubker and Gay (1982), who were concerned about the observation that the timing of anticipatory lip rounding seemed to vary systematically between different languages. For Swedish, there was a tendency for anticipatory lip rounding to be initiated in direct relation to the length of the consonant string preceding a rounded vowel; in the American English data, on the other hand, the time lag between rounding and vowel onsets tended to be more constant and independent of the size of the preceding cluster. This observation is unexpected as long as the temporal coordination of speech movements is interpreted in terms of static segmental inputs, with language-independent constraints imposed on their physical manifestation. Realizing this awkward cul-de-sac, Lubker and Gay made a commendable, but still misdirected attempt to explain the observed effects in terms of structural differences between the Amercian English and Swedish vowel inventories. Specifically, since the Swedish vowel space is more crowded than the English one, lip rounding would be more accurate in Swedish in order to meet more demanding perceptual requirements; and this would lead to an earlier onset of anticipatory rounding.

As discussed in the above-mentioned paper (Articulatory coordination in selected VCV utterances: A means-end view), the weakness of this hypothesis becomes obvious as soon as it is recognized that similar timing differences can be observed in dialects of the same language with almost identically structured vowel inventories. This is the case for, e.g., Central Standard Swedish, as spoken in the Stockholm region, vs. Malmö Scanian, a Southern Swedish dialect. Both dialects have distinctly but differently diphthongized long vowels. For example, while Scanian /i/ and /u/ glide towards a near-cardinal from a near-neutral vowel quality (Bruce, 1970), their Central cognates frequently display a final bend from the near-cardinal towards a near-neutral quality, sometimes preceded by slight frication (e.g., Fant, 1973). It goes without saying that these differences must correspond to differences in the position of the articulators at the beginning and end of the vowels in these dialects. Since Southern Swedish diphthongization bears a clear similarity to that of many dialects of English (cf., e.g., Jones, 1964; Thomas, 1958; Labov, 1986), it was concluded in the paper referred to here that the same point could be made in relation to Central Swedish and many dialects of English. Thus, Lubker and Gay's (1982) results could be given a reasonable explanation in terms of language- and dialect-specific vowel dynamics.

The vowel dynamics argument has been further tested and corroborated in a study using a cross-language, electropalatographic (EPG) data material extracted from the ACCOR data base (described in EUR-ACCOR: The design of a multichannel database, Abstract. Preliminary results of this project have been reported in, e.g., in Investigating the ‘trough’: vowel dynamics and aerodynamics, Journal of the Acoustical Society of America, Vol. 100, No. 4, Pt. 2, 2659-2660 (see also Towards an electropalatographic specification of consonant articulation in Swedish, Abstract). The material consisted of nonsense utterances such as /ipi/, /ipa/, /api/, read by a number of speakers of English, French, German and Swedish. Essentially, the data for symmetrical VCVs such as /ipi/ showed a) that Swedish and English data displayed significant, consonant-related troughs in the EPG patterns, b) that the Swedish and English trough patterns were differently timed, the Swedish trough occurring earlier than the English one, and c) that German and English VCVs displayed minimal, if any troughs; the latter observation was interpreted as a consequence of the requirement for attaining essentially steady-state formant patterns typically seen in Standard French and Standard German. For the same reason, then, most of the transitional movement from V1 to V2 occurred during the consonant closure in French and German.

1.3 A means-end view

The above data and discussion have exemplified that complex coordination patterns can be given a natural explanation when seen in the light of a means-end strategy, in which speech gestures are coordinated in such a way as to bring about certain acoustic effects, which are intended by the speaker and correspond to the phonetic norm for the language or dialect. Articulation is thus a goal-oriented activity. As such, articulatory gestures are ad hoc in the sense that acoustic goals are achieved using variable, context-dependent motor schemes. Thus, reorganization of motor commands to adapt to variable articulatory conditions as imposed by speaking rate or style is an integral part of the theory (cf., e.g., section 3.1). In particular, the problem of explaining a specialized translation function for converting underlying units to behavior will never arise; the means-end perspective therefore obviates the need for segment-based coarticulation devices such as those proposed by Henke and others.

As argued in a paper entitled Predicting segment durations in terms of a gesture theory of speech production (Proceedings of the 9th International Congress of Phonetic Sciences, Copenhagen, August 6-11, Vol. II, 305-311), intended acoustic effects can be articulated in two ways, by means of sequencing or by means of coarticulation. Sequencing amounts to the linear ordering of acoustic effects such that the completion of a preceding effect is immediately followed by the execution of a subsequent effect in such as way that the essential acoustic characteristics of both effects will be preserved.

Coarticulation, as we used the term in this paper, amounts to the simultaneous articulation of acoustic effects such that several effects are audible at the same time. Thus, coarticulation results in a complex sound brought about by means of a compound gesture. In the compound gesture, however, one part may be slower than all the others; thus, if several of these gestures are started at about the same time, some of them may be completed earlier than the others in the sense that the effects that they are intended to bring about will emerge before the others. In particular, many linguistically functional effects, e.g., vowel qualities, are not required to have any particular duration; they are felt to be complete as soon as they are heard to emerge. However, to coarticulate all the effects, i.e., to make them all audible at the same time, the effects that emerge early will have to be maintained for some time, waiting for the remaining effects to materialize. Thus, acoustic segments with quasi-stationary qualities will arise not as a final end product of the phonetic action, but as a secondary consequence of the effort to reach a certain final goal, i.e., the simultaneous sound of the effects in question. For example, vowel qualities will be maintained during the execution of pitch movements coarticulated with the vowel.

Experimental evidence will be presented below for a distinction between what we referred to in the just mentioned paper as primary, intended acoustic effects and secondary, fortuitous traces of the effort involved in bringing about the primary effects. The assemblage of primary acoustic effects, sequenced or coarticulated, is taken to constitute the phonetic as opposed to the physical structure of the words and sentences of the language. Phonetic structures thus represent the means employed by speakers to guide listeners' perception to the linguistic meaning of words and sentences. Consequently, these structures constitute the central object of phonetic investigation. The following experiment provides an example:

Example: /h/ as smooth onset

Explaining the acoustic-phonetic characteristics and phonotactic distribution of /h/ presents notorious problems. Acoustically, /h/ is special in that its noise frequencies vary widely with vowel context, and distributionally, it is excluded from syllable-final position in many languages. A related, curious problem with /h/ is presented by the 'nolla-hallon effect' (pointed out by Lindblom), in which the Swedish word nolla 'zero' played backwards is heard as hallon 'raspberry'.

Traditionally, the characteristic acoustic feature of /h/ has been taken to be the aspirative noise which accompanies many of its productions. The purpose of the paper referred to here, A new phonetic interpretation of /h/ in Swedish (Abstract), was to test the alternative hypothesis that a certain spectral shape of the initial voice pulses, a smooth onset, is a necessary and sufficient requirement for identification of initial /h/, i.e., that it forms part of the phonetic structure of words containing /h/.

A number of listening experiments were conducted, as described in the paper, using various acoustic manipulations of the words Anna and Hanna as recorded with a male Swedish speaker. The essential results were as follows:

a) A h-like masking noise superimposed on Anna and Hanna, even at relatively high intensity levels, did not cause listeners to misidentify the two words;

b) when a h-noise, taken from a natural rendering of Hanna, was added to the initial portion of Anna, no perceptual shift occurred; listeners heard Anna i spite of the h-noise, except at extremely high noise levels.

In combination, these results supported the hypothesis that voice onset properties, abrupt vs. smooth, constitute a necessary and sufficient condition for listeners perception of initial vowel vs. initial /h/. In consequence, a smooth onset was taken to also be the primary acoustic effect intended by the speaker in producing /h/, while the presence of an aspirative noise could be explained as a secondary acoustic trace of the articulatory gestures needed to bring about the primary effect. As discussed in the final section of the paper, these results also suggested alternative interpretations of the 'nolla-hallon' effect as well as of the underrepresentation of final /h/ in the world's languages.

Fortuitous traces of articulatory gestures such as the aspirative noise in /h/ may occasionally serve also as secondary perceptual criteria for identification of linguistically functional units. This is observed in many perception experiments dealing with 'cues' to phonological contrasts. It is also evident that fortuitous side-effects of articulation are frequently re-evaluated in historical sound change. A well-know example is the development of lexical tone contrasts on vowels from high and low pitches automatically accompanying voiceless and voiced obstruents, respectively (e.g., Hombert et al., 1979; Svantesson, 1983). This idea was further elaborated in a study of the clicks (described in section 7.2).

In what follows, a number of experiments will be summarized which were conducted with the means-end perspective in mind, as well as studies which follow up the various, less immediate implications of this view. Occasionally, ideas are expressed which were not discussed in the publications reviewed, but which have appeared later. As far as possible, the summary will follow a chronological order, but this can not be consistently adhered to since some projects have been carried out in parallel and, in other cases, threads have been left hanging for a while to be picked up later. Thus, the following summary will be thematically organized to a high degree. Section 2 will mainly deal with prosody, particularly phonetic structures of words and phrases. Section 3 summarizes results relevant to the interactions between prosody and acoustic features of vowels, particularly the phonetic implications of quantity. Section 4 refers to studies of the systematicity of phonetic variation in 'spontaneous' speech. Section 5 deals with problems of foreign accent. Sections 6-8, starting out with a summary description of a cross-linguistic database project, are oriented towards phonetic typology and cross-language studies, particularly focusing on problems of markedness, universal trends, areal biases and historical sound change. In section 9, finally, a recently initiated project is introduced, which continues several of these themes on the basis of the Swedish dialects.

Many of the studies summarized involve Swedish in one way or another. Unless otherwise indicated, 'Swedish' will be used in the restricted sense of Standard Swedish as spoken in the Eastern Central part of the country, notably the Stockholm area.

2 Prosodic bases of phrase and word structure

2.1 F0 and phrase structure

It has traditionally been assumed that fundamental frequency (F0) constitutes the primary prosodic dimension underlying word and phrase accents. Consider, for example, the following minimally contrasting phrase quartet:

      1

      en stor mans dräkt

      'a suit belonging to a big man'

      2

      en stor mansdräkt

      'a big suit for a man'

      3

      en stormans dräkt

      'a suit belonging to a magnate'

      4

      en stormansdräkt

      'a suit for a magnate'

The F0 contours pertaining to these respective phrases are distinctive, but acoustic measurements also show systematic effects on the duration and intensity dimensions. Some related observations have been made in the literature; for example, the second syllable in grave accent (accent 2) words (such as the compound words in 1-4) tends to be longer than the second syllable in acute accent (accent 1) words (Gårding and Lindblad, 1973), and the stressed syllable in acute words tends to be longer than the stressed syllables in grave words (Elert, 1964). Also, the second syllable of grave words is ceteris paribus typically more intense than the second syllable of acute words. Thus, even though some limited synthesis work (Malmberg, 1966) has suggested that a crucial contribution is provided by F0, the primacy of F0 as the sole phonetic basis for word and phrase structures should not be taken for granted.

The hypothesis that the F0 dimension contains the necessary and sufficient information for identifying phrase structures such as those in (1)-(4) above was tested in an experiment using an LPC-based synthesis method developed in our research group in the 1970s: An experiment on the perceptual evaluation of prosodic parameters for phrase structure decision in Swedish, published in E. Gårding and G. Bruce (eds.): Nordic Prosody. Lund University, Department of Linguistics, 15-21.The method permitted displays and free manipulation of fundamental frequency, intensity and duration in naturally produced utterances. Several synthetic combinations of these parameters were created and tested with native Swedish listeners who were asked to identify each stimulus as one of the phrases in (1)-(4).

The result of this experiment was quite straighforwardly that F0, when correctly aligned to the syllabic structure of the test phrases, was necessary and sufficient in the sense that a) listeners’ identification of a phrase in which the original F0 was preserved could not be changed by manipulating duration or intensity; and b) any phrase which was given a new F0 contour was identified in accordance with this new contour. These perceptual results were also taken to suggest the hypothesis that the F0 dimension contains the primary acoustic effects controlled by the speaker to bring about listeners' perception of phrase structure. As a corollary, it should be possible, given a rigorous production model, to explain the accompanying duration and intensity variations as side-effects of the phonatory gestures involved in producing the relevant F0 contours.

In a follow-up paper: Some observations on the role of prosodic parameters in the perception of phrase structure in Swedish (in I. Karlsson & L. Nord (eds.): Report from the Phonetics Symposium 1978. Royal Institute of Technology, Department of Speech Communication, 11-13, results were presented relevant to the perception of so-called lexicalized phrases such as Vita Huset ('the White House') vs. (det) vita huset ('the white house'). In lexicalized phrases, stress is removed from the first constituent, and our physical measurements showed a considerable effect on both duration and F0 contour (intensity was not considered in this experiment). Here too, listening tests revealed that the only parameter effective for the listeners' choice of phrase structure was F0. In cases where durations were increased much beyond realistic limits, the synthesized utterances sounded rather unnatural, but listeners' choice of phrase structure was still unanimously made in accordance with F0. This experiment thus reconfirmed the role of F0 as a primary determinant of perceived phrase structure in Swedish.

 

2.2 Tonal structure of the word accent contrast

While the results just summarized strengthened our view that certain F0 contours are necessary and sufficient for perception of word and phrase structures, they did not tell us precisely what those conditions are. However, a number of later publications represent attempts to determine F0 conditions on the grave vs. acute word accent contrast in Swedish; see, in particular, Phonetic interpretation of the word accent contrast in Swedish: evidence from spontaneous speech (Abstract) and Phonetic interpretation of the word accent contrast in Swedish (Abstract). These papers bear on phrase structures containing compounds (such as stormansdräkt), since the grave accent is the prime prosodic characteristic of Swedish compounds.

The experiments reported in these word accent papers thus form a logical continuation of the perception work summarized above. But they also represent a (somewhat delayed) reaction to a word accent model proposed in Bruce (1977). The next couple of paragraphs provide a brief background.

The phonetic correlate of the grave accent has traditionally been thought of as a two-peaked F0 contour (e.g., Malmberg, 1963). The primary stress syllable is associated with an F0 peak and a fall, and the secondary stress syllable is associated with an F0 rise and a new peak. The acute accent, which does not have a secondary stress, has traditionally been thought of as a one-peaked F0 contour resulting from a rise on the primary stress syllable (Malmberg, op. cit.).

This analysis was modified by Bruce (1977), who showed that a grave word has an F0 rise on the secondary stress syllable only if it is in sentence focus. Likewise, an acute word has an F0 peak only if in the focus position. These F0 events are, thus, determined at the sentence rather than at the word level. Also in contrast with traditional notions, Bruce argued that the correct characterization of the two accents should be that a) both have an F0 fall in connection with the primary stressed syllable, while b) the difference between them is one of timing: in acute words, the F0 fall occurs during the consonant preceding the primary stress vowel, and in grave words, it occurs on the primary stress vowel. Since sentence focal stress causes F0 to rise immediately after the fall in both accents, the result is two globally similar, two-peaked F0 contours which are differently timed in relation to the segmental structure of the word, the grave contour being delayed in relation to the acute contour.

This novel approach was questioned in one of the above-mentioned accent papers on the basis of F0 measurements of conversational speech produced by three male Swedish speakers. The results suggested that the grave accent was consistently marked by an F0 fall on the primary stress syllable, while F0 in the acute words was quite variable. The robustness of the falling grave contour was taken to suggest that an F0 fall associated with the primary stress syllable is a primary acoustic feature of the basic grave F0 contour, while no such positively defined characteristic could be associated with the acute accent.

Another finding of this experiment also deserves mentioning. While a descending grave F0 contour was a stable feature of the grave accent, the rate of the F0 fall varied with consonant context. Specifically, when the following consonant was voiceless, and the vowel thus of shorter duration, this was compensated for in terms of a rate adjustment effect resulting in a relatively constant low F0 value being attained irrespective of vowel duration. Thus, the alternative, a truncation of the grave F0 contour in time-compressed vowels, did not occur. This result was taken as corroborating evidence for the interpretation of the falling F0 contour as a primary feature of the grave accent; however, it did not confirm previous results reported by Bannert and Bredvad-Jensen (1975).

In the companion paper on the word accents, data were reported from an experiment in which the intonational context preceding segmentally identical grave and acute test words was systematically manipulated. Again, three male speakers (two of which were the same as those used in the previous experiment) served as subjects. In brief, the results confirmed the conclusion that F0's behavior in acute words was largely predictable from sentence context whereas, in grave words, the F0 fall appeared consistently as expected. It was thus concluded that the only positively specified tonal feature of the Central Standard Swedish word accent contrast is an F0 fall on the primary stress vowel in grave words. This was more in line with the traditional view of the word accents than with the time shift model proposed by Bruce (1977). However, further experiments are needed to explore the perceptual validity of these conclusions.

In combination, these word accent studies were also of interest from an experimental point of view because they demonstrated the methodological advantage of using data from spontaneous speech as a basis for generating hypotheses, which may then be formally tested using a more precisely targeted speech material.

 

2.3 Acquisition of the word accent contrast

The above summary has suggested that at least the following statements concerning the phonetics of the grave word accent should be uncontroversial: a) a descending F0 contour is a stable acoustic feature of the grave accent; b) when produced with focal or emphatic stress, words with the grave accent display a rising F0 contour associated with the secondary stress syllable (but not always exactly time-aligned to it). Also, grave accent words occur very frequently in Swedish since, in principle, disyllabic stems, including compounds (a very productive word structure in Swedish), are characterized by the grave accent. As a result, practically every spoken sentence contains one or more examples of the grave tonal contour. A consequence of this is that infants reared in Swedish speaking environments are quite heavily exposed to these contours.

Tonal contrasts have frequently been assumed to be acquired earlier than segmental contrasts (e.g., Li and Thompson, 1977; Tse, 1978). Proposed explanations have included the claim that pitch is easy to control in both production (Li and Thompson, 1977) and perception (Tse, 1978). Independent evidence suggesting an auditory-motor basis for early acquisition of tone comes from studies showing that infants are able to imitate pitch before the middle of the first year of life (Kessen et al., 1979; Kuhl and Meltzoff, 1988) and to discriminate pitch as early as 4 weeks of age (Kuhl and Miller, 1982). For all these reasons, it appears likely that Swedish children would display tonal influences of the word accents at a relatively early stage of language acquisition, and that their tonal behavior would begin to diverge from that of non-Swedish children relatively early.

Until recently, however, there has been an almost complete lack of solid phonetic evidence bearing on the acquisition of the Swedish word accents, and little has been done on acquisition of accent and tone even at an international level (cf. Vihman, 1986). Existing data on tonal development generally come from small-scale, informal studies involving few subjects and vocalization samples, and lacking a cross-language perspective. Thus, there is a very meager basis for determining to what extent observed tonal properties of babbling and early words represent individual peculiarities, near-universal development traits, or reflexes of the phonetics of the child’s ambient language.

The study reported in Acquisition of the Swedish tonal word accent contrast (Abstract) thus represented a first attempt to experimentally and systematically investigate the child's path to a productive command of the Swedish word accents. For reasons discussed above, this development was expected to begin relatively early. Therefore, a developmental stage defined on the basis of an active vocabulary of approximately 50 words ('the 50 word point') was chosen as the point of departure; this corresponds to about 17 months of age. The idea was to a) examine the presence and possible systematicity of grave vs. acute tonal characteristics at that point, then b) go down in age successively until no such effects could be found. In such a way, we expected to find out at which developmental stage, defined in terms of lexical acquisition, a command of the accents is acquired, and to shed light on the process by which this is achieved. It was anticipated that both strategy and time of completion of this acquisition process would differ considerably from child to child (cf. Vihman, 1993).

The material for this study consisted of audio and video recordings of 5 Swedish and 5 American-English children. These were part of a series of longitudinal recordings of Swedish, American, French and Japanese children from about 9 to about 18 months of age, made within the project 'From babbling to speech'. The data were based on an examination of all disyllabic babbles and early words produced by these children during one recording session. On the basis of F0 parameters selected to reflect presence and amount of grave accent fall and sentence accent rise, the following observations were made:

a) There was a great individual variation in both parameters within both language groups;

b) there was a considerable overlap between the two language groups in both parameters;

c) the only statistically significant between-group difference was in the rise parameter, i.e., the part of the F0 contour that corresponds to sentence accent in the adult Swedish norm. Thus, the Swedish children displayed, as a group, a higher rise than did the American children;

d) for the Swedish children, there was a significant difference in the rise parameter between vocalizations judged as pure babbles and those judged as approximations to Swedish words, such that the rise was higher in the latter than in the former vocalizations.

In summary, then, these data suggested that, at about 17 months of age, Swedish children are just beginning to produce grave-like F0 contours and to mark the appropriate words with those contours. However, the distinguishing F0 event did not occur in the F0 fall parameter as could be expected by virtue of this parameter's robustness in adults’ grave accent productions, but on the secondary stress syllable. However, a possible explanation may come from the study of speech directed to infants in which an exaggerated F0 rise on the secondary stress syllable seems to be a means of expressing emphasis and affection.

At first sight, these results appeared to cast doubt on the above-mentioned claim that tonal features are acquired early. From a typological point of view, however, they were not all that surprising; while tones in languages such as Thai or Mandarin Chinese are predominantly lexical, the function of the Swedish accents is to mark word and phrase structure; thus, they do not have a high semantic load. Hypothetically then, this may decrease the pressure to acquire the contrast at an early stage. Also, since the word accent contrast is intimately coupled to morphological structures, it is void of functional meaning until the child begins to master operations such as compounding, derivation and inflection; and preliminary observations suggest that this may not happen until around two years of age (see The Inter-Nordic study of language acquisition, Abstract). In summary then, the continuation of this study will focus on later rather than ealier stages of development. Another, current branch of the acquisition project is described in Is babbling language-specific? A listening test using vocalizations produced by Swedish and American 12- and 18-month-olds (Full text).

 

3 Studies of the prosodic-spectral interface

3.1 Effects related to speaking rate and stress

Pursuing the prosodic theme, we now return for a while to two papers mentioned earlier (section 1.2): Articulatory coordination in selected VCV utterances: A means-end view, and Articulatory correlates of stress and speaking rate in Swedish VCV utterances. An additional purpose of those studies was to use acoustical and movement data to study effects of variations in stress and speaking rate on a) the spectral characteristics of vowels and b) articulatory coordination patterns as observed using cineradiography. The question was: To what extent and by which means are primary vowel effects preserved in various stress and rate conditions? The same two subjects as before were thus asked to produce VCV utterances such as /ipi/, /ipa/, /api/, /ipu/ using all four combinations of two stress degrees and two speaking rates (which the subjects partly failed to control, as described in the papers).

The stress-rate versions produced by the male subject were spectrographed and analyzed by means of formant and duration measurements. As expected, vowel and consonant durations depended in varying degrees on both stress and speaking rate. However, there were also some less obvious results, namely that:

a) formant frequencies (F1 and F3 for /i/, F1 and F2 for /a/ and /u/) were significantly different across the stressed and the unstressed conditions, the stressed condition displaying the most elaborated vowel spectra. In the terminology of Jakobson et al. (1969), stressed /i/ was 'more acute' than unstressed /i/, stressed /u/ was 'more grave', and stressed /a/ was 'more compact';

b) however, variation in speaking rate did not lead to significant spectral effects.

Thus, durational variation caused by stress was associated with spectral effects, while variation induced by speaking rate was not associated with such effects to any considerable extent.

To obtain articulatory data, the midsagittal distances between the tongue surface and the roof of the mouth or posterior pharyngeal wall were measured and converted into cross-sectional areas following algorithms as explained in the papers. Those data showed the following effects of stress and speaking rate for the 2 speakers:

a) Variations in stress were, for all three vowels, associated with significant differences in distance and cross-area values. For example, for the female subject, the range of distances observed for the stressed versions was 2-3 mm, while the range for the unstressed version was 4.8-5.2 mm. For the male subject, the corresponding ranges were 3.1-5.3 mm and 6.0-7.6 mm, respectively.

b) The effect of rate was as follows: For the female subject, there was no appreciable difference related to variation in speaking rate. The constriction sizes ranged between 2.0 and 2.7 mm for the slow version and between 2.5 and 3.0 mm for the fast versions. The corresponding ranges for the male subject were 3.1-4.4 and 3.3-5.3, i.e., only a slight tendency to wider constrictions as a function of rate. Thus, the essential effects were due to stress rather than to speaking rate.

These results were discussed in relation to Lindblom’s vowel reduction model (Lindblom, 1963) in which degree of target attainment was predicted in terms of ideal target values, adjacent consonants and segment durations. Durational differences were assumed to cause 'undershoot' relative to the target irrespective of whether they resulted from stress or speaking rate variation. In contrast, the results summarized here supported the notions that a) Swedish tense vowels have robust spectral properties which are relatively insensitive to variation in speaking rate; and b) that stress acts to give extra salience to these properties. This conclusion received further support by the observation that spectral properties of the vowels were actively safeguarded under rate variation by means of motor reorganization of the entire VCV utterance, i.e., a case of compensatory response to a rate-induced change of articulatory conditions. Thus, following the means-end approach, the articulatory activity underlying those events was interpreted as a positive means to preserve the spectral features most characteristic of /i/, /a/ and /u/, i.e., their respective ‘acuteness’, ‘compactness’ and ‘gravity’.

 

3.2 Durational vs. spectral bases of quantity

Given the observed spectral robustness of the tense vowels, the hypothesis presents itself that the two sets of Swedish vowels referred to as 'tense' and 'lax' (or 'long' and 'short'), i.e., vowels distinguished in terms of quantity, are also primarily differentiated in terms of spectral properties. It was hypothesized, moreover, that the tense vs. lax distinction would be maintained in terms of spectral differences in both stressed and unstressed contexts. An alternative hypotheses was that the distinction would be preserved in terms of duration.

These hypotheses were subjected to a preliminary test using a set of formant measurements from one tense vs. lax vowel pair: /i:/ and /I/ as in the minimally contrasting word pair vila ('rest') and villa ('house'). These words were put into sentence frames and produced with and without sentence stress by 5 male Swedish speakers. In brief, the results of these measurements, which were reported in Articulatory coordination in selected VCV utterances: A means-end view, were the following:

1) F3 provided a reliable criterion for separating vila and villa in both the stressed and the unstressed conditions with the greatest degree of separation occurring in the stressed condition;

2) stressed vila vs. villa displayed significant duration differences in both vowels and consonants;

3) in contrast, unstressed vila vs. villa were not consistently separated in terms of duration; in three of the five speakers, there was a small difference in both vowel and consonant durations (the vowel in vila being longer than in villa and vice versa for the consonant). However, these differences were absent in the remaining two subjects.

This experiment thus suggested that F3 provides a stable criterion for separating tense and lax i-cognates. For these vowels then, the labels ‘tense’ and ‘lax’ (acoustically as well as articulatorily interpreted) appear to describe the contrast more appropriately than 'long' and 'short'. It can be hypothesized, in addition, that durational effects associated with tensity and stress will fall out as a consequence of the extra amount of articulatory effort required to bring about the spectral effects underlying these features (cf. section 1.3). The results of a subsequent, small experiment reported in Acoustic features correlating with tenseness, laxness and stress in Swedish: Preliminary observations, (in C.-C. Elert, I. Johansson & E. Strangert, eds.: Nordic Prosody III, University of Umeå, Acta Universitatis Umensis, Umeå Studies in the Humanities 59, 51-66; also in Reports from Uppsala University, Department of Lingustics (RUUL) 11, 8-22), in which speaking rate was incorporated as an additional independent variable, could be interpreted in essentially the same way.

It should be noted, however, that full verification of the above conclusions can be obtained only on the basis of a more complete test material. This is particularly important since experiments have indicated that vowel quality and duration may be associated with variable perceptual weights in different tense vs. lax cognate pairs (Hadding-Koch and Abramson, 1964). Thus, acoustical analyses are currently being carried out in order to complete the picture, and an attempt to replicate Hadding-Koch and Abramson's results, using up to date editing and synthesis techniques, is planned. To date, preliminary measurements have been made on the following set of words (which are partly somewhat far-fetched). The word list represents an almost complete inventory of quantity contrasts in Swedish. The mid syllables of these words, on which the measurements were made, are lexically unstressed, i.e., they provide an opportunity to evaluate acoustically the durational vs. spectral basis of quantity in unstressed syllables.

        TENSE/LONG VOWEL

        LAX/SHORT VOWEL

        ovisheten

        ovissheten

        järnsylverket

        järnsyllverket

        kartrutmönstret

        kartruttmönstret

        stenvägsbygge

        stenväggsbygge

        svartlösjorden

        svartlössjorden

        finmatlagning

        finmattlagning

        Åmål-mässan

        å-mollmässan

        gräddmostallrik

        gräddmoussetallrik

Recordings were made of these words in sentence frames and in and out of focus. So far, preliminary data for one speaker have indicated, for a majority of the vowel pairs, that quantity cognates are clearly differentiated in terms of formant frequencies and less clearly, if at all, differentiated in terms of duration. This was particularly evident for words produced out of focus, i.e., with sentence stress placed on a different word in the frame. This operation greatly diminished the durational differences in all words, while formant values were just slightly affected. As expected, however, there was one exception, namely the front mid vowel /epsilon/, which closely approximates the 'neutral' vowel (Chomsky and Halle, 1968) and the quality of which is heard to remain essentially unchanged across the two quantities. Acoustically, then, stenvägsbygge and stenväggsbygge were practically identical when produced out of focus. Apart from this marginal exception, this result therefore provided preliminary corroboration of the hypothesis that the primacy of vowel quality as a basis for the Swedish quantity contrast can be generalized to the whole inventory. In should be noted, in addition, that these data also suggest that the quantity feature in Swedish is less dependent on lexical stress than is usually assumed (cf., e.g., Elert, 1964).

Obviously, however, the experiment needs to be repeated using an extended number of subjects and, ultimately, formal perceptual evaluation.

The above experiments did not have a cross-linguistic dimension. However, language-specific phonetic structures can be further highlighted if seen in such a perspective. This was the rationale for undertaking the cross-language studies to be summarized next.

 

3.3 Duration vs. spectrum (cont'd): cross-language observations

3.3.1 Using elicited speech materials

It was suggested above that spectral characteristics provide the primary basis for the Swedish quantity contrast in the sense of constituting the speaker's acoustic targets as well as the listener's primary objects of phonetic perception. In contrast, durational effects were thought of as side-effects of the articulatory maneuvers employed to meet these criteria. In many languages, however, e.g., Finnish, quantity seems to be based almost entirely on durational contrasts, with spectral variation playing a minor role (e.g., Sovijärvi, 1938, 1956; Wiik, 1965; Lehtonen, 1970); and a similar situation is reported for other quantity languages such as Czech and Serbo-Croatian (Lehiste, 1970). It is reasonable to assume, then, that durational contrasts in those languages will be more robust and more resistant to reduction than seems to be the case in Swedish.

To test this hypothesis, durational measurements were made on speech samples produced by a number of Swedish, Finnish and Czech speakers (Durational correlates of quantity and sentence stress: A cross-language study of Swedish, Finnish and Czech, UCLA Working Papers in Phonetics 63, 1-25). The material consisted of sentences containing test words representing all possible quantity contrasts on the lexically stressed syllables in the respective languages: V:C och VC: for Swedish, VC och V:C for Czech, and VC, V:C, VC: and V:C: for Finnish. These sentences were read with the test words in and out of focus. The results can be summarized as follows:

Swedish: Under the stressed speaking condition, all subjects displayed the expected complementary V:C vs. VC: patterns. In the unstressed version, the difference between durational means for the long vs. short vowels failed to reach statistical significance in 3 of the 6 subjects; and the difference between the long and short consonants also failed to meet this requirement in 3 subjects. In particular, 2 subjects distinguished neither vowel nor consonant duration in V:C vs. VC:.

Czech: Under both stress conditions, the long vowel in V:C had a significantly greater duration than the short vowel in VC. There is no appreciable complementary length effect in this Czech material of the kind seen in Standard Swedish. The main effect of stress was a lengthening of the long vowel in V:C.

Finnish: The durational data for Finnish were also quite straighforward in consistently reflecting the four possible patterns under both stress conditions. The VC and V:C: patterns were distinguished such that the total duration of V:C: was greater than the total duration of VC.

In summary, this result supported the hypothesis that durational patterns provide the primary acoustic features on which the Finnsh and Czech quantity systems are built. This contrasted with Swedish and suggested a phonetic quantity typology with Finnish and Czech near one end of a duration vs. spectrum continuum, and with Swedish near the other end.

 

3.3.2 Using spontaneous speech

While elicited speech materials of the type just exemplified guarantee adequate control of experimental variables, they are only informative in terms of the isolated operation of those variables. They are thus artificial in precluding effects of the full spectrum of phonetic influences normally interacting in 'spontaneous speech'. Pure lab experiments therefore have a limited validity. For example, do our typological conclusions regarding the durational basis of quantity hold even in the face of a more realistic speech material? As a test of this question, we performed measurements of duration on conversational speech recorded with speakers of Swedish, Finnish and Estonian (Durational correlates of quantity in Swedish, Finnish and Estonian: cross-language evidence for a theory of adaptive dispersion, Abstract).

The Estonian quantity system is, like the Finnish one, unusually complex and thought to be primarily based on the duration dimension (e.g., Krull, pers. comm.). For reasons outlined above, it was thus hypothesized that both Finnish and Estonian speakers would preserve the durational correlate of quantity to a greater extent than would Swedish speakers. Such a state of affairs would also be compatible with the theory of 'adaptive dispersion' (Lindblom, 1990a) which, in essence, assumes that the speaker adapts his articulatory precision to provide the listener with 'sufficient phonetic contrast'.

The results of our durational analyses indicated that:

a) to a certain extent, quantity categories were maintained in terms of duration by speakers of all three languages;

b) however, the durational contrasts were considerably more distinct and consistent in Finnish and Estonian than in Swedish.

In other words, our hypothesis was largely confirmed. However, the following question still needed an answer: Why would quantity languages such as Finnish and Estonian refrain from engaging supplementary acoustic dimensions such as vowel quality or diphthongization to support the perception of the quantity contrast? It is not far-fetched to assume that languages with complex quantity systems would improve perceptual distinctiveness by recruiting these 'extra' dimensions. It was suggested in the paper that a possible explanation would be related to the fact that both Finnish and Estonian have extremely crowded vowel spaces when both monophthongs and diphthongs are taken into account; Finnish, for example, has 8 distinctive monophthongs and 18 distinctive diphthongs. In consequence, the the room for quantity-related vowel qualities or diphthongizations would appear to be limited.

This is reasonable. In retrospect, however, the explanation is not completely satisfying. The reason is twofold: first, Swedish has a fairly crowded vowel space, too; and second, there is no a priori reason to believe that the acoustic characteristics of the vowels and diphthongs in Finnish and Estonian could not be auditorily enhanced without causing confusion. Maybe the following, means-end-inspired question is more to the point: If duration is the primary basis for quantity in Finnish and Estonian, what secondary effects would follow as fortuitous side-effects of the vocal tract's effort to bring about the required degrees of duration? This question does not seem to have an obvious answer. At least in the case of Finnish and Estonian, it seems to be a fact that the extra amount of time made available in the long vowels does seem to be utilized for coarticulating additional sound effects.

 

4 Systematicity of phonetic variation in spontaneous speech

Even though a considerable amount of stability is to be found in allegedly primary acoustic features, phonetic variation is readily observed in many situations such as in conversational speech and at stylistic levels not normally encountered in the phonetics laboratory's recording booth. A theory of absolute acoustic invariance is therefore less than attractive. A more interesting task seems to be to describe the range of phonetic variation met with in various speaking styles and to try and determine to what extent and in which sense it is systematical. Such knowledge, of course, will be indispensable for understanding perception since it will help in clarifying the circumstances under which the perceptual system has to work.

Let us first mention the little noticed phenomenon of discontinuous phonetic variation, which was discussed in the paper Discontinuous variation in spontaneous speech (in 'Papers from the Second Swedish Phonetics Conference', Lund, 5-6 May 1988, Working Papers 34, Lund University, Department of Linguistics, 33-36); also in Phonetic Experimental Research, Institute of Linguistics, University of Stockholm (PERILUS) 8, 48-53). It was hypothesized that phonetically discontinuous word and phrase forms may have separate entries in the speaker/listener's phonetic lexicon, i.e., that the speaker and listener have the opportunity to choose among different acoustic target forms of what is semantically the same item (cf. Ohala, 1992). This seems to happen particularly frequently with words that are low in semantic content, either because of contextual/situational factors, or because they are function rather than content words in the language. My own transcriptions of such words include, for example, several instances of [utriksaspatmen] (5 syllables), as heard from a political journalist, for what would be conventionally represented as /utrikeshandelsdepartementet/ (10 syllables), in orthography Utrikeshandelsdepartementet ('The ministry of foreign trade').

It was demonstrated in the paper that the conjunction så att ('so that', 'such that') had two distinct phonetic forms in converstional speech, one 'reduced' [satt] and one 'elaborated' [so att]. Formant tracings of several instances of each of these showed a) that they were clearly separated with no transitional forms, and b) that both forms were phonetically elaborated in the sense of having distinctly articulated vowels and consonants. In particular, the 'reduced' form could easily be heard as a clear rendering of the verb form /satt/ 'sat'.

In other studies of spontaneous speech, we have revisited the problem of the relationship between duration and spectral effects in vowels (e.g., On the systematicity of phonetic variation in spontaneous speech (published in Phonetic Experimental Research, Institute of Linguistics, University of Stockholm (PERILUS) 8, 34-47). It was observed, among other things, that the movement of the higher formants in [VrV] sequences (in words such as bara 'only') from onset via turning-point to offset correlated strongly with the duration of the [VrV] sequence; the shorter the duration, the less formant movement. This, in contrast with previously obtained data from systematically elicited speech, appeared to be compatible with Lindblom's 1963 model. This brought us back again to the question of the relationship between variables such as stress, speaking rate and spectral variation. It was argued, among other things, that descriptive labels such as ‘allegro’, ‘lento’, ‘fast speech’ etc., which were en vogue at the time (e.g., Dressler, 1972; Dalby, 1984), are misleading in implying an oversimplified causal connection between variables such as speaking rate and degree of articulatory elaboration and, consequently, seem to do little more than restate the philosophy behind Lindblom’s original reduction model. An alternative but equally reasonable view of speaking rate and reduction, it was argued, would be that variation in perceived speaking rate is caused, as a secondary consequence, by the articulatory movements required to bring about acoustic effects related to stress or style (cf. Barry, 1984; Lindblom et al., 1992).

However, everyday experience suggests that people are able to control speaking rate and articulatory precision independently. This is suggested by listening to personal speaking styles. Thus, some speakers may be perceived as slow and unenergetic because of a speaking style involving, for example, strong vowel reductions, frequent stop lenitions or monotonous intonations; and others may be perceived otherwise because of a high speaking rate combined with precisely attained vowel qualities and a vivid prosody. Thus, speaking styles associated with particular groups or individuals have a natural sociophonetic bearing and should be given more attention than they have received to date.

In connection with these sociophonetic comments, the next section will summarize phonetic work done in relation to a particular linguistic group, namely non-native speakers of a majority language.

 

5 Foreign accents - attitudes and signatures

From a phonetic point of view, attitudes to foreign accents and their consequences for social and linguistic interaction between native and non-native speakers is a largely unexplored area. A thorough search of the literature has also revealed that most of the questions that need to be asked in order to investigate attitudes to foreign accent have no answers, and that many basic questions have not even been raised; see Attitudes to immigrant Swedish - a literature review and preparatory experiments (Abstract). The following are some of the questions raised in that paper:

a) To what extent are native speakers of a language able to identify the linguistic origin of various foreign accents, when heard spoken?

b) To what extent can foreign languages be identified?

c) Can strength of foreign accent be quantified?

Answers to these questions are indispensable if attitudes to foreign accents are to be measured in a reliable way. One example: Suppose that a study of native listeners' attitudes to two languages, L1 and L2, seems to reveal a more positive attitude to L1 than to L2. The conclusion is drawn that this also reflects listeners' attitudes to the underlying ethnic groups. However, since accent strength can be assumed to affect attitudes, this result is confounded by an unknown factor and can only be judged as valid if this factor is corrected for. Thus, accent strength needs to be measured.

The project ‘Attitudes to Immigrant Swedish’ set out to solve problems of this kind. The first study (referred to above) concerned identification of foreign accents in terms of language of origin. Listener groups with various degrees of experience of foreign accents were tested: high schools students around 18 years of age, and teachers of Swedish as a foreign language. The test material consisted of readings of several accented Swedish versions of ‘The North Wind and the Sun’, which were available in a previously collected database material (IRIS, see section 6.1 below). The results of this experiment showed that identification of foreign accent in terms of language of origin was relatively poor in both groups. As expected, the teachers of Swedish as a foreign language scored better the high school students, but very few accents (notably Finnish, Norwegian, English, German and French) could be identified.

In another experiment, similar listener groups were asked to identify languages rather than accents. This test was based on recordings of ‘The North Wind and the Sun’ in several languages available in the IRIS database. As expected, the results were somewhat better than in the foreign accent identification task, with teachers still scoring better than students. For example, at least 40% of the students were able to identify Norwegian, English, German, Finnish, Russian, French, Danish and Spanish. Most non-European languages were not correctly identified. However, the task of placing the language samples in the right part of world was solved more successfully revealing a somewhat unexpected ability to categorize languages on a geographical rather than genetical basis. For example, all African languages (Swahili, Tigrinya, Kinyarwanda and Yoruba, which do not all belong to the same language family) were identified as African above chance level. The reason for this relative success is not completely clear. However, broad areal phonetic trends might reasonably play a role (cf. section 7.1).

Strength of foreign accent can be assumed to be a significant predictor of natives' attitudes, as noted above. However, it was not clear how to measure degree of accentedness even though the question had been addressed before (e.g., Brennan and Brennan, 1981). Partly based on Brennan and Brennan, we carried out an experiment to test the hypothesis that accentedness as judged by linguists’ formal analysis of phonetic deviations will correlate positively with accentedness as more subjectively judged by informants without linguistic training. A high correlation between these methods would then simplify the task of specifying degree of accentedness for attitude measurements.

Non-linguists judged degree of accentedness, based on a short recorded passage, on a scale from 1 to 5, and a panel of 8 linguists (teachers and graduate students in phonetics and general linguistics) listened independently to the recordings and marked all noticed non-native phonetic features. A comparison of the two group results showed a high correlation. This was taken to indicate that non-experts' subjective judgments could be used as a reliable basis for quantification of accent strength. This experiment also showed that reliable judgments of accentedness can be made on the basis of a relatively small material. Thus, the non-linguists’ judgments were based on about 25 sec. of speech compared to 5 minutes used by Brennan and Brennan (1981). This is a methodologically useful result (which may perhaps also suggest that accent strength is judged quickly in real life situations).

Based on the above experiments, the above-mentioned paper reported the results of an attitude experiment. One of the purposes was to test the hypothesis that stronger accents tend to elicit less positive native speaker attitudes than weaker accents. This was confirmed; stronger accents did elicit less favorable attitudes than weaker accents. An additional effect of accent strength was that weak accents were thought to come from languages spoken near Sweden, while stronger accents were thought to come from languages spoken in more distant parts of the world. In particular, weak accents were taken to come from Norwegian or some dialect of Swedish; slightly heavier accents from Finnish, next English or German, next Spanish, French or Greek, and so forth. The heaviest accents were thought to originate from African (Swahili, Tigrinya, Kinyarwanda or Yoruba) or Asian (Japanese, Korean or Bengali) languages.

This experiment, however, did not tell us much about a) either the importance of individual accent features in creating the impression of foreign accent, or b) which accent features, or combinations of features, were effective in creating impressions of particular foreign accents. But it is reasonable to assume that some accent features sound more foreign than others. Thus, in Perceived strength and identity of foreign accent in Swedish (Abstract), an experiment was reported in which several readings of the Swedish version of ‘The North Wind and the Sun’ were recorded, with all separate recordings systematically differing from one another in terms of deviations from the native phonetic norm; i.e., the deviations were intended to simulate features of foreign accents. Several accent features were introduced simultaneously. (This tongue-twisting exercise was performed by the present author.) The features (which are listed in table 1 of the paper) were selected with particular attention paid to what might be expected from British English and Finnish accents. 34 versions were played to listeners who were asked to judge, for each reading, whether it sounded like a) a foreign accent, if yes, which one? b) a possible regional variant of Swedish, or c) merely strange.

The results showed that some versions were predominantly judged as foreign accents, some as Swedish dialects and some as merely strange. Minimal requirements to create impressions of foreign accent included deviations such as lack of aspiration in voiceless stops, velarized /l/, unrounding or backing of front rounded vowels, trilled /r/, and incorrect use of the word accents. Some particular combinations were found to create the impression of a Finnish accent.

The latter observation was followed up in a new experiment specially designed to examine the ingredients of Finnish accent more closely. This was again done by having the same Swedish speaker read the same text using various combinations of 5 supposedly Finnish accent features such as deviating word accents, durationally exaggerated quantity contrasts, unaspirated voiceless stops, velarized /l/, trilled /r/ etc. Acoustic measurements were also carried out on genuin Finnish accents to verify a) that the posited Finnish accent features were valid, and b) that the attempts to simulate them were in the right direction. A group of listeners was asked to indicate, for each reading, a) whether they thought that it sounded like a Finnish accent, and b) how strong they judged the reader’s foreign accent on a scale from 0 to 4.

On the whole, the results showed that perception of both accent strength and Finnishness increased with increasing number of deviations. The effect of adding features was thus cumulative, but the features were not equally powerful in creating the impression of Finnish accent. The outstanding features were unaspirated voiceless stops and exaggerated durational quantity contrasts. However, the experiment also showed that different combinations of accent features could give an impression of Finnish accent in Swedish.

6 Immigrant voices in Sweden (IRIS) - a database project

6.1 Background and motivations

The IRIS database was mentioned above as a source of language material for studies of foreign accent. The IRIS project was reported in IRIS - A data base for cross-linguistic phonetic research (Manuscript, Department of Linguistics, Uppsala University).The purpose of the project was to build up a phonetic database containing digitally represented, comparable speech records from a wide range of languages and foreign accents, many of which are encountered in ethnic minorities in Sweden. The database was meant to provide an easily accessible reference material for phonetic studies of those languages and accents.

To reach this goal, speech samples from several languages and dialects were recorded and complemented with recordings made in other phonetic laboratories. At the completion of the project, we were in possession of speech samples from about 120 languages; about 20 languages have been recorded after the official completion of the project. Each language in the database is represented by from 1 to 12 speakers. Some languages are represented by several regional variants; for example, Spanish is represented with six variants. Another ambition was to represent each language with several types of speech material, and to select material to exemplify special features of the languages.

The IRIS material has been used in phonetic courses at different levels and has provided material for various term papers. The material has also been used to carry out pilot tests of hypotheses concerning phonetic structures of various languages, and to study various aspects of foreign accents in Swedish, as described above. The main current use of IRIS is as a basis for a phonetic evaluation of the UPSID phonological database (see section 8 below).

Most phonetic analyses of languages other than Swedish referred to in this summary have been initiated on the basis of the IRIS database. Following generation and preliminary tests of various hypotheses, supplementary recordings have normally been made; these recordings have subsequently been incorporated with the database. The two studies described in the next few paragraphs were initiated and carried out in connection with the database work.

 

6.2 Special study #1: Aspiration in Swahili

Swahili is an important language in terms of number of speakers and status as native language and lingua franca in a vast region of Africa. However, the language has received almost no attention from modern experimental phonetics and, moreover, available descriptions seem to be partly unreliable. This paper (On aspiration in Swahili: Hypotheses, field-observations, and an instrumental analysis, Abstract) dealt with the contrast between aspirated and unaspirated voiceless stops in Swahili. This is a relatively complex contrast from a functional point of view.

On the basis of (the second author's) native intuition and our listening to several hours of recorded speech produced by Swahili natives, a set of hypotheses regarding functional and physical properties of Swahili aspiration was formulated and tested. Examples of these are: a) aspiration is a feature of N-class nouns as opposed to other noun classes; b) aspiration is used to express congruence between adjectives and N-class nouns qualified by those adjectives; c) Swahili aspiration is stronger, in terms of VOT, than is aspiration typically encountered in stressed voiceless stops in Swedish. These hypotheses were supported on the basis of VOT measurements made on a speech material recorded with a male native Swahili speaker.

6.3 Special study #2: Salient features of Lule Sami phonetics

In spite of being a traditional minority language in Northern Scandinavia, Sami (Lappish) is little known from a phonetic point of view. Our studies of one of the major dialects, Lule Sami, have therefore focused on some particularly salient patterns of the language, some of which form the phonetic basis for phonological phenomena related to the quantity contrast (Durational patterns of Lule Sami phonology, Abstract) and to the voiced vs. voiceless contrast (Preaspiration and the voicing contrast in Lule Sami, Abstract). A more complete survey, also covering aspects of intonation, was presented in Salient features of Lule Sami pronunciation (in C.-C. Elert, ed.: The Sounds of Lappish. Department of Phonetics, Umeå University).

The quantity article reports four experiments based on acoustical analyses of recordings made with native Lule Sami speakers of various dialectal origins.

Experiment 1

This experiment demonstrated, using acoustic analysis, that duration in Lule Sami is instrumental as a basis for quantity in its lexical as well as grammatical function. The lexical contrast can be exemplified by the two vowel lengths in mánná [ma:n:a:] (the child) and manná [man:a:] (he walks). Among grammatical categories expressed by duration are person and aspect in verbs as exemplified by maná [mana:] (you walk), manná [man:a:] (he walks) and man´ná [man::a:] (he starts walking). These words also illustrate the two degrees of quantity in Lule Sami vowels and the three degrees of quantity in consonants. The occurrence of the three quantity degrees displayed by the above stem consonants is, in the Fenno-Ugric literature, usually referred to as ‘grade alternation’. In the literature on this language, the three quantities are usually called grade 1, 2 and 3, respectively. Certain grade 3 forms are associated with a so-called epenthetic vowel as in the phrase (vatte) bál´kán [pa:lahka:n] (‘give as salary’), where the accent sign represents the epenthetic vowel, whose quality depends on the surrounding vowels, vs. mávsij bálkáv [pa:llka:w] (‘paid salary’).

Experiment 2

The previous experiment demonstrated that syllables in Lule Sami display a wide variety of durations. For example, a syllable with a long vowel and an ‘overlong’ consonant, as in mánná, turn out to have a much greater duration than a syllable with a short vowel and a short consonant, as in maná. It was hypothesized on the basis of this result, and confirmed by means of measurements of duration in connected speech, that there would be no physical evidence of isochrony at the syllabic level (‘syllable-timing’; cf. Pike, 1947). Physical isochrony at the level of the stress-group (‘stress-timing’) was also absent since the duration of interstress intervals was a function of the number of syllables.

On the one hand, these measurements showed that durational contrasts associated with quantity patterns were carefully preserved in connected speech, resulting in no signs of syllable or stress group isochrony. However, our measurements agreed with auditory impressions indicating that Lule Sami is characterized by a relatively low syllable rate and short stress-groups, factors that have been referred to in discussions of rhythmic types in the world’s languages (e.g., Dauer, 1983).

The fact that the quantity distinction, in connected speech as well as in isolated words, is strictly upheld in Lule Sami is probably due to the complexity of the system and to the fact that vowel qualities seem to play a negligible role in the quantity distinction (cf. the above discussion of Finnish and Estonian). It could be speculated, however, that the extensive grammatical use of quantity also plays a role in this since contrasts between unrelated lexical items are probably better supported by extralinguistic context than are contrasts between grammatical functions such as aspect in verbs.

Experiments 3 and 4

Finally, the article contributed to the solution of a long-standing Fenno-Ugrist problem by documenting the phonetic nature of the so-called epenthetic vowel which accompanies many grade 3 forms in Lule Sami. The question made explicit was: Is it an 'epenthetic' vowel in the sense of being a consequence of the sequential articulation of certain consonants? Or is it intended, planned and part of the Lule Sami phonetic norm? It was shown that, in the speech of all our informants, this vowel was a full-fledged one on the basis of duration and syllabicity. This observation was also indirectly supported by the fact that the appearance of this vowel in a word caused the preceding vowel to lengthen. The reason for this lengthening was probably that this vowel opened the preceding syllable, adding duration to its vowel. Had the vowel been non-syllabic, the lengthening of the preceding vowel would probably not have come about. However, the discussion of this in the paper is in error in implying that Lule Sami may constitute a counterexample of the word length effect. In fact, experiment 4 showed, on the basis of a supplementary material, that the word length effect indeed operates in Lule Sami.

In the companion paper on preaspiration and voicing in Lule Sami, a ‘voice offset time’ (VOffT) parameter was defined as the durational aspect of preaspiration in stop consonants (somewhat analogous to VOT for postaspiration, as defined by Lisker and Abramson, 1964). The respective roles of VOffT, stop closure duration and VOT as phonetic bases for what might, at a rather high level of abstraction, be called a voiced vs. voiceless contrast were investigated by means of acoustical analysis of speech samples produced by speakers of two dialects. For one of the dialects, VOffT provided a reliable phonetic basis for the voicing feature; for the other group, this relationship was less clear. Stop closure duration was not consistently related to the voiced-voiceless contrast, and the VOT parameter was completely irrelevant; VOT values generally fell in the short lag range typical of the widespread plain voiceless stop category.

The above observations on the phonetics of Lule Sami were made in a phonological perspective. However, the sound patterns of a language can also be seen as 'impressionistic' phenomena that give the language a special, global character and make it easily recognizable. These impressions may convey meanings to the non-native listener that are probably neither intended by the speaker nor perceived by the native listener. For example, in our intonational analysis of Lule Sami (in the above-mentioned Salient features of Lule Sami pronunciation), we found that stressed syllables have a quite prominent pitch lowering, which is particularly marked at sentence endings. In addition, a prominent utterance-final aspiration frequently occurs. To a Swedish listener, a falling voice and something that sounds like a deep sigh at the end of an utterance is an emotionally loaded signal; it means that the speaker is expressing a feeling of something like resignation or hopelessness. In Lule Sami, however, these signals are perfectly regular and emotionally unmarked. Such linguistic differences may give rise to misperceptions of a more subtle kind than missed phonological distinctions. It is likely that cross-linguistic differences of this kind may contribute to creating or confirming attitudes towards groups and individuals. They thus represent a sociophonetic aspect of foreign accent that deserves more attention that it has received to date.

7 Phonetic typology: Constraints and biases

7.1 Areal biases in stop paradigms

Some of the long-term goals of phonetic language descriptions are to create a representative picture of the phonetic variation found in the world's languages, and to attempt to specify the constraints which may set limits on the possible variation. Were the phonetic repertoires of languages that existed in prehistory essentially the same as today?

Evidence exists to suggest that they should have been very similar, such as the vocal tract's capacity to produce a wealth of sound effects which do not seem to occur in any human languages; and, at the other end, the existence of a few sound types that are almost ubiquitous. Combinatorial exercises indicate that this state of affairs can not have come about at random. Secondly, these asymmetries can be derived deductively to a certain extent; thus, markedness theories have identified many good candidates to the rank of universal constraint (cf., e.g., Ohala, 1983; Stevens and Keyser, 1989; Lindblom, 1990b; Willerman, 1994).

In an ongoing series of studies, we attempt to use as well-documented as possible data on the sound inventories in the world's languages to test predictions made on the basis of allegedly universal constraints. For example, there is the well-known hypothesis that the world’s languages universally tend to favor voiceless over voiced stop consonants, and that the strength of this tendency increases from front to back places of articulation. In particular, voiced velar stops are said to be avoided more often than voiced bilabial or voiced dental/alveolar stops. Among the voiceless stops, on the other hand, the bilabials are thought to be underrepresented in the world’s languages in comparison with voiceless dental/alveolars and velars. This, according to Gamkrelidze (1978), would give rise to stop paradigms with 'empty slots' or 'gaps' as shown in table 1.

 

Table 1. Frequent stop gap patterns as hypothesized by Gamkrelidze (1978).

      (a)

      (b)

      (c)

      p t k

      - t k

      - t k

      b d -

      b d g

      b d -

The relatively low occurrence of voiced velar stops in the world's languages has been explained as a consequence of the fact that the volume between the glottis and the velar closure can be expanded less efficiently than the cavity volume in stops with more fronted places of articulation. Thus, there are few possibilities to maintain the transglottal pressure drop at a level required for voicing (Ohala and Riordan, 1979; Ohala, 1983). On the other hand, the frequent lack of /p/ would reflect the more favorable voicing conditions at the bilabial place of articulation; thus, /p/ might become voiced relatively easily through historical sound change, and /b/ might resist devoicing more efficiently than voiced stops at more posterior places of articulation; in addition, the /p/ burst is auditorily less salient than the /t/ and /k/ bursts due to low amplitude and spectral diffuseness (e.g., Jakobson et al., 1969).

In the paper Areal biases in stop paradigms (Abstract), the hypothesis that stop paradigms tend to display gaps at the voiced velar and voiceless bilabial positions was first tested using the UPSID database. Following a procedure much like Sherman (1975), stop inventories with a voiced vs. voiceless contrast in the bilabial-dental/alveolar-velar series were identified, from these all inventories lacking one or more of the six possible stop cognates were drawn, and the number of gaps was counted.

At first glance, the results of this analysis seemed to corroborate Gamkrelidze's hypothesis that gaps in stop paradigms tend to be universally concentrated to the voiced velar and voiceless bilabial positions. However, there were strong areal biases such that, for example, there was a low incidence of /g/ gaps and a high incidence of /p/ gaps in the African languages, while the reverse was true of the American languages. Thus, the African languages were responsible for most of the /p/ gap effect observed in UPSID as a whole, and they did not conform to the trend to avoid voiced velars. Thus, on the basis of an areally differentiated stop gap analysis, it could not be concluded that velars and bilabials constitute universally underrepresented members of the respective voiced and voiceless stop series. Although this pattern is to be expected from proposed production and perception constraints, it seems to be largely overridden by areal biases. This result thus suggested that even if constraints on production and perception do influence the design of sound inventories, there may also be considerable room for opposing influences.

The sound change mechanisms by which areal trends develop are not well understood in the sense that it would be possible to predict when and why a certain change will occur. However, the phonetic preconditions for change can be analyzed such that some developments are more likely to appear than others (cf. Ohala, 1993). This was the topic of a paper that accompanied the just mentioned one and that specifically attempted to outline a theory for explaining the phonetic preconditions for the development of click sounds (Why are clicks so exclusive?, Abstract).

7.2 Why are clicks so exclusive?

One of the most striking examples of areal skewness in the world's sound inventories is the limitation of clicks to the languages of southern Africa. In that part of the world, however, click systems are extremely common and are found both in Khoisan and non-Khoisan languages. Moreover, clicks are fundamental to the phonologies of many languages of the region, and click inventories can be amazingly complex (e.g., Beach, 1938; Traill, 1985; Ladefoged and Traill, 1994).

Attempts to explain the mystery of the clicks are discussed in the article just mentioned. The approach to the problem taken in that paper begins with the uncontroversial premise that 'organic' sound change (i.e., sound change which is not caused by language contact or the like) is phonetically motivated in the sense that 'old sounds' always provide the phonetic preconditions for the historical appearance of ‘new sounds’. A well-know example is the development of lexical tone contrasts on vowels following voiceless and voiced obstruents, as noted above. It was argued, therefore, that a) sounds that are very unusual and have a very skewed distribution are likely to develop only in the presence of certain phonetic preconditions, which are themselves rare, but happen to be amply represented in a particular area; and b) if these unusual and skewed sounds become highly rated in the language, e.g., by virtue of being communicatively efficient, they will survive and perhaps dominate the linguistic scene for a long time.

It was further proposed in the paper that preconditions for the historical development of clicks presumably include the presence of doubly articulated consonants such as /kp/, /gb/ and / m/, a sound type known to be overrepresented in African languages. It was pointed out that these consonants have striking similarities with clicks: they have a double articulation which invariably involves the soft palate, and they tend to exhibit a negative oral pressure and an audible click-like component associated with the labial release (e.g., Doke, 1931; Ladefoged, 1964). And it was noted that the doubly articulated stops are, in their turn, closely related to the implosives, which also represent an areal peculiarity of the African languages (cf. Ladefoged, 1964). The conclusion reached was that the clicks are exclusive, not because they score low in a markedness hierarchy, but because their development presupposes a unique phonetic-typological environment. Unfortunately, however, this hypothesis can not be subjected to a crucial test since little is known about the prehistory of the African languages. On the other hand, a general phonetic theory of the preconditions for historical development may assist historical linguistics in providing criteria for a well-informed choice between alternative hypotheses (Ohala, 1993).

 

7.3 Two studies of voicing in stop consonants

The two studies to be summarized next are, in different ways, relevant to the above typological theme. These studies raised the following questions: a) How is voicing in stop consonants realized in realistic speech situations such as everyday conversational speech? b) Does stop voicing in infants' and young children's prelinguistic babbling and early word production display evidence of vocal tract constraints?

a) Approximantization of voiced stops

There are, as discussed above, aerodynamic reasons to assume that stop articulation provides a less than ideal context for voicing, and that voicing is hindered more with back than front places of articulation. It was shown in in the above-mentioned paper or areas biases in stop paradigms that this constraint is at work in many languages but that it can be largely overruled by areal trends. To further evaluate the role played by this constraint, it is of interest to examine how it is dealt with in connected speech processes. This was done in a paper entitled Lenition of stop consonants in conversational speech: evidence from Swedish (with a sideview on stops in the world’s languages, in Arbeitsberichte, Institut für Phonetik und digitale Sprachverarbeitung, Universität Kiel (AIPUK), 31, 31-41.Three alternative possibilities were distinguished in that paper:

1) The 'speaker-friendly' alternative: If active cavity expansion does not take place during voiced stop closures, voicing will cease almost immediately. There will be no extra expenditure of physiological energy on the part of the speaker, but the contrast between voiced and voiceless stops will become less clearcut;

2) the 'listener-friendly' alternative: If active cavity expansion does take place, some extra energy expenditure (presumably increasing from front to back) is required on the part of the speaker, but the listener will benefit from the resulting enhancement of the voicing contrast;

3) the 'optimal compromise' alternative: Voicing is preserved by articulatory lenition, specifically approximantization, such that the intraoral pressure will be kept at a level suitable to maintain glottal vibration. This means that the stop is produced with incomplete closure and voicing, but without frication. In the paper, then, the term 'lenition' was used in this restricted sense.

It was argued in the paper that there are two reasons why preservation of voicing by lenition is an optimal compromise: 1) In many languages, including Swedish, lenition of voiced stops will not jeopardize phonological contrasts; and 2) lenition in itself involves articulatory ease, i.e., incomplete stop closure. Thus, lenition preserves voicing in a way that suits both the listener and the speaker.

It was thus hypothesized that voiced stops would tend to become approximants in Swedish connected speech, and that this kind of lenition would occur more often at back than at front places of articulation. In contrast, in the voiceless stops, in which voicing should be avoided, we did not expect this kind of lenition to occur.

The paper reported a test of this hypothesis. Conversational speech was recorded with three male Swedish speakers, one of whom was studied in the present experiment. Occurrences of VCV sequences were identified (approximately 50 occurrences per stop), in which the stops were either voiced /b d g/ or voiceless /p t k/. The VCVs were excized in such a way that underlying utterance or language could not be identified, and they were randomized and recorded on digital audio tape for a listening test. Linguistically trained subjects were used as listeners since the task was to make a 'phonetic' judgment of the stimuli rather than to identify their meaning. In one part of the test, the listeners were asked to mark if the consonant of the VCV sequence sounded like a stop, or if it sounded like something else, e.g., a fricative or an approximant. In a second part of the test, the same VCV sequence was played once more, and the listeners were asked to mark whether the consonant in question sounded voiced or voiceless.

Listeners' responses indicated the following:

a) The voiceless stops /p t k/ were judged as stops in 90-100% of the cases. The voiced /b d g/ were judged to be stops much less frequently, from about 65% for /b/ to about 35 and 40% for /d/ and /g/, respectively;

b) voicing was preserved in both long and short voiced stops in 80 to 100% of the cases.

Acoustical measurements were compatible with these listener reactions. Thus, voiced stops in Swedish connected speech turned out to be lenited to a high degree as expected. However, the expectation that the extent of lenition of the voiced stops would increase from front to back places of articulation, i.e., in the order labial, dental, velar, was not borne out completely; the dentals were heard as lenited somewhat more often than the velars. It is possible that this was due to the fact that the dentals and the velars tended to occur in different morpheme classes; specifically, most of the dentals occurred in words belonging to 'closed' morpheme classes (such as pronouns and inflectional suffixes), while most of the velars occurred in words belongning to 'open' classes (such as noun and verb stems). It might be expected that closed morphemes are, on the average, pronounced in a more reduced style than open morphemes (cf. Lindblom, 1990; Willerman, 1994). When this difference was eliminated by using the listener responses pertaining to open morpheme classes only, the data went more in the expected direction.

Thus, the data essentially corroborated the view that stop voicing is preserved in connected speech in Swedish in a way that can be thought of as an optimal speaker-listener compromise. In summary then, voiced stop consonants may be tolerated in many languages since it is possible to find a convenient way around the aerodynamic constraints which would otherwise prevent them.

Do infants voice stops?

This question was examined in VOT in stop inventories and in young children’s vocalizations: preliminary analyses (Abstract). It was hypothesized that the same vocal tract and perhaps auditory constraints that are assumed to cause voicing asymmetries in many of the world's stop inventories would be laid bare in the vocalizations of infants and young children who are not much influenced by the phonetic norm of the ambient language. As a preliminary test of this possibility, VOT was measured in initial stops produced by 24 Swedish and 27 American English children, aged 12 and 18 months, with about the same number of children in each age group. The stops were auditorily classified as bilabial, dental/alveolar or palatal/velar. The 12-month-olds produced mostly babbling, while the 18-month-olds produced both babbles and approximations to adult words. Only vocalizations in which the initial consonant was heard as a full, non-nasal stop were included. For the vocalizations that could be identified as adult words, place of articulation was inferred from what was heard rather than from the adult form of the target word.

The material was drawn randomly from a multi-purpose, video-recorded, cross-sectional child language database built up in the project 'Early language-specific sound development: experimental studies of children from 6 to 30 months of age'. For the purpose of this experiment, the data for Swedish and American children were pooled. The VOT data were divided into three intervals: voicing lead (negative VOT), short lag (VOT between 0 and 25 ms) and long lag (VOT greater than 25 ms).

For the 12-month-olds, the bilabials were found in the lead interval more frequently than the dental/alveolars or palatal/velars; the latter were the least common in the lead interval. In the long lag interval, the pattern was reversed such that the palatal/velars were more common than the labials, with the dental/alveolars falling inbetween. For the 18-month-olds, the picture was similar except that a greater proportion of the tokens were found in the voicing lag intervals; the bilabials, in particular, were less common in the lead interval as compared to the data for the 12-month-olds. Most significantly, however, the 25 ms short lag interval was proportionately overrepresented such that 47% of the bilabials, 61% of the dental/alveolars and 33% of the palatal/velars were found in this interval. Thus, for both ages, the modal VOTs are short (bilabials and dental/alveolars) to moderate (palatal/velars). This pattern confirmed previously reported babbling and early speech data (e.g., Eilers et al., 1984), and is also close to VOT values previously observed in plain stops in a number of languages (e.g., Lisker and Abramson, 1964).

Our analyses thus suggested that VOT patterns for children at 12 and 18 months of age parallel voicing asymmetries found in many of the world's stop inventories. Specifically, the typologically frequent plain stops were overrepresented in the data. These parallels thus indicated that sound patterns in children's vocalizations and those favored in the world's languages may have a common origin in vocal tract, and perhaps auditory constraints.

 

8 Phonetic typology: Phonetic evaluation of UPSID

UPSID allows analyses and comparisons of sound inventories in 451 languages - a respectable proportion of the world’s languages. It is, however, important to realize that results of UPSID analyses, such as those discussed in section 7.1, need to be interpreted with due caution. The reason is that it is frequently hard to determine whether the sources on which UPSID is based represent the sound patterns in sufficient phonetic detail. While some field linguists, whose documentation has been used in UPSID, have reported the auditory quality of vowels in the narrowest detail, others have simply relied on the commonest sound symbols. Consequently, it is necessary to raise the question to what extent UPSID offers a correct and reliable empirical basis for phonetic-typological generalizations across the world’s languages.

The ongoing project 'Form and substance in phonetic systems' represents an attempt to deal with this problem. The idea is to use experimental methods, including acoustical and auditory analyses, to evaluate UPSID’s feature-based representation of the sound inventories of several languages. The languages are those common to UPSID and IRIS, which is an intersection of 47 languages with a fair geographical and genetic distribution. The analyses focus primarily on the vowel and stop inventories in those languages (including ejectives, implosives, doubly articulated stops and clicks).

A phonetic evaluation of UPSID is relevant to the areal analyses reported in the above-mentioned papers on clicks and areal biases. For another application, consider the following:

A previous analysis of UPSID (Lindblom and Maddieson, 1988) resulted in a hypothesis called the ‘size principle’. Lindblom and Maddieson classified UPSID’s segments in terms of the three categories ‘basic’, ‘elaborated’ and ‘complex’, and plotted the number of obstruents used in each UPSID language as a function of the total number of consonants used in the language. They found a very robust relationship between system size and the order in which obstruent types were recruited. In small systems, only basic types appeared, intermediate systems used basic and elaborated types, and large systems included all three types. This result was interpreted such that small systems pose less intrasystemic distinctivity demands than large systems, and the elementary basic sounds are therefore sufficient. In intermediate and large systems, the competition within the consonant and vowel paradigms is harder such that clearer contrasts are necessary, i.e., basic, elaborated as well as complex sound types are recruited.

Suppose, however, that a given inventory appears to contain only consonant segments belonging to the basic category, e.g., /p t m n/. If the inventory is small, this is as expected from the size principle. However, without our knowledge, the sounds that our written source refers to as bilabials are actually labiovelar, i.e., sounds produced with a double place of articulation, and thus elaborated rather than basic. This would give a misleading picture of the popularity of feature dimensions in the world’s languages. Thus, since the phonetic interpretation of some UPSID sources is uncertain, there may be sources of error that give the model predictions an apparent but false legitimity. Even a moderate occurrence of such artefacts would be a serious challenge to the markedness theory.

The following is a general outline of analyses being made in the project:

Vowels

a) Vowel systems are projected on a common F1/F2/F3 space. Formant frequences are measured and normalized. F0 and duration are documented as needed.

b) Secondary vowel processes which deviated from basic (nasalization, retroflexion, velarization, pharyngealization, diphthongization) are observed auditorily, by means of formal listening tests and acoustical analysis.

c) Phonation type and glottal adjustments (voicing, laryngealization, breathy voice etc.) are observed auditorily, by means of formal listening tests and acoustical analysis.

Stops

Are basic stops basic? This is first tested by means of a preliminary auditory search for the possible presence of, e.g., a) secondary modifications such as labialization, palatalization, velarization or pharyngealization; b) elaborated or complex articulations such as implosion, ejection, clicking, affrication, nasalization or prenasalization; c) glottal maneuvers such as aspiration, preaspiration, laryngealization, breathy voice. Candidates are evaluated acoustically and by means of formal listening tests.

The following exemplifies the kind of results coming out of the project:

Voiced implosives in Igbo (and, according to Ladefoged, 1964, many West African languages) are, judging from auditory impressions and observed formant transitions, clearly velarized. This appears to be the primary characteristic of the voiced implosives in Igbo. In contrast, implosives observed in Vietnamese speech samples appear to be produced without velarization. Instead, the characteristic feature of the Vietnamese implosives seems to be strong and growing voicing during the closure interval. The Vietnamese implosives thus seem to be more similar to stops with voicing throughout the closure as observed in languages such as Russian or French, although the voicing seems to be even stronger in Vietnamese than in those languages. These data thus suggest that the category referred to as implosives may include very different kinds. While the Vietnamese implosives are to be characterized as strongly voiced stops, those in Igbo involve an additional mechanism. Thus, a well-informed markedness theory will probably distinguish between these sound types.

 

9 Phonetic typology and sound change - the Swedish dialects

The phonetic 'microvariation' represented by the dialects of 'the same language' can be seen as the result of relatively few diachronic steps affecting a common vocabulary. Thus, a careful phonetic study of the dialects can make possible detailed and reliable inferences as to which sound changes must have taken place. These inferences can provide a valuable 'answer key' against which it is possible to evaluate hypotheses concerning areal, perceptual or articulatory constraints on diachronic processes.

This was one of several reasons why we have recently embarked upon the project ('Phonetics and phonology of the Swedish dialects around the year 2000' with the acronym 'SWEDIA 2000'); see Phonetics and phonology of Swedish dialects around the year 2000: a research plan (Abstract), Phonetic preconditions for historical sound change - evidence from the dialects (Full text), and Fonetiken i dialektforskningens tjänst - några exempel [Phonetics in the service of dialectology - a few examples.] (Swedish abstract). The purpose of the project is to record and analyze comparable speech samples from more than 100 Swedish dialect areas in Sweden and Finland. Such data are badly needed since, while there is a rich literature from around the last turn of the century, phonetic studies of the Swedish dialects have been very rare during the last few decades. Thus, our knowledge of the present state of the Swedish dialects is mostly anecdotal and quite fragmentary.

This relatively large-scale project (6 years, 3 universities) is designed to serve several theoretical as well as practical purposes. The theoretical objectives include the definition of criteria on which phonetic and phonological typologies can be based, the analysis of the geographical distribution of dialect features with respect to traditional dialectological problems (e.g., are transitions between dialects abrupt or continuous?), and sociolinguistically oriented problems such as levelling tendencies in different age groups and sexes.

The practical objective of the project concerns the database itself and its potential use in education and future research. Thus, the project aims at offering linguistics a solid and well-documented source of information on the phonetic and phonological features of the Swedish dialects spoken around the upcoming turn of the century. The database material will thus serve as a reference for future studies of the phonetic and phonological development of the Swedish dialects. In particular, further analyses of the data will be able to complement sociolinguistic investigations, provide a basis for continued research in dialect geography and historical linguistics, serve as a key for the evaluation of phonological theories, and contribute to characterizing the Swedish dialects from a universal-typological perspective. An extended knowledge of the phonetic and phonological characteristics of the Swedish dialects will also be useful in many applied fields such as automatic simulation of dialectal pronunciation, dialect-independent automatic speech recognition, and forensic phonetics. Finally, the project will help build up a strong national competence in dialectally and typologically oriented phonetics and phonology.

 

References

Bannert, R. and Bredvad-Jensen, A.-C. 1975. Temporal organization of Swedish tonal accents: the effect of vowel duration. Working Papers, Lund University, Department of Linguistics, 10, 1-36.

Barry, M.C. 1984. Connected speech: processes, motivations, models. Cambridge Papers in Phonetics and Experimental Linguistics 3, 1-16.

Beach D.M. 1938. The phonetics of the Hottentot language. Cambridge: W. Heffer & Sons Ltd.

Brennan, E. and Brennan, J. 1981. Measurements of accent and attitude toward Mexican American Speech. Journal of Psycholinguistic Research 10, 487-501.

Bruce G. 1970. Diphthongization in the Malmö dialect. Working Papers, Lund University, Department of Linguistics 3, 1-19.

Bruce, G. 1977. Swedish word accents in sentence perspective. Travaux de l'institut de linguistique de Lund 12. Lund: Gleerup.

Chomsky, N. och Halle, M. 1968. The sound pattern of English. New York: Harper & Row.

Dalby, J.M. 1984. Phonetic structure of fast speech in American English. Bloomington, Indiana: Indiana University Linguistics Club.

Dauer, R.M. 1983. Stress-timing and syllable-timing reanalyzed. Journal of Phonetics 11, 51-62.

Doke C.M. 1931. A comparative study in Shona phonetics. Johannesburg: University of the Witwatersrand Press.

Dressler, W. 1972. Allegroregeln rechtfertigen Lentoregeln. Sekundäre Phoneme des Bretonischen. Innsbrucker Beiträge zur Sprachwissenschaft 9.

Eilers, R.E., Oller, D.K. and Benito-Garcia, C.R. 1984. The acquisition of voicing contrasts in Spanish and English learning infants and children: a longitudinal study. Journal of Child Language 11, 313-336.

Elert C.-C. 1964. Phonologic studies of quantity in Swedish. Uppsala: Almqvist & Wiksell.

Fant, G. 1973. Speech sounds and features. Cambridge: The MIT Press.

Gamkrelidze, T.V. 1978. On the correlation of stops and fricatives in a phonological system. In Greenberg J.H. (ed.), Universals of human language, Vol. 2: Phonology, pp. 9-46. Stanford: Stanford University Press.

Gay, T. 1977. Articulatory movements in VCV sequences. Journal of the Acoustical Society of America 62, 183-193.

Gay, T. 1978. Articulatory units: segments or syllables? In A. Bell and J.B. Hooper (eds.), Syllables and segments, pp. 121-131. Amsterdam: North- Holland Press.

Gay, T. 1979. Coarticulation in some consonant-vowel and consonant cluster-vowel syllables. In B. Lindblom and S. Öhman (eds.), Frontiers of speech communication research, pp. 69-76. London: Academic Press.

Gårding E. and Lindblad, P. 1973. Constancy and variation in Swedish word accent patterns. Working Papers, Lund University, Department of Linguistics 7.

Hadding-Koch K. and Abramsom, A. 1964. Duration versus spectrum in Swedish vowels: some perceptual experiments. Studia Linguistica 18, 94- 107.

Henke, W.L. 1966. Dynamic articulatory model of speech production using computer simulation. PhD Diss., MIT, Cambridge, Massachusetts.

Henke, W.L. 1967. Preliminaries to speech synthesis based on an articulatory model. IEEE Boston Speech Conference, Boston, 170-177.

Hombert J.-M., Ohala J. and Ewan W. 1979. Phonetic explanations for the development of tones. Language 55, 37-58.

Jakobson, R., Fant, G. aand Halle, M. 1969 (9th printing). Preliminaries to speech analysis. Cambridge, Massachusetts: The MIT Press .

Jones, D. 1964 (9th ed.). An outline of English phonetics. Cambridge: Heffer.

Kessen, W., Levine, J. and Wendrich, K.A. 1979. The imitation of pitch in infants. Infant Behavior and Development 2, 93-99.

Kuhl, P.K. and Miller, J.D. 1982. Discrimination of auditory target dimensions in the presence or absence of variation in a second dimension by infants. Perception and Psychophysics 31, 279-292.

Kuhl, P.K. and Meltzoff, A.N. 1988. Speech is an intermodal object of perception. In A. Yonas (ed.), Perceptual development in infancy. Hillsdale, New Jersey: Erlbaum.

Labov, W. 1986. Sources of inherent variation in the speech process. In J.S. Perkell and D.H. Klatt (eds.), Invariance and variability in speech processes. Hillsdale, NJ: Erlbaum, 402-425.

Ladefoged P. 1964. A phonetic study of West African languages: An auditory-instrumental survey. West African Language Monograph, Series I, edited by J.H. Greenberg and J. Spencer. Cambridge: Cambridge University Press.

Ladefoged P. and Traill A. 1994. Clicks and their accompaniments. Journal of Phoneticcs 22, 33-64.

Lehiste, I. 1970. Suprasegmentals. Cambridge, Massachusetts: The MIT Press.

Lehtonen, J. 1970. Aspects of quantity in Standard Finnish. Studia Philologica Jyväskyläensia VI. Jyväskylä: Gummérus.

Li, C.N. and Thompson, S.A. 1977. The acquisition of tone in Mandarin- speaking children. Journal of Child Language 4, 185-199.

Liberman, A.M., Cooper, F.S., Shankweiler, D.P. and Studdert-Kennedy, M. 1967. Perception of the speech code. Psychological Review 74, 431- 461.

Lindblad, P. 1980. Svenskans sje- och tje-ljud i ett allmänfonetiskt perspektiv. [With a summary in English: Some Swedish sibilants.] Lund: Gleerup.

Lindblom, B. 1963. Spectrographic study of vowel reduction. Journal of the Acoustical Society of America 35, 1773-1781.

Lindblom, B. 1990a. Explaining phonetic variation: a sketch of the H and H theory. In W. Hardcastle and A. Marchal (eds.), Speech production and speech modelling, pp. 403-439. Dordrecht: Kluwer Academic Publishers.

Lindblom, B. 1990b. On the notion of 'possible speech sounds'. Journal of Phonetics 18, 135-152.

Lindblom, B. & Maddieson, I. 1988. Phonetic universals in consonant systems. In L.M. Hyman and C.N. Li (eds.), Language, speech and mind. Studies in honour of Victoria A. Fromkin., pp. 62-78. London: Routledge.

Lindblom, B., Brownlee, S., Davis, B. and Moon, S.-J. 1992. Speech transforms. Speech Communication 11, 357-368.

Lisker L. and Abramson A.S. 1964. A cross-language study of voicing in initial stops: acoustical measurements. Word 20, 348-422.

Lubker, J. and Gay, T. 1982. Anticipatory labial coarticulation: experimental, biological and linguistic variables. Journal of the Acoustical Society of America 71, 437-448.

McCasland, G.P. 1979. Noise intensity and spectrum cues of spoken fricatives. In Wolf and Klatt (eds.): Speech communication papers presented at the 97th Meeting of the Acoustical Society of America, pp. 303-306. New York: Acoustical Society of America.

Maddieson I. 1984. Patterns of sounds. Cambridge: Cambridge University Press.

Maddieson I. and Precoda K. 1989. Updating UPSID. Journal of the Acoustical Society of America, Suppl. 1, Vol. 86, S19.

Malmberg, B. 1966. Studier över den svenska ordaccenten. Nyare fonetiska rön och andra uppsatser i allmän och svensk fonetik. [Studies of the Swedish word accent.] Lund: Gleerup.

McAllister, R. 1978. Temporal asymmetry in labial coarticulation. Papers from the Institute of Linguistics, University of Stockholm (PILUS) 35, 1- 29.

Molis, M.R. 1994. A reexamination of coarticulation in VCV utterances. Ms., Department of Linguistics, University of Texas at Austin.

Ohala, J.J. 1983. The origin of sound patterns in vocal tract constraints. In P.F. MacNeilage (ed.), The production of speech, pp. 189-216. New York: Springer-Verlag.

Ohala, J.J. 1992. What is the input to the speech production mechanism? Speech Communication 11, 369-378.

Ohala J.J. 1993. The phonetics of sound change. In Jones C. (ed.), Historical Linguistics: Problems and perspectives, 237-278. London: Longman.

Öhman, S. 1966. Coarticulation in VCV utterances: spectrographic measurements. Journal of the Acoustical Society of America 39, 151-168.

Öhman, S. 1967. Numerical model of coarticulation. Journal of the Acoustical Society of America 41, 310-320.

Perkell, J.S. 1986. Coarticulation strategies: preliminary implications of a detailed analysis of lower lip protrusion movements. Speech Communication 5, 47-68.

Pike, K.L. 1947. The intonation of American English. University of Michigan Publications. Linguistics, Vol. I. Ann Arbor: University of Michigan Press.

Sherman D. 1975. Stop and fricative systems: a discussion of paradigmatic gaps and the question of language sampling. Stanford University Phonology Archiving Project, Working Papers on Language Universals 17, 1-31.

Sovijärvi, A. 1938. Die gehaltenen, geflüsterten und gesungenen Vokale und Nasale der finnischen Sprache. Physiologisch-physikalische Lautanalysen, Helsinki.

Sovijärvi, A. 1956. Über die phonetischen Hauptzüge der finnischen und der ungarischen Hochsprache. Wiesbaden: Ural-Altaische Bibliothek, II.

Stevens, K.N. 1971. Airflow and turbulence noise for fricative and stop consonants: static considerations. Journal of the Acoustical Society of America 50, 1180-1192.

Stevens, K.N. and Blumstein, S.E. 1978. Invariant cues for place of articulation in stop consonants. Journal of the Acoustical Society of America 64, 1358-1368.

Stevens K.N. and Keyser S.J. 1989. Primary features and their enhancement in consonants. Language 65, 81-106.

Svantesson J.-O. 1983. Kammu phonology and morphology. Travaux de l’Institut de linguistique de Lund XVIII. Lund: Gleerup.

Thomas, C.K. 1958. Phonetics of American English. New York: Ronald Press (second edition).

Traill A. 1985. Phonetic and phonological studies of !Xóo Bushman. Hamburg: Buske.

Tse, J.K.P. 1978. Tone acquisition in Cantonese: a longitudinal case study. Journal of Child Language 5, 191-204.

Vihman, M.M. 1993. Variable paths to early word production. Journal of Phonetics 21, 61-82.

Vihman, M.M. 1996. Phonological development. The origins of language in the child. Cambridge, Massachusetts: Blackwell.

Wiik, K. 1965. Finnish and English vowels. Turku: Annales Universitatis Turkuensis, Series B, Tom 94.

Willerman R. 1994. The phonetics of pronouns: articulatory bases of markedness. PhD dissertation, University of Texas, Austin.