Sound waves form the physical link between speaker and hearer. Central to the field of acoustic phonetics are the concepts and techniques of acoustic physics; but acoustic phonetic research also integrates knowledge about how speech signals are produced by a speaker, how they are perceived by a hearer, and how they are structured by the phonology of languages. From the linguist's point of view, acoustic phonetics provides quantitative information on the realization of the sound system of a language, supplementing the data available from auditory phonetics.
Acoustic phonetics is a relative newcomer to the discipline of phonetics. Developments in the 19th century in the field of acoustics laid its theoretical foundations; but it was given its real impetus in the 20th century, by techniques for recording sound and analyzing it electronically. The availability of computers for digital processing of signals gave it further momentum. Acoustic phonetics has become arguably the most successful branch of phonetics. Its primary data are easy to obtain (unlike, e.g., data on muscle activity in speech production); and advances in acoustic phonetics are often stimulated by the prospect of practical applications in such areas as telecommunications and human/computer interaction through speech.
This article will offer an introduction to some background concepts in acoustic phonetics, a summary of the acoustic properties of major classes of speech sounds, and a review of some of the roles of acoustic phonetics. For more extensive introductions, see Ladefoged 1996 and Fry 1979. The seminal work in modern acoustic phonetics is Fant 1960, and wide-ranging, authoritative coverage of the field is to be found in Stevens 1998.
1. Background concepts
Sound consists of vibrations to which the ear is sensitive. Usually the ear is responding to tiny, fast oscillations of air molecules, which originate at a sound source. A tuning fork provides a familiar illustration: as an arm of the fork swings in one direction, it shunts adjacent molecules nearer to subsequent ones, causing a brief local increase in air pressure. The shunting effect spreads outward through further molecules; a wave of high pressure is radiating from the arm. Meanwhile, the arm swings back to, and overshoots, its rest position—as do the adjacent molecules. Their momentary increased separation from subsequent molecules means a reduction in pressure (“rarefaction”). A wave of low pressure now radiates in the wake of the wave of high pressure. As the oscillation of the fork continues, a regular succession of pressure highs and lows spreads outward.
At a point in the path of the radiating pressure changes, we could plot pressure against time as they sweep by. Figure 1 does this for a tone produced by an “ideal” tuning fork. Notice three properties of this waveform. First, the peaks and troughs have a particular amplitude; increasing the amplitude would cause the tone to sound louder. Second, the cycle of pressure variation takes a given time to repeat itself; this is the period of the wave. The shorter the period, the more repetitions or cycles of the wave there will be per second: i.e., its frequency will be higher. The wave shown has a period of .01 s (second); its frequency is 1/.01=100 Hz (Hertz, meaning cycles per second). Increasing the tone's frequency would cause it to sound higher in pitch. Third, the particular wave in Figure 1 shows a simple pattern of rising and falling pressure, called a sine wave.
The wave at the top of Figure 2 is more complex. Crucially, however, a complex wave can be analyzed as being made up of a number of sine waves of different frequencies—for this wave, three, as shown below it. If we add together the amplitudes of the component sine waves, or harmonics, at each point in time, we can recreate the complex wave. Such analysis of a complex wave is known as Fourier analysis.
The graph at the right in Figure 2 is another way of showing the essential information from this type of analysis. It plots each harmonic at its frequency by a line which shows the harmonic's relative amplitude. This kind of representation, called an amplitude spectrum, is of central importance in acoustic phonetics, as is the information it contains about the distribution of acoustic energy at different frequencies.
The harmonics in Figure 2 are at 100, 300, and 400 Hz. The highest common factor of these values is 100 Hz; this is the frequency of repetition of the complex wave, which is called its fundamental frequency, and which determines our perception of the pitch of this sound.
A repetitive or periodic wave will thus have energy in discrete harmonics at some (but not necessarily all) of the frequencies which are whole-number multiples of its fundamental frequency. In many occurring waves, however, it is not possible to discern a repeating cycle. In such aperiodic waves, there is no fundamental frequency; energy is present throughout the frequency range, rather than being banded into discrete harmonics.
2. Acoustics of speech
Figure 3 shows brief extracts from the waveforms of two rather different sounds taken from the word speech: the vowel [i] above, and the consonant [s] below. The vowel's waveform is roughly periodic, though more complex than that in Figure 2. The second waveform, from [s], consists of aperiodic noise. In general, voiced speech sounds (for which the vocal cords are vibrating) will have periodic waveforms. The rate of repetition of the wave, i.e. its fundamental frequency, directly reflects the rate of vibration of the vocal cords, and it cues the hearer's perception of pitch.
The schematic amplitude spectrum of the magnified fragment of [i] shows that it is rich in harmonics. Their amplitude is greater at certain frequencies, and this “shaping” of the spectrum determines the perception of a particular sound. The spectrum for the aperiodic [s] shows energy that is not banded into harmonics, but is present continuously over a range of frequencies. Again, the shape of the spectrum characterizes the sound.
Figure 4 illustrates the production of two different vowels, [ɑ] as in palm and [i] as in heed. At the bottom of the figure is the spectrum of the waveform produced by the vibrating larynx. This laryngeal source wave would sound like a buzz, if we could isolate it from the vocal tract. Its spectrum is rich in harmonics, whose amplitude gradually decreases with increasing frequency.
Above that can be seen alternative vocal tract shapes: on the left, that for [ɑ], and on the right that for [i]. The vocal tract is, in effect, a tube; and like any tube, e.g. a wind instrument, it has a number of resonant frequencies, at which the air in the tube is especially liable to vibrate in sympathy with another sound. The spectrum next to each vocal tract in Figure 4 shows how well the tract resonates at any frequency: in each case, three resonance peaks, or formants, can be seen. In practice, further formants exist at higher frequencies; but the first three are most important in determining vowel quality. Note that [ɑ] has a relatively high first formant (F1) and a low second formant (F2); for [i], F1 is low and F2 is high. The frequency of the formants is dependent on the shape of the vocal tract, and hence on the positioning of the tongue and lips.
The vocal tract acts on the laryngeal source as a filter, which enhances some harmonics relative to others. Thus, in the spectrum of the speech waveform as it emerges at the lips, each harmonic of the laryngeal source has an amplitude which is modified according to how near it is in frequency to a formant—as shown schematically in the two spectra at the top of Figure 4. Some details have been omitted for simplicity; however, this general conception of vowel production as the combination of a sound source at the larynx and the spectral shaping function of the vocal tract, known as the source-filter model, has been highly influential in acoustic phonetics.
Vowels have been discussed so far as though they were characterized by a steady-state vocal tract posture and corresponding spectrum. In fact, speech sounds rarely involve steady states: the vocal tract is in almost constant motion, and hence the spectrum of the speech wave is constantly changing. It would be possible to represent this as a series of spectra, and this is sometimes done; but another kind of display is more common in acoustic phonetics. This is illustrated schematically at the top of Figure 4. Assume that the vocal tract is moving continuously from the [ɑ] configuration to the [i] configuration. This would produce a diphthong sounding something like the word eye. In the new display, time runs from left to right; frequency, as before, is shown on the vertical axis. In effect, there is a third dimension: high-amplitude parts of the spectrum are shown as black. Thus it is possible to trace the changing formant frequencies as movements in the black bands. Note how, at the start and end of the diphthong, the bands coincide with the peaks in the spectra of the individual words.
This general kind of display is called a spectrogram. Real, as opposed to schematic, spectrograms have a range of shades of gray which indicate increasing amplitude; they allow much of the detail of the spectrum at a particular point to be inferred. From the 1940s onward, use was made of the sound spectrograph, a machine that uses analog electronics to produce spectrograms and other spectral displays; since the 1970s, digital computers have taken over this function.
Figure 5 shows a real spectrogram of the rhyming phrases “a bye,” “a dye,” “a guy.” The movement of the first two formants in each word is similar to that shown in Figure 4; like eye, these words have a diphthong that moves from an open vowel something like [ɑ] toward a close front vowel something like [i]. The consonants appear as almost blank on the spectrogram, because little sound radiates from the vocal tract when it is closed. Adjacent to them, the detailed trajectory of the formants, particularly F2 and F3, differs according to the consonant; thus, for the velar [g], these two formants appear rather close together. Such differing formant transitions are vital cues to our perception of consonants. They occur because, as the vocal tract closes for a consonant and as it opens again, its resonances change (as always when a tube changes shape). The way in which they change depends on where, along its length, the tract is closing—i.e. on the place of articulation of the consonant.
Part of the acoustic character of a consonant, then, seems to be explicable like that of a vowel: as changes in the resonances of the vocal tract tube. But consonants are considerably more complex acoustically than vowels. Many consonants have a source of acoustic energy other than the vibrating larynx; e.g. in an [s] (see Figure 3), aperiodic noise is produced when air is forced through a narrow gap at the alveolar ridge, and the flow becomes turbulent.
All fricatives involve the production of such noise at some point in the vocal tract. Voiceless fricatives like [s] have only this kind of source of acoustic energy; in voiced fricatives, like [z], noise is superimposed on energy from the vibrating vocal cords. The spectrum of the noise depends on the kind of turbulence which produces it, and on the way in which it is shaped by the resonances of the vocal tract. An [h] will have a formant structure rather like that of a vowel, because the aperiodic noise source (like the periodic voicing of a vowel) is at the end of the vocal tract. An [s], by contrast, will have a spectral shaping quite unlike that of a vowel.
Figure 6 is a spectrogram of “hazy sunshine.” The [h] has a formant pattern somewhat like that of the vowel following it. The [z] shows two cues to the voicing which differentiates it from [s]: low-frequency energy at the bottom of the pattern, and a continuity of the vertical striations (each of which, as in vowels, indicates a cycle of vocal-cord vibration). By contrast, [s] and [z] share a similar high-frequency noise spectrum, because they are both alveolars; the spectrum of the post-alveolar [š] is rather different.
While fricatives result from turbulence in a steady airflow, the release of a stop brings a short burst of aperiodic acoustic energy at the moment when the air pressure, built up behind the closure, is released. The spectral distribution of energy in the burst varies according to the place of articulation of the stop; this supplements the formant transition cues discussed above.
Nasals are like vowels in that energy produced at the larynx is spectrally shaped by the resonance of a tube; here, however, the tube extends from the larynx through the nasal cavities. The acoustic complexity is increased because there is interaction with the resonances of the mouth cavity behind the oral closure. Nasalized vowels, as in French [ɔ̃] on, are similar to oral vowels, except that their spectrum is made more complex by the interaction of the resonances of the nasal cavity.
The acoustic dimensions that underlie suprasegmental properties are well established. Fundamental frequency correlates closely with perceived pitch, intensity with perceived loudness, and the duration of different acoustic events with length—although, in each case, the perceptual attribute may be influenced by other factors. More complex is the way in which all these acoustic dimensions combine to cue linguistic contrasts. For instance, stress or accent is cued by contributions from fundamental frequency, duration, and intensity, but the precise constellation of cues varies according to language and according to the intonational context. Nor is there a sharp dividing line between dimensions supporting segmental and suprasegmental contrasts: for instance, in many languages vowel quality contributes to stress contrasts, with unstressed vowels being mid-centralized. Nonetheless, as in the case of segmental phonetics, quantitative analysis of the relevant dimensions leads to a much fuller understanding of the linguistic contrasts.
This treatment of the acoustics of speech has been far from complete. Not all types of segments have been considered, and only a little has been said about the prosodic or suprasegmental properties of speech. More generally, it has not explored the quantitative mathematical models which underlie the analysis of speech. It is this quantitative Acoustic theory of speech which gives acoustic phonetics the power to manipulate and replicate speech signals, and which opens the way to many of the applications discussed below.
3. The roles of acoustic phonetics
As an adjunct to phonology, acoustic phonetics can supplement the information on phonetic realization which is provided by auditory phonetics. The exact nature of a fine auditory distinction often is not clear from skilled listening alone; acoustic analysis can show objectively the contribution of spectral, durational, and other acoustic dimensions to the realization of a phonological contrast. Beyond this, acoustic phonetics can suggest appropriate phonological features for descriptive use. For example, there is little motivation in articulatory terms for the sound change by which the Germanic velar fricative at the end of the word laugh became labio-dental in modern English; but acoustic analysis suggests a similarity in terms of spectral shape. Both fricatives have a weighting of their energy toward the lower end of the spectrum; in terms of phonological features, they share the value [+grave].
Acoustic phonetics has an important role in the branch of cognitive psychology which deals with the perception of speech. In particular, the analyses of acoustic phonetics provide techniques for manipulating real speech signals, and for creating speech signals artificially by speech synthesis. Thus experimental stimuli can be created whose acoustic properties are precisely known, and which can be varied in controlled ways. It is then possible to discover exactly which properties of a sound are crucial for its perception by a hearer. For instance, it can be shown that, in identifying an English stop as voiceless rather than voiced (e.g. as a realization of /p/ rather than /b/), hearers are mainly sensitive to the delay in the onset of voicing after the release of the stop; but they also integrate information such as the strength of the stop burst and the trajectory of F1 at the start of the vowel.
The study of how humans produce speech, too, benefits from a well worked-out acoustic theory of speech. It is now possible to implement mathematically explicit models of how certain aspects of the speech signal are created in the vocal tract. Perhaps the most advanced are analogs of the vocal tract tube as it acts as a filter on the glottal source. Given such a model, it is possible to predict formant values for any variation in the shape of the tube. Working back from the observed acoustic pattern of a speech sound, the articulatory events underlying the sound can be inferred. Quantitative models also exist for the creation of acoustic energy in the vocal tract, either by a purely aerodynamic process (as with fricative energy), or—in the case of the vocal cords—a process involving complex interactions of aerodynamics and properties of the vocal-cord tissues. To the extent that such models yield realistic acoustic signals, they confirm progress in understanding speech production.
Acoustic phonetic knowledge is crucial in speech technology, including automatic speech recognition, and speech output by computers. Much work in this area is motivated by the goal of allowing humans to interact with machines by using natural language. Part of the challenge is to find explicit and computationally tractable ways to represent existing acoustic phonetic knowledge. One particularly successful technique is linear prediction, which in some ways can be seen as an approximation to the source-filter model; it is applicable in speech synthesis and recognition. However, there are areas where existing knowledge is itself incomplete; this is particularly so in connection with the acoustic variation in a sound which occurs because of differences among individuals' vocal tracts, and which causes difficulties in designing systems of speech recognition that can cope with a variety of speakers. Further progress in understanding speaker-to-speaker variation will also contribute to techniques for speaker recognition, to which acoustic phonetics is central.
Acoustic phonetics is not an isolated discipline with sharply defined borders. Its central object of study is the acoustic speech signal; but a purely physics-based study of the signal—ignoring how the signal is produced and perceived, and how it is structured linguistically—would contribute relatively little to our understanding of spoken communication. Acoustic phonetics thus proceeds in symbiosis with the study of speech production, speech perception, and linguistics generally.
Fant, Gunnar. 1960. Acoustic theory of speech production. The Hague: Mouton.Find this resource:
Fry, Dennis B. 1979. The physics of speech. Cambridge and New York: Cambridge University Press.Find this resource:
Ladefoged, Peter. 1996. Elements of acoustic phonetics. 2d ed. Chicago: University of Chicago Press.Find this resource:
Stevens, Kenneth N. 1998. Acoustic phonetics. Cambridge, Mass.: MIT Press.Find this resource: