Dom Massaro comments on the debate regarding the basic unit of speech perception.
The following was originally posted to Talking Brains.
Given that there have been some interesting debates here on Talking Brains regarding the basic unit of speech perception, I asked Dom Massaro, a prominent and long-time player in this debate, to put together a comment on the topic for publication here. He graciously agreed to do this for us and here it is. Thanks Dom!
Some reminiscences on how I was led to propose the syllable as the perceptual unit in speech perception. I relied mostly on my writings in the literature rather than undocumented memory.
During my graduate studies in mathematical and experimental psychology and also during my postdoctoral position, I developed an information-processing approach to the study of behavior (see Massaro & Cowan, 1993, for this brand of information processing). Two important implications arose from this approach: 1) the proximal influences on behavior and 2) the time course of processing are central to a complete description of behavior (as opposed to simple environment-behavior relationships. My early studies involved a delineation of perception and memory processes in the processing of speech and music. The research led to a theory of perception and memory processes that revealed the properties of pre-perceptual and perceptual memory stores and rules for interference of information in these stores and theories of forgetting (Massaro, 1970).
Initiating my career as a faculty member, I looked to apply this information-processing approach to a more substantive domain of behavior. I held a graduate seminar for three years with the purpose of applying the approach to language processing. We learned that previous work in this area had failed to address the issues described above, and our theoretical framework and empirical reviews anticipated much of the research in psycholinguistics since that time in which the focus is on real-time on-line processing (see our book entitled, Understanding Language: An Information Processing Analysis of Speech Perception, Reading and Psycholinguistics, 1975)
My own research interests also expanded to include the study of reading and speech perception. Previous research had manipulated only a single variable in these fields, and our empirical work manipulated multiple sources of both bottom-up and top-down information. Gregg Oden and I collaborated to formulate a fuzzy logical model of perception (Oden & Massaro, 1978; Movellan & McClelland, 2001), which has served as a framework for my research to this day. Inherent to the model were prototypes in memory and, therefore, it was important to take a stance on perceptual units in speech and print. By this time, my research and research by others indicated the syllable and the letter as units in speech and print, respectively. Here is the logic I used.
Speech perception can be described as a pattern-recognition problem. Given some speech input, the perceiver must determine which message best describes the input. An auditory stimulus is transformed by the auditory receptor system and sets up a neurological code in a pre-perceptual auditory storage. Based on my backward masking experiments and other experimental paradigms, this storage holds the information in a pre-perceptual form for roughly 250 ms, during which time the recognition process must take place. The recognition process transforms the pre-perceptual image into a synthesized percept. One issue given this framework is, what are the patterns that are functional in the recognition of speech? These sound patterns are referred to as perceptual units.
One reasonable assumption is that every perceptual unit in speech has a representation in long-term memory, which is called a prototype. The prototype contains a list of acoustic features that define the properties of the sound pattern as they would be represented in pre-perceptual auditory storage. As each sound pattern is presented, its corresponding acoustic features are held in pre-perceptual auditory storage. The recognition process operates to find the prototype in long-term memory which best describes the acoustic features in pre-perceptual auditory storage. The outcome of the recognition process is the transformation of the pre-perceptual auditory image of the sound stimulus into a synthesized percept held in synthesized auditory memory.
According to this model, pre-perceptual auditory storage can hold only one sound pattern at a time for a short temporal period. Backward recognition masking studies have shown that a second sound pattern can interfere with the recognition of an earlier pattern if the second is presented before the first is recognized. Each perceptual unit in speech must occur within the temporal span of pre-perceptual auditory storage and must be recognized before the following one occurs for accurate speech processing to take place. Therefore, the sequence of perceptual units in speech must be recognized one after the other in a successive and linear fashion. Finally, each perceptual unit must have a relatively invariant acoustic signal so that it can be recognized reliably. If the sound pattern corresponding to a perceptual unit changes significantly within different speech contexts, recognition could not be reliable, since one set of acoustic features would not be sufficient to characterize that perceptual unit. Perceptual units in speech as small as the phoneme or as large as the phrase have been proposed.
The phoneme was certainly a favorite to win the pageant for speech’s perceptual unit. Linguists had devoted their lives to phonemes, and phonemes gained particular prominence when they could be distinguished from one another by distinctive features. Trubetzkoy, Jakobson, and other members of the "Prague school" proposed that phonemes in a language could be distinguished by distinctive features. For example, Jakobson, Fant, and Halle (1961) proposed that a small set of orthogonal, binary properties or features were sufficient to distinguish among the larger set of phonemes of a language. Jakobson et al. were able to classify 28 English phonemes on the basis of only nine distinctive features. While originally intended only to capture linguistic generalities, distinctive feature analysis had been widely adopted as a framework for human speech perception. The attraction of this framework is that since these features are sufficient to distinguish among the different phonemes, it is possible that phoneme identification could be reduced to the problem of determining which features are present in any given phoneme. This approach gained credibility with the finding, originally by Miller and Nicely (1955) and since by many others, that the more distinctive features two sounds share, the more likely they are to be perceptually confused for one another. Thus, the first candidate we considered for the perceptual unit was the phoneme.
Consider the acoustic properties of vowel phonemes. Unlike some consonant phonemes, whose acoustic properties change over time, the wave shape of the vowel is considered to be steady-state or tone-like. The wave shape of the vowel repeats itself anywhere from 75 to 200 times per second. In normal speech, vowels last between 100 and 300 ms, and during this time the vowels maintain a fairly regular and unique pattern. It follows that, by our criteria, vowels could function as perceptual units in speech.
Next let us consider consonant phonemes. Consonant sounds are more complicated than vowels and some of them do not seem to qualify as perceptual units. We have noted that a perceptual unit must have a relatively invariant sound pattern in different contexts. However, some consonant phonemes appear to have different sound patterns in different speech contexts. For example, the stop consonant phoneme /d/ has different acoustic representations in different vowel contexts. Since the steady-state portion corresponds to the vowel sounds, the first part, called the transition, must be responsible for the perception of the consonant /d/. The acoustic pattern corresponding to the /d/ sound differs significantly in the syllables /di/ and /du/. Hence, one set of acoustic features would not be sufficient to recognize the consonant /d/ in the different vowel contexts. Therefore, we must either modify our definition of a perceptual unit or eliminate the stop consonant phoneme as a candidate.
There is another reason why the consonant phoneme /d/ cannot qualify as a perceptual unit. In the model perceptual units are recognized in a successive and linear fashion. Research has shown, however, that the consonant /d/ cannot be recognized before the vowel is also recognized. If the consonant were recognized before the vowel, then we should be able to decrease the duration of the vowel portion of the syllable so that only the consonant would be recognized. Experimentally, the duration of the vowel in the consonant-vowel syllable (CV) is gradually decreased and the subject is asked when she hears the stop consonant sound alone. The CV syllable is perceived as a complete syllable until the vowel is eliminated almost entirely (Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967). At that point, however, instead of the perception changing to the consonant /d/, a nonspeech whistle is heard. Liberman et al. show that the stop consonant /d/ cannot be perceived independently of perceiving a CV syllable. Therefore, it seems unlikely that the /d/ sound would be perceived before the vowel sound; it appears, rather, that the CV syllable is perceived as an indivisible whole or gestalt.
These arguments led to the idea that the syllables function as perceptual units rather than containing two perceptual units each. One way to test this hypothesis is to employ the CV syllables in a recognition-masking task. Liberman et al., found that subjects could identify shortened versions of the CV syllables when most of the vowel portion is eliminated. Analogous to our interpretation of vowel perception, recognition of these shortened CV syllables also should take time. Therefore, a second syllable, if it follows the first soon enough, should interfere with perception of the first. Consider the three CV syllables /ba/, /da/, and /ga/ (/a/ pronounced as in father), which differ from each other only with respect to the consonant phoneme. Backward recognition masking, if found with these sounds, would demonstrate that the consonant sound is not recognized before the vowel occurs and also that the CV syllable requires time to be perceived.
There have been several experiments on the backward recognition masking of CV syllables (Massaro, 1974, 1975; Pisoni, 1972). Newman and Spitzer (1987) employed the three CV syllables /ba/, /da/, and /ga/ as test items in the backward recognition masking task. These items were synthetic speech stimuli that lasted 40 ms; the first 20 ms of the item consisted of the CV transition and the last 20 ms corresponded to the steady-state vowel. The masking stimulus was the steady-state vowel /a/ presented for 40 ms. In one condition, the test and masking stimuli were presented to opposite ears, that is, dichotically. All other procedural details followed the prototypical recognition-masking experiment.
The percentage of correct recognitions for 8 observers improved dramatically with increases in the silent interval between the test and masking CVs. These results show that recognition of the consonant is not complete at the end of the CV transition, nor even at the end of the short vowel presentation. Rather, correct identification of the CV syllable requires perceptual processing after the stimulus presentation. These results support our hypothesis that the CV syllable must have functioned as a perceptual unit, because the syllable must have been stored in pre-perceptual auditory storage, and recognition involved a transformation of this pre-perceptual storage into a synthesized percept of a CV unit. The acoustic features necessary for recognition must, therefore, define the complete CV unit. An analogous argument can be made for VC syllables also functioning as perceptual units (Massaro, 1974).
We must also ask whether perceptual units could be larger than vowels, CV, or VC syllables. Miller (1962) argued that the phrase of two or three words might function as a perceptual unit. According to our criteria for a perceptual unit, it must correspond to a prototype in long-term memory which has a list of features describing the acoustic features in the pre-perceptual auditory image of that perceptual unit. Accordingly, pre-perceptual auditory storage must last on the order of one or two seconds to hold perceptual units of the size of a phrase. But the recognition-masking studies usually estimate the effective duration of pre-perceptual storage to be about 250 ms. Therefore, perceptual units must occur within this period, eliminating the phrase as the perceptual unit.
The recognition-masking paradigm developed to study the recognition of auditory sounds has provided a useful tool for determining the perceptual units in speech. If preperceptual auditory storage is limited to 250 ms, the perceptual units must occur within this short period. This time period agrees nicely with the durations of syllables in normal speech.
The results of the present experiments demonstrate backward masking in a two-interval forced-choice task, a same-different task, and an absolute identification task. The backward masking of one sound by a second sound is interpreted in terms of auditory perception continuing after a short sound is complete. A representation of the short sound is held in a preperceptual auditory storage so that resolution of the sound can continue to occur after the stimulus is complete. A second sound interferes with the storage of the earlier sound interfering with its further resolution The current research contributes to the development of a general information processing model (Massaro, 1972, 1975).
To solve the invariance problem between acoustic signal and phoneme, while simultaneously adhering to a pre-perceptual auditory memory constraint of roughly 250 ms, Massaro (1972) proposed the syllables V, CV, or VC as the perceptual unit, where V is a vowel and C is a consonant or consonant cluster. This assumption was built into the foundation of the FLMP (Oden & Massaro, 1978). It should be noted that CVC syllables would actually be two perceptual units, the CV and VC portions, rather that just one. Assuming that this larger segment is the perceptual unit reinstates a significant amount of invariance between signal and percept. Massaro and Oden (1980, pp. 133—135) reviewed evidence that the major coarticulatory influences on perception occur within these syllables, rather than between syllables. Any remaining lack of invariance across these syllables could conceivably be disambiguated by additional sources of information in the speech stream.
Massaro, D.W. (1970). Perceptual Processes and Forgetting in Memory Tasks. Psychological Review, 77(6), 557-567.
Massaro, D.W. (1972). Preperceptual Images, Processing Time, and Perceptual Units in Auditory Perception. Psychological Review, 79(2), 124-145.
Massaro, D. W. (1974). Perceptual Units in Speech Recognition. Journal of Experimental Psychology, 102(2), 349-353.
Massaro, D.W. (1975). Understanding Language: An Information Processing Analysis of Speech Perception, Reading and Psycholinguistics. New York: Academic Press.
Massaro, D.W. and Cowan, N. (1993). Information Processing Models: Microscopes of the Mind. Annual Review of Psychology, 44, 383-425.
Massaro, D. W. & Oden, G. C. (1980). Speech Perception: A Framework for Research and Theory. In N.J. Lass (Ed.), Speech and Language: Advances in Basic Research and Practice. Vol. 3, New York: Academic Press, 129-165.
Movellan, J., and McClelland, J. L. (2001). The Morton-Massaro Law of Information Integration: Implications for Models of Perception. Psychological Review,