As I See It, March 2007

By Nelson Morgan, Director

The focus in this issue on speech research reminds me of my early days at ICSI, almost 2 decades ago. The Institute was also just starting out, and I had been asked to help put together a "Realization" group that would design and build massively parallel systems. This sounded like fun, but I also wanted to have an application area to focus on so that our work would not be an abstract exercise. At about that time Hervé Bourlard came to ICSI as a visitor, and "infected" us with his enthusiasm for speech recognition research. I had worked in speech processing earlier in that decade, and after several years of working with brain scientists to understand a little bit about the neurophysiological correlates of cognition, I was ready to return to speech, at least for a while.

Hervé had been working at Philips in Brussels, and had realized that one could estimate probabilities with properly trained neural networks. Both his theoretical work and his intuition told him that these probabilities could be used to improve speech recognition if incorporated with hidden Markov models. This sounded intriguing to me, so I set to work with him on the related experimental research. Our first result with a problem of scale was very motivating - we got 140% error! To the uninitiated this might sound impossible, but it turns out that in speech recognition we count as errors not only the words that are wrong or missing, but also the extra words that are inserted in the output. Anyway, we figured we had nowhere to go but up... (or down, if you're counting errors). And indeed we did. Over the months that followed, we gradually discovered, step by step, the things we needed to do for good performance. The next year we were joined by Chuck Wooters, probably the only student to ever get a degree at Berkeley in the interdisciplinary cross-department topic of "speech recognition". With Chuck's help, and in a collaborative effort with Mike Cohen and Horacio Franco (both then of SRI), we ultimately ended up with a very good system and plenty of new ideas.

Another milestone from those early days came from our collaboration with Hynek Hermansky. At that time Hynek was working for US West (Later Qwest). Unlike Hervé, who was primarily concerned with the statistical models, Hynek was sharply focused on the feature extraction process. Previously he had invented a technique called "Perceptual Linear Prediction" or PLP, which is now used by many systems. He had been motivated by the goal of making systems more independent of variation in the signal due to different speakers, but when we started working together he was interested in making recognizers less sensitive to other kinds of variability. Our interest in this topic was piqued by a small workshop we organized (the SPeech recOgnition frOnt eNd workshop, or SPOONS), for which we invited a number of people who had designed innovative models for speech processing, e.g., Les Atlas, Jordan Cohen, Ron Cole, Malcolm Slaney, Dick Lyon, Dirk Pueschel, and Shihab Shamma. The discussions were great (Mike O'Malley, who had done important work in speech synthesis, asked "Why a 10 millisecond step? Because we have 10 fingers?"), but one of the comments particularly struck Hynek and me. Jordan Cohen, who had designed a biologically-inspired speech recognition front end for IBM, asked: "We could play speech through a filter approximating the inverse of a steady state vowel spectrum (such as 'e') and the speech is still intelligible, including the vowels which turn into a white spectrum signal. Which hearing model can account for that?" Given this 'inverse-e' challenge, Hynek and I later came up with what we called RelAtive SpecTral Analysis, or RASTA, an approach that ultimately was adopted by Qualcomm and ended up in many millions of cell phones as the front end for speech recognition.

In the following years, I was fortunate to work with both Hervé and Hynek as they changed jobs. As our group grew, adding students, postdocs, and the continuing flow of talented visitors, we all worked closely with these two early contributors and the teams that worked with them, for instance co-developing new approaches to the incorporation of multiple feature streams in speech recognition. As of this writing both of them are at IDIAP in Switzerland (in many ways our sister institution) and our joint work continues.

Since those early days, we have graduated 16 PhDs from the Speech (née Realization) group, some of who are now teaching a new generation of students; and others have come back to ICSI as research staff. We now have a new generation of the next 10 students who are working on a much more diverse set of problems, as we have expanded from speech recognition to speaker recognition, sentence segmentation, "diarization" (who spoke when), and a number of aspects of speech understanding. I hope that this issue of our Gazette, which will include a focus on the DARPA-sponsored GALE project, will give our readers some insight as to the directions of the current group.