Just as the Kokanee salmon steadfastly progress
up the Wallowa River to spawn, Microsoft is equally determined to
move speech technology into the mainstream, making it a widespread
industry. Kokanee is the codename for a major research and development
effort at Microsoft designed to grow the speech industry. Its focus,
the delivery of the .Net Speech Platform, will make speech-enabled
application development and deployment simpler, faster, and easier.
If Kokanee succeeds in stimulating the creation of killer speech applications,
we may soon find ourselves talking with computers on a more frequent
basis and actually enjoying the experience.
Before delving into the details
of Kokanee, it is helpful to quickly review the origins of speech
recognition and its history at Microsoft. The former lays the groundwork
necessary for understanding the technologies underlying this ambitious
initiative, while the latter reveals how the key players arrived at
Microsoft and their quintessential roles in shaping the fields
Brief History of Speech Recognition
The first machine to recognize
speech was likely a commercial toy dog called Radio Rex, manufactured
in 1920. Designed to respond to its name, Rex actually responded
to almost any sound with sufficient 500-Hz energy. Rexs inability
to detect out-of-vocabulary sounds foreshadowed a problem that would
plague speech recognizers (speech recognition engines) to this day.
Although the 1930s and 1940s
saw some advances in speech recognition, in the areas of vocoding
(voice compression) and speech analysis, the first computer system
came from Bell Labs in the 1950s. This speech recognition system
may have been the first actual word recognizer. It was
able to discern digits (numbers) spoken by a single speaker, with
long isolated (discrete) pauses.
the speech recognizers developed in the 1950s relied on acoustic
input for identifying a few words, syllables, vowels, or digits.
However, in 1959, the College of London (www.rdg.ac.uk/EPU)
demonstrated the use of a linguistic unit to predict the probability
of the next linguistic unit, a concept that one day would lead to
major breakthroughs in speech research.
In the 1960s, automatic speech
recognition systems made minor improvements in vocabulary and accuracy.
However, the focus was on acoustic models, with little attention
given to speech understanding. Consequently, accuracy and vocabularies
remained quite limited, plus systems required discreetly spoken
words (i.e., one at a time, with pauses in between).
In the tumultuous spring of
1968, the movie classic 2001: A Space Odyssey gave voice
to the calm but psychologically unbalanced HAL 9000 computer. For
the public, HAL set exceedingly high expectations for speech recognition
and understanding (and speech synthesis). The smooth talking HAL
sounded entirely human in it conversations with crewmembers
although HALs voice synthesis resembled a computerized version
of Mr. Rogers.
Government Funds Speech Research
In 1971, with speech technology
at the forefront of public consciousness, the U.S. Department of
Defenses Advanced Research Projects Agency (ARPAlater
renamed DARPA) funded a five-year study, dubbed the Speech Understanding
Research (SUR) project (www.nvrc.org/Conferences
%20and%20Workshops/ 1999/auto_speech recognition_systems.htm)
Its goal was achieving a breakthrough in continuous (connected)
speech recognition. The SUR project Advisory Board, which included
Allen Newell (a founding father of artificial intelligence), specified
that the systems should recognize normally spoken English in a quiet
room, with a restricted vocabulary