Speech is a complex phenomenon. People rarely understand how is it produced and perceived. The naive perception is often that speech is built with words, and each word consists of phones. The reality is unfortunately very different. Speech is a dynamic process without clearly distinguished parts. It’s always useful to get a sound editor and look into the recording of the speech and listen to it. Here is for example the speech recording in an audio editor.
All modern descriptions of speech are to some degree probabilistic. That means that there are no certain boundaries between units, or between words. Speech to text translation and other applications of speech are never 100% correct. That idea is rather unusual for software developers, who usually work with deterministic systems. And it creates a lot of issues specific only to speech technology.
Structure of speech
In current practice, speech structure is understood as follows:
Speech is a continuous audio stream where rather stable states mix with dynamically changed states. In this sequence of states, one can define more or less similar classes of sounds, or phones. Words are understood to be built of phones, but this is certainly not true. The acoustic properties of a waveform corresponding to a phone can vary greatly depending on many factors – phone context, speaker, style of speech and so on. The so called coarticulation makes phones sound very different from their “canonical” representation. Next, since transitions between words are more informative than stable regions, developers often talk about diphones – parts of phones between two consecutive phones. Sometimes developers talk about subphonetic units – different substates of a phone. Often three or more regions of a different nature can easily be found.
The number three is easily explained. The first part of the phone depends on its preceding phone, the middle part is stable, and the
next part depends on the subsequent phone. That’s why there are often three states in a phone selected for HMM recognition.
Sometimes phones are considered in context. There are triphones or even quinphones. But note that unlike phones and diphones, they are matched with the same range in waveform as just phones. They just differ by name. That’s why we prefer to call this object senone. A senone’s dependence on context could be more complex than just left and right context. It can be a rather complex function defined by a decision tree, or in some other way.
Next, phones build subword units, like syllables. Sometimes, syllables are defined as “reduction-stable entities”. To illustrate, when speech becomes fast, phones often change, but syllables remain the same. Also, syllables are related to intonational contour. There are other ways to build subwords – morphologically-based in morphology-rich languages or phonetically-based. Subwords are often used in open vocabulary speech recognition.
Subwords form words. Words are important in speech recognition because they restrict combinations of phones significantly. If there are 40 phones and an average word has 7 phones, there must be 50^7 words. Luckily, even a very educated person rarely uses more then 20k words in his practice, which makes recognition way more feasible.
Words and other non-linguistic sounds, which we call fillers (breath, um, uh, cough), form utterances. They are separate chunks of audio between pauses. They don’t necessary match sentences, which are more semantic concepts.