|
Bandwidth in speech intelligibility
YUGAL SHARMA, country manager, Polycom India, on the history of bandwidth in
telephony
BANDWIDTH is a much used and abused reason for the success or failure of many
technology applications. Speech in telephony is one such application. Of all
the elements that affect the intelligibility of speech in telephony, bandwidth
has been proven to be one of the most critical—critical enough to be able
to compensate for other deficiencies such as noise, reverberations and other
factors hampering contemporary speech communication systems. I will attempt
to deal with this issue in two parts, tracing the path that has led to the evolution
of telephony, and the bottlenecks that should be done away with to accelerate
its present growth in terms of quality.
Some progress has been made in reducing telephony’s deficiencies in the
years since the first transcontinental phone call in 1915, as many sciences
have come together and enabled a better understanding of the causes and solutions
to these problems.
Early advances
Acoustics, physics, chemistry and electronics have facilitated major advances
in the design of the telephone instrument, with new designs for the mouthpiece
and earpiece alone producing a 10 dB frequency improvement by 1940. Similar
improvements brought closer control to the gain of these elements (early experiments
required the talker to tap the carbon microphone to loosen the granules inside).
As the telephone evolved, antisidetone circuits were added so the talker could
better judge his own loudness. The network added echo suppression, and later,
digital echo cancellation to reduce echo at the far-end that became more troublesome
as long-distance calls became routine.
However, in the last sixty years, little progress has been made in the amount
of audio bandwidth that can be carried by the telephone network. Early telephone
connections were not intentionally limited, but were constrained by the characteristics
of the transducers (which convert non-electrical signals into electrical) and
the equipment then available. Intelligibility research was commonly conducted
with frequencies extending from 4 KHz to 8 KHz (and sometimes beyond), but the
telephone network was expected to carry signals only to about 3 KHz into the
1930s, and to about 3.5 KHz with the first multiple-channel carrier systems.
With standardisation, and the codification of digital telephony in G.711, the
upper frequency limit of the telephone network is now commonly accepted to be
about 3.3 KHz at best. The last pre-divestiture Bell PSTN tests in 1984 showed
significant roll-off at 3.2 KHz for short and medium connections, dropping to
2.7 KHz in long-distance connections. At the low end of the spectrum, the telephone
network carries frequencies no lower than 220 Hz, and most commonly only as
far down as 280 or 300 Hz.
In contrast to this telephone performance, we find FM radio and television spanning
30 Hz to 15 KHz, CD audio covering 20 Hz to 20 KHz, professional and audiophile
audio 20 Hz to above 22 KHz, and AM radio extending up to 5 KHz.
Bandwidth and intelligibility
Crandall noted in 1917, “It is possible to identify most words in a given
context without taking note of the vowels...the consonants are the determining
factors in...articulation.”
“Take him to the map” has a very different meaning from “take
him to the mat,” and a handyman may waste a lot of time fixing a “faucet”
when the faulty component was actually the “soffet.” Pole, bole,
coal, dole, foal, goal, told, hole, molt, mold, noel, bold, yo, roll, colt,
sole, dolt, sold, toll, bolt, vole, gold, shoal, and troll all share the same
vowel sound, only differing in the consonants with which it is coupled. Consonant
sounds have this critical role in most languages, including French, German,
Italian, Polish, Russian and Japanese. And overall, more than half of all phonemes
are consonants.
This critical role of consonants in speech presents a serious challenge for
the telephone network. The reason for this is that the energy in consonant sounds
is carried predominantly in the higher frequencies, often beyond the telephone’s
bandwidth entirely. While most of the average energy in English speech is in
the vowels, which lie below 3 KHz, the most critical elements of speech, the
consonants, lie above. The difference between “f” and “s,”
for example, is found entirely in the frequencies above 3 KHz; indeed, above
the 3.3 KHz telephone bandwidth entirely. For example, the burst of high-frequency
sound that distinguishes the “s” in “sailing” from the
“f” in “failing” occurs between 4 KHz and 14 KHz. When
these frequencies are removed, no cue remains as to what has been said.
This makes a conventional telephone incapable of conveying the difference between
“my cousin is sailing in college” and “my cousin is failing
in college” without the analysis of additional contextual information
(knowing whether my cousin sails frequently, for example).
The challenge we face
Overall, two-thirds of the frequencies in which the human ear is most sensitive,
and 80 percent of the frequencies in which speech occurs, are beyond the capabilities
of the public telephone network. The human ear is most sensitive at 3.3 KHz,
just where the telephone network cuts off.
Consonants are formed as non-voiced clicks, puffs, breaths, etc. They are created
not from the vocal cords but by colliding, snapping and hissing through combinations
of tongue, cheeks, teeth and so on. While “formants”, used in some
speech analysis, are useful in examining vowels and long, voiced sounds, we
see that they have very little to do with those elements of speech that carry
so much of its information, the consonants. Intelligibility of speech decreases
with decreasing bandwidth. For single syllables, 3.3 KHz bandwidth yields an
accuracy of only 75 percent, as opposed to over 95 percent with 7 KHz bandwidth.
This loss of intelligibility is compounded when sounds are combined in sentences.
The human mind is not conscious of confusion this frequently because the brain
has some ability to compensate. When a sound is not clear, the brain attempts
to examine the context of the sound. However, when presented with a continual
string of such verbal puzzles as the meeting progresses, the listener is distracted.
Too much of the listener’s time is spent in unravelling the intended meaning
instead of understanding it.
What else affects speech accuracy?
There are additional aspects of business conferencing that interact with audio
bandwidth. Reverberation, which comes from the natural reflections occurring
in any room, magnifies the degrading effect of limited bandwidth. This is an
important issue in business telephony because group teleconferences are usually
held in meeting rooms, which are reverberant spaces. This problem is also magnified
as the talker moves farther from the microphone, or when the microphone is pointed
away from the talker, because a larger proportion of the total received sound
is reverberant rather than direct.
Increasing bandwidth is very effective at counteracting this problem. In one
test, word accuracy in a reverberant space increased from 52 to 80 percent when
the available bandwidth was raised from 4 to 8 KHz.
The expansion of global business has increased the importance of accurate telephone
communication among talkers who have different native languages or dialects.
Understanding accented speech can be much more difficult than native speech,
both because of the presence of an accent and because grammar pronunciation,
and even word selection, are much different from what the listener expects.
A Korean speaker of English, for example, will commonly substitute “p”
for “f” (“faint” becomes “paint,” “coffee”
becomes “copy”). A Turk may insert extra syllables (“stone”
becomes “istone” or “sitone”). Even a speaker in London,
referring to a cigarette container as a “fag packet” (notice all
the consonants?), may leave his American listener completely perplexed.
Because of these substitutions, it is no longer safe to assume that an unclear
word can be deduced from its grammatical context. Hence the increased accuracy
that derives from increasing speech bandwidth is more critical when speech is
accented.
Whispering and soft speech have more high frequencies. While the long-term average
energy at 7 KHz in normal speech is roughly 40 dB below that at 600 Hz, in whispered
speech it is almost flat, dropping only 10 dB over these three octaves. Hence,
in whispers, even the vowels are much less intelligible with telephone bandwidth.
A person with a cold, or who is growing hoarse, will have more difficulty being
understood both because they have proportionately less energy within the telephone
band, and because they are probably speaking more softly.
One more factor for consideration is that the telephone removes important frequencies
both above and below its pass-band. In general, the telephone’s elimination
of frequencies below 250 Hz is responsible for much of the ‘unreality’
and loss of comfort that we hear in telephonic speech, the sense that the talker
is not really present.
By extending telephone bandwidth to 7 KHz and beyond, it is clear that one can
markedly reduce fatigue, improve concentration, and increase intelligibility.
This improvement is even more significant in real-world room situations, where
the sound is often degraded by reverberation, projector or air-conditioner noise,
accented speech, and other acoustic problems that are encountered in business
telephony. Additionally, extending telephone bandwidth below 300 Hz brings a
significant increase in presence and realism.
In his 1938 paper discussing the bandwidth of the telephone system, AT&T’s
Inglis noted, “Frequency limitation is essentially an economic one, subject
to change as conditions change.” Here in the twenty-first century, economics
and conditions have changed as Inglis predicted, and modern telephony is now
in a position to deliver on the promises of wider bandwidth and clearer speech.
The author may be contacted at yugal.sharma@polycom.com
|