1993-11-01 - Re: Secure Phone Progress (fwd)

Header Data

From: karn@qualcomm.com (Phil Karn)
To: Matthew J Ghio <cypherpunks@toad.com
Message Hash: 6caa2934f99eac5464e3d43fad842ea7d999d4b182719bd226509888150bdab1
Message ID: <9311010457.AA13960@servo>
Reply To: N/A
UTC Datetime: 1993-11-01 04:59:42 UTC
Raw Date: Sun, 31 Oct 93 20:59:42 PST

Raw message

From: karn@qualcomm.com (Phil Karn)
Date: Sun, 31 Oct 93 20:59:42 PST
To: Matthew J Ghio <cypherpunks@toad.com
Subject: Re: Secure Phone Progress (fwd)
Message-ID: <9311010457.AA13960@servo>
MIME-Version: 1.0
Content-Type: text/plain



>Um...  Well, you should be able to do that with an ordinary sound
>board/digitizer and a modem.  However, 14400 bps (without compression)
>isn't enough to transmit sound waves at normal frequencies.
[...]

There are a few minor misstatements in this note. Standard telephony
samples 8000 times per second, with 8 bits per sample. You can
easily sample fast enough to get all important speech frequencies into
14400 bps -- at the expense of reducing the resolution of each sample.
The resulting speech would be understandable, but highly distorted.

There are many other speech coding methods that work somewhat better,
including various forms of delta modulation (sending just the differences
between adjacent samples). These are probably the best in the bang-for-buck
department (Motorola's CVSD is already widely used in secure voice radios
operating at 16 kb/s or so). 

But if you really want high quality at low data rates, you pretty much have
to use a vocoder. All of the methods mentioned above try to reproduce the
(important parts of) the speech waveform.  Vocoders work by modeling
the human vocal tract and sending the parameters that describe it at any
particular moment, instead of attempting to encode the actual waveform.
Since these parameters correspond to things that move relatively slowly
in the modeled system (e.g., the muscles of the tongue, jaw, lips, etc)
they consume much less bandwidth than the actual sound that's produced.

What makes CELP so computationally expensive is the "C" part. CELP
stands for Codebook Excited Linear Prediction. The modeling scheme I
just mentioned is the Linear Prediction part; the Codebook Excitation
part is used to drive it. It corresponds to the vocal cords in the vocal tract.
As I understand it, the analyzer picks an entry in a predefined "codebook"
that produces the best results, i.e., synthesized speech that most closely
matches the original sound. This is a fairly brute-force process; we're talking
tens of DSP MIPS to do this in real time.  I don't  know anyone who has
done it in assembler on a widely available general purpose CPU, but I
would be ecstatic to be proven wrong.

Phil





Thread