Source: http://www.cdf.utoronto.ca/~csc209h/summer/a2/a2.html, written by Daniel Zingaro.
Sounds are waves of air pressure. When
a sound is generated, a sound wave
consisting of compressions (increases
in pressure) and rarefactions
(decreases in pressure) moves through
the air. This is similar to what
happens if you throw a stone into a
pond: the water rises and falls in a
repeating wave.
When a microphone records sound, it
takes a measure of the air pressure
and returns it as a value. These
values are called samples and can be
positive or negative corresponding to
increases or decreases in air
pressure. Each time the air pressure
is recorded, we are sampling the
sound. Each sample records the sound
at an instant in time; the faster we
sample, the more accurate is our
representation of the sound. The
sampling rate refers to how many times
per second we sample the sound. For
example, CD-quality sound uses a
sampling rate of 44100 samples per
second; sampling someone's voice for
use in a VOIP conversation uses far
less than this. Sampling rates of
11025 (voice quality), 22050, and
44100 (CD quality) are common...
For mono sounds (those with one sound
channel), a sample is simply a
positive or negative integer that
represents the amount of compression
in the air at the point the sample was
taken. For stereo sounds (which we use
in this assignment), a sample is
actually made up of two integer
values: one for the left speaker and
one for the right...
Here's how the algorithm [to remove vocals] works.
Copy the first 44 bytes verbatim from the input file to the output
file. Those 44 bytes contain important
header information that should not be
modified.
Next, treat the rest of the input file as a sequence of shorts. Take
each pair of shorts left and right,
and compute combined = (left - right)
/ 2. Write two copies of combined to
the output file.
Why Does This Work?
For the curious, a brief explanation
of the vocal-removal algorithm is in
order. As you noticed from the
algorithm, we are simply subtracting
one channel from the other (and then
dividing by 2 to keep the volume from
getting too loud). So why does
subtracting the left channel from the
right channel magically remove vocals?
When music is recorded, it is
sometimes the case that vocals are
recorded by a single microphone, and
that single vocal track is used for
the vocals in both channels. The other
instruments in the song are recorded
by multiple microphones, so that they
sound different in both channels.
Subtracting one channel from the other
takes away everything that is ``in
common'' between those two channels
which, if we're lucky, means removing
the vocals.
Of course, things rarely work so well.
Try your vocal remover on this
badly-behaved wav file. Sure, the
vocals are gone, but so is the body of
the music! Apparently, some of the
instruments were also recorded
"centred", so that they are removed
along with the vocals when channels
are subtracted.