Getting started with programmatic audio [closed]

Asked 14/12, 2008 at 4:50 Answered 27/10, 2011 at 21:34

I'm looking for help getting started working programmatically with audio.

Specifically, the platform I'm working with exposes APIs to extract audio data from a resource (like an MP3), or to play back arbitrary data as audio. In both cases the actual data is byte arrays of 32bit floats representing 44.1 KHz stereo. What I'm looking for is help understanding what those floats represent, and what kinds of things can be done with them to dynamically analyze or modify the sound they represent.

What sort of concepts do I need to learn about to work with audio this way?

Mannish answered 14/12, 2008 at 4:50 Comment(5)

PCM audio: en.wikipedia.org/wiki/Pulse-code_modulation – Nunci 14/12, 2008 at 5:37

Basically every 32 bit value represents the voltage level at a specified time. Since the sample frequency is 44100Hz you get 441000 32 bit values per second per channel ( * 2 since you have stereo) – Nunci 14/12, 2008 at 5:41

With stereo sounds the left and right channel is often interleaved so that the first sample represents the left channel and the second the right, and so on. – Nunci 14/12, 2008 at 5:45

Thanks for the info so far - the platform I'm using does indeed interleave the stereo channels. And feel free to post this stuff as an answer so I can vote you up! :D – Mannish 14/12, 2008 at 5:50

Done! Should I maybe remove the comments here? – Nunci 14/12, 2008 at 6:15

107

As some has pointed out in the comments, what you want to look into is PCM audio.

In a nutshell, sound is a wave that travels through air. In order to capture that sound, we use a microphone, which contains a membrane which will vibrate as the sound waves hit it. This vibration is translated into an electric signal, where the voltage goes up and down. This change in voltage is then changed into a digital signal by an analog-to-digital converter (ADC) by sampling a certain number of times a second ("sampling rate" - the 44 KHz, or 44,100 samples per second) and, in the current case, stored as a pulse-code modulated (PCM) audio data.

A speaker works in opposite; the PCM signal is converted to analog by an digital-to-analog converter (DAC), then the analog signal goes to the speaker where it will vibrate a membrane which produces vibrations in the air which results in sound.

Manipulating Audio

There are many libraries out there for many languages that you can manipulate audio with, however you've marked this question as "language-agnostic", I'll mention a few simple ways (as that's all I know!) that you'll be able to manipulate audio in your preferred language.

I'll present the code samples in pseudocode.

The pseudocode will have each audio sample have an amplitude in the range of -1 to 1. This will be dependent on the data type you are using for storing each sample. (I haven't dealt with 32-bit floats before, so this may be different.)

Amplification

In order to amplify the audio, (therefore, increasing the volume of the sound) you'll want to make the vibration of the speakers to be larger so the magnitude of the sound wave is increased.

In order to make that speaker move more, you'll have to increase the value of each sample:

original_samples = [0, 0.5, 0, -0.5, 0]

def amplify(samples):
    foreach s in samples:
        s = s * 2

amplified_samples = amplify(original_samples)

// result: amplified_samples == [0, 1, 0, -1, 0]

The resulting samples are now amplified by 2, and on playback, it should sound much louder than it did before.

Silence

When there are no vibrations, there is no sound. Silence can be achieved by dropping each sample to 0, or to any specific value, but does not have any change in amplitude between samples:

original_samples = [0, 0.5, 0, -0.5, 0]

def silence(samples):
    foreach s in samples:
        s = 0

silent_samples = silence(original_samples)

// result: silent_samples == [0, 0, 0, 0, 0]

Playing back the above should result in no sound, as the membrane on the speaker is not moving at all, due to a lack of change in amplitude in the samples.

Speed Up and Down

Speeding things up and down can be achieved in two ways: (1) changing the playback sampling rate or (2) changing the samples themselves.

Changing the playback sampling rate from 44100 Hz to 22050 Hz will decrease the speed of playback by 2. This will make the sound slower and lower in tone. Going from a 22 KHz source and playing back at 44 KHz, the sound will be very fast and high pitched like birds chirping.

Changing the samples themselves (and keeping a constant playback sampling rate) means that samples either (a) get thrown out or (b) are added in.

To speed up the playback of the audio, throw out samples:

original_samples = [0, 0.1, 0.2, 0.3, 0.4, 0.5]

def faster(samples):
    new_samples = []
    for i = 0 to samples.length:
        if i is even:
            new_samples.add(samples[i])
    return new_samples

faster_samples = faster(original_samples)

// result: silent_samples == [0, 0.2, 0.4]

The result of the above program is that the audio will speed up by a factor of 2, similar to playing back an audio that sampled at 22 KHz at 44 KHz.

To slow down the playback of the audio, throw in a few samples:

original_samples = [0, 0.1, 0.2, 0.3]

def slower(samples):
    new_samples = []
    for i = 0 to samples.length:
        new_samples.add(samples[i])
        new_samples.add(interpolate(s[i], s[i + 1]))
    return new_samples

slower_samples = slower(original_samples)

// result: silent_samples == [0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3]

Here, extra samples were added, thereby slowing down the playback. Here, we have an interpolation function that makes a "guess" as to how to fill in that extra space that much be added.

Spectrum Analysis and Sounds Modification by FFT

Using a technique called Fast Fourier transform (FFT), the sound data in the amplitude-time domain can be mapped to the frequency-time domain to find out the frequency components of audio. This can be used to produce the spectrum analyzers that you might see on your favorite audio player.

Not only that, since now you have the frequency components of the audio, if you change the amount of

If you want to cut-off certain frequencies, you can use FFT to transform the sound data into the frequency-time domain, and zero-out the frequency components that are not desired. This is called filtering.

Making an high-pass filter, which allows frequencies above a certain frequency can be performed like this:

data = fft(orignal_samples)

for i = (data.length / 2) to data.length:
    data[i] = 0

new_samples = inverse_fft(data)

In the above example, all frequencies over the half-way mark is cutoff. So, if the audio could produce 22 KHz as the maximum frequency, any frequency above 11 KHz will be cut out. (For audio played back at 44 KHz, the maximum theoretical frequency that can be produced is 22 KHz. See Nyquist–Shannon sampling theorem.)

If you want to do something like increase the low-frequency range (similar to the bass boost effect), take the lower-end of the FFT-transformed data and increase its magnitude:

data = fft(orignal_samples)

for i = 0 to (data.length / 4):
    increase(data[i])

new_samples = inverse_fft(data)

This example increases the lower quarter of the frequency components of the audio, leading to the low frequencies to become louder.

There are quite a few things that can be done to the samples to manipulate the audio. Just go ahead and experiment! It's the most exciting way to learn.

Chiton answered 14/12, 2008 at 6:57 Comment(1)

can you give example for noise suppressor for removing noises form my audio.?i need to implement remove noises in my application.Is there any way to find this solution please give some sample links and coding also thanks in advance... – Margo 23/3, 2015 at 7:48

Looks like you want to know more about PCM audio

Basically every 32 bit value represents the voltage level at a specified time. Since the sample frequency is 44100Hz you get 441000 32 bit values per second per channel ( * 2 since you have stereo)

With stereo sounds the left and right channel is often interleaved so that the first sample represents the left channel and the second the right, and so on.

Nunci answered 14/12, 2008 at 6:15 Comment(0)

To understand what those arrays of 32 bit floats represent you need to read a good introduction to Digital Audio.

If you're near a library 'The Computer Music Tutorial' by Curtis Roads may be useful. Specifically chapter one 'Digital Audio Concepts'. (It's been a long time since I read this book though).

Once you have an understanding of digital audio, there are many ways to manipulate it. When you're ready these links may help.

The Dsp + Plugin Development forum at KVR Audio is one place to ask questions. Posts here are generally split between general audio DSP and VST plugin topics.

MusicDsp has a lot of code snippets.

The Scientist and Engineer's Guide to Digital Signal Processing Is an free online text book which goes into depth on standard DSP topics. Much of which also applies to digital audio.

Impregnate answered 14/12, 2008 at 4:50 Comment(0)

I posted a similar question recently: good audio dsp tutorials.

The golden link is certainly The Audio EQ Cookbook, if you wanna write and sorta understand EQs, but more generally, the musicdsp.org archive is the best resource I've found for audio DSP coding.

Here's a video of a synth ("Soundoid") I co-made in Flash: http://www.youtube.com/watch?v=O-1hHiA7y4o

And you can play with it here: http://www.zachernuk.com/2011/03/28/soundoid-audio-synthesizer-v0-5/

Hertel answered 27/10, 2011 at 21:34 Comment(0)

Recommended topics

Hot tags