Programmatically increase the pitch of an array of audio samples

Asked 1/3, 2011 at 15:0 Answered 24/3, 2011 at 11:47

Hello kind people of the audio computing world,

I have an array of samples that respresent a recording. Let us say that it is 5 seconds at 44100Hz. How would I play this back at an increased pitch? And is it possible to increase and decrease the pitch dynamically? Like have the pitch slowly increase to double the speed and then back down.

In other words I want to take a recording and play it back as if it is being 'scratched' by a d.j.

Pseudocode is always welcomed. I will be writing this up in C.

Thanks,

EDIT 1

Allow me to clarify my intentions. I want to keep the playback at 44100Hz and so therefore I need to manipulate the samples before playback. This is also because I would want to mix the audio that has an increased pitch with audio that is running at a normal rate.

Expressed in another way, maybe I need to shrink the audio over the same number of samples somehow? That way when it is played back it will sound faster?

EDIT 2

Also, I would like to do this myself. No libraries please (unless you feel I could pick through the code and find something interesting).

EDIT 3

A sample piece of code written in C that takes 2 arguments (array of samples and pitch factor) and then returns an array of the new audio would be fantastic!

PS I've started a bounty on this not because I don't think the answers already given aren't valid. I just thought it would be good to get more feedback on the subject.

AWARD OF BOUNTY

Honestly I wish I could distribute the bounty over several different answers as they were quite a few that I thought were super helpful. Special shoutout to Daniel for passing me some code and AShelly and Hotpaw2 for putting in such detailed responses.

Ultimately though I used an answer from another SO question referenced by datageist and so the award goes to him.

Thanks again everyone!

Houchens answered 1/3, 2011 at 15:0 Comment(2)

Please clarify EDIT 3. Scratching a record like a DJ will result in a file that is not the same length. e.g. If a DJ spins a record twice as fast, a file of recording that sound would be half the original length in time. Is this what you want? – Fiesta 19/3, 2011 at 18:26

@hotpaw2. Yes, sorry about that. I've corrected EDIT 3. Indeed it would be half the length and that is exactly what I'm looking for. Thanks :) – Houchens 21/3, 2011 at 9:52

Take a look at the "Elephant" paper in Nosredna's answer to this (very similar) SO question: How do you do bicubic (or other non-linear) interpolation of re-sampled audio data?

Sample implementations are provided starting on page 37, and for reference, AShelly's answer corresponds to linear interpolation (on that same page). With a little tweaking, any of the other formulas in the paper could be plugged into that framework.

For evaluating the quality of a given interpolation method (and understanding the potential problems with using "cheaper" schemes), take a look at this page:

http://www.discodsp.com/highlife/aliasing/

For more theory than you probably want to deal with (with source code), this is a good reference as well:

https://ccrma.stanford.edu/~jos/resample/

Arcadian answered 19/3, 2011 at 12:21 Comment(0)

One way is to keep a floating point index into the original wave, and mix interpolated samples into the output wave.

//Simulate scratching of `inwave`: 
// `rate` is the speedup/slowdown factor. 
// result mixed into `outwave`
// "Sample" is a typedef for the raw audio type.
void ScratchMix(Sample* outwave, Sample* inwave, float rate)
{
   float index = 0;
   while (index < inputLen)
   {
      int i = (int)index;          
      float frac = index-i;      //will be between 0 and 1
      Sample s1 = inwave[i];
      Sample s2 = inwave[i+1];
      *outwave++ += s1 + (s2-s1)*frac;   //do clipping here if needed
      index+=rate;
   }

}

If you want to change rate on the fly, you can do that too.

If this creates noisy artifacts when rate > 1, try replacing *outwave++ += s1 + (s2-s1)*frac; with this technique (from this question)

*outwave++ = InterpolateHermite4pt3oX(inwave+i-1,frac);

where

public static float InterpolateHermite4pt3oX(Sample* x, float t)
{
    float c0 = x[1];
    float c1 = .5F * (x[2] - x[0]);
    float c2 = x[0] - (2.5F * x[1]) + (2 * x[2]) - (.5F * x[3]);
    float c3 = (.5F * (x[3] - x[0])) + (1.5F * (x[1] - x[2]));
    return (((((c3 * t) + c2) * t) + c1) * t) + c0;
}

Example of using the linear interpolation technique on "Windows Startup.wav" with a factor of 1.1. The original is on top, the sped-up version is on the bottom:

It may not be mathematically perfect, but it sounds like it should, and ought to work fine for the OP's needs..

Socle answered 1/3, 2011 at 20:48 Comment(10)

This can't work, since audio isn't pixels on the screen - and will give you incredible and unbearable sound artifacts and noises in the output signal. Also, you won't need clipping because there can't be overflow since weights always sum to 100%. – Dedifferentiation 2/3, 2011 at 11:10

@Daniel Mošmondor, This does work, assuming inwave is a single pcm channel, and your factor is not so extreme that you get aliasing. I have a working program in ruby to prove it. If you plot samples vs time, you get a waveform. It's perfectly valid to interpolate points to give you a different representation of that waveform. – Socle 2/3, 2011 at 21:48

As for the clipping, I added that comment to deal with the OP's remark that he wanted to mix the stretched sample into another track. This code handles the case where outwave already contains good data. – Socle 2/3, 2011 at 21:50

I won't flame here, but just try this: open original and re-sampled file in some editor that can show you spectral view, and you'll see great difference. – Dedifferentiation 2/3, 2011 at 22:11

@Socle - Yes it is perfectly valid to interpolate;... BUT, interpolation is filtering, a linear interpolation is a box filter, and a box filter has a nasty frequency response. A much more complicated and/or longer interpolation filter is required for a decent frequency response without aliasing. – Fiesta 20/3, 2011 at 18:22

I'd still recommend that the OP tries this first. Simulating scratching one track on top of a background track may not call for the purest possible transform. At the very least, this method is easy to get working, and from there it is easy to substitute more complicated filters if needed. – Socle 21/3, 2011 at 19:9

I don't see the problem with interpolation... I do it in my code. In the old days we'd just pick the nearest sample (eww). Running a mild lowpass filter prior to the interpolation would eliminate aliasing, if that turns out to be a problem. But if the OP doesn't need perfect quality, why not do this? – Lyautey 23/3, 2011 at 18:11

That's convenient - I was going to go dig up my question about this, and then I saw you'd already linked to it. :) – Trollop 25/3, 2011 at 9:40

@Daniel: hermitic spline interpolation for audio is a little bit better than linear interpolation, but neither is so bad that they will give you "incredible and unbearable sound artifacts and noises". And as Quertie points out you can get rid of what artifacts there are from this process with a filter. – Trollop 25/3, 2011 at 9:48

The amount of audio distortion artifacts produced by a poor filtering interpolator might be proportional to the amount of high frequency sound (say above Fs/4) in your audio. Try a hi-fi recording including lots of high bell sounds or something. – Fiesta 25/3, 2011 at 17:50

Yes, it is possible.

~~But this is not a small amount of pseudo code. You are asking for a time pitch modification algorithm, which is a fairly large and complicated amount of DSP code for decent results.~~

Here's a Time Pitch stretching overview from DSP Dimensions. You can also Google for phase vocoder algorithms.

ADDED:

If you want to "scratch", as a DJ might do with an LP on a physical turntable, you don't need time-pitch modification. Scratching changes the pitch and the speed of play by the same amount (not independently as would require time-pitch modification).

And the resulting array won't be of the same length, but will be shorter or longer by the amont of pitch/speed change.

You can change the pitch, as well as make the sound play faster or slower by the same ratio, by just resampling the signal using properly filtered interpolation. Just move each sample point, instead of by 1.0, by floating point addition by your desired rate change, then filter and interpolate the data at that point. Interpolation using a windowed Sinc interpolation kernel, with a low-pass filter transition frequency below the lower of the original and interpolated local sample rate, will work fairly well. Searching for "windowed Sinc interpolation" on the web returns lots of suitable result.

You need an interpolation method that includes a low-pass filter, or else you will hear horrible aliasing noise. (The exception to this might be if your original sound file is already severely low-pass filtered a decade or more below the sample rate.)

Fiesta answered 2/3, 2011 at 0:51 Comment(0)

If you want this done easily, see AShelly's suggestion [edit: as a matter of fact, try it first anyway]. If you need good quality, you basically need a phase vocoder.

The very basic idea of a phase vocoder is to find the frequencies that the sound consists of, change those frequencies as needed and resynthesize the sound. So a brutal simplification would be:

run FFT
change all frequencies by a factor
run inverse FFT

If you're going to implement this yourself, you definitely should read a thorough explanation of how a phase vocoder works. The algorithm really needs many more considerations than the three-step simplification above.

Of course, ready-made implementations exist, but from the question I gather you want to do this yourself.

Annice answered 24/3, 2011 at 11:47 Comment(6)

I like the 2nd link, it's well illustrated. But I am slightly confused: In the algorithm described, after the inverse FFT, you get "a signal that is now either stretched or compressed in time and the pitch is not changed". Are you saying that to also change the pitch (since that's what the OP wants), you need to modify the frequency data after the FFT? Is that as simple as re-interpreting how wide the bins are when you do the inverse? – Socle 24/3, 2011 at 17:38

@AShelly: What happens is that first the audio is stretched/compressed while maintaining the pitch, then sped up/slowed down to get the appropriate length. This results in the wanted pitch change. My gross simplification probably isn't directly implementable; I think that's the idea, but it's done this way to get correct phases. I have no experience implementing a vocoder myself, but some with other FFT/DCT usage. Therefore I tried to point to a good source. – Annice 24/3, 2011 at 19:59

Also, I definitely suggest trying AShelly's algorithm first, if it's important to do this yourself. – Annice 24/3, 2011 at 20:1

@dancek a drawback of using this technique might be the block lenght properties of the fft. It handles a full length chunks of audio at once, manipulates them in the frequency domain and transforms the result back to the time domain. In order to get the nice dynamic (fast) changing pitch effects the OP asks for, the blocks must be either very small or (recommended) each sample must be recognized by multiple chunks. The chunks are overlapping and cross-blended in the time domain, may introducing audible artefacts. Therefore, a time domain based solution would be more feasable. – Guru 25/3, 2011 at 9:46

@user492238: Of course a windowing function will need to be used; see my link to an explanation of a phase vocoder algorithm. This technique actually seems to be called short-time Fourier transform. – Annice 25/3, 2011 at 9:54

@dancek yes, proper window is needed anyway. But it will not help much by reassembling adjacent chunks of (frequency altered) audio in time domain, especially when the frequency dynamically would have been changed within the chunks. The neccessary re-alignment of the chunks would be very hard in order to prevent doubling-artifacts while cross-blending. Keep in mind, the length of the chunks changes when modifying the pitch. – Guru 25/3, 2011 at 10:29

To decrease and increase the pitch is as simple as playing the sample back at a lower or higher rate than 44.1kHz. This will produce the slower/faster record sound but you'll need to add the 'scratchiness' of real records.

Minatory answered 1/3, 2011 at 15:12 Comment(2)

+1 I consider this to be the most simple solution. Most audio HW supports the changing of the playback rate very easily. – Guru 25/3, 2011 at 9:54

@Guru Common audio hardware is shared between applications and runs at a fixed rate. Software mixers are provided by the OS to make an impression of you "owning" the hardware. – Thessalonians 24/1, 2014 at 22:20

This helped me with resampling, which is same thing you need just looked from the opposite side.

If you can't find code, ping me, I have a nice C routine for this.

Mammilla answered 1/3, 2011 at 19:46 Comment(1)

I cant see, how FIR would help here? What th OP is looking for is to alter the audio in a nonlinear way. Basically, what FIR is helpful for, is to change the magnitude of several frequencies in the audio, not to transform the inherent frequencies, the signal is made of? – Guru 25/3, 2011 at 9:53

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags