Does "16bit integer PCM data" mean it's signed or unsigned?

Asked 20/2, 2015 at 15:38 Answered 16/2, 2019 at 16:16

I'm using FMOD to develop an application which would immediately start playing the recording of the next/previous sentence exactly from its beginning in a MP3 file which contains speech, without music, when the user clicked the Next/Prev button. I got the PCM data of a mp3 file by calling Sound::lock, but Sound::getFormat only told me it was "16bit integer PCM data", without saying whether it was signed or unsigned. How would I know that?

Some articles on the Internet say that almost all 16-bit integer PCM data are signed. If my PCM data is signed, what range of values represent silence, those values close to 0 (e.g. -10 ~ 10), or the values close to -32768 (e.g. -32768 ~ -32750)? If they are the values close to 0, does this mean that there's no difference in meaning between opposite numbers like -32767 and 32767?

I need to detect silences which are long enough, e.g. longer than 500ms, to determine where each sentence in the speech begins.

Could anyone give me any suggestions on how to detect silence between sentences?

Medici answered 20/2, 2015 at 15:38 Comment(0)

16-bit audio is, by convention, usually signed.

Think about what PCM audio is: each measure is how far along its axis the speaker should physically rest at that moment in time. Therefore perfect silence is absolutely any repeating value — that represents the speaker not moving.

0 is then the centre of the range, and usually where a microphone should be with no input. -32768 is the speaker as close to one end of its axis as it can be, 32767 is it at the other end.

The safest way to detect silence would be to run a spectral analysis over the relevant range and look for periods where there is no activity in any audible frequency range.

If you're looking for pauses between speech then the easiest thing would probably be to go to somewhere like this, plug in an acceptable frequency range for speech (it's considered to be around 300Hz to around 3500Hz in telephony), your sampling rate and however many multiplications you think you can afford. Copy the coefficients supplied. E.g. I assumed you'll do 37 taps across the speech range with a 44100Hz input and, converted to a C array, I got:

double coefficients[] = {
    -0.000560, -0.001290, -0.002332, -0.003606, -0.004911, -0.005921,  -0.006201, 
    -0.005256, -0.002610, 0.002106, 0.009059, 0.018139, 0.028924, 0.040691,  0.052479, 
    0.063203, 0.071794, 0.077351, 0.079274, 0.077351, 0.071794, 0.063203,  0.052479, 
    0.040691, 0.028924, 0.018139, 0.009059, 0.002106, -0.002610, -0.005256, -0.006201, 
    -0.005921, -0.004911, -0.003606, -0.002332, -0.001290, -0.000560};

If it were double input, for each input sample c I'd then compute a sampled value:

double *inputWave = ... input, an infinite array for the purposes of the example ...
double sampledValue = 0.0;
for(size_t coeff = 0; coeff < numberOfTaps; coeff++) {
    sampledValue += coefficients[coeff] * inputWave[c + coeff];
}

// (where numberOfTaps = sizeof(coefficients) / sizeof(coefficients[0]),
// i.e. the number of coefficients: 37 with the array given above)

What I've then got is a bandpass filter. Only that part of the signal representing sound in the frequency range 300–3500Hz should remain in the output values. In real life no such filter is perfect; increase the number of coefficients to increase the quality of your filter.

Having cut irrelevant parts of the signal I could then look for prolonged periods of sampledValue = [close to] 0.0.

Banneret answered 20/2, 2015 at 15:50 Comment(2)

Thank you so much, Tommy. I thought I'd only need to compare the PCM data with a number directly to find the pauses between sentences. Excuse my ignorance, but what does "taps" mean? Why should there be 37 taps? Does the array inputWave[] refer to the PCM data like the 16-bit integers I mentioned? And is your example code intended to determine whether the sample inputWave[c] represents silence? Sorry for so many questions and my poor English. – Medici 22/2, 2015 at 15:28

Taps is the signal processing term for number of input samples that are combined to perform one output sample. It comes more from the hardware tradition side of things. It doesn't need to be 37, that's just the default on that page. You should probably pick based on subjective performance — more = better, generally. As to CPU performance, look into using your processor's SIMD unit for the whole thing (which may mean using fixed point shorts rather than doubles but whatever). The output is a filtered wave. You could listen to it directly. Look for prolonged periods close to 0 to find silences. – Banneret 23/2, 2015 at 17:13

Surprisingly if I create directsound soundbuffers with 8Bit format, directsound expects the samples to be 8Bit SIGNED (-127 - 127) on my machine while when I create a 16Bit buffer directsound expects them to be 16Bit UNSIGNED (0 - 65535). So at least on my machine the standard seems to be the opposite of Tommy's answer.

Lactary answered 16/2, 2019 at 16:16 Comment(0)

Recommended topics

Hot tags