Detect human voice from audio file input

Asked 21/8, 2013 at 10:51 Answered 2/11, 2023 at 14:0

I am trying to implement automatic voice recording functionality, similar to the Talking Tom app. I use the following code to read input from the audio recorder and analyse the buffer :

 float totalAbsValue = 0.0f;
 short sample = 0;

 numberOfReadBytes = audioRecorder.read( audioBuffer, 0, bufferSizeInBytes);

 // Analyze Sound.
 for( int i=0; i<bufferSizeInBytes; i+=2 )
 {
     sample = (short)( (audioBuffer[i]) | audioBuffer[i + 1] << 8 );
     totalAbsValue += Math.abs( sample ) / (numberOfReadBytes/2);
 }

 // Analyze temp buffer.
 tempFloatBuffer[tempIndex%3] = totalAbsValue;
 float temp = 0.0f; 

 for( int i=0; i<3; ++i )
 temp += tempFloatBuffer[i];

Now I am able to detect voice input coming from the audio recorder and I can analyse the audio buffer.

The buffer is converted to an float value and if it increases by a certain amount, it is assumed that there is some sound in the background and recording is started. But the problem is that the app starts recording all background noise, including fan/AC duct sounds.

Can anyone help me with analysing the buffer to detect human voice only? Or are there any other alternative ways to detect human voice from the audio recorder input?

Thanks in advance,

Exacting answered 21/8, 2013 at 10:51 Comment(6)

Do you know the characteristics of human voice which differentiate it from background noise? – Remarkable 21/8, 2013 at 10:54

@Remarkable No idea mate.. – Exacting 21/8, 2013 at 10:55

time-dependent frequency analysis + a neural network should do the trick. After all, that's what humans naturally do. – Direct 21/8, 2013 at 10:57

have you seen this question? https://mcmap.net/q/500839/-java-speech-recognition-api-closed – Heathcote 21/8, 2013 at 11:18

@vkulla42 tried the speech recognition. But no luck :( – Exacting 22/8, 2013 at 9:8

"The voiced speech of a typical adult male will have a fundamental frequency from 85 to 180 Hz, and that of a typical adult female from 165 to 255 Hz" (From here en.wikipedia.org/wiki/Voice_frequency) - what about you use your existing method but you pass it through a bandpass filter first (do it once for male voice and once for female voice)? Provided that you don't have a lot of noise in these bands then it could work for you. – Cowcatcher 30/8, 2013 at 5:22

Voice detection is not that simple. There are several algorithms, some of them are published, for example GSM VAD. Several open source VAD libraries are available, some of them are discussed here

Showy answered 30/8, 2013 at 3:8 Comment(0)

For voice detect, try ftt algorithm.

For noise, try speex library.

Kraut answered 2/9, 2013 at 2:9 Comment(0)

If you want to have a clean recording you can

Filter noise from the voice, you can use FFT for that and apply filters such as lowpass, highpass and bandpass filters Filtering using FFT and Filters

2.After Filtration the noise would be decreased and you can use Voice recognition API's

API's

The more Filtering the better less noise More recognition, but be wary in filtering because it can also remove the Voice together with the noise.

Also read more about FFt

Fast Fourier Transform of Human Voice

Hope This Helps :)

Persevere answered 2/9, 2013 at 2:25 Comment(4)

Voice Recognition API Link doesn't work.. "Apologies, but the page you requested could not be found. " – Exacting 2/9, 2013 at 4:23

@Exacting try this android-developers.blogspot.com/2010/03/… – Persevere 2/9, 2013 at 5:4

or this developer.android.com/reference/android/speech/… – Persevere 2/9, 2013 at 5:5

this is the link posted above javacodegeeks.com/2012/08/… – Persevere 2/9, 2013 at 5:6

What exactly are you looking for? Do you just want to filter out the human speech in the audio or do you actually want to know what the person has said?

Filtering the human speech is done by nearly every Smartphone by recording the background noice with a second microphone at the back of the device and subtract the two signals. But to be honest, I haven't seen any Android API were you can directly access the two signals.

If you want to do speech to text conversion, then have a look at Sphinx4 and Praat. Both do this job but again, I haven't seen an implementation for Android. Sphinx4 claims to be fully written in Java, so it should be possible to embed it in an Android App.

Muskellunge answered 30/8, 2013 at 17:49 Comment(0)

The way to process the input is to use a specialised library which removes noise.

For example, http://audacity.sourceforge.net, does noise removal.

So long as you have characterised the main types of noise, you should have only speech remaining.

It would be worthwhile collecting sampling data before the capture from the user, and after the user ended the capture, as this would provide at-the-time samples of noise in the environment. This is useful if each user faces unique background noise challenges.

Single answered 26/8, 2013 at 21:43 Comment(2)

audacity is not an android library. – Pokeberry 27/8, 2013 at 4:42

Fair enough; are you saying that the solution you desire is constrained to android-only libraries, or are you considering porting in a library as part of the project? – Single 29/8, 2013 at 19:27

Have you considered using Microsoft's speech Recognition API? You can use a voice key utterance to begin recording, like how they say "computer" before asking the computer something in Star Trek. Use ISpRecognizer::CreateRecoContext to load your recognition grammar and start recognition. Then implement a check with ISpPhrase to see if you should begin recording or not.

Brunhild answered 30/8, 2013 at 16:56 Comment(1)

Could you explain how to install that on an Android device? – Godard 30/8, 2013 at 17:10

In the completely general case, this is an unsolved problem. In the practical sense...

First step is to get as noise-free a recording as possible. As others have noted, that starts with a directional microphone as focused on the sound you want to keep as possible.

Second step is filtering. As noted previously, the telephone company did a lot of work on which frequency ranges are actually needed by humans for speech comprehension. Filtering out frequencies outside that range will make the voice sound like... well, a telephone... but will get rid of more of the background noise.

If you want to go beyond that, things can get really complicated. There are some algorithms which, if you can show them a sample of what you consider noise on that particular recording, will analyse it and try to subtract it out without damaging the sound you want to keep too much. This is not simple programming; if I were you I'd seriously consider buying it from someone who has already gotten it right rather than trying to reinvent/reimplement it. I don't know whether any of them are available for Android or whether the typical Android box has enough computing power to execute them in anything like realtime. (I've used SoundSoap in the studio to remove A/C noise, and it works very well.)

In fact, my own inclincation would be to simplify the problem to a solved one: use the most directional and closest mike I could get, let Android do the recording... but then do the signal processing to clean it up later, using off-the-shelf tools. But I admit I'm biased because I have already invested in the latter.

Kaddish answered 1/9, 2013 at 16:56 Comment(0)

I tried to solve a similar problem on Windows. One thing I learned fast -- simple frequency analysis with a fast Fourier transform is not enough. Lots of noises hit human frequencies -- from simple taps on the microphone to clapping hands. Even some level of sophisticated filtering won't do it. I've found the easiest way is to take the noise to a cloud API and ask it to transcribe the speech. If the cloud API can transcribe to a reasonable length string, then I can continue recording -- else, stop recording. This does require that you sample some noise and send it to a cloud provider.

Pawl answered 17/9, 2015 at 23:2 Comment(0)

for best result you can use silero VAD, find more in the repository

val vad = Vad.builder()
        .setContext(applicationContext)
        .setSampleRate(SampleRate.SAMPLE_RATE_8K)
        .setFrameSize(FrameSize.FRAME_SIZE_256)
        .setMode(Mode.NORMAL)
        .setSilenceDurationMs(300)
        .setSpeechDurationMs(50)
        .build()

    val isSpeech = vad.isSpeech(audioData)

    vad.setContinuousSpeechListener(audioData, object : VadListener {
        override fun onSpeechDetected() {
            //Speech detected!
        }

        override fun onNoiseDetected() {
            //Noise detected!
        }
    })

    vad.close()

Tontine answered 2/11, 2023 at 14:0 Comment(0)

Most of them have misunderstood the question and their replies solves problems different from yours.

You should parse the audio in your buffer searching for frequencies in the voice human range. As soon you detect them, will mean someone has started talking, and you can start recording (don't forget to include the buffer too as it contains the first part of the speech).

Search for routines that prints the list of frequencies in an audio raw stream

Uzzia answered 1/9, 2013 at 20:46 Comment(0)

Recommended topics

Hot tags