python webrtc voice activity detection is wrong
Asked Answered
F

2

5

I need to do voice activity detection as a step to classify audio files.

Basically, I need to know with certainty if a given audio has spoken language.

I am using py-webrtcvad, which I found in git-hub and is scarcely documented:

https://github.com/wiseman/py-webrtcvad

Thing is, when I try it on my own audio files, it works fine with the ones that have speech but keeps yielding false positives when I feed it with other types of audio (like music or bird sound), even if I set aggressiveness at 3.

Audios are 8000 sample/hz

The only thing I changed to the source code was the way I pass the arguments to main function (excluding sys.args).

def main(file, agresividad):

    audio, sample_rate = read_wave(file)
    vad = webrtcvad.Vad(int(agresividad))
    frames = frame_generator(30, audio, sample_rate)
    frames = list(frames)
    segments = vad_collector(sample_rate, 30, 300, vad, frames)
    for i, segment in enumerate(segments):
        path = 'chunk-%002d.wav' % (i,)
        print(' Writing %s' % (path,))
        write_wave(path, segment, sample_rate)

if __name__ == '__main__':

    file = 'myfilename.wav'
    agresividad = 3 #aggressiveness
    main(file, agresividad)  
Fremont answered 22/7, 2018 at 8:3 Comment(2)
Any luck? I'm having the same problem. It detects music or even typing as voice.Springlet
Just wondering if you had reached any retrospective insights about this. It might just be inherent to what types of non-speech noise webrtc is trained for wouldn't it?Mahaffey
V
1

I'm seeing the same thing. I'm afraid that's just the extent to which it works. Speech detection is a difficult task and webrtcvad wants to be light on resources so there's only so much you can do. If you need more accuracy then you would need different packages/methods that will necessarily take more computing power.

On aggressiveness, you're right that even on 3 there are still a lot of false positives. I'm also seeing false negatives however so one trick I'm using is running three instances of the detector, one for each aggressiveness setting. Then instead of classifying a frame 0 or 1 I give it the value of the highest aggressiveness that still said it was speech. In other words each sample now has a score of 0 to 3 with 0 meaning even the least strict detector said it wasn't speech and 3 meaning even the strictest setting said it was. I get a little bit more resolution like that and even with the false positives it is good enough for me.

Villalobos answered 16/6, 2020 at 13:58 Comment(0)
A
1

The WebRTC VAD is a very simple, real-time oriented model. It is not a good choice if false positives from things like music, birdsong or other voice-like sounds is an issue.

There are several other open-source VADs out there that are expected to do much better. Some examples are:

Alsatian answered 20/5, 2023 at 12:28 Comment(1)
whisper also exposes probability of speech per tokenMagnusson

© 2022 - 2024 — McMap. All rights reserved.