EnergyDetector
For Voice Activity Detection, I have been using the EnergyDetector program of the MISTRAL (was LIA_RAL) speaker recognition toolkit, based on the ALIZE library.
It works with feature files, not with audio files, so you'll need to extract the energy of the signal. I usually extract cepstral features (MFCC) with the log-energy parameter, and I use this parameter for VAD. You can use sfbcep`, an utility part of the SPro signal processing toolkit in the following way:
sfbcep -F PCM16 -p 19 -e -D -A input.wav output.prm
It will extract 19 MFCC + log-energy coefficient + first and second order delta coefficients. The energy coefficient is the 19th, you will specify that in the EnergyDetector configuration file.
You will then run EnergyDetector in this way:
EnergyDetector --config cfg/EnergyDetector.cfg --inputFeatureFilename output
If you use the configuration file that you find at the end of the answer, you need to put output.prm
in prm/
, and you'll find the segmentation in lbl/
.
As a reference, I attach my EnergyDetector configuration file:
*** EnergyDetector Config File
***
loadFeatureFileExtension .prm
minLLK -200
maxLLK 1000
bigEndian false
loadFeatureFileFormat SPRO4
saveFeatureFileFormat SPRO4
saveFeatureFileSPro3DataKind FBCEPSTRA
featureServerBufferSize ALL_FEATURES
featureServerMemAlloc 50000000
featureFilesPath prm/
mixtureFilesPath gmm/
lstPath lst/
labelOutputFrames speech
labelSelectedFrames all
addDefaultLabel true
defaultLabel all
saveLabelFileExtension .lbl
labelFilesPath lbl/
frameLength 0.01
segmentalMode file
nbTrainIt 8
varianceFlooring 0.0001
varianceCeiling 1.5
alpha 0.25
mixtureDistribCount 3
featureServerMask 19
vectSize 1
baggedFrameProbabilityInit 0.1
thresholdMode weight
CMU Sphinx
The CMU Sphinx speech recognition software contains a built-in VAD. It is written in C, and you might be able to hack it to produce a label file for you.
A very recent addition is the GStreamer support. This means that you can use its VAD in a GStreamer media pipeline. See Using PocketSphinx with GStreamer and Python -> The 'vader' element
Other VADs
I have also been using a modified version of the AMR1 Codec that outputs a file with speech/non speech classification, but I cannot find its sources online, sorry.