Non-Speech Noise or Sound Recognition Software?
Asked Answered
A

2

6

I'm working on some software for children, and looking to add the ability for the software to respond to a number of non-speech sounds. For instance, clapping, barking, whistling, fart noises, etc.

I've used CMU Sphinx and the Windows Speech API in the past, however, as far as I can tell neither of these have any support for non-speech noises, and in fact I believe actively filter them out.

In general I'm looking for "How do I get this functionality" but I suspect it may help if I break it down into three questions that are my guesses for what to search for next:

  1. Is there a way to use one of the main speech recognition engines to recognize non-word sounds by changing an acoustic model or pronunciation lexicon?
  2. (or) Is there already an existing library to do non-word noise recognition?
  3. (or) I have a bit of familiarity with Hidden Markov Models and the underlying tech of voice recognition from college, but no good estimate on how difficult it would be to create a very small noise/sound recognizer from scratch (suppose <20 noises to be recognized). If 1) and 2) fail, any estimation on how long it would take to roll my own?

Thanks

Alasdair answered 4/11, 2010 at 15:28 Comment(1)
My answer to the question Real-time identification of non-speech, non-music sound from a continuous microphone stream might be relevant.Procarp
B
4

Yes, you can use speech recognition software like CMU Sphinx for recognition of non-speech sounds. For this, you need to create your own acoustical and language models and define the lexicon restricted to your task. But to train the corresponding acoustic model, you must have enough training data with annotated sounds of interest.

In short, the sequence of steps is the following:

First, prepare resources for training: lexicon, dictionary etc. The process is described here: http://cmusphinx.sourceforge.net/wiki/tutorialam. But in your case, you need to redefine phoneme set and the lexicon. Namely, you should model fillers as real words (so, no ++ around) and you don't need to define the full phoneme set. There are many possibilities, but probably the most simple one is to have a single model for all speech phonemes. Thus, your lexicon will look like:

CLAP CLAP
BARK BARK
WHISTLE WHISTLE
FART FART
SPEECH SPEECH

Second, prepare training data with labels: Something similar to VoxForge, but text annotations must contain only labels from your lexicon. Of course, non-speech sounds must be labeled correctly as well. Good question here is where to get large enough amount of such data. But I guess it should be possible.

Having that, you can train your model. The task is simpler compared to speech recognition, for instance, you don't need to use triphones, just monophones.

Assuming equal prior probability of any sound/speech, the simplest language model can be a loop-like grammar (http://cmusphinx.sourceforge.net/wiki/tutoriallm):

#JSGF V1.0;
/**
 * JSGF Grammar for Hello World example
 */
grammar foo;
public <foo> = (CLAP | BARK | WHISTLE | FART | SPEECH)+ ;

This is the very basic approach to using ASR toolkit for your task. In can be further improved by fine-tuning HMMs configurations, using statistical language models and using fine-grained phonemes modeling (e.g. distinguishing vowels and consonants instead of having single SPEECH model. It depends on nature of your training data).

Outside the framework of speech recognition, you can build a simple static classifier that will analyze the input data frame by frame. Convolutional neural networks that operate over spectrograms perform quite well for this task.

Batting answered 28/12, 2016 at 10:41 Comment(0)
L
0

I don't know any existing libraries you can use, I suspect you may have to roll your own.

Would this paper be of interest? It has some technical detail, they seem to be able to recognise claps and differentiate them from whistles.

Ledesma answered 4/11, 2010 at 15:43 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.