Real-time identification of non-speech, non-music sound from a continuous microphone stream
Asked Answered
I

2

8

I'm looking to log events corresponding to a specific sound, such as a car door slamming, or perhaps a toaster ejecting toast.

The system needs to be more sophisticated than a "loud noise detector"; it needs to be able to distinguish that specific sound from other loud noises.

The identification need not be zero-latency, but the processor needs to keep up with a continuous stream of incoming data from a microphone that is always on.

  • Is this task significantly different than speech recognition, or could I make use of speech recognition libraries/toolkits to identify these non-speech sounds?
  • Given the requirement that I only need to match one sound (as opposed to matching among a library of sounds), are there any special optimizations I can do?

This answer indicates that a matched filter would be appropriate, but I am hazy on the details. I don't believe a simple cross-correlation on the audio waveform data between a sample of the target sound and the microphone stream would be effective, due to variations in the target sound.

My question is also similar to this, which didn't get much attention.

Irrepealable answered 27/11, 2011 at 9:54 Comment(1)
You might have better luck over on dsp.stackexchange.com.Furze
B
4

I found an interesting paper on the subject

It should also work for your application, if not better than for vehicle sounds.

When analyzing the training data, it...

  1. Takes samples of 200ms
  2. Does a Fourier Transform (FFT) on each sample
  3. Does a Principal Component Analysis on the frequency vectors

    • Calculates the mean of all samples of this class
    • Subtracts the mean from the samples
    • Calculates the eigen-vectors of the mean covariance matrix (mean of the outer products of each vector with itself)
    • Stores the mean and the most significant eigen-vectors.

Then to classify a sound, it...

  1. Takes samples of 200ms (S).
  2. Does a Fourier Transform on each sample.
  3. Subtracts the mean of the class (C) from the frequency vector (F).
  4. Multiplies the frequency vector with each eigen-vector of C, giving a number from each.
  5. Subtracts the product of each number and the corresponding eigen-vector from F.
  6. Takes the length of the resulting vector.
  7. If this value is below some constant, S is recognized as belonging to the class C.
Bagworm answered 27/11, 2011 at 12:50 Comment(0)
B
3

This doctoral thesis, Non-Speech Environmental Sound Classification System for Autonomous Surveillance, by Cowling (2004), has experimental results on different techniques for audio feature extraction, as well as classification. He uses environmental sounds such as jangling keys and footsteps, and was able to achieve an accuracy of 70%:

The best technique is found to be either Continuous Wavelet Transform feature extraction with Dynamic Time Warping or Mel-Frequency Cepstral Coefficients with Dynamic Time Warping. Both of these techniques achieve a 70% recognition rate.

If you limit yourself to one sound, perhaps you might be able to achieve a higher recognition rate?

The author also mentions that techniques that work fairly well with speech recognition (learning vector quantization and neural networks) don't work so well with environmental sounds.

I have also found a more recent article here: Detecting Audio Events for Semantic Video Search, by Bugalho et al. (2009), where they detect sound events in movies (like gun shots, explosions, etc).

I have no experience in this area. I have merely stumbled upon this material as a result of your question piquing my interest. I'm posting my finds here in the hope that it helps with your research.

Bagwig answered 27/11, 2011 at 11:41 Comment(1)
@AJMansfield Found alternate links to the articles.Bagwig

© 2022 - 2024 — McMap. All rights reserved.