Writing software to tell where sound comes from (directional listening) [closed]

Asked 29/12, 2011 at 1:3 Answered 20/11, 2013 at 6:36

Solved speech-recognition wav audio waveform

I've been curious about this for some time so I thought maybe posting here I could get some good answers.

What I know so far:

Humans can use their two ears to get not only what sounds "sound like" but also where they are coming from. Pitch is the note we hear, and something like the human voice has various pitches overlaid (not a pure tone.)

What I'd like to know:

How do I go about writing a program that can know where sound is coming from? From a theoretical standpoint I'd need two microphones, then I would record the sound data coming to the microphones and store the audio data such that a split second of audio data can be put into a tuple like [streamA, streamB].

I feel like there might be a formulaic / mathematical way to calculate based on the audio where a sound comes from. I also feel like it's possible to take the stream data and train a learner (give it sample audio and tell it where the audio came from) and have it classify incoming audio in that way.

What's the best way to go about doing this / are there good resources from which I can learn more about the subject?

EDIT:

Example:

          front

left (mic) x ======== x (mic) right

          back

                            x (sound source should return "back" or "right" or "back right")

I want to write a program that can return front/back left/right for most of the sound it is hearing. From what I understand it should be simple to set up two microphones pointed "forward." Based on that I'm trying to figure out a way we can triangulate sound and know where in relation to the mics the source is.

Buxom answered 29/12, 2011 at 1:3 Comment(8)

I'm guessing you want to do a discrete cross-correlation between the two channels. – Dissoluble 29/12, 2011 at 1:8

@HotLicks: That doesn't tell you very much. Knowing the relative delay between left and right mic only narrows the location down to the surface of an ellipsoid. – Triumvir 29/12, 2011 at 1:16

BBN makes millions of dollars selling a system that does this. They're not telling how, or if they are they've patented it. – Seidler 29/12, 2011 at 1:19

Hm it doesn't seem absurdly difficult, though. If anything, I feel like we could train a machine learner / classifier to do this rather than writing an algo. I'm just not sure what kind of ML algo I should be investigating, or where I should be looking to find more about this subject. Certainly there must be a mathematical relationship between two separate streams of sound separated by physical distance x that gives us a direction from a given "forward" position. – Buxom 29/12, 2011 at 1:24

@OliCharlesworth -- Which is as much information as you're going to get from the sound, unless you can somehow extract an echo (which is vaguely possible with an auto-correlstion). – Dissoluble 29/12, 2011 at 1:33

@HotLicks & Oli -- can someone simplify what you're saying? I'm not sure I follow. – Buxom 29/12, 2011 at 11:23

If we need to simplify it then you're in over your head already. – Dissoluble 29/12, 2011 at 12:26

Do you have the option of incorporating a dummy head into your system? Good localisation results have been achieved with this approach, including front/rear estimates and in some cases elevation. See sdac.kaist.ac.kr/upload/paper/ICCAS_2007_Hwang.pdf and jp.honda-ri.com/upload/document/entry/20110911/…. One author claims good results with a single microphone and "artificial pinna" ai.stanford.edu/~asaxena/monaural/monaural.pdf. Without this, the problem is very tricky due to the relatively flat directional frequency response of standard microphones. – Gaseous 29/12, 2011 at 20:23

If you look into research papers on multi phase microphone arrays, specifically those used for underwater direction finding (ie, a big area of submarine research during the cold war - where is the motor sound coming from so we can aim the torpedoes?) then you'll find the technology and math required to find the location of a sound given two or more microphone inputs.

It's non-trivial, and not something that could be discussed so broadly here, though, so you aren't going to find an easy code snippet and/or library to do what you need.

The main issue is eliminating echos and shadows. A simplistic method would be to start with a single tone, filtering out everything but that tone, then measuring the phase difference between the two microphones of that tone. The phase difference will give you a lot of information about the location of the tone.

You can then choose whether you want to deal with echoes and multipath issues (many of which can be eliminated by removing all but the strongest tone) or move onto correlating sounds that consist of something other than a single tone - a person talking, or a glass break, for instance. Start small and easy, and expand from there.

Gooseherd answered 29/12, 2011 at 17:37 Comment(0)

I was looking up something similar and wrote a dumb answer here that got deleted. I had some ideas but didn't really write them properly. The deletion gave me that internet bruised ego pride so I decided to try out the problem and I think it worked!

Actually trying to do a real locate a la Adam Davis' answer is very difficult but doing a human style location (looking at the first source, ignoring echos, or treating them as sources) is not too bad, I think, though I'm not a signal processing expert by any means.

I read this and this. Which made me realise that the problem is really one of finding the time shift (cross-correlation) between two signals. From there you would calculate the angle using the speed of sound. Note you'll get two solutions (front and back).

The key information I read was in this answer and others on the same page which talk about how to do fast fourier transforms in scipy, to find the cross-correlation curve.

Basically, you need to import the wave file into python. See this.

If your wave file (input) is a tuple with two numpy arrays (left, right), zero-padded at least as long as itself (to stop it circularly aligning apparently) the code follows from Gustavo's answer. I think you need to recognise that that ffts make the assumption of time-invariance, which means if you want to get any kind of time-based tracking of signals you need to 'bite off' small samples of data.

I brought the following code together from the mentioned sources. It will produce a graph showing estimated time delay, in frames, from left to right (negative/positive). To convert to actual time, divide by the sample rate. If you want to know what the angle is you need to:

assume everything is on a plane (no height factor)
forget the difference between sound in front and those behind (you can't differentiate)

You would also want to use the distance between the two microphones to make sure you aren't getting echos (time delays greater than that for the 90 degree delay).

I realise that I've taken a lot of borrowed here, so thanks to all of those that inadvertently contributed!

import wave
import struct
from numpy import array, concatenate, argmax
from numpy import abs as nabs
from scipy.signal import fftconvolve
from matplotlib.pyplot import plot, show
from math import log

def crossco(wav):
    """Returns cross correlation function of the left and right audio. It
    uses a convolution of left with the right reversed which is the
    equivalent of a cross-correlation.
    """
    cor = nabs(fftconvolve(wav[0],wav[1][::-1]))
    return cor

def trackTD(fname, width, chunksize=5000):
    track = []
    #opens the wave file using pythons built-in wave library
    wav = wave.open(fname, 'r')
    #get the info from the file, this is kind of ugly and non-PEPish
    (nchannels, sampwidth, framerate, nframes, comptype, compname) = wav.getparams ()

    #only loop while you have enough whole chunks left in the wave
    while wav.tell() < int(nframes/nchannels)-chunksize:

        #read the audio frames as asequence of bytes
        frames = wav.readframes(int(chunksize)*nchannels)

        #construct a list out of that sequence
        out = struct.unpack_from("%dh" % (chunksize * nchannels), frames)

        # Convert 2 channels to numpy arrays
        if nchannels == 2:
            #the left channel is the 0th and even numbered elements
            left = array (list (out[0::2]))
            #the right is all the odd elements
            right = array (list  (out[1::2]))
        else:
            left = array (out)
            right = left

        #zero pad each channel with zeroes as long as the source
        left = concatenate((left,[0]*chunksize))
        right = concatenate((right,[0]*chunksize))

        chunk = (left, right)

        #if the volume is very low (800 or less), assume 0 degrees
        if abs(max(left)) < 800 :
            a = 0.0
        else:
            #otherwise computing how many frames delay there are in this chunk
            cor = argmax(crossco(chunk)) - chunksize*2
            #calculate the time
            t = cor/framerate
            #get the distance assuming v = 340m/s sina=(t*v)/width
            sina = t*340/width
            a = asin(sina) * 180/(3.14159)



        #add the last angle delay value to a list
        track.append(a)


    #plot the list
    plot(track)
    show()

I tried this out using some stereo audio I found at equilogy. I used the car example (stereo file). It produced this.

To do this on-the-fly, I guess you'd need to have an incoming stereo source that you could 'listen to' for a short time (I used 1000 frames = 0.0208s) and then calculate and repeat.

[edit: found you can easily use the fft convolve function, using the inverted time series of one of the two to make a correlation]

Renaerenaissance answered 20/11, 2013 at 6:36 Comment(0)

This is an interesting problem. I don't know of any reference material for this, but I do have some experience in audio software and signal processing that may help point you in the right direction.

Determining sound source direction (where the sound is coming from around you) is fairly simple. Get 6 directional microphones and point them up, down, front, back, left, and right. By looking at the relative amplitudes of the mic signals in response to a sound, you could pretty easily determine which direction a particular sound is coming from. Increase the number of microphones for increased resolution.

2 microphones would only tell you whether a sound is coming from the right or left. The reason your 2 ears can figure out whether a sound is coming from in front of, or behind you, is because the outer structure of your ear modifies the sound depending on the direction, which your brain interprets and then corrects for.

Aruabea answered 29/12, 2011 at 1:47 Comment(3)

When you lose hearing in one ear, you lose your ability to tell direction - the outer structure of the ear helps, but both ears are required - hearinglosshelp.com/weblog/… . Your brain performs pretty complex correlation and timing between the two ears in order to determine direction. – Gooseherd 29/12, 2011 at 17:35

This answer is a bit misleading. With use of appropriate binaural techniques, 2 microphones can give you a location estimate in 2D plane (azimuth), not just on a line (left-right). See sdac.kaist.ac.kr/upload/paper/ICCAS_2007_Hwang.pdf and other papers. More recently it has been shown that an elevation estimate can also be achieved jp.honda-ri.com/upload/document/entry/20110911/…. Accuracy can be improved if some assumptions can be made about the source, e.g. ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5443663. – Gaseous 29/12, 2011 at 20:5

It may be a little limited in scope, but I wouldn't call it misleading. While your second link doesn't work, and your 3rd link requires a membership, the first link discusses experiments involving mimicking the structure of the outer ear, which I discussed in the final part of the answer. However, OP was asking primarily about a method for 'formulaic / mathematical way to calculate' position. Emulating the behavior of the outer ear would require specialized hardware. – Aruabea 29/12, 2011 at 20:29

Cross-correlation is a main method but it has some specifics. There are various approaches which help to detect source with microphone array efficiently. Some also work without calibration, some require calibration to adapt to the room geometry.

You can try existing open source software for the source localization task

Manyears robot sound source separation and localization https://sourceforge.net/projects/manyears/

HARK toolkit for robotics applications http://www.ros.org/wiki/hark

Hoarsen answered 29/12, 2011 at 17:31 Comment(0)

Recommended topics

Hot tags