how to find an offset from two audio file ? one is noisy and one is clear

Asked 30/12, 2016 at 6:48 Answered 5/1, 2017 at 14:12

Solved ios objective-c audio synchronization audio-comparison

I have once scenario in which user capturing the concert scene with the realtime audio of the performer and at the same time device is downloading the live streaming from audio broadcaster device.later i replace the realtime noisy audio (captured while recording) with the one i have streamed and saved in my phone (good quality audio).right now i am setting the audio offset manually with trial and error basis while merging so i can sync the audio and video activity at exact position.

Now what i want to do is to automate the process of synchronisation of audio.instead of merging the video with clear audio at given offset i want to merge the video with clear audio automatically with proper sync.

for that i need to find the offset at which i should replace the noisy audio with clear audio.e.g. when user start the recording and stop the recording then i will take that sample of real time audio and compare with live streamed audio and take the exact part of that audio from that and sync at perfect time.

does any one have any idea how to find the offset by comparing two audio files and sync with the video.?

Diddle answered 30/12, 2016 at 6:48 Comment(2)

Code example? This question does not appear to be about programming within the scope defined in the help center. – Fado 2/1, 2017 at 21:58

I made syncstart to sync two recordings using an fft based correlation of the start. – Fallacious 19/2, 2021 at 21:53

Here's a concise, clear answer.

• It's not easy - it will involve signal processing and math.
• A quick Google gives me this solution, code included.
• There is more info on the above technique here.
• I'd suggest gaining at least a basic understanding before you try and port this to iOS.
• I would suggest you use the Accelerate framework on iOS for fast Fourier transforms etc
• I don't agree with the other answer about doing it on a server - devices are plenty powerful these days. A user wouldn't mind a few seconds of processing for something seemingly magic to happen.

Edit

As an aside, I think it's worth taking a step back for a second. While math and fancy signal processing like this can give great results, and do some pretty magical stuff, there can be outlying cases where the algorithm falls apart (hopefully not often).

What if, instead of getting complicated with signal processing, there's another way? After some thought, there might be. If you meet all the following conditions:

• You are in control of the server component (audio broadcaster device)
• The broadcaster is aware of the 'real audio' recording latency
• The broadcaster and receiver are communicating in a way that allows accurate time synchronisation

...then the task of calculating audio offset becomes reasonably trivial. You could use NTP or some other more accurate time synchronisation method so that there is a global point of reference for time. Then, it is as simple as calculating the difference between audio stream time codes, where the time codes are based on the global reference time.

Crutchfield answered 4/1, 2017 at 2:44 Comment(2)

Another problem with the described scenario by the OP will arise when the audio broadcaster "lies" and the stream does not match 100% the real event. In this case the output of the synchronization will be essentially undetermined for the video stream (e.g., an instrument was added to the stream which is not in the live concert; strangers things have happened multiple times...). – Calabar 5/1, 2017 at 10:36

@D.Kovács that's true (although the OP never says this will happen, we don't really know context). My gut feeling though is that the algorithm would be fine still: another instrument is no worse than noise or a bad recording, which apparently the algorithm can deal with. If you wanted the algorithm to handle pitch translation, there may be some work needed - but I think it is still within the realm of possible. – Crutchfield 5/1, 2017 at 11:5

This could prove to be a difficult problem, as even though the signals are of the same event, the presence of noise makes a comparison harder. You could consider running some post-processing to reduce the noise, but noise reduction in its self is an extensive non-trivial topic.

Another problem could be that the signal captured by the two devices could actually differ a lot, for example the good quality audio (i guess output from the live mix console?) will be fairly different than the live version (which is guess is coming out of on stage monitors/ FOH system captured by a phone mic?)

Perhaps the simplest possible approach to start would be to use cross correlation to do the time delay analysis.

A peak in the cross correlation function would suggest the relative time delay (in samples) between the two signals, so you can apply the shift accordingly.

Maupassant answered 5/1, 2017 at 14:12 Comment(0)

I don't know a lot about the subject, but I think you are looking for "audio fingerprinting". Similar question here.

An alternative (and more error-prone) way is running both sounds through a speech to text library (or an API) and matching relevant part. This would be of course not very reliable. Sentences frequently repeat in songs and concert maybe instrumental.

Also, doing audio processing on a mobile device may not play well (because of low performance or high battery drain or both). I suggest you to use a server if you go that way.

Good luck.

Varner answered 2/1, 2017 at 13:18 Comment(0)

Recommended topics

Hot tags