I think as well you may try to select a few second sample from both audio track, mnormalise them in amplitude and reduce noise with a band pass filter and after try to use a correlator.
for instance you may take a 5 second sample of one of the thwo and made it slide over the second one computing a cross corelation for any time you shift. (be carefull that if you take a too small pachet you may have high correlation when not expeced and you will soffer the side effect due to the croping of the signal and the crosscorrelation).
After yo can collect an array with al the results of the cross correlation and get the index of the maximun.
You should then set experimentally up threshould o decide when yo assume the pachet to b the same. this will change depending on the quality of the audio track you are comparing.
I implemented a correator to receive and distinguish preamble in wireless communication. My script is actually done in matlab. if you are interested i can try to find the common part and send it to you.
It would be a too long code to be pasted hene in the forum. if you want just let me know and i will send it to ya asap.
cheers