Convolutional Neural Network (CNN) for Audio [closed]
F

2

35

I have been following the tutorials on DeepLearning.net to learn how to implement a convolutional neural network that extracts features from images. The tutorial are well explained, easy to understand and follow.

I want to extend the same CNN to extract multi-modal features from videos (images + audio) at the same time.

I understand that video input is nothing but a sequence of images (pixel intensities) displayed in a period of time (ex. 30 FPS) associated with audio. However, I don't really understand what audio is, how it works, or how it is broken down to be feed into the network.

I have read a couple of papers on the subject (multi-modal feature extraction/representation), but none have explained how audio is inputed to the network.

Moreover, I understand from my studies that multi-modality representation is the way our brains really work as we don't deliberately filter out our senses to achieve understanding. It all happens simultaneously without us knowing about it through (joint representation). A simple example would be, if we hear a lion roar we instantly compose a mental image of a lion, feel danger and vice-versa. Multiple neural patterns are fired in our brains to achieve a comprehensive understanding of what a lion looks like, sounds like, feels like, smells like, etc.

The above mentioned is my ultimate goal, but for the time being I'm breaking down my problem for the sake of simplicity.

I would really appreciate if anyone can shed light on how audio is dissected and then later on represented in a convolutional neural network. I would also appreciate your thoughts with regards to multi-modal synchronisation, joint representations, and what is the proper way to train a CNN with multi-modal data.

Fletcherfletcherism answered 18/3, 2014 at 5:28 Comment(9)
The moderators intrusevly changed the content of this question, to the point where it even changed its original inquery. The post is on audio and they have changed it to video. They have remove important keywords that could help people find this post such as CNN and other ones. This is among some of my best ranking questions. I can't imagine how this kind of moderation could affect the community and ones with lesser ranks. @user4157124Fletcherfletcherism
Just because it has been voted up does not make it a good question for this site. At present its asking for opinions, so I cannot vote for it to be reopened.Chou
@RohitGupta What you cosider a bad question could be considered good by a million other people. Its extremely disrespectful to change someone's original effort and still find the audacity to call it bad.Fletcherfletcherism
"The moderators intrusevly changed the content of this question" - did they? Check the revision history for this post, it does not contain any moderation edits. Also, as you haven't shared any code so far, how should others check where your approach went wrong? Finally, nobody called your question "bad"Mistrustful
@NicoHaase oh yes. The moderators edited the question title removed audio and added video. Also deleted almost 90% of the content. Moreover when someone Rohit Gupta says its not a good question, so I'm assuming he means bad. Moderators can pretty much do whatever they like although clearly the community finds this question super helpful. We are not robots, we are humans and there will always be a human element in all questions asked and answered. Unless you want to get the best out of the community and build your next chat gpt or something. Sad! very sad.Fletcherfletcherism
Which moderator did that? The revision history shows that user4157124 edited the question, but that user is not a moderator. If you feel this edit caused more harm than good, roll back the edit. Also, Rohit stated this is not a "good question" (however we would define "good") for this site, not that this is a bad question per seMistrustful
Thank you @NicoHaase . The question is currently closed, but would you kindly assist me on how I can rollback the edits. I very much would appreciate your support.Fletcherfletcherism
There's a link to the revision history below the tags (currently named "edited Jun 22 at 19:30"). From there, you can perform a rollbackMistrustful
Thank you @NicoHaase I rolled it back. Much appreciated.Fletcherfletcherism
I
20

We used deep convolutional networks on spectrograms for a spoken language identification task. We had around 95% accuracy on a dataset provided in this TopCoder contest. The details are here.

Plain convolutional networks do not capture the temporal characteristics, so for example in this work the output of the convolutional network was fed to a time-delay neural network. But our experiments show that even without additional elements convolutional networks can perform well at least on some tasks when the inputs have similar sizes.

Individualize answered 11/10, 2015 at 12:4 Comment(2)
the "in this work" Microsoft link doesn't lead to any article or pdf, can you mention the title?Sterilization
sorry for a late reply. Here it is scholar.google.com/…Individualize
B
9

There are many techniques to extract feature vectors from audio data in order to train classifiers. The most commonly used is called MFCC (Mel-frequency cepstrum), which you can think of as a "improved" spectrogram, retaining more relevant information to discriminate between classes. Other commonly used technique is PLP (Perceptual Linear Predictive), which also gives good results. These are still many other less known.

More recently deep networks have been used to extract features vectors by themselves, thus more similarly the way we do in image recognition. This is a active area of research. Not long ago we also used feature extractors to train classifiers for images (SIFT, HOG, etc.), but these were replaced by deep learning techniques, which have raw images as inputs and extract feature vectors by themselves (indeed it's what deep learning is really all about).

It's also very important to notice that audio data is sequential. After training a classifier you need to train a sequential model as a HMM or CRF, which chooses the most likely sequences of speech units, using as input the probabilities given by your classifier.

A good starting point to learn speech recognition is Jursky and Martins: Speech and Language Processing. It explains very well all these concepts.

[EDIT: adding some potentially useful information]

There are many speech recognition toolkits with modules to extract MFCC feature vectors from audio files, but using than for this purpose is not always straightforward. I'm currently using CMU Sphinx4. It has a class named FeatureFileDumper, that can be used standalone to generate MFCC vectors from audio files.

Bigner answered 24/5, 2014 at 1:54 Comment(5)
spectrograms contain all the information what waves(the most direct representation of sound) havePurism
Laie is correct, I am currently using the spectrogram approach and the first function I wrote was convert wav to spectrogram and then convert back to wav. It reproduces with 100% accuracy except for the first few and last few samplesRendon
@Laie, sorry for the late response (lessBigner
@Laie, sorry for the late response (short of 8 years 😅), but perhaps whether a spectrogram preserves the whole signal or not is just a matter of terminology. As I see the word more commonly used, a spectrogram is a graphical representation of the magnitudes (generally squared) of the fourier coefficients. The coefficients are complex numbers, thus their phases are discarded in the graphBigner
@Laie, but sure, you might still be correct for all practical purposes for dealing with audio signals, since for a real and even signal, the coefficients would be real. None of this seems to matter anymore for the OP, now that deep learning made all this old school feature extraction totally obsolete!Bigner

© 2022 - 2024 — McMap. All rights reserved.