I have already implemented image captioning using VGG as the image classification model. I have read about YOLO being a fast image classification and detection model and it is primarily used for multiple object detection. However for image captioning i just want the classes not the bounding boxes.
I completely agree with what Parag S. Chandakkar mentioned in his answer. YOLO and RCNN the two most used object detection models are slow if used just for classification compared to VGG-16 and other object classification networks. However in support of YOLO, I would mention that , you can create a single model for image captioning and image object detection.
YOLO generates a vector of length 1470.
Tune YOLO to generate number of classes as supported by your dataset i.e make YOLO generate a vector of 49*(number of classes in your dataset) + 98 + 392.
Use this vector to generate the Bounding boxes.
- Further tune this vector to generate a vector of size equal to the number of classes. You can use a dense layer for the same.
- Pass this vector to your language model for generating captions.
Thus to sum up, you can generate the bounding boxes first and then further tune that vector to generate captions.
My initial guess is it would not make sense to use YOLO for image classification. YOLO is fast for object detection, but networks used for image classification are faster than YOLO since they have do lesser work (so the comparison is not fair).
According to benchmarks provided here, we can consider Inception-v1 network that has 27 layers. YOLO base network has 24 layers. Now, with latest cuDNN, on Maxwell TitanX, Inception v1 takes 19.29 ms for 16 images, which translates into ~ 830 fps (again expect lower fps when you pass a single image because GPU is fast at processing mini-batches i.e. making one forward pass with mini-batch of 16 is faster than making 16 forward passes with mini-batch size 1).
Latest version of YOLO runs at 67 fps and its tiny version runs at 207 fps, still a lot slower than Inception v1 (note that YOLO does not Inception v1 as their base network, but still number of layers are comparable).
So, in short, I do not see any speed advantage in using YOLO for image classification. Now, regarding accuracy, I cannot say for sure if YOLO would be able to detect presence of an object better than a conventional image classification network, if the object is tiny.
© 2022 - 2024 — McMap. All rights reserved.