YOLOV8 how does it handle different image sizes
Asked Answered
P

1

11

Yolov8 and I suspect Yolov5 handle non-square images well. I cannot see any evidence of cropping the input image, i.e. detections seem to go to the enge of the longest side. Does it resize to a square 640x604 which would change the aspect ratio of objects making them more difficult to detect?

When training on a custom dataset starting from a pre-trained model, what does the imgsz (image size) parameter actually do?

Phyliciaphylis answered 28/1, 2023 at 14:33 Comment(0)
B
16

Modern Yolo versions, from v3 onwards, can handle arbitrary sized images as long as both sides are a multiple of 32. This is because the maximum stride of the backbone is 32 and it is a fully convolutional network. But there are clearly two different cases for how input images to the model are preprocessed:

Training

An example. Let's say you start a training by:

from ultralytics.yolo.engine.model import YOLO
  
model = YOLO("yolov8n.pt")
results = model.train(data="coco128.yaml", imgsz=512) 

By printing what is fed to the model (im) in trainer.py you will obtain the following output:

Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
  0%|          | 0/8 [00:00<?, ?it/s]
 torch.Size([16, 3, 512, 512])
      1/100      1.67G      1.165      1.447      1.198        226        512:  12%|█▎        | 1/8 [00:01<00:08,  1.15s/it]
 torch.Size([16, 3, 512, 512])
      1/100      1.68G      1.144      1.511       1.22        165        512:  25%|██▌       | 2/8 [00:02<00:06,  1.10s/it]
 torch.Size([16, 3, 512, 512])

So, during training, images have to be reshaped to the same size in order to be able to create mini-batches as you cannot concatenate tensors of different shapes. imgsz selects the size of the images to train on.

Prediction

Now, let's have a look at prediction. Let's say you select the images under assets as source and imgsz 512 by

from ultralytics.yolo.engine.model import YOLO
  
model = YOLO("yolov8n.pt")
results = model.predict(stream=True, imgsz=512) # source already setup

By printing the original image shape (im0) and the one fed to the model (im) in predictor.py you will obtain the following output:

(yolov8) ➜  ultralytics git:(main) ✗ python new.py 
Ultralytics YOLOv8.0.23 🚀 Python-3.8.15 torch-1.11.0+cu102 CUDA:0 (Quadro P2000, 4032MiB)
YOLOv8n summary (fused): 168 layers, 3151904 parameters, 0 gradients, 8.7 GFLOPs
im0s (1080, 810, 3)
im torch.Size([1, 3, 512, 384])
image 1/2 /home/mikel.brostrom/ultralytics/ultralytics/assets/bus.jpg: 512x384 4 persons, 1 bus, 7.4ms
im0s (720, 1280, 3)
im torch.Size([1, 3, 288, 512])
image 2/2 /home/mikel.brostrom/ultralytics/ultralytics/assets/zidane.jpg: 288x512 3 persons, 2 ties, 5.8ms
Speed: 0.4ms pre-process, 6.6ms inference, 1.5ms postprocess per image at shape (1, 3, 512, 512)

You can see that the longest image side is reshaped to 512. The short side is reshaped to the closest multiple of 32 while maintaining the aspect ratio. As you are not feeding multiple images at the same time you don't need to reshape images into the same shape and stack them, making it possible to avoid padding.

Bonnibelle answered 28/1, 2023 at 21:8 Comment(6)
Thank you for taking the time to explain, Stride of 32! I see in the diagram of the yolov8 backbone here user-images.githubusercontent.com/27466624/… that it mentions stride=32 at the bottom but I think that is calculated as 2**5 which is stide 2 5 times. What I do not see in the diagram is an adaptive pooling layer or similar. I assume that the smaller side is adjusted by padding to be the same as the adjusted larger size. Do you know what the imgsz parameter does?Phyliciaphylis
No need of padding as long as the short side also is a multiple of 32. Will add some more info to the answer...Bonnibelle
Thanks for the additional info. Both the examples work out nicely, 512 is a multiple of 32 (16*32) and the ratio of the image sizes are easy multiples that cancel with the 16 to give an exact multiple of 32 without any adjustment. I assume that if this is not the case the aspect ratio will be altered somewhat.Phyliciaphylis
The aspect ratio could changed slightly, in the case the sides are not exact multiples of 32. But it still maintained tightly as images are inserted in "letterboxes" when feeding multiple images to the model at the same timeBonnibelle
Hi @MikeB, do you know any literature or post that gives more details about this? Specifically, usually we need fixed size input when we convert output of conv layers to fully connected layers, and wonder how yolo handles this. Thanks!Abbatial
Sadly, I don't think this is something you will find in a book. The easiest way to understand how this works is the hard way. By for example building yolov3 from scratch.Bonnibelle

© 2022 - 2024 — McMap. All rights reserved.