Modern Yolo versions, from v3 onwards, can handle arbitrary sized images as long as both sides are a multiple of 32. This is because the maximum stride of the backbone is 32 and it is a fully convolutional network. But there are clearly two different cases for how input images to the model are preprocessed:
Training
An example. Let's say you start a training by:
from ultralytics.yolo.engine.model import YOLO
model = YOLO("yolov8n.pt")
results = model.train(data="coco128.yaml", imgsz=512)
By printing what is fed to the model (im) in trainer.py
you will obtain the following output:
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
0%| | 0/8 [00:00<?, ?it/s]
torch.Size([16, 3, 512, 512])
1/100 1.67G 1.165 1.447 1.198 226 512: 12%|█▎ | 1/8 [00:01<00:08, 1.15s/it]
torch.Size([16, 3, 512, 512])
1/100 1.68G 1.144 1.511 1.22 165 512: 25%|██▌ | 2/8 [00:02<00:06, 1.10s/it]
torch.Size([16, 3, 512, 512])
So, during training, images have to be reshaped to the same size in order to be able to create mini-batches as you cannot concatenate tensors of different shapes. imgsz
selects the size of the images to train on.
Prediction
Now, let's have a look at prediction. Let's say you select the images under assets as source and imgsz 512 by
from ultralytics.yolo.engine.model import YOLO
model = YOLO("yolov8n.pt")
results = model.predict(stream=True, imgsz=512) # source already setup
By printing the original image shape (im0) and the one fed to the model (im) in predictor.py
you will obtain the following output:
(yolov8) ➜ ultralytics git:(main) ✗ python new.py
Ultralytics YOLOv8.0.23 🚀 Python-3.8.15 torch-1.11.0+cu102 CUDA:0 (Quadro P2000, 4032MiB)
YOLOv8n summary (fused): 168 layers, 3151904 parameters, 0 gradients, 8.7 GFLOPs
im0s (1080, 810, 3)
im torch.Size([1, 3, 512, 384])
image 1/2 /home/mikel.brostrom/ultralytics/ultralytics/assets/bus.jpg: 512x384 4 persons, 1 bus, 7.4ms
im0s (720, 1280, 3)
im torch.Size([1, 3, 288, 512])
image 2/2 /home/mikel.brostrom/ultralytics/ultralytics/assets/zidane.jpg: 288x512 3 persons, 2 ties, 5.8ms
Speed: 0.4ms pre-process, 6.6ms inference, 1.5ms postprocess per image at shape (1, 3, 512, 512)
You can see that the longest image side is reshaped to 512. The short side is reshaped to the closest multiple of 32 while maintaining the aspect ratio. As you are not feeding multiple images at the same time you don't need to reshape images into the same shape and stack them, making it possible to avoid padding.