I am developing an object detection model to detect ships using YOLO. I want to use the COCO dataset. Is there a way to download only the images that have ships with the annotations?
To download images from a specific category, you can use the COCO API. Here's a demo notebook going through this and other usages. The overall process is as follows:
- Install pycocotools
- Download one of the annotations jsons from the COCO dataset
Now here's an example on how we could download a subset of the images containing a person
and saving it in a local file:
from pycocotools.coco import COCO
import requests
# instantiate COCO specifying the annotations json path
coco = COCO('...path_to_annotations/instances_train2014.json')
# Specify a list of category names of interest
catIds = coco.getCatIds(catNms=['person'])
# Get the corresponding image ids and images using loadImgs
imgIds = coco.getImgIds(catIds=catIds)
images = coco.loadImgs(imgIds)
Which returns a list of dictionaries with basic information on the images and its url. We can now use requests
to GET
the images and write them into a local folder:
# Save the images into a local folder
for im in images:
img_data = requests.get(im['coco_url']).content
with open('...path_saved_ims/coco_person/' + im['file_name'], 'wb') as handler:
Note that this will save all images from the specified category. So you might want to slice the images
list to the first n
FileNotFoundError: [Errno 2] No such file or directory: '../coco_sheep/COCO_train2014_000000040961.jpg'
after running this piece of code for sheep category? I have the annotations json file, and no errors in code; could be the dataset's fault? –
Toandfro Nowadays there is a package called fiftyone
with which you could download the MS COCO dataset and get the annotations for specific classes only. More information about installation can be found at https://github.com/voxel51/fiftyone#installation.
Once you have the package installed, simply run the following to get say the "person" and "car" classes:
import fiftyone.zoo as foz
# To download the COCO dataset for only the "person" and "car" classes
dataset = foz.load_zoo_dataset(
label_types=["detections", "segmentations"],
classes=["person", "car"],
# max_samples=50,
If desired, you can comment out the last option to set a maximum samples size. Moreover, you can change the "train" split to "validation" in order to obtain the validation split instead.
To visualize the dataset downloaded, simply run the following:
# Visualize the dataset in the FiftyOne App
import fiftyone as fo
session = fo.launch_app(dataset)
If you would like to download the splits "train", "validation", and "test" in the same function call of the data to be loaded, you could do the following:
dataset = foz.load_zoo_dataset(
splits=["train", "validation", "test"],
label_types=["detections", "segmentations"],
# max_samples=50,
Connection to images.cocodataset.org timed out
for downloading ~66k images of 5 classes, but it did the job eventually. –
Greyback From what I personally know, if you're talking about the COCO dataset only, I don't think they have a category for "ships". The closest category they have is "boat". Here's the link to check the available categories: http://cocodataset.org/#overview
BTW, there are ships inside the boat category too.
If you want to just select images of a specific COCO category, you might want to do something like this (taken and edited from COCO's official demos):
# display COCO categories
cats = coco.loadCats(coco.getCatIds())
nms=[cat['name'] for cat in cats]
print('COCO categories: \n{}\n'.format(' '.join(nms)))
# get all images containing given categories (I'm selecting the "bird")
catIds = coco.getCatIds(catNms=['bird']);
imgIds = coco.getImgIds(catIds=catIds);
coco.download('myfolder', imgIds)
Charqui I tried the code that @yatu and @Tim had shared here, but I got lots of requests.exceptions.ConnectionError: HTTPSConnectionPool
So after carefully reading this answer to Max retries exceeded with URL in requests, I rewrote the code like this one and now it runs smoothly:
from pycocotools.coco import COCO
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
import requests
from tqdm.notebook import tqdm
# instantiate COCO specifying the annotations json path
coco = COCO('annotations/instances_train2017.json')
# Specify a list of category names of interest
catIds = coco.getCatIds(catNms=['person'])
# Get the corresponding image ids and images using loadImgs
imgIds = coco.getImgIds(catIds=catIds)
images = coco.loadImgs(imgIds)
# handle annotations
ANNOTATIONS = {"info": {
"description": "my-project-name"
def cocoJson(images: list) -> dict:
arrayIds = np.array([k["id"] for k in images])
annIds = coco.getAnnIds(imgIds=arrayIds, catIds=catIds, iscrowd=None)
anns = coco.loadAnns(annIds)
for k in anns:
k["category_id"] = catIds.index(k["category_id"])+1
catS = [{'id': int(value), 'name': key}
for key, value in categories.items()]
ANNOTATIONS["images"] = images
ANNOTATIONS["annotations"] = anns
ANNOTATIONS["categories"] = catS
def createJson(JsonFile: json, label='train') -> None:
name = label
Path("data/labels").mkdir(parents=True, exist_ok=True)
with open(f"data/labels/{name}.json", "w") as outfile:
json.dump(JsonFile, outfile)
def downloadImages(images: list) -> None:
session = requests.Session()
retry = Retry(connect=3, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
for im in tqdm(images):
if not isfile(f"data/images/{im['file_name']}"):
img_data = session.get(im['coco_url']).content
with open('data/images/' + im['file_name'], 'wb') as handler:
trainSet = cocoJson(images)
On my side I had recent difficulties installing fiftyone
with Apple Silicon Mac (M1), so I created a script based on pycocotools that allows me to quickly download a subset of the coco 2017 dataset (images and annotations).
It is very simple to use, details are available here: https://github.com/tikitong/minicoco , hope this helps.
© 2022 - 2024 — McMap. All rights reserved.