How can I download a specific part of Coco Dataset?
Asked Answered
P

5

11

I am developing an object detection model to detect ships using YOLO. I want to use the COCO dataset. Is there a way to download only the images that have ships with the annotations?

Pitiable answered 29/6, 2018 at 10:58 Comment(0)
D
17

To download images from a specific category, you can use the COCO API. Here's a demo notebook going through this and other usages. The overall process is as follows:

Now here's an example on how we could download a subset of the images containing a person and saving it in a local file:

from pycocotools.coco import COCO
import requests

# instantiate COCO specifying the annotations json path
coco = COCO('...path_to_annotations/instances_train2014.json')
# Specify a list of category names of interest
catIds = coco.getCatIds(catNms=['person'])
# Get the corresponding image ids and images using loadImgs
imgIds = coco.getImgIds(catIds=catIds)
images = coco.loadImgs(imgIds)

Which returns a list of dictionaries with basic information on the images and its url. We can now use requests to GET the images and write them into a local folder:

# Save the images into a local folder
for im in images:
    img_data = requests.get(im['coco_url']).content
    with open('...path_saved_ims/coco_person/' + im['file_name'], 'wb') as handler:
        handler.write(img_data)

Note that this will save all images from the specified category. So you might want to slice the images list to the first n.

Dowser answered 7/7, 2020 at 7:51 Comment(3)
how we can download .txt file yolo labels?Mortenson
The best way to convert COCO to YOLO labels would be to use FiftyOne, as mentioned by @kris-stern in another answer. From there, you can export the dataset to disk in a number of formats, including YOLO: voxel51.com/docs/fiftyone/user_guide/…Anglicanism
Why do i get an error FileNotFoundError: [Errno 2] No such file or directory: '../coco_sheep/COCO_train2014_000000040961.jpg' after running this piece of code for sheep category? I have the annotations json file, and no errors in code; could be the dataset's fault?Toandfro
E
9

Nowadays there is a package called fiftyone with which you could download the MS COCO dataset and get the annotations for specific classes only. More information about installation can be found at https://github.com/voxel51/fiftyone#installation.

Once you have the package installed, simply run the following to get say the "person" and "car" classes:

import fiftyone.zoo as foz

# To download the COCO dataset for only the "person" and "car" classes
dataset = foz.load_zoo_dataset(
    "coco-2017",
    split="train",
    label_types=["detections", "segmentations"],
    classes=["person", "car"],
    # max_samples=50,
)

If desired, you can comment out the last option to set a maximum samples size. Moreover, you can change the "train" split to "validation" in order to obtain the validation split instead.

To visualize the dataset downloaded, simply run the following:

# Visualize the dataset in the FiftyOne App
import fiftyone as fo
session = fo.launch_app(dataset)

If you would like to download the splits "train", "validation", and "test" in the same function call of the data to be loaded, you could do the following:

dataset = foz.load_zoo_dataset(
    "coco-2017",
    splits=["train", "validation", "test"],
    label_types=["detections", "segmentations"],
    classes=["person"],
    # max_samples=50,
)
Ettore answered 15/10, 2021 at 3:8 Comment(2)
Just a tip for those using this method, if you use "train" or "validation", eveything is in the json files, but split with "test" does not.Sochor
had to run the code 7 times due to Connection to images.cocodataset.org timed out for downloading ~66k images of 5 classes, but it did the job eventually.Greyback
I
7

From what I personally know, if you're talking about the COCO dataset only, I don't think they have a category for "ships". The closest category they have is "boat". Here's the link to check the available categories: http://cocodataset.org/#overview

BTW, there are ships inside the boat category too.

If you want to just select images of a specific COCO category, you might want to do something like this (taken and edited from COCO's official demos):

# display COCO categories
cats = coco.loadCats(coco.getCatIds())
nms=[cat['name'] for cat in cats]
print('COCO categories: \n{}\n'.format(' '.join(nms)))

# get all images containing given categories (I'm selecting the "bird")
catIds = coco.getCatIds(catNms=['bird']);
imgIds = coco.getImgIds(catIds=catIds);
Inearth answered 13/9, 2018 at 6:56 Comment(3)
So can I download all the boats separately?Pitiable
What do you mean? All the images of various categories are in the image set they've provided. You can't just download one single category 'boat' by itself. But with the code above, you can select specific categories and save them into a folder later if you want.Inearth
@ShobhitKumar You can. Just follow the answee's code and add coco.download('myfolder', imgIds)Charqui
A
2

I tried the code that @yatu and @Tim had shared here, but I got lots of requests.exceptions.ConnectionError: HTTPSConnectionPool.

So after carefully reading this answer to Max retries exceeded with URL in requests, I rewrote the code like this one and now it runs smoothly:

from pycocotools.coco import COCO
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
import requests
from tqdm.notebook import tqdm


# instantiate COCO specifying the annotations json path
coco = COCO('annotations/instances_train2017.json')
# Specify a list of category names of interest
catIds = coco.getCatIds(catNms=['person'])
# Get the corresponding image ids and images using loadImgs
imgIds = coco.getImgIds(catIds=catIds)
images = coco.loadImgs(imgIds)

# handle annotations


ANNOTATIONS = {"info": {
    "description": "my-project-name"
}
}


def cocoJson(images: list) -> dict:
    arrayIds = np.array([k["id"] for k in images])
    annIds = coco.getAnnIds(imgIds=arrayIds, catIds=catIds, iscrowd=None)
    anns = coco.loadAnns(annIds)
    for k in anns:
        k["category_id"] = catIds.index(k["category_id"])+1
    catS = [{'id': int(value), 'name': key}
            for key, value in categories.items()]
    ANNOTATIONS["images"] = images
    ANNOTATIONS["annotations"] = anns
    ANNOTATIONS["categories"] = catS

    return ANNOTATIONS


def createJson(JsonFile: json, label='train') -> None:
    name = label
    Path("data/labels").mkdir(parents=True, exist_ok=True)
    with open(f"data/labels/{name}.json", "w") as outfile:
        json.dump(JsonFile, outfile)

def downloadImages(images: list) -> None:
    session = requests.Session()
    retry = Retry(connect=3, backoff_factor=0.5)
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)

    for im in tqdm(images):
        if not isfile(f"data/images/{im['file_name']}"):
            img_data = session.get(im['coco_url']).content
            with open('data/images/' + im['file_name'], 'wb') as handler:
                handler.write(img_data)


trainSet = cocoJson(images)
createJson(trainSet) 
downloadImages(images)
Adversity answered 5/8, 2022 at 12:32 Comment(1)
Thanks a lot for this small modification concerning the URL requests. However the code as you provide it is incomplete. I modified the original repository github.com/tikitong/minicoco taking this into account, and including directly the download of the annotation file and a direct launch with argparse.Bolling
B
1

On my side I had recent difficulties installing fiftyone with Apple Silicon Mac (M1), so I created a script based on pycocotools that allows me to quickly download a subset of the coco 2017 dataset (images and annotations).

It is very simple to use, details are available here: https://github.com/tikitong/minicoco , hope this helps.

Bolling answered 23/5, 2022 at 12:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.