Ground truth pixel labels in PASCAL VOC for semantic segmentation

Asked 3/4, 2018 at 12:18 Answered 25/4, 2020 at 12:23

I'm experimenting with FCN(Fully Convolutional Network), and trying to reproduce the results reported in the original paper (Long et al. CVPR'15).

In that paper the authors reported results on PASCAL VOC dataset. After downloading and untarring the train-val dataset for 2012 (http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar ), I noticed there are 2913 png files in the SegmentationClass and same number of files in SegmentationObject subdirectory.

The pixel values in these png files seem to be multiples of 32 (e.g. 0, 128, 192, 224...), which don't fall in the range between 0 and 20. I'm just wondering what's the correspondence between the pixel values and ground truth labels for pixels. Or am I looking at the wrong files?

Derayne answered 3/4, 2018 at 12:18 Comment(2)

I recently reproduced these FCN results and it worked fine. How do you read your images ? Have you resized them ? I once did it mindlessly and I messed the labels because of interpolations or averages when resizing... – Tracery 25/4, 2018 at 15:4

Did you figure this out? I too see lots of 224 values in the raw byte data. I don't see 224 anywhere in the color map. Does it meean they're undefined? VOC_COLORMAP = [[0, 0, 0], [128, 0, 0], [0, 128, 0], [128, 128, 0], [0, 0, 128], [128, 0, 128], [0, 128, 128], [128, 128, 128], [64, 0, 0], [192, 0, 0], [64, 128, 0], [192, 128, 0], [64, 0, 128], [192, 0, 128], [64, 128, 128], [192, 128, 128], [0, 64, 0], [128, 64, 0], [0, 192, 0], [128, 192, 0], [0, 64, 128]] – Stead 19/5, 2019 at 7:10

Just downloaded Pascal VOC. The pixel values in the dataset are as follows:

0: background
[1 .. 20] interval: segmented objects, classes [Aeroplane, ..., Tvmonitor]
255: void category, used for border regions (5px) and to mask difficult objects

You can find more info on the dataset here.

The previous answer by captainist discusses png files saved with color palettes, I think it's not related to the original question. The linked tensorflow code simply loads a png that was saved with color map (palette), then converts it to numpy array (at this step the color palette is lost), then saves the array as a png again. The numerical values are not changed in this process, only the color palette is removed.

Tho answered 19/3, 2019 at 17:17 Comment(0)

I know that this question was asked some time ago. But I raised myself a similar question when trying on PASCAL VOC 2012 with tensorflow deeplab.

If you look at the file_download_and_convert_voc2012.sh, there are lines marked by "# Remove the colormap in the ground truth annotations". This part process the original SegmentationClass files and produce the raw segmented image files, which have each pixel value between 0 : 20. (If you may ask why, check this post: Python: Use PIL to load png file gives strange results)

Pay attention to this magic function:

def _remove_colormap(filename):
  """Removes the color map from the annotation.

  Args:
    filename: Ground truth annotation filename.

  Returns:
    Annotation without color map.
  """
  return np.array(Image.open(filename))

I have to admit that I do not fully understand the operation by

np.array(Image.open(filename))

I have shown here below a set of images for your referece (from above down: orignal image, segmentation class, and segmentation raw class)

Cabinetwork answered 3/8, 2018 at 15:15 Comment(0)

The values mentioned in the original question look like the "color map" values, which could be obtained by getpalette() function from PIL Image module.

For the annotated values of the VOC images, I use the following code snip to check them:

import numpy as np
from PIL import Image

files = [ 
        'SegmentationObject/2007_000129.png',
        'SegmentationClass/2007_000129.png',
        'SegmentationClassRaw/2007_000129.png', # processed by _remove_colormap()
                                                # in captainst's answer...
        ]

for f in files:
    img = Image.open(f)
    annotation = np.array(img)
    print('\nfile: {}\nanno: {}\nimg info: {}'.format(
        f, set(annotation.flatten()), img))

The three images used in the code are shown below (left to right, respectively):

The corresponding outputs of the code are as follows:

file: SegmentationObject/2007_000129.png
anno: {0, 1, 2, 3, 4, 5, 6, 255}
img info: <PIL.PngImagePlugin.PngImageFile image mode=P size=334x500 at 0x7F59538B35F8>

file: SegmentationClass/2007_000129.png
anno: {0, 2, 15, 255}
img info: <PIL.PngImagePlugin.PngImageFile image mode=P size=334x500 at 0x7F5930DD5780>

file: SegmentationClassRaw/2007_000129.png
anno: {0, 2, 15, 255}
img info: <PIL.PngImagePlugin.PngImageFile image mode=L size=334x500 at 0x7F5930DD52E8>

There are two things I learned from the above output.

First, the annotation values of the images in SegmentationObject folder are assigned by the number of objects. In this case there are 3 people and 3 bicycles, and the annotated values are from 1 to 6. However, for images in SegmentationClass folder, their values are assigned by the class value of the objects. All the people belong to class 15 and all the bicycles are class 2.

Second, as mkisantal has already mentioned, after the np.array() operation, the color palette was removed (I "know" it by observing the results but I still don't understand the mechanism under the hood...). We can confirm this by checking the image mode of the outputs:

Both the SegmentationObject/2007_000129.png and SegmentationClass/2007_000129.png have image mode=P while
SegmentationClassRaw/2007_000129.png has image mode=L. (ref: The modes of PIL Image)

Scaleboard answered 25/4, 2020 at 12:23 Comment(4)

"I "know" it by observing the results but I still don't understand the mechanism under the hood..." - The thing is saving color requires 3 bytes, one for R,G,B each. When u have only 20 classes in whole dataset, i.e. 20 colors for foreground and 1 color for background, u have only 21 unique values for each pixel, which can be encoded in at most, 5 bits(2^5=32) so why use 3 bytes. Now, this can be done using PNG image, where you can store LUT, i.e. LOOK UP TABLE of colors. – Mangrove 31/1, 2021 at 4:20

If u check raw pixel value, it stores 0,1,20, 255 etc. But most image viewer, read the image and then apply LUT on those value to give you the red color. The look up table is stored using only 256 color entries, i.e 0 maps to black, 1 maps to color X, etc and this look up table gets stored in metadata – Mangrove 31/1, 2021 at 4:22

Finally, you can check if image metadata by running identify -verbose <imagepath>. The identify is imagemagick command and comes preinstalled in ubuntu – Mangrove 31/1, 2021 at 4:23

P.S. This LUT is called palette in PIL library – Mangrove 31/1, 2021 at 4:23

Recommended topics

Hot tags