For this particular image, you need not work in full color space, but instead can work on the intensity alone (the "V" part of HSV - "value," meaning intensity).
Whether you use Value space or Hue space, as Penelope mentioned, will depend on the natural images you produce for your real objects. For the general case, you may need to use a combination of hue and value (intensity) to segment images properly. Rather than work in hue-value vector space, it's more straightforward to work in the H and V image planes separately and then combine results. (Segmentation in 3D vector spaces is certainly possible, but would probably be unnecessarily complicated for this project.)
The watershed algorithm in OpenCV could be a good match for your needs.
http://www.seas.upenn.edu/~bensapp/opencvdocs/ref/opencvref_cv.htm
One word of caution about Otsu's method: it's fine for separating two modes when a histogram of intensity values (or hue values) is a bimodal distribution, but for natural images it's not common to have true bimodal distributions. If the background and/or foreground objects vary in intensity and/or hue from one side of an object to another, then Otsu can perform poorly.
Otsu can certainly be extended for multiple modes, as is explained in Digital Image Processing by Gonzalez and Woods and other introductory textbooks on the subject. However, a background gradient will cause problems even if you use Otsu to separate one pair of modes at a time.
You also want to ensure that if your camera lens zooms in or out, you'll still find the same binarization thresholds. The basic Otsu technique uses all pixels in the image histogram. That means that you could scramble all of the pixels in the image to produce pure noise with the same image histogram as your original image, and Otsu's method would generate the same threshold.
One common trick is to rely on pixels near edges. In your example we can consider an image to be a region with sharp edges, sharp corners, and (hopefully) uniform HSV values. Sampling pixels near edges can be done in several ways, including the following:
- Find strong edge points (using Canny or some simpler technique). Along the direction of the edge gradient, and at distances +/- D from the edge point, sample the gray levels of the (relative) foreground and (relative) background. Distance D should be much smaller than the size of the objects in question.
- Find strong edge points. Use the gray levels at the edge points themselves as estimates of the likely desired threshold. In your example, you'll edge up with two strong peaks: one at the edge between object1 and object2, and the other at the edge between object2 and object3.
Since your objects have corners, you can use those to help identify object boundaries and/or edge pixels suitable for sampling.
If you have nominally rectangular objects, you could also use a Hough edge or RANSAC edge algorithm to identify lines in the image, find intersections at corners, etc.
All that said, for nearly any natural image involving objects stacked on top of each other you're going to run into several complications:
- Shadows
- Color and intensity gradients across an object of nominally consistent color
- Edges of varying sharpness if objects are a varying distances from the optical system
If you know for certain how many objects are present, you can try a K Means technique.
http://aishack.in/tutorials/knearest-neighbors-in-opencv/
For more complex segmentation tasks, such as when the number of objects isn't known, you can use the Mean Shift technique, though I'd recommend trying simpler techniques first.
The first step and easiest fix is to use proper lighting. To reduce reflections and shadows, use diffuse lighting. For many applications, the closest to ideally diffuse lighting is "cloudy day" lighting:
http://www.microscan.com/en-us/products/nerlite-machine-vision-lighting/cdi-illuminators.aspx
More simply, you could try one or more "bounce" lights such as those used in studio photography.
http://www.photography.com/articles/taking-photos/bounce-lighting/