I don't quite understand the following:
In the proposed FCN for Semantic Segmentation by Shelhamer et al, they propose a pixel-to-pixel prediction to construct masks/exact locations of objects in an image.
In the slightly modified version of the FCN for biomedical image segmentation, the U-net, the main difference seems to be "a concatenation with the correspondingly cropped feature map from the contracting path."
Now, why does this feature make a difference particularly for biomedical segmentation? The main differences I can point out for biomedical images vs other data sets is that in biomedical images there are not as rich set of features defining an object as for common every day objects. Also the size of the data set is limited. But is this extra feature inspired by these two facts or some other reason?