In my opinion the reason random cropping helps data augmentation is that while the semantics of the image are preserved (unless you pick out a really bad crop, but let's assume that you setup your random cropping so that this is very low probability) the activations values you get in your conv net are different. So in effect our conv net learns to associate a broader range of spatial activation statistics with a certain class label and thus data augmentation via random cropping helps improve the robustness of our feature detectors in conv nets. Also in the same vein, the random crop produces different intermediate activation values and produces a different forwardpass so it's like a "new training point."
It's also not trivial. See the recent work on adversarial examples in neural networks (relatively shallow to AlexNet sized). Images that semantically look the same, more or less, when we pass them through a neural net with a softmax classifier on top, we can get drastically different class probabilities. So subtle changes from a semantic point of view can end up having different forward passes through a conv net. For more details see Intriguing properties of neural networks.
To answer the last part of your question: I usually just make my own random cropping script. Say my images are (3, 256, 256) (3 RGB channels, 256x256 spatial size) you can code up a loop which takes 224x224 random crops of your image by just randomly selecting a valid corner point. So I typically compute an array of valid corner points and if I want to take 10 random crops, I randomly select 10 different corner points from this set, say I choose (x0, y0) for my upper left hand corner point, I will select the crop X[x0:x0+224, y0:y0+224], something like this. I personally like to randomly choose from a pre-computed set of valid corner points instead of randomly choosing a corner one draw at a time because this way I guarantee I do not get a duplicate crop, though in reality it's probably low probability anyway.