Yes, out of order execution can definitely cause memory reordering, such as load/load re-ordering
It is not so much a question of the loads being retired in order, as of when the load value is bound to the load instruction. Eg Load1 may precede Load2 in program order, Load2 gets its value from memory before Load1 does, and eg if there is an intervening store to the location read by Load2, then Load/load reordering has occurred.
However, certain systems, such as Intel P6 family systems, have additional mechanisms to detect such conditions to obtain stronger memory order models.
In these systems all loads are buffered until retirement, and if a possible store is detected to such a buffered but not yet retired load, then the load and program order instructions are “nuked”, and execution is resumed art, e.g., Load2.
I call this Freye’s Rule snooping, after I learned that Brad Freye at IBM had invented it many years before I thought I had. I believe the standard academic reference is Gharachorloo.
I.e. it is not so much buffering loads until retirement, as it is providing such a detection and correction mechanism associated with buffering loads until retirement. Many CPUs provide buffering until retirement but do not provide this detection mechanism.
Note also that this requires something like snoop based cache coherence. Many systems, including Intel systems that have such mechanisms also support noncoherent memory, e.g. memory that may be cached but which is managed by software. If speculative loads are allowed to such cacheable but non-coherent memory regions, the Freye’s Rule mechanism will not work and memory will be weakly ordered.
Note: I said “buffer until retirement”, but if you think about it you can easily come up with ways of buffering not quite until retirement. E.g. you can stop this snooping when all earlier loads have them selves been bound, and there is no longer any possibility of an intervening store being observed even transitively.
This can be important, because there is quite a lot of performance to be gained by “early retirement“, removing instructions such as loads from buffering and repair mechanisms before all earlier instructions have retired. Early retirement can greatly reduce the cost of out of order hardware mechanisms.