Related for x86: Why flush the pipeline for Memory Order Violation caused by other logical processors?. The observable result will obey x86 ordering rules, but microarchitecturally yes it can load early. (And of course that's from cache; HW prefetch is different).
OoO exec CPUs truly do reorder load execution if the address isn't ready for one load. Or if it misses in cache, then later loads can run before data arrives for this one. But on x86, to maintain correctness wrt. the strong memory model (program order + a store buffer with store forwarding), the core checks if the eventual result was legal according to the ISA's on-paper memory-model guarantees. (i.e. that the cache line loaded from earlier is still valid and thus still contains the data we're now allowed to load). If not, nuke the in-flight instructions that depended on this possibly-unsafe speculation and roll back to a known safe state.
So modern x86 gets the perf of relaxed load ordering (most of the time) while still maintaining the memory-model rules where every load is effectively an acquire load. But at the cost of pipeline nukes if you do something the pipeline doesn't like, e.g. false sharing (which is already bad enough).
Other CPUs with a strong memory model (Sparc TSO) might not be this aggressive. Weak memory models allow later loads to complete early.
Of course this is reading from cache; demand-load requests are seen by the / a memory controller only on cache miss. But HW prefetchers can access memory asynchronously from the CPU; that's how they get data into cache ahead of when the CPU runs an instruction that loads it, ideally avoiding a cache miss at all.
And yes, the memory subsystem is pipelined, like 12 to 16 outstanding requests per core in Skylake. (12 LFBs for L1<->L2, and IIRC 16 superqueue entries in the L2.)