The following are some examples of replacement policies used in actual processors.
The PowerPC 7450's 8-way L1 cache used binary tree pLRU. Binary tree pLRU uses one bit per pair of ways to set an LRU for that pair, then an LRU bit for each pair of pairs of ways, etc. The 8-way L2 used pseudo-random replacement settable by privileged software (the OS) as using either a 3-bit counter incremented every clock cycle or a shift-register-based pseudo-random number generator.
The StrongARM SA-1110 32-way L1 data cache used FIFO. It also had a 2-way minicache for transient data, which also seems to have used FIFO. (Intel StrongARM SA-1110 Microprocessor Developer’s Manual states "Replacements in the minicache use the same round-robin pointer mechanism as in the main data cache. However, since this cache is only two-way set-associative, the replacement algorithm reduces to a simple least-recently-used (LRU) mechanism."; but 2-way FIFO is not the same as LRU even with only two ways, though for streaming data it works out the same.])
The HP PA 7200 had a 64-block fully associative "assist cache" that was accessed in parallel with an off-chip direct-mapped data cache. The assist cache used FIFO replacement with the option of evicting to the off-chip L1 cache. Load and store instructions had a "locality only" hint; if an assist cache entry was loaded by such a memory access, it would be evicted to memory bypassing the off-chip L1.
For 2-way associativity, true LRU might be the most common choice since it has good behavior (and, incidentally, is the same as binary tree pLRU when there are only two ways). E.g., the Fairchild Clipper Cache And Memory Management Unit used LRU for its 2-way cache. FIFO is slightly cheaper than LRU since the replacement information is only updated when the tags are written anyway, i.e., when inserting a new cache block, but has better behavior than counter-based pseudo-random replacement (which has even lower overhead). The HP PA 7300LC used FIFO for its 2-way L1 caches.
The Itanium 9500 series (Poulson) uses NRU for L1 and L2 data caches, L2 instruction cache, and the L3 cache (L1 instruction cache is documented as using LRU.). For the 24-way L3 cache in the Itanium 2 6M (Madison), a bit per block was provided for NRU with an access to a block setting the bit corresponding to its set and way ("Itanium 2 Processor 6M: Higher Frequency and Larger L3 Cache", Stefan Rusu et al., 2004). This is similar to the clock page replacement algorithm.
I seem to recall reading elsewhere that the bits were cleared when all were set (rather than keeping the one that set the last unset bit) and that the victim was chosen by a find first unset scan of the bits. This would have the hardware advantage of only having to read the information (which was stored in distinct arrays from but nearby the L3 tags) on a cache miss; a cache hit could simply set the appropriate bit. Incidentally, this type of NRU avoids some of the bad traits of true LRU (e.g., LRU degrades to FIFO in some cases and in some of these cases even random replacement can increase the hit rate).