fyi: I'm not sure if I understood you well, and you may already know/understand all I've written below, maybe better than me. Drop me a note, and I'll remove the answer.
1/2a: The hardware used is "just" some extra registers and logic circuits that form/inject additional (orthogonal!) states into the standard JTAG statemachine.
If you understand how the JTAG protocol performs the boundary-scan and how the bitstream is pushed/pulled from the device, you should be able to imagine how is it used to i.e. program on-chip memory banks. Imagine typical daisy-chaining, not between the chips but rather inside-a-chip.
Let's say that device has some programmable persistent memory. With a few more flops and gates, the device forms an extra buffer before or after the JTAG chain of the actual memory:
input -> xflops -> memory -> yflops -> output
let's say that x/mem/y = 16/1024/0. Now, the chain has 1040 bits. The preceding xflops does not directly affect memory nor vice-versa. The xflops might be now linked to the control lines of the builtin internal programmer that drives the memory.
input -> progcmd -> memory -> output
the logic circuit inside the chip can now react to some 16-bit 'magic number' a.k.a. "write command" that will trigger the procedure of writing/erasing of the persistent memory. Any other 16bits values are ignored and device behaves like 1024 r/o data followed by 16bit echo or zeroes.
Ok, so we have simple on-device 'controller' that performs operations on a 'real device'. If you extend the idea with i.e. the controller having states that can control what subdevices are attached to the chain, on the fly:
default chain after reset is:
input -> progcmd -> output
if now the controller gets ENABLE_WRITE it attaches MEM to chain
input -> progcmd -> memory -> output
then controller reacts to WRITE and ABORTs on everything else
input -> progcmd -> output
controller ges VERIFY, it reattaches MEM again but in READONLY mode
input -> progcmd -> memory -> output
etc
It is of course just an extra statemachine. In similar way you could do almost any fancy operations, including debugging like freezing, stepping, reading/writing registers etc. But all of this requires tons of extra logic to be built into the chip in question. Actually, it's having several devices in one chip.
2b: Unfortunatelly, I cannot say more, because I'm too green in the subject ;) I know that many manufacturers form their own internal standards, the 'controller's are simply shared between models and sometimes families of chips, but I've not heard of any 'global' standard common between manufacturers.