When I say array[4] = 12 in a program, I'm really just storing the bit
representation of the memory address into a register. This physical
register in the hardware will turn on the corresponding electrical
signals according to the bit representation I fed it. Those electrical
signals will then somehow magically ( hopefully someone can explain
the magic ) access the right memory address in physical/main memory.
I am not quite sure what you are asking but I dont see any answers related to what is really going on in the magic of the hardware. Hopefully I understood enough to go through this long winded explanation (which is still very high level).
array[4] = 12;
So from comments it sounds like it is understood that you have to get the base address of array, and then multiply by the size of an array element (or shift if that optimization is possible) to get the address (from your programs perspective) of the memory location. Right of the bat we have a problem. Are these items already in registers or do we have to go get them? The base address for array may or may not be in a register depending on code that surrounds this line of code, in particular code that precedes it. That address might be on the stack or in some other location depending on where you declared it and how. And that may or may not matter as to how long it takes. An optimizing compiler may (often) go so far as to pre-compute the address of array[4] and place that somewhere so it can go into a register and the multiply never happens at runtime, so it is absolutely not true that the computation of array[4] for a random access is a fixed amount of time compared to other random accesses. Depending on the processor, some immediate patterns are one instruction others take more that also has a factor on whether this address is read from .text or stack or etc, etc...To not chicken and egg that problem to death, assume we have the address of array[4] computed.
This is a write operation, from the programmers perspective. Starting with a simple processor, no cache, no write buffer, no mmu, etc. Eventually the simple processor will put the address on the edge of the processor core, with a write strobe and data, each processors bus is different than other processor families, but it is roughly the same the address and data can come out in the same cycle or in separate cycles. The command type (read, write) can happen at the same time or different. but the command comes out. The edge of the processor core is connected to a memory controller that decodes that address. The result is a destination, is this a peripheral if so which one and on what bus, is this memory, if so on what memory bus and so on. Assume ram, assume this simple processor has sram not dram. Sram is more expensive and faster in an apples to apples comparison. The sram has an address and write/read strobes and other controls. Eventually you will have the transaction type, read/write, the address and the data. The sram however its geometry is will route and store the individual bits in their individual pairs/groups of transistors.
A write cycle can be fire and forget. All the information that is needed to complete the transaction, this is a write, this is the address, this is the data, is known right then and there. The memory controller can if it chooses tell the processor that the write transaction is complete, even if the data is nowhere near the memory. That address/data pair will take its time getting to the memory and the processor can keep operating. Some systems though the design is such that the processors write transaction waits until a signal comes back to indicate that the write has made it all the way to the ram. In a fire and forget type setup, that address/data will be queued up somewhere, and work its way to the ram. The queue cant be infinitely deep otherwise it would be the ram itself, so it is finite, and it is possible and likely that many writes in a row can fill that queue faster than the other end can write to ram. At that point the current and or next write has to wait for the queue to indicate there is room for one more. So in situations like this, how fast your write happens, whether your simple processor is I/O bound or not has to do with prior transactions which may or may not be write instructions that preceded this instruction in question.
Now add some complexity. ECC or whatever name you want to call it (EDAC, is another one). The way an ECC memory works is the writes are all a fixed size, even if your implementation is four 8 bit wide memory parts giving you 32 bits of data per write, you have to have a fixed with that the ECC covers and you have to write the data bits plus the ecc bits all at the same time (have to compute the ecc over the full width). So if this was an 8 bit write for example into a 32 bit ECC protected memory then that write cycle requires a read cycle. Read the 32 bits (check the ecc on that read) modify the new 8 bits in that 32 bit pattern, compute the new ecc pattern, write the 32 bits plus ecc bits. Naturally that read portion of the write cycle can end up with an ecc error, which just makes life even more fun. Single bit errors can be corrected usually (what good is an ECC/EDAC if it cant), multi-bit errors not. How the hardware is designed to handle these faults affects what happens next, the read fault may just trickle back to the processor faulting the write transaction, or it may go back as an interrupt, etc. But here is another place where one random access is not the same as another, depending on the memory being accessed, and the size of the access a read-modify-write definitely takes longer than a simple write.
Dram can also fall into this fixed width category, even without ECC. Actually all memory falls into this category at some point. The memory array is optimized on the silicon for a certain height and width in units of bits. You cannot violate that memory it can only be read and written in units of that width at that level. The silicon libraries will include many geometries of ram, and the designers will chose those geometries for their parts, and the parts will have fixed limits and often you can use multiple parts to get some integer multiple width of that size, and sometimes the design will allow you to write to only one of those parts if only some of the bits are changing, or some designs will force all parts to light up. Notice how the next ddr family of modules that you plug into your home computer or laptop, the first wave is many parts on both sides of the board. Then as that technology gets older and more boring, it may change to fewer parts on both sides of the board, eventually becoming fewer parts on one side of the board before that technology is obsolete and we are already into the next.
This fixed width category also carries with it alignment penalties. Unfortunately most folks learn on x86 machines, which dont restrict you to aligned accesses like many other platforms. There is a definite performance penalty on x86 or others for unaligned accesses, if allowed. It is usually when folks go to a mips or usually an arm on some battery powered device is when they first learn as programmers about aligned accesses. And sadly find them to be painful rather than a blessing (due to the simplicity both in programming and for the hardware benefits that come from it). In a nutshell if your memory is say 32 bits wide and can only be accessed, read or write, 32 bits at a time that means it is limited to aligned accesses only. A memory bus on a 32 bit wide memory usually does not have the lower address bits a[1:0] because there is no use for them. those lower bits from a programmers perspective are zeros. if though our write was 32 bits against one of these 32 bit memories and the address was 0x1002. Then somebody along the line has to read the memory at address 0x1000 and take two of our bytes and modify that 32 bit value, then write it back. Then take the 32 bits at address 0x1004 and modify two bytes and write it back. four bus cycles for a single write. If we were writing 32 bits to address 0x1008 though it would be a simple 32 bit write, no reads.
sram vs dram. dram is painfully slow, but super cheap. a half to a quarter the number of transistors per bit. (4 for sram for example 1 for dram). Sram remembers the bit so long as the power is on. Dram has to be refreshed like a rechargable battery. Even if the power stays on a single bit will only be remembered for a very short period of time. So some hardware along the way (ddr controller, etc) has to regularly perform bus cycles telling that ram to remember a certain chunk of the memory. Those cycles steal time from your processor wanting to access that memory. dram is very slow, it may say 2133Mhz (2.133ghz) on the box. But it is really more like 133Mhz ram, right 0.133Ghz. The first cheat is ddr. Normally things in the digital world happen once per clock cycle. The clock goes to an asserted state then goes to a deasserted state (ones and zeros) one cycle is one clock. DDR means that it can do something on both the high half cycle and on the low half cycle. so that 2133Ghz memory really uses a 1066mhz clock. Then pipeline like parallelisms happen, you can shove in commands, in bursts, at that high rate, but eventually that ram has to actually get accessed. Overall dram is non-determinstic and very slow. Sram on the other hand, no refreshes required it remembers so long as the power is on. Can be several times faster (133mhz * N), and so on. It can be deterministic.
The next hurdle, cache. Cache is good and bad. Cache is generally made from sram. Hopefully you have an understanding of a cache. If the processor or someone upstream has marked the transaction as non-cacheable then it goes through uncached to the memory bus on the other side. If cacheable then the a portion of the address is looked up in a table and will result in a hit or miss. this being a write, depending on the cache and/or transaction settings, if it is a miss it may pass through to the other side. If there is a hit then the data will be written into the cache memory, depending on the cache type it may also pass through to the other side or that data may sit in the cache waiting for some other chunk of data to evict it and then it gets written to the other side. caches definitely make reads and sometimes make writes non-deterministic. Sequential accesses have the most benefit as your eviction rate is lower, the first access in a cache line is slow relative to the others, then the rest are fast. which is where we get this term of random access anyway. Random accesses go against the schemes that are designed to make sequential accesses faster.
Sometimes the far side of your cache has a write buffer. A relatively small queue/pipe/buffer/fifo that holds some number of write transactions. Another fire and forget deal, with those benefits.
Multiple layers of caches. l1, l2, l3...L1 is usually the fastest either by its technology or proximity, and usually the smallest, and it goes up from there speed and size and some of that has to do with cost of the memory. We are doing a write, but when you do a cache enabled read understand that if l1 has a miss it goes to l2 which if it has a miss goes to l3 which if it has a miss goes to main memory, then l3, l2 and l1 all will store a copy. So a miss on all 3 is of course the most painful and is slower than if you had no cache at all, but sequential reads will give you the cached items which are now in l1 and super fast, for the cache to be useful sequential reads over the cache line should take less time overall than reading that much memory directly from the slow dram. A system doesnt have to have 3 layers of caches, it can vary. Likewise some systems can separate instruction fetches from data reads and can have separate caches which dont evict each other, and some the caches are not separate and instruction fetches can evict data from data reads.
caches help with alignment issues. But of course there is an even more severe penalty for an unaligned access across cache lines. Caches tend to operate using chunks of memory called cache lines. These are often some integer multiple in size of the memory on the other side. a 32 bit memory for example the cache line might be 128 bits or 256 bits for example. So if and when the cache line is in the cache, then a read-modify-write due to an unaligned write is against faster memory, still more painful than aligned but not as painful. If it were an unaligned read and the address was such that part of that data is on one side of a cache line boundary and the other on the other then two cache lines have to be read. A 16 bit read for example can cost you many bytes read against the slowest memory, obviously several times slower than if you had no caches at all. Depending on how the caches and memory system in general is designed, if you do a write across a cache line boundary it may be similarly painful, or perhaps not as much it might have the fraction write to the cache, and the other fraction go out on the far side as a smaller sized write.
Next layer of complexity is the mmu. Allowing the processor and programmer the illusion of flat memory spaces and/or control over what is cached or not, and/or memory protection, and/or the illusion that all programs are running in the same address space (so your toolchain can always compile/link for address 0x8000 for example). The mmu takes a portion of the virtual address on the processor core side. looks that up in a table, or series of tables, those lookups are often in system address space so each one of those lookups may be one or more of everything stated above as each are a memory cycle on the system memory. Those lookups can result in ecc faults even though you are trying to do a write. Eventually after one or two or three or more reads, the mmu has determined what the address is on the other side of the mmu is, and the properties (cacheable or not, etc) and that is passed on to the next thing (l1, etc) and all of the above applies. Some mmus have a bit of a cache in them of some number of prior transactions, remember because programs are sequential, the tricks used to boost the illusion of memory performance are based on sequential accesses, not random accesses. So some number of lookups might be stored in the mmu so it doesnt have to go out to main memory right away...
So in a modern computer with mmus, caches, dram, sequential reads in particular, but also writes are likely to be faster than random access. The difference can be dramatic. The first transaction in a sequential read or write is at that moment a random access as it has not been seen ever or for a while. Once the sequence continues though the optimizations fall in order and the next few/some are noticeably faster. The size and alignment of your transaction plays an important role in performance as well. While there are so many non-deterministic things going on, as a programmer with this knowledge you modify your programs to run much faster, or if unlucky or on purpose can modify your programs to run much slower. Sequential is going to be, in general faster on one of these systems. random access is going to be very non-deterministic. array[4]=12; followed by array[37]=12; Those two high level operations could take dramatically different amounts of time, both in the computation of the write address and the actual writes themselves. But for example discarded_variable=array[3]; array[3]=11; array[4]=12; Can quite often execute significantly faster than array[3]=11; array[4]=12;