The fetch block takes full responsibility for loading instructions from memory. First, it considers whether the instruction required by the CPU is in the L1 instruction cache. If not here, it will enter the L2 memory cache. If the instruction is not included in the L2 memory cache, it must be loaded directly from the RAM memory.
When you turn on the computer, all caches are empty, but when the system starts loading the operating system, the CPU starts processing the first instructions from the hard drive and the cache controller starts loading the caches and that is What started to prepare implementation of processing an instruction.
After the fetch block has obtained the necessary instructions for the CPU to be processed, it sends this instruction to the decoding block.
The decoding block will indicate what task this command performs. It does that by consulting ROM memory that exists inside the CPU, called microcode. Each instruction that the CPU understands has its microcode. Microcode will "order" what the CPU does. It is like a step-by-step guide in the documentation. For example, if the instruction is loaded with a + b, its microcode will tell the decoding block that it needs two parameters a and b. The decoding block will then ask the fetch unit to retrieve the data contained in the next two memory locations, in accordance with the values of a and b. After the 'decryption' block completes the command and retrieves all the data needed to execute the instruction, it sends all this data and provides step-by-step instructions on how to execute that instruction to the execution block. .
The executable block will execute this instruction. On modern CPUs, you will see that there are many execution units that work in parallel. This is done to increase CPU performance. For example, a CPU with 6 execution blocks will be able to execute up to 6 parallel instructions simultaneously, so it is theoretically possible to perform a performance equal to 6 processors but only has an executable block. This type of architecture is called 'superscalar' architecture.
Usually modern CPUs don't have many identical execution units; they have dedicated execution blocks for each instruction type. The most understandable example is the FPU, Float Point Unit, which is responsible for executing complex mathematical instructions. Usually between the decoding block and the execution block there is a block (called an outgoing or scheduled block) that is responsible for sending instructions to the correct execution block, meaning that if it is a mathematical instruction it will send The instruction is sent to FPU and not sent to the common execution block. In addition, common execution blocks are called ALU (Arithmetic and Logic Unit).
Finally, when processing is done, the results will be sent to the L1 data cache. Continuing our example of a + b, the result will be sent to the L1 data cache. This result can then be sent back to RAM memory or to another location such as a video card. However, this will depend on the next instruction that will be processed next (the next instruction may be to print the results to the screen).
Another interesting feature that all processors have is 'pipeline' - in computer design this is a hardware assembly line that speeds up the processing of commands through the execution process. , search and record back. This design may have some other instructions at some other level of the CPU at the same time.
After the fetch block has sent instructions to the decoding block, will it do nothing (idle)? So what about replacing doing nothing by having this block get the next instruction? When the first instruction arrives at the execution block, the instruction block can send a second instruction to the decoding block and retrieve the third instruction, and the process continues.
In a modern CPU with an 11-stage pipeline (each floor is a block of CPU), it will be able to have up to 11 internal instructions at the same time. In fact, when all effective CPUs have 'superscalar' architecture, the number of instructions simultaneously inside the CPU will be higher.
Also, with an 11-stage pipeline CPU, a completely executed instruction will have to move through 11 blocks. If more and more floors or blocks are present, the amount of time that each instruction remains slow to be executed will be more. In other words, remember that some instructions can run inside the CPU at the same time. The first instruction loaded by the CPU can hold 11 steps slow to be processed, but when it comes out, the second instruction will be processed immediately (only takes a few slow steps rather than all 11 floors).
There are several other tips used by modern CPUs to increase system performance. We will introduce two of them, which are executing not in order (OOO) and executing with speculation
Enforcement does not follow order (OOO)
Keep in mind that we have said that modern CPUs have a number of execution units that work in parallel and have some other types for implementation blocks, such as ALU - common execution block, and FPU - math execution block. learn. Let's take a general example to understand this problem, let us give the CPU example with 6 execution machines, 4 generic instructions for ALU and 2 math instructions (math instruction) for FPU. We also assume that the program has the order of instructions below.
1. general instructions (ALU)
2. General instruction
3. General instruction
4. General instruction
5. General instruction
6. General instructions
7. mathematical instructions (FPU)
8. General instruction
9. General instructions
10. mathematical instructions
What is going to happen? The send / schedule block will send the first 4 instructions to the ALU blocks but then the 5th instruction CPU will need to wait for one of their ALU instructions to be released to continue processing, since at this time all All 4 common execution blocks are busy. This is not good because we still have 2 unused math units (FPU), obviously they are in idle mode. Therefore, an implementation that does not follow the order (OOO) (all modern CPUs have this feature) will see the next instruction to see if it can be sent to one of two idle execution blocks. that is not In our example, it is not possible, because the 6th instruction also needs a common execution block (ALU) to process. The execution engine does not follow the order of its continued search and finds that the 7th instruction is a mathematical instruction and can be executed at the idle math execution block. Since other math execution units are still idle, it will enter the program to search for other mathematical instructions. In our example, it will jump through the 8th and 9th instructions and load the 10th instruction.
In our example, the execution units will always process at the same time, the instructions executed at this time are the 1st, 2, 3, 4, 7 and 10 instructions.
Name OOO comes from the fact that the CPU does not have to wait but it can pull an instruction at the end of the program and process it before the instructions above. Obviously, the executable engine that does not comply with the order of OOO cannot forever search for an instruction if there is no instruction required (for example, the example above does not have a math instruction, for example). This machine of all CPUs has a certain limit on the number of instructions it can find (usually 512).
Enforcement is speculative
Let's assume that one of the general instructions is a conditional branch instruction. So what will the executable machine OOO do? If the CPU adds a feature called speculative execution (all modern CPUs are available), it will execute both branches. Consider the example below.
1. general instructions
2. General instruction
3. if a = 4. general instruction
5. General instruction
6. General instructions
7. math instruction
8. General instruction
9. General instructions
10. mathematical instructions
.
15. mathematical instructions
16. General instruction
.
When the machine executes not in the order of analyzing this program, it will pull the 15 command into FPU, at which point the FPU is idle. So at this point, we have both branches that are processed simultaneously. If when the CPU finishes processing the third instruction that knows a> b, then the CPU will remove the processing of instruction 15. You may think this is time consuming but in fact it is completely inexpensive. time. It is absolutely not worth the CPU executing that particular command, because FPU is idle. On the other hand if a =