The standard processor without conflict resolution (not without conflicts as you wrote) has no forwarding logic, but makes use of register bypassing. With the standard five stage pipeline, we get the following
- after ALU: pWB − pID − 1 = 5 - 2 - 1 = 2
- after LOAD: pWB − pID − 1 = 5 - 2 - 1 = 2
- after BRANCH: pWB − pIF − 1 = 5 - 1 - 1 = 3
and with forwarding, we get
- after ALU: pEX − pID − 1 = 3 - 2 - 1 = 0
- after LOAD: pMA − pID − 1 = 4 - 2 - 1 = 1
- after BRANCH: pEX − pIF − 1 = 3 - 1 - 1 = 1
So, your calculations are correct. The program with nop instructions inserted you have doubts about is as follows:
00: mov $0,0
nop,nop
01: addi $1,$0,5
nop,nop
02 wh: ldi $2,$1,10
03: subi $1,$1,1
nop,nop
04: ldi $3,$1,10
nop,nop
05: add $3,$3,$2
nop,nop
06: sti $3,$1,10
07: bez $1,ex
nop,nop,nop
08: j wh
nop,nop,nop
09 ex: ldi $7,$1,10
There are no nops between instructions 2 and 3 since there is no conflict that we have to resolve. Note that the above number of stalls/nops that we calculated are only necessary if there is a conflict between instructions that follow each other. Note further that it may also be sufficient to add only one nop after a load even without forwarding, e.g. in case of the following program
ldi $0,$1,0
addi $2,$2,1
addi $2,$0,1
which results in
ldi $0,$1,0
addi $2,$2,1
nop
addi $2,$0,1
We just have to make sure that the distance between a read operation that follows for example a load operation is at least 2 in the above case, and have to add nop operations to increase the distance if needed. It does not mean that we always have to add the two nop operations after every load operation. It does mean however that we never have to add more than the two nop operations.