Load/Store Execution Steps Load: LW R2, 0(R1) 1. Generate virtual address; may wait on base register 2. Translate virtual address into physical address 3. Write data cache Lecture 11: Memory Data Flow Techniques Load/store buffer design, memorylevel parallelism, consistency model, memory disambiguation 1 Store: SW R2, 0(R1) 1. Generate virtual address; may wait on base register and data register 2. Translate virtual address into physical address 3. Write data cache Unlike in register accesses, memory addresses are not known prior to execution 2 Load/store Buffer in Tomasulo Load/store Unit with Centralized RS Support memory-level parallelism Centralized RS includes part of load/store buffer in Tomasulo Loads wait in load buffer until their address is ready; memory reads are then processed Stores wait in store buffer until their address and data are ready; memory writes wait further until stores are committed IM Fetch Unit Decode Rename S-buf L-buf DM Regfile Reorder Buffer RS RS FU1 FU2 IM Fetch Unit Decode Loads and stores wait in RS until there are ready Rename Regfile Reorder Buffer RS S-unit data Store buffer L-unit FU1 FU2 addr addr cache 3 4 Memory-level Parallelism Memory Consistency for (i=0;i<100;i++) A[i] = A[i]*2; Memory contents must be the same as by sequential execution Must respect RAW, WRW, and WAR dependences Loop:L.S F2, 0(R1) MULT F2, F2, F4 SW F2, 0(R1) ADD R1, R1, 4 BNE R1, R3,Loop F4 store 2.0 LW1 LW2 LW3 SW1 SW2 Practical implementations: 1. Reads may proceed out-of-order 2. Writes proceed to memory in program order 3. Reads may bypass earlier writes only if their addresses are different SW3 Significant improvement from sequential reads/writes 5 6 1 Store Stages in Dynamic Execution Load Bypassing and Memory Disambiguation 1. Wait in RS until base RS address and store data are available (ready) Store Load 2. Move to store unit for unit unit address calculation and address translation finished 3. Move to store buffer completed (finished) 4. Wait for ROB commit (completed) D-cache 5. Write to data cache (retired) Stores always retire in for Source: Shen and Lipasti, page 197 WAW and WRA Dep. To exploit memory parallelism, loads have to bypass writes; but this may violate RAW dependences Dynamic Memory Disambiguation: Dynamic detection of memory dependences Compare load address with every older store addresses 7 Load Bypassing Implementation 8 Load Forwarding Store unit 1 2 match 1 2 3 1. address calc. 2. address trans. 3. if no match, update dest reg Load unit data addr Load Forwarding: if a load address matches a older write address, can forward data RS RS in-order D-cache data addr in-order Store unit 1 2 match 1 2 3 Load unit If a match is found, forward the related data addr data to dest register (in ROB) D-cache Multiple matches may exists; last one wins data To dest. reg Associative search for matching Assume in-order execution of load/stores addr 9 In-order Issue Limitation for (i=0;i<100;i++) A[i] = A[i]/2; Loop:L.S F2, 0(R1) DIV F2, F2, F4 SW F2, 0(R1) ADD R1, R1, 4 BNE R1, R3,Loop 10 Speculative Load Execution Any store in RS station may blocks all following loads When is F2 of SW available? RS out-order Store 1 2 unit 1 2 3 match Match at completion When is the next L.S ready? addrdata Finished load buffer Assume reasonable FU latency and pipeline length Load unit data No match: addr predict a load has no RAW on older stores D-cache data If match: flush pipeline 11 Forwarding does not always work if some addresses are unknown Flush pipeline at commit if predicted wrong 12 2 Alpha 21264 Pipeline Alpha 21264 Load/Store Queues Int issue queue fp issue queue Addr Int Int Addr ALU ALU ALU ALU FP ALU Int RF(80) Int RF(80) D-TLB L-Q FP ALU FP RF(72) S-Q AF Dual D-Cache 32-entry load queue, 32-entry store queue 13 Load Bypassing, Forwarding, and RAW Detection LQ IQ match commit ROB Load/store? SQ IQ completed If match: forwarding D-cache D-cache Speculative Memory Disambiguation Fetch PC Load forwarding Load: WAIT if LQ head not completed, then move LQ head Store: mark SQ head as completed, then move SQ head If match: mark store-load trap to flush pipeline (at commit) 15 Architectural Memory States LQ 14 1024 1-bit entry table Renamed inst 1 int issue queue • When a load is trapped at commit, set stWait bit in the table, indexed by the load’s PC • When the load is fetched, get its stWait from the table • The load waits in issue queue until old stores are issued • stWait table is cleared periodically 16 Summary of Superscalar Execution Instruction flow techniques SQ Completed entries L1-Cache Branch prediction, branch target prediction, and instruction prefetch Committed states Register data flow techniques L2-Cache Register renaming, instruction scheduling, in-order commit, mis-prediction recovery L3-Cache (optional) Memory Disk, Tape, etc. Memory data flow techniques Load/store units, memory consistency Memory request: search the hierarchy from top to bottom Source: Shen & Lipasti 17 18 3
© Copyright 2025