goal of pipelining is to approach 1 instruction per clock tick
superpipelining (deeply pipelined microprocessors)
superscalar (multiple issue, multiple instruction dispatch,
multiple concurrent pipelines)
- one way to get instruction level parallelism is pipelining
- longer pipes can give more parallelism
- stages are divided to lengthen pipe
- depth limited by logic speed, branch frequency, etc..
- tries to get us closer to 1 instruction per clock tick
- but no matter how deep,
never gets us to exactly 1 instruction per clock tick.
superscalar (multiple issue) origins
- multiple concurrent pipelines
- an "instruction dispatcher" (some intelligence in the CPU) tries to
schedule different instructions in different pipelines.
- everything's still sequential in terms of data
- can't have 2 instructions working on the same data (so not same as MIMD).
- only works if no data dependencies between instructions.
- origins in Cray's CDC-6600
- Cray designs kept multiple functional units busy simultaneously
- cray issued just one instruction at a time, but fetched up to four at once
- (dispatch was pipelined)
- good issue rate for technology in mid 1960's
Crossing the Threshold Between single and multiple issue
- microprocessor tech. lags mainframe computer architecutre by about 20 years
- there's really very little that's new in microprocessors despite intel's
- what they're doing is taking mainframe tech and making it low cost.
- (cdc6600, iliac4 of which only 1 was ever made used SIMD concepts,
etc., had implemented this)
- superscalar extends Cray approach with multiple issue.
- need to fetch instructions in parallel to increase memory bandwidth
- some superscalar designs still fetch and dispatch serially to
multiple pipes, but becoming less common
- R4400 is borderline superscalar
- UltraSPARC is clearly superscalar
Implications of Multiple Issue
- if functional units are multi-cycle or use a slower clock rate,
then a single fetch may be adequate
- once the average time for the functional units to operate
is < 1 clock cycle, then sysetem will benefit from multiple issue.
Limits to Superscalar (multiple issue) microprocessors
- need to fetch multiple instructions at once
- inter-pipeline interlocking
- reordering of instructions for multiple interlocked pipelines
- potentially multiple register write-backs
- multiported register file, and/or split register file
- limited number of independent operations available at any point in a stream
- pipeline depth affects that (in the sense that deeper pipelines need more
registers and that the penalty is much higher).
- example: if you stall for 3 cycles, you're stalling 18 cycles on
- limited by branches to generally fewer than 10, typically about 4
(downstairs alpha is 4, upstairs alpha is 6)
- this limit can be increased by increasing the number of registers
- limited also by number of independent buses
(each pipeline needs access to registers, e.g. 4 pipes needs 8 buses source
and destination bus as well); therefore register needs to be octal ported.
- depends on how good your compiler is and how well it can pull apart
loops. examples, e.g. dot products, limited only by number of registers
(practical considerations aside).
- summing loops are another good example of concurrency.
(current term is dependent on previous term, can break the range apart
and sum at the end).
- summing loops with "tree" structure (loop level parallelism, not actually
instruction level parallelism --- but loop level parallelism can also be
exploited by multiple issue).
- higher levels can be achieved with aggressive branch prediction
and speculative execution.
- the new IA64 arch. does not have branch prediction or speculative hardware;
relies very heavily on the compiler (it's somewhere between risc and vliw:
called "post risc").
- Can be as high as 50 on some very regular codes
- Appears to be about 6 for typical mix (e.g. SPEC) so not much sense
making an 8 issue CPU, for most applications.
- May be only 4 for integer programs.
- Potential ILP may be 3X this, but hard to obtain
- Still a worthwhile performance gain
Two techniques to identify hazards:
- ILP is a property of code that: often instructions are independent in a
given sequence of instructions.
which opens up possibility of executing them concurrently (at same time).
- technique to run more than one instruction at once is multiple issue.
- in order to exploit ILP, need to know where hazards are (2 techniques)
- then we need to do something about the hazards; 2 techniques for that
- register renaming
- Scoreboarding -- originally developed for the CDC-6600 in 1964 by Cray.
(first implementation): centralized and explicit control.
helps with register renaming, which helps with multiple issue.
- used in modern microprocessors
- Reservation stations -- developed for the IBM 360/91 in 1967 by
Tomasulo. Distributed and more implicit control
e.g. IBM 360 only had 4fp regs; used tomasulo's ...
- more complicated than scoreboarding.
- no longer used in modern microprocessors.
- may make a comeback in the next 5 years.
- scoreboaring looks at what's used when,
whereas Tomasulo's reserves registers in advance (doesn't look forward).
- scoreboarding looks forward to some number of instructions (limited by how
much silicon you have). you can scoreboard more than the stages in pipeline
(e.g. can reorder instructions based on that information).
- both of these 2 methods maximize utilization of register file, alu, memory,
- in modern computing: more registers, so both of these 2 methods get more
- both of these alleviate work from the compiler.
- historically compiler writers and CPU makers didn't talk to each other.
- when they started communicating, silicon could be used for something other
than scoreboaring or tomasulo's).
- things have reversed nowadays (almost all the intelligence is in the
- Centralized set of tables ("registers") in the cpu that keep track of what's
being used and when.
- each row indicates what's being used at each clock tick.
- scoreboard manages dispatch, stalling, and completion of instructions
- Watches for WAW, RAW, WAR hazards, and manages the execution sequence
(e.g. possible instruction reordering) to avoid these hazards.
- Hazards are detected by observing register references and operation types
Result Register Designator
- Result Register Designator
- Entry-Operand Register Designator
- Valid Operand Flag
- Source Function Unit Designator
- Each functional unit also has 2 operand registers, a result
register, a mode register, and a busy flag
Result Register Dependencies
- Each ISA register has a matching scorebard register indicating which
functional unit is scheduled to write to it next (or if it is free)
- Instructions are issued only if their ISA result register is free,
and the necessary functional unit isn't busy
- Avoids WAW hazards
- Just because a destination register is free doesn't mean its
contents can be overwritten - can still be an operand
- Destination register must not be listed in the Entry-Operand
Register Designators in order to be written
- Avoids WAR hazards
- forwarding gets more complicated in superscalar processors because we now need
to check for validity across several pipelines.
- forwarding needs to be responsive to the contents of the scoreboard table.
- Source Function Unit Designators tell each unit where its operands will
come from, so it can take them from the other unit's Result Register
- If a result is in an ISA register, then its valid flag is set and
the SFUD is empty
- Avoids RAW hazards
Reservation Station Contents
- Pools of register(s) associated with each functional unit
- Distributed resource, following instruction dispatch
- Controls entry of instructions into functional units, but allows
out-of- order generation of results
Feeding Reservation Stations
- Operand sources or values
- Result destination
- Valid source flags
- Instruction is released to functional unit when its operands are
all valid (which can include forwarding)
- Avoids RAW hazards
Release and Execute
- Issue and release are separate
- Instruction is issued if a station for its FU is available, and
its destination register isn't busy -- avoids WAW hazards
- On issue, destination register is marked busy, and assigned tag
for the FU that will fill it
- Valid source values or source tags are sent to the reservation
station - avoids WAR hazards
- Results with matching tags go to reservation stations and registers
- Execution proceeds when all values are available
- Result broadcast on common bus
- Register matches its tag to a broadcast value and clears busy
- Parallel issue leads to out-of-order execution that must be turned
into in-order completion
- Done through register renaming:
- Register Pool or
- Reorder Buffers
Register Pool Assignment
- More physical registers than logical
- Mapping table: logical to physical
- Free, logical, pending states
- Register becomes free when another one takes the same logical name
- Pending temporarily holds a result
- Indicates logical destination and source instruction tag
- Function unit result is sent to first free register -- if none, then stall
- Register becomes pending, hold its future name and instruction tag
- Takes the name, becomes logical when instruction "writes back"
- Instruction has to wait for all ahead of it to write back before it can
Limits to ILP
- Central pool (or attached to stations)
- Circular queue of registers
- Assigned when instruction is issued
- Structure represents issue order
- Holds result value and logical destination until instruction commits
- Forwards as necessary, station refers to buffer that will write value last
- Example: superscalar (4), superpipeline (10) requires 40
instructions that can be executed
- Superscalar effectively shortens distance between branches in pipe
- Every issue group may include a branch -- separate branch unit
PowerPC 601 part 1
- 4-way superscalar
- In-order dispatch, out-of-order completion
- Groups 4 instructions per cycle for issue: Load/Store,
FP/Graphics, Integer, Branch
- In-order register file update
PowerPC 601 Part 2
- 3-way superscalar
- Integer, branch, FP (loads and stores handled in corresponding units)
- First 3 stages shared: Fetch, Cache, Dispatch
- 4-word instruction prefetch buffer
- 4-word dispatch buffer
PowerPC 601 Part 3
- Branches start in dispatch, take 1 more cycle for writeback (4 stages)
- Integer ops execute then write back (5 stages)
- FP ops multiply, add, arithmetic or load, write back (7 stages)
- In-order within pipes, out-of-order between pipes
PowerPC 604 Part 1
- All register modifying operations are synchronized with a
corresponding integer operation
- In an integer operation isn't available, then a no-op is inserted
- Simple means to ensure in-order completion
- Limited out-of-order capabilities
PowerPC 604 Part 2
- 4-way superscalar (load/store unit)
- Fetch, Decode, Dispatch are shared
- Branches: Validate, Complete
- Integers: Execute, Complete, WB
- Load/Store: Address/Cache/Align (2 cycles), Complete, WB
- FP: Mult, Add, Round/Normalize, Complete, WB
PowerPC 604 Part 3
- 3 integer units, FP, branch, L/S
- 16 reorder buffers -- no result field
- Multiple instructions can complete simultaneously if they are all
done and together at the end of the buffer
- Separate 12 integer, 8 FP, 8 cond rename registers
- Each unit has 2 reservation stations
- Integer units can release out-of-order, others are in-order
- Exceptions are tied to integer pipe
- Branch history and target buffers
- Resolution at any stage
- Penalty of 1 to 4, 1 cycle to start the mispredicted instruction