goal of pipelining is to approach 1 instruction per clock tick

superpipelining (deeply pipelined microprocessors)

one way to get instruction level parallelism is pipelining
longer pipes can give more parallelism
stages are divided to lengthen pipe
depth limited by logic speed, branch frequency, etc..
tries to get us closer to 1 instruction per clock tick
but no matter how deep, never gets us to exactly 1 instruction per clock tick.

superscalar (multiple issue, multiple instruction dispatch, multiple concurrent pipelines)

multiple concurrent pipelines
an "instruction dispatcher" (some intelligence in the CPU) tries to schedule different instructions in different pipelines.
everything's still sequential in terms of data
can't have 2 instructions working on the same data (so not same as MIMD).
only works if no data dependencies between instructions.

superscalar (multiple issue) origins

origins in Cray's CDC-6600
Cray designs kept multiple functional units busy simultaneously
cray issued just one instruction at a time, but fetched up to four at once
(dispatch was pipelined)
good issue rate for technology in mid 1960's

Modern Superscalar

microprocessor tech. lags mainframe computer architecutre by about 20 years
there's really very little that's new in microprocessors despite intel's claims
what they're doing is taking mainframe tech and making it low cost.
(cdc6600, iliac4 of which only 1 was ever made used SIMD concepts, etc., had implemented this)
superscalar extends Cray approach with multiple issue.
need to fetch instructions in parallel to increase memory bandwidth
some superscalar designs still fetch and dispatch serially to multiple pipes, but becoming less common
R4400 is borderline superscalar
UltraSPARC is clearly superscalar

Crossing the Threshold Between single and multiple issue

if functional units are multi-cycle or use a slower clock rate, then a single fetch may be adequate
once the average time for the functional units to operate is < 1 clock cycle, then sysetem will benefit from multiple issue.

Implications of Multiple Issue

need to fetch multiple instructions at once
inter-pipeline interlocking
reordering of instructions for multiple interlocked pipelines
potentially multiple register write-backs
multiported register file, and/or split register file

Limits to Superscalar (multiple issue) microprocessors

limited number of independent operations available at any point in a stream
pipeline depth affects that (in the sense that deeper pipelines need more registers and that the penalty is much higher).
example: if you stall for 3 cycles, you're stalling 18 cycles on alpha (3
N)
limited by branches to generally fewer than 10, typically about 4 (downstairs alpha is 4, upstairs alpha is 6)
this limit can be increased by increasing the number of registers
limited also by number of independent buses (each pipeline needs access to registers, e.g. 4 pipes needs 8 buses source and destination bus as well); therefore register needs to be octal ported.
depends on how good your compiler is and how well it can pull apart loops. examples, e.g. dot products, limited only by number of registers (practical considerations aside).
summing loops are another good example of concurrency. (current term is dependent on previous term, can break the range apart and sum at the end).
summing loops with "tree" structure (loop level parallelism, not actually instruction level parallelism --- but loop level parallelism can also be exploited by multiple issue).
higher levels can be achieved with aggressive branch prediction and speculative execution.
the new IA64 arch. does not have branch prediction or speculative hardware; relies very heavily on the compiler (it's somewhere between risc and vliw: called "post risc").

Maximum ILP

Can be as high as 50 on some very regular codes
Appears to be about 6 for typical mix (e.g. SPEC) so not much sense making an 8 issue CPU, for most applications.
May be only 4 for integer programs.
Potential ILP may be 3X this, but hard to obtain
Still a worthwhile performance gain

ILP:

ILP is a property of code that: often instructions are independent in a given sequence of instructions. which opens up possibility of executing them concurrently (at same time).
technique to run more than one instruction at once is multiple issue.
in order to exploit ILP, need to know where hazards are (2 techniques)
then we need to do something about the hazards; 2 techniques for that as well:
1. reordering
2. register renaming

Two techniques to identify hazards:

Scoreboarding -- originally developed for the CDC-6600 in 1964 by Cray. (first implementation): centralized and explicit control. helps with register renaming, which helps with multiple issue.
- used in modern microprocessors
Reservation stations -- developed for the IBM 360/91 in 1967 by Tomasulo. Distributed and more implicit control e.g. IBM 360 only had 4fp regs; used tomasulo's ...
- more complicated than scoreboarding.
- no longer used in modern microprocessors.
- may make a comeback in the next 5 years.
scoreboaring looks at what's used when, whereas Tomasulo's reserves registers in advance (doesn't look forward).
scoreboarding looks forward to some number of instructions (limited by how much silicon you have). you can scoreboard more than the stages in pipeline (e.g. can reorder instructions based on that information).
both of these 2 methods maximize utilization of register file, alu, memory, etc..
in modern computing: more registers, so both of these 2 methods get more complicated.
both of these alleviate work from the compiler.
historically compiler writers and CPU makers didn't talk to each other.
when they started communicating, silicon could be used for something other than scoreboaring or tomasulo's).
things have reversed nowadays (almost all the intelligence is in the compiler)

Scoreboard

Centralized set of tables ("registers") in the cpu that keep track of what's being used and when.
each row indicates what's being used at each clock tick.
scoreboard manages dispatch, stalling, and completion of instructions
Watches for WAW, RAW, WAR hazards, and manages the execution sequence (e.g. possible instruction reordering) to avoid these hazards.
Hazards are detected by observing register references and operation types

Scoreboard Contents

Result Register Designator
Entry-Operand Register Designator
Valid Operand Flag
Source Function Unit Designator
Each functional unit also has 2 operand registers, a result register, a mode register, and a busy flag

Result Register Designator

Each ISA register has a matching scorebard register indicating which functional unit is scheduled to write to it next (or if it is free)
Instructions are issued only if their ISA result register is free, and the necessary functional unit isn't busy
Avoids WAW hazards

Result Register Dependencies

Just because a destination register is free doesn't mean its contents can be overwritten - can still be an operand
Destination register must not be listed in the Entry-Operand Register Designators in order to be written
Avoids WAR hazards

Forwarding

forwarding gets more complicated in superscalar processors because we now need to check for validity across several pipelines.
forwarding needs to be responsive to the contents of the scoreboard table. aaaaaaaaa
Source Function Unit Designators tell each unit where its operands will come from, so it can take them from the other unit's Result Register
If a result is in an ISA register, then its valid flag is set and the SFUD is empty
Avoids RAW hazards aaaaaa Reservation Stations
- Pools of register(s) associated with each functional unit
- Distributed resource, following instruction dispatch
- Controls entry of instructions into functional units, but allows out-of- order generation of results
Reservation Station Contents
- Instruction
- Operand sources or values
- Result destination
- Valid source flags
- Instruction is released to functional unit when its operands are all valid (which can include forwarding)
- Avoids RAW hazards
Feeding Reservation Stations
- Issue and release are separate
- Instruction is issued if a station for its FU is available, and its destination register isn't busy -- avoids WAW hazards
- On issue, destination register is marked busy, and assigned tag for the FU that will fill it
Release and Execute
- Valid source values or source tags are sent to the reservation station - avoids WAR hazards
- Results with matching tags go to reservation stations and registers
- Execution proceeds when all values are available
- Result broadcast on common bus
Completion
- Register matches its tag to a broadcast value and clears busy
- Parallel issue leads to out-of-order execution that must be turned into in-order completion
- Done through register renaming:
- Register Pool or
- Reorder Buffers
Register Pool
- More physical registers than logical
- Mapping table: logical to physical
- Free, logical, pending states
- Register becomes free when another one takes the same logical name
- Pending temporarily holds a result
- Indicates logical destination and source instruction tag
Register Pool Assignment
- Function unit result is sent to first free register -- if none, then stall
- Register becomes pending, hold its future name and instruction tag
- Takes the name, becomes logical when instruction "writes back"
- Instruction has to wait for all ahead of it to write back before it can
Reorder Buffers
- Central pool (or attached to stations)
- Circular queue of registers
- Assigned when instruction is issued
- Structure represents issue order
- Holds result value and logical destination until instruction commits
- Forwards as necessary, station refers to buffer that will write value last
Limits to ILP
- Example: superscalar (4), superpipeline (10) requires 40 instructions that can be executed
- Superscalar effectively shortens distance between branches in pipe
- Every issue group may include a branch -- separate branch unit
UltraSPARC
- 4-way superscalar
- In-order dispatch, out-of-order completion
- Groups 4 instructions per cycle for issue: Load/Store, FP/Graphics, Integer, Branch
- In-order register file update
PowerPC 601 part 1
- 3-way superscalar
- Integer, branch, FP (loads and stores handled in corresponding units)
- First 3 stages shared: Fetch, Cache, Dispatch
- 4-word instruction prefetch buffer
- 4-word dispatch buffer
PowerPC 601 Part 2
- Branches start in dispatch, take 1 more cycle for writeback (4 stages)
- Integer ops execute then write back (5 stages)
- FP ops multiply, add, arithmetic or load, write back (7 stages)
- In-order within pipes, out-of-order between pipes
PowerPC 601 Part 3
- All register modifying operations are synchronized with a corresponding integer operation
- In an integer operation isn't available, then a no-op is inserted
- Simple means to ensure in-order completion
- Limited out-of-order capabilities
PowerPC 604 Part 1
- 4-way superscalar (load/store unit)
- Fetch, Decode, Dispatch are shared
- Branches: Validate, Complete
- Integers: Execute, Complete, WB
- Load/Store: Address/Cache/Align (2 cycles), Complete, WB
- FP: Mult, Add, Round/Normalize, Complete, WB
PowerPC 604 Part 2
- 3 integer units, FP, branch, L/S
- 16 reorder buffers -- no result field
- Multiple instructions can complete simultaneously if they are all done and together at the end of the buffer
- Separate 12 integer, 8 FP, 8 cond rename registers
PowerPC 604 Part 3
- Each unit has 2 reservation stations
- Integer units can release out-of-order, others are in-order
- Exceptions are tied to integer pipe
- Branch history and target buffers
- Resolution at any stage
- Penalty of 1 to 4, 1 cycle to start the mispredicted instruction
References 1. http://www.cs.umass.edu/~weems/CmpSci635/