speedups and parallelism in programs

there are many approaches to increasing the number of instructions microprocessors can do per clock tick:

pipelining (described already)
multiple pipelines = superscalar = multiple issue = multiple execution units
- one "thread" of code
- single instruction stream
- programmer doesn't need to worry about or even know about this existence of multiple pipes
- exploit instruction level parallelism
- pentium is a crippled dual issue design, crippled by unequal pipelines, due to need for downards compatibility.
- 2nd flr. DEC alpha = chaos@eecg.toronto.edu has 4 equal pipelines (4 execution units of depth 5)
- 4th flr. DEC alpha = eyetap.org has 6 equal pipelines (6 execution units of depth 5)
- microprocessors with multiple execution units, can actually execute more than one instruction per clock cycle, e.g. DEC alpha does up to 6 instructions per clock cycle. therefore, although it's only a 600MHz alpha, it's doing 3.6 gigahertz effectively. (e.g. like a P1800 only better because of other architectural factors)
multiple microprocessors
- exploit loop level parallelism
- multiple "threads"
- forking
- there are many examples of loops that don't have data dependency
- useful when there is no data dependencies between instructions
- example: inner product (can give a copy of array to many processes)
- beowulf: can do parallelism in code
- SMP can also exploit loop level parallelism

problem: instruction fetching is always trying to fetch instructions, so there is bus contention.

solution: introduce a prefetch queue

64bit x86
it's somewhere between CISC and VLIW
127 64bit registers

instructions are 256 bits
HP+intel joint venture
gnux (gnu linux) is currently the only operating system that will run on it native
nobody could get win64 to run on it.