mmxtm-microarchitecture-of-pentiumr-processors-with-mmxtm-technology-and-pentiumr-ii-microprocessors.pdf

(68 KB) Pobierz
MMX Microarchitecture of Pentium ® Processors With
MMX Technology and Pentium ® II Microprocessors
Michael Kagan, IDC, Intel Corp.
Simcha Gochman, IDC, Intel Corp.
Doron Orenstien, IDC, Intel Corp.
Derrick Lin, MD6, Intel Corp.
Index Words: MMX™ technology, multimedia applications, IA extensions, Pentium ® processor
Abstract
The MMX™ technology is an extension to the Intel
Architecture (IA) aimed at boosting the performance of
multimedia applications. This technology is the most
significant IA extension since the introduction of the
Intel386™ microprocessor. The challenge in
implementing this technology came from retrofitting the
new
Additional changes were introduced in the micro-
architecture of the predecessor microprocessor in order
to stay on the performance curve, improving the
frequency and clock per instruction performance.
Introduction
During the ramp of the Pentium ® processor in 1993, it
became evident that the home market was becoming a
major consumer of PCs, with a major boost coming from
multimedia applications.
Pentium ®
functionality
into
existing
and
Pentium ® Pro processor designs.
The main challenge was how to incorporate the new
instructions while also keeping upcoming products on the
Intel performance curve. Both projects had to deliver
higher performance than their predecessors on legacy
applications, using both frequency gain and CPI (Clocks
Per Instruction) microarchitecture improvements.
On the other hand, new instructions had to be
implemented in a cost-effective way, e.g., provide a
breakthrough performance boost on multimedia
applications while maintaining reasonably low die size
cost. Moreover, the Pentium processor with MMX
technology and Pentium ® II processor, being the first
microprocessors to implement the new Instruction Set
Architecture (ISA), had to deliver superior multimedia
performance to demonstrate that the benefit of the ISA
extension would be compelling enough for Independent
Software Vendors (ISVs) to develop software using these
new instructions and fuel up the software spiral.
Traditionally, multimedia applications were supported by
expansion hardware with dedicated software, thereby
increasing the cost of the machine and lacking common
standards. Engineers in the Intel Architecture group
envisioned the need of executing operations for
multimedia on the core CPU. This would establish a
standard for the industry, reduce the cost of the system,
and free up motherboard expansion slots.
A distinct characteristic of multimedia applications is the
execution of the same operation on multiple small-size
data items (e.g., 8 and 16 bits). The Single Instruction
Multiple Data (SIMD) architecture provides a cost-
effective solution for such applications, and therefore it
was decided to extend the IA with 57 new MMX™
SIMD-type instructions.
At the time of this decision, two design projects were in
their initial development stages: a high-end Pentium
processor, and the Pentium® II processor, a
Pentium® Pro processor compaction, both based on
Intel’s 0.35u CMOS process. In order to allow a fast ramp
and a top-to-bottom penetration of the new extensions into
the PC market, it was decided to incorporate new
instructions in both projects and have them become the
flagships of the new architecture extension.
The new instructions operate on packed data types (single
operand represents more than one datum) and use a flat
register file that is architecturally aliased to an existing
register file of the Floating-Point (FP) stack. This
definition allows a variety of implementation alternatives.
1
1129643803.169.png
Intel Technology Journal Q3 ’97
At that time, the Pentium and Pentium Pro processors
were both in advanced development stages with a much
more mature database and silicon experience. In order to
stay on the performance curve and catch up on frequency,
we had to set a more aggressive frequency goal than our
predecessors and also improve CPI performance. In the
Pentium processor with MMX technology, this resulted in
restructuring the entire machine by adding one more stage
to the processor main pipeline. The Pentium II processor
design team improved the performance of graphics
applications and achieved a higher frequency through less
aggressive architectural changes.
3.00
PP/MT* architecture limit
Pentium architecture limit
2.50
2.00
1.50
Frequency/
normalized
Pentium speedup trend
PP/MT* speedup trend,
data prior to Q4’96 are
pre-production.
1.00
*Pentium processor with MMX technology
.50
q1/95
q2/95
q3/95
q4/95
q1/96
q2/96
q3/96
q4/96
q1/97
q2/97
Figure 1. Frequency Improvement Trends
Both design teams delivered excellent results. The
Pentium processor with MMX technology achieved both
its CPI and frequency goals. It is 20% higher in frequency
(running at 233MHz in production) and 15% faster on
CPI than other Pentium processors. The Pentium II
processor significantly improved the performance of
graphics code and achieved a 300MHz frequency at
introduction. The speedup goal for multimedia
applications was achieved as well. Most applications
using the new instructions improved by a factor of 1.6X,
with some having improved up to 4X.
In order to improve the architectural limit of the device,
we had to identify and resolve the major speed bottlenecks
of the Pentium processor’s architecture. After a thorough
analysis, two major bottlenecks were identified: the
decoder and the data cache access. The two bottlenecks
were dependent. In other words, resolving one of them
would help to speed up the other one. We decided to
resolve the decoder bottleneck, since it was simpler and
less risky, and it would also allow a smooth
implementation of MMX instruction decoding. The
Pentium processor execution pipeline originally consisted
of five pipeline stages: Pre-fetch (PF), Decode1 (D1),
Decode2 (D2), Execute (E), and Writeback (WB). We
added an additional pipeline stage in the front end of the
machine, rebalanced the entire pipeline to take advantage
of the extra clock cycle, and added a queue between the F
and D1 stages to decouple freezes, which are the most
critical signals generated in every pipeline stage. Figure 2
illustrates the difference between the original Pentium
processor pipeline and the MMX technology pipeline.
Pentium Processor With MMX Technology
Microarchitecture
In order to exceed the performance of its predecessor, the
design team had to improve both the frequency and CPI
performance of the microprocessor. Both of these goals
could be achieved with microarchitecture changes
implemented in the new processor.
Frequency Speedup
Frequency is the most significant factor that determines
the performance of a microprocessor and is a major (and
sometimes only) performance indicator used by
customers. Therefore, it was not possible to come up with
a new product running at a lower frequency than its
predecessor.
The frequency improvement of a product approaches
asymptotically the architectural limit of the device by
cleaning up escapes and by making slight design
improvements in critical paths. Therefore, in order to
match a predecessor’s frequency, a product that comes to
market later must have higher architectural frequency
limits. Figure 1 illustrates frequency improvement trends
for the Pentium processor and Pentium processor with
MMX technology.
Pentium pipeline
F
D1
D2
E
WB
Pentium with MMX technology pipeline
PF
F
D1
D2
E
WB
queue
Figure 2. Pentium Processor and
Pentium Processor With MMX Technology Pipeline
An additional clock cycle in the front end of the pipeline
resolved the decoder speed bottleneck and reduced fan-
2
1129643803.180.png 1129643803.191.png 1129643803.202.png 1129643803.001.png 1129643803.011.png 1129643803.022.png 1129643803.033.png 1129643803.044.png 1129643803.054.png 1129643803.065.png 1129643803.076.png 1129643803.087.png 1129643803.098.png 1129643803.109.png 1129643803.120.png 1129643803.131.png 1129643803.132.png 1129643803.133.png 1129643803.134.png 1129643803.135.png 1129643803.136.png 1129643803.137.png 1129643803.138.png 1129643803.139.png 1129643803.140.png 1129643803.141.png 1129643803.142.png 1129643803.143.png 1129643803.144.png 1129643803.145.png 1129643803.146.png 1129643803.147.png 1129643803.148.png 1129643803.149.png 1129643803.150.png 1129643803.151.png 1129643803.152.png 1129643803.153.png 1129643803.154.png 1129643803.155.png 1129643803.156.png 1129643803.157.png 1129643803.158.png 1129643803.159.png 1129643803.160.png 1129643803.161.png 1129643803.162.png 1129643803.163.png 1129643803.164.png 1129643803.165.png 1129643803.166.png 1129643803.167.png 1129643803.168.png 1129643803.170.png 1129643803.171.png 1129643803.172.png 1129643803.173.png 1129643803.174.png 1129643803.175.png 1129643803.176.png 1129643803.177.png 1129643803.178.png 1129643803.179.png 1129643803.181.png 1129643803.182.png 1129643803.183.png 1129643803.184.png 1129643803.185.png 1129643803.186.png 1129643803.187.png 1129643803.188.png 1129643803.189.png 1129643803.190.png 1129643803.192.png 1129643803.193.png 1129643803.194.png 1129643803.195.png 1129643803.196.png 1129643803.197.png 1129643803.198.png 1129643803.199.png 1129643803.200.png 1129643803.201.png 1129643803.203.png 1129643803.204.png 1129643803.205.png 1129643803.206.png 1129643803.207.png
 
Intel Technology Journal Q3 ’97
out for the data cache freeze (generated in the E stage),
which in turn relaxed a requirement for this freeze signal.
This was the first step in the resolution of the data cache
bottleneck.
The next step was to improve the timing of the data freeze
signal generated by the data cache. The cache access path
starts with address generation in the D2 stage, followed by
a subsequent cache access in the E stage. The entire path
was redesigned to self-time pipelined execution with time
borrowing between the stages. The address generation
logic was changed, incorporating simplified and faster
adders, thereby allowing faster address generation.
pipelines write operations among each other. Each time a
store is executed, the tag lookup is performed for the
current store, while the data array is updated with data
from the previous store. This way we could have only one
data array access window, which allowed a significant
speedup of cache access.
Figure 4 illustrates the Pentium processor with MMX
technology’s cache access windows architecture.
Pentium processor with MMX technology
cache array access windows
Read
timing
Tag lookup
precharge
Data read
data write to hit buffer
Write
timing
The third step was the cache circuit architecture. It was
performance-crucial to execute a single clock read and
write operation in each cache port. As a result, the
Pentium processor’s cache access windows were designed
to support two access windows per clock, as illustrated in
Figure 3.
Tag lookup
precharge
Data write to array
timing relaxation
Cache access
windows
Data read/write
precharge
Figure 4. Pentium Processor With MMX Technology’s
Cache Array Access Windows Architecture
Pentium cache array access windows
The solutions described above resolved major Pentium
processor speed paths, allowing a frequency leap.
Additional local changes were performed in every
functional block to keep all the rest of the circuitry in line
with this new goal.
Tag lookup
Read
timing
Data read
precharge
Write
timing
Tag lookup
Data write to array
precharge
In summary, the Pentium processor with MMX
technology designers addressed two major bottlenecks at a
global architecture level (adding a pipeline stage and re-
balancing the entire machine), made few changes on the
intermediate level (time borrowing between pipe stages
for a specific operation), and implemented numerous local
changes to keep the machine balanced. This top-down
approach allowed us to achieve a 20% frequency boost
over the original Pentium processor design.
Cache access
windows
Data read
Data write
precharge
Figure 3. Pentium Processor’s Cache Array Access Windows
Although read and write operations to the same port were
never performed in the same cycle, cache timers had to
support two access windows, thereby limiting the overall
cache access time. On the other hand, since read and write
operations never happen in the same clock to the same
port, both access windows could never be active in the
same clock cycle. In other words, during a read operation,
no data access could be performed in a write access
window and vice versa. Therefore, we decided to have
just one data access window in the front end of the cycle
(e.g., read window timing) and use it for both read and
write accesses. The read access works as in other Pentium
processors; it is a speculative operation and can be thrown
away. Write access depends on the result of a tag lookup
and cannot be executed if the same clock tags are looked
up. Therefore, the Pentium processor with MMX
technology implemented a cache store hit buffer. If a store
hit is encountered at the cache lookup phase, the data is
stored to this buffer. The actual store to the data array will
be done at the data access window of the next write
operation, while this window is idle. Meanwhile, before
the next write, the data can be delivered from the store hit
buffer to subsequent reads from this address. In other
words, the Pentium processor with MMX technology
CPI Performance
Although adding a pipeline stage improves frequency, it
decreases CPI performance, i.e., the longer the pipeline,
the more work done speculatively by the machine and
therefore more work is being thrown away in the case of
branch miss prediction. The additional pipeline stage costs
decreased the CPI performance of the processor by 5-6%.
In order to stay on the performance curve, we had to gain
back this loss and, in addition, speed up the machine
further.
The Pentium processor with MMX technology’s CPI
performance was increased in three major ways:
1.
Improved branch prediction. We implemented a more
advanced branch prediction algorithm that was
developed by the Pentium Pro processor design team.
This algorithm improved the prediction of branches,
which resulted in fewer miss-predictions of branches
3
1129643803.208.png 1129643803.209.png 1129643803.210.png 1129643803.211.png 1129643803.002.png 1129643803.003.png 1129643803.004.png 1129643803.005.png 1129643803.006.png 1129643803.007.png 1129643803.008.png
 
Intel Technology Journal Q3 ’97
and caused less work to be thrown away. On top of
the Branch Target Buffer (BTB), we also
implemented a Return Stack Buffer (RSB)—a
dedicated branch prediction logic for call/return
instructions. The combination of the updated BTB
algorithm and the RSB improved CPI performance by
about 8%. This helped close the performance gap
opened while adding the new pipeline stage and gave
us some advantage over the Pentium processor.
instruction. The Pentium processor with MMX technology
decoder was redesigned to quadruple the throughput of 0F
instructions, allowing two instructions per cycle
throughput.
Additional modifications were made to the MMX
technology pipeline to incorporate the MMX execute
stage (MEX) and the MMX writeback stage. To improve
the performance of MMX ARITH-MEM instructions, the
integer-execute stage is used as an MMX “read-stage,”
where the source operands as well as the memory
operands are read. As a result, an ARITH-MEM
instruction is executed in a single clock cycle. Since the
Pentium processor with MMX technology may pair an
ARITH-MEM instruction with an ARITH instruction, it is
equivalent to having three execution units (two ARITH,
one LOAD) working in parallel, similar in concept to a
Pentium II processor.
According to the MMX technology architecture definition,
the MMX register file is aliased to the FP mantissa
register file. It was decided to design dedicated hardware
to execute the MMX instructions (the Munit). This unit
has a dedicated MMX register file, capable of delivering
four 64-bit operands and storing three 64-bit results in a
single clock cycle. The Munit also incorporates the MMX
execution units, which were defined and designed as a
module, and which allowed the design to be shared with
the Pentium II processor.
2.
Improving core/bus protocols. The original Pentium
processor design was tuned to a 1:1 ratio between the
core and bus clocks. As a result, some
price/performance tradeoffs that were made for a 1:1
clock ratio were not optimal for use when the gap
between the core and bus frequency increased.
Several enhancements were made by the design team
to tune the protocols. Write buffers were combined
into a single pool, thereby allowing both pipes to
share the same hardware, the clock crossover
mechanism was changed, and the DP protocol was
completely redesigned to decouple core and bus
frequencies. These improvements gained about a 5%
CPI performance improvement and simplified the
design and testing (e.g., crossover, DP protocols).
3.
Creating larger caches and fully-associative
Translation Lookaside Buffers (TLB). In general,
increasing cache size is the most cost-effective way to
improve performance. The Pentium processor with
MMX technology increased the size of both caches
from 8Kbyte to 16Kbyte and made them four-way
set-associative. Fully-associative TLBs improved CPI
to some extent, making address translation faster than
in the original TLB design. Larger caches and fully-
associative TLBs bought us about a 7-10% CPI
performance improvement.
Clean partitioning of the MMX technology design and an
additional pipeline stage in the decoder resulted in no
speed issues associated with the new units. The area
penalty for the Munit was small.
Pentium Processor With MMX Technology
Block Diagram
The block diagram of the Pentium processor with MMX
technology is shown in Figure 5, outlining parts that were
redesigned for speed, CPI, and MMX technology.
In summary, by improving the BTB, redesigning the
core/bus protocol, and making larger caches, the Pentium
processor with MMX technology achieved about a 15%
higher CPI performance than the Pentium processor
despite the CPI loss due to the additional pipeline stage.
Prefetch
Fetch
D1
D2
Execute Writeback
MMX Technology Implementation
After setting the stage for frequency and CPI performance,
we could incorporate the MMX instructions relatively
straightforwardly.
Munit
BTB
Shadow reg.
CROM
RSB
FPU
FP registers
Code
cache
16K
Instr.
decod
and
FIFO
Adr.
calc,
op.
read
Len.
decod
The instruction decode logic had to be modified to
decode, schedule, and issue the new instructions at a rate
of up to two instructions per clock. The MMX opcodes
are mapped to a 0F prefix, which is rarely used in
previous IA native software. Therefore, decoding of these
instructions in the original Pentium processor design was
slow, with a throughput of two clock cycles per
Integer exec
TLB
f.assoc
Dcache
16K
TLB
f.assoc
Page
unit
Bus unit
IPC
MMX
4
1129643803.009.png 1129643803.010.png 1129643803.012.png 1129643803.013.png 1129643803.014.png 1129643803.015.png 1129643803.016.png 1129643803.017.png 1129643803.018.png 1129643803.019.png 1129643803.020.png 1129643803.021.png 1129643803.023.png 1129643803.024.png 1129643803.025.png 1129643803.026.png 1129643803.027.png 1129643803.028.png 1129643803.029.png 1129643803.030.png 1129643803.031.png 1129643803.032.png 1129643803.034.png 1129643803.035.png 1129643803.036.png 1129643803.037.png 1129643803.038.png 1129643803.039.png 1129643803.040.png 1129643803.041.png 1129643803.042.png 1129643803.043.png 1129643803.045.png 1129643803.046.png 1129643803.047.png 1129643803.048.png 1129643803.049.png 1129643803.050.png 1129643803.051.png
 
Intel Technology Journal Q3 ’97
Figure 5. Block Diagram of the Pentium Processor
With MMX Technology
the performance gain that can be achieved by several
applications when using the new instructions.
Results
The Pentium processor with MMX technology design
achieved its goals. The processor taped out in late 1995,
and samples were delivered to customers less than a week
after the first silicon. With six months of extensive silicon
debug, we closed the frequency gap with the Pentium
processor and, half a year later, achieved 233MHz in
production, which is one bin above the Pentium
processor’s production frequency.
Figure 6 shows the actual speed improvement of the
Pentium processor and the Pentium processor with MMX
technology versus the anticipated trend.
4
3
2
1
MPEG1
Audi o
Image
Filter
Modem
3D
Integer
Geomet ry
3D
True Color
Shading
Video
Conferencing
Figure 8. Performance Improvement Using New Instructions
Pentium II Processor Microarchitecture
While the Pentium processor with MMX technology made
microarchitecture changes to improve frequency and
performance as well as implement the MMX technology,
the Pentium II processor improved upon the Pentium Pro
processor’s microarchitecture and brought MMX
technology to a new level of performance. The Pentium II
processor is based on the dynamic execution
microarchitecture of the Pentium Pro processor. Changes
were made in the Pentium II processor’s microarchitecture
to improve graphics performance and to implement MMX
technology. In addition, the entire back-side bus interface
that connects the processor to an off-chip second-level
cache was redesigned to allow low-cost commodity
SRAMs to be used as second-level cache. Doing so
significantly reduced the system cost compared to the
Pentium Pro processor’s Multi-Chip Module (MCM) that
houses the processor as well as the second-level cache. A
higher frequency was achieved through aggressive circuit
techniques and other changes.
3.00
PP/MT* architecture limit
Pentium architecture limit
2.50
2.00
Pentium speedup (actual)
PP/MT* speedup (actual), data
prior to Q4’96 are pre-production
Pentium speedup trend
PP/MT speedup trend
1.50
Frequency/
normalized
1.00
* Pentium processor with MMX
technology
.50
q1/95
q2/95
q3/95
q4/95
q1/96
q2/96
q3/96
q4/96
q1/97
q2/97
Figure 6. Actual Versus Anticipated Speed Improvement
Trend
The Pentium processor with MMX technology also met its
CPI goals. Figure 7 shows the CPI performance of the
Pentium processor with MMX technology compared to
the Pentium processor.
Overview
The Pentium II processor is the second Intel
microprocessor to implement MMX technology. The
Pentium II processor’s MMX technology implementation
offers multimedia applications the benefits of an out-of-
order execution, aggressive memory speculation, a
superpipelined and superscalar microarchitecture, etc.
These are the same features that the Pentium Pro
microprocessor provides. The Pentium II processor
supports two packed ALU operations, one packed shift,
and one packed multiply operation. Pack and unpack
operations are implemented by the packed shifter. The
Pentium II processor allows packed shift and packed
multiply to be executed concurrently.
30%
20%
10%
0%
iSpec95
iSpec95
fSpec95
fSpec95
iSpec92
iSpec92
Figure 7. CPI Performance of the Pentium Processor with
MMX Technology Compared to the Pentium Processor
And at last, multimedia applications gained significant
performance using new instructions. Figure 8 illustrates
5
1129643803.052.png 1129643803.053.png 1129643803.055.png 1129643803.056.png 1129643803.057.png 1129643803.058.png 1129643803.059.png 1129643803.060.png 1129643803.061.png 1129643803.062.png 1129643803.063.png 1129643803.064.png 1129643803.066.png 1129643803.067.png 1129643803.068.png 1129643803.069.png 1129643803.070.png 1129643803.071.png 1129643803.072.png 1129643803.073.png 1129643803.074.png 1129643803.075.png 1129643803.077.png 1129643803.078.png 1129643803.079.png 1129643803.080.png 1129643803.081.png 1129643803.082.png 1129643803.083.png 1129643803.084.png 1129643803.085.png 1129643803.086.png 1129643803.088.png 1129643803.089.png 1129643803.090.png 1129643803.091.png 1129643803.092.png 1129643803.093.png 1129643803.094.png 1129643803.095.png 1129643803.096.png 1129643803.097.png 1129643803.099.png 1129643803.100.png 1129643803.101.png 1129643803.102.png 1129643803.103.png 1129643803.104.png 1129643803.105.png 1129643803.106.png 1129643803.107.png 1129643803.108.png 1129643803.110.png 1129643803.111.png 1129643803.112.png 1129643803.113.png 1129643803.114.png 1129643803.115.png 1129643803.116.png 1129643803.117.png 1129643803.118.png 1129643803.119.png 1129643803.121.png 1129643803.122.png 1129643803.123.png 1129643803.124.png 1129643803.125.png 1129643803.126.png 1129643803.127.png 1129643803.128.png 1129643803.129.png 1129643803.130.png
 
Zgłoś jeśli naruszono regulamin