mmxtm-microarchitecture-of-pentiumr-processors-with-mmxtm-technology-and-pentiumr-ii-microprocessors.pdf

(68 KB) Pobierz

MMX ™ Microarchitecture of Pentium ® Processors With

MMX Technology and Pentium ® II Microprocessors

Michael Kagan, IDC, Intel Corp.

Simcha Gochman, IDC, Intel Corp.

Doron Orenstien, IDC, Intel Corp.

Derrick Lin, MD6, Intel Corp.

Index Words: MMX™ technology, multimedia applications, IA extensions, Pentium ® processor

Abstract

The MMX™ technology is an extension to the Intel

Architecture (IA) aimed at boosting the performance of

multimedia applications. This technology is the most

significant IA extension since the introduction of the

Intel386™ microprocessor. The challenge in

implementing this technology came from retrofitting the

new

Additional changes were introduced in the micro-

architecture of the predecessor microprocessor in order

to stay on the performance curve, improving the

frequency and clock per instruction performance.

Introduction

During the ramp of the Pentium ® processor in 1993, it

became evident that the home market was becoming a

major consumer of PCs, with a major boost coming from

multimedia applications.

Pentium ®

functionality

into

existing

and

Pentium ® Pro processor designs.

The main challenge was how to incorporate the new

instructions while also keeping upcoming products on the

Intel performance curve. Both projects had to deliver

higher performance than their predecessors on legacy

applications, using both frequency gain and CPI (Clocks

Per Instruction) microarchitecture improvements.

On the other hand, new instructions had to be

implemented in a cost-effective way, e.g., provide a

breakthrough performance boost on multimedia

applications while maintaining reasonably low die size

cost. Moreover, the Pentium processor with MMX

technology and Pentium ® II processor, being the first

microprocessors to implement the new Instruction Set

Architecture (ISA), had to deliver superior multimedia

performance to demonstrate that the benefit of the ISA

extension would be compelling enough for Independent

Software Vendors (ISVs) to develop software using these

new instructions and fuel up the software spiral.

Traditionally, multimedia applications were supported by

expansion hardware with dedicated software, thereby

increasing the cost of the machine and lacking common

standards. Engineers in the Intel Architecture group

envisioned the need of executing operations for

multimedia on the core CPU. This would establish a

standard for the industry, reduce the cost of the system,

and free up motherboard expansion slots.

A distinct characteristic of multimedia applications is the

execution of the same operation on multiple small-size

data items (e.g., 8 and 16 bits). The Single Instruction

Multiple Data (SIMD) architecture provides a cost-

effective solution for such applications, and therefore it

was decided to extend the IA with 57 new MMX™

SIMD-type instructions.

At the time of this decision, two design projects were in

their initial development stages: a high-end Pentium

processor, and the Pentium® II processor, a

Pentium® Pro processor compaction, both based on

Intel’s 0.35u CMOS process. In order to allow a fast ramp

and a top-to-bottom penetration of the new extensions into

the PC market, it was decided to incorporate new

instructions in both projects and have them become the

flagships of the new architecture extension.

The new instructions operate on packed data types (single

operand represents more than one datum) and use a flat

register file that is architecturally aliased to an existing

register file of the Floating-Point (FP) stack. This

definition allows a variety of implementation alternatives.

1

1129643803.169.png

Intel Technology Journal Q3 ’97

At that time, the Pentium and Pentium Pro processors

were both in advanced development stages with a much

more mature database and silicon experience. In order to

stay on the performance curve and catch up on frequency,

we had to set a more aggressive frequency goal than our

predecessors and also improve CPI performance. In the

Pentium processor with MMX technology, this resulted in

restructuring the entire machine by adding one more stage

to the processor main pipeline. The Pentium II processor

design team improved the performance of graphics

applications and achieved a higher frequency through less

aggressive architectural changes.

3.00

PP/MT* architecture limit

Pentium architecture limit

2.50

2.00

1.50

Frequency/

normalized

Pentium speedup trend

PP/MT* speedup trend,

data prior to Q4’96 are

pre-production.

1.00

*Pentium processor with MMX technology

.50

q1/95

q2/95

q3/95

q4/95

q1/96

q2/96

q3/96

q4/96

q1/97

q2/97

Figure 1. Frequency Improvement Trends

Both design teams delivered excellent results. The

Pentium processor with MMX technology achieved both

its CPI and frequency goals. It is 20% higher in frequency

(running at 233MHz in production) and 15% faster on

CPI than other Pentium processors. The Pentium II

processor significantly improved the performance of

graphics code and achieved a 300MHz frequency at

introduction. The speedup goal for multimedia

applications was achieved as well. Most applications

using the new instructions improved by a factor of 1.6X,

with some having improved up to 4X.

In order to improve the architectural limit of the device,

we had to identify and resolve the major speed bottlenecks

of the Pentium processor’s architecture. After a thorough

analysis, two major bottlenecks were identified: the

decoder and the data cache access. The two bottlenecks

were dependent. In other words, resolving one of them

would help to speed up the other one. We decided to

resolve the decoder bottleneck, since it was simpler and

less risky, and it would also allow a smooth

implementation of MMX instruction decoding. The

Pentium processor execution pipeline originally consisted

of five pipeline stages: Pre-fetch (PF), Decode1 (D1),

Decode2 (D2), Execute (E), and Writeback (WB). We

added an additional pipeline stage in the front end of the

machine, rebalanced the entire pipeline to take advantage

of the extra clock cycle, and added a queue between the F

and D1 stages to decouple freezes, which are the most

critical signals generated in every pipeline stage. Figure 2

illustrates the difference between the original Pentium

processor pipeline and the MMX technology pipeline.

Pentium Processor With MMX Technology

Microarchitecture

In order to exceed the performance of its predecessor, the

design team had to improve both the frequency and CPI

performance of the microprocessor. Both of these goals

could be achieved with microarchitecture changes

implemented in the new processor.

Frequency Speedup

Frequency is the most significant factor that determines

the performance of a microprocessor and is a major (and

sometimes only) performance indicator used by

customers. Therefore, it was not possible to come up with

a new product running at a lower frequency than its

predecessor.

The frequency improvement of a product approaches

asymptotically the architectural limit of the device by

cleaning up escapes and by making slight design

improvements in critical paths. Therefore, in order to

match a predecessor’s frequency, a product that comes to

market later must have higher architectural frequency

limits. Figure 1 illustrates frequency improvement trends

for the Pentium processor and Pentium processor with

MMX technology.

Pentium pipeline

F

D1

D2

E

WB

Pentium with MMX technology pipeline

PF

F

D1

D2

E

WB

queue

Figure 2. Pentium Processor and

Pentium Processor With MMX Technology Pipeline

An additional clock cycle in the front end of the pipeline

resolved the decoder speed bottleneck and reduced fan-

2

1129643803.180.png

1129643803.191.png

1129643803.202.png

1129643803.001.png

1129643803.011.png

1129643803.022.png

1129643803.033.png

1129643803.044.png

1129643803.054.png

1129643803.065.png

1129643803.076.png

1129643803.087.png

1129643803.098.png

1129643803.109.png

1129643803.120.png

1129643803.131.png

1129643803.132.png

1129643803.133.png

1129643803.134.png

1129643803.135.png

1129643803.136.png

1129643803.137.png

1129643803.138.png

1129643803.139.png

1129643803.140.png

1129643803.141.png

1129643803.142.png

1129643803.143.png

1129643803.144.png

1129643803.145.png

1129643803.146.png

1129643803.147.png

1129643803.148.png

1129643803.149.png

1129643803.150.png

1129643803.151.png

1129643803.152.png

1129643803.153.png

1129643803.154.png

1129643803.155.png

1129643803.156.png

1129643803.157.png

1129643803.158.png

1129643803.159.png

1129643803.160.png

1129643803.161.png

1129643803.162.png

1129643803.163.png

1129643803.164.png

1129643803.165.png

1129643803.166.png

1129643803.167.png

1129643803.168.png

1129643803.170.png

1129643803.171.png

1129643803.172.png

1129643803.173.png

1129643803.174.png

1129643803.175.png

1129643803.176.png

1129643803.177.png

1129643803.178.png

1129643803.179.png

1129643803.181.png

1129643803.182.png

1129643803.183.png

1129643803.184.png

1129643803.185.png

1129643803.186.png

1129643803.187.png

1129643803.188.png

1129643803.189.png

1129643803.190.png

1129643803.192.png

1129643803.193.png

1129643803.194.png

1129643803.195.png

1129643803.196.png

1129643803.197.png

1129643803.198.png

1129643803.199.png

1129643803.200.png

1129643803.201.png

1129643803.203.png

1129643803.204.png

1129643803.205.png

1129643803.206.png

1129643803.207.png

Intel Technology Journal Q3 ’97

out for the data cache freeze (generated in the E stage),

which in turn relaxed a requirement for this freeze signal.

This was the first step in the resolution of the data cache

bottleneck.

The next step was to improve the timing of the data freeze

signal generated by the data cache. The cache access path

starts with address generation in the D2 stage, followed by

a subsequent cache access in the E stage. The entire path

was redesigned to self-time pipelined execution with time

borrowing between the stages. The address generation

logic was changed, incorporating simplified and faster

adders, thereby allowing faster address generation.

pipelines write operations among each other. Each time a

store is executed, the tag lookup is performed for the

current store, while the data array is updated with data

from the previous store. This way we could have only one

data array access window, which allowed a significant

speedup of cache access.

Figure 4 illustrates the Pentium processor with MMX

technology’s cache access windows architecture.

Pentium processor with MMX technology

cache array access windows

Read

timing

Tag lookup

precharge

Data read

data write to hit buffer

Write

timing

The third step was the cache circuit architecture. It was

performance-crucial to execute a single clock read and

write operation in each cache port. As a result, the

Pentium processor’s cache access windows were designed

to support two access windows per clock, as illustrated in

Figure 3.

Tag lookup

precharge

Data write to array

timing relaxation

Cache access

windows

Data read/write

precharge

Figure 4. Pentium Processor With MMX Technology’s

Cache Array Access Windows Architecture

Pentium cache array access windows

The solutions described above resolved major Pentium

processor speed paths, allowing a frequency leap.

Additional local changes were performed in every

functional block to keep all the rest of the circuitry in line

with this new goal.

Tag lookup

Read

timing

Data read

precharge

Write

timing

Tag lookup

Data write to array

precharge

In summary, the Pentium processor with MMX

technology designers addressed two major bottlenecks at a

global architecture level (adding a pipeline stage and re-

balancing the entire machine), made few changes on the

intermediate level (time borrowing between pipe stages

for a specific operation), and implemented numerous local

changes to keep the machine balanced. This top-down

approach allowed us to achieve a 20% frequency boost

over the original Pentium processor design.

Cache access

windows

Data read

Data write

precharge

Figure 3. Pentium Processor’s Cache Array Access Windows

Although read and write operations to the same port were

never performed in the same cycle, cache timers had to

support two access windows, thereby limiting the overall

cache access time. On the other hand, since read and write

operations never happen in the same clock to the same

port, both access windows could never be active in the

same clock cycle. In other words, during a read operation,

no data access could be performed in a write access

window and vice versa. Therefore, we decided to have

just one data access window in the front end of the cycle

(e.g., read window timing) and use it for both read and

write accesses. The read access works as in other Pentium

processors; it is a speculative operation and can be thrown

away. Write access depends on the result of a tag lookup

and cannot be executed if the same clock tags are looked

up. Therefore, the Pentium processor with MMX

technology implemented a cache store hit buffer. If a store

hit is encountered at the cache lookup phase, the data is

stored to this buffer. The actual store to the data array will

be done at the data access window of the next write

operation, while this window is idle. Meanwhile, before

the next write, the data can be delivered from the store hit

buffer to subsequent reads from this address. In other

words, the Pentium processor with MMX technology

CPI Performance

Although adding a pipeline stage improves frequency, it

decreases CPI performance, i.e., the longer the pipeline,

the more work done speculatively by the machine and

therefore more work is being thrown away in the case of

branch miss prediction. The additional pipeline stage costs

decreased the CPI performance of the processor by 5-6%.

In order to stay on the performance curve, we had to gain

back this loss and, in addition, speed up the machine

further.

The Pentium processor with MMX technology’s CPI

performance was increased in three major ways:

1.

Improved branch prediction. We implemented a more

advanced branch prediction algorithm that was

developed by the Pentium Pro processor design team.

This algorithm improved the prediction of branches,

which resulted in fewer miss-predictions of branches

3

1129643803.208.png

1129643803.209.png

1129643803.210.png

1129643803.211.png

1129643803.002.png

1129643803.003.png

1129643803.004.png

1129643803.005.png

1129643803.006.png

1129643803.007.png

1129643803.008.png

Intel Technology Journal Q3 ’97

and caused less work to be thrown away. On top of

the Branch Target Buffer (BTB), we also

implemented a Return Stack Buffer (RSB)—a

dedicated branch prediction logic for call/return

instructions. The combination of the updated BTB

algorithm and the RSB improved CPI performance by

about 8%. This helped close the performance gap

opened while adding the new pipeline stage and gave

us some advantage over the Pentium processor.

instruction. The Pentium processor with MMX technology

decoder was redesigned to quadruple the throughput of 0F

instructions, allowing two instructions per cycle

throughput.

Additional modifications were made to the MMX

technology pipeline to incorporate the MMX execute

stage (MEX) and the MMX writeback stage. To improve

the performance of MMX ARITH-MEM instructions, the

integer-execute stage is used as an MMX “read-stage,”

where the source operands as well as the memory

operands are read. As a result, an ARITH-MEM

instruction is executed in a single clock cycle. Since the

Pentium processor with MMX technology may pair an

ARITH-MEM instruction with an ARITH instruction, it is

equivalent to having three execution units (two ARITH,

one LOAD) working in parallel, similar in concept to a

Pentium II processor.

According to the MMX technology architecture definition,

the MMX register file is aliased to the FP mantissa

register file. It was decided to design dedicated hardware

to execute the MMX instructions (the Munit). This unit

has a dedicated MMX register file, capable of delivering

four 64-bit operands and storing three 64-bit results in a

single clock cycle. The Munit also incorporates the MMX

execution units, which were defined and designed as a

module, and which allowed the design to be shared with

the Pentium II processor.

2.

Improving core/bus protocols. The original Pentium

processor design was tuned to a 1:1 ratio between the

core and bus clocks. As a result, some

price/performance tradeoffs that were made for a 1:1

clock ratio were not optimal for use when the gap

between the core and bus frequency increased.

Several enhancements were made by the design team

to tune the protocols. Write buffers were combined

into a single pool, thereby allowing both pipes to

share the same hardware, the clock crossover

mechanism was changed, and the DP protocol was

completely redesigned to decouple core and bus

frequencies. These improvements gained about a 5%

CPI performance improvement and simplified the

design and testing (e.g., crossover, DP protocols).

3.

Creating larger caches and fully-associative

Translation Lookaside Buffers (TLB). In general,

increasing cache size is the most cost-effective way to

improve performance. The Pentium processor with

MMX technology increased the size of both caches

from 8Kbyte to 16Kbyte and made them four-way

set-associative. Fully-associative TLBs improved CPI

to some extent, making address translation faster than

in the original TLB design. Larger caches and fully-

associative TLBs bought us about a 7-10% CPI

performance improvement.

Clean partitioning of the MMX technology design and an

additional pipeline stage in the decoder resulted in no

speed issues associated with the new units. The area

penalty for the Munit was small.

Pentium Processor With MMX Technology

Block Diagram

The block diagram of the Pentium processor with MMX

technology is shown in Figure 5, outlining parts that were

redesigned for speed, CPI, and MMX technology.

In summary, by improving the BTB, redesigning the

core/bus protocol, and making larger caches, the Pentium

processor with MMX technology achieved about a 15%

higher CPI performance than the Pentium processor

despite the CPI loss due to the additional pipeline stage.

Prefetch

Fetch

D1

D2

Execute Writeback

MMX Technology Implementation

After setting the stage for frequency and CPI performance,

we could incorporate the MMX instructions relatively

straightforwardly.

Munit

BTB

Shadow reg.

CROM

RSB

FPU

FP registers

Code

cache

16K

Instr.

decod

and

FIFO

Adr.

calc,

op.

read

Len.

decod

The instruction decode logic had to be modified to

decode, schedule, and issue the new instructions at a rate

of up to two instructions per clock. The MMX opcodes

are mapped to a 0F prefix, which is rarely used in

previous IA native software. Therefore, decoding of these

instructions in the original Pentium processor design was

slow, with a throughput of two clock cycles per

Integer exec

TLB

f.assoc

Dcache

16K

TLB

f.assoc

Page

unit

Bus unit

IPC

MMX

4

1129643803.009.png

1129643803.010.png

1129643803.012.png

1129643803.013.png

1129643803.014.png

1129643803.015.png

1129643803.016.png

1129643803.017.png

1129643803.018.png

1129643803.019.png

1129643803.020.png

1129643803.021.png

1129643803.023.png

1129643803.024.png

1129643803.025.png

1129643803.026.png

1129643803.027.png

1129643803.028.png

1129643803.029.png

1129643803.030.png

1129643803.031.png

1129643803.032.png

1129643803.034.png

1129643803.035.png

1129643803.036.png

1129643803.037.png

1129643803.038.png

1129643803.039.png

1129643803.040.png

1129643803.041.png

1129643803.042.png

1129643803.043.png

1129643803.045.png

1129643803.046.png

1129643803.047.png

1129643803.048.png

1129643803.049.png

1129643803.050.png

1129643803.051.png

Intel Technology Journal Q3 ’97

Figure 5. Block Diagram of the Pentium Processor

With MMX Technology

the performance gain that can be achieved by several

applications when using the new instructions.

Results

The Pentium processor with MMX technology design

achieved its goals. The processor taped out in late 1995,

and samples were delivered to customers less than a week

after the first silicon. With six months of extensive silicon

debug, we closed the frequency gap with the Pentium

processor and, half a year later, achieved 233MHz in

production, which is one bin above the Pentium

processor’s production frequency.

Figure 6 shows the actual speed improvement of the

Pentium processor and the Pentium processor with MMX

technology versus the anticipated trend.

4

3

2

1

MPEG1

Audi o

Image

Filter

Modem

3D

Integer

Geomet ry

3D

True Color

Shading

Video

Conferencing

Figure 8. Performance Improvement Using New Instructions

Pentium II Processor Microarchitecture

While the Pentium processor with MMX technology made

microarchitecture changes to improve frequency and

performance as well as implement the MMX technology,

the Pentium II processor improved upon the Pentium Pro

processor’s microarchitecture and brought MMX

technology to a new level of performance. The Pentium II

processor is based on the dynamic execution

microarchitecture of the Pentium Pro processor. Changes

were made in the Pentium II processor’s microarchitecture

to improve graphics performance and to implement MMX

technology. In addition, the entire back-side bus interface

that connects the processor to an off-chip second-level

cache was redesigned to allow low-cost commodity

SRAMs to be used as second-level cache. Doing so

significantly reduced the system cost compared to the

Pentium Pro processor’s Multi-Chip Module (MCM) that

houses the processor as well as the second-level cache. A

higher frequency was achieved through aggressive circuit

techniques and other changes.

3.00

PP/MT* architecture limit

Pentium architecture limit

2.50

2.00

Pentium speedup (actual)

PP/MT* speedup (actual), data

prior to Q4’96 are pre-production

Pentium speedup trend

PP/MT speedup trend

1.50

Frequency/

normalized

1.00

* Pentium processor with MMX

technology

.50

q1/95

q2/95

q3/95

q4/95

q1/96

q2/96

q3/96

q4/96

q1/97

q2/97

Figure 6. Actual Versus Anticipated Speed Improvement

Trend

The Pentium processor with MMX technology also met its

CPI goals. Figure 7 shows the CPI performance of the

Pentium processor with MMX technology compared to

the Pentium processor.

Overview

The Pentium II processor is the second Intel

microprocessor to implement MMX technology. The

Pentium II processor’s MMX technology implementation

offers multimedia applications the benefits of an out-of-

order execution, aggressive memory speculation, a

superpipelined and superscalar microarchitecture, etc.

These are the same features that the Pentium Pro

microprocessor provides. The Pentium II processor

supports two packed ALU operations, one packed shift,

and one packed multiply operation. Pack and unpack

operations are implemented by the packed shifter. The

Pentium II processor allows packed shift and packed

multiply to be executed concurrently.

30%

20%

10%

0%

iSpec95

iSpec95

fSpec95

fSpec95

iSpec92

iSpec92

Figure 7. CPI Performance of the Pentium Processor with

MMX Technology Compared to the Pentium Processor

And at last, multimedia applications gained significant

performance using new instructions. Figure 8 illustrates

5

1129643803.052.png

1129643803.053.png

1129643803.055.png

1129643803.056.png

1129643803.057.png

1129643803.058.png

1129643803.059.png

1129643803.060.png

1129643803.061.png

1129643803.062.png

1129643803.063.png

1129643803.064.png

1129643803.066.png

1129643803.067.png

1129643803.068.png

1129643803.069.png

1129643803.070.png

1129643803.071.png

1129643803.072.png

1129643803.073.png

1129643803.074.png

1129643803.075.png

1129643803.077.png

1129643803.078.png

1129643803.079.png

1129643803.080.png

1129643803.081.png

1129643803.082.png

1129643803.083.png

1129643803.084.png

1129643803.085.png

1129643803.086.png

1129643803.088.png

1129643803.089.png

1129643803.090.png

1129643803.091.png

1129643803.092.png

1129643803.093.png

1129643803.094.png

1129643803.095.png

1129643803.096.png

1129643803.097.png

1129643803.099.png

1129643803.100.png

1129643803.101.png

1129643803.102.png

1129643803.103.png

1129643803.104.png

1129643803.105.png

1129643803.106.png

1129643803.107.png

1129643803.108.png

1129643803.110.png

1129643803.111.png

1129643803.112.png

1129643803.113.png

1129643803.114.png

1129643803.115.png

1129643803.116.png

1129643803.117.png

1129643803.118.png

1129643803.119.png

1129643803.121.png

1129643803.122.png

1129643803.123.png

1129643803.124.png

1129643803.125.png

1129643803.126.png

1129643803.127.png

1129643803.128.png

1129643803.129.png

1129643803.130.png

Plik z chomika:

Inne pliki z tego folderu:

intel-strataflashr-memory-technology-overview.pdf (857 KB)
intel-strataflashr-memory-technology-development-and-implementation.pdf (114 KB)
the-story-of-intel-mmxtm-technology.pdf (23 KB)
redundancy-yield-model-for-srams.pdf (669 KB)
redundancy-and-high-volume-manufacturing-methods.pdf (89 KB)

Inne foldery tego chomika:

Zgłoś jeśli naruszono regulamin