Skip to main content

Topic: integer multiplications on IA32 architecture. (Read 4849 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.
  • wkwai
  • [*][*][*][*]
  • Developer
integer multiplications on IA32 architecture.
Hi,


I am used to working with Assembly Language Programming on the Pentium processor generation( 166 - 200 Mhz MMX). I noticed that for operations like int16 and int32 multiplications / divisions, it used to take as long as 20 clock cycles to complete the an instruction execution. However I noticed that on a Celeron processor, (using the VTune 7.0 evaluation kit from Intel's website) it takes on 1 clock cycle to execute.. Could anyone verify this? In the past, we would use a combination of shift and add operations to implement integer multiplications / divisions.   


wkwai

  • NumLOCK
  • [*][*][*][*][*]
  • Developer
integer multiplications on IA32 architecture.
Reply #1
Hi,

Quote
I noticed that for operations like int16 and int32 multiplications / divisions, it used to take as long as 20 clock cycles to complete the an instruction execution.

What is the 'an' instruction ?

edit: ok, if I ignore the 'an':  20 cycles seems way out of line. Ensure your mul instruction doesn't fetch its argument from memory.

Quote
However I noticed that on a Celeron processor, (using the VTune 7.0 evaluation kit from Intel's website) it takes on 1 clock cycle to execute.. Could anyone verify this?

If you mean 1 cycle latency for 'mul' or 'imul' 32x32bit instruction, it is impossible.
Any x86-compatible processor to date will need at the very least 2 cycles (IIRC) because of the high frequency. I think the fastest one was the K6, with 2 cycle latency and 3-cycle execution time for mul/imul.

edit: I think on K6, the 32 lowest bits were available in 2 cycles, and the higher 32 bits were available 1 cycle later.

Quote
In the past, we would use a combination of shift and add operations to implement integer multiplications / divisions.

Yeah.
Nowadays, it's a bit different though: thanks to improved multiplication circuitry it's usually worth using special instructions only for:

- result = n*2^k =>  shl reg, k
- result = 3*n+k => lea reg, [reg+2*reg+k]
- result = 5*n+k => lea reg, [reg+4*reg+k]

In most other cases the multiply will be faster. Plus (depending on your program) you'll avoid saturating the AGU (address generation unit). Also while the mul runs, you can do something else.

Regards
  • Last Edit: 06 August, 2003, 10:17:29 AM by NumLOCK
Try Leeloo Chat at http://leeloo.webhop.net

  • wkwai
  • [*][*][*][*]
  • Developer
integer multiplications on IA32 architecture.
Reply #2
Quote
20 cycles seems way out of line. Ensure your mul instruction doesn't fetch its argument from memory.

I think so, in fact there are also penalties in mixing type bytes and int16 with int32 instructions in old Pentium processors.  In fact by using the MMX instructions for integer multiplications, the speed up time is about 100 factors. 

However on a Celeron system, MMX multiplication instructions only speed things up by about a factor of 4 only.
I think the Celeron, Pentium II and Pentium III are all based on a different architecture. 

wkwai

  • Gabriel
  • [*][*][*][*][*]
  • Developer
integer multiplications on IA32 architecture.
Reply #3
P6 architecture: pentium pro, pII, PIII, older celerons
Netburst architecture: p4, newer celerons

  • NumLOCK
  • [*][*][*][*][*]
  • Developer
integer multiplications on IA32 architecture.
Reply #4
Quote
I think so, in fact there are also penalties in mixing type bytes and int16 with int32 instructions in old Pentium processors.  In fact by using the MMX instructions for integer multiplications, the speed up time is about 100 factors. 

However on a Celeron system, MMX multiplication instructions only speed things up by about a factor of 4 only.
I think the Celeron, Pentium II and Pentium III are all based on a different architecture. 

wkwai

I think it's too bad that MMX doesn't support 32-bit multiplications.
Also the inability for MMX instructions to interoperate with x86 registers is a big design flaw in their architecture.  If you want to mix both types of instructions, you have to use useless "MOVD" instructions which prevent many opportunities to optimize.

Have you seen how well Motorola's Altivec is designed, for instance ?  You can do several 64x64bit multiplies in parallel...

Well after all, Intel is Intel... and stays Intel 

Edit: To be completely impartial (), I must admit that MMX still proved useful for me, in several 24-bit graphics routines.

By the way, I loved their funny PCKUNMLL and PSKCNNNLXGLCBB mnemonics 
  • Last Edit: 08 August, 2003, 11:34:18 AM by NumLOCK
Try Leeloo Chat at http://leeloo.webhop.net

  • wkwai
  • [*][*][*][*]
  • Developer
integer multiplications on IA32 architecture.
Reply #5
Quote
By the way, I loved their funny PCKUNMLL and PSKCNNNLXGLCBB mnemonics 


I think those instructions does not exists for the Celeron and PII systems. For PIII and above, the MMX instructions actually work on 128 bit registers. That is what I noticed from the latest Intel Programmers guide.

I wondered how much performance gain does a 64 bit processor has over the IA32 architecture? It seems to me that  most of the internal  floating point operations of the IA32 architecture are already at 64 bit operations??? 

When using a floating point instructions in IA32, such as fmul, would the instructions load in the data 32 bits at a time or 64 bits? 

  • Diocletian
  • [*]
integer multiplications on IA32 architecture.
Reply #6
Quote
Quote
By the way, I loved their funny PCKUNMLL and PSKCNNNLXGLCBB mnemonics 


I think those instructions does not exists for the Celeron and PII systems. For PIII and above, the MMX instructions actually work on 128 bit registers. That is what I noticed from the latest Intel Programmers guide.

I wondered how much performance gain does a 64 bit processor has over the IA32 architecture? It seems to me that  most of the internal  floating point operations of the IA32 architecture are already at 64 bit operations??? 

When using a floating point instructions in IA32, such as fmul, would the instructions load in the data 32 bits at a time or 64 bits?   

The 64 bit FMUL instructions have nothing to do with the 64 bit IMUL instructions on IA64
or x86-64. The main advantage of a 64 bit CPU is that it can work with more or more fragmented
memory:
- you can work with more than 1.5 GB of memory per process
- you don't have to care with virtual address room fragmentation
- you can map files to memory
- you can built up sparse memory structures in the memory which do a lot of work
  in hardware than in software
-----------------------------------

The rules about optimization which you find in books and in brains are typically 10 years and
older and are COMPLETELY out of day and often able to deoptimize programs.

To evaluate the speed of current CPUs is easier than the speed of 10 years old CPUs,
because in modern CPUs decoding and execeution is nearly complete decoupled.
This was not the case for CPUs like Pentium, Pentium MMX and AMD K5, where a
prediction of calculation speed was a pain.

Modern CPU executation time of code and data which is completely in the L1 cache can
be characterized by two parameters:

- Latency (the time from the input to the output register)
- Throughput (the average time from input to output register when executing multiple instructions)

Latency/Throughput is typically an integer which can be interpreted as the number of execution
pipelines. The execution time of the mul32 instruction:

- i386: Depending on the number of significant bits in the second operand: 6...37 clocks
- i486: Depending on the number of significant bits in the second operand: 9...40 clocks
- Pentium/Pentium MMX:  11 clocks (fixed)
- K6: 2 clocks , a 3rd clock for the upper 32 bits
- Athlon: 5 clocks (throughput: 2.5 clocks)
  but: operand in memory: 4 clocks (throughput: 2 clocks)
- Pentium II: 4 clocks (throughput: 4 clocks)  (?)
- Pentium 4: 14 clocks (throughput: 5.67 clocks)
  operand in memory: 18 clocks (throughput: 6 clocks)

Pentium 4 is much slower than the Pentium II/III or the K6. Even shl don't helps, because
it is also very slow:

- shl  reg,n: 4 clocks

Fast indeed is:

- add reg1, reg2:  0.5 clocks

MMX on pentium 4 is also slower than on the Pentium MMX/II/III, because there's only
ONE MMX pipeline instead of two. The Pentium 4 is clock speed optimized, not speed optimized. A lot of Latency (the time from the input to the output register)
- Throughput (the average time from input to output register when executing multiple instructions)

Latency/Throughput is typically an integer which can be interpreted as the number of execution
pipelines. The execution time of the mul32 instruction:

- i386: Depending on the number of significant bits in the second operand: 6...37 clocks
- i486: Depending on the number of significant bits in the second operand: 9...40 clocks
- Pentium/Pentium MMX:  11 clocks (fixed)
- K6: 2 clocks , a 3rd clock for the upper 32 bits
- Athlon: 5 clocks (throughput: 2.5 clocks)
  but: operand in memory: 4 clocks (throughput: 2 clocks)
- Pentium II: 4 clocks (throughput: 4 clocks)  (?)
- Pentium 4: 14 clocks (throughput: 5.67 clocks)
  operand in memory: 18 clocks (throughput: 6 clocks)

Pentium 4 is much slower than the Pentium II/III or the K6. Even shl don't helps, because
it is also very slow:

- shl  reg,n: 4 clocks

Fast indeed is:

- add reg1, reg2:  0.5 clocks

MMX on pentium 4 is also slower than on the Pentium MMX/II/III, because there's only
ONE MMX pipeline instead of two. The Pentium 4 is clock speed optimized, not speed optimized. A lot of these changes are to allow high clock speeds. In the first P4 stepping
there were additional serious penalties for misaligned memory accesses which dropped the
speed down to Pentium MMX times.
Diocletian

Time Travel Agency
Book a journey to the Diocletian Palace. Not today!

  • NumLOCK
  • [*][*][*][*][*]
  • Developer
integer multiplications on IA32 architecture.
Reply #7
Quote from: wkwai,Aug 10 2003, 07:45 AM
I think those instructions does not exists for the Celeron and PII systems. For PIII and above, the MMX instructions actually work on 128 bit registers. That is what I noticed from the latest Intel Programmers guide.

They don't really exist, I was joking about their habit for strange mnemonics.

I think the 128-bit version would be best called "MMX2".

Quote
I wondered how much performance gain does a 64 bit processor has over the IA32 architecture? It seems to me that  most of the internal  floating point operations of the IA32 architecture are already at 64 bit operations??? 

You're right, there would be little performance gain switching to 64 bits. The real advantage is the addressing.

For x86, much more useful changes would be:
- an extension to raise number of registers (8 regs is ridiculous)a 64 bit processor has over the IA32 architecture? It seems to me that  most of the internal  floating point operations of the IA32 architecture are already at 64 bit operations???  [/QUOTE]
You're right, there would be little performance gain switching to 64 bits. The real advantage is the addressing.

For x86, much more useful changes would be:
- an extension to raise number of registers (8 regs is ridiculous)
- the possibility to use 3-operand instructions, like on most sane architectures (ie: ADDL source1, source2, destination).

Quote
When using a floating point instructions in IA32, such as fmul, would the instructions load in the data 32 bits at a time or 64 bits? 
Since most instructions see memory through 32-byte cache lines, the load will be done in one clock (assuming your 64-bit operand is duly a
- the possibility to use 3-operand instructions, like on most sane architectures (ie: ADDL source1, source2, destination).

Quote
When using a floating point instructions in IA32, such as fmul, would the instructions load in the data 32 bits at a time or 64 bits? 
Since most instructions see memory through 32-byte cache lines, the load will be done in one clock (assuming your 64-bit operand is duly aligned).
Try Leeloo Chat at http://leeloo.webhop.net

  • wkwai
  • [*][*][*][*]
  • Developer
integer multiplications on IA32 architecture.
Reply #8
Thanks. Another question, I have a Celeron 650 Mhz.. I thought that the Celeron Processor is almost identical to PII? But someone just said that the latest versions of the Celeron processors are based on the new P4 architecture???    I am wondering if mine would support SSE2 instructions.

As for int64 operations, I think there are very limited applications apart from scientific and engineering purposes / memory intensive applications. Most of the intensive computational requirements are usually for 8 - 16 bits audio-visual data. I hardly use long int types in my programming..

I thought so, the IA32 architecture is already a "hybrid 32-64 bits" processor.


wkwai

  • Lefungus
  • [*][*]
integer multiplications on IA32 architecture.
Reply #9
Celerons based on p4 = Celerons above 1.6 GHz, so no your celeron is just a pII with less L2 memory
It's a 'Jump to Conclusions Mat'. You see, you have this mat, with different CONCLUSIONS written on it that you could JUMP TO.

  • Audible!
  • [*][*][*][*][*]
integer multiplications on IA32 architecture.
Reply #10
Quote
Another question, I have a Celeron 650 Mhz.. I thought that the Celeron Processor is almost identical to PII? But someone just said that the latest versions of the Celeron processors are based on the new P4 architecture???  I am wondering if mine would support SSE2 instructions.


  The very first Celerons (PPGA, not FCPGA) were PII's with less L2 cache.
Starting at the 533MHz clock rate (and going to about 1.4GHz), the Celerons were PIII's architecture with less L2 cache, meaning SSE (1 not 2).
  This is the type of Celeron you have.
 
  After 1.4Ghz or so, the Celeron moved to the NetBurst Architecture (PIV less cache, SSE2).

edit: note that there were 500 and 533 MHz PPGA AND FCPGA Celerons, the former being quite easy to spot because of the heat spreader. For more informaiton visit sandpile.org
  • Last Edit: 16 August, 2003, 06:24:44 PM by Audible!

  • CiTay
  • [*][*][*][*][*]
  • Administrator
integer multiplications on IA32 architecture.
Reply #11
Quote
Starting at the 533MHz clock rate (and going to about 1.4GHz), the Celerons were PIII's architecture with less L2 cache, meaning SSE (1 not 2).

To make things completely confusing, there were two types of PIII Celerons, the Coppermine- and the Tualatin-based ones.

You can see the various models on this roadmap, including some future CPUs up to Q4/2004...