Intel SSE [Archivio] - Hardware Upgrade Forum

checo

29-09-2009, 08:26

sse

addps - Adds 4 single-precision (32bit) floating-point values to 4 other single-precision floating-point values.
addss - Adds the lowest single-precision values, top 3 remain unchanged.
subps - Subtracts 4 single-precision floating-point values from 4 other single-precision floating-point values.
subss - Subtracts the lowest single-precision values, top 3 remain unchanged.
mulps - Multiplies 4 single-precision floating-point values with 4 other single-precision values.
mulss - Multiplies the lowest single-precision values, top 3 remain unchanged.
divps - Divides 4 single-precision floating-point values by 4 other single-precision floating-point values.
divss - Divides the lowest single-precision values, top 3 remain unchanged.
rcpps - Reciprocates (1/x) 4 single-precision floating-point values.
rcpss - Reciprocates the lowest single-precision values, top 3 remain unchanged.
sqrtps - Square root of 4 single-precision values.
sqrtss - Square root of lowest value, top 3 remain unchanged.
rsqrtps - Reciprocal square root of 4 single-precision floating-point values.
rsqrtss - Reciprocal square root of lowest single-precision value, top 3 remain unchanged.
maxps - Returns maximum of 2 values in each of 4 single-precision values.
maxss - Returns maximum of 2 values in the lowest single-precision value. Top 3 remain unchanged.
minps - Returns minimum of 2 values in each of 4 single-precision values.
minss - Returns minimum of 2 values in the lowest single-precision value, top 3 remain unchanged.
pavgb - Returns average of 2 values in each of 8 bytes.
pavgw - Returns average of 2 values in each of 4 words.
psadbw - Returns sum of absolute differences of 8 8bit values. Result in bottom 16 bits.
pextrw - Extracts 1 of 4 words.
pinsrw - Inserts 1 of 4 words.
pmaxsw - Returns maximum of 2 values in each of 4 signed word values.
pmaxub - Returns maximum of 2 values in each of 8 unsigned byte values.
pminsw - Returns minimum of 2 values in each of 4 signed word values.
pminub - Returns minimum of 2 values in each of 8 unsigned byte values.
pmovmskb - builds mask byte from top bit of 8 byte values.
pmulhuw - Multiplies 4 unsigned word values and stores the high 16bit result.
pshufw - Shuffles 4 word values. Takes 2 128bit values (source and dest) and an 8-bit immediate value, and then fills in each Dest 32-bit value from a Source 32-bit value specified by the immediate. The immediate byte is broken into 4 2-bit values.

Logic:
andnps - Logically ANDs 4 single-precision values with the logical inverse (NOT) of 4 other single-precision values.
andps - Logically ANDs 4 single-precision values with 4 other single-precision values.
orps - Logically ORs 4 single-precision values with 4 other single-precision values.
xorps - Logically XORs 4 single-precision values with 4 other single-precision values.

Compare:
cmpxxps - Compares 4 single-precision values.
cmpxxss - Compares lowest 2 single-precision values.
comiss - Compares lowest 2 single-recision values and stores result in EFLAGS.
ucomiss - Compares lowest 2 single-precision values and stores result in EFLAGS. (QNaNs don't throw exceptions with ucomiss, unlike comiss.)
Compare Codes (the xx parts above):
eq - Equal to.
lt - Less than.
le - Less than or equal to.
ne - Not equal.
nlt - Not less than.
nle - Not less than or equal to.
ord - Ordered.
unord - Unordered.

Conversion:
cvtpi2ps - Converts 2 32bit integers to 32bit floating-point values. Top 2 values remain unchanged.
cvtps2pi - Converts 2 32bit floating-point values to 32bit integers.
cvtsi2ss - Converts 1 32bit integer to 32bit floating-point value. Top 3 values remain unchanged.
cvtss2si - Converts 1 32bit floating-point value to 32bit integer.
cvttps2pi - Converts 2 32bit floating-point values to 32bit integers using truncation.
cvttss2si - Converts 1 32bit floating-point value to 32bit integer using truncation.

State:
fxrstor - Restores FP and SSE State.
fxsave - Stores FP and SSE State.
ldmxcsr - Loads the mxcsr register.
stmxcsr - Stores the mxcsr register.

Load/Store:
movaps - Moves a 128bit value.
movhlps - Moves high half to a low half.
movlhps - Moves low half to upper halves.?
movhps - Moves 64bit value into top half of an xmm register.
movlps - Moves 64bit value into bottom half of an xmm register.
movmskps - Moves top bits of single-precision values into bottom 4 bits of a 32bit register.
movss - Moves the bottom single-precision value, top 3 remain unchanged is another xmm register, otherwise they're set to zero.
movups - Moves a 128bit value. Address can be unaligned.
maskmovq - Moves a 64bit value according to a mask.
movntps - Moves a 128bit value directly to memory, skipping the cache. (NT stands for "Non Temporal".)
movntq - Moves a 64bit value directly to memory, skipping the cache.

Shuffling:
shufps - Shuffles 4 single-precision values. Complex.
unpckhps - Unpacks single-precision values from high halves.
unpcklps - Unpacks single-precision values from low halves.

Cache Control:
prefetchT0 - Fetches a cache-line of data into all levels of cache.
prefetchT1 - Fetches a cache-line of data into all but the highest levels of cache.
prefetchT2 - Fetches a cache-line of data into all but the two highest levels of cache.
prefetchNTA - Fetches data into only the highest level of cache, not the lower levels.
sfence - Guarantees that all memory writes issued before the sfence instruction are completed before any writes after the sfence instruction.

sse2
Arithmetic:
addpd - Adds 2 64bit doubles.
addsd - Adds bottom 64bit doubles.
subpd - Subtracts 2 64bit doubles.
subsd - Subtracts bottom 64bit doubles.
mulpd - Multiplies 2 64bit doubles.
mulsd - Multiplies bottom 64bit doubles.
divpd - Divides 2 64bit doubles.
divsd - Divides bottom 64bit doubles.
maxpd - Gets largest of 2 64bit doubles for 2 sets.
maxsd - Gets largets of 2 64bit doubles to bottom set.
minpd - Gets smallest of 2 64bit doubles for 2 sets.
minsd - Gets smallest of 2 64bit values for bottom set.
paddb - Adds 16 8bit integers.
paddw - Adds 8 16bit integers.
paddd - Adds 4 32bit integers.
paddq - Adds 2 64bit integers.
paddsb - Adds 16 8bit integers with saturation.
paddsw - Adds 8 16bit integers using saturation.
paddusb - Adds 16 8bit unsigned integers using saturation.
paddusw - Adds 8 16bit unsigned integers using saturation.
psubb - Subtracts 16 8bit integers.
psubw - Subtracts 8 16bit integers.
psubd - Subtracts 4 32bit integers.
psubq - Subtracts 2 64bit integers.
psubsb - Subtracts 16 8bit integers using saturation.
psubsw - Subtracts 8 16bit integers using saturation.
psubusb - Subtracts 16 8bit unsigned integers using saturation.
psubusw - Subtracts 8 16bit unsigned integers using saturation.
pmaddwd - Multiplies 16bit integers into 32bit results and adds results.
pmulhw - Multiplies 16bit integers and returns the high 16bits of the result.
pmullw - Multiplies 16bit integers and returns the low 16bits of the result.
pmuludq - Multiplies 2 32bit pairs and stores 2 64bit results.
rcpps - Approximates the reciprocal of 4 32bit singles.
rcpss - Approximates the reciprocal of bottom 32bit single.
sqrtpd - Returns square root of 2 64bit doubles.
sqrtsd - Returns square root of bottom 64bit double.

Logic:
andnpd - Logically NOT ANDs 2 64bit doubles.
andnps - Logically NOT ANDs 4 32bit singles.
andpd - Logically ANDs 2 64bit doubles.
pand - Logically ANDs 2 128bit registers.
pandn - Logically Inverts the first 128bit operand and ANDs with the second.
por - Logically ORs 2 128bit registers.
pslldq - Logically left shifts 1 128bit value.
psllq - Logically left shifts 2 64bit values.
pslld - Logically left shifts 4 32bit values.
psllw - Logically left shifts 8 16bit values.
psrad - Arithmetically right shifts 4 32bit values.
psraw - Arithmetically right shifts 8 16bit values.
psrldq - Logically right shifts 1 128bit values.
psrlq - Logically right shifts 2 64bit values.
psrld - Logically right shifts 4 32bit values.
psrlw - Logically right shifts 8 16bit values.
pxor - Logically XORs 2 128bit registers.
orpd - Logically ORs 2 64bit doubles.
xorpd - Logically XORs 2 64bit doubles.

Compare:
cmppd - Compares 2 pairs of 64bit doubles.
cmpsd - Compares bottom 64bit doubles.
comisd - Compares bottom 64bit doubles and stores result in EFLAGS.
ucomisd - Compares bottom 64bit doubles and stores result in EFLAGS. (QNaNs don't throw exceptions with ucomisd, unlike comisd.
pcmpxxb - Compares 16 8bit integers.
pcmpxxw - Compares 8 16bit integers.
pcmpxxd - Compares 4 32bit integers.
Compare Codes (the xx parts above):
eq - Equal to.
lt - Less than.
le - Less than or equal to.
ne - Not equal.
nlt - Not less than.
nle - Not less than or equal to.
ord - Ordered.
unord - Unordered.

Conversion:
cvtdq2pd - Converts 2 32bit integers into 2 64bit doubles.
cvtdq2ps - Converts 4 32bit integers into 4 32bit singles.
cvtpd2pi - Converts 2 64bit doubles into 2 32bit integers in an MMX register.
cvtpd2dq - Converts 2 64bit doubles into 2 32bit integers in the bottom of an XMM register.
cvtpd2ps - Converts 2 64bit doubles into 2 32bit singles in the bottom of an XMM register.
cvtpi2pd - Converts 2 32bit integers into 2 32bit singles in the bottom of an XMM register.
cvtps2dq - Converts 4 32bit singles into 4 32bit integers.
cvtps2pd - Converts 2 32bit singles into 2 64bit doubles.
cvtsd2si - Converts 1 64bit double to a 32bit integer in a GPR.
cvtsd2ss - Converts bottom 64bit double to a bottom 32bit single. Tops are unchanged.
cvtsi2sd - Converts a 32bit integer to the bottom 64bit double.
cvtsi2ss - Converts a 32bit integer to the bottom 32bit single.
cvtss2sd - Converts bottom 32bit single to bottom 64bit double.
cvtss2si - Converts bottom 32bit single to a 32bit integer in a GPR.
cvttpd2pi - Converts 2 64bit doubles to 2 32bit integers using truncation into an MMX register.
cvttpd2dq - Converts 2 64bit doubles to 2 32bit integers using truncation.
cvttps2dq - Converts 4 32bit singles to 4 32bit integers using truncation.
cvttps2pi - Converts 2 32bit singles to 2 32bit integers using truncation into an MMX register.
cvttsd2si - Converts a 64bit double to a 32bit integer using truncation into a GPR.
cvttss2si - Converts a 32bit single to a 32bit integer using truncation into a GPR.

Load/Store:
(is "minimize cache pollution" the same as "without using cache"??)
movq - Moves a 64bit value, clearing the top 64bits of an XMM register.
movsd - Moves a 64bit double, leaving tops unchanged if move is between two XMMregisters.
movapd - Moves 2 aligned 64bit doubles.
movupd - Moves 2 unaligned 64bit doubles.
movhpd - Moves top 64bit value to or from an XMM register.
movlpd - Moves bottom 64bit value to or from an XMM register.
movdq2q - Moves bottom 64bit value into an MMX register.
movq2dq - Moves an MMX register value to the bottom of an XMM register. Top is cleared to zero.
movntpd - Moves a 128bit value to memory without using the cache. NT is "Non Temporal."
movntdq - Moves a 128bit value to memory without using the cache.
movnti - Moves a 32bit value without using the cache.
maskmovdqu - Moves 16 bytes based on sign bits of another XMM register.
pmovmskb - Generates a 16bit Mask from the sign bits of each byte in an XMM register.

Shuffling:
pshufd - Shuffles 32bit values in a complex way.
pshufhw - Shuffles high 16bit values in a complex way.
pshuflw - Shuffles low 16bit values in a complex way.
unpckhpd - Unpacks and interleaves top 64bit doubles from 2 128bit sources into 1.
unpcklpd - Unpacks and interleaves bottom 64bit doubles from 2 128 bit sources into 1.
punpckhbw - Unpacks and interleaves top 8 8bit integers from 2 128bit sources into 1.
punpckhwd - Unpacks and interleaves top 4 16bit integers from 2 128bit sources into 1.
punpckhdq - Unpacks and interleaves top 2 32bit integers from 2 128bit sources into 1.
punpckhqdq - Unpacks and interleaces top 64bit integers from 2 128bit sources into 1.
punpcklbw - Unpacks and interleaves bottom 8 8bit integers from 2 128bit sources into 1.
punpcklwd - Unpacks and interleaves bottom 4 16bit integers from 2 128bit sources into 1.
punpckldq - Unpacks and interleaves bottom 2 32bit integers from 2 128bit sources into 1.
punpcklqdq - Unpacks and interleaces bottom 64bit integers from 2 128bit sources into 1.
packssdw - Packs 32bit integers to 16bit integers using saturation.
packsswb - Packs 16bit integers to 8bit integers using saturation.
packuswb - Packs 16bit integers to 8bit unsigned integers unsing saturation.

Cache Control:
clflush - Flushes a Cache Line from all levels of cache.
lfence - Guarantees that all memory loads issued before the lfence instruction are completed before anyloads after the lfence instruction.
mfence - Guarantees that all memory reads and writes issued before the mfence instruction are completed before any reads or writes after the mfence instruction.
pause - Pauses execution for a set amount of time.

sse3
addsubpd - Adds the top two doubles and subtracts the bottom two.
addsubps - Adds top singles and subtracts bottom singles.
haddpd - Top double is sum of top and bottom, bottom double is sum of second operand's top and bottom.
haddps - Horizontal addition of single-precision values.
hsubpd - Horizontal subtraction of double-precision values.
hsubps - Horizontal subtraction of single-precision values.

Load/Store:
lddqu - Loads an unaligned 128bit value.
movddup - Loads 64bits and duplicates it in the top and bottom halves of a 128bit register.
movshdup - Duplicates the high singles into high and low singles.
movsldup - Duplicates the low singles into high and low singles.
fisttp - Converts a floating-point value to an integer using truncation.

Process Control:
monitor - Sets up a region to monitor for activity.
mwait - Waits until activity happens in a region specified by monitor.

ssee3
psignd - Gives 32bit integer magnitudes the sign of the 2nd operand.
psignw - Gives 16bit integer magnitudes the sign of the 2nd operand.
psignb - Gives 8bit integer magnitudes the sign of the 2nd operand.
phaddd - Horizontal addition of unsigned 32bit integers.
phaddw - Horizontal addition of unsigned 16bit integers.
phaddsw - Horizontal saturated addition of 16bit integers.
phsubd - Horizontal subtraction of unsigned 32bit integers.
phsubw - Horizontal subtraction of unsigned 16bit integers.
phsubsw - Horizontal saturated subtraction of 16bit words.
pmaddubsw - Multiply-accumulate instruction (finally).
pabsd - abs() for 32bit integers.
pabsw - abs() for 16bit integers.
pabsb - abs() for 8bit integers.
pmulhrsw - 16bit integer multiplication, stores top 16bits of result.
pshufb - Another complex shuffle instruction.
palignr - Combines two register values, and extracts a register-width value from it, based on an offset.

sse4.1/2/a
SSE4.1
mpsadbw - Sum of absolute differences.
phminposuw - minimum+index extraction (16bit word).
pmuldq - packed multiply.
pmulld - packed multiply.
dpps - dot product, single precision.
dppd - dot product, double precision.
blendps - conditional copy.
blendpd - conditional copy.
blendvps - conditional copy.
blendvpd - conditional copy.
pblendvb - conditional copy.
pblendw - conditional copy.
pminsb - packed minimum signed byte.
pmaxsb - packed maximum signed byte.
pminuw - packed minimum unsigned word.
pmaxuw - packed maximum unsigned word.
pminud - packed minimum unsigned dword.
pmaxud - packed maximum unsigned dword.
pminsd - packed minimum signed dword.
pmaxsd - packed maximum signed dword.
roundps - packed round single precision float to integer.
roundss - scalar round single precision float to integer.
roundpd - packed round double precision float to integer.
roundsd - scalar round double precision float to integer.
inserps - complex data shuffling.
pinsrb - complex data shuffling.
pinsrd - complex data shuffling.
pinsrq - complex data shuffling.
extractps - complex data shuffling.
pextrb - complex data shuffling.
pextrw - complex data shuffling.
pextrd - complex data shuffling.
pextrq - complex data shuffling.
pmovsxbw - packed sign extension.
pmovzxbw - packed zero extension.
pmovsxbd - packed sign extension.
pmovzxbd - packed zero extension.
pmovsxbq - packed sign extension.
pmovzxbq - packed zero extension.
pmovxswd - packed sign extension.
pmovzxwd - packed zero extension.
pmovsxwq - packed sign extension.
pmovzxwq - packed zero extension.
pmovsxdq - packed sign extension.
pmovzxdq - packed zero extension.
ptest - same as test, but for sse registers.
pcmpeqq - quadword compare for equality.
packusdw - saturating signed dwords to unsigned words.
movntdqa - Non-temporal aligned move (this uses write-combining for efficiency).

SSE4.2
crc32 - CRC32C function (using 0x11edc6f41 as the polynomial).
pcmpestri - Packed compare explicit length string, Index.
pcmpestrm - Packed compare explicit length string, Mask.
pcmpistri - Packed compare implicit length string, Index.
pcmpistrm - Packed compare implicit length string, Mask.
pcmpgtq - Packed compare, greater than.
popcnt - Population count.

SSE4a
lzcnt - Leading Zero count.
popcnt - Population count.
extrq - Mask-shift operation.
inserq - Mask-shift operation.
movntsd - Non-temporal double precision move.
movntss - Non-temporal single precision move.

google può esser spesso di aiuto lo sai?
http://softpixel.com/~cwright/programming/simd/index.php