By Nigel Paver, Bradley Aldrich and Moinul Khan, Intel Corp.
Performing MAC Operations on
Registers
Wireless MMX technology supports 16-bit SIMD multiply and
accumulate operations, where the sources and the destination use SIMD
coprocessor registers. Similar to the TMIA instruction, any of the
coprocessor registers can be used as an accumulator for this case.
The issue latency of the WMAC instruction is one cycle, and the
result and resource latency is two cycles. The second WMAC instruction
in the following example will stall for one cycle due to the two-cycle
resource latency.
wmacs
wR0, wR2, wR3
wmacs wR1, wR4,
wR5 @stall 1 cycle
The WADD instruction in the following example stalls for one cycle
due to the two-cycle result latency. However, the second WMACS does not
stall for two cycles due to the internal forwarding supported by the
multiplier and accumulate unit (MAU) of the coprocessor.
wmacs
wR0, wR4,
wR5
wmacs
wR0, wR2,
wR3 @stall 1 cycle
waddhss wR1,
wR0, wR2 @stall 2 cycles
It is often possible to interleave instructions and effectively
overlap their execution with multicycle instructions that use the
multiply pipeline. The two-cycle WMAC instruction may be easily
interleaved with operations that do not use the same resources:
wmacs
wR14,
wR2, wR3
wldrd
wR3, [r4] ,
#8
wmacs
wR15, wR1,
wR0
waligni wR5,
wR6,
wR7, #4
wmacs
wR15, wR5,
wR0
wldrd
wR0, [r3], #8
In the preceding example, the WLDRD and WALIGNI instructions do not
incur a stall since they are utilizing the memory and execution
pipelines respectively and have no data dependencies. For interleaving
WMACS with other instructions, instructions of the Intel XScale core
can be used.
wmacs
wR14, wR1, wR2
add
R1, R2, R3
wmacs wR14, wR1,
wR2
mul
R4, R5, R6