Taking a multicore DSP approach to medical ultrasound beamforming
Since beam-forming is a fairly demanding algorithm it will be split on five cores and two processing phases. In Phase 1 on each core six channels will be added coherently and in Phase2 the output of Phase 1 will go through an adding process (Figure 8 below).
Figure 8 – Beamforming architecture
The time frames for Phase 1 and Phase 2 are allocated as in Figure 9 below, based on the complexity of each stage.

Figure 9 – Time frame allocation for Beamforming Phase 1 and Phase 2
Apodization and delay coefficients will be considered pre-calculated and stored in DDR memory. To reduce the bandwidth and size requirements the coefficients will be combined in two eight bit variables.
By taking into consideration all the input and output data and also the Rapid IO transfer in the timeframes allocated the bandwidth requirements are calculated and presented in Table 2 below. As can be observed, for each of the timeframes, the requirements do not pass over 50% of the theoretical achievable bandwidth for DDR3 and M3.

Table 2 – Memory bandwidth requirements
The next step in the evaluation process is the estimation of the cycle count spent on the actual processing. There are two factors that influence the overall cycle count: core cycles due to operations and cycles due to penalties.

Figure 10 – Pseudo-code for Phase 1; in the right hand side the pseudo-code for apodization and delay coefficient updates and on the left hand side the actual coherent summing; in brackets are the operations that can be grouped in a single VLES (variable length execution set) [To view larger image, click here]
The pseudo-codes in Figure 10 above and Figure 11 below suggest a pipelining method that can achieve 9.4 cycles/output sample of Phase 1 and 2 cycles/output sample of Phase 2.

Figure 11– Pseudo-code for Phase2; in brackets are the operations that can be grouped in a single VLES (variable length execution set)
Due to efficient pipelining there are no stall penalties between VLES (variable length execution set) and so the only type of penalties that have to be taken into consideration are penalties due to cache misses. By taking into consideration cache line sizes and typical penalties due to cache misses one can evaluate the miss penalties based on the following rule of thumb:
1) For M3: 80 cycles/128bytes (is over evaluated with about 30% due to the Rapid IO traffic )
2) For DDR: 100 cycles/128bytes
The MSC8156 offers a mechanism to reduce the penalties due to cache misses by fetching data from memory before it is demanded by the core (dfetch instruction), so it is expected that the above reported figures are higher than the ones that could be achieved after optimizations.
After adding all the cycle consuming elements together and comparing with the available cycle count, it can be observed that beam-forming is achievable also from this last point of view (Table 3 below).

Table 3 – Beamforming, cycle count required and available


Loading comments... Write a comment