Taking a multicore DSP approach to medical ultrasound beamforming - Embedded.com

Taking a multicore DSP approach to medical ultrasound beamforming

This product how-to article describes how to use a Freescale DSP MSC8156 to produce diagnostically useful medical B-mode ultrasound imaging results, using no more than about 38% of the resources of the DSP, leaving enough room to also perform Doppler imaging.

A critical factor in the design of many medical ultrasound equipment designs is the effectiveness of the beamforming algorithms, a signal processing technique used in sensor arrays for directional signal transmission or reception. But until recently, the computational power needed to do such signal processing – at a reasonable cost and with the resolution needed for medical diagnostic purposes – was only possible with FPGAs and ASICS.

Now, however a new generation of dedicated digital signal processing architectures can bring a lot of processing power and highly parallel architectures to the problem at a much lower cost and at lower power.

This article describes how to use a Freescale DSP MSC8156 loaded with software libraries for B-Mode images that produce diagnostically useful medical ultrasound imaging results, using no more than about 38% of the resources of the DSP, leaving enough room to also fit Doppler Imaging modes.

In B-mode ultrasound, a linear array of transducers simultaneously scans a plane through the body that can be viewed as a two-dimensional image on screen, whereas in the Doppler mode, the ultrasound system makes use of the well-known Doppler effect to measure and visualize blood flow.

The ultrasound imaging problem
A classic architecture for ultrasound system, up to where the image is formed, can be visualized in Figure 1, below. Usually, in such architectures, dedicated hardware is used to perform signal processing tasks such as beam-forming.This approach lowers the flexibility of the system, making updates impractical (e.g. introducing new transducers that have different geometrical properties might imply changes that are more efficient to be done in software).

In this article, we will not go into details regarding transducers, pulsers, protection T/R switches and analog frontends; the interested reader can find information about these parts in the datasheets provided by different manufacturers.

Figure 1 – Classic architecture for an ultrasound system
To bring more flexibility into the system, we propose a software approach where all signal processing is done in software and FPGAs are employed as simple connectors. By using “off the shelf” hardware that is mature in terms of tools and “getting started” materials, a lower time to market can be expected (Figure 2 below ).

Figure 2 – Proposed architecture with Beamforming done on DSP
Today solutions for 32 channels, based on FPGAs, can be divided into categories as in Table 1 below. While for small and medium large FPGAs dynamic focusing and apodization is hardly achievable on high end FPGAs almost any feature is available, leaving a gap in image quality for low and mid – end devices. Due to the fact that the delay and apodization coefficients can be pre-calculated and stored into memory, a DSP approach can prove to be efficient in implementing beamforming, as long as the processing power is sufficient.

Table 1 – FPGAs, DSPs capabilities

Applying the MSC8156 to the problem
In terms of costs, DSPs have a clear advantage over high end FPGAs and are suitable for low and mid end devices. The following shows how dynamic focusing and dynamic apodization, features that are usually in expensive FPGAs, can be implemented also on lower cost DSPs (Figure 3 below ).

Figure 3 – Features supported vs. price
The DSP selected for the study is MSC8156, a six core device from Freescale Semiconductor that is widely used in baseband, voice and video applications.

To meet the high-data-rate and demanding computational requirements of medical ultrasound imaging application, DSP architectures have to be analyzed based of five critical criteria: parallelism, memory bandwidth, IO bandwidth, instruction set and hardware accelerators.

Freescale’s MSC8156 serves these applications with a high parallel architecture that has six cores running at 1GHz, each core having four Data Arithmetic-Logic Units, where up to four arithmetic and logic operations can be executed in a single clock cycle, and two Address Generation Units, where usually integer operations and address generation is performed (Figure 4 below ).

Figure 4 – High level block diagram of MSC8156 (To view larger image, click here)
In terms of memory, MSC8156 features an on SoC M3 memory of 1MB with a theoretical speed of about 8GBps and the capability to connect two DDR3 memories of up to 1GB each with a theoretical bandwidth of about 12GBps.

Since we are dealing with high data rates, Serial Rapid IO is elected as the interface of choice. MSC8156 has two Rapid IO ports for that the measured speed for one port and one direction can be up to 9.11Gbps.

Beamforming algorithms, such as delay and sum, make use of interpolation operations implementable by packed instructions like:

For Doppler imaging modes such as Spectral Doppler, the FFT/DFT accelerator MAPLE-B might be of interest. While MAPLE is executing FFTs the cores are free to execute modules for other imaging modes.

The basics of beamforming
Beamforming is the signal processing heart of any medical ultrasound machine. It is usually implemented in FPGAs and ASICS due to the large bandwidth and computational requirements. The recent advances in digital signal processors (DSPs) opened the door for beamforming on general processing chips.

The medical ultrasound images are formed by first creating focused ultrasound beams in transmit and directivity patterns (in the same direction as for transmit) on receive. The area of interest is covered by adjacent beams with a spacing defined by the minimum resolution required. The ultrasound beams are formed by adding delays to the electrical pulses sent to each element with the purpose of controlling the radiation pattern.

On the receive side, the approach is similar; the signals from each element are delayed with the same amount as on transmit and summed together. The delay phase insures that the waves/signals are all in phase and that no destructive adding takes place (Figure 5 below ).

Figure 5 – Rx –Beamforming high level diagram
The beamforming process can be elegantly put into the following formula:

where N is the number of elements, Ai are the apodization coefficients (weights), s i are the received echoes and t are the delays. In this context, we are speaking about dynamic focusing and dynamic apodization because delay and respectively apodization coefficients are functions of time.

The first step to be done when analyzing beamforming is defining the physical parameters of the system and the features that are to be included. We will consider a 30 channel system with ADCs running at 32MHz and 16bit resolution for a central frequency of 2MHz.For the use case presented, dynamic focusing and apodization with coefficients updates every 0.96mm and interpolation at every cycle is considered. The depth of scan and view angle are depicted in Figure 6 below.

Figure 6 – Geometry parameters
Based on the Fraunhofer approximation the required number of lines to sample the sector can be calculated as:

where θ is the sector angle, At and Ar are the transmit and receive apertures and α is the wavelength. Due to the detection process that implies usually squaring of the echoes and logarithmic compression the actual number of lines needed to sample the sector θ is twice the value of N. For the use case presented a sufficient number of line is 61.2 but since we are dealing with an approximation 64 lines will be considered in the following exposition.

To avoid a drop in frames/second the data that has been acquired has to be processed until new data is transferred as schematically pictured in Figure 7 below.

Figure 7 – Schematics of the pipelined signal processing approach; Time frame allocation

A first step to consider is the time frame allocation for beamforming and B-Mode library based on the number of frames/second that needs to be supported. For this use case a number of 45 frames /second is considered as adequate. After dividing the number of frames/second to the number of scan lines we can allocate the time frame for beamforming module and for B-Mode libraries as presented in Figure 7 above.

When estimating that certain algorithms are feasible for certain architecture three factors are most critical: IO bandwidth (if it is the case), memory bandwidth requirements and the cycle count needed to execute the algorithm.

The IO bandwidth requirement can be calculated based on the timeframe allocation and ADC parameters as being 8.07Gbps well in bounds of the accepted for one Rapid IO port (measured speed of up to 9.11Gbps.

The memory where the input data is stored will be M3, since it is quicker than DDR3 and features sufficient space to support even a double buffering approach if needed. Since the transfer will be done via DMA, in the following no module will use a DMA approach so that the Rapid IO transfer is not interrupted.

Since beam-forming is a fairly demanding algorithm it will be split on five cores and two processing phases. In Phase 1 on each core six channels will be added coherently and in Phase2 the output of Phase 1 will go through an adding process (Figure 8 below ).

Figure 8 – Beamforming architecture
The time frames for Phase 1 and Phase 2 are allocated as in Figure 9 below , based on the complexity of each stage.

Figure 9 – Time frame allocation for Beamforming Phase 1 and Phase 2
Apodization and delay coefficients will be considered pre-calculated and stored in DDR memory. To reduce the bandwidth and size requirements the coefficients will be combined in two eight bit variables.

By taking into consideration all the input and output data and also the Rapid IO transfer in the timeframes allocated the bandwidth requirements are calculated and presented in Table 2 below. As can be observed, for each of the timeframes, the requirements do not pass over 50% of the theoretical achievable bandwidth for DDR3 and M3.

Table 2 – Memory bandwidth requirements
The next step in the evaluation process is the estimation of the cycle count spent on the actual processing. There are two factors that influence the overall cycle count: core cycles due to operations and cycles due to penalties.

Figure 10 – Pseudo-code for Phase 1; in the right hand side the pseudo-code for apodization and delay coefficient updates and on the left hand side the actual coherent summing; in brackets are the operations that can be grouped in a single VLES (variable length execution set) [ To view larger image, click here]

The pseudo-codes in Figure 10 above and Figure 11 below suggest a pipelining method that can achieve 9.4 cycles/output sample of Phase 1 and 2 cycles/output sample of Phase 2.

Figure 11– Pseudo-code for Phase2; in brackets are the operations that can be grouped in a single VLES (variable length execution set)

Due to efficient pipelining there are no stall penalties between VLES (variable length execution set) and so the only type of penalties that have to be taken into consideration are penalties due to cache misses. By taking into consideration cache line sizes and typical penalties due to cache misses one can evaluate the miss penalties based on the following rule of thumb:

1) For M3: 80 cycles/128bytes (is over evaluated with about 30% due to the Rapid IO traffic )
2) For DDR: 100 cycles/128bytes

The MSC8156 offers a mechanism to reduce the penalties due to cache misses by fetching data from memory before it is demanded by the core (dfetch instruction), so it is expected that the above reported figures are higher than the ones that could be achieved after optimizations.

After adding all the cycle consuming elements together and comparing with the available cycle count, it can be observed that beam-forming is achievable also from this last point of view (Table 3 below ).

Table 3 – Beamforming, cycle count required and available

Implementing the B-MODE
We propose a B-Mode library composed of the following four modules:

1) Envelope detection, decimation and logarithmic compression (ED)
2) Scan Conversion (SC)
3) Median Filtering (MF)
4) Histogram equalization (HQ)

To obtain B-Mode images only the first two modules (ED, SC) are necessary; the next two modules are classic image enhancing techniques that might bring a plus in image quality. Integrating enhancers in the acquisition timeframe has the advantage that no loss in frames/second is observed compared to applying them after the image is already formed. In the following we will speak about bilinear interpolation based Scan Conversion with an output image size of 640×480 and 3×3 window median filtering.

In the literature one can find different approaches to perform envelope detection, based on Hilbert transform, quadrature demodulation or by simply squaring the input signal. We propose an envelope detection algorithm formed from the following sub-modules:

1) Square value of the input beam-formed signal (Figure 12 below )
2) Decimation with a CIC and compensation FIR
3) Logarithmic compression

This method is specific for B-Mode image forming and proves to be one of the friendliest approaches from cycle count consumption perspective. MSC8156 is helping to implement this approach by featuring native adders on 40 bit, which can handle the bit-growth demands of CIC filters.

Figure 12 – In the top figure the spectrum of the echo signal is presented; in the bottom figure the spectrum of the squared echo signal – the band of interest moves towards 0 Hz

In Figure 13 below one can observe the time frame allocated for ED and that a single core is assigned for the processing.

Figure 13 – ED time frame allocation and number of cores used
Median filtering and scan conversion need access to at least three scan lines and respectively two scan lines to be called, this makes them not very suitable for low end FPGAs that have limited buffering resources. In the following, four instances of median filtering and four instances of scan conversion are called on for cores, when enough data is acquired (Figure 14 below ).

Figure 14 – MF and SC time frame allocation and number of cores used
Histogram equalization is a slightly different type of algorithm compared to the ones already discussed, due to the fact that at in some point one needs access to the entire image. To maximize the performance, histogram equalization (Figure 15 below ) is split into two phases:

(HQ1) Computing the histogram – is executed in the Acquisition and IO timeframe
(HQ2) Apply transformation – is executed when the whole image is acquisitioned (will drop the number of frames/second to about 44.5 )

Figure 15 –HQ1 and HQ2 time frame allocation and number of cores used

Bandwidth requirements and cycle count requirements (as obtained after simulator validation) have been summarized in Table 4 below.

Table 4 – Bandwidth and cycle count requirements for B-Mode library
In about 75% of the time frame we will not execute Median Filtering, Scan Conversion and HQ1 so we see a utilization factor as presented in Table 5 below while for the case when MF , SC and HQ1 are called the utilization is presented in Table 6 below .

Table 5.Utilization when MF, SC and HQ1 are not called

Table 6 – Utilization when MF, SC and HQ1 are called
By averaging the utilization from tables… and taking into consideration the unused timeframes a 38% utilization can be observed, about 2300MCPS (million cycles per second) from 6000MCPS available.

Scheduling image scans
Figure 16 below presents the processing performed on Core0 for each scan line; as can be observed 30% of the time frame is unused and cold be utilized for Doppler imaging modes.

Figure 16 – Core 0 utilization
Cores 1 to 4 typically do not execute MF, SC and HQ1 so almost 60% of the time frame is free for other imaging modes (Figure 17 below ). When also MF, SC and HQ1 are called, 10% of the time frame is unused (Figure 18 below ).

Figure 17 – Core1-4, when MF, SC and HQ1 are not called

Figure 18 – Core1-4, when MF, SC and HQ1 are called
The DDR bandwidth requirements for all the time frames are pictured in Figure 19a below . To lower the cost of the overall system by using a single DDR the unused time frame can be distributed to scan conversion and beamforming Phase 2 , leading to DDR bandwidth requirements acceptable for a single DDR (Figure19b below ).

Figure 19 – DDR Bandwidth requirements

In the past three years digital signal processors have progressed significantly in terms of processing power and power consumption making them more and more suitable for low and mid end portable medical ultrasound devices.

The plus in flexibility by using a software approach for all signal processing modules, lower time to market and lower cost by using “of the shelf” hardware that is tested and for that a lot of “getting started” material exist are the big plus of DSPs.

For a simple beam approach, as presented in this paper, a usage fact of about 38% is very encouraging to also move to multiple line acquisition techniques and to add other imaging modes like Power Doppler and Color Doppler.

MAPLE-B, while not used in this exposition, can be found very useful for Spectral Doppler imaging modes and for frequency domain beam forming algorithms.

Robert Krutsch is a DSP Software Engineer at Freescale Semiconductor. He received a BSc degree specializing in automatics and computer science from the Polytechnic University of Timisoara; a BSc degree from the Automation Institute at the University of Bremen and PhD in automatic control from Polytechnic University of Bucharest.

[1] Thomas L. Szabo, “Diagnostic Ultrasound Imaging : Inside Out”, Elsevier Academic Press, 2004
[2] Kai E. Thomenius, “Evolution of ultrasound beamformers”, IEEE Ultrasonics Symposium, 1996
[3] Hui Li, Dong C. Liu, “An embedded high performance Ultrasonic Signal processing subsystem”, IEEE computer society, 2009
[4] J. Nelson Wright,” Image formation in diagnostic ultrasound”, IEEE International Ultrasonics Symposium, 1997
[5] Tore Gruner Bjastad, “High frame rate ultrasound imaging using parallel beamforming”, PhD. Thesis, Norwegian University of Science and Technology, 2009
[6]  Iulian Talpiga, “Optimizing Serial Rapid IO Controller Performance in the MSC815x and MSC825x DSPs”, Application Note AN3872, 2010

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.