Editor’s Note: In Part 2 of this series from Digital Video Processing For Engineers , Andrew Draper describes some of the strategies for debugging an FPGA-based video system to be sure it reliably delivers the necessary video streams in real time. Part 2: Clocked and flow controlled video streams.
Most digital video protocols send video frames between boards using a clock and a series of synchronization signals. This is simple to explain but it is an inefficient way to communicate within a device, as all processing modules need to be ready to process data on every clock within the frame, but will be idle during the synchronization intervals.
Using a flow-controlled interface is more flexible because it simplifies processing blocks and allows them to spread the data processing over the whole frame time. Flow-controlled interfaces provide a way to control the flow of data in both directions e the source can indicate on which cycles there is data present and can backpressure when it is not ready to accept data.
In the Avalon ST flow-controlled interface the valid signal indicates that the source has data and the ready signal indicates that the sink is able to accept it (i.e. is not backpressuring the source).
If you are building a system from library components, most problems will occur when converting from clocked-video streams to flow-controlled video streams, and vice versa.
Several debugging tools are available: the most basic tools of which are an oscilloscope, logic analyser or (within an FPGA) an embedded logic analyser (such as Altera’s SignalTap tool). These tools provide a high-resolution view of the data being transferred on a number of signals.
If you have data integrity issues between boards, then low-level debugging tools such as these can be used to diagnose the problem. Unfortunately, once the signals between boards or within devices are clean, these tools typically provide too much data to diagnose the types of problems that appear at higher levels.
Higher-level debug tools provide a way to trace the data passing through the system and display it as video packets. The amount of data in a video system is more than can easily be transferred, so it must be compressed to allow it to be transferred to the debug host and analysed.
The highest level of compression can be achieved by ignoring most of the pixel values and only transferring control packets and statistics about the data flow for example, a count of the number of clock cycles where data was transferred, was not available to transfer from the source or was back-pressured by the sink.
The Altera trace system (Figure 21.3 ) is instantiated when you are building a video design within the QSYS environment. Two parts are needed: a trace monitor component for each interface to be traced and a trace system component which transfers trace data packets to the host.
Click on image to enlarge.
A video trace monitor component needs to be inserted into each video stream you want to monitor. This component is nonintrusive e it has no effect on the video data going through the stream. The video trace monitor component reads the signals being transmitted and sends summaries to the trace system.
You will need to parameterize the video trace monitor to match the type of data being sent and to match the trace system data-width. The trace system component takes the reports from the trace monitors and buffers them before sending the results to the host over JTAG or USB, where they are reconstructed for the user.
You will also need to parameterize this component to select the type of connection to the host, the number ofmonitors, the buffer size, etc. The SystemConsole host application decodes and displays the received packets to show the data as it passes through the system (Figure 21.4 ). Each video packet is displayed as one line in the display. Follwing in the rest of this article are descriptions of some common video errors and how to recognize them in the trace output.
Click on image to enlarge.
Debug tools are also available which allow the debug host to access memory mapped slaves within the target. The Altera JTAG Avalon Master and USB Debug Master components are explicitly designed to do this: if you do not have such a component available then most processor debuggers can be used in a similar way.
Converting from Clocked to Flow-controlled Video Streams
In a functioning system the input to the flow-controlled domain will send data as it becomes available. The system needs to transfer, on average, one line’s worth of pixels in each line scan time. The transfer of data will normally be controlled by “valid”, with “ready” asserted occasionally to select the cycles on which data is accepted.
The number of cycles on which “valid” is asserted depends on the ratio between the screen resolution and the clock rate in the flow-controlled domain. If the clock is just sufficient for the highest resolution then “valid” will be asserted on most cycles within the main part of the frame. At lower resolutions “valid” will only be asserted on a proportion of the cycles.
The “ready” signal to the clocked video input should not be the main source of flow control on the frame, so it is typically deasserted only for short periods to synchronise with the sink. One common problem is that if “ready” is de-asserted for too long then the memory buffer in the video input block can overflow.
Attaching a streaming video monitor to the output of the video input block can help detect overflow situations e if the video input block is backpressured (by de-asserting “ready”) for too long then it will abandon the backpressured frame and send a short packet.
This can be seen on the trace. The trace also reports the number of not-ready cycles within each packet and the time interval between packets. This can be used to check that the interface is being mostly flow-controlled by “valid” rather than “ready”.
If the clocked video-input block has a control port then the debug master can be used to check the overflow sticky-bit in the status register. This bit will be set if there has been an overflowsince it was last checked – note that if you have software monitoring and clearing this bit then reading it from the debugger will not be reliable.
Converting from Flow-controlled to Clocked Video Streams
Theclocked-video output component converts flow-controlled video packetsinto a clocked video signal. The flow control on the input to thiscomponent is controlled by the “ready” signal, which essentially pullsdata out of the interface as it is needed.
If the source isunable to provide data at a sufficient rate then the FIFO in thiscomponent will empty. This is referred to as underflow. At this pointthe component tries to re-synchronize, sending out blank video data andreading continuously from the input until the start of the new frameappears, when it will re-start the output video.
Theclocked-video output component latches the underflow indication e theunderflow sticky-bit is set when an underflow occurs. You can use thedebug master or software on an embedded processor to check this bit. Aswith the overflow bit in the clocked video master, if embedded softwareis monitoring and resetting the bit then reading it from the debuggerwill not be reliable.
The video trace monitor can also indicatewhen there are problems with underflow. Normally the stream going in tothe clocked-video output is controlled by “ready”, but if there is aproblem with underflow then “ready” will not be asserted during there-synchronization process. The resulting lack of backpressure isvisible in the captured video packet summaries.
Free-running Streaming Video Interfaces
Theclock rate within a flow-controlled video system is normally set tosufficient bandwidth on the streaming ports for a picture of maximumresolution to be transmitted (with a small amount of overhead to allowfor jitter). The flow control signals e “ready” near the video output or“valid” near the video input e ensure that processing does not runfaster than the incoming video stream. If processing runs too far aheadthen frames will be missed and the picture will be jerky.
Thiscan happen if the design has instantiated multiple triplebuffercomponents. Triple buffers do not flow-control their inputs or theiroutputs (except temporarily when waiting for memory accesses). A videopipeline between two triple buffers will run at the processing clockspeed rather than staying in sync with the video frames.
If partof the video pipeline is allowed to free run then this will waste memorybandwidth. It can also reduce picture quality as the input triplebuffer will duplicate frames to keep its output busy while the outputtriple buffer will delete frames to match the frame rate on the output.The overall effect will be that some frames are output multiple timeswhile other frames are not output at all.
The solution to thefree-running problem is to replace all but one of the triple bufferswith a double-buffer component. The double buffer does no frame rateconversion so will not allow its input and output to run more than oneframe apart. This will provide flow control to the central part of thesystem. The video trace monitor can also be used to detect freerunningstreaming video components. Examining the flow- control statistics willreport that there is no backpressure or unavailable data e i.e. “ready”and “valid” will be high for most of the frame.
The timinginformation on the captured video packets reports the average frame ratepassing through the monitor. If the streaming video interface is freerunning then the frame rate in parts of the video pipeline will be muchfaster than expected.
Insufficient Memory Bandwidth
Somevideo processing components, such as a color space converter, canprocess the video data one pixel at a time. Others need to store thepixels between input and output e the simplest examples are the buffercomponents that write the input pixels to a frame buffer in memory andread from memory (with different timing) to create the output pixelstream.
Components using a frame buffer demand a large amount ofmemory bandwidth e the sum of the bandwidth of the input and output datarates. If the memory subsystem is not designed correctly then it willnot be able to provide this bandwidth. This will cause excessive flowcontrol of the input and/or output which in turn will make FIFOs inother components overflow or underflow as described previously.
Becauseof their size, most frame buffers are stored in external memory, whichis usually shared between multiple, different, memory-mapped masters.Even in the case where memory is not shared, a double- or triple-buffercomponent has two masters, one to write and the other to read.
Whenthere are multiple masters for the same memory-mapped slave, an arbiteris needed to share the slave’s bandwidth between the masters. In somecases the arbiter is inserted automatically as part of the bus fabric ein other cases it is explicitly inserted by the user as a separatecomponent, or as part of the slave component.
The Alteramulti-port front-end component is a specialized arbiter whichunderstands the costs of different DDR accesses and can be configured tomaximize bus efficiency. When used correctly this component can achievememory-bandwidth efficiency of over 90% e i.e. the number of cycleslost due to bank opens, closes, read-after-write delays and other DDRperformance hazards is less than 10%.
Setting up the arbiter toachieve high efficiencies is sometimes complex, as the interfacepriorities need to be set correctly so that low-latency masters areserviced quickly. Most video component masters will use only as muchbandwidth as is needed for the selected video resolution, although inexample of free running they will use as much bandwidth as available epossibly locking out lower-priority masters from the memory.
Aprocessor master does not normally have a bandwidth limit e it will useas much bandwidth as is available and can be consumed. Most processorsare only able to pipeline a limited number of memory accesses, so thelatency of the memory can limit the amount of bandwidth they canconsume. Processors are normally put at the lowest priority to preventthem from starving video masters, which have a bandwidth target.
Manyarbiters, including the Altera external memory interface toolkit, havean optional efficiency monitoring feature which collects statisticsabout the bandwidths and latencies used by different masters. Thisefficiency monitor can be used to check that the memory is running at asufficiently high overall bandwidth, and can help with optimization whenit is not.
Check Data Within Stream
During theprototype stage all components have bugs that must be fixed. The usualhardware flow is to fix these bugs through simulation where thevisibility into the system is good.
This is harder for videocomponents as the high data rates mean that complex components can takeseveral minutes to simulate each frame. For edge-case bugs, which occuronce every few hours on video data, this would mean many days ofsimulation before a bug occurs. These bugs are only really debuggable inhardware.
Most debug components, including the Altera tracesystem, can be set up to continuously capture data into a circularbuffer. When the trace system is triggered it stops capturing data,sending its stored data to the host for analysis. Ignoring the activityof the system significantly before the trigger lets you concentrate onthe immediate causes of the bug, rather than having to wade throughlarge amounts of captured data.
What drives the trigger signal?
Forhard-to-find bugs you might write custom hardware, which monitorsvarious parts of the system and sends a trigger when misbehavior isdetected. This is difficult, can be error-prone and is not alwaysnecessary. Most component vendors ship bus protocol monitors that areused in simulation to check that the signals on a bus do not violate thespecification.
For example, a lot of memory-mapped busesrequire that after an access has started the address signals must remainstable until that access is accepted by the slave. A master thatchanges the address lines halfway through its transaction will bedetected by the monitor. Bus monitors are used extensively duringsimulation and some of them are now synthesizable, so they can betemporarily included within an FPGA design.
Connecting the erroroutput of the monitor to the trigger input of a trace system, orembedded logic analyzer, will let the user capture the events leading upto, and just after, the error.
In streaming video systemsapplication-aware bus monitors also detect higher-level errors: forexample, a component which outputs data packets which do not match thesize described in the preceding control packet will be logged and/orreport errors.
The video trace system will show edge casesoccurring which might trigger a bug, for example when two controlpackets preceding a data packet is legal (the second control packettakes priority) but is not handled correctly by some components.
Debuggingvideo systems can be daunting, especially when the only visible symptomis the output of a black picture. Many trace components are availableto provide visibility into the system and narrow down the location ofthe bug which is causing the symptom. Careful use of these componentscan save significant time during development.
In some cases thetrace components can be left active in shipped systems. Remote debuggingcan then be used on units running real data e this can be especiallyvaluable when the bug has been triggered by almost standard data beinggenerated by other equipment that is only installed in one broadcaster,in one distant country.
Read Part 1: Timing analysis and debugging
This article is from a chapter in Digital Video Processing For Engineers , by Michael Parker and Suhel Dhanani and is used with thepermission of the publisher Newnes (Copyright 2013), an imprint ofElsevier Ltd.
Andrew Draper is a principaldesign engineer at Altera Corp. and is based in Chesham,Buckinghamshire, United Kingdom. He received his engineering training atCambridge University and Godalming College before going on to work atPhilips Consumer Electronics and Madge Networks.