The effective identification and debug of semiconductor issues is critical for the technical teams involved in system design and delivery. The challenging combination of increased system complexity and decreased time to market can result in extreme pressure on resolving issues in the shortest time possible – what engineer hasn’t been asked to deliver their fix yesterday? At time of such intense pressure, it can be difficult to follow a logical and structured approach in debugging an issue. Conversely though, following a logical and structured approach in debugging an issue is the very key to a timely resolution.
It is clear that a comprehensive debug framework could save engineers significant time and frustration in debugging complex semiconductor issues. Using illustrative examples, this article describes such a framework. Although video products are used as a lens to examine how semiconductor issues can manifest and be resolved, the framework outlined here should be considered as generic and applicable to many semiconductors and problems. Starting with the review of the application against any available reference schematics and layouts and culminating with the submission of parts through official failure analysis channels, this guide attempts to provide as comprehensive a framework as possible.
The front cover of the Hitchhiker's Guide to the Galaxy [Adams, Douglas (1981), The Hitchhiker's Guide to the Galaxy , New York: Pocket Books, 1981] famously called out a calming message to those who were lucky enough to possess a copy: Don’t Panic. Douglas Adam’s famously wrote that “despite its many glaring (and occasionally fatal) inaccuracies, the Hitchhiker's Guide to the Galaxy itself has outsold the Encyclopedia Galactica because it is slightly cheaper, and because it has the words “DON'T PANIC” in large, friendly letters on the cover”.
We’ve all been there though. The demo system is overdue for delivery, the marketing department are on the phone looking for updates and a small group of engineers in the lab pours over a board which refuses to act as it was intended to. It is at times like this that an intergalactic traveller would reach for the Guide . Engineers have many alternatives to the Guide – the Internet, The Art of Electronics or even one of Dilbert’s many insightful cartoons. Such an exalted list is now augmented by this article – a fault-finding recipe that engineers can use to alleviate the panic when it occurs.
Start with software….
Every engineer has their own biases towards starting a debug with either software or hardware. Temporarily suppressing these biases (although the author does have a hardware background…), software can often be the best place to start a debug given the ability to change complex elements reasonably quickly and the sticky nature of hardware (e.g. the lead time involved in non-bill of materials changes).
Silicon vendors invest significant time and resources prior to releasing products to define optimal configuration settings which work across a range of operation conditions, such as process variations, temperature and voltage. Modern semiconductor devices such as HDMI receivers rely heavily on the use of optimal configuration settings to ensure their stable and robust operation. Although extra settings may need to be added to the core configuration settings to address application specific issues (e.g. input muxing, color space conversion), the core configuration settings must be maintained without adjustment.
When confronted with an issue, the configuration settings being employed in the application must be examined as a priority. If the configuration settings being employed do not match those recommended by the silicon vendor, the next step must be to change those settings before immediately retesting. The impact of incorrect settings can range from the slight (e.g. blurred video due to filters being disabled) to the serious (e.g. complete absence of video or failing compliance).
During software debug, it can be helpful to isolate a complex software driver with many interactions (e.g. interrupt responses, scheduled tasks) from the basic I2C configurations required for the system components. If everything works okay with the static configuration, the issue may be in the increased number of interactions incorporated into the software. If there are still issues, hardware debug is the next step.
Part To Play
Once software have been ruled out as the possible source for the issues being experienced, hardware is the next area for analysis. It might sound unlikely and elementary but the first stage of any hardware debug should be to confirm that the parts on the PCB are actually correct. Silicon vendors typically develop several models of each product with each model differentiated from the others in varying degrees.
One common differentiator between models is in speed grade. A simple mistake in selecting from an ordering guide could result in a model which does not perform to the required specification e.g. an ADV7802BSTZ-150 video decoder operates at frequencies of up to 150MHz whereas an ADV7802BSTZ-80 video decoder only operates at frequencies of up to 80MHz. Another differentiator could be in feature set differences between models. For example, ordering a non-HDCP supporting HDMI receiver by mistake (e.g. ADV7611BSWZ vs ADV7611BSWZ-P) could result in a system which does not support video from consumer video sources. Models of the same product may also differ in pinout despite sharing the same package. For example, two models of the same product with different output interfaces may have slightly offset I2C interfaces as illustrated in Figure 1. A minor ordering oversight could result in an incorrect part and a non-functional hardware design.
Figure 1. Product Comparison
Hardware is, well, hard….
Getting a complex hardware design correct first time isn’t easy. A good design process incorporating up-front planning, schematic entry, layout and layout simulation is a solid foundation which maximizes the likelihood of success. Even with a good process in place however, errors can still happen. Once the correct part has been confirmed on the PCB, the next step is to confirm the hardware.
Most silicon vendors provide reference schematics and/or schematic guidelines to assist in getting hardware designs correct first time around. If your system is not behaving as desired, review it against the aforementioned references to ensure that no significant differences exist beyond those required by the application e.g. input and output connections may differ between a reference system and a specific application.
Careful attention should be paid to the power supply, ensuring that the filtering and decoupling are implemented in accordance with the recommendations – failure to do so could result in noise coupling from a digital switching supply (e.g. a digital core supply) to a sensitive analog supply (e.g. a PLL supply). Unused pins must be carefully handled to ensure that damage is not caused to the part or that interference is not induced into the system (e.g. unterminated outputs could oscillate uncontrollably).
If the schematic is okay, the next step for analysis is the layout itself. Layout induced issues can range from basic component placement problems through to complex coupling issues. Video products usually carry recommendations to place key external circuit components (e.g. external loop filers, crystal oscillators) on the same side of the board as, and close to, the video part itself. Failure to layout external circuit components carefully, and in accordance with the recommendations, could result in unpredictable behavior from the circuit.
Differential circuits (e.g. HDMI, MHL, MIPI and APIX), if not correctly designed and implemented, can be particularly susceptible to layout induced issues. Failure to follow recommendations relating to such technologies can result in degraded performance leading to functional or compliance issues. Discovery of functional or compliance issues should trigger a comprehensive review of the basic principles of differential layout; have the differential traces been kept short and on the same side of the board (avoiding vias) as the video part, has a solid ground plane been used underneath the traces, have intra- and inter-pair spacings been kept consistent, has the surrounding copper ground fill been kept far enough away from the differential traces.
The power supply is another element of the layout that can induce significant issues if poorly implemented. As outlined in the schematic section, while power supply filtering is important, the supporting layout is equally so. Key things to check for are that low inductance power supply planes have been used wherever feasible and that decoupling capacitors have been carefully located such that decoupling can be achieved right at the pin. If all the basics look okay, then more subtle elements may need to be examined such as whether stitching capacitors have been used to reduce current return paths.
Investigate All Available Knowledge Bases
If the hardware and software have been eliminated as possible sources of the issues being observed, it may be valuable to checkpoint with other engineers and knowledge bases. While some semiconductor problems are novel, many others are observed on a regular basis by technical specialists or experienced engineers.
While this has traditionally happened on a local level with engineers within the same company working together to identify solutions, many semiconductor companies now provide web-based collaboration tools such as Analog Devices’ Engineer Zone forum to broaden the mindshare.
There are many benefits to using such tools – they provide rapid access to technical specialists, they do not discriminate between customers and they provide access to a unique knowledge base which can result in up to 30% of people suffering issues finding answers to their queries without actually asking a question.
Characterize The Problem
If a solution to the problem is not obvious from either the local assistance or from the web-based collaboration tools, the time has arrived to characterize the problem. A detailed analysis of the issue at hand will assist in delivering a faster solution to the problem once support has been sought.
If the hardware and software checks completed during the initial investigation stages were not conclusive, then the ecosystem around the part should be examined to ensure that it is to specification. The first element of the platform to examine is the mechanical fixing of the parts to the printed circuit board; was the printed circuit board electrically tested; do all visible solder joints look robust; do all non-visible solder joints seem robust when viewed using an alternate means of inspection e.g. x-ray. The reliability of mechanical assembly is always improving but low volume prototype production runs can sometimes result in issues.
If, following investigation, the board is confirmed to be reliable, the next step should involve correlating the hardware behavior to the original design specifications. Are all aspects of the platform operating as per each part’s datasheet requirements (e.g. power supply level and stability, crystal frequency)? For example, if a single voltage supply is not within specification or incorrect supplies have been connected together, the part may not operate as expected (the output may suffer increased noise, compliance performance may be compromised). When checking supply levels, it is important to probe as close to the device pin as possible, thus ruling out any possible voltage drop across the board. In addition to the power supply, other common sources of issues are the crystal circuit (e.g. stability, load capacitance requirement, negative resistance), termination circuits (device selection and layout) and voltage reference circuits.
Another key aspect of the design is the margin available for setup and hold time on each chip-to-chip connection. Manifesting on the display as anything from sparkles (i.e. pixel errors appearing as occasional white dots), through green lines, to complete picture loss, timing errors can be detrimental to the performance of a system across a range of temperatures. Matching the data invalid times of the transmitter with the setup and hold times required of the receiver while incorporating all PCB specific information (e.g. trace length, series resistors) is vital to delivering a stable system.
Report Your Issue…
If the hardware debug is successful and everything is to specification, the next step should be to characterize the issue as concisely as possible before seeking further assistance. Gathering information on the following topics will assist technical support staff in providing an informed and timely response.
- How is the issue reproduced and what is the reproducibility rate of the issue (e.g. 1 in 1000, 1 in 100, 1 in 10 or 10 in 10 attempts)? Can the issue be recreated in a static manner or does it require dynamic input (e.g. repeated hot plugging of an input)
- Describe the input used to duplicate the issue (e.g. is the input a standard or application specific source such as a reversing camera? Does the issue occur with just one or a range of different sources? What is the resolution/frequency of the input? What is the color space of the input? What is the color depth of the input? For LVDS or MIPI type inputs, how many lanes does the input employ?)
- Describe the output used to duplicate the issue (e.g. is the output a standard or application specific sink? Does the issue occur with one or a range of sinks? What is the resolution/frequency of the output? What is the color space of the output? What is the color depth of the output? For LVDS or MIPI type outputs, how many lanes does the output employ?)
- Describe the configuration of the part required to duplicate the issue (e.g. what input(s) are being employed – analog or digital? What signal path through the part is being used – component processor, scaler, de-interlacer, on-screen display? What output(s) are being used – analog or digital? What are the settings being employed and how have they been adjusted from those recommended?)
- Characterize the influencers (e.g. Does the issue change with temperature? Does the issue change with voltage? Is the issue independent of cable length or are the parts being connected on a PCB?)
Note: When applying either heat or cold to the device, it is strongly recommended to use either a chamber or forced air equipment. Freezer spray and heat guns are to be avoided if at all possible due to the unpredictable nature of their output.
- Can the issue be replicated on the semiconductor device’s evaluation board when the same inputs and outputs are connected with the DUT configured in as similar a manner as possible? Does the issue manifest in the same, or in a slightly different, manner?
Note : this question is vitally important for two key reasons: First, it indicates very quickly to the technical support staff whether the problem is related to the semiconductor device and supporting material (e.g. documentation or recommended configuration settings) or if the problem is related to something application specific; Second, it may facilitate the technical support staff with a very fast means of reproducing the problem if the problem can be reproduced on the evaluation board.
- Does the issue happen on all platforms or on a single platform? An ABA swap may be required to confirm this.
Note: An ABA swap is an effective tool for semiconductor packages which can be removed from, and remounted to, a PCB with minimal levels of rework – e.g. this is suitable for LQFP and LFCSP packages but less so for BGA packages unless there is scope to reball the device. The purpose of an ABA swap is to confirm whether the failure follows a suspect system, or a suspect device. First, reproduce the issue on the suspect system and confirm a second system which does not show the issue. Remove the parts from both systems, swap and remount. Confirm if the issue remains on the suspect platform or if the issue moved to the second system with the suspect part.
- Capture scope traces, pictures or, if more insightful, videos to demonstrate the issue (a video is significantly more useful if the issue is dynamic in nature e,g, sparkles, random noise etc).
After all of this research has been completed and the results are at hand, it is time return to the web-based collaboration tools to initiate a conversation with the technical support specialists.
After exploiting all possible avenues of exploration, the physical condition of the semiconductor device may itself be the final stage of investigation. To progress with this stage, the device must normally be returned to the semiconductor supplier through the formal failure analysis channel – this can usually be achieved by contacting either the semiconductor supplier’s local sales office or the sales office of the distributor.
The failure analysis process employs tools such as repeat production testing and bench evaluation to check for failure modes such as electrical overstress, electrostatic discharge damage, manufacturing defects, production test escapes and application related issues. Should the failure analysis process be unsuccessful in reproducing or confirming the issue, the part may be returned categorized as No Trouble Found (NTF) for further debug.
The sage words, “DON’T PANIC”, can seem antagonistic when offered at the wrong instant. But the intention of the advice is the key to unlocking so many issues that are commonly experienced by engineers when trying to design and deliver complex systems featuring multiple semiconductor devices. Following a structured and reasoned approach to the debug effort (e.g. reviewing reference schematics and layouts, recommended settings, confirming hardware operation as being to specification) increases the possibility of finding the root cause of any issue.