Best practices for debugging Zephyr-based IoT applications - Embedded.com

Best practices for debugging Zephyr-based IoT applications

The Linux Foundation Zephyr Open Source Project has grown into the backbone of many IoT projects. Zephyr offers a best-in-class small, scalable, real-time operating system (RTOS) optimized for resource-constrained devices, across multiple architectures. The project currently has 1,000 contributors and 50,000 commits building advanced support for multiple architectures, including ARC, Arm, Intel, Nios, RISC-V, SPARC and Tensilica, and more than 250 boards.

When working with Zephyr, there are a few critical considerations to keeping things connected and functioning reliably. Developers can’t work out all classes of issues at their desk and some only become obvious when a device fleet grows. As networks and network stacks evolve, you need to ensure upgrades don’t introduce unnecessary problems.

For example, consider a situation we faced with GPS trackers deployed to track farm animals. The device was a low-footprint, sensor-based collar. On any given day, the animal roamed from mobile network to mobile network; from country to country; from location to location. Such movement quickly exposed misconfigurations and unexpected behavior that could lead to power loss resulting in large economic loss. We didn’t need to just know about an issue, we needed to know why it happened and how to fix it. When working with connected devices, remote monitoring and debugging is crucial to gaining instant insight into what went wrong, the next best steps to address the situation, and ultimately how to establish and maintain normal operation.

We use a combination of Zephyr and cloud-based device observability platform Memfault to support device monitoring and updating. In our experience, you can leverage both to establish best practices for remote monitoring using reboots, watchdogs, fault/asserts, and connectivity metrics.

Setting Up an Observability Platform

Memfault allows developers to monitor, debug, and update firmware remotely, which allows us to:

  • avoid production freezes in favor of minimum viable product and Day-0 updates
  • continuously monitor overall device health
  • push updates and patches before most, if any, end users notice issues

Memfault’s SDK is easily integrated to collect packets of data for cloud analysis and issue deduplication. It works like a typical Zephyr module where you add it to your manifest file.

# west.yml
[ ... ]
     - name: memfault-firmware-sdk
     url: https://github.com/memfault/memfault-firmware-sdk
     path: modules/memfault-firmware-sdk
     revision: master
 
 
# prj.conf
CONFIG_MEMFAULT=y
CONFIG_MEMFAULT_HTTP_ENABLE=y 

First Area of Focus: Reboots

Suppose you see a considerable surge in resets on your device. This is often an early indicator that something in the topology has changed or devices are starting to experience issues due to hardware defects. It’s the smallest piece of information you can collect to start getting some insight about device health, and it helps to think about it in two parts: hardware resets and software resets.

Hardware resets are often due to hardware watchdogs and brownouts. Software resets can be caused by firmware updates, asserts, or be user initiated. 

After identifying what types of resets are taking place, we can understand if there are issues that are affecting the whole fleet, or if they are limited to a small percentage of devices.

Record reason for reboot

void fw_update_finish(void) {
   // ...
   memfault_reboot_tracking_mark_reset_imminent(kMfltRebootReason_FirmwareUpdate, ...);
   sys_reboot(0);
}

Zephyr has a mechanism for registering regions that will be preserved across a reset which Memfault hooks into. If you’re about to reboot the platform, we recommend saving right before you start. When you reboot the platform, record the reason for your reboot –  in this case, a firmware update –  and then call it a Zephyr sys_reboot.

Capturing Device Resets on Zephyr

Register init handler that to read bootup information

static int record_reboot_reason() {
   // 1. Read hardware reset reason register. 
(Check MCU data sheet for register name)
   // 2. Capture software reset reason from noinit RAM
   // 3. Send data to server for aggregation
}
 
SYS_INIT(record_reboot_reason, APPLICATION, 
CONFIG_KERNEL_INIT_PRIORITY_DEFAULT);

You can set up a macro that captures system information before resets via the MCU reset reason register. When the device restarts, Zephyr will register handlers using system_int macro. MCU reset reason registers all have slightly different names, and all are useful because you can see if there are any hardware problems or defects.

Example: Power Supply Issue

Let’s look at an example of how remote monitoring can give vital insight into fleet health by looking at reboots and power supply. Here we can see a small number of devices account for more than 12,000 reboots (Figure 1). 

click for full size image

Figure 1: Example of Power Supply Issue, Chart of Reboots over 15 days. (Source: Authors)

  • 12K device reboots a day – way too many
  • 99% of reboots contributed by 10 devices
  • Bad mechanical part contributing to device constant reboots

In this case, some devices are rebooting 1,000 times a day, likely due to a mechanical issue (bad part, bad battery contact, or various chronic rate issues).

Once devices are in production, you can handle a number of these issues through firmware updates. Rolling out an update lets you workaround hardware defects and circumvent the need to try and recover and replace devices.

Second Area of Focus: Watchdogs

When working with connected stacks, a watchdog is the last line of defense to get a system back into a clean state without manually resetting the device. Hangs can happen for many reasons such as

  • Connectivity Stack Blocks on send()
  • Infinite Retry Loops
  • Deadlocks between tasks
  • Corruption

Hardware watchdogs are a dedicated peripheral in the MCU that must be “fed” periodically to prevent them from resetting the device. Software watchdogs are implemented in the firmware and fire ahead of the hardware watchdog to enable the capture of the system state leading to the hardware watchdog

Zephyr has a hardware watchdog API where all the MCUs can go through the generic API to set up and configure the watchdog in the platform. (See Zephyr API for more details: zephyr/include/drivers/watchdog.h)

// ...
void start_watchdog(void) {
  // consult device tree for available hardware watchdog
  s_wdt = device_get_binding(DT_LABEL(DT_INST(0, nordic_nrf_watchdog)));
 
  struct wdt_timeout_cfg wdt_config = {
    /* Reset SoC when watchdog timer expires. */
    .flags = WDT_FLAG_RESET_SOC,
 
    /* Expire watchdog after max window */
    .window.min = 0U,
    .window.max = WDT_MAX_WINDOW,
  };
 
  s_wdt_channel_id = wdt_install_timeout(s_wdt, &wdt_config);
 
  const uint8_t options = WDT_OPT_PAUSE_HALTED_BY_DBG;
  wdt_setup(s_wdt, options);
  // TODO: Start a software watchdog 
}
 
void feed_watchdog(void) {
  wdt_feed(s_wdt, s_wdt_channel_id);
  // TODO: Feed software watchdog
}

Let’s walk through a few steps using this example of the Nordic nRF9160.

  1. Go to the device tree and set up the Nordic nRF watchtime folder.
  2. Set the configuration options for the watchdog through the exposed API.
  3. Install the watchdog.
  4. Periodically feed the watchdog when behaviors are running as expected. Sometimes this is done from the lowest priority tasks. If the system is stuck it will trigger a reboot.

Using Memfault on Zephyr, you can make use of kernel timers, powered by a timer peripheral.  You can set the software watchdog timeout to be ahead of your hardware watchdog (for example, set your hardware watchdog at 60 seconds and your software watchdog at 50 seconds). If the callback is ever invoked, an assert will be triggered, which will take you through the Zephyr fault handler and get information about what was happening at that point in time when the system was stuck.

Example: SPI Driver Stuck

Let’s again turn to an example of an issue that isn’t caught in development but arises in the field. In Figure 2, you can see timing, the facts and the degradation in SPI driver chips.

click for full size image

Figure 2: SPI Driver Stuck Example. (Source: Authors)

  • SPI flash degrading over time, incorrect timing of communication
  • Traced this on 1% of devices after 16 months of field deployment
  • Driver fix and roll-out with next release

For Flash, after a year in the field, you can see that there’s a sudden start to errors due to getting stuck in SPI transactions or various bits and pieces of code. Having the whole trace helps you find the root cause and develop a solution.

The watchdog below (Figure 3) is kicking off the Zephyr fault handler.


Figure 3: Fault Handler Example, Register Dump. (Source: Authors)

 

Third Area Focus: Faults/Asserts:

The third component to track is faults and asserts. If you’ve ever done some local debug or built some features of your own you’ve probably seen a similar screen about register state when a fault has taken place on the platform. These can be due to:

  • asserts, or
  • accessing bad memory
  • dividing by zero
  • using a peripheral in the wrong way

Here’s an example of a fault handling flow that’s taken on Cortex M microcontrollers on Zephyr.

void network_send(void) {
  const size_t packet_size = 1500;
  void *buffer = z_malloc(packet_size);
  // missing NULL check!
  memcpy(buffer, 0x0, packet_size);
  // ...
}

void network_send(void) {
  const size_t packet_size = 1500;
  void *buffer = z_malloc(packet_size);
  // missing NULL check!
  memcpy(buffer, 0x0, packet_size);
  // ...
}

bool memfault_coredump_save(const      
      sMemfaultCoredumpSaveInfo *save_info) {
  // Save register state
  // Save _kernel and task contexts
  // Save selected .bss & .data regions
}

void sys_arch_reboot(int type) {
  // ...
}

When an assert, or a fault, kicks off, an interrupt fires, and a fault handler is invoked in Zephyr that provides the register state at the time of the crash. 

The Memfault SDK automatically stitches into the fault handling flow, saving critical information to the cloud including the register state, the state of the kernel, and a portion of all tasks running on the system at the time of the crash. 

There are three things to look for when you are debugging locally or remotely:

  1. The Cortex M fault status register tells you why the platform asserted or faulted.
  2. Memfault recovers the exact line of code that the system was running before the crash, and the state of all the other other tasks.
  3. Collect the _kernel structure in the Zephyr RTOS to see the scheduler, and if it is a connected application, the state of the Bluetooth or LTE parameters.

 

Fourth Area of Focus: Tracking Metrics for Device Observability

Tracking metrics allows you to start building a pattern of what is happening on your system and enables you to make comparisons across your devices and your fleet to understand what changes are having an impact.

A few metrics useful to track are:

  • CPU utilization
  • connectivity parameters
  • heat usage

With the Memfault SDK, you can add and start defining metrics on Zephyr with two lines of code:

  1. Define metric
MEMFAULT_METRICS_KEY_DEFINE(
   LteDisconnect,
kMemfaultMetricType_Unsigned)
  1. Update metric in code
void lte_disconnect(void) {
  memfault_metrics_heartbeat_add(
      MEMFAULT_METRICS_KEY(LteDisconnect), 1);
  //...
}

Memfault SDK + Cloud

  • Serializes and compresses metrics for transport
  • Indexes Metrics by device and firmware version
  • Exposes web interface for browsing metrics by device and across Fleet

Dozens of metrics can be collected and indexed by device and firmware version. A few examples:

  • NB-IoT/LTE-M basic connectivity: See how a modem impacts battery life, either by being connected or connecting.
  • Tracking Base Stations and PSM in NB-IoT/LTE-M: Mobile signal quality can be painful and can drain battery life if unmanaged. Create metrics for network status, events, cell tower information, settings, timers and more. Monitor for changes and use alerts.
  • Testing Large Fleets: Unexpectedly large data can increase device connectivity costs and help identify outliers.

Example: NB-IoT/LTE-M data size

click for full size image

Figure 4: Tracking metrics for device observability – NB-IoT example, LTE-M data size. (Source: Authors)

  • UDP data size: Track bytes per send interval (Figure 4)
  • Post-reboot more data is sent
  • Some packets are bigger due to more info or traces
  • Track issue of data consumption

Conclusion

Leveraging Zephyr and Memfault, developers can implement remote monitoring to get better observability into connected device functionality. By focusing on reboots, watchdogs, fault/asserts, and connectivity metrics, developers can optimize the cost and performance of IoT systems.

Learn more by watching a recorded presentation from the 2021 Zephyr Developer Summit.


Luka Mustafa is the founder and CEO of IRNAS which develops durable IoT solutions for industrial use-cases and low-power off-grid operation. Luka leads a multidisciplinary team developing open systems and devices for the most challenging environments and uses, ranging from custom CNC machines, to electronics and fiber optic systems (wireless optical system KORUZA), promotes and deploys open wireless networks in wlan slovenija project, and manages national and international wireless backbones.
Chris Coleman Chris Coleman is Co-Founder and CTO of Memfault, which offers an observability platform for connected devices. Prior to Memfault, Chris was an embedded software engineer at Pebble and Fitbit where he led efforts across the firmware stack and developed a reputation for tracking down and fixing challenging firmware bugs.

Related Contents:

For more Embedded, subscribe to Embedded’s weekly email newsletter.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.