Updating firmware reliably

Many devices still treat firmware updates as exceptional events, to be done in exceptional circumstances and only by advanced users or qualified personnel. However, especially for connected devices, keeping up to date is becoming ever more important. Keeping the device secure requires being able to update the software running on it, quickly and reliably, because there will be vulnerabilities discovered.

But nobody likes bricked devices. Sure, a bricked device is perfectly secure, but that is hardly a consolation. Therefore, when performing firmware updates, especially delivered over the air and in the background, there is one overarching concern: reliability. At every step of the process there must be safety mechanisms that allow for device to be recovered with no or minimal intervention from the end user (a manual reboot is fine, an RMA for reflashing is not).

These are the principles we set out for ourselves when designing a firmware update mechanism for Mongoose Firmware. In this article we will talk about performing reliable firmware updates in general and consider particular implementation used on the TI CC3200 (but the same applies to ESP8266).

TI CC3200 is an unusual device in that it does not have any on-chip flash memory. Code and data sections are loaded from an external SPI flash into SRAM and executed from there. The SPI flash chip is formatted to contain a rudimentary file system. The size of the flash chip can be up to 64 Mb, but the most popular size is 8 Mb – as seen on both the LAUNCHXL dev board and in the CC3200MOD, the module offered by TI.

At first glance, code not being executed from flash directly makes things easier – files on the SPI filesystem can be deleted and rewritten without interfering with code being executed, so code could be updated while it is running. This, however, is extremely unsafe – if the process were to fail for any reason (interrupted network connection or sudden loss of power), the device would not be able to boot at all, let alone roll back the update. Thus, first order of business is creating a mechanism for maintaining an alternate image of the firmware, with ability to boot from it.

The solution is a boot loader: a small piece of code that rarely, if ever, changes.Its role is to determine which of the firmware images to load and execute. This is usually specified by a configuration block, found at a predetermined location. Since CC3200 does not allow direct access to flash, this will need to be a separated file rather than a flash sector. Boot config specifies which image should be loaded – see figure 1.


Figure 1. (Source: Cesanta)

There is one more consideration, however: if the boot config is stored in only one location, it makes it susceptible to failure during updates, which are usually performed as a read-erase-write operation: a reboot after erase and before write is complete could render device unbootable. The time between the two is short, but we set out to make our update process safe at all points, so we have to deal with it. The way we do it by using two config files with versioning, or sequencing. A sequencer is a monotonically decreasing number, so of the two files the one with smaller sequencer is more recent – on figure 2, config 1 is selected as active because it has smaller sequencer. When writing a new config file, we always use the currently inactive (older) slot and it will not become newer until it is written – erased config will be older than any valid one because erased NOR flash is filled with all 1s.


Figure 2. (Source: Cesanta)

Next we consider rollbacks. What if the firmware is bad and does not work? For example, it may just reboot or hang and be rebooted by watchdog timer. Thus the first boot after an update is unconfirmed : newly-flashed firmware image has the “first boot” flag set. Figure 2 illustrates the state after update was first applied: it was downloaded and flashed to slot 1 and boot config 1 was written with “first boot” flag set. At an early stage of the boot process, boot config with “first boot” set is erased and the boot process continues. If at any point the system reboots, the bad config that led to failed boot will not be there and a rollback will occur – the system will boot from the previous, good configuration (cfg 0 and image 0 in fig 2). If the boot is successful and update is committed, the “first boot” flag is removed from the config and from now on it will always be used as active firmware.

However, what if the firmware initializes successfully, but fails to function on a higher level – e.g. fails to establish network connection or perform some similar higher level function? The point is, it may not always be possible for firmware itself to tell if it’s OK or not, and an external confirmation is desirable. Mongoose IoT firmware supports this by way of a commit timeout : a kind of watchdog timer, that will automatically revert an update if it is not explicitly committed within certain time. It can be done in multiple ways: by explicit invocation of an API function from user code, by external process sending an HTTP request to device’s /update/commit URI or by having firmware poll a commit URL after update – a successful response will tell device to commit the firmware. We found this very useful when recovering devices from bad updates – so convenient, in fact, that it is possible for us to do significant part of development on devices that are deployed in the field via OTA with delayed commits.

And now, let’s consider update delivery. There are multiple approaches of course. We chose and implemented two in Mongoose Firmware: push-based update delivery via HTTP POST and poll-based delivery by having device check specific URL for updates at regular intervals. The former is best suited for lowest latency and when the device is directly accessible (e.g. during development); the latter is best suited for production, when a fleet of devices needs to be updated. In the latter case, the server responding to update requests can be configured to perform targeted updates and staged rollout.

Let me show you how this works in practice. We will use our “Hello, world!” example.

So, let’s build and run it to establish the baseline (you will need to register on the Mongoose Cloud to get your own username and password). Note: in the following examples, for simplicity, we are using HTTP. Real setup should use HTTPS with proper certificate validation (which is supported).

rojer@nbt:~/cesanta/mongoose-iot/fw/examples/c_hello master$ miot build –arch cc3200

Connecting to http://cloud.mongoose-iot.com, user cesanta

Uploading sources (1734 bytes)

Success, built c_hello/cc3200 version 1.0 (20161114-113755/???).

Firmware saved to build/fw.zip

rojer@nbt:~/cesanta/mongoose-iot/fw/examples/c_hello master$ miot flash && miot console 

Loaded c_hello/cc3200 version 1.0 (20161114-113755/???)

Opening /dev/ttyUSB1…

Connecting to boot loader..

Main boot loader v2.1.4.0

Booting firmware…

All done!

S0c_hello.bin.0@20000000+138973+.2001ec45 (1)

cc3200_init         c_hello 1.0 (20161114-113755/???)

cc3200_init         Mongoose IoT Firmware 2016111411 (20161114-113755/master@e2a4f704)

cc3200_init         RAM: 122260 total, 109044 free

start_nwp           NWP v2.7.0.0 started, host driver v1.0.1.6

cc3200_init         Boot cfg 0: 0xfffffffffffffffe, 0x0, c_hello.bin.0 @ 0x20000000, spiffs.img.0 (2)

fs_mount_idx         Mounting spiffs.img.0.0 0xfffffffffffffffe

mg_sys_config_init   MAC: F4B85E49A7B3

mg_sys_config_init   WDT: 15 seconds

clubby_channel_uart 20025edc UART0

mg_wifi_setup_ap     AP Mongoose_49A7B3 configured

mg_sys_config_init_http HTTP server started on [80]

Hello, world! (3)

Hey, a file!

cc3200_init         Init done, RAM: 104664 free

mg_wifi_on_change_cb WiFi: ready, IP 192.168.4.1

blink_timer_cb       Tick

blink_timer_cb       Tock

A few key things in the log above:

  • Is output of the boot loader. It’s terse, but if boot fails, it is possible to tell at what stage.
  • Logs the contents of the boot config used to boot this firmware: sequencer, flags, image name, load address and SPIFFS container image name (not covered here, see our blog post on the subject).
  • Is the output of our example’s mg_app_init function. After that you see output of the timer callback and it should be accompanied by blinking of the red LED (on the LAUNCHXL board).

By default, the board starts up in the AP mode. To make our life easier, let’s instead make it join a WiFi network:

$ miot config-set wifi.ap.enable=false wifi.sta.enable=true wifi.sta.ssid=Cesanta  wifi.sta.pass=***  && miot console

Getting configuration…

Setting new configuration…

mg_wifi_connect     Connecting to Cesanta

mg_wifi_on_change_cb WiFi: ready, IP 192.168.1.33

blink_timer_cb       Tock

blink_timer_cb       Tick

The device rebooted and is now connected to network. Now let’s build a new firmware and push an update. Keeping the console attached, switch to a different window. Make a change to src/main.c to print something distinctive on the console. I added a simple counter and modified statements in the timer callback to print it. Then update version miot.yml to 1.1 and build.

$ miot build –arch cc3200

Connecting to http://cloud.mongoose-iot.com, user cesanta

Uploading sources (1756 bytes)

Success, built c_hello/cc3200 version 1.1  (20161114-135843/???).

Firmware saved to build/fw.zip

So, there we have our v1.1. Now instead of flashing directly, let’s perform an update.

Configuration page at http://192.168.1.33/ has a firmware update form at the bottom, select build/fw.zip and press “upload”

Continue to page 2 >>

When you press the button, you should see the following output on the console:

mongoose_ev_handler 20025a1c HTTP connection from 192.168.1.100:44532

updater_context_create Starting update (timeout 300)

Starting update (timeout 300)

parse_manifest       FW: c_hello cc3200 1.0 20161114-135644/??? -> 1.1 20161114-135843/??? (1)

handle_update_post   Rebooting device

mg_system_restart_after Rebooting in 101 ms

Rebooting in 101 ms

fs_umount           Unmounting spiffs.img.0.1

S1c_hello.bin.1@20000000+138988+.2001ec4d

cc3200_init         c_hello 1.1 (20161114-135843/???) (2)

cc3200_init         Mongoose IoT Firmware 2016111413 (20161114-135843/master@e2a4f704)

cc3200_init         RAM: 122244 total, 109028 free

start_nwp            NWP v2.7.0.0 started, host driver v1.0.1.6

cc3200_init         Boot cfg 1: 0xfffffffffffffffd, 0x3, c_hello.bin.1 @ 0x20000000, spiffs.img.1 (3)

fs_mount_idx         Mounting spiffs.img.1.0 0xfffffffffffffffe

fs_delete_container Deleting spiffs.img.0.0

cc3200_init         Applying update

fs_mount_idx         Mounting spiffs.img.0.1 0xfffffffffffffffd

file_copy           Copying conf.json (4)

fs_umount           Unmounting spiffs.img.0.1

mg_sys_config_init   MAC: F4B85E49A7B3

mg_sys_config_init WDT: 15 seconds

clubby_channel_uart 20025f34 UART0

mg_wifi_connect     Connecting to Cesanta

mg_sys_config_init_http HTTP server started on [80]

Hello, world!

Hey, a file!

cc3200_init         Init done, RAM: 104568 free

mg_upd_boot_commit   Committed cfg 1, seq 0xfffffffffffffffd (5)

mg_wifi_on_change_cb Wifi: connected

mg_wifi_on_change_cb WiFi: ready, IP 192.168.1.33

blink_timer_cb       Tick 0 (6)

blink_timer_cb       Tock 1

  1. This tells us the current and the version of the firmware being applied
  2. Here we see new firmware booting
  3. Active boot config is 1, image 1 and sequencer is 0xf…fd. Also BOOT_F_FIRST_BOOT is set (0x1).
  4. Here existing configuration file is copied into the new SPIFFS file system.
  5. Since delayed commit was not requested, update is committed immediately after successful update.
  6. Here we see our updated code – every timer iteration is accompanied by a counter.

Now, let’s try to simulate a bad update. Let’s introduce a bug during initialization. Use your favorite way to trigger a crash – for example, by adding something like this to the mg_app_init function, right after print hello world: *((uint32_t *) 0xF00DBEEF) = 1;

(Fun fact: NULL dereference is not a bug on CC3200, 0x0 is mapped to internal ROM)

Bump the version, build, upload and observe the following:

cc3200_init         c_hello 1.2 (20161114-140409/???)

mg_wifi_connect     Connecting to Cesanta

mg_sys_config_init_http HTTP server started on [80]

Hello, world!

— Bus Fault —

SHCTL=0x00070002, FSTAT=0x00000400, HFSTAT=0x00000000, FADDR=e000ed38

SF @ 0x20024a20:

   R0=0x00000040 R1=0x00000002 R2=0x00000000 R3=0x200247dc R12=0x0000004f

   LR=0x20015509 PC=0x200172d0 xPSR=0x21000000

S1c_hello.bin.1@20000000+138988+.2001ec4d

cc3200_init         c_hello 1.1 (20161114-135843/???)

Hello, world!

Hey, a file!

cc3200_init         Init done, RAM: 104568 free

mg_wifi_on_change_cb WiFi: ready, IP 192.168.1.33

blink_timer_cb       Tick 0

blink_timer_cb       Tock 1

Here you can see first_boot of v1.2: an exception during mg_app_init, reboot of the device by the exception handler and subsequent successful boot of v1.1 from slot 1. If device were rebooted for any reason before update was committed, the next boot will have its update rolled back. It could be an exception like in our example, WDT (you may have noticed it is set for 15 seconds) or even a manual reset.

Now let’s see how delayed commit works. There are no controls for it on the configuration page, but the form takes an additional parameter: commit_timeout, expressed in seconds.

Let’s prepare v1.3 by making some minor change, build it and then use the curl utility to perform a form submission:

$ curl -F commit_timeout=30 -F file=@build/fw.zip http://192.168.1.33/update

Update applied, finalizing

Now let’s see what happens on the console:

cc3200_init         c_hello 1.3 (20161114-140752/???)

cc3200_init         Init done, RAM: 104560 free

mg_upd_boot_finish   Update state: 0 30

mg_upd_boot_finish   Arming commit watchdog for 30 seconds

blink_timer_cb       Ticketty 0

blink_timer_cb       Tockitty 1

blink_timer_cb       Tockitty 27

blink_timer_cb       Ticketty 28

mg_upd_watchdog_cb   Update commit timeout expired

mg_upd_revert       Reverting update

mg_upd_boot_revert   Config 0 is bad, reverting

fs_umount           Unmounting spiffs.img.0.0

S1c_hello.bin.1@20000000+138988+.2001ec4d

cc3200_init         c_hello 1.1 (20161114-135843/???)

So, as you can see, an otherwise perfectly functional firmware is nevertheless rolled back after 30 seconds. Why wasn’t it committed? Nobody knows, but better safe than sorry, right? Again – reboot for any other reason during this time will also lead to automatic rollback.

So how does one perform a commit? Simplest way is by fetching the /update/commit URI. Let’s try pushing 1.3 again, but this time make it stick:

$ curl -F commit_timeout=30 -F file=@build/fw.zip http://192.168.1.33/update

Update applied, finalizing

And after a while, when the firmware boots and connects to wifi:

$ curl http://192.168.1.33/update/commit

Ok

This is accompanied by the following messages in the log:

blink_timer_cb       Tockitty 9

mongoose_ev_handler 20026804 HTTP connection from 192.168.1.100:44640

mg_upd_commit       Committing update

mg_upd_boot_commit   Committed cfg 0, seq 0xfffffffffffffffc

From then on, v1.3 will be booted by default (try pressing reset).

Finally, let me show you how poll-based updates work – after all, in the field you won’t be POSTing updates to each and every device. Instead, you’ll want to have them periodically contact an update server and ask for updates. Mongoose Firmware has you covered there.

First, let’s set up a web server that will serve our updates. Need a web server to run on your desktop? Hey, I happen to know one… But it’s up to you.

rojer@nbt:~/cesanta/mongoose-iot/fw/examples/c_hello/build master$ mongoose

Mongoose web server v6.6 serving [/home/rojer/cesanta/mongoose-iot/fw/examples/c_hello/build] on port 8080

Confirm that fw can be fetched:

$ curl -O http://192.168.1.100:8080/fw.zip

% Total   % Received % Xferd Average Speed   Time   Time     Time Current

                                Dload Upload   Total   Spent   Left Speed

100 272k 100 272k   0     0 57.6M     0 –:–:– –:–:– –:–:– 66.6M

Now let’s configure the firmware to check that URL for updates every 10 seconds (NB: if you have the console running, you’ll need to stop it for config-set to work):

blink_timer_cb       Tockitty 1177

blink_timer_cb       Ticketty 1178

^C

$ miot config-set –port /dev/ttyUSB1 update.url=http://192.168.1.100:8080/fw.zip update.interval=10

Getting configuration…

Setting new configuration…

rojer@nbt:~/cesanta/mongoose-iot/fw/examples/c_hello master$ miot console –port /dev/ttyUSB1

cc3200_init         c_hello 1.3 (20161114-140752/???)

mg_sys_config_init_http HTTP server started on [80]

mg_updater_http_init Updates from http://192.168.1.100:8080/fw.zip, every 10 seconds

Hello, world!

Hey, a file!

cc3200_init         Init done, RAM: 104216 free

blink_timer_cb       Ticketty 8

updater_context_create Starting update (timeout 300)

mg_updater_http_start Update URL: http://192.168.1.100:8080/fw.zip, ct: 0, isv? 1

parse_manifest       FW: c_hello cc3200 1.3 20161114-140752/??? -> 1.3 20161114-140752/???

updater_finish       Update finished: 1 Version is the same as current, mem free 95712

4 thoughts on “Updating firmware reliably

  1. “Good write-up of the problems faced with over-the-air updates. I've come up with similar solutions. I assume your config sequence numbers are big enough that they will never get to zero?”

    Log in to Reply
  2. “yes, sequencer is 64 bit. even with 32 it'd take decades in a tight loop to exhaust, with 64 bits we're getting into “heat death of the universe” kind of time frame :)”

    Log in to Reply
  3. “Hello Deomid,nFirst of all – thanks for the article. It covers a lot of relevant update-related problems. nAs for your approach with two update areas – looks really robust, but not available in most of micro controller-based systems (if you count statis

    Log in to Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.