Updating firmware reliably
Many devices still treat firmware updates as exceptional events, to be done in exceptional circumstances and only by advanced users or qualified personnel. However, especially for connected devices, keeping up to date is becoming ever more important. Keeping the device secure requires being able to update the software running on it, quickly and reliably, because there will be vulnerabilities discovered.
But nobody likes bricked devices. Sure, a bricked device is perfectly secure, but that is hardly a consolation. Therefore, when performing firmware updates, especially delivered over the air and in the background, there is one overarching concern: reliability. At every step of the process there must be safety mechanisms that allow for device to be recovered with no or minimal intervention from the end user (a manual reboot is fine, an RMA for reflashing is not).
These are the principles we set out for ourselves when designing a firmware update mechanism for Mongoose Firmware. In this article we will talk about performing reliable firmware updates in general and consider particular implementation used on the TI CC3200 (but the same applies to ESP8266).
TI CC3200 is an unusual device in that it does not have any on-chip flash memory. Code and data sections are loaded from an external SPI flash into SRAM and executed from there. The SPI flash chip is formatted to contain a rudimentary file system. The size of the flash chip can be up to 64 Mb, but the most popular size is 8 Mb – as seen on both the LAUNCHXL dev board and in the CC3200MOD, the module offered by TI.
At first glance, code not being executed from flash directly makes things easier – files on the SPI filesystem can be deleted and rewritten without interfering with code being executed, so code could be updated while it is running. This, however, is extremely unsafe – if the process were to fail for any reason (interrupted network connection or sudden loss of power), the device would not be able to boot at all, let alone roll back the update. Thus, first order of business is creating a mechanism for maintaining an alternate image of the firmware, with ability to boot from it.
The solution is a boot loader: a small piece of code that rarely, if ever, changes.Its role is to determine which of the firmware images to load and execute. This is usually specified by a configuration block, found at a predetermined location. Since CC3200 does not allow direct access to flash, this will need to be a separated file rather than a flash sector. Boot config specifies which image should be loaded – see figure 1.
Figure 1. (Source: Cesanta)
There is one more consideration, however: if the boot config is stored in only one location, it makes it susceptible to failure during updates, which are usually performed as a read-erase-write operation: a reboot after erase and before write is complete could render device unbootable. The time between the two is short, but we set out to make our update process safe at all points, so we have to deal with it. The way we do it by using two config files with versioning, or sequencing. A sequencer is a monotonically decreasing number, so of the two files the one with smaller sequencer is more recent – on figure 2, config 1 is selected as active because it has smaller sequencer. When writing a new config file, we always use the currently inactive (older) slot and it will not become newer until it is written – erased config will be older than any valid one because erased NOR flash is filled with all 1s.
Figure 2. (Source: Cesanta)
Next we consider rollbacks. What if the firmware is bad and does not work? For example, it may just reboot or hang and be rebooted by watchdog timer. Thus the first boot after an update is unconfirmed : newly-flashed firmware image has the “first boot” flag set. Figure 2 illustrates the state after update was first applied: it was downloaded and flashed to slot 1 and boot config 1 was written with “first boot” flag set. At an early stage of the boot process, boot config with “first boot” set is erased and the boot process continues. If at any point the system reboots, the bad config that led to failed boot will not be there and a rollback will occur – the system will boot from the previous, good configuration (cfg 0 and image 0 in fig 2). If the boot is successful and update is committed, the “first boot” flag is removed from the config and from now on it will always be used as active firmware.
However, what if the firmware initializes successfully, but fails to function on a higher level – e.g. fails to establish network connection or perform some similar higher level function? The point is, it may not always be possible for firmware itself to tell if it’s OK or not, and an external confirmation is desirable. Mongoose IoT firmware supports this by way of a commit timeout : a kind of watchdog timer, that will automatically revert an update if it is not explicitly committed within certain time. It can be done in multiple ways: by explicit invocation of an API function from user code, by external process sending an HTTP request to device’s /update/commit URI or by having firmware poll a commit URL after update – a successful response will tell device to commit the firmware. We found this very useful when recovering devices from bad updates – so convenient, in fact, that it is possible for us to do significant part of development on devices that are deployed in the field via OTA with delayed commits.
And now, let’s consider update delivery. There are multiple approaches of course. We chose and implemented two in Mongoose Firmware: push-based update delivery via HTTP POST and poll-based delivery by having device check specific URL for updates at regular intervals. The former is best suited for lowest latency and when the device is directly accessible (e.g. during development); the latter is best suited for production, when a fleet of devices needs to be updated. In the latter case, the server responding to update requests can be configured to perform targeted updates and staged rollout.
Let me show you how this works in practice. We will use our “Hello, world!” example.
So, let’s build and run it to establish the baseline (you will need to register on the Mongoose Cloud to get your own username and password). Note: in the following examples, for simplicity, we are using HTTP. Real setup should use HTTPS with proper certificate validation (which is supported).
rojer@nbt:~/cesanta/mongoose-iot/fw/examples/c_hello master$ miot build –arch cc3200
Connecting to http://cloud.mongoose-iot.com, user cesanta
Uploading sources (1734 bytes)
Success, built c_hello/cc3200 version 1.0 (20161114-113755/???).
Firmware saved to build/fw.zip
rojer@nbt:~/cesanta/mongoose-iot/fw/examples/c_hello master$ miot flash && miot console
Loaded c_hello/cc3200 version 1.0 (20161114-113755/???)
Opening /dev/ttyUSB1…
Connecting to boot loader..
Main boot loader v2.1.4.0
…
Booting firmware…
All done!
S0c_hello.bin.0@20000000+138973+.2001ec45 (1)
cc3200_init c_hello 1.0 (20161114-113755/???)
cc3200_init Mongoose IoT Firmware 2016111411 (20161114-113755/master@e2a4f704)
cc3200_init RAM: 122260 total, 109044 free
start_nwp NWP v2.7.0.0 started, host driver v1.0.1.6
cc3200_init Boot cfg 0: 0xfffffffffffffffe, 0x0, c_hello.bin.0 @ 0x20000000, spiffs.img.0 (2)
fs_mount_idx Mounting spiffs.img.0.0 0xfffffffffffffffe
mg_sys_config_init MAC: F4B85E49A7B3
mg_sys_config_init WDT: 15 seconds
clubby_channel_uart 20025edc UART0
mg_wifi_setup_ap AP Mongoose_49A7B3 configured
mg_sys_config_init_http HTTP server started on [80]
Hello, world! (3)
Hey, a file!
cc3200_init Init done, RAM: 104664 free
mg_wifi_on_change_cb WiFi: ready, IP 192.168.4.1
blink_timer_cb Tick
blink_timer_cb Tock
A few key things in the log above:
- Is output of the boot loader. It’s terse, but if boot fails, it is possible to tell at what stage.
- Logs the contents of the boot config used to boot this firmware: sequencer, flags, image name, load address and SPIFFS container image name (not covered here, see our blog post on the subject).
- Is the output of our example’s mg_app_init function. After that you see output of the timer callback and it should be accompanied by blinking of the red LED (on the LAUNCHXL board).
By default, the board starts up in the AP mode. To make our life easier, let’s instead make it join a WiFi network:
$ miot config-set wifi.ap.enable=false wifi.sta.enable=true wifi.sta.ssid=Cesanta wifi.sta.pass=*** && miot console
Getting configuration…
Setting new configuration…
…
mg_wifi_connect Connecting to Cesanta
…
mg_wifi_on_change_cb WiFi: ready, IP 192.168.1.33
blink_timer_cb Tock
blink_timer_cb Tick
The device rebooted and is now connected to network. Now let’s build a new firmware and push an update. Keeping the console attached, switch to a different window. Make a change to src/main.c to print something distinctive on the console. I added a simple counter and modified statements in the timer callback to print it. Then update version miot.yml to 1.1 and build.
$ miot build –arch cc3200
Connecting to http://cloud.mongoose-iot.com, user cesanta
Uploading sources (1756 bytes)
Success, built c_hello/cc3200 version 1.1 (20161114-135843/???).
Firmware saved to build/fw.zip
So, there we have our v1.1. Now instead of flashing directly, let’s perform an update.
Configuration page at http://192.168.1.33/ has a firmware update form at the bottom, select build/fw.zip and press “upload”
When you press the button, you should see the following output on the console:
mongoose_ev_handler 20025a1c HTTP connection from 192.168.1.100:44532
updater_context_create Starting update (timeout 300)
Starting update (timeout 300)
parse_manifest FW: c_hello cc3200 1.0 20161114-135644/??? -> 1.1 20161114-135843/??? (1)
…
handle_update_post Rebooting device
mg_system_restart_after Rebooting in 101 ms
Rebooting in 101 ms
fs_umount Unmounting spiffs.img.0.1
S1c_hello.bin.1@20000000+138988+.2001ec4d
cc3200_init c_hello 1.1 (20161114-135843/???) (2)
cc3200_init Mongoose IoT Firmware 2016111413 (20161114-135843/master@e2a4f704)
cc3200_init RAM: 122244 total, 109028 free
start_nwp NWP v2.7.0.0 started, host driver v1.0.1.6
cc3200_init Boot cfg 1: 0xfffffffffffffffd, 0x3, c_hello.bin.1 @ 0x20000000, spiffs.img.1 (3)
fs_mount_idx Mounting spiffs.img.1.0 0xfffffffffffffffe
fs_delete_container Deleting spiffs.img.0.0
cc3200_init Applying update
fs_mount_idx Mounting spiffs.img.0.1 0xfffffffffffffffd
file_copy Copying conf.json (4)
fs_umount Unmounting spiffs.img.0.1
mg_sys_config_init MAC: F4B85E49A7B3
mg_sys_config_init WDT: 15 seconds
clubby_channel_uart 20025f34 UART0
mg_wifi_connect Connecting to Cesanta
mg_sys_config_init_http HTTP server started on [80]
Hello, world!
Hey, a file!
cc3200_init Init done, RAM: 104568 free
mg_upd_boot_commit Committed cfg 1, seq 0xfffffffffffffffd (5)
mg_wifi_on_change_cb Wifi: connected
mg_wifi_on_change_cb WiFi: ready, IP 192.168.1.33
blink_timer_cb Tick 0 (6)
blink_timer_cb Tock 1
- This tells us the current and the version of the firmware being applied
- Here we see new firmware booting
- Active boot config is 1, image 1 and sequencer is 0xf…fd. Also BOOT_F_FIRST_BOOT is set (0x1).
- Here existing configuration file is copied into the new SPIFFS file system.
- Since delayed commit was not requested, update is committed immediately after successful update.
- Here we see our updated code – every timer iteration is accompanied by a counter.
Now, let’s try to simulate a bad update. Let’s introduce a bug during initialization. Use your favorite way to trigger a crash – for example, by adding something like this to the mg_app_init function, right after print hello world: *((uint32_t *) 0xF00DBEEF) = 1;
(Fun fact: NULL dereference is not a bug on CC3200, 0x0 is mapped to internal ROM)
Bump the version, build, upload and observe the following:
cc3200_init c_hello 1.2 (20161114-140409/???)
…
mg_wifi_connect Connecting to Cesanta
mg_sys_config_init_http HTTP server started on [80]
Hello, world!
— Bus Fault —
SHCTL=0x00070002, FSTAT=0x00000400, HFSTAT=0x00000000, FADDR=e000ed38
SF @ 0x20024a20:
R0=0x00000040 R1=0x00000002 R2=0x00000000 R3=0x200247dc R12=0x0000004f
LR=0x20015509 PC=0x200172d0 xPSR=0x21000000
—
S1c_hello.bin.1@20000000+138988+.2001ec4d
cc3200_init c_hello 1.1 (20161114-135843/???)
…
Hello, world!
Hey, a file!
cc3200_init Init done, RAM: 104568 free
mg_wifi_on_change_cb WiFi: ready, IP 192.168.1.33
blink_timer_cb Tick 0
blink_timer_cb Tock 1
Here you can see first_boot of v1.2: an exception during mg_app_init, reboot of the device by the exception handler and subsequent successful boot of v1.1 from slot 1. If device were rebooted for any reason before update was committed, the next boot will have its update rolled back. It could be an exception like in our example, WDT (you may have noticed it is set for 15 seconds) or even a manual reset.
Now let’s see how delayed commit works. There are no controls for it on the configuration page, but the form takes an additional parameter: commit_timeout, expressed in seconds.
Let’s prepare v1.3 by making some minor change, build it and then use the curl utility to perform a form submission:
$ curl -F commit_timeout=30 -F file=@build/fw.zip http://192.168.1.33/update
Update applied, finalizing
Now let’s see what happens on the console:
cc3200_init c_hello 1.3 (20161114-140752/???)
…
cc3200_init Init done, RAM: 104560 free
mg_upd_boot_finish Update state: 0 30
mg_upd_boot_finish Arming commit watchdog for 30 seconds
blink_timer_cb Ticketty 0
blink_timer_cb Tockitty 1
…
blink_timer_cb Tockitty 27
blink_timer_cb Ticketty 28
mg_upd_watchdog_cb Update commit timeout expired
mg_upd_revert Reverting update
mg_upd_boot_revert Config 0 is bad, reverting
fs_umount Unmounting spiffs.img.0.0
S1c_hello.bin.1@20000000+138988+.2001ec4d
cc3200_init c_hello 1.1 (20161114-135843/???)
…
So, as you can see, an otherwise perfectly functional firmware is nevertheless rolled back after 30 seconds. Why wasn’t it committed? Nobody knows, but better safe than sorry, right? Again – reboot for any other reason during this time will also lead to automatic rollback.
So how does one perform a commit? Simplest way is by fetching the /update/commit URI. Let’s try pushing 1.3 again, but this time make it stick:
$ curl -F commit_timeout=30 -F file=@build/fw.zip http://192.168.1.33/update
Update applied, finalizing
And after a while, when the firmware boots and connects to wifi:
$ curl http://192.168.1.33/update/commit
Ok
This is accompanied by the following messages in the log:
blink_timer_cb Tockitty 9
mongoose_ev_handler 20026804 HTTP connection from 192.168.1.100:44640
mg_upd_commit Committing update
mg_upd_boot_commit Committed cfg 0, seq 0xfffffffffffffffc
From then on, v1.3 will be booted by default (try pressing reset).
Finally, let me show you how poll-based updates work – after all, in the field you won’t be POSTing updates to each and every device. Instead, you’ll want to have them periodically contact an update server and ask for updates. Mongoose Firmware has you covered there.
First, let’s set up a web server that will serve our updates. Need a web server to run on your desktop? Hey, I happen to know one… But it’s up to you.
rojer@nbt:~/cesanta/mongoose-iot/fw/examples/c_hello/build master$ mongoose
Mongoose web server v6.6 serving [/home/rojer/cesanta/mongoose-iot/fw/examples/c_hello/build] on port 8080
Confirm that fw can be fetched:
$ curl -O http://192.168.1.100:8080/fw.zip
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 272k 100 272k 0 0 57.6M 0 –:–:– –:–:– –:–:– 66.6M
Now let’s configure the firmware to check that URL for updates every 10 seconds (NB: if you have the console running, you’ll need to stop it for config-set to work):
blink_timer_cb Tockitty 1177
blink_timer_cb Ticketty 1178
^C
$ miot config-set –port /dev/ttyUSB1 update.url=http://192.168.1.100:8080/fw.zip update.interval=10
Getting configuration…
Setting new configuration…
rojer@nbt:~/cesanta/mongoose-iot/fw/examples/c_hello master$ miot console –port /dev/ttyUSB1
…
cc3200_init c_hello 1.3 (20161114-140752/???)
…
mg_sys_config_init_http HTTP server started on [80]
mg_updater_http_init Updates from http://192.168.1.100:8080/fw.zip, every 10 seconds
Hello, world!
Hey, a file!
cc3200_init Init done, RAM: 104216 free
…
blink_timer_cb Ticketty 8
updater_context_create Starting update (timeout 300)
mg_updater_http_start Update URL: http://192.168.1.100:8080/fw.zip, ct: 0, isv? 1
parse_manifest FW: c_hello cc3200 1.3 20161114-140752/??? -> 1.3 20161114-140752/???
updater_finish Update finished: 1 Version is the same as current, mem free 95712
“Good write-up of the problems faced with over-the-air updates. I've come up with similar solutions. I assume your config sequence numbers are big enough that they will never get to zero?”
“yes, sequencer is 64 bit. even with 32 it'd take decades in a tight loop to exhaust, with 64 bits we're getting into “heat death of the universe” kind of time frame :)”
“Agree with you peter…”
“Hello Deomid,nFirst of all – thanks for the article. It covers a lot of relevant update-related problems. nAs for your approach with two update areas – looks really robust, but not available in most of micro controller-based systems (if you count statis