Chapter 291: OTA Failure Recovery Strategies
Chapter Objectives
By the end of this chapter, you will be able to:
- Understand the common causes of Over-the-Air (OTA) update failures.
- Explain the role of the partition table, specifically the
ota_0
,ota_1
, andota_data
partitions, in enabling OTA updates and recovery. - Describe the boot process flow in an OTA-enabled ESP32 application.
- Implement a robust application self-test to validate a new firmware image.
- Use ESP-IDF APIs to programmatically confirm a successful update or trigger a rollback.
- Manually revert to a previous firmware version.
- Identify and troubleshoot common issues related to OTA rollback mechanisms.
Introduction
Over-the-Air (OTA) updates are a cornerstone of modern IoT device management, allowing for the deployment of new features, bug fixes, and security patches without physical intervention. While incredibly powerful, the OTA process itself introduces a critical point of potential failure. A power outage during an update, a corrupted firmware binary, or a critical bug in the new code could leave a device in an unresponsive or “bricked” state, making it impossible to manage or update further.
For devices deployed in the field—whether on a factory floor, in a consumer’s home, or atop a remote mountain—such a failure is unacceptable. This chapter delves into the robust failure recovery mechanisms provided by the ESP-IDF. We will explore the theory behind the ESP32‘s boot and partition system that makes recovery possible and then implement practical, application-level strategies to ensure that your devices can reliably recover from a faulty update, maintaining operational integrity and saving you from costly field service.
Theory
The foundation of the ESP32’s OTA failure recovery is its flexible partitioning system and a bootloader designed with OTA in mind.
The OTA Partition Scheme
To perform an OTA update, the device’s flash memory must be laid out according to a specific scheme. Instead of a single application partition, we use at least two, conventionally named ota_0
and ota_1
.
ota_0
,ota_1
(App Partitions): These partitions are of identical size and are where the application firmware is stored. At any given time, one is the “active” or “boot” partition (running the current code), while the other is the “inactive” or “update” partition (the target for the next OTA update).ota_data
(OTA Data Partition): This is a small (typically 8KB) but crucial partition that acts as a control sector for the bootloader. It stores state information, most importantly indicating which of the app partitions (ota_0
orota_1
) the bootloader should load on startup.factory
(Optional): Many designs include afactory
partition that holds the initial, golden-master firmware image. This partition is typically write-protected and is not used in the standard OTA update cycle but can be used as a last-resort recovery option.
The OTA Update and Boot Process
Let’s walk through a successful OTA update and the subsequent boot process to understand how these partitions interact.
- Initial State: The device is running firmware from the
ota_0
partition. Theota_data
partition contains information pointing toota_0
as the boot partition. - Update Initiated: The application receives a command to update. It begins downloading the new firmware binary from a server.
- Writing to Inactive Partition: The OTA logic writes the new firmware, chunk by chunk, into the inactive partition (
ota_1
). The currently running application inota_0
is not affected. - Switching the Boot Target: Once the entire binary is written and verified (e.g., via a checksum), the application calls the function
esp_ota_set_boot_partition()
. This function does not immediately switch the firmware. Instead, it updates theota_data
partition to point toota_1
as the target for the next boot. - Reboot: The application triggers a system reboot via
esp_restart()
. - New Boot Sequence:
- The ROM Bootloader executes.
- It loads the second-stage Bootloader from flash.
- The second-stage Bootloader reads the
ota_data
partition. It sees thatota_1
is now the designated boot partition. - It loads and executes the new application from
ota_1
.
graph TD subgraph "Current State" A[Start: App Running from ota_0] end subgraph "OTA Update Process" B{Receive OTA Command?} C[Download New Firmware] D[Write Firmware to ota_1] E[Verify Firmware Checksum] F[Update ota_data to boot from ota_1] G[Reboot System] end subgraph "Boot & Validation" H[Bootloader reads ota_data] I[Bootloader loads App from ota_1] J{"Is App in<br><b>PENDING_VERIFY</b> state?"} K[Start Rollback Timer] L[Run App Self-Test] end subgraph "Outcome" M{Self-Test Passed?} N[Mark App as <b>VALID</b><br>Update is now permanent] O[Self-Test Failed or Timed Out] P[Mark App as <b>INVALID</b><br>and Reboot] Q[Bootloader reads ota_data,<br>sees invalid/pending state] R[Rollback: Boot from ota_0] S[Device is recovered<br>Running old firmware] end %% Styling classDef start fill:#D1FAE5,stroke:#059669,stroke-width:2px,color:#065F46 classDef process fill:#DBEAFE,stroke:#2563EB,stroke-width:1px,color:#1E40AF classDef decision fill:#FEF3C7,stroke:#D97706,stroke-width:1px,color:#92400E classDef success fill:#D1FAE5,stroke:#059669,stroke-width:2px,color:#065F46 classDef failure fill:#FEE2E2,stroke:#DC2626,stroke-width:1px,color:#991B1B class A,N,S start class C,D,E,F,G,H,I,K,L,P,Q,R process class B,J,M decision class O failure %% Connections A --> B B -- Yes --> C --> D --> E -- OK --> F --> G B -- No --> A G --> H --> I --> J J -- Yes --> K --> L --> M J -- No --> L M -- Yes --> N M -- No --> O --> P P --> Q Q --> R --> S
How Failure Recovery Works
The brilliance of this system is what happens when the new firmware in ota_1
is faulty.
Automatic Rollback (Bootloader-Driven)
By default, when the bootloader loads an application from a partition that has just been updated, it considers it to be in a “pending validation” state. The bootloader starts a countdown. If the newly booted application does not explicitly signal its health within a configurable timeout (CONFIG_BOOTLOADER_APP_ROLLBACK_ENABLE
), the bootloader assumes it is faulty. On the next reboot (e.g., from a watchdog timer reset), the bootloader will see that the “pending” app failed to validate, and it will automatically revert the ota_data
partition to boot from the previous, known-good application (ota_0
).
Application-Driven Validation and Rollback
Relying on a simple timeout is good, but not perfect. A truly robust system requires the application to perform its own intelligent self-test. The new firmware should be responsible for validating its own integrity and its ability to perform its core functions.
flowchart TD A(Start: New Firmware Boots from ota_1) B{Check OTA Partition State} C{"State == <b>PENDING_VERIFY</b>?"} subgraph "Self-Validation Logic" D[Run Critical Self-Tests<br>- Connect to Wi-Fi<br>- Ping Backend Server<br>- Check Sensor/Actuator] E{All Tests Successful?} end subgraph "Decision & Action" F["Call <b>esp_ota_mark_app_valid_cancel_rollback()</b>"] G["Call <b>esp_ota_mark_app_invalid_rollback_and_reboot()</b>"] end H[Update is now permanent.<br>Continue normal operation.] I[Device reboots.<br>Bootloader sees INVALID state and rolls back to ota_0.] J[State is already <b>VALID</b>.<br>Skip test and run normally.] %% Styling classDef start fill:#EDE9FE,stroke:#5B21B6,stroke-width:2px,color:#5B21B6 classDef process fill:#DBEAFE,stroke:#2563EB,stroke-width:1px,color:#1E40AF classDef decision fill:#FEF3C7,stroke:#D97706,stroke-width:1px,color:#92400E classDef check fill:#FEE2E2,stroke:#DC2626,stroke-width:1px,color:#991B1B classDef success fill:#D1FAE5,stroke:#059669,stroke-width:2px,color:#065F46 class A start class B,D process class C,E decision class F,H,J success class G,I check %% Connections A --> B --> C C -- Yes --> D C -- No --> J D --> E E -- Yes --> F --> H E -- No --> G --> I
A common validation sequence is:
- Connect to Wi-Fi.
- Connect to the designated cloud backend (e.g., MQTT broker).
- Perform a key business logic function (e.g., read a sensor, control a relay).
Only after this checklist is complete should the application report its health.
- To signal success, the application must call
esp_ota_mark_app_valid_cancel_rollback()
. This function updatesota_data
to mark the current partition asESP_OTA_IMG_VALID
, stopping the bootloader’s rollback countdown. The update is now considered permanent. - To signal failure, if the application’s self-test fails, it should immediately call
esp_ota_mark_app_invalid_rollback_and_reboot()
. This function marks the current partition asESP_OTA_IMG_INVALID
and triggers an immediate reboot. The bootloader will see the invalid state and instantly revert to the previous working firmware.
This application-driven approach is far superior because it can detect not just boot-loops, but also logical failures in the application that a simple boot timeout would miss.
Warning: A very common mistake is for developers to write a perfectly functional new application but forget to call
esp_ota_mark_app_valid_cancel_rollback()
. The device works until it is power-cycled or rebooted, at which point the bootloader, having never received the “all clear,” rolls the firmware back, leaving the developer confused.
Practical Example: Self-Testing OTA
Let’s build an application that demonstrates a robust, self-testing OTA recovery mechanism. We will simulate an OTA update where the “new” firmware must pass a simple test before being accepted. If it fails, it will automatically roll itself back.
1. Project Setup and Partition Table
First, ensure your project uses an OTA-capable partition table. Create a file named partitions.csv
in your project’s root directory:
# Name, Type, SubType, Offset, Size, Flags
nvs, data, nvs, , 24K,
otadata, data, ota, , 8K,
phy_init, data, phy, , 4K,
ota_0, app, ota_0, , 1M,
ota_1, app, ota_1, , 1M,
storage, data, nvs, , ,
In your project’s CMakeLists.txt, point to this file:
set(PARTITION_TABLE “partitions.csv”)
Next, run idf.py menuconfig
and ensure the following are set:
Component config
—>Bootloader
—>[*] Enable app rollback
Component config
—>App Update
—>[*] Allow anti-rollback
(Good practice, but not strictly needed for this example)
2. The Application Code
We will create an application that, for the sake of this example, will have its version hardcoded. We’ll first flash v1.0, then simulate an OTA update to a “bad” v2.0 which will fail its self-test and roll back.
Main Application Logic (main/main.c
)
This code includes a placeholder for an OTA update task and the crucial self-test logic.
#include <stdio.h>
#include <string.h>
#include "freertos/FreeRTOS.h"
#include "freertos/task.h"
#include "esp_system.h"
#include "esp_log.h"
#include "esp_ota_ops.h"
#include "esp_app_format.h"
#include "nvs_flash.h"
#include "esp_wifi.h"
#include "esp_event.h"
static const char *TAG = "OTA_RECOVERY_EXAMPLE";
// In a real app, you would have a proper OTA task to download the binary.
// For this example, we assume the OTA has already happened and the device has rebooted.
// void ota_update_task(void *pvParameter);
// --- SIMULATED SELF-TEST ---
// This function simulates a critical self-test the new firmware must pass.
// In a real application, this would check for WiFi, server connectivity, etc.
// Here, we'll use a Kconfig option to simulate success or failure.
static bool run_application_self_test(void) {
ESP_LOGI(TAG, "Running application self-test...");
#if CONFIG_EXAMPLE_FIRMWARE_FAILS_SELF_TEST
ESP_LOGE(TAG, "Self-test FAILED! This firmware is bad.");
vTaskDelay(pdMS_TO_TICKS(2000));
return false;
#else
ESP_LOGI(TAG, "Self-test PASSED! This firmware is good.");
vTaskDelay(pdMS_TO_TICKS(2000));
return true;
#endif
}
// --- MAIN LOGIC ---
void app_main(void) {
ESP_LOGI(TAG, "Starting OTA Recovery Example App...");
// Initialize NVS
esp_err_t err = nvs_flash_init();
if (err == ESP_ERR_NVS_NO_FREE_PAGES || err == ESP_ERR_NVS_NEW_VERSION_FOUND) {
ESP_ERROR_CHECK(nvs_flash_erase());
err = nvs_flash_init();
}
ESP_ERROR_CHECK(err);
// Get OTA partition info
const esp_partition_t *running_partition = esp_ota_get_running_partition();
esp_app_desc_t running_app_info;
esp_ota_get_app_description(running_partition, &running_app_info);
ESP_LOGI(TAG, "Running firmware version: %s", running_app_info.version);
// Check the state of the currently running partition.
esp_ota_img_states_t ota_state;
err = esp_ota_get_state_partition(running_partition, &ota_state);
if (err == ESP_OK) {
if (ota_state == ESP_OTA_IMG_PENDING_VERIFY) {
ESP_LOGW(TAG, "This is a new firmware image, pending verification.");
ESP_LOGW(TAG, "Running self-test to validate...");
if (run_application_self_test()) {
ESP_LOGI(TAG, "Self-test successful. Marking app as valid.");
esp_ota_mark_app_valid_cancel_rollback();
} else {
ESP_LOGE(TAG, "Self-test failed. Triggering rollback!");
esp_ota_mark_app_invalid_rollback_and_reboot();
}
} else if (ota_state == ESP_OTA_IMG_VALID) {
ESP_LOGI(TAG, "This firmware is already marked as valid.");
} else {
ESP_LOGE(TAG, "Firmware is in an invalid state!");
// You could trigger a rollback here as a safety measure too.
esp_ota_mark_app_invalid_rollback_and_reboot();
}
} else {
ESP_LOGE(TAG, "Failed to get OTA state, error: %s", esp_err_to_name(err));
// Critical error, maybe try to roll back anyway.
}
// Main application loop can go here.
// For the example, we just print the version every 10 seconds.
while (1) {
ESP_LOGI(TAG, "Application loop running, version: %s", running_app_info.version);
vTaskDelay(pdMS_TO_TICKS(10000));
}
}
3. Kconfig for Simulating Failure
To easily switch between a “good” and “bad” firmware for our test, we add a custom option in main/Kconfig.projbuild
.
menu "Example Configuration"
config EXAMPLE_FIRMWARE_FAILS_SELF_TEST
bool "Simulate a failed self-test"
default n
help
If selected, the application's self-test function will intentionally fail,
triggering an OTA rollback if the app is in a PENDING_VERIFY state.
Unselect this for the initial "good" firmware.
endmenu
4. Build and Run Instructions
Follow these steps precisely to observe the rollback mechanism.
Step 1: Build and Flash the Initial “Good” Firmware (v1.0)
- Ensure the
Simulate a failed self-test
option is disabled (n
) inmenuconfig
. - Open your project’s
CMakeLists.txt
and set the version:set(PROJECT_VERSION "1.0")
- Build and flash the project:
idf.py build flash monitor
- Observe the output. You should see:
I (OTA_RECOVERY_EXAMPLE): Starting OTA Recovery Example App... I (OTA_RECOVERY_EXAMPLE): Running firmware version: 1.0 I (OTA_RECOVERY_EXAMPLE): This firmware is already marked as valid. I (OTA_RECOVERY_EXAMPLE): Application loop running, version: 1.0
This is now our stable, known-good firmware running fromota_0
.
Step 2: Build the “Bad” Firmware (v2.0)
- In
menuconfig
, go toExample Configuration
and enable (y
) the[*] Simulate a failed self-test
option. Save and exit. - In
CMakeLists.txt
, update the version:set(PROJECT_VERSION "2.0-bad")
- Only build the project, do not flash it yet:
idf.py build
- This creates the
ota_recovery_example.bin
file in yourbuild
directory. This is the “bad” firmware we will serve for the OTA update.
Step 3: Simulate the OTA Update and Observe Rollback
For a real OTA, you’d use a cloud service. Here, we can use the esp_https_ota
example component from ESP-IDF and a simple local python server. For simplicity, let’s just describe the expected outcome as if an OTA process has just completed and the device is rebooting.
Imagine your device has just downloaded the v2.0-bad
firmware and rebooted.
- The device starts up. The bootloader reads
ota_data
and boots from the newly updated partition (e.g.,ota_1
). - You will see the following in the monitor:
I (312) boot: Loaded app from partition at offset 0x110000
...
I (OTA_RECOVERY_EXAMPLE): Starting OTA Recovery Example App...
I (OTA_RECOVERY_EXAMPLE): Running firmware version: 2.0-bad
W (OTA_RECOVERY_EXAMPLE): This is a new firmware image, pending verification.
W (OTA_RECOVERY_EXAMPLE): Running self-test to validate...
I (OTA_RECOVERY_EXAMPLE): Running application self-test...
E (OTA_RECOVERY_EXAMPLE): Self-test FAILED! This firmware is bad.
E (OTA_RECOVERY_EXAMPLE): Self-test failed. Triggering rollback!
- The device will immediately reboot.
- Watch the boot log again. This time, you will see the bootloader detect the rollback.
I (311) boot: Rollback to factory app
I (315) boot: Loaded app from partition at offset 0x10000
...
I (OTA_RECOVERY_EXAMPLE): Starting OTA Recovery Example App...
I (OTA_RECOVERY_EXAMPLE): Running firmware version: 1.0
I (OTA_RECOVERY_EXAMPLE): This firmware is already marked as valid.
I (OTA_RECOVERY_EXAMPLE): Application loop running, version: 1.0
The device has successfully identified the bad firmware, triggered a rollback, and is now running the previous stable version (v1.0) again. It has saved itself from being bricked.
Variant Notes
The core OTA rollback functionality described here is a fundamental feature of ESP-IDF and is consistent across all ESP32 variants, including the ESP32, ESP32-S2, ESP32-S3, ESP32-C3, ESP32-C6, and ESP32-H2. The esp_ota_*
APIs and the bootloader logic are part of the shared framework.
Where you might encounter differences is in related features:
- Flash & PSRAM: The available flash size on a given module will determine the maximum size of your application partitions (
ota_0
,ota_1
). Variants with PSRAM can support much larger applications, but the OTA logic remains the same. - Security Features:
- Secure Boot: When enabled, the bootloader will verify the digital signature of the application in
ota_0
orota_1
before loading it. A failed signature check will prevent the app from booting and can trigger a rollback, providing a hardware-enforced layer of security. - Flash Encryption: Encrypts the flash content. The OTA update binary must also be encrypted. The rollback process works seamlessly with this, as the bootloader handles decryption.
- Secure Boot: When enabled, the bootloader will verify the digital signature of the application in
- Anti-Rollback: This feature, which prevents downgrading to an older version, relies on security version numbers stored in the app descriptor. It is supported on all variants that have eFuses to securely store the minimum version.
Feature | ESP32 | ESP32-S2 | ESP32-S3 | ESP32-C3 | ESP32-C6 | ESP32-H2 |
---|---|---|---|---|---|---|
Core OTA Rollback | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
Secure Boot | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
Flash Encryption | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
Anti-Rollback (HW) | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
External PSRAM | ✔ | ✔ | ✔ | ✖ | ✖ | ✖ |
Wi-Fi | ✔ | ✔ | ✔ | ✔ | ✔ (Wi-Fi 6) | ✖ |
Bluetooth LE | ✔ | ✖ | ✔ | ✔ | ✔ | ✔ |
Thread / Zigbee | ✖ | ✖ | ✖ | ✖ | ✔ | ✔ |
In summary, the recovery strategies you learn in this chapter are directly portable across the entire ESP32 family.
Common Mistakes & Troubleshooting Tips
Mistake / Issue | Symptom(s) | Troubleshooting / Solution |
---|---|---|
Forgot to mark app as valid | The new firmware (e.g., v2.0) runs perfectly after OTA. However, after a power cycle or manual reset, the device mysteriously reverts to the old firmware (v1.0). | This is the most common OTA issue. The bootloader’s rollback feature is working correctly. Your new app never signaled that it was “good”. Solution: In your application logic, after you have confirmed the new firmware is stable (e.g., connected to Wi-Fi/cloud), you must call esp_ota_mark_app_valid_cancel_rollback();. |
Incorrect Partition Table | The OTA update fails immediately. The log shows an error like E (ota): esp_ota_begin… not found or the device fails to boot entirely, complaining about a missing partition. | Your project is not configured for OTA. It needs dedicated partitions for the process. Solution: Ensure you have a partitions.csv file with at least ota_0, ota_1, and otadata defined. Then, point to it in your CMakeLists.txt with set(PARTITION_TABLE “partitions.csv”). |
Flawed Self-Test Logic | Rollbacks happen sporadically. Sometimes the update succeeds, other times it fails. The device might reboot unexpectedly during the self-test phase. | Your self-test might be unreliable or too slow. Solutions: 1. Watchdog Timeout: If the test takes too long, the watchdog timer might reset the device, causing a rollback. Make tests fast or feed the watchdog timer during long operations. 2. Network Dependency: If your test requires a cloud connection that is flaky, it will cause false failures. Implement retries with a final timeout. |
Fighting Anti-Rollback | The OTA download seems to work, but the final write/validation step fails with ESP_ERR_OTA_VALIDATE_FAILED. The log might mention an “anti-rollback” check failure. | You have Anti-Rollback enabled and are trying to flash firmware with a security version number that is lower than or equal to the currently running version. Solution: For a new release, you must increment the version number in your project configuration (e.g., in idf.py menuconfig under Security features, or in your build scripts). During development, you can disable Anti-Rollback. |
Corrupted Firmware Binary | The device enters a boot loop after an OTA update. The log shows a guru meditation error or a checksum mismatch error during boot. | The binary file downloaded during the OTA was corrupted, incomplete, or was built for the wrong ESP32 variant. Solution: 1. Ensure your OTA server provides a checksum (like SHA-256) along with the binary. 2. Before calling esp_ota_set_boot_partition(), verify the checksum of the downloaded data against the one from the server. |
Incorrect OTA State Logic | The self-test logic runs on every single boot, not just after an update. This can cause unintended behavior, like trying to roll back an already-validated app. | Your code isn’t checking the partition state correctly. Solution: Wrap your self-test and rollback logic inside a condition that checks if the state is ESP_OTA_IMG_PENDING_VERIFY. Do not run the validation logic if the state is already ESP_OTA_IMG_VALID. |
Exercises
- Implement a Timed Self-Test: Modify the example code. Instead of a simple boolean flag, the self-test should start a 60-second timer. The application must receive a specific command (e.g., via UART or an MQTT message like
device/123/validate
) within this window. If the command is received, callesp_ota_mark_app_valid_cancel_rollback()
. If the timer expires, callesp_ota_mark_app_invalid_rollback_and_reboot()
. - Manual Rollback Trigger: Implement a “safe mode” button. If a specific GPIO pin is held low during boot, the application should immediately call
esp_ota_mark_app_invalid_rollback_and_reboot()
, regardless of its validation state. This provides a physical way for a user to force a recovery if a new firmware has a critical flaw that even the self-test missed (e.g., it corrupts the display but otherwise functions). - Investigate
ota_data
withparttool.py
: Use the ESP-IDF command-line partition tool (python ${IDF_PATH}/components/partition_table/parttool.py
) to inspect theotadata
partition.- Read the partition after flashing the initial v1.0 firmware.
- Perform a (simulated) OTA to v2.0 and reboot. Before it rolls back, quickly halt the device and read the
otadata
partition again. - Let the device boot, fail, and roll back. Read the
otadata
partition one last time. - Document the changes you observe in the raw bytes of the partition.
Summary
- OTA update failures are a significant risk for deployed IoT devices, but ESP-IDF provides a robust recovery framework.
- The system relies on a partition scheme with at least two app partitions (
ota_0
,ota_1
) and a state partition (ota_data
). - The
ota_data
partition informs the bootloader which application to boot. - The bootloader can automatically roll back to a previous application if a newly updated one fails to validate itself within a timeout.
- The most reliable strategy is application-level self-validation, where the new firmware confirms its own operational health.
- A successful new application must call
esp_ota_mark_app_valid_cancel_rollback()
to make the update permanent. - A faulty application should call
esp_ota_mark_app_invalid_rollback_and_reboot()
to trigger an immediate rollback to the last known-good version. - This core recovery mechanism is portable across all ESP32 variants.
Further Reading
- ESP-IDF OTA Updates Documentation: https://docs.espressif.com/projects/esp-idf/en/v5.2.1/esp32/api-reference/system/ota.html
- Partition Tool (
parttool.py
) Documentation: https://docs.espressif.com/projects/esp-idf/en/v5.2.1/esp32/api-reference/storage/parttool.html - Anti-rollback Documentation: https://docs.espressif.com/projects/esp-idf/en/v5.2.1/esp32/api-reference/system/ota.html#anti-rollback