Chapter 291: OTA Failure Recovery Strategies

Chapter Objectives

By the end of this chapter, you will be able to:

  • Understand the common causes of Over-the-Air (OTA) update failures.
  • Explain the role of the partition table, specifically the ota_0ota_1, and ota_data partitions, in enabling OTA updates and recovery.
  • Describe the boot process flow in an OTA-enabled ESP32 application.
  • Implement a robust application self-test to validate a new firmware image.
  • Use ESP-IDF APIs to programmatically confirm a successful update or trigger a rollback.
  • Manually revert to a previous firmware version.
  • Identify and troubleshoot common issues related to OTA rollback mechanisms.

Introduction

Over-the-Air (OTA) updates are a cornerstone of modern IoT device management, allowing for the deployment of new features, bug fixes, and security patches without physical intervention. While incredibly powerful, the OTA process itself introduces a critical point of potential failure. A power outage during an update, a corrupted firmware binary, or a critical bug in the new code could leave a device in an unresponsive or “bricked” state, making it impossible to manage or update further.

For devices deployed in the field—whether on a factory floor, in a consumer’s home, or atop a remote mountain—such a failure is unacceptable. This chapter delves into the robust failure recovery mechanisms provided by the ESP-IDF. We will explore the theory behind the ESP32‘s boot and partition system that makes recovery possible and then implement practical, application-level strategies to ensure that your devices can reliably recover from a faulty update, maintaining operational integrity and saving you from costly field service.

Theory

The foundation of the ESP32’s OTA failure recovery is its flexible partitioning system and a bootloader designed with OTA in mind.

The OTA Partition Scheme

To perform an OTA update, the device’s flash memory must be laid out according to a specific scheme. Instead of a single application partition, we use at least two, conventionally named ota_0 and ota_1.

  • ota_0ota_1 (App Partitions): These partitions are of identical size and are where the application firmware is stored. At any given time, one is the “active” or “boot” partition (running the current code), while the other is the “inactive” or “update” partition (the target for the next OTA update).
  • ota_data (OTA Data Partition): This is a small (typically 8KB) but crucial partition that acts as a control sector for the bootloader. It stores state information, most importantly indicating which of the app partitions (ota_0 or ota_1) the bootloader should load on startup.
  • factory (Optional): Many designs include a factory partition that holds the initial, golden-master firmware image. This partition is typically write-protected and is not used in the standard OTA update cycle but can be used as a last-resort recovery option.

The OTA Update and Boot Process

Let’s walk through a successful OTA update and the subsequent boot process to understand how these partitions interact.

  1. Initial State: The device is running firmware from the ota_0 partition. The ota_data partition contains information pointing to ota_0 as the boot partition.
  2. Update Initiated: The application receives a command to update. It begins downloading the new firmware binary from a server.
  3. Writing to Inactive Partition: The OTA logic writes the new firmware, chunk by chunk, into the inactive partition (ota_1). The currently running application in ota_0 is not affected.
  4. Switching the Boot Target: Once the entire binary is written and verified (e.g., via a checksum), the application calls the function esp_ota_set_boot_partition(). This function does not immediately switch the firmware. Instead, it updates the ota_data partition to point to ota_1 as the target for the next boot.
  5. Reboot: The application triggers a system reboot via esp_restart().
  6. New Boot Sequence:
    • The ROM Bootloader executes.
    • It loads the second-stage Bootloader from flash.
    • The second-stage Bootloader reads the ota_data partition. It sees that ota_1 is now the designated boot partition.
    • It loads and executes the new application from ota_1.
graph TD
    subgraph "Current State"
        A[Start: App Running from ota_0]
    end

    subgraph "OTA Update Process"
        B{Receive OTA Command?}
        C[Download New Firmware]
        D[Write Firmware to ota_1]
        E[Verify Firmware Checksum]
        F[Update ota_data to boot from ota_1]
        G[Reboot System]
    end

    subgraph "Boot & Validation"
        H[Bootloader reads ota_data]
        I[Bootloader loads App from ota_1]
        J{"Is App in<br><b>PENDING_VERIFY</b> state?"}
        K[Start Rollback Timer]
        L[Run App Self-Test]
    end

    subgraph "Outcome"
        M{Self-Test Passed?}
        N[Mark App as <b>VALID</b><br>Update is now permanent]
        O[Self-Test Failed or Timed Out]
        P[Mark App as <b>INVALID</b><br>and Reboot]
        Q[Bootloader reads ota_data,<br>sees invalid/pending state]
        R[Rollback: Boot from ota_0]
        S[Device is recovered<br>Running old firmware]
    end

    %% Styling
    classDef start fill:#D1FAE5,stroke:#059669,stroke-width:2px,color:#065F46
    classDef process fill:#DBEAFE,stroke:#2563EB,stroke-width:1px,color:#1E40AF
    classDef decision fill:#FEF3C7,stroke:#D97706,stroke-width:1px,color:#92400E
    classDef success fill:#D1FAE5,stroke:#059669,stroke-width:2px,color:#065F46
    classDef failure fill:#FEE2E2,stroke:#DC2626,stroke-width:1px,color:#991B1B
    
    class A,N,S start
    class C,D,E,F,G,H,I,K,L,P,Q,R process
    class B,J,M decision
    class O failure

    %% Connections
    A --> B
    B -- Yes --> C --> D --> E -- OK --> F --> G
    B -- No --> A
    G --> H --> I --> J
    J -- Yes --> K --> L --> M
    J -- No --> L
    M -- Yes --> N
    M -- No --> O --> P
    P --> Q
    Q --> R --> S

How Failure Recovery Works

The brilliance of this system is what happens when the new firmware in ota_1 is faulty.

Automatic Rollback (Bootloader-Driven)

By default, when the bootloader loads an application from a partition that has just been updated, it considers it to be in a “pending validation” state. The bootloader starts a countdown. If the newly booted application does not explicitly signal its health within a configurable timeout (CONFIG_BOOTLOADER_APP_ROLLBACK_ENABLE), the bootloader assumes it is faulty. On the next reboot (e.g., from a watchdog timer reset), the bootloader will see that the “pending” app failed to validate, and it will automatically revert the ota_data partition to boot from the previous, known-good application (ota_0).

Application-Driven Validation and Rollback

Relying on a simple timeout is good, but not perfect. A truly robust system requires the application to perform its own intelligent self-test. The new firmware should be responsible for validating its own integrity and its ability to perform its core functions.

flowchart TD
    A(Start: New Firmware Boots from ota_1)
    B{Check OTA Partition State}
    C{"State == <b>PENDING_VERIFY</b>?"}
    
    subgraph "Self-Validation Logic"
        D[Run Critical Self-Tests<br>- Connect to Wi-Fi<br>- Ping Backend Server<br>- Check Sensor/Actuator]
        E{All Tests Successful?}
    end

    subgraph "Decision & Action"
        F["Call <b>esp_ota_mark_app_valid_cancel_rollback()</b>"]
        G["Call <b>esp_ota_mark_app_invalid_rollback_and_reboot()</b>"]
    end

    H[Update is now permanent.<br>Continue normal operation.]
    I[Device reboots.<br>Bootloader sees INVALID state and rolls back to ota_0.]
    J[State is already <b>VALID</b>.<br>Skip test and run normally.]

    %% Styling
    classDef start fill:#EDE9FE,stroke:#5B21B6,stroke-width:2px,color:#5B21B6
    classDef process fill:#DBEAFE,stroke:#2563EB,stroke-width:1px,color:#1E40AF
    classDef decision fill:#FEF3C7,stroke:#D97706,stroke-width:1px,color:#92400E
    classDef check fill:#FEE2E2,stroke:#DC2626,stroke-width:1px,color:#991B1B
    classDef success fill:#D1FAE5,stroke:#059669,stroke-width:2px,color:#065F46
    
    class A start
    class B,D process
    class C,E decision
    class F,H,J success
    class G,I check
    
    %% Connections
    A --> B --> C
    C -- Yes --> D
    C -- No --> J
    D --> E
    E -- Yes --> F --> H
    E -- No --> G --> I

A common validation sequence is:

  1. Connect to Wi-Fi.
  2. Connect to the designated cloud backend (e.g., MQTT broker).
  3. Perform a key business logic function (e.g., read a sensor, control a relay).

Only after this checklist is complete should the application report its health.

  • To signal success, the application must call esp_ota_mark_app_valid_cancel_rollback(). This function updates ota_data to mark the current partition as ESP_OTA_IMG_VALID, stopping the bootloader’s rollback countdown. The update is now considered permanent.
  • To signal failure, if the application’s self-test fails, it should immediately call esp_ota_mark_app_invalid_rollback_and_reboot(). This function marks the current partition as ESP_OTA_IMG_INVALID and triggers an immediate reboot. The bootloader will see the invalid state and instantly revert to the previous working firmware.

This application-driven approach is far superior because it can detect not just boot-loops, but also logical failures in the application that a simple boot timeout would miss.

Warning: A very common mistake is for developers to write a perfectly functional new application but forget to call esp_ota_mark_app_valid_cancel_rollback(). The device works until it is power-cycled or rebooted, at which point the bootloader, having never received the “all clear,” rolls the firmware back, leaving the developer confused.

Practical Example: Self-Testing OTA

Let’s build an application that demonstrates a robust, self-testing OTA recovery mechanism. We will simulate an OTA update where the “new” firmware must pass a simple test before being accepted. If it fails, it will automatically roll itself back.

1. Project Setup and Partition Table

First, ensure your project uses an OTA-capable partition table. Create a file named partitions.csv in your project’s root directory:

Plaintext
# Name,   Type, SubType, Offset,  Size, Flags
nvs,      data, nvs,     ,        24K,
otadata,  data, ota,     ,        8K,
phy_init, data, phy,     ,        4K,
ota_0,    app,  ota_0,   ,        1M,
ota_1,    app,  ota_1,   ,        1M,
storage,  data, nvs,     ,        ,

In your project’s CMakeLists.txt, point to this file:

set(PARTITION_TABLE “partitions.csv”)

Next, run idf.py menuconfig and ensure the following are set:

  • Component config —> Bootloader —> [*] Enable app rollback
  • Component config —> App Update —> [*] Allow anti-rollback (Good practice, but not strictly needed for this example)

2. The Application Code

We will create an application that, for the sake of this example, will have its version hardcoded. We’ll first flash v1.0, then simulate an OTA update to a “bad” v2.0 which will fail its self-test and roll back.

Main Application Logic (main/main.c)

This code includes a placeholder for an OTA update task and the crucial self-test logic.

C
#include <stdio.h>
#include <string.h>
#include "freertos/FreeRTOS.h"
#include "freertos/task.h"
#include "esp_system.h"
#include "esp_log.h"
#include "esp_ota_ops.h"
#include "esp_app_format.h"
#include "nvs_flash.h"
#include "esp_wifi.h"
#include "esp_event.h"

static const char *TAG = "OTA_RECOVERY_EXAMPLE";

// In a real app, you would have a proper OTA task to download the binary.
// For this example, we assume the OTA has already happened and the device has rebooted.
// void ota_update_task(void *pvParameter);

// --- SIMULATED SELF-TEST ---
// This function simulates a critical self-test the new firmware must pass.
// In a real application, this would check for WiFi, server connectivity, etc.
// Here, we'll use a Kconfig option to simulate success or failure.
static bool run_application_self_test(void) {
    ESP_LOGI(TAG, "Running application self-test...");

#if CONFIG_EXAMPLE_FIRMWARE_FAILS_SELF_TEST
    ESP_LOGE(TAG, "Self-test FAILED! This firmware is bad.");
    vTaskDelay(pdMS_TO_TICKS(2000));
    return false;
#else
    ESP_LOGI(TAG, "Self-test PASSED! This firmware is good.");
    vTaskDelay(pdMS_TO_TICKS(2000));
    return true;
#endif
}

// --- MAIN LOGIC ---
void app_main(void) {
    ESP_LOGI(TAG, "Starting OTA Recovery Example App...");

    // Initialize NVS
    esp_err_t err = nvs_flash_init();
    if (err == ESP_ERR_NVS_NO_FREE_PAGES || err == ESP_ERR_NVS_NEW_VERSION_FOUND) {
        ESP_ERROR_CHECK(nvs_flash_erase());
        err = nvs_flash_init();
    }
    ESP_ERROR_CHECK(err);

    // Get OTA partition info
    const esp_partition_t *running_partition = esp_ota_get_running_partition();
    esp_app_desc_t running_app_info;
    esp_ota_get_app_description(running_partition, &running_app_info);

    ESP_LOGI(TAG, "Running firmware version: %s", running_app_info.version);

    // Check the state of the currently running partition.
    esp_ota_img_states_t ota_state;
    err = esp_ota_get_state_partition(running_partition, &ota_state);

    if (err == ESP_OK) {
        if (ota_state == ESP_OTA_IMG_PENDING_VERIFY) {
            ESP_LOGW(TAG, "This is a new firmware image, pending verification.");
            ESP_LOGW(TAG, "Running self-test to validate...");

            if (run_application_self_test()) {
                ESP_LOGI(TAG, "Self-test successful. Marking app as valid.");
                esp_ota_mark_app_valid_cancel_rollback();
            } else {
                ESP_LOGE(TAG, "Self-test failed. Triggering rollback!");
                esp_ota_mark_app_invalid_rollback_and_reboot();
            }
        } else if (ota_state == ESP_OTA_IMG_VALID) {
            ESP_LOGI(TAG, "This firmware is already marked as valid.");
        } else {
            ESP_LOGE(TAG, "Firmware is in an invalid state!");
            // You could trigger a rollback here as a safety measure too.
            esp_ota_mark_app_invalid_rollback_and_reboot();
        }
    } else {
        ESP_LOGE(TAG, "Failed to get OTA state, error: %s", esp_err_to_name(err));
        // Critical error, maybe try to roll back anyway.
    }

    // Main application loop can go here.
    // For the example, we just print the version every 10 seconds.
    while (1) {
        ESP_LOGI(TAG, "Application loop running, version: %s", running_app_info.version);
        vTaskDelay(pdMS_TO_TICKS(10000));
    }
}

3. Kconfig for Simulating Failure

To easily switch between a “good” and “bad” firmware for our test, we add a custom option in main/Kconfig.projbuild.

Plaintext
menu "Example Configuration"

    config EXAMPLE_FIRMWARE_FAILS_SELF_TEST
        bool "Simulate a failed self-test"
        default n
        help
            If selected, the application's self-test function will intentionally fail,
            triggering an OTA rollback if the app is in a PENDING_VERIFY state.
            Unselect this for the initial "good" firmware.

endmenu

4. Build and Run Instructions

Follow these steps precisely to observe the rollback mechanism.

Step 1: Build and Flash the Initial “Good” Firmware (v1.0)
  1. Ensure the Simulate a failed self-test option is disabled (n) in menuconfig.
  2. Open your project’s CMakeLists.txt and set the version: set(PROJECT_VERSION "1.0")
  3. Build and flash the project: idf.py build flash monitor
  4. Observe the output. You should see:I (OTA_RECOVERY_EXAMPLE): Starting OTA Recovery Example App... I (OTA_RECOVERY_EXAMPLE): Running firmware version: 1.0 I (OTA_RECOVERY_EXAMPLE): This firmware is already marked as valid. I (OTA_RECOVERY_EXAMPLE): Application loop running, version: 1.0
    This is now our stable, known-good firmware running from ota_0.
Step 2: Build the “Bad” Firmware (v2.0)
  1. In menuconfig, go to Example Configuration and enable (y) the [*] Simulate a failed self-test option. Save and exit.
  2. In CMakeLists.txt, update the version: set(PROJECT_VERSION "2.0-bad")
  3. Only build the project, do not flash it yet: idf.py build
  4. This creates the ota_recovery_example.bin file in your build directory. This is the “bad” firmware we will serve for the OTA update.
Step 3: Simulate the OTA Update and Observe Rollback

For a real OTA, you’d use a cloud service. Here, we can use the esp_https_ota example component from ESP-IDF and a simple local python server. For simplicity, let’s just describe the expected outcome as if an OTA process has just completed and the device is rebooting.

Imagine your device has just downloaded the v2.0-bad firmware and rebooted.

  • The device starts up. The bootloader reads ota_data and boots from the newly updated partition (e.g., ota_1).
  • You will see the following in the monitor:
Plaintext
I (312) boot: Loaded app from partition at offset 0x110000
...
I (OTA_RECOVERY_EXAMPLE): Starting OTA Recovery Example App...
I (OTA_RECOVERY_EXAMPLE): Running firmware version: 2.0-bad
W (OTA_RECOVERY_EXAMPLE): This is a new firmware image, pending verification.
W (OTA_RECOVERY_EXAMPLE): Running self-test to validate...
I (OTA_RECOVERY_EXAMPLE): Running application self-test...
E (OTA_RECOVERY_EXAMPLE): Self-test FAILED! This firmware is bad.
E (OTA_RECOVERY_EXAMPLE): Self-test failed. Triggering rollback!
  • The device will immediately reboot.
  • Watch the boot log again. This time, you will see the bootloader detect the rollback.
Plaintext
I (311) boot: Rollback to factory app
I (315) boot: Loaded app from partition at offset 0x10000
...
I (OTA_RECOVERY_EXAMPLE): Starting OTA Recovery Example App...
I (OTA_RECOVERY_EXAMPLE): Running firmware version: 1.0
I (OTA_RECOVERY_EXAMPLE): This firmware is already marked as valid.
I (OTA_RECOVERY_EXAMPLE): Application loop running, version: 1.0

The device has successfully identified the bad firmware, triggered a rollback, and is now running the previous stable version (v1.0) again. It has saved itself from being bricked.

Variant Notes

The core OTA rollback functionality described here is a fundamental feature of ESP-IDF and is consistent across all ESP32 variants, including the ESP32, ESP32-S2, ESP32-S3, ESP32-C3, ESP32-C6, and ESP32-H2. The esp_ota_* APIs and the bootloader logic are part of the shared framework.

Where you might encounter differences is in related features:

  • Flash & PSRAM: The available flash size on a given module will determine the maximum size of your application partitions (ota_0ota_1). Variants with PSRAM can support much larger applications, but the OTA logic remains the same.
  • Security Features:
    • Secure Boot: When enabled, the bootloader will verify the digital signature of the application in ota_0 or ota_1 before loading it. A failed signature check will prevent the app from booting and can trigger a rollback, providing a hardware-enforced layer of security.
    • Flash Encryption: Encrypts the flash content. The OTA update binary must also be encrypted. The rollback process works seamlessly with this, as the bootloader handles decryption.
  • Anti-Rollback: This feature, which prevents downgrading to an older version, relies on security version numbers stored in the app descriptor. It is supported on all variants that have eFuses to securely store the minimum version.
Feature ESP32 ESP32-S2 ESP32-S3 ESP32-C3 ESP32-C6 ESP32-H2
Core OTA Rollback
Secure Boot
Flash Encryption
Anti-Rollback (HW)
External PSRAM
Wi-Fi ✔ (Wi-Fi 6)
Bluetooth LE
Thread / Zigbee

In summary, the recovery strategies you learn in this chapter are directly portable across the entire ESP32 family.

Common Mistakes & Troubleshooting Tips

Mistake / Issue Symptom(s) Troubleshooting / Solution
Forgot to mark app as valid The new firmware (e.g., v2.0) runs perfectly after OTA. However, after a power cycle or manual reset, the device mysteriously reverts to the old firmware (v1.0). This is the most common OTA issue. The bootloader’s rollback feature is working correctly. Your new app never signaled that it was “good”.

Solution: In your application logic, after you have confirmed the new firmware is stable (e.g., connected to Wi-Fi/cloud), you must call esp_ota_mark_app_valid_cancel_rollback();.
Incorrect Partition Table The OTA update fails immediately. The log shows an error like E (ota): esp_ota_begin… not found or the device fails to boot entirely, complaining about a missing partition. Your project is not configured for OTA. It needs dedicated partitions for the process.

Solution: Ensure you have a partitions.csv file with at least ota_0, ota_1, and otadata defined. Then, point to it in your CMakeLists.txt with set(PARTITION_TABLE “partitions.csv”).
Flawed Self-Test Logic Rollbacks happen sporadically. Sometimes the update succeeds, other times it fails. The device might reboot unexpectedly during the self-test phase. Your self-test might be unreliable or too slow.

Solutions:
1. Watchdog Timeout: If the test takes too long, the watchdog timer might reset the device, causing a rollback. Make tests fast or feed the watchdog timer during long operations.
2. Network Dependency: If your test requires a cloud connection that is flaky, it will cause false failures. Implement retries with a final timeout.
Fighting Anti-Rollback The OTA download seems to work, but the final write/validation step fails with ESP_ERR_OTA_VALIDATE_FAILED. The log might mention an “anti-rollback” check failure. You have Anti-Rollback enabled and are trying to flash firmware with a security version number that is lower than or equal to the currently running version.

Solution: For a new release, you must increment the version number in your project configuration (e.g., in idf.py menuconfig under Security features, or in your build scripts). During development, you can disable Anti-Rollback.
Corrupted Firmware Binary The device enters a boot loop after an OTA update. The log shows a guru meditation error or a checksum mismatch error during boot. The binary file downloaded during the OTA was corrupted, incomplete, or was built for the wrong ESP32 variant.

Solution:
1. Ensure your OTA server provides a checksum (like SHA-256) along with the binary.
2. Before calling esp_ota_set_boot_partition(), verify the checksum of the downloaded data against the one from the server.
Incorrect OTA State Logic The self-test logic runs on every single boot, not just after an update. This can cause unintended behavior, like trying to roll back an already-validated app. Your code isn’t checking the partition state correctly.

Solution: Wrap your self-test and rollback logic inside a condition that checks if the state is ESP_OTA_IMG_PENDING_VERIFY. Do not run the validation logic if the state is already ESP_OTA_IMG_VALID.

Exercises

  1. Implement a Timed Self-Test: Modify the example code. Instead of a simple boolean flag, the self-test should start a 60-second timer. The application must receive a specific command (e.g., via UART or an MQTT message like device/123/validate) within this window. If the command is received, call esp_ota_mark_app_valid_cancel_rollback(). If the timer expires, call esp_ota_mark_app_invalid_rollback_and_reboot().
  2. Manual Rollback Trigger: Implement a “safe mode” button. If a specific GPIO pin is held low during boot, the application should immediately call esp_ota_mark_app_invalid_rollback_and_reboot(), regardless of its validation state. This provides a physical way for a user to force a recovery if a new firmware has a critical flaw that even the self-test missed (e.g., it corrupts the display but otherwise functions).
  3. Investigate ota_data with parttool.py: Use the ESP-IDF command-line partition tool (python ${IDF_PATH}/components/partition_table/parttool.py) to inspect the otadata partition.
    • Read the partition after flashing the initial v1.0 firmware.
    • Perform a (simulated) OTA to v2.0 and reboot. Before it rolls back, quickly halt the device and read the otadata partition again.
    • Let the device boot, fail, and roll back. Read the otadata partition one last time.
    • Document the changes you observe in the raw bytes of the partition.

Summary

  • OTA update failures are a significant risk for deployed IoT devices, but ESP-IDF provides a robust recovery framework.
  • The system relies on a partition scheme with at least two app partitions (ota_0ota_1) and a state partition (ota_data).
  • The ota_data partition informs the bootloader which application to boot.
  • The bootloader can automatically roll back to a previous application if a newly updated one fails to validate itself within a timeout.
  • The most reliable strategy is application-level self-validation, where the new firmware confirms its own operational health.
  • A successful new application must call esp_ota_mark_app_valid_cancel_rollback() to make the update permanent.
  • A faulty application should call esp_ota_mark_app_invalid_rollback_and_reboot() to trigger an immediate rollback to the last known-good version.
  • This core recovery mechanism is portable across all ESP32 variants.

Further Reading

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top