Chapter 295: Fleet Management for ESP32 Devices
Chapter Objectives
By the end of this chapter, you will be able to:
- Define fleet management and explain its importance for scalable IoT solutions.
- Describe the concept of a Device Shadow (or Digital Twin) and its role in state synchronization.
- Implement firmware that can report its state to the cloud and respond to desired state changes.
- Understand how to use metadata and tagging for device grouping and targeted actions.
- Explain the strategy of phased OTA rollouts (canary deployments) to minimize risk.
- Design systems for applying fleet-wide configuration changes without requiring a full firmware update.
Introduction
Throughout the previous chapters, we have mastered the art of building robust, single devices. We can update their firmware, monitor their health, and collect their data. But a successful IoT product rarely consists of just one device; it comprises hundreds, thousands, or even millions, scattered across the globe. How do you update the firmware on 10,000 devices in the field? How do you change a setting on all devices located in Germany? Managing them one by one is not just impractical; it’s impossible.
This is the challenge that Fleet Management solves. It is the set of tools, strategies, and architectural patterns for administering and controlling a large population of devices in aggregate. It allows you to move beyond interacting with individual devices and start orchestrating your entire deployment as a cohesive system.
In this chapter, we will explore the core concepts of modern fleet management, focusing on the de-facto industry standard for state management: the Device Shadow. Learning these principles is the final step in moving from an embedded developer to an IoT systems architect.

Theory
Effective fleet management relies on the cloud backend being the central source of truth and control. The device’s role is to synchronize its state with this central authority.
1. Device Registry and Grouping
At the heart of any fleet management system is a Device Registry. This is a database on your cloud platform that lists every device authorized to connect to your service. Each entry contains the device’s unique ID (e.g., its certificate ID or a custom name) and, crucially, a set of metadata, often called tags or attributes.
Tags are simple key-value pairs that describe the device, for example:
"location": "factory-A"
"customer_id": "cust-943"
"firmware_version": "2.1.4"
"hardware_rev": "D"
"deployment_ring": "beta_testers"
This metadata is the foundation of group management. It allows you to query your fleet and perform bulk operations on a specific subset of devices, such as “initiate an OTA update for all devices in factory-A
running firmware version 2.1.3
.”
2. The Device Shadow (Digital Twin)
The most powerful concept in fleet management is the Device Shadow, also known as a Digital Twin. It is a JSON document stored in the cloud that represents the last known state of a physical device and the desired future state of that device. It acts as a reliable, always-available intermediary, decoupling the device from the applications that control it.
The shadow document typically has two main sections: reported
and desired
.
Shadow Section | Source of Truth | Purpose | Example |
---|---|---|---|
reported | The Physical Device | Describes the last known, actual state of the device. It is the device’s view of itself. | {“led_is_on”: true, “firmware_version”: “1.0”} |
desired | User or Cloud Application | Specifies the state the device should be in. It is a command or request for a state change. | {“led_is_on”: false, “telemetry_interval”: 300} |
delta | Cloud Service (Calculated) | Represents the difference between the desired and reported states. This is sent to the device so it knows what to change. | {“led_is_on”: false, “telemetry_interval”: 300} |
reported
state: This section is written by the device. It represents the device’s current, actual state. For example:{"reported": {"led_is_on": true, "firmware_version": "1.0", "telemetry_interval": 60}}
.desired
state: This section is written by cloud applications or users. It represents the state we want the device to be in. For example, a user flips a switch in a mobile app, which writes{"desired": {"led_is_on": false}}
to the shadow.
graph TD subgraph Cloud A(User App / Backend) B{"Device Shadow<br><b>desired:</b> {led: on}<br><b>reported:</b> {led: off}"} C{Calculate Delta} D(Shadow Service) E[<b>Convergence</b><br>desired & reported<br>match] end subgraph Device F(ESP32 Device) G["Perform Action<br><i>gpio_set_level(LED, 1)</i>"] H[Report New State] end A --"1 Update Desired State"--> B B --"2 State Difference Detected"--> C C --"3 Publish Delta to Device"--> D D --"/shadow/update/delta<br><b>payload:</b> {led: on}"--> F F --"4 Receive Delta"--> G G --"5 Action Complete"--> H H --"6 Publish Reported State<br>/shadow/update<br><b>payload:</b> {reported: {led: on}}"--> D D --"7 States Match"--> E classDef start fill:#EDE9FE,stroke:#5B21B6,stroke-width:2px,color:#5B21B6; classDef process fill:#DBEAFE,stroke:#2563EB,stroke-width:1px,color:#1E40AF; classDef decision fill:#FEF3C7,stroke:#D97706,stroke-width:1px,color:#92400E; classDef check fill:#FEE2E2,stroke:#DC2626,stroke-width:1px,color:#991B1B; classDef endo fill:#D1FAE5,stroke:#059669,stroke-width:2px,color:#065F46; class A,F start; class B,C,D decision; class G,H process; class E endo;
The Synchronization Flow:
- Device Reports: When a device boots or its state changes, it publishes a message to update its
reported
state in the cloud. - User/App Requests Change: A backend service or user application updates the
desired
state in the shadow. - Delta Calculation: The cloud service compares the
desired
andreported
states. If there are any differences, it calculates a delta and publishes it to a special topic. - Device Receives Delta: The device is subscribed to this delta topic. It receives a message like
{"state": {"led_is_on": false}, "timestamp": ...}
. - Device Acts: The device’s firmware parses this delta and performs the required action (e.g., turns the LED off).
- Device Confirms: After successfully changing its state, the device publishes a new
reported
state ({"reported": {"led_is_on": false}}
). - Shadow Reaches Convergence: The cloud service sees that the
reported
state now matches thedesired
state. It removes theled_is_on
field from thedesired
section, completing the loop. The shadow is now “in sync.”
This asynchronous mechanism is incredibly robust. The device doesn’t need to be online when the user requests a change. The desired
state will be held in the shadow, and the device will pick it up the next time it connects.
3. Phased Rollouts and Bulk Actions
With grouping and device shadows, you can orchestrate complex, fleet-wide actions safely. The most common is an OTA update.
graph TD Start((Start Rollout)) --> A["Deploy to Canary Group (1% of fleet)"]; A --> B{Monitor Canary Health & Errors}; B -- "No Issues" --> C["Deploy to Early Adopters (10% of fleet)"]; B -- "Issues Detected!" --> Stop((Halt Rollout & Investigate)); C --> D{Monitor Expanded Group}; D -- "No Issues" --> E["Deploy to General Fleet (Wave 1 - 50%)"]; D -- "Issues Detected!" --> Stop; E --> F{Monitor General Fleet}; F -- "No Issues" --> G["Deploy to Remainder (Wave 2 - 100%)"]; F -- "Issues Detected!" --> Stop; G --> End((Rollout Complete)); classDef start fill:#EDE9FE,stroke:#5B21B6,stroke-width:2px,color:#5B21B6; classDef process fill:#DBEAFE,stroke:#2563EB,stroke-width:1px,color:#1E40AF; classDef decision fill:#FEF3C7,stroke:#D97706,stroke-width:1px,color:#92400E; classDef check fill:#FEE2E2,stroke:#DC2626,stroke-width:1px,color:#991B1B; classDef endo fill:#D1FAE5,stroke:#059669,stroke-width:2px,color:#065F46; class Start,A,C,E,G process; class B,D,F decision; class Stop check; class End endo;
A phased rollout (or canary deployment) is the industry-standard best practice:
- Canary Group: Identify a small group of non-critical devices using tags (e.g.,
deployment_ring: canary
). - Deploy: Initiate a bulk job that sets the
desired
state for this group to{"ota_url": "...", "target_version": "1.1"}
. - Monitor: Closely watch the telemetry and error logs from the canary group.
- Expand: If no issues arise, expand the rollout to a larger group (e.g.,
deployment_ring: early_adopters
, representing 5% of the fleet). - Full Rollout: Continue expanding the deployment in stages until 100% of the fleet is updated.
Phase | Device Group (by Tag) | Fleet Percentage | Action & Purpose |
---|---|---|---|
1. Canary | deployment_ring: canary | ~1% | Deploy new firmware to a small, low-risk group. Purpose: Detect show-stopping bugs with minimal impact. |
2. Early Adopter | deployment_ring: early_adopters | 5-10% | Expand rollout to a larger, but still limited, group. Purpose: Verify stability and performance at a slightly larger scale. |
3. General Availability | customer_id: cust-123 | 25-50% per wave | Continue rolling out in larger waves. Can be targeted by customer or region. Purpose: Gradual, controlled update for the majority of the fleet. |
4. Full Deployment | All remaining devices | 100% | Complete the rollout to all devices. Purpose: Ensure the entire fleet is on the target version. |
This process dramatically reduces the risk of a bad firmware update. If a problem is found in the canary group, you can cancel the rollout, having impacted only a tiny fraction of your devices.
Practical Example: Device Shadow Interaction
Let’s write firmware that manages an LED’s state via a device shadow. The device will report its state on boot and listen for delta
updates to change its state.
Note: We will use the MQTT topic structure popularized by AWS IoT, as it is a widely understood standard pattern for shadow interactions.
+
is a single-level wildcard.
Action | MQTT Topic | Direction | Purpose |
---|---|---|---|
Update Shadow | {prefix}/{thingName}/shadow/update | Device ➔ Cloud | Used by the device to report its state or by an app to set a desired state. |
Receive Delta | {prefix}/{thingName}/shadow/update/delta | Cloud ➔ Device | The device subscribes here to receive notifications about differences between desired and reported states. |
Update Accepted | {prefix}/{thingName}/shadow/update/accepted | Cloud ➔ Client | A confirmation from the cloud that the /update message was successfully processed. |
Update Rejected | {prefix}/{thingName}/shadow/update/rejected | Cloud ➔ Client | An error message from the cloud if the /update message was malformed or failed. |
- To publish updates:
{prefix}/{thingName}/shadow/update
- To receive accepted/rejected responses:
{prefix}/{thingName}/shadow/update/accepted
and.../rejected
- To receive deltas:
{prefix}/{thingName}/shadow/update/delta
The Code
This example assumes a configured GPIO for an LED and a connected MQTT client.
#include "freertos/FreeRTOS.h"
#include "freertos/task.h"
#include "esp_system.h"
#include "esp_log.h"
#include "driver/gpio.h"
#include "cJSON.h"
#include "esp_mqtt_client.h"
// --- Configuration ---
#define FIRMWARE_VERSION "1.0.0"
#define LED_GPIO 2 // Your board's LED GPIO
#define SHADOW_PREFIX "$aws/things" // Generic prefix, change if your broker uses another
// Assume mqtt_client is initialized and connected elsewhere
extern esp_mqtt_client_handle_t mqtt_client;
extern char device_id[]; // A string holding the unique device ID (e.g., thingName)
static const char *TAG = "FLEET_MGMT";
static bool led_state = false;
// --- Helper Functions ---
void set_led_state(bool is_on) {
led_state = is_on;
gpio_set_level(LED_GPIO, led_state);
ESP_LOGI(TAG, "LED state set to: %s", is_on ? "ON" : "OFF");
}
void report_current_state() {
char topic[256];
snprintf(topic, sizeof(topic), "%s/%s/shadow/update", SHADOW_PREFIX, device_id);
cJSON *root = cJSON_CreateObject();
cJSON *state = cJSON_AddObjectToObject(root, "state");
cJSON *reported = cJSON_AddObjectToObject(state, "reported");
cJSON_AddBoolToObject(reported, "led_is_on", led_state);
cJSON_AddStringToObject(reported, "firmware_version", FIRMWARE_VERSION);
char* payload = cJSON_PrintUnformatted(root);
esp_mqtt_client_publish(mqtt_client, topic, payload, 0, 1, 0);
ESP_LOGI(TAG, "Reported current state: %s", payload);
free(payload);
cJSON_Delete(root);
}
// --- Shadow Delta Handler ---
void handle_shadow_delta(const char *payload, int len) {
ESP_LOGI(TAG, "Received shadow delta: %.*s", len, payload);
cJSON *root = cJSON_ParseWithLength(payload, len);
if (root == NULL) {
ESP_LOGE(TAG, "Failed to parse delta JSON");
return;
}
cJSON *state_node = cJSON_GetObjectItem(root, "state");
if (cJSON_IsObject(state_node)) {
cJSON *led_node = cJSON_GetObjectItem(state_node, "led_is_on");
if (cJSON_IsBool(led_node)) {
// Act on the desired state
set_led_state(cJSON_IsTrue(led_node));
// After acting, report back our new state to clear the delta
report_current_state();
}
}
cJSON_Delete(root);
}
// --- MQTT Event Handler Snippet ---
// In your existing MQTT event handler, add a case for incoming data:
/*
case MQTT_EVENT_DATA:
// ... other handlers
char delta_topic[256];
snprintf(delta_topic, sizeof(delta_topic), "%s/%s/shadow/update/delta", SHADOW_PREFIX, device_id);
if (strncmp(event->topic, delta_topic, event->topic_len) == 0) {
handle_shadow_delta(event->data, event->data_len);
}
break;
*/
// --- Main Application Logic ---
void fleet_management_init() {
// Configure LED GPIO
gpio_reset_pin(LED_GPIO);
gpio_set_direction(LED_GPIO, GPIO_MODE_OUTPUT);
set_led_state(false); // Default to off
// Subscribe to the delta topic
char delta_topic[256];
snprintf(delta_topic, sizeof(delta_topic), "%s/%s/shadow/update/delta", SHADOW_PREFIX, device_id);
esp_mqtt_client_subscribe(mqtt_client, delta_topic, 1);
ESP_LOGI(TAG, "Subscribed to shadow delta topic: %s", delta_topic);
// Give a moment for subscription to complete, then report initial state
vTaskDelay(pdMS_TO_TICKS(2000));
report_current_state();
}
Build and Run
- Integrate the code into your MQTT project. Call
fleet_management_init()
after the MQTT client connects. - Flash and run. Observe the serial monitor. You should see it subscribe and then publish its initial
reported
state. - Using an MQTT client (like MQTTX or a cloud provider’s console), publish to the …/shadow/update topic with the following payload:{“state”: {“desired”: {“led_is_on”: true}}}
- Observe:
- The device’s serial monitor will show “Received shadow delta…”.
- The physical LED on your board will turn ON.
- The device will immediately publish a new
reported
state back to the cloud. - If you now publish
{"desired": {"led_is_on": false}}
, the LED will turn off.
Variant Notes
- Universally Applicable: Fleet management is a cloud-centric concept. The device-side logic for interacting with a shadow is almost entirely independent of the specific ESP32 variant. The example code will run on an ESP32, S3, C3, or H2 without modification. The core task is always MQTT communication and JSON parsing.
- Trustworthiness of
reported
State: This is where variants matter. When a device reports"firmware_version": "1.0.0"
, how much do you trust it?- On an ESP32 without security features, an attacker could potentially modify the firmware to lie about its version.
- On an ESP32-S3/C3/C6 with Secure Boot v2 enabled, the hardware guarantees that only authentically signed firmware can run. Therefore, the
reported
firmware version has a much higher degree of trust. This makes fleet management more secure, as you can be confident that your device groups (e.g., “all devices on v1.0”) are accurate.
- Low-Power Scenarios (ESP32-C6/H2): For battery-powered devices on Thread or in Wi-Fi power-save modes, the device shadow becomes even more vital. The device may only connect for 30 seconds every hour. It cannot maintain a persistent connection to wait for commands. Its operational loop is: Wake -> Connect -> Check for shadow delta -> Act -> Report new state -> Disconnect -> Sleep. The shadow provides the necessary asynchronous command queue.
Common Mistakes & Troubleshooting Tips
Mistake / Issue | Symptom(s) | Troubleshooting / Solution |
---|---|---|
Mismatched Topic Names | Device never receives delta updates. No errors appear, the device just seems to ignore desired state changes. | Log the exact topic string the device subscribes to (e.g., “$aws/things/my-device/shadow/update/delta”) and compare it character-by-character with the topic used by your MQTT client to publish. |
Forgetting to Report Back | The device acts on a delta (e.g., LED turns on), but then performs the same action again every time it reconnects. | Always call your report_current_state() function immediately after the device successfully changes its state. This clears the desired state in the shadow. |
Crashing on Bad JSON | Device reboots or crashes when a delta is received. The log might show a Guru Meditation Error. | Add defensive checks. Before using a JSON object, ensure it’s not NULL. Use functions like cJSON_IsBool or cJSON_IsObject to verify the type before accessing the value. |
No Shadow Versioning | You add a new feature (e.g., “brightness”) and push it to the fleet. Old devices, which don’t have the feature, receive the delta but do nothing, making the shadow permanently out of sync for them. | Include “firmware_version” in the device’s reported state. Your cloud-side logic should check this version before pushing a desired state with new features. |
Exercises
- Remote Configuration: Extend the practical example by adding a
telemetry_interval_s
field to the shadow. The device should report its current interval. Implement logic in thehandle_shadow_delta
function to parse this new field and change the delay in a (hypothetical) telemetry task. - OTA via Shadow: Add
ota_url
andtarget_version
to thereported
state (initially null). The exercise is to handle a delta where the cloud sets thesedesired
values. The device firmware should compare itsFIRMWARE_VERSION
to thetarget_version
. If they differ, it should initiate the OTA process using the providedota_url
. - Group Tagging: In your
report_current_state
function, add a static “tag” to thereported
state, such as"location": "lab-bench-1"
. This demonstrates how devices can self-report attributes that can then be used on the cloud side to create logical groups for bulk operations.
Summary
- Fleet Management is the process of managing large numbers of IoT devices collectively, which is essential for scaling.
- The Device Shadow (or Digital Twin) is the cornerstone of modern fleet management, acting as a cloud-based, virtual representation of the device’s state.
- The shadow decouples the device from control applications using
reported
(from device) anddesired
(to device) states, synchronized via adelta
. - Tagging devices with metadata in a central registry allows for powerful grouping and targeted bulk actions.
- Phased OTA rollouts are a critical risk-mitigation strategy, leveraging device groups to deploy updates incrementally.
- Writing robust device-side firmware that correctly interacts with the shadow is key to creating a manageable and controllable IoT fleet.
Further Reading
- AWS IoT Device Shadow Service: https://docs.aws.amazon.com/iot/latest/developerguide/iot-device-shadows.html
- Azure IoT Hub Device Twins: https://docs.microsoft.com/en-us/azure/iot-hub/iot-hub-devguide-device-twins
- Google Cloud IoT Core Device Registry: https://cloud.google.com/iot/docs/how-tos/devices
- ESP RainMaker (Espressif’s own fleet management platform): https://rainmaker.espressif.com/